Merge tag 'v3.7.0' into aosp/main 2023-08-10 v3.7.0 This release includes new codec interfaces, compression efficiency and perceptual improvements, speedup and memory optimizations and many bug fixes. This release is ABI compatible with the last release. - New Features * New codec controls: * AV1E_SET_QUANTIZER_ONE_PASS: Set quantizer for each frame. * AV1E_ENABLE_RATE_GUIDE_DELTAQ: enable the rate distribution guided delta quantization in all intra mode. The "enable-rate-guide-deltaq" option is added for this control. * AV1E_SET_RATE_DISTRIBUTION_INFO: set the input file for rate distribution used in all intra mode. The "rate-distribution-info" option is added for this control. * AV1E_GET_LUMA_CDEF_STRENGTH * AV1E_SET_BITRATE_ONE_PASS_CBR * AOM_SCALING_MODE is extended to include 2/3 and 1/3 scaling. * aom_tune_metric is extended to include AOM_TUNE_VMAF_SALIENCY_MAP. The "tune" option is extended to include "vmaf_saliency_map". * SVC example encoder svc_encoder_rtc is able to use the rate control library. * Loopfilter level and CDEF filter level is supported by RTC rate control library. * New speed (--cpu-used) 11, intended for RTC screen sharing, added for faster encoding with ~3% bdrate loss with 16% IC (instruction count) speedup compared to speed 10. - Compression Efficiency Improvements * Improved VoD encoding performance * 0.1-0.6% BDrate gains for encoding speeds 2 to 6 * Rate control accuracy improvement in VBR mode * RTC encoding improvements * Screen content mode: 10-19% BDrate gains for speeds 6 - 10 * Temporal layers video mode, for speed 10: * 2 temporal layers on low resolutions: 13-15% BDrate gain * 3 temporal layers on VGA/HD: 3-4% BDrate gain - Perceptual Quality Improvements * Fixed multiple block and color artifacts for RTC screen content by * Incorporating color into RD cost for IDTX * Reducing thresholds for palette mode in non RD mode * Allowing more palette mode testing * Improved color sensitivity for altref in non-RD mode. * Reduced video flickering for temporal layer encoding. - Speedup and Memory Optimizations * Speed up the VoD encoder * 2-5% for encoding speed 2 to 4 * 9-15% for encoding speed 5 to 6 * ARM * Standard bitdepth * speed 5: +31% * speed 4: +2% * speed 3: +9% * speed 2: +157% * High bitdepth * speed 5: +85% * RTC speedups * Screen content mode * 15% IC speedup for speeds 6-8 * ARM: 7% for speed 9, 3% for speed 10 * Temporal layers video mode * 7% speedup for 3 temporal layers on VGA/HD, for speed 10 * Single layer video * x86: 2% IC speedup for speeds 7-10 * ARM: 2-4% speedup across speeds 5-10 - Other improvements * VoD: Major improvements to global motion estimation, now enabled up to speed 4 * RTC * Fixes to make lossless coding work. * Fixes to make frame dropper (--drop_frames) work for single and temporal layers. * Improvements to RPS (reference picture selection) recovery frames. * Improvements to rate control for temporal layers. * libwebm is updated to libwebm-1.0.0.29-9-g1930e3c - Bug Fixes * aomedia:3261 Assertion failed when encoding av1 with film grain and '--monochrome' flag * aomedia:3276 ensure all allocations are checked (partial fix) * aomedia:3451 The libaom library calls exit() * aomedia:3450 enable -Wshadow for C++ sources * aomedia:3449 Test Seg Faults After b459af3e345be402db052a143fcc5383d4b74cbd * aomedia:3416 prune unused symbols / restrict symbol visibility * aomedia:3443 Jenkins failure: UninstantiatedParameterizedTestSuite<EstimateNoiseTest> * aomedia:3434 realtime failures with CONFIG_BITSTREAM_DEBUG=1 * aomedia:3433 DeltaqModeTest crash w/row_mt=0 * aomedia:3429 Encoder crash when turn on both ExternalResize and g_threads > 2 * aomedia:3438 Build failure with `-DSANITIZE=address -DBUILD_SHARED_LIBS=ON` when using clang. * aomedia:3435 Block artifacts when scrolling with AV1 in screen sharing scenarios * aomedia:3170 vmaf tune presets produce extreme glitches in one scene * aomedia:3401 Building shared libaom with MSVC results in a race condition with the export library * aomedia:3420 Floating point exception in av1_tpl_get_frame_importance() * aomedia:3424 heap-buffer-overflow in ScaleFilterCols_16_C() (SIGABRT) * aomedia:3417 examples/svc_encoder_rtc.c is using internal macros and functions * aomedia:3372 SEGV in assign_frame_buffer_p av1_common_int.h * aomedia:3130 'cpu-features.h' file not found on Android NDK 22 * aomedia:3415 Encoder/decoder mismatch for svc_encoder_rtc running 1 SL 3 TL * aomedia:3412 Lossless Mode Fails Loopback Bit Test * aomedia:3409 The use of AV1_VAR_OFFS in av1/encoder/var_based_part.c is incorrect for high bit depths * aomedia:3403 test_libaom fails with error message "feenableexcept() failed" on Linux arm * aomedia:3370 Random color block at fast motion area * aomedia:3393 Assertion failure in av1_convolve_2d_sr_c() * aomedia:3392 Strong artifacting for high bit-depth real-time * aomedia:3376 aomenc --threads=10 --deltaq-mode=3 crashes after "Allintra: multi-threading of calculating differential contrast" * aomedia:3380 Crashes and ASan and TSan errors in deltaq-mode=3 multithreading code * chromium:1410766 heap-buffer-overflow in aom_yv12_copy_v_c * Cannot set level via AV1E_SET_TARGET_SEQ_LEVEL_IDX * Encoding failure due to the use of loop restoration with unintended use of lossless mode. * Signed integer overflow in scan_past_frames * Signed integer overflow in update_a_sep_sym * Flickering in AV1 1440p/2160p HDR transcodes * Fixed artifacts with screen share at encoder speed 10 * Fixed prediction setup for IDTX Bug: 299684368 Test: atest CtsMediaV2TestCases (cherry picked from https://android-review.googlesource.com/q/commit:eb47d839a7b1731d294f150ee256cec1546958b3) Merged-In: Ic153bff94ca4bdb8f60f2769026615cd10e07bac Change-Id: Ic153bff94ca4bdb8f60f2769026615cd10e07bac

commit: 8a657a6fe96b9618244bda3ed1bc688d2f22c6f2 [log] [tgz]
author: Harish Mahendrakar <[email protected]> Sat Oct 07 03:18:46 2023 +0000
committer: Cherrypicker Worker <[email protected]> Thu Nov 30 03:39:07 2023 +0000
tree: 35005b6dc14154f863e88f325abff141af82fac8
parent: ebef6e4b5898d04b5956796d52d32b51d147125b [diff]
diff --git a/.mailmap b/.mailmap
index 7ee51a4..7d31a70 100644
--- a/.mailmap
+++ b/.mailmap

@@ -9,6 +9,8 @@
 Arild Fuldseth <[email protected]> <[email protected]>
 Aasaipriya Chandran <[email protected]>
 Aasaipriya Chandran <[email protected]> Aasaipriya C <[email protected]>
+Apurve Pandey <[email protected]>
+Apurve Kumar Pandey <[email protected]> Apurve Pandey
 Bohan Li <[email protected]>
 Changjun Yang <[email protected]>
 Chi Yo Tsai <[email protected]>
@@ -53,8 +55,11 @@
 Michael Horowitz <[email protected]> <[email protected]>
 Mingliang Chen <[email protected]>
 Monty Montgomery <[email protected]>
+Mudassir Galaganath <[email protected]>
+Mudassir Galaganath <[email protected]> Mudassir Galagnath
 Nathan E. Egge <[email protected]>
 Nathan E. Egge <[email protected]> <[email protected]>
+Onur Guleryuz <[email protected]>
 Pascal Massimino <[email protected]>
 Pascal Massimino <[email protected]> <[email protected]>
 Paul Wilkins <[email protected]>

diff --git a/AUTHORS b/AUTHORS
index 3695891..79056a1 100644
--- a/AUTHORS
+++ b/AUTHORS

@@ -26,7 +26,7 @@
 Aniket Wanare <[email protected]>
 Ankur Saxena <[email protected]>
 Anupam Pandey <[email protected]>
-Apurve Pandey <[email protected]>
+Apurve Kumar Pandey <[email protected]>
 Arild Fuldseth <[email protected]>
 Aron Rosenberg <[email protected]>
 Arun Singh Negi <[email protected]>
@@ -36,6 +36,7 @@
 Brennan Shacklett <[email protected]>
 Brion Vibber <[email protected]>
 Bruno Berthier <[email protected]>
+Casey Smalley <[email protected]>
 Changjun Yang <[email protected]>
 Charles 'Buck' Krasic <[email protected]>
 Cheng Chen <[email protected]>
@@ -80,6 +81,7 @@
 Fritz Koenig <[email protected]>
 Fyodor Kyslov <[email protected]>
 Gaute Strokkenes <[email protected]>
+George Steed <[email protected]>
 Gerda Zsejke More <[email protected]>
 Geza Lore <[email protected]>
 Ghislain MARY <[email protected]>
@@ -140,6 +142,7 @@
 Katsuhisa Yuasa <[email protected]>
 Kavi Ramamurthy <[email protected]>
 KO Myung-Hun <[email protected]>
+Konstantinos Margaritis <[email protected]>
 Krishna Malladi <[email protected]>
 Kwanghoon Son <[email protected]>
 Kyle Siefring <[email protected]>
@@ -163,6 +166,7 @@
 Makoto Kato <[email protected]>
 Mans Rullgard <[email protected]>
 Marco Paniconi <[email protected]>
+Mark Horvath <[email protected]>
 Mark Mentovai <[email protected]>
 Mark Wachsler <[email protected]>
 Martin Ettl <[email protected]>
@@ -184,8 +188,9 @@
 Mirko Bonadei <[email protected]>
 Monty Montgomery <[email protected]>
 Morton Jonuschat <[email protected]>
-Mudassir Galagnath <[email protected]>
+Mudassir Galaganath <[email protected]>
 Mufaddal Chakera <[email protected]>
+Narayan Kalaburgi <[email protected]>
 Narayan <[email protected]>
 Nathan E. Egge <[email protected]>
 Neeraj Gadgil <[email protected]>
@@ -195,6 +200,7 @@
 Nithya V S <[email protected]>
 Ola Hugosson <[email protected]>
 Oleg Nalivayko <[email protected]>
+Onur Guleryuz <[email protected]>
 Parag Salasakar <[email protected]>
 Pascal Massimino <[email protected]>
 Patrik Westin <[email protected]>
@@ -232,6 +238,7 @@
 Ryan Overbeck <[email protected]>
 Sachin Kumar Garg <[email protected]>
 Sai Deng <[email protected]>
+Salome Thirot <[email protected]>
 Sami Boukortt <[email protected]>
 Sami Pietilä <[email protected]>
 Samuel Thibault <[email protected]>
@@ -298,6 +305,7 @@
 Yingying Ma <[email protected]>
 Yongzhe Wang <[email protected]>
 Yuan Tong <[email protected]>
+Yu-Chen (Eric) Sun <[email protected]>
 Yue Chen <[email protected]>
 Yunqing Wang <[email protected]>
 Yury Gitman <[email protected]>

diff --git a/Android.bp b/Android.bp
index 7f37b93..c7cb621 100644
--- a/Android.bp
+++ b/Android.bp

@@ -30,6 +30,7 @@
     "av1/common/arm/cdef_block_neon.c",
     "av1/common/arm/cfl_neon.c",
     "av1/common/arm/convolve_neon.c",
+    "av1/common/arm/highbd_convolve_neon.c",
     "av1/common/arm/highbd_inv_txfm_neon.c",
     "av1/common/arm/jnt_convolve_neon.c",
     "av1/common/arm/reconinter_neon.c",
@@ -164,6 +165,7 @@
     "av1/encoder/arm/neon/av1_error_neon.c",
     "av1/encoder/arm/neon/av1_fwd_txfm2d_neon.c",
     "av1/encoder/arm/neon/av1_highbd_quantize_neon.c",
+    "av1/encoder/arm/neon/av1_k_means_neon.c",
     "av1/encoder/arm/neon/encodetxb_neon.c",
     "av1/encoder/arm/neon/highbd_fwd_txfm_neon.c",
     "av1/encoder/arm/neon/hybrid_fwd_txfm_neon.c",
@@ -171,6 +173,7 @@
     "av1/encoder/arm/neon/picksrt_neon.c",
     "av1/encoder/arm/neon/quantize_neon.c",
     "av1/encoder/arm/neon/rdopt_neon.c",
+    "av1/encoder/arm/neon/reconinter_enc_neon.c",
     "av1/encoder/arm/neon/temporal_filter_neon.c",
     "av1/encoder/arm/neon/wedge_utils_neon.c",
 ]
@@ -252,6 +255,7 @@
     "av1/encoder/ml.c",
     "av1/encoder/motion_search_facade.c",
     "av1/encoder/mv_prec.c",
+    "av1/encoder/nonrd_opt.c",
     "av1/encoder/nonrd_pickmode.c",
     "av1/encoder/palette.c",
     "av1/encoder/partition_search.c",
@@ -282,11 +286,25 @@
     "third_party/vector/vector.c",
 ]
 
-aom_av1_rc_qmode_sources = [
-    "av1/qmode_rc/ducky_encode.cc",
-    "av1/qmode_rc/ratectrl_qmode.cc",
-    "av1/qmode_rc/ratectrl_qmode_interface.cc",
-    "av1/qmode_rc/reference_manager.cc",
+aom_av1_rc_sources = [
+    "av1/ratectrl_rtc.cc",
+]
+
+aom_common_app_util_sources = [
+    "av1/arg_defs.c",
+    "common/args.c",
+    "common/args_helper.c",
+    "common/av1_config.c",
+    "common/ivfdec.c",
+    "common/md5_utils.c",
+    "common/rawenc.c",
+    "common/tools_common.c",
+    "common/y4menc.c",
+]
+
+aom_decoder_app_util_sources = [
+    "common/obudec.c",
+    "common/video_reader.c",
 ]
 
 aom_dsp_common_asm_sse2 = [
@@ -316,7 +334,9 @@
 ]
 
 aom_dsp_common_intrin_neon = [
+    "aom_dsp/arm/aom_convolve8_neon.c",
     "aom_dsp/arm/aom_convolve_copy_neon.c",
+    "aom_dsp/arm/avg_pred_neon.c",
     "aom_dsp/arm/blend_a64_mask_neon.c",
     "aom_dsp/arm/fwd_txfm_neon.c",
     "aom_dsp/arm/highbd_intrapred_neon.c",
@@ -422,10 +442,18 @@
 aom_dsp_encoder_intrin_neon = [
     "aom_dsp/arm/avg_neon.c",
     "aom_dsp/arm/hadamard_neon.c",
+    "aom_dsp/arm/highbd_avg_neon.c",
+    "aom_dsp/arm/highbd_hadamard_neon.c",
     "aom_dsp/arm/highbd_quantize_neon.c",
+    "aom_dsp/arm/highbd_sad4d_neon.c",
+    "aom_dsp/arm/highbd_sad_neon.c",
     "aom_dsp/arm/highbd_variance_neon.c",
-    "aom_dsp/arm/sad4d_neon.c",
+    "aom_dsp/arm/masked_sad4d_neon.c",
+    "aom_dsp/arm/masked_sad_neon.c",
+    "aom_dsp/arm/obmc_sad_neon.c",
+    "aom_dsp/arm/obmc_variance_neon.c",
     "aom_dsp/arm/sad_neon.c",
+    "aom_dsp/arm/sadxd_neon.c",
     "aom_dsp/arm/sse_neon.c",
     "aom_dsp/arm/subpel_variance_neon.c",
     "aom_dsp/arm/sum_squares_neon.c",
@@ -441,6 +469,7 @@
     "aom_dsp/x86/highbd_quantize_intrin_sse2.c",
     "aom_dsp/x86/highbd_subtract_sse2.c",
     "aom_dsp/x86/highbd_variance_sse2.c",
+    "aom_dsp/x86/jnt_sad_sse2.c",
     "aom_dsp/x86/quantize_sse2.c",
     "aom_dsp/x86/sum_squares_sse2.c",
     "aom_dsp/x86/variance_sse2.c",
@@ -448,6 +477,7 @@
 
 aom_dsp_encoder_intrin_sse4_1 = [
     "aom_dsp/flow_estimation/x86/corner_match_sse4.c",
+    "aom_dsp/flow_estimation/x86/disflow_sse4.c",
     "aom_dsp/x86/avg_intrin_sse4.c",
     "aom_dsp/x86/highbd_variance_sse4.c",
     "aom_dsp/x86/obmc_sad_sse4.c",
@@ -456,7 +486,6 @@
 ]
 
 aom_dsp_encoder_intrin_ssse3 = [
-    "aom_dsp/x86/jnt_sad_ssse3.c",
     "aom_dsp/x86/jnt_variance_ssse3.c",
     "aom_dsp/x86/masked_sad4d_ssse3.c",
     "aom_dsp/x86/masked_sad_intrin_ssse3.c",
@@ -481,6 +510,7 @@
     "aom_dsp/noise_model.c",
     "aom_dsp/noise_util.c",
     "aom_dsp/psnr.c",
+    "aom_dsp/pyramid.c",
     "aom_dsp/quantize.c",
     "aom_dsp/sad.c",
     "aom_dsp/sad_av1.c",
@@ -490,11 +520,28 @@
     "aom_dsp/variance.c",
 ]
 
+aom_encoder_app_util_sources = [
+    "common/ivfenc.c",
+    "common/video_writer.c",
+    "common/warnings.c",
+    "common/y4minput.c",
+    "examples/encoder_util.c",
+]
+
 aom_encoder_stats_sources = [
     "stats/aomstats.c",
     "stats/rate_hist.c",
 ]
 
+aom_libwebm_sources = [
+    "third_party/libwebm/common/hdr_util.cc",
+    "third_party/libwebm/mkvmuxer/mkvmuxer.cc",
+    "third_party/libwebm/mkvmuxer/mkvmuxerutil.cc",
+    "third_party/libwebm/mkvmuxer/mkvwriter.cc",
+    "third_party/libwebm/mkvparser/mkvparser.cc",
+    "third_party/libwebm/mkvparser/mkvreader.cc",
+]
+
 aom_mem_sources = [
     "aom_mem/aom_mem.c",
 ]
@@ -503,14 +550,6 @@
     "aom_ports/float.asm",
 ]
 
-aom_rc_interface_sources = [
-    "common/y4minput.c",
-    "test/decode_test_driver.cc",
-    "test/encode_test_driver.cc",
-    "test/ratectrl_rtc_test.cc",
-    "test/test_aom_rc_interface.cc",
-]
-
 aom_rtcd_sources = [
     "aom_dsp/aom_dsp_rtcd.c",
     "aom_scale/aom_scale_rtcd.c",
@@ -534,7 +573,6 @@
 
 aom_util_sources = [
     "aom_util/aom_thread.c",
-    "aom_util/debug_util.c",
 ]
 
 aom_webm_decoder_sources = [
@@ -545,13 +583,6 @@
     "common/webmenc.cc",
 ]
 
-av1_rc_qmode_sources = [
-    "common/tools_common.c",
-    "common/y4minput.c",
-    "test/ducky_encode_test.cc",
-    "test/ratectrl_qmode_test.cc",
-]
-
 aom_rtcd_sources_gen = [
 ]
 
@@ -562,10 +593,6 @@
 aom_version_sources_gen = [
 ]
 
-av1_rc_qmode_sources_gen = [
-    "gen_src/usage_exit.c",
-]
-
 aom_av1_common_sources += ["common/av1_config.c"]
 
 package {

diff --git a/CHANGELOG b/CHANGELOG
index 531c6d9..f35903d 100644
--- a/CHANGELOG
+++ b/CHANGELOG

@@ -1,3 +1,133 @@
+2023-08-10 v3.7.0
+  This release includes new codec interfaces, compression efficiency and
+  perceptual improvements, speedup and memory optimizations and many bug fixes.
+  This release is ABI compatible with the last release.
+
+  - New Features
+    * New codec controls:
+      * AV1E_SET_QUANTIZER_ONE_PASS: Set quantizer for each frame.
+      * AV1E_ENABLE_RATE_GUIDE_DELTAQ: enable the rate distribution guided delta
+        quantization in all intra mode. The "enable-rate-guide-deltaq" option is
+        added for this control.
+      * AV1E_SET_RATE_DISTRIBUTION_INFO: set the input file for rate
+        distribution used in all intra mode. The "rate-distribution-info" option
+        is added for this control.
+      * AV1E_GET_LUMA_CDEF_STRENGTH
+      * AV1E_SET_BITRATE_ONE_PASS_CBR
+    * AOM_SCALING_MODE is extended to include 2/3 and 1/3 scaling.
+    * aom_tune_metric is extended to include AOM_TUNE_VMAF_SALIENCY_MAP.
+      The "tune" option is extended to include "vmaf_saliency_map".
+    * SVC example encoder svc_encoder_rtc is able to use the rate control
+      library.
+    * Loopfilter level and CDEF filter level is supported by RTC rate control
+      library.
+    * New speed (--cpu-used) 11, intended for RTC screen sharing, added for
+      faster encoding with ~3% bdrate loss with 16% IC (instruction count)
+      speedup compared to speed 10.
+
+  - Compression Efficiency Improvements
+    * Improved VoD encoding performance
+      * 0.1-0.6% BDrate gains for encoding speeds 2 to 6
+      * Rate control accuracy improvement in VBR mode
+    * RTC encoding improvements
+      * Screen content mode: 10-19% BDrate gains for speeds 6 - 10
+      * Temporal layers video mode, for speed 10:
+        * 2 temporal layers on low resolutions: 13-15% BDrate gain
+        * 3 temporal layers on VGA/HD: 3-4% BDrate gain
+
+  - Perceptual Quality Improvements
+    * Fixed multiple block and color artifacts for RTC screen content by
+      * Incorporating color into RD cost for IDTX
+      * Reducing thresholds for palette mode in non RD mode
+      * Allowing more palette mode testing
+    * Improved color sensitivity for altref in non-RD mode.
+    * Reduced video flickering for temporal layer encoding.
+
+  - Speedup and Memory Optimizations
+    * Speed up the VoD encoder
+      * 2-5% for encoding speed 2 to 4
+      * 9-15% for encoding speed 5 to 6
+      * ARM
+        * Standard bitdepth
+          * speed 5: +31%
+          * speed 4: +2%
+          * speed 3: +9%
+          * speed 2: +157%
+        * High bitdepth
+          * speed 5: +85%
+    * RTC speedups
+      * Screen content mode
+        * 15% IC speedup for speeds 6-8
+        * ARM: 7% for speed 9, 3% for speed 10
+      * Temporal layers video mode
+        * 7% speedup for 3 temporal layers on VGA/HD, for speed 10
+      * Single layer video
+        * x86: 2% IC speedup for speeds 7-10
+        * ARM: 2-4% speedup across speeds 5-10
+
+  - Other improvements
+    * VoD: Major improvements to global motion estimation, now enabled up to
+      speed 4
+    * RTC
+      * Fixes to make lossless coding work.
+      * Fixes to make frame dropper (--drop_frames) work for single and temporal
+        layers.
+      * Improvements to RPS (reference picture selection) recovery frames.
+      * Improvements to rate control for temporal layers.
+    * libwebm is updated to libwebm-1.0.0.29-9-g1930e3c
+
+  - Bug Fixes
+    * aomedia:3261 Assertion failed when encoding av1 with film grain and
+      '--monochrome' flag
+    * aomedia:3276 ensure all allocations are checked (partial fix)
+    * aomedia:3451 The libaom library calls exit()
+    * aomedia:3450 enable -Wshadow for C++ sources
+    * aomedia:3449 Test Seg Faults After
+      b459af3e345be402db052a143fcc5383d4b74cbd
+    * aomedia:3416 prune unused symbols / restrict symbol visibility
+    * aomedia:3443 Jenkins failure:
+      UninstantiatedParameterizedTestSuite<EstimateNoiseTest>
+    * aomedia:3434 realtime failures with CONFIG_BITSTREAM_DEBUG=1
+    * aomedia:3433 DeltaqModeTest crash w/row_mt=0
+    * aomedia:3429 Encoder crash when turn on both ExternalResize and
+      g_threads > 2
+    * aomedia:3438 Build failure with
+      `-DSANITIZE=address -DBUILD_SHARED_LIBS=ON` when using clang.
+    * aomedia:3435 Block artifacts when scrolling with AV1 in screen sharing
+      scenarios
+    * aomedia:3170 vmaf tune presets produce extreme glitches in one scene
+    * aomedia:3401 Building shared libaom with MSVC results in a race condition
+      with the export library
+    * aomedia:3420 Floating point exception in av1_tpl_get_frame_importance()
+    * aomedia:3424 heap-buffer-overflow in ScaleFilterCols_16_C() (SIGABRT)
+    * aomedia:3417 examples/svc_encoder_rtc.c is using internal macros and
+      functions
+    * aomedia:3372 SEGV in assign_frame_buffer_p av1_common_int.h
+    * aomedia:3130 'cpu-features.h' file not found on Android NDK 22
+    * aomedia:3415 Encoder/decoder mismatch for svc_encoder_rtc running
+      1 SL 3 TL
+    * aomedia:3412 Lossless Mode Fails Loopback Bit Test
+    * aomedia:3409 The use of AV1_VAR_OFFS in av1/encoder/var_based_part.c is
+      incorrect for high bit depths
+    * aomedia:3403 test_libaom fails with error message
+      "feenableexcept() failed" on Linux arm
+    * aomedia:3370 Random color block at fast motion area
+    * aomedia:3393 Assertion failure in av1_convolve_2d_sr_c()
+    * aomedia:3392 Strong artifacting for high bit-depth real-time
+    * aomedia:3376 aomenc --threads=10 --deltaq-mode=3 crashes after
+      "Allintra: multi-threading of calculating differential contrast"
+    * aomedia:3380 Crashes and ASan and TSan errors in deltaq-mode=3
+      multithreading code
+    * chromium:1410766 heap-buffer-overflow in aom_yv12_copy_v_c
+    * Cannot set level via AV1E_SET_TARGET_SEQ_LEVEL_IDX
+    * Encoding failure due to the use of loop restoration with unintended use of
+      lossless mode.
+    * Signed integer overflow in scan_past_frames
+    * Signed integer overflow in update_a_sep_sym
+    * Flickering in AV1 1440p/2160p HDR transcodes
+    * Fixed artifacts with screen share at encoder speed 10
+    * Fixed prediction setup for IDTX
+
 2023-05-08 v3.6.1
   This release includes several bug fixes. This release is ABI
   compatible with the last release. See

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 87d88fa..8f459f3 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt

@@ -11,7 +11,7 @@
 if(CONFIG_TFLITE)
   cmake_minimum_required(VERSION 3.11)
 else()
-  cmake_minimum_required(VERSION 3.7)
+  cmake_minimum_required(VERSION 3.9)
 endif()
 
 set(AOM_ROOT "${CMAKE_CURRENT_SOURCE_DIR}")
@@ -41,6 +41,13 @@
   endif()
 endif()
 
+if(MSVC AND MSVC_VERSION LESS 1920)
+  message(
+    WARNING
+      "MSVC versions prior to 2019 (v16) are not supported and may generate"
+      " incorrect code!")
+endif()
+
 # Library version info. Update LT_CURRENT, LT_REVISION and LT_AGE when making a
 # public release by following the guidelines in the libtool document:
 # https://www.gnu.org/software/libtool/manual/libtool.html#Updating-version-info
@@ -51,9 +58,9 @@
 # passed to libtool.
 #
 # We set SO_FILE_VERSION = [c-a].a.r
-set(LT_CURRENT 9)
-set(LT_REVISION 1)
-set(LT_AGE 6)
+set(LT_CURRENT 10)
+set(LT_REVISION 0)
+set(LT_AGE 7)
 math(EXPR SO_VERSION "${LT_CURRENT} - ${LT_AGE}")
 set(SO_FILE_VERSION "${SO_VERSION}.${LT_AGE}.${LT_REVISION}")
 unset(LT_CURRENT)
@@ -210,13 +217,9 @@
 include_directories(${AOM_ROOT} ${AOM_CONFIG_DIR} ${AOM_ROOT}/apps
                     ${AOM_ROOT}/common ${AOM_ROOT}/examples ${AOM_ROOT}/stats)
 
-if(CONFIG_RUNTIME_CPU_DETECT AND ANDROID_NDK)
-  include_directories(${ANDROID_NDK}/sources/android/cpufeatures)
-endif()
-
 # Targets
 add_library(aom_version ${AOM_VERSION_SOURCES})
-add_dummy_source_file_to_target(aom_version c)
+add_no_op_source_file_to_target(aom_version c)
 add_custom_command(OUTPUT "${AOM_CONFIG_DIR}/config/aom_version.h"
                    COMMAND ${CMAKE_COMMAND} ARGS
                            -DAOM_CONFIG_DIR=${AOM_CONFIG_DIR}
@@ -270,10 +273,26 @@
   set(AOM_LIB_TARGETS ${AOM_LIB_TARGETS} aom_encoder_stats)
 endif()
 
-add_library(aom ${AOM_SOURCES} $<TARGET_OBJECTS:aom_rtcd>)
+# Xcode generator cannot take a library composed solely of objects. See
+# https://gitlab.kitware.com/cmake/cmake/-/issues/17500
+if(XCODE)
+  set(target_objs_aom ${AOM_SOURCES})
+else()
+  add_library(aom_obj OBJECT ${AOM_SOURCES})
+  set(AOM_LIB_TARGETS ${AOM_LIB_TARGETS} aom_obj)
+  set(target_objs_aom $<TARGET_OBJECTS:aom_obj>)
+endif()
+add_library(aom ${target_objs_aom} $<TARGET_OBJECTS:aom_rtcd>)
+
 if(BUILD_SHARED_LIBS)
-  add_library(aom_static STATIC ${AOM_SOURCES} $<TARGET_OBJECTS:aom_rtcd>)
+  add_library(aom_static STATIC ${target_objs_aom} $<TARGET_OBJECTS:aom_rtcd>)
   set_target_properties(aom_static PROPERTIES OUTPUT_NAME aom)
+  if(MSVC OR (WIN32 AND NOT MINGW))
+    # Fix race condition on the export library file between the two versions.
+    # Affects MSVC in all three flavors (stock, Clang/CL, LLVM-- the latter sets
+    # MSVC and MINGW both to FALSE).
+    set_target_properties(aom PROPERTIES ARCHIVE_OUTPUT_NAME "aom_dll")
+  endif()
 
   if(NOT MSVC)
     # Extract version string and set VERSION/SOVERSION for the aom target.
@@ -304,7 +323,7 @@
   endif()
 endif()
 
-if(CONFIG_AV1_RC_RTC AND CONFIG_AV1_ENCODER AND NOT BUILD_SHARED_LIBS)
+if(CONFIG_AV1_ENCODER AND NOT CONFIG_REALTIME_ONLY AND NOT BUILD_SHARED_LIBS)
   list(APPEND AOM_AV1_RC_SOURCES "${AOM_ROOT}/av1/ratectrl_rtc.h"
               "${AOM_ROOT}/av1/ratectrl_rtc.cc")
   add_library(aom_av1_rc ${AOM_AV1_RC_SOURCES})
@@ -312,33 +331,13 @@
   if(NOT WIN32 AND NOT APPLE)
     target_link_libraries(aom_av1_rc ${AOM_LIB_LINK_TYPE} m)
   endif()
-endif()
-
-if(CONFIG_AV1_ENCODER AND NOT CONFIG_REALTIME_ONLY AND NOT BUILD_SHARED_LIBS)
-  list(APPEND AOM_AV1_RC_QMODE_SOURCES
-              "${AOM_ROOT}/av1/qmode_rc/ratectrl_qmode_interface.h"
-              "${AOM_ROOT}/av1/qmode_rc/ratectrl_qmode_interface.cc"
-              "${AOM_ROOT}/av1/qmode_rc/reference_manager.h"
-              "${AOM_ROOT}/av1/qmode_rc/reference_manager.cc"
-              "${AOM_ROOT}/av1/qmode_rc/ratectrl_qmode.h"
-              "${AOM_ROOT}/av1/qmode_rc/ratectrl_qmode.cc"
-              "${AOM_ROOT}/av1/qmode_rc/ducky_encode.h"
-              "${AOM_ROOT}/av1/qmode_rc/ducky_encode.cc")
-  add_library(av1_rc_qmode ${AOM_AV1_RC_QMODE_SOURCES})
-  target_link_libraries(av1_rc_qmode ${AOM_LIB_LINK_TYPE} aom)
-  if(NOT MSVC AND NOT APPLE)
-    target_link_libraries(av1_rc_qmode ${AOM_LIB_LINK_TYPE} m)
-  endif()
-  set_target_properties(av1_rc_qmode PROPERTIES LINKER_LANGUAGE CXX)
+  set_target_properties(aom_av1_rc PROPERTIES LINKER_LANGUAGE CXX)
 endif()
 
 # List of object and static library targets.
 set(AOM_LIB_TARGETS ${AOM_LIB_TARGETS} aom_rtcd aom_mem aom_scale aom)
-if(CONFIG_AV1_RC_RTC AND CONFIG_AV1_ENCODER AND NOT BUILD_SHARED_LIBS)
-  set(AOM_LIB_TARGETS ${AOM_LIB_TARGETS} aom_av1_rc)
-endif()
 if(CONFIG_AV1_ENCODER AND NOT CONFIG_REALTIME_ONLY AND NOT BUILD_SHARED_LIBS)
-  set(AOM_LIB_TARGETS ${AOM_LIB_TARGETS} av1_rc_qmode)
+  set(AOM_LIB_TARGETS ${AOM_LIB_TARGETS} aom_av1_rc)
 endif()
 if(BUILD_SHARED_LIBS)
   set(AOM_LIB_TARGETS ${AOM_LIB_TARGETS} aom_static)
@@ -362,13 +361,13 @@
   endif()
 endforeach()
 
-# Generate C/C++ stub files containing the function usage_exit(). Users of the
+# Generate a C file containing the function usage_exit(). Users of the
 # aom_common_app_util library must define this function. This is a convenience
 # to allow omission of the function from applications that might want to use
 # other pieces of the util support without defining usage_exit().
-file(WRITE "${AOM_GEN_SRC_DIR}/usage_exit.c" "void usage_exit(void) {}")
-file(WRITE "${AOM_GEN_SRC_DIR}/usage_exit.cc"
-     "extern \"C\" void usage_exit(void) {}")
+file(WRITE "${AOM_GEN_SRC_DIR}/usage_exit.c"
+     "#include <stdlib.h>\n\n#include \"common/tools_common.h\"\n\n"
+     "void usage_exit(void) { exit(EXIT_FAILURE); }\n")
 
 #
 # Application and application support targets.
@@ -461,7 +460,7 @@
 if(CONFIG_LIBYUV OR CONFIG_TUNE_BUTTERAUGLI)
   add_library(yuv OBJECT ${AOM_LIBYUV_SOURCES})
   if(NOT MSVC)
-    target_compile_options(yuv PRIVATE -Wno-unused-parameter)
+    target_compile_options(yuv PRIVATE -Wno-shadow)
   endif()
   include_directories("${AOM_ROOT}/third_party/libyuv/include")
 endif()
@@ -495,7 +494,7 @@
                                     $<TARGET_OBJECTS:aom_common_app_util>
                                     $<TARGET_OBJECTS:aom_encoder_app_util>)
 
-    add_executable(svc_encoder_rtc "${AOM_ROOT}/examples/svc_encoder_rtc.c"
+    add_executable(svc_encoder_rtc "${AOM_ROOT}/examples/svc_encoder_rtc.cc"
                                    $<TARGET_OBJECTS:aom_common_app_util>
                                    $<TARGET_OBJECTS:aom_encoder_app_util>)
 
@@ -634,15 +633,15 @@
     if(PKG_CONFIG_FOUND)
       pkg_check_modules(VMAF REQUIRED libvmaf)
       if(BUILD_SHARED_LIBS)
-        target_link_libraries(aom PRIVATE ${VMAF_LDFLAGS} ${VMAF_LIBRARIES})
-      else()
-        target_link_libraries(aom
-                              PRIVATE ${VMAF_LDFLAGS} ${VMAF_LIBRARIES} -static)
+        target_link_libraries(aom_static
+                              PRIVATE ${VMAF_LDFLAGS} ${VMAF_LIBRARIES})
       endif()
-      target_include_directories(aom PRIVATE ${VMAF_INCLUDE_DIRS})
+      target_link_libraries(aom PRIVATE ${VMAF_LDFLAGS} ${VMAF_LIBRARIES})
       target_include_directories(aom_dsp_encoder PRIVATE ${VMAF_INCLUDE_DIRS})
       if(VMAF_CFLAGS)
-        append_compiler_flag("${VMAF_CFLAGS}")
+        foreach(flag "${VMAF_CFLAGS}")
+          append_compiler_flag("${flag}")
+        endforeach()
       endif()
     else()
       message(FATAL_ERROR "CONFIG_TUNE_VMAF error: pkg-config not found.")
@@ -665,7 +664,7 @@
 
 if(ENABLE_TOOLS)
   if(CONFIG_AV1_DECODER)
-    add_executable(dump_obu "${AOM_GEN_SRC_DIR}/usage_exit.cc"
+    add_executable(dump_obu "${AOM_GEN_SRC_DIR}/usage_exit.c"
                             "${AOM_ROOT}/tools/dump_obu.cc"
                             "${AOM_ROOT}/tools/obu_parser.cc"
                             "${AOM_ROOT}/tools/obu_parser.h"
@@ -795,7 +794,7 @@
     # here, it really is the Xcode generator's fault, or just a deficiency in
     # Xcode itself.
     foreach(aom_app ${AOM_APP_TARGETS})
-      add_dummy_source_file_to_target("${aom_app}" "cc")
+      add_no_op_source_file_to_target("${aom_app}" "cc")
     endforeach()
   endif()
 endif()
@@ -824,7 +823,15 @@
 endif()
 
 if(BUILD_SHARED_LIBS)
-  if(NOT WIN32 AND NOT APPLE)
+  # Don't use -Wl,-z,defs with Clang's sanitizers.
+  #
+  # Clang's AddressSanitizer documentation says "When linking shared libraries,
+  # the AddressSanitizer run-time is not linked, so -Wl,-z,defs may cause link
+  # errors (don't use it with AddressSanitizer)." See
+  # https://clang.llvm.org/docs/AddressSanitizer.html#usage.
+  if(NOT WIN32
+     AND NOT APPLE
+     AND NOT (CMAKE_C_COMPILER_ID MATCHES "Clang" AND SANITIZE))
     # The -z defs linker option reports unresolved symbol references from object
     # files when building a shared library.
     if("${CMAKE_VERSION}" VERSION_LESS "3.13")
@@ -935,7 +942,7 @@
 get_cmake_property(all_cmake_vars VARIABLES)
 foreach(var ${all_cmake_vars})
   if("${var}" MATCHES "SOURCES$\|_INTRIN_\|_ASM_"
-     AND NOT "${var}" MATCHES "_APP_\|DOXYGEN\|LIBWEBM\|LIBYUV\|_PKG_\|TEST")
+     AND NOT "${var}" MATCHES "DOXYGEN\|LIBYUV\|_PKG_\|TEST")
     list(APPEND aom_source_vars ${var})
   endif()
 endforeach()

diff --git a/METADATA b/METADATA
index eeaae2a..2534531 100644
--- a/METADATA
+++ b/METADATA

@@ -20,10 +20,10 @@
     type: GIT
     value: "https://aomedia.googlesource.com/aom/"
   }
-  version: "v3.6.1"
+  version: "v3.7.0"
   last_upgrade_date {
     year: 2023
-    month: 5
-    day: 11
+    month: 10
+    day: 9
   }
 }

diff --git a/README.android b/README.android
index bff3665..fcffb85 100644
--- a/README.android
+++ b/README.android

@@ -1,12 +1,12 @@
 Name: libaom
 URL: https://aomedia.org
-Version: v3.6.1
+Version: v3.7.0
 License: BSD
 License File: libaom/LICENSE
 
-Date: Thursday May 11 2023
-Branch: helia
-Commit: 7ade96172b95adc91a5d85bf80c90989cd543ee8
+Date: Monday October 09 2023
+Branch: ironbark
+Commit: 6054fae218eda6e53e1e3b4f7ef0fff4877c7bf1
 
 Description:
 Contains the sources used to compile libaom.

diff --git a/README.md b/README.md
index 0d51080..d7b66e0 100644
--- a/README.md
+++ b/README.md

@@ -217,27 +217,26 @@
 ### Microsoft Visual Studio builds {#microsoft-visual-studio-builds}
 
 Building the AV1 codec library in Microsoft Visual Studio is supported. Visual
-Studio 2017 (15.0) or later is required. The following example demonstrates
+Studio 2019 (16.0) or later is required. The following example demonstrates
 generating projects and a solution for the Microsoft IDE:
 
 ~~~
     # This does not require a bash shell; Command Prompt (cmd.exe) is fine.
     # This assumes the build host is a Windows x64 computer.
 
-    # To build with Visual Studio 2019 for the x64 target:
+    # To create a Visual Studio 2022 solution for the x64 target:
+    $ cmake path/to/aom -G "Visual Studio 17 2022"
+
+    # To create a Visual Studio 2022 solution for the 32-bit x86 target:
+    $ cmake path/to/aom -G "Visual Studio 17 2022" -A Win32
+
+    # To create a Visual Studio 2019 solution for the x64 target:
     $ cmake path/to/aom -G "Visual Studio 16 2019"
-    $ cmake --build .
 
-    # To build with Visual Studio 2019 for the 32-bit x86 target:
+    # To create a Visual Studio 2019 solution for the 32-bit x86 target:
     $ cmake path/to/aom -G "Visual Studio 16 2019" -A Win32
-    $ cmake --build .
 
-    # To build with Visual Studio 2017 for the x64 target:
-    $ cmake path/to/aom -G "Visual Studio 15 2017" -T host=x64 -A x64
-    $ cmake --build .
-
-    # To build with Visual Studio 2017 for the 32-bit x86 target:
-    $ cmake path/to/aom -G "Visual Studio 15 2017" -T host=x64
+    # To build the solution:
     $ cmake --build .
 ~~~
 
@@ -575,12 +574,19 @@
 `Generate Password` Password link at the top of the page. You’ll be given
 instructions for creating a cookie to use with our Git repos.
 
+You must also have a Gerrit account associated with your Google account. To do
+this visit the [Gerrit review server](https://aomedia-review.googlesource.com)
+and click "Sign in" (top right).
+
 ### Contributor agreement {#contributor-agreement}
 
 You will be required to execute a
 [contributor agreement](http://aomedia.org/license) to ensure that the AOMedia
 Project has the right to distribute your changes.
 
+Note: If you are pushing changes on behalf of an Alliance for Open Media member
+organization this step is not necessary.
+
 ### Testing your code {#testing-your-code}
 
 The testing basics are covered in the [testing section](#testing-the-av1-codec)

diff --git a/README.version b/README.version
index 398ffb8..1ce2056 100644
--- a/README.version
+++ b/README.version

@@ -1,4 +1,3 @@
 URL: https://aomedia.googlesource.com/aom/
-Version: v3.6.1
+Version: v3.7.0
 Local Modifications:
-* cherry-pick 4781b9f7f6 nonrd_opt: align scan tables

diff --git a/aom/aom_codec.h b/aom/aom_codec.h
index 6a9fb7b..d5b8790 100644
--- a/aom/aom_codec.h
+++ b/aom/aom_codec.h

@@ -417,19 +417,21 @@
  * \param[in]    ctx     Pointer to this instance's context.
  *
  */
-const char *aom_codec_error(aom_codec_ctx_t *ctx);
+const char *aom_codec_error(const aom_codec_ctx_t *ctx);
 
 /*!\brief Retrieve detailed error information for codec context
  *
  * Returns a human readable string providing detailed information about
- * the last error.
+ * the last error. The returned string is only valid until the next
+ * aom_codec_* function call (except aom_codec_error and
+ * aom_codec_error_detail) on the codec context.
  *
  * \param[in]    ctx     Pointer to this instance's context.
  *
  * \retval NULL
  *     No detailed information is available.
  */
-const char *aom_codec_error_detail(aom_codec_ctx_t *ctx);
+const char *aom_codec_error_detail(const aom_codec_ctx_t *ctx);
 
 /* REQUIRED FUNCTIONS
  *
@@ -444,9 +446,11 @@
  * \param[in] ctx   Pointer to this instance's context
  *
  * \retval #AOM_CODEC_OK
- *     The codec algorithm initialized.
- * \retval #AOM_CODEC_MEM_ERROR
- *     Memory allocation failed.
+ *     The codec instance has been destroyed.
+ * \retval #AOM_CODEC_INVALID_PARAM
+ *     ctx is a null pointer.
+ * \retval #AOM_CODEC_ERROR
+ *     Codec context not initialized.
  */
 aom_codec_err_t aom_codec_destroy(aom_codec_ctx_t *ctx);
 

diff --git a/aom/aom_decoder.h b/aom/aom_decoder.h
index 5ce7c7b..f3f11d8 100644
--- a/aom/aom_decoder.h
+++ b/aom/aom_decoder.h

@@ -113,7 +113,7 @@
  * \param[in]    ver     ABI version number. Must be set to
  *                       AOM_DECODER_ABI_VERSION
  * \retval #AOM_CODEC_OK
- *     The decoder algorithm initialized.
+ *     The decoder algorithm has been initialized.
  * \retval #AOM_CODEC_MEM_ERROR
  *     Memory allocation failed.
  */

diff --git a/aom/aom_encoder.h b/aom/aom_encoder.h
index c0efe79..e3d8d29 100644
--- a/aom/aom_encoder.h
+++ b/aom/aom_encoder.h

@@ -903,7 +903,7 @@
 
 /*!\brief Initialize an encoder instance
  *
- * Initializes a encoder context using the given interface. Applications
+ * Initializes an encoder context using the given interface. Applications
  * should call the aom_codec_enc_init convenience macro instead of this
  * function directly, to ensure that the ABI version number parameter
  * is properly initialized.
@@ -912,6 +912,9 @@
  * is not thread safe and should be guarded with a lock if being used
  * in a multithreaded context.
  *
+ * If aom_codec_enc_init_ver() fails, it is not necessary to call
+ * aom_codec_destroy() on the encoder context.
+ *
  * \param[in]    ctx     Pointer to this instance's context.
  * \param[in]    iface   Pointer to the algorithm interface to use.
  * \param[in]    cfg     Configuration to use, if known.
@@ -919,7 +922,7 @@
  * \param[in]    ver     ABI version number. Must be set to
  *                       AOM_ENCODER_ABI_VERSION
  * \retval #AOM_CODEC_OK
- *     The decoder algorithm initialized.
+ *     The encoder algorithm has been initialized.
  * \retval #AOM_CODEC_MEM_ERROR
  *     Memory allocation failed.
  */
@@ -1024,6 +1027,10 @@
  * \param[in]    img       Image data to encode, NULL to flush.
  *                         Encoding sample values outside the range
  *                         [0..(1<<img->bit_depth)-1] is undefined behavior.
+ *                         Note: Although img is declared as a const pointer,
+ *                         if AV1E_SET_DENOISE_NOISE_LEVEL is set to a nonzero
+ *                         value aom_codec_encode() modifies (denoises) the
+ *                         samples in img->planes[i] .
  * \param[in]    pts       Presentation time stamp, in timebase units. If img
  *                         is NULL, pts is ignored.
  * \param[in]    duration  Duration to show frame, in timebase units. If img

diff --git a/aom/aomcx.h b/aom/aomcx.h
index 906cf2a..a5db0a5 100644
--- a/aom/aomcx.h
+++ b/aom/aomcx.h

@@ -1481,6 +1481,52 @@
    */
   AV1E_ENABLE_SB_QP_SWEEP = 158,
 
+  /*!\brief Codec control to set quantizer for the next frame, int parameter.
+   *
+   * - Valid range [0, 63]
+   *
+   * This will turn off cyclic refresh. Only applicable to 1-pass.
+   */
+  AV1E_SET_QUANTIZER_ONE_PASS = 159,
+
+  /*!\brief Codec control to enable the rate distribution guided delta
+   * quantization in all intra mode, unsigned int parameter
+   *
+   * - 0 = disable (default)
+   * - 1 = enable
+   *
+   * \attention This feature requires --deltaq-mode=3, also an input file
+   *            which contains rate distribution for each 16x16 block,
+   *            passed in by --rate-distribution-info=rate_distribution.txt.
+   */
+  AV1E_ENABLE_RATE_GUIDE_DELTAQ = 160,
+
+  /*!\brief Codec control to set the input file for rate distribution used
+   * in all intra mode, const char * parameter
+   * The input should be the name of a text file, which
+   * contains (rows x cols) float values separated by space.
+   * Each float value represent the number of bits for each 16x16 block.
+   * rows = (frame_height + 15) / 16
+   * cols = (frame_width + 15) / 16
+   *
+   * \attention This feature requires --enable-rate-guide-deltaq=1.
+   */
+  AV1E_SET_RATE_DISTRIBUTION_INFO = 161,
+
+  /*!\brief Codec control to get the CDEF strength for Y / luma plane,
+   * int * parameter.
+   * Returns an integer array of CDEF_MAX_STRENGTHS elements.
+   */
+  AV1E_GET_LUMA_CDEF_STRENGTH = 162,
+
+  /*!\brief Codec control to set the target bitrate in kilobits per second,
+   * unsigned int parameter. For 1 pass CBR mode, single layer encoding.
+   * This controls replaces the call aom_codec_enc_config_set(&codec, &cfg)
+   * when only target bitrate is changed, and so is much cheaper as it
+   * bypasses a lot of unneeded code checks.
+   */
+  AV1E_SET_BITRATE_ONE_PASS_CBR = 163,
+
   // Any new encoder control IDs should be added above.
   // Maximum allowed encoder control ID is 229.
   // No encoder control ID should be added below.
@@ -1497,7 +1543,9 @@
   AOME_THREEFOUR = 3,
   AOME_ONEFOUR = 4,
   AOME_ONEEIGHT = 5,
-  AOME_ONETWO = 6
+  AOME_ONETWO = 6,
+  AOME_TWOTHREE = 7,
+  AOME_ONETHREE = 8
 } AOM_SCALING_MODE;
 
 /*!\brief Max number of segments
@@ -1579,6 +1627,7 @@
   AOM_TUNE_VMAF_MAX_GAIN = 6,
   AOM_TUNE_VMAF_NEG_MAX_GAIN = 7,
   AOM_TUNE_BUTTERAUGLI = 8,
+  AOM_TUNE_VMAF_SALIENCY_MAP = 9,
 } aom_tune_metric;
 
 /*!\brief Distortion metric to use for RD optimization.
@@ -1608,7 +1657,12 @@
   int temporal_layer_id; /**< Temporal layer ID */
 } aom_svc_layer_id_t;
 
-/*!brief Parameter type for SVC */
+/*!brief Parameter type for SVC
+ *
+ * In the arrays of size AOM_MAX_LAYERS, the index for spatial layer `sl` and
+ * temporal layer `tl` is sl * number_temporal_layers + tl.
+ *
+ */
 typedef struct aom_svc_params {
   int number_spatial_layers;                 /**< Number of spatial layers */
   int number_temporal_layers;                /**< Number of temporal layers */
@@ -1616,7 +1670,7 @@
   int min_quantizers[AOM_MAX_LAYERS];        /**< Min Q for each layer */
   int scaling_factor_num[AOM_MAX_SS_LAYERS]; /**< Scaling factor-numerator */
   int scaling_factor_den[AOM_MAX_SS_LAYERS]; /**< Scaling factor-denominator */
-  /*! Target bitrate for each layer */
+  /*! Target bitrate for each layer, in kilobits per second */
   int layer_target_bitrate[AOM_MAX_LAYERS];
   /*! Frame rate factor for each temporal layer */
   int framerate_factor[AOM_MAX_TS_LAYERS];
@@ -2103,6 +2157,21 @@
 AOM_CTRL_USE_TYPE(AV1E_ENABLE_SB_QP_SWEEP, unsigned int)
 #define AOM_CTRL_AV1E_ENABLE_SB_QP_SWEEP
 
+AOM_CTRL_USE_TYPE(AV1E_SET_QUANTIZER_ONE_PASS, int)
+#define AOM_CTRL_AV1E_SET_QUANTIZER_ONE_PASS
+
+AOM_CTRL_USE_TYPE(AV1E_ENABLE_RATE_GUIDE_DELTAQ, unsigned int)
+#define AOM_CTRL_AV1E_ENABLE_RATE_GUIDE_DELTAQ
+
+AOM_CTRL_USE_TYPE(AV1E_SET_RATE_DISTRIBUTION_INFO, const char *)
+#define AOM_CTRL_AV1E_SET_RATE_DISTRIBUTION_INFO
+
+AOM_CTRL_USE_TYPE(AV1E_GET_LUMA_CDEF_STRENGTH, int *)
+#define AOM_CTRL_AV1E_GET_LUMA_CDEF_STRENGTH
+
+AOM_CTRL_USE_TYPE(AV1E_SET_BITRATE_ONE_PASS_CBR, unsigned int)
+#define AOM_CTRL_AV1E_SET_BITRATE_ONE_PASS_CBR
+
 /*!\endcond */
 /*! @} - end defgroup aom_encoder */
 #ifdef __cplusplus

diff --git a/aom/src/aom_codec.c b/aom/src/aom_codec.c
index bc2039a..4e75fcb 100644
--- a/aom/src/aom_codec.c
+++ b/aom/src/aom_codec.c

@@ -52,12 +52,12 @@
   return "Unrecognized error code";
 }
 
-const char *aom_codec_error(aom_codec_ctx_t *ctx) {
+const char *aom_codec_error(const aom_codec_ctx_t *ctx) {
   return (ctx) ? aom_codec_err_to_string(ctx->err)
                : aom_codec_err_to_string(AOM_CODEC_INVALID_PARAM);
 }
 
-const char *aom_codec_error_detail(aom_codec_ctx_t *ctx) {
+const char *aom_codec_error_detail(const aom_codec_ctx_t *ctx) {
   if (ctx && ctx->err)
     return ctx->priv ? ctx->priv->err_detail : ctx->err_detail;
 
@@ -81,7 +81,7 @@
 }
 
 aom_codec_caps_t aom_codec_get_caps(aom_codec_iface_t *iface) {
-  return (iface) ? iface->caps : 0;
+  return iface ? iface->caps : 0;
 }
 
 aom_codec_err_t aom_codec_control(aom_codec_ctx_t *ctx, int ctrl_id, ...) {

diff --git a/aom/src/aom_encoder.c b/aom/src/aom_encoder.c
index 6ec2f34..a4acbcc 100644
--- a/aom/src/aom_encoder.c
+++ b/aom/src/aom_encoder.c

@@ -80,6 +80,10 @@
     res = ctx->iface->init(ctx);
 
     if (res) {
+      // IMPORTANT: ctx->priv->err_detail must be null or point to a string
+      // that remains valid after ctx->priv is destroyed, such as a C string
+      // literal. This makes it safe to call aom_codec_error_detail() after
+      // aom_codec_enc_init_ver() failed.
       ctx->err_detail = ctx->priv ? ctx->priv->err_detail : NULL;
       aom_codec_destroy(ctx);
     }
@@ -92,7 +96,6 @@
                                              aom_codec_enc_cfg_t *cfg,
                                              unsigned int usage) {
   aom_codec_err_t res;
-  int i;
 
   if (!iface || !cfg)
     res = AOM_CODEC_INVALID_PARAM;
@@ -101,26 +104,24 @@
   else {
     res = AOM_CODEC_INVALID_PARAM;
 
-    for (i = 0; i < iface->enc.cfg_count; ++i) {
+    for (int i = 0; i < iface->enc.cfg_count; ++i) {
       if (iface->enc.cfgs[i].g_usage == usage) {
         *cfg = iface->enc.cfgs[i];
         res = AOM_CODEC_OK;
+        /* default values */
+        memset(&cfg->encoder_cfg, 0, sizeof(cfg->encoder_cfg));
+        cfg->encoder_cfg.super_block_size = 0;  // Dynamic
+        cfg->encoder_cfg.max_partition_size = 128;
+        cfg->encoder_cfg.min_partition_size = 4;
+        cfg->encoder_cfg.disable_trellis_quant = 3;
         break;
       }
     }
   }
-  /* default values */
-  if (cfg) {
-    memset(&cfg->encoder_cfg, 0, sizeof(cfg->encoder_cfg));
-    cfg->encoder_cfg.super_block_size = 0;  // Dynamic
-    cfg->encoder_cfg.max_partition_size = 128;
-    cfg->encoder_cfg.min_partition_size = 4;
-    cfg->encoder_cfg.disable_trellis_quant = 3;
-  }
   return res;
 }
 
-#if ARCH_X86 || ARCH_X86_64
+#if AOM_ARCH_X86 || AOM_ARCH_X86_64
 /* On X86, disable the x87 unit's internal 80 bit precision for better
  * consistency with the SSE unit's 64 bit precision.
  */
@@ -131,15 +132,17 @@
 #else
 #define FLOATING_POINT_SET_PRECISION
 #define FLOATING_POINT_RESTORE_PRECISION
-#endif  // ARCH_X86 || ARCH_X86_64
+#endif  // AOM_ARCH_X86 || AOM_ARCH_X86_64
 
 #if HAVE_FEXCEPT && CONFIG_DEBUG
 #define FLOATING_POINT_SET_EXCEPTIONS \
   const int float_excepts =           \
       feenableexcept(FE_DIVBYZERO | FE_UNDERFLOW | FE_OVERFLOW);
 #define FLOATING_POINT_RESTORE_EXCEPTIONS \
-  fedisableexcept(FE_ALL_EXCEPT);         \
-  feenableexcept(float_excepts);
+  if (float_excepts != -1) {              \
+    fedisableexcept(FE_ALL_EXCEPT);       \
+    feenableexcept(float_excepts);        \
+  }
 #else
 #define FLOATING_POINT_SET_EXCEPTIONS
 #define FLOATING_POINT_RESTORE_EXCEPTIONS

diff --git a/aom_dsp/aom_dsp.cmake b/aom_dsp/aom_dsp.cmake
index c5c2db7..4c60e5c 100644
--- a/aom_dsp/aom_dsp.cmake
+++ b/aom_dsp/aom_dsp.cmake

@@ -112,12 +112,14 @@
 
 list(APPEND AOM_DSP_COMMON_INTRIN_NEON
             "${AOM_ROOT}/aom_dsp/arm/aom_convolve_copy_neon.c"
+            "${AOM_ROOT}/aom_dsp/arm/aom_convolve8_neon.c"
             "${AOM_ROOT}/aom_dsp/arm/fwd_txfm_neon.c"
             "${AOM_ROOT}/aom_dsp/arm/loopfilter_neon.c"
             "${AOM_ROOT}/aom_dsp/arm/highbd_intrapred_neon.c"
             "${AOM_ROOT}/aom_dsp/arm/intrapred_neon.c"
             "${AOM_ROOT}/aom_dsp/arm/subtract_neon.c"
-            "${AOM_ROOT}/aom_dsp/arm/blend_a64_mask_neon.c")
+            "${AOM_ROOT}/aom_dsp/arm/blend_a64_mask_neon.c"
+            "${AOM_ROOT}/aom_dsp/arm/avg_pred_neon.c")
 
 if(CONFIG_AV1_HIGHBITDEPTH)
   list(APPEND AOM_DSP_COMMON_INTRIN_SSE2
@@ -176,7 +178,7 @@
 
   # Flow estimation library
   if(NOT CONFIG_REALTIME_ONLY)
-    list(APPEND AOM_DSP_ENCODER_SOURCES
+    list(APPEND AOM_DSP_ENCODER_SOURCES "${AOM_ROOT}/aom_dsp/pyramid.c"
                 "${AOM_ROOT}/aom_dsp/flow_estimation/corner_detect.c"
                 "${AOM_ROOT}/aom_dsp/flow_estimation/corner_match.c"
                 "${AOM_ROOT}/aom_dsp/flow_estimation/disflow.c"
@@ -184,7 +186,8 @@
                 "${AOM_ROOT}/aom_dsp/flow_estimation/ransac.c")
 
     list(APPEND AOM_DSP_ENCODER_INTRIN_SSE4_1
-                "${AOM_ROOT}/aom_dsp/flow_estimation/x86/corner_match_sse4.c")
+                "${AOM_ROOT}/aom_dsp/flow_estimation/x86/corner_match_sse4.c"
+                "${AOM_ROOT}/aom_dsp/flow_estimation/x86/disflow_sse4.c")
 
     list(APPEND AOM_DSP_ENCODER_INTRIN_AVX2
                 "${AOM_ROOT}/aom_dsp/flow_estimation/x86/corner_match_avx2.c")
@@ -208,7 +211,8 @@
               "${AOM_ROOT}/aom_dsp/x86/quantize_x86.h"
               "${AOM_ROOT}/aom_dsp/x86/blk_sse_sum_sse2.c"
               "${AOM_ROOT}/aom_dsp/x86/sum_squares_sse2.c"
-              "${AOM_ROOT}/aom_dsp/x86/variance_sse2.c")
+              "${AOM_ROOT}/aom_dsp/x86/variance_sse2.c"
+              "${AOM_ROOT}/aom_dsp/x86/jnt_sad_sse2.c")
 
   list(APPEND AOM_DSP_ENCODER_ASM_SSSE3_X86_64
               "${AOM_ROOT}/aom_dsp/x86/fwd_txfm_ssse3_x86_64.asm"
@@ -245,8 +249,7 @@
               "${AOM_ROOT}/aom_dsp/x86/masked_variance_intrin_ssse3.c"
               "${AOM_ROOT}/aom_dsp/x86/quantize_ssse3.c"
               "${AOM_ROOT}/aom_dsp/x86/variance_impl_ssse3.c"
-              "${AOM_ROOT}/aom_dsp/x86/jnt_variance_ssse3.c"
-              "${AOM_ROOT}/aom_dsp/x86/jnt_sad_ssse3.c")
+              "${AOM_ROOT}/aom_dsp/x86/jnt_variance_ssse3.c")
 
   list(APPEND AOM_DSP_ENCODER_INTRIN_SSE4_1
               "${AOM_ROOT}/aom_dsp/x86/avg_intrin_sse4.c"
@@ -254,12 +257,17 @@
               "${AOM_ROOT}/aom_dsp/x86/obmc_sad_sse4.c"
               "${AOM_ROOT}/aom_dsp/x86/obmc_variance_sse4.c")
 
-  list(APPEND AOM_DSP_ENCODER_INTRIN_NEON "${AOM_ROOT}/aom_dsp/arm/sad4d_neon.c"
+  list(APPEND AOM_DSP_ENCODER_INTRIN_NEON
+              "${AOM_ROOT}/aom_dsp/arm/sadxd_neon.c"
               "${AOM_ROOT}/aom_dsp/arm/sad_neon.c"
+              "${AOM_ROOT}/aom_dsp/arm/masked_sad_neon.c"
+              "${AOM_ROOT}/aom_dsp/arm/masked_sad4d_neon.c"
               "${AOM_ROOT}/aom_dsp/arm/subpel_variance_neon.c"
               "${AOM_ROOT}/aom_dsp/arm/variance_neon.c"
               "${AOM_ROOT}/aom_dsp/arm/hadamard_neon.c"
               "${AOM_ROOT}/aom_dsp/arm/avg_neon.c"
+              "${AOM_ROOT}/aom_dsp/arm/obmc_variance_neon.c"
+              "${AOM_ROOT}/aom_dsp/arm/obmc_sad_neon.c"
               "${AOM_ROOT}/aom_dsp/arm/sse_neon.c"
               "${AOM_ROOT}/aom_dsp/arm/sum_squares_neon.c")
 
@@ -283,7 +291,11 @@
                 "${AOM_ROOT}/aom_dsp/x86/highbd_variance_sse4.c")
 
     list(APPEND AOM_DSP_ENCODER_INTRIN_NEON
+                "${AOM_ROOT}/aom_dsp/arm/highbd_avg_neon.c"
+                "${AOM_ROOT}/aom_dsp/arm/highbd_hadamard_neon.c"
                 "${AOM_ROOT}/aom_dsp/arm/highbd_quantize_neon.c"
+                "${AOM_ROOT}/aom_dsp/arm/highbd_sad_neon.c"
+                "${AOM_ROOT}/aom_dsp/arm/highbd_sad4d_neon.c"
                 "${AOM_ROOT}/aom_dsp/arm/highbd_variance_neon.c")
   endif()
 
@@ -322,8 +334,8 @@
 function(setup_aom_dsp_targets)
   add_library(aom_dsp_common OBJECT ${AOM_DSP_COMMON_SOURCES})
   list(APPEND AOM_LIB_TARGETS aom_dsp_common)
-  create_dummy_source_file("aom_av1" "c" "dummy_source_file")
-  add_library(aom_dsp OBJECT "${dummy_source_file}")
+  create_no_op_source_file("aom_av1" "c" "no_op_source_file")
+  add_library(aom_dsp OBJECT "${no_op_source_file}")
   target_sources(aom PRIVATE $<TARGET_OBJECTS:aom_dsp_common>)
   if(BUILD_SHARED_LIBS)
     target_sources(aom_static PRIVATE $<TARGET_OBJECTS:aom_dsp_common>)
@@ -331,8 +343,8 @@
   list(APPEND AOM_LIB_TARGETS aom_dsp)
 
   # Not all generators support libraries consisting only of object files. Add a
-  # dummy source file to the aom_dsp target.
-  add_dummy_source_file_to_target("aom_dsp" "c")
+  # source file to the aom_dsp target.
+  add_no_op_source_file_to_target("aom_dsp" "c")
 
   if(CONFIG_AV1_DECODER)
     add_library(aom_dsp_decoder OBJECT ${AOM_DSP_DECODER_SOURCES})

diff --git a/aom_dsp/aom_dsp_rtcd_defs.pl b/aom_dsp/aom_dsp_rtcd_defs.pl
index b3f8ec7..e738971 100755
--- a/aom_dsp/aom_dsp_rtcd_defs.pl
+++ b/aom_dsp/aom_dsp_rtcd_defs.pl

@@ -16,8 +16,8 @@
 
 #include "aom/aom_integer.h"
 #include "aom_dsp/aom_dsp_common.h"
-#include "av1/common/enums.h"
 #include "av1/common/blockd.h"
+#include "av1/common/enums.h"
 
 EOF
 }
@@ -86,104 +86,104 @@
 }
 
 specialize qw/aom_dc_top_predictor_4x4 neon sse2/;
-specialize qw/aom_dc_top_predictor_4x8 sse2/;
-specialize qw/aom_dc_top_predictor_4x16 sse2/;
-specialize qw/aom_dc_top_predictor_8x4 sse2/;
+specialize qw/aom_dc_top_predictor_4x8 neon sse2/;
+specialize qw/aom_dc_top_predictor_4x16 neon sse2/;
+specialize qw/aom_dc_top_predictor_8x4 neon sse2/;
 specialize qw/aom_dc_top_predictor_8x8 neon sse2/;
-specialize qw/aom_dc_top_predictor_8x16 sse2/;
-specialize qw/aom_dc_top_predictor_8x32 sse2/;
-specialize qw/aom_dc_top_predictor_16x4 sse2/;
-specialize qw/aom_dc_top_predictor_16x8 sse2/;
+specialize qw/aom_dc_top_predictor_8x16 neon sse2/;
+specialize qw/aom_dc_top_predictor_8x32 neon sse2/;
+specialize qw/aom_dc_top_predictor_16x4 neon sse2/;
+specialize qw/aom_dc_top_predictor_16x8 neon sse2/;
 specialize qw/aom_dc_top_predictor_16x16 neon sse2/;
-specialize qw/aom_dc_top_predictor_16x32 sse2/;
-specialize qw/aom_dc_top_predictor_16x64 sse2/;
-specialize qw/aom_dc_top_predictor_32x8 sse2/;
-specialize qw/aom_dc_top_predictor_32x16 sse2 avx2/;
+specialize qw/aom_dc_top_predictor_16x32 neon sse2/;
+specialize qw/aom_dc_top_predictor_16x64 neon sse2/;
+specialize qw/aom_dc_top_predictor_32x8 neon sse2/;
+specialize qw/aom_dc_top_predictor_32x16 neon sse2 avx2/;
 specialize qw/aom_dc_top_predictor_32x32 neon sse2 avx2/;
-specialize qw/aom_dc_top_predictor_32x64 sse2 avx2/;
-specialize qw/aom_dc_top_predictor_64x16 sse2 avx2/;
-specialize qw/aom_dc_top_predictor_64x32 sse2 avx2/;
-specialize qw/aom_dc_top_predictor_64x64 sse2 avx2/;
+specialize qw/aom_dc_top_predictor_32x64 neon sse2 avx2/;
+specialize qw/aom_dc_top_predictor_64x16 neon sse2 avx2/;
+specialize qw/aom_dc_top_predictor_64x32 neon sse2 avx2/;
+specialize qw/aom_dc_top_predictor_64x64 neon sse2 avx2/;
 
 specialize qw/aom_dc_left_predictor_4x4 neon sse2/;
-specialize qw/aom_dc_left_predictor_4x8 sse2/;
-specialize qw/aom_dc_left_predictor_4x16 sse2/;
-specialize qw/aom_dc_left_predictor_8x4 sse2/;
+specialize qw/aom_dc_left_predictor_4x8 neon sse2/;
+specialize qw/aom_dc_left_predictor_4x16 neon sse2/;
+specialize qw/aom_dc_left_predictor_8x4 neon sse2/;
 specialize qw/aom_dc_left_predictor_8x8 neon sse2/;
-specialize qw/aom_dc_left_predictor_8x16 sse2/;
-specialize qw/aom_dc_left_predictor_8x32 sse2/;
-specialize qw/aom_dc_left_predictor_16x4 sse2/;
-specialize qw/aom_dc_left_predictor_16x8 sse2/;
+specialize qw/aom_dc_left_predictor_8x16 neon sse2/;
+specialize qw/aom_dc_left_predictor_8x32 neon sse2/;
+specialize qw/aom_dc_left_predictor_16x4 neon sse2/;
+specialize qw/aom_dc_left_predictor_16x8 neon sse2/;
 specialize qw/aom_dc_left_predictor_16x16 neon sse2/;
-specialize qw/aom_dc_left_predictor_16x32 sse2/;
-specialize qw/aom_dc_left_predictor_16x64 sse2/;
-specialize qw/aom_dc_left_predictor_32x8 sse2/;
-specialize qw/aom_dc_left_predictor_32x16 sse2 avx2/;
+specialize qw/aom_dc_left_predictor_16x32 neon sse2/;
+specialize qw/aom_dc_left_predictor_16x64 neon sse2/;
+specialize qw/aom_dc_left_predictor_32x8 neon sse2/;
+specialize qw/aom_dc_left_predictor_32x16 neon sse2 avx2/;
 specialize qw/aom_dc_left_predictor_32x32 neon sse2 avx2/;
-specialize qw/aom_dc_left_predictor_32x64 sse2 avx2/;
-specialize qw/aom_dc_left_predictor_64x16 sse2 avx2/;
-specialize qw/aom_dc_left_predictor_64x32 sse2 avx2/;
-specialize qw/aom_dc_left_predictor_64x64 sse2 avx2/;
+specialize qw/aom_dc_left_predictor_32x64 neon sse2 avx2/;
+specialize qw/aom_dc_left_predictor_64x16 neon sse2 avx2/;
+specialize qw/aom_dc_left_predictor_64x32 neon sse2 avx2/;
+specialize qw/aom_dc_left_predictor_64x64 neon sse2 avx2/;
 
 specialize qw/aom_dc_128_predictor_4x4 neon sse2/;
-specialize qw/aom_dc_128_predictor_4x8 sse2/;
-specialize qw/aom_dc_128_predictor_4x16 sse2/;
-specialize qw/aom_dc_128_predictor_8x4 sse2/;
+specialize qw/aom_dc_128_predictor_4x8 neon sse2/;
+specialize qw/aom_dc_128_predictor_4x16 neon sse2/;
+specialize qw/aom_dc_128_predictor_8x4 neon sse2/;
 specialize qw/aom_dc_128_predictor_8x8 neon sse2/;
-specialize qw/aom_dc_128_predictor_8x16 sse2/;
-specialize qw/aom_dc_128_predictor_8x32 sse2/;
-specialize qw/aom_dc_128_predictor_16x4 sse2/;
-specialize qw/aom_dc_128_predictor_16x8 sse2/;
+specialize qw/aom_dc_128_predictor_8x16 neon sse2/;
+specialize qw/aom_dc_128_predictor_8x32 neon sse2/;
+specialize qw/aom_dc_128_predictor_16x4 neon sse2/;
+specialize qw/aom_dc_128_predictor_16x8 neon sse2/;
 specialize qw/aom_dc_128_predictor_16x16 neon sse2/;
-specialize qw/aom_dc_128_predictor_16x32 sse2/;
-specialize qw/aom_dc_128_predictor_16x64 sse2/;
-specialize qw/aom_dc_128_predictor_32x8 sse2/;
-specialize qw/aom_dc_128_predictor_32x16 sse2 avx2/;
+specialize qw/aom_dc_128_predictor_16x32 neon sse2/;
+specialize qw/aom_dc_128_predictor_16x64 neon sse2/;
+specialize qw/aom_dc_128_predictor_32x8 neon sse2/;
+specialize qw/aom_dc_128_predictor_32x16 neon sse2 avx2/;
 specialize qw/aom_dc_128_predictor_32x32 neon sse2 avx2/;
-specialize qw/aom_dc_128_predictor_32x64 sse2 avx2/;
-specialize qw/aom_dc_128_predictor_64x16 sse2 avx2/;
-specialize qw/aom_dc_128_predictor_64x32 sse2 avx2/;
-specialize qw/aom_dc_128_predictor_64x64 sse2 avx2/;
+specialize qw/aom_dc_128_predictor_32x64 neon sse2 avx2/;
+specialize qw/aom_dc_128_predictor_64x16 neon sse2 avx2/;
+specialize qw/aom_dc_128_predictor_64x32 neon sse2 avx2/;
+specialize qw/aom_dc_128_predictor_64x64 neon sse2 avx2/;
 
 specialize qw/aom_v_predictor_4x4 neon sse2/;
-specialize qw/aom_v_predictor_4x8 sse2/;
-specialize qw/aom_v_predictor_4x16 sse2/;
-specialize qw/aom_v_predictor_8x4 sse2/;
+specialize qw/aom_v_predictor_4x8 neon sse2/;
+specialize qw/aom_v_predictor_4x16 neon sse2/;
+specialize qw/aom_v_predictor_8x4 neon sse2/;
 specialize qw/aom_v_predictor_8x8 neon sse2/;
-specialize qw/aom_v_predictor_8x16 sse2/;
-specialize qw/aom_v_predictor_8x32 sse2/;
-specialize qw/aom_v_predictor_16x4 sse2/;
-specialize qw/aom_v_predictor_16x8 sse2/;
+specialize qw/aom_v_predictor_8x16 neon sse2/;
+specialize qw/aom_v_predictor_8x32 neon sse2/;
+specialize qw/aom_v_predictor_16x4 neon sse2/;
+specialize qw/aom_v_predictor_16x8 neon sse2/;
 specialize qw/aom_v_predictor_16x16 neon sse2/;
-specialize qw/aom_v_predictor_16x32 sse2/;
-specialize qw/aom_v_predictor_16x64 sse2/;
-specialize qw/aom_v_predictor_32x8 sse2/;
-specialize qw/aom_v_predictor_32x16 sse2 avx2/;
+specialize qw/aom_v_predictor_16x32 neon sse2/;
+specialize qw/aom_v_predictor_16x64 neon sse2/;
+specialize qw/aom_v_predictor_32x8 neon sse2/;
+specialize qw/aom_v_predictor_32x16 neon sse2 avx2/;
 specialize qw/aom_v_predictor_32x32 neon sse2 avx2/;
-specialize qw/aom_v_predictor_32x64 sse2 avx2/;
-specialize qw/aom_v_predictor_64x16 sse2 avx2/;
-specialize qw/aom_v_predictor_64x32 sse2 avx2/;
-specialize qw/aom_v_predictor_64x64 sse2 avx2/;
+specialize qw/aom_v_predictor_32x64 neon sse2 avx2/;
+specialize qw/aom_v_predictor_64x16 neon sse2 avx2/;
+specialize qw/aom_v_predictor_64x32 neon sse2 avx2/;
+specialize qw/aom_v_predictor_64x64 neon sse2 avx2/;
 
 specialize qw/aom_h_predictor_4x4 neon sse2/;
-specialize qw/aom_h_predictor_4x8 sse2/;
-specialize qw/aom_h_predictor_4x16 sse2/;
-specialize qw/aom_h_predictor_8x4 sse2/;
+specialize qw/aom_h_predictor_4x8 neon sse2/;
+specialize qw/aom_h_predictor_4x16 neon sse2/;
+specialize qw/aom_h_predictor_8x4 neon sse2/;
 specialize qw/aom_h_predictor_8x8 neon sse2/;
-specialize qw/aom_h_predictor_8x16 sse2/;
-specialize qw/aom_h_predictor_8x32 sse2/;
-specialize qw/aom_h_predictor_16x4 sse2/;
-specialize qw/aom_h_predictor_16x8 sse2/;
+specialize qw/aom_h_predictor_8x16 neon sse2/;
+specialize qw/aom_h_predictor_8x32 neon sse2/;
+specialize qw/aom_h_predictor_16x4 neon sse2/;
+specialize qw/aom_h_predictor_16x8 neon sse2/;
 specialize qw/aom_h_predictor_16x16 neon sse2/;
-specialize qw/aom_h_predictor_16x32 sse2/;
-specialize qw/aom_h_predictor_16x64 sse2/;
-specialize qw/aom_h_predictor_32x8 sse2/;
-specialize qw/aom_h_predictor_32x16 sse2/;
+specialize qw/aom_h_predictor_16x32 neon sse2/;
+specialize qw/aom_h_predictor_16x64 neon sse2/;
+specialize qw/aom_h_predictor_32x8 neon sse2/;
+specialize qw/aom_h_predictor_32x16 neon sse2/;
 specialize qw/aom_h_predictor_32x32 neon sse2 avx2/;
-specialize qw/aom_h_predictor_32x64 sse2/;
-specialize qw/aom_h_predictor_64x16 sse2/;
-specialize qw/aom_h_predictor_64x32 sse2/;
-specialize qw/aom_h_predictor_64x64 sse2/;
+specialize qw/aom_h_predictor_32x64 neon sse2/;
+specialize qw/aom_h_predictor_64x16 neon sse2/;
+specialize qw/aom_h_predictor_64x32 neon sse2/;
+specialize qw/aom_h_predictor_64x64 neon sse2/;
 
 specialize qw/aom_paeth_predictor_4x4 ssse3 neon/;
 specialize qw/aom_paeth_predictor_4x8 ssse3 neon/;
@@ -268,24 +268,24 @@
 # TODO(yunqingwang): optimize rectangular DC_PRED to replace division
 # by multiply and shift.
 specialize qw/aom_dc_predictor_4x4 neon sse2/;
-specialize qw/aom_dc_predictor_4x8 sse2/;
-specialize qw/aom_dc_predictor_4x16 sse2/;
-specialize qw/aom_dc_predictor_8x4 sse2/;
+specialize qw/aom_dc_predictor_4x8 neon sse2/;
+specialize qw/aom_dc_predictor_4x16 neon sse2/;
+specialize qw/aom_dc_predictor_8x4 neon sse2/;
 specialize qw/aom_dc_predictor_8x8 neon sse2/;
-specialize qw/aom_dc_predictor_8x16 sse2/;
-specialize qw/aom_dc_predictor_8x32 sse2/;
-specialize qw/aom_dc_predictor_16x4 sse2/;
-specialize qw/aom_dc_predictor_16x8 sse2/;
+specialize qw/aom_dc_predictor_8x16 neon sse2/;
+specialize qw/aom_dc_predictor_8x32 neon sse2/;
+specialize qw/aom_dc_predictor_16x4 neon sse2/;
+specialize qw/aom_dc_predictor_16x8 neon sse2/;
 specialize qw/aom_dc_predictor_16x16 neon sse2/;
-specialize qw/aom_dc_predictor_16x32 sse2/;
-specialize qw/aom_dc_predictor_16x64 sse2/;
-specialize qw/aom_dc_predictor_32x8 sse2/;
-specialize qw/aom_dc_predictor_32x16 sse2 avx2/;
+specialize qw/aom_dc_predictor_16x32 neon sse2/;
+specialize qw/aom_dc_predictor_16x64 neon sse2/;
+specialize qw/aom_dc_predictor_32x8 neon sse2/;
+specialize qw/aom_dc_predictor_32x16 neon sse2 avx2/;
 specialize qw/aom_dc_predictor_32x32 neon sse2 avx2/;
-specialize qw/aom_dc_predictor_32x64 sse2 avx2/;
-specialize qw/aom_dc_predictor_64x64 sse2 avx2/;
-specialize qw/aom_dc_predictor_64x32 sse2 avx2/;
-specialize qw/aom_dc_predictor_64x16 sse2 avx2/;
+specialize qw/aom_dc_predictor_32x64 neon sse2 avx2/;
+specialize qw/aom_dc_predictor_64x64 neon sse2 avx2/;
+specialize qw/aom_dc_predictor_64x32 neon sse2 avx2/;
+specialize qw/aom_dc_predictor_64x16 neon sse2 avx2/;
 if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
   specialize qw/aom_highbd_v_predictor_4x4 sse2 neon/;
   specialize qw/aom_highbd_v_predictor_4x8 sse2 neon/;
@@ -310,57 +310,104 @@
   # TODO(yunqingwang): optimize rectangular DC_PRED to replace division
   # by multiply and shift.
   specialize qw/aom_highbd_dc_predictor_4x4 sse2 neon/;
-  specialize qw/aom_highbd_dc_predictor_4x8 sse2/;
-  specialize qw/aom_highbd_dc_predictor_8x4 sse2/;;
+  specialize qw/aom_highbd_dc_predictor_4x8 sse2 neon/;
+  specialize qw/aom_highbd_dc_predictor_4x16 neon/;
+  specialize qw/aom_highbd_dc_predictor_8x4 sse2 neon/;
   specialize qw/aom_highbd_dc_predictor_8x8 sse2 neon/;
-  specialize qw/aom_highbd_dc_predictor_8x16 sse2/;;
-  specialize qw/aom_highbd_dc_predictor_16x8 sse2/;
+  specialize qw/aom_highbd_dc_predictor_8x16 sse2 neon/;
+  specialize qw/aom_highbd_dc_predictor_8x32 neon/;
+  specialize qw/aom_highbd_dc_predictor_16x4 neon/;
+  specialize qw/aom_highbd_dc_predictor_16x8 sse2 neon/;
   specialize qw/aom_highbd_dc_predictor_16x16 sse2 neon/;
-  specialize qw/aom_highbd_dc_predictor_16x32 sse2/;
-  specialize qw/aom_highbd_dc_predictor_32x16 sse2/;
+  specialize qw/aom_highbd_dc_predictor_16x32 sse2 neon/;
+  specialize qw/aom_highbd_dc_predictor_16x64 neon/;
+  specialize qw/aom_highbd_dc_predictor_32x8 neon/;
+  specialize qw/aom_highbd_dc_predictor_32x16 sse2 neon/;
   specialize qw/aom_highbd_dc_predictor_32x32 sse2 neon/;
+  specialize qw/aom_highbd_dc_predictor_32x64 neon/;
+  specialize qw/aom_highbd_dc_predictor_64x16 neon/;
+  specialize qw/aom_highbd_dc_predictor_64x32 neon/;
   specialize qw/aom_highbd_dc_predictor_64x64 neon/;
 
-  specialize qw/aom_highbd_h_predictor_4x4 sse2/;
-  specialize qw/aom_highbd_h_predictor_4x8 sse2/;
-  specialize qw/aom_highbd_h_predictor_8x4 sse2/;
-  specialize qw/aom_highbd_h_predictor_8x8 sse2/;
-  specialize qw/aom_highbd_h_predictor_8x16 sse2/;
-  specialize qw/aom_highbd_h_predictor_16x8 sse2/;
-  specialize qw/aom_highbd_h_predictor_16x16 sse2/;
-  specialize qw/aom_highbd_h_predictor_16x32 sse2/;
-  specialize qw/aom_highbd_h_predictor_32x16 sse2/;
-  specialize qw/aom_highbd_h_predictor_32x32 sse2/;
-  specialize qw/aom_highbd_dc_left_predictor_4x4 sse2/;
-  specialize qw/aom_highbd_dc_top_predictor_4x4 sse2/;
-  specialize qw/aom_highbd_dc_128_predictor_4x4 sse2/;
-  specialize qw/aom_highbd_dc_left_predictor_4x8 sse2/;
-  specialize qw/aom_highbd_dc_top_predictor_4x8 sse2/;
-  specialize qw/aom_highbd_dc_128_predictor_4x8 sse2/;
-  specialize qw/aom_highbd_dc_left_predictor_8x4 sse2/;
-  specialize qw/aom_highbd_dc_top_predictor_8x4 sse2/;
-  specialize qw/aom_highbd_dc_128_predictor_8x4 sse2/;
-  specialize qw/aom_highbd_dc_left_predictor_8x8 sse2/;
-  specialize qw/aom_highbd_dc_top_predictor_8x8 sse2/;
-  specialize qw/aom_highbd_dc_128_predictor_8x8 sse2/;
-  specialize qw/aom_highbd_dc_left_predictor_8x16 sse2/;
-  specialize qw/aom_highbd_dc_top_predictor_8x16 sse2/;
-  specialize qw/aom_highbd_dc_128_predictor_8x16 sse2/;
-  specialize qw/aom_highbd_dc_left_predictor_16x8 sse2/;
-  specialize qw/aom_highbd_dc_top_predictor_16x8 sse2/;
-  specialize qw/aom_highbd_dc_128_predictor_16x8 sse2/;
-  specialize qw/aom_highbd_dc_left_predictor_16x16 sse2/;
-  specialize qw/aom_highbd_dc_top_predictor_16x16 sse2/;
-  specialize qw/aom_highbd_dc_128_predictor_16x16 sse2/;
-  specialize qw/aom_highbd_dc_left_predictor_16x32 sse2/;
-  specialize qw/aom_highbd_dc_top_predictor_16x32 sse2/;
-  specialize qw/aom_highbd_dc_128_predictor_16x32 sse2/;
-  specialize qw/aom_highbd_dc_left_predictor_32x16 sse2/;
-  specialize qw/aom_highbd_dc_top_predictor_32x16 sse2/;
-  specialize qw/aom_highbd_dc_128_predictor_32x16 sse2/;
-  specialize qw/aom_highbd_dc_left_predictor_32x32 sse2/;
-  specialize qw/aom_highbd_dc_top_predictor_32x32 sse2/;
-  specialize qw/aom_highbd_dc_128_predictor_32x32 sse2/;
+  specialize qw/aom_highbd_h_predictor_4x4 sse2 neon/;
+  specialize qw/aom_highbd_h_predictor_4x8 sse2 neon/;
+  specialize qw/aom_highbd_h_predictor_4x16 neon/;
+  specialize qw/aom_highbd_h_predictor_8x4 sse2 neon/;
+  specialize qw/aom_highbd_h_predictor_8x8 sse2 neon/;
+  specialize qw/aom_highbd_h_predictor_8x16 sse2 neon/;
+  specialize qw/aom_highbd_h_predictor_8x32 neon/;
+  specialize qw/aom_highbd_h_predictor_16x4 neon/;
+  specialize qw/aom_highbd_h_predictor_16x8 sse2 neon/;
+  specialize qw/aom_highbd_h_predictor_16x16 sse2 neon/;
+  specialize qw/aom_highbd_h_predictor_16x32 sse2 neon/;
+  specialize qw/aom_highbd_h_predictor_16x64 neon/;
+  specialize qw/aom_highbd_h_predictor_32x8 neon/;
+  specialize qw/aom_highbd_h_predictor_32x16 sse2 neon/;
+  specialize qw/aom_highbd_h_predictor_32x32 sse2 neon/;
+  specialize qw/aom_highbd_h_predictor_32x64 neon/;
+  specialize qw/aom_highbd_h_predictor_64x16 neon/;
+  specialize qw/aom_highbd_h_predictor_64x32 neon/;
+  specialize qw/aom_highbd_h_predictor_64x64 neon/;
+
+  specialize qw/aom_highbd_dc_128_predictor_4x4 sse2 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_4x8 sse2 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_4x16 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_8x4 sse2 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_8x8 sse2 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_8x16 sse2 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_8x32 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_16x4 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_16x8 sse2 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_16x16 sse2 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_16x32 sse2 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_16x64 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_32x8 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_32x16 sse2 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_32x32 sse2 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_32x64 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_64x16 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_64x32 neon/;
+  specialize qw/aom_highbd_dc_128_predictor_64x64 neon/;
+
+  specialize qw/aom_highbd_dc_left_predictor_4x4 sse2 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_4x8 sse2 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_4x16 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_8x4 sse2 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_8x8 sse2 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_8x16 sse2 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_8x32 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_16x4 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_16x8 sse2 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_16x16 sse2 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_16x32 sse2 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_16x64 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_32x8 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_32x16 sse2 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_32x32 sse2 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_32x64 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_64x16 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_64x32 neon/;
+  specialize qw/aom_highbd_dc_left_predictor_64x64 neon/;
+
+  specialize qw/aom_highbd_dc_top_predictor_4x4 sse2 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_4x8 sse2 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_4x16 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_8x4 sse2 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_8x8 sse2 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_8x16 sse2 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_8x32 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_16x4 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_16x8 sse2 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_16x16 sse2 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_16x32 sse2 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_16x64 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_32x8 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_32x16 sse2 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_32x32 sse2 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_32x64 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_64x16 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_64x32 neon/;
+  specialize qw/aom_highbd_dc_top_predictor_64x64 neon/;
 
   specialize qw/aom_highbd_paeth_predictor_4x4 neon/;
   specialize qw/aom_highbd_paeth_predictor_4x8 neon/;
@@ -451,8 +498,8 @@
 add_proto qw/void aom_convolve8_vert/,            "const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4, const int16_t *filter_y, int y_step_q4, int w, int h";
 
 specialize qw/aom_convolve_copy       neon sse2 avx2/;
-specialize qw/aom_convolve8_horiz     sse2 ssse3/, "$avx2_ssse3";
-specialize qw/aom_convolve8_vert      sse2 ssse3/, "$avx2_ssse3";
+specialize qw/aom_convolve8_horiz     neon sse2 ssse3/, "$avx2_ssse3";
+specialize qw/aom_convolve8_vert      neon sse2 ssse3/, "$avx2_ssse3";
 
 add_proto qw/void aom_scaled_2d/, "const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const InterpKernel *filter, int x0_q4, int x_step_q4, int y0_q4, int y_step_q4, int w, int h";
 specialize qw/aom_scaled_2d ssse3 neon/;
@@ -607,12 +654,16 @@
     add_proto qw/void aom_fdct4x4_lp/, "const int16_t *input, int16_t *output, int stride";
     specialize qw/aom_fdct4x4_lp neon sse2/;
 
-    add_proto qw/void aom_fdct8x8/, "const int16_t *input, tran_low_t *output, int stride";
-    specialize qw/aom_fdct8x8 neon sse2/, "$ssse3_x86_64";
-    # High bit depth
-    if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
-      add_proto qw/void aom_highbd_fdct8x8/, "const int16_t *input, tran_low_t *output, int stride";
-      specialize qw/aom_highbd_fdct8x8 sse2/;
+    if (aom_config("CONFIG_INTERNAL_STATS") eq "yes"){
+      # 8x8 DCT transform for psnr-hvs. Unlike other transforms isn't compatible
+      # with av1 scan orders, because it does two transposes.
+      add_proto qw/void aom_fdct8x8/, "const int16_t *input, tran_low_t *output, int stride";
+      specialize qw/aom_fdct8x8 neon sse2/, "$ssse3_x86_64";
+      # High bit depth
+      if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
+        add_proto qw/void aom_highbd_fdct8x8/, "const int16_t *input, tran_low_t *output, int stride";
+        specialize qw/aom_highbd_fdct8x8 sse2/;
+      }
     }
     # FFT/IFFT (float) only used for denoising (and noise power spectral density estimation)
     add_proto qw/void aom_fft2x2_float/, "const float *input, float *temp, float *output";
@@ -743,13 +794,13 @@
     specialize qw/aom_sum_squares_2d_i16 sse2 avx2 neon/;
 
     add_proto qw/uint64_t aom_sum_squares_i16/, "const int16_t *src, uint32_t N";
-    specialize qw/aom_sum_squares_i16 sse2/;
+    specialize qw/aom_sum_squares_i16 sse2 neon/;
 
     add_proto qw/uint64_t aom_var_2d_u8/, "uint8_t *src, int src_stride, int width, int height";
-    specialize qw/aom_var_2d_u8 sse2 avx2/;
+    specialize qw/aom_var_2d_u8 sse2 avx2 neon/;
 
     add_proto qw/uint64_t aom_var_2d_u16/, "uint8_t *src, int src_stride, int width, int height";
-    specialize qw/aom_var_2d_u16 sse2 avx2/;
+    specialize qw/aom_var_2d_u16 sse2 avx2 neon/;
   }
 
   #
@@ -802,9 +853,12 @@
   specialize qw/aom_sad_skip_16x8                     sse2  neon/;
   specialize qw/aom_sad_skip_8x16                     sse2  neon/;
   specialize qw/aom_sad_skip_8x8                      sse2  neon/;
+  specialize qw/aom_sad_skip_8x4                            neon/;
   specialize qw/aom_sad_skip_4x8                      sse2  neon/;
+  specialize qw/aom_sad_skip_4x4                            neon/;
 
   specialize qw/aom_sad_skip_4x16                     sse2  neon/;
+  specialize qw/aom_sad_skip_16x4                           neon/;
   specialize qw/aom_sad_skip_8x32                     sse2  neon/;
   specialize qw/aom_sad_skip_32x8                     sse2  neon/;
   specialize qw/aom_sad_skip_16x64                    sse2  neon/;
@@ -834,43 +888,31 @@
   specialize qw/aom_sad16x64_avg        sse2 neon/;
   specialize qw/aom_sad64x16_avg        sse2 neon/;
 
-  specialize qw/aom_dist_wtd_sad128x128_avg ssse3/;
-  specialize qw/aom_dist_wtd_sad128x64_avg  ssse3/;
-  specialize qw/aom_dist_wtd_sad64x128_avg  ssse3/;
-  specialize qw/aom_dist_wtd_sad64x64_avg   ssse3/;
-  specialize qw/aom_dist_wtd_sad64x32_avg   ssse3/;
-  specialize qw/aom_dist_wtd_sad32x64_avg   ssse3/;
-  specialize qw/aom_dist_wtd_sad32x32_avg   ssse3/;
-  specialize qw/aom_dist_wtd_sad32x16_avg   ssse3/;
-  specialize qw/aom_dist_wtd_sad16x32_avg   ssse3/;
-  specialize qw/aom_dist_wtd_sad16x16_avg   ssse3/;
-  specialize qw/aom_dist_wtd_sad16x8_avg    ssse3/;
-  specialize qw/aom_dist_wtd_sad8x16_avg    ssse3/;
-  specialize qw/aom_dist_wtd_sad8x8_avg     ssse3/;
-  specialize qw/aom_dist_wtd_sad8x4_avg     ssse3/;
-  specialize qw/aom_dist_wtd_sad4x8_avg     ssse3/;
-  specialize qw/aom_dist_wtd_sad4x4_avg     ssse3/;
+  specialize qw/aom_dist_wtd_sad128x128_avg sse2/;
+  specialize qw/aom_dist_wtd_sad128x64_avg  sse2/;
+  specialize qw/aom_dist_wtd_sad64x128_avg  sse2/;
+  specialize qw/aom_dist_wtd_sad64x64_avg   sse2/;
+  specialize qw/aom_dist_wtd_sad64x32_avg   sse2/;
+  specialize qw/aom_dist_wtd_sad32x64_avg   sse2/;
+  specialize qw/aom_dist_wtd_sad32x32_avg   sse2/;
+  specialize qw/aom_dist_wtd_sad32x16_avg   sse2/;
+  specialize qw/aom_dist_wtd_sad16x32_avg   sse2/;
+  specialize qw/aom_dist_wtd_sad16x16_avg   sse2/;
+  specialize qw/aom_dist_wtd_sad16x8_avg    sse2/;
+  specialize qw/aom_dist_wtd_sad8x16_avg    sse2/;
+  specialize qw/aom_dist_wtd_sad8x8_avg     sse2/;
+  specialize qw/aom_dist_wtd_sad8x4_avg     sse2/;
+  specialize qw/aom_dist_wtd_sad4x8_avg     sse2/;
+  specialize qw/aom_dist_wtd_sad4x4_avg     sse2/;
 
-  specialize qw/aom_dist_wtd_sad4x16_avg     ssse3/;
-  specialize qw/aom_dist_wtd_sad16x4_avg     ssse3/;
-  specialize qw/aom_dist_wtd_sad8x32_avg     ssse3/;
-  specialize qw/aom_dist_wtd_sad32x8_avg     ssse3/;
-  specialize qw/aom_dist_wtd_sad16x64_avg    ssse3/;
-  specialize qw/aom_dist_wtd_sad64x16_avg    ssse3/;
-
-  add_proto qw/unsigned int/, "aom_sad4xh", "const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height";
-  add_proto qw/unsigned int/, "aom_sad8xh", "const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height";
-  add_proto qw/unsigned int/, "aom_sad16xh", "const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height";
-  add_proto qw/unsigned int/, "aom_sad32xh", "const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height";
-  add_proto qw/unsigned int/, "aom_sad64xh", "const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height";
-  add_proto qw/unsigned int/, "aom_sad128xh", "const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height";
-
-  specialize qw/aom_sad4xh   sse2/;
-  specialize qw/aom_sad8xh   sse2/;
-  specialize qw/aom_sad16xh  sse2/;
-  specialize qw/aom_sad32xh  sse2/;
-  specialize qw/aom_sad64xh  sse2/;
-  specialize qw/aom_sad128xh sse2/;
+  if (aom_config("CONFIG_REALTIME_ONLY") ne "yes") {
+    specialize qw/aom_dist_wtd_sad4x16_avg     sse2/;
+    specialize qw/aom_dist_wtd_sad16x4_avg     sse2/;
+    specialize qw/aom_dist_wtd_sad8x32_avg     sse2/;
+    specialize qw/aom_dist_wtd_sad32x8_avg     sse2/;
+    specialize qw/aom_dist_wtd_sad16x64_avg    sse2/;
+    specialize qw/aom_dist_wtd_sad64x16_avg    sse2/;
+  }
 
   if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
     foreach (@encoder_block_sizes) {
@@ -884,50 +926,53 @@
       }
       add_proto qw/unsigned int/, "aom_highbd_dist_wtd_sad${w}x${h}_avg", "const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param";
     }
-    specialize qw/aom_highbd_sad128x128 avx2/;
-    specialize qw/aom_highbd_sad128x64  avx2/;
-    specialize qw/aom_highbd_sad64x128  avx2/;
-    specialize qw/aom_highbd_sad64x64   avx2 sse2/;
-    specialize qw/aom_highbd_sad64x32   avx2 sse2/;
-    specialize qw/aom_highbd_sad32x64   avx2 sse2/;
-    specialize qw/aom_highbd_sad32x32   avx2 sse2/;
-    specialize qw/aom_highbd_sad32x16   avx2 sse2/;
-    specialize qw/aom_highbd_sad16x32   avx2 sse2/;
-    specialize qw/aom_highbd_sad16x16   avx2 sse2/;
-    specialize qw/aom_highbd_sad16x8    avx2 sse2/;
-    specialize qw/aom_highbd_sad8x16         sse2/;
-    specialize qw/aom_highbd_sad8x8          sse2/;
-    specialize qw/aom_highbd_sad8x4          sse2/;
-    specialize qw/aom_highbd_sad4x8          sse2/;
-    specialize qw/aom_highbd_sad4x4          sse2/;
+    specialize qw/aom_highbd_sad128x128 avx2      neon/;
+    specialize qw/aom_highbd_sad128x64  avx2      neon/;
+    specialize qw/aom_highbd_sad64x128  avx2      neon/;
+    specialize qw/aom_highbd_sad64x64   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad64x32   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad32x64   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad32x32   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad32x16   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad16x32   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad16x16   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad16x8    avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad8x16         sse2 neon/;
+    specialize qw/aom_highbd_sad8x8          sse2 neon/;
+    specialize qw/aom_highbd_sad8x4          sse2 neon/;
+    specialize qw/aom_highbd_sad4x8          sse2 neon/;
+    specialize qw/aom_highbd_sad4x4          sse2 neon/;
 
-    specialize qw/aom_highbd_sad4x16         sse2/;
-    specialize qw/aom_highbd_sad16x4    avx2 sse2/;
-    specialize qw/aom_highbd_sad8x32         sse2/;
-    specialize qw/aom_highbd_sad32x8    avx2 sse2/;
-    specialize qw/aom_highbd_sad16x64   avx2 sse2/;
-    specialize qw/aom_highbd_sad64x16   avx2 sse2/;
+    specialize qw/aom_highbd_sad4x16         sse2 neon/;
+    specialize qw/aom_highbd_sad16x4    avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad8x32         sse2 neon/;
+    specialize qw/aom_highbd_sad32x8    avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad16x64   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad64x16   avx2 sse2 neon/;
 
-    specialize qw/aom_highbd_sad_skip_128x128 avx2/;
-    specialize qw/aom_highbd_sad_skip_128x64  avx2/;
-    specialize qw/aom_highbd_sad_skip_64x128  avx2/;
-    specialize qw/aom_highbd_sad_skip_64x64   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_64x32   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_32x64   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_32x32   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_32x16   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_16x32   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_16x16   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_16x8    avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_8x16         sse2/;
-    specialize qw/aom_highbd_sad_skip_8x8          sse2/;
-    specialize qw/aom_highbd_sad_skip_4x8          sse2/;
+    specialize qw/aom_highbd_sad_skip_128x128 avx2      neon/;
+    specialize qw/aom_highbd_sad_skip_128x64  avx2      neon/;
+    specialize qw/aom_highbd_sad_skip_64x128  avx2      neon/;
+    specialize qw/aom_highbd_sad_skip_64x64   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_64x32   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_32x64   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_32x32   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_32x16   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_16x32   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_16x16   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_16x8    avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_16x4              neon/;
+    specialize qw/aom_highbd_sad_skip_8x16         sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_8x4               neon/;
+    specialize qw/aom_highbd_sad_skip_8x8          sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_4x8          sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_4x4               neon/;
 
-    specialize qw/aom_highbd_sad_skip_4x16         sse2/;
-    specialize qw/aom_highbd_sad_skip_8x32         sse2/;
-    specialize qw/aom_highbd_sad_skip_32x8    avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_16x64   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_64x16   avx2 sse2/;
+    specialize qw/aom_highbd_sad_skip_4x16         sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_8x32         sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_32x8    avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_16x64   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_64x16   avx2 sse2 neon/;
 
     specialize qw/aom_highbd_sad128x128_avg avx2/;
     specialize qw/aom_highbd_sad128x64_avg  avx2/;
@@ -957,7 +1002,7 @@
   foreach (@encoder_block_sizes) {
     ($w, $h) = @$_;
     add_proto qw/unsigned int/, "aom_masked_sad${w}x${h}", "const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask";
-    specialize "aom_masked_sad${w}x${h}", qw/ssse3 avx2/;
+    specialize "aom_masked_sad${w}x${h}", qw/ssse3 avx2 neon/;
   }
 
   if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
@@ -976,7 +1021,7 @@
       ($w, $h) = @$_;
       add_proto qw/unsigned int/, "aom_obmc_sad${w}x${h}", "const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask";
       if (! (($w == 128 && $h == 32) || ($w == 32 && $h == 128))) {
-        specialize "aom_obmc_sad${w}x${h}", qw/sse4_1 avx2/;
+        specialize "aom_obmc_sad${w}x${h}", qw/sse4_1 avx2 neon/;
       }
     }
 
@@ -998,7 +1043,6 @@
     ($w, $h) = @$_;
     add_proto qw/void/, "aom_sad${w}x${h}x4d", "const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]";
     add_proto qw/void/, "aom_sad${w}x${h}x3d", "const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]";
-    add_proto qw/void/, "aom_sad${w}x${h}x4d_avg", "const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]";
     add_proto qw/void/, "aom_sad_skip_${w}x${h}x4d", "const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]";
     add_proto qw/void/, "aom_masked_sad${w}x${h}x4d", "const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]";
   }
@@ -1018,7 +1062,6 @@
   specialize qw/aom_sad8x16x4d         neon sse2/;
   specialize qw/aom_sad8x8x4d          neon sse2/;
   specialize qw/aom_sad8x4x4d          neon sse2/;
-  specialize qw/aom_sad4x32x4d         neon sse2/;
   specialize qw/aom_sad4x8x4d          neon sse2/;
   specialize qw/aom_sad4x4x4d          neon sse2/;
 
@@ -1044,88 +1087,66 @@
   specialize qw/aom_sad_skip_16x32x4d   avx2 sse2 neon/;
   specialize qw/aom_sad_skip_16x16x4d   avx2 sse2 neon/;
   specialize qw/aom_sad_skip_16x8x4d    avx2 sse2 neon/;
+  specialize qw/aom_sad_skip_16x4x4d              neon/;
   specialize qw/aom_sad_skip_8x32x4d         sse2 neon/;
   specialize qw/aom_sad_skip_8x16x4d         sse2 neon/;
   specialize qw/aom_sad_skip_8x8x4d          sse2 neon/;
-  specialize qw/aom_sad_skip_4x32x4d         sse2 neon/;
+  specialize qw/aom_sad_skip_8x4x4d               neon/;
   specialize qw/aom_sad_skip_4x16x4d         sse2 neon/;
   specialize qw/aom_sad_skip_4x8x4d          sse2 neon/;
+  specialize qw/aom_sad_skip_4x4x4d               neon/;
 
-  specialize qw/aom_sad128x128x3d avx2/;
-  specialize qw/aom_sad128x64x3d  avx2/;
-  specialize qw/aom_sad64x128x3d  avx2/;
-  specialize qw/aom_sad64x64x3d   avx2/;
-  specialize qw/aom_sad64x32x3d   avx2/;
-  specialize qw/aom_sad32x64x3d   avx2/;
-  specialize qw/aom_sad32x32x3d   avx2/;
-  specialize qw/aom_sad32x16x3d   avx2/;
-  specialize qw/aom_sad16x32x3d   avx2/;
-  specialize qw/aom_sad16x16x3d   avx2/;
-  specialize qw/aom_sad16x8x3d    avx2/;
+  specialize qw/aom_sad128x128x3d neon avx2/;
+  specialize qw/aom_sad128x64x3d  neon avx2/;
+  specialize qw/aom_sad64x128x3d  neon avx2/;
+  specialize qw/aom_sad64x64x3d   neon avx2/;
+  specialize qw/aom_sad64x32x3d   neon avx2/;
+  specialize qw/aom_sad32x64x3d   neon avx2/;
+  specialize qw/aom_sad32x32x3d   neon avx2/;
+  specialize qw/aom_sad32x16x3d   neon avx2/;
+  specialize qw/aom_sad16x32x3d   neon avx2/;
+  specialize qw/aom_sad16x16x3d   neon avx2/;
+  specialize qw/aom_sad16x8x3d    neon avx2/;
+  specialize qw/aom_sad8x16x3d    neon/;
+  specialize qw/aom_sad8x8x3d     neon/;
+  specialize qw/aom_sad8x4x3d     neon/;
+  specialize qw/aom_sad4x8x3d     neon/;
+  specialize qw/aom_sad4x4x3d     neon/;
 
-  specialize qw/aom_sad64x16x3d   avx2/;
-  specialize qw/aom_sad32x8x3d    avx2/;
-  specialize qw/aom_sad16x64x3d   avx2/;
+  specialize qw/aom_sad64x16x3d   neon avx2/;
+  specialize qw/aom_sad32x8x3d    neon avx2/;
+  specialize qw/aom_sad16x64x3d   neon avx2/;
+  specialize qw/aom_sad16x4x3d    neon/;
+  specialize qw/aom_sad8x32x3d    neon/;
+  specialize qw/aom_sad4x16x3d    neon/;
 
-  if (aom_config("CONFIG_REALTIME_ONLY") ne "yes") {
-    specialize qw/aom_sad128x128x4d_avg sse2/;
-    specialize qw/aom_sad128x64x4d_avg  sse2/;
-    specialize qw/aom_sad64x128x4d_avg  sse2/;
-    specialize qw/aom_sad64x64x4d_avg   sse2/;
-    specialize qw/aom_sad64x32x4d_avg   sse2/;
-    specialize qw/aom_sad64x16x4d_avg   sse2/;
-    specialize qw/aom_sad32x64x4d_avg   sse2/;
-    specialize qw/aom_sad32x32x4d_avg   sse2/;
-    specialize qw/aom_sad32x16x4d_avg   sse2/;
-    specialize qw/aom_sad32x8x4d_avg    sse2/;
-    specialize qw/aom_sad16x64x4d_avg   sse2/;
-    specialize qw/aom_sad16x32x4d_avg   sse2/;
-    specialize qw/aom_sad16x16x4d_avg   sse2/;
-    specialize qw/aom_sad16x8x4d_avg    sse2/;
+  specialize qw/aom_masked_sad128x128x4d  ssse3 neon/;
+  specialize qw/aom_masked_sad128x64x4d   ssse3 neon/;
+  specialize qw/aom_masked_sad64x128x4d   ssse3 neon/;
+  specialize qw/aom_masked_sad64x64x4d    ssse3 neon/;
+  specialize qw/aom_masked_sad64x32x4d    ssse3 neon/;
+  specialize qw/aom_masked_sad64x16x4d    ssse3 neon/;
+  specialize qw/aom_masked_sad32x64x4d    ssse3 neon/;
+  specialize qw/aom_masked_sad32x32x4d    ssse3 neon/;
+  specialize qw/aom_masked_sad32x16x4d    ssse3 neon/;
+  specialize qw/aom_masked_sad32x8x4d     ssse3 neon/;
+  specialize qw/aom_masked_sad16x64x4d    ssse3 neon/;
+  specialize qw/aom_masked_sad16x32x4d    ssse3 neon/;
+  specialize qw/aom_masked_sad16x16x4d    ssse3 neon/;
+  specialize qw/aom_masked_sad16x8x4d     ssse3 neon/;
 
-    specialize qw/aom_sad8x16x4d_avg    sse2/;
-    specialize qw/aom_sad8x8x4d_avg     sse2/;
-    specialize qw/aom_sad8x4x4d_avg     sse2/;
-    specialize qw/aom_sad4x16x4d_avg    sse2/;
-    specialize qw/aom_sad4x8x4d_avg     sse2/;
-    specialize qw/aom_sad4x4x4d_avg     sse2/;
+  specialize qw/aom_masked_sad8x16x4d     ssse3 neon/;
+  specialize qw/aom_masked_sad8x8x4d      ssse3 neon/;
+  specialize qw/aom_masked_sad8x4x4d      ssse3 neon/;
+  specialize qw/aom_masked_sad4x16x4d     ssse3 neon/;
+  specialize qw/aom_masked_sad4x8x4d      ssse3 neon/;
+  specialize qw/aom_masked_sad4x4x4d      ssse3 neon/;
 
-    specialize qw/aom_sad4x32x4d_avg    sse2/;
-    specialize qw/aom_sad4x16x4d_avg    sse2/;
-    specialize qw/aom_sad16x4x4d_avg    sse2/;
-    specialize qw/aom_sad8x32x4d_avg    sse2/;
-    specialize qw/aom_sad32x8x4d_avg    sse2/;
-    specialize qw/aom_sad64x16x4d_avg   sse2/;
-  }
-
-  specialize qw/aom_masked_sad128x128x4d  ssse3/;
-  specialize qw/aom_masked_sad128x64x4d   ssse3/;
-  specialize qw/aom_masked_sad64x128x4d   ssse3/;
-  specialize qw/aom_masked_sad64x64x4d    ssse3/;
-  specialize qw/aom_masked_sad64x32x4d    ssse3/;
-  specialize qw/aom_masked_sad64x16x4d    ssse3/;
-  specialize qw/aom_masked_sad32x64x4d    ssse3/;
-  specialize qw/aom_masked_sad32x32x4d    ssse3/;
-  specialize qw/aom_masked_sad32x16x4d    ssse3/;
-  specialize qw/aom_masked_sad32x8x4d     ssse3/;
-  specialize qw/aom_masked_sad16x64x4d    ssse3/;
-  specialize qw/aom_masked_sad16x32x4d    ssse3/;
-  specialize qw/aom_masked_sad16x16x4d    ssse3/;
-  specialize qw/aom_masked_sad16x8x4d     ssse3/;
-
-  specialize qw/aom_masked_sad8x16x4d     ssse3/;
-  specialize qw/aom_masked_sad8x8x4d      ssse3/;
-  specialize qw/aom_masked_sad8x4x4d      ssse3/;
-  specialize qw/aom_masked_sad4x16x4d     ssse3/;
-  specialize qw/aom_masked_sad4x8x4d      ssse3/;
-  specialize qw/aom_masked_sad4x4x4d      ssse3/;
-
-  specialize qw/aom_masked_sad4x32x4d     ssse3/;
-  specialize qw/aom_masked_sad4x16x4d     ssse3/;
-  specialize qw/aom_masked_sad16x4x4d     ssse3/;
-  specialize qw/aom_masked_sad8x32x4d     ssse3/;
-  specialize qw/aom_masked_sad32x8x4d     ssse3/;
-  specialize qw/aom_masked_sad64x16x4d    ssse3/;
+  specialize qw/aom_masked_sad4x16x4d     ssse3 neon/;
+  specialize qw/aom_masked_sad16x4x4d     ssse3 neon/;
+  specialize qw/aom_masked_sad8x32x4d     ssse3 neon/;
+  specialize qw/aom_masked_sad32x8x4d     ssse3 neon/;
+  specialize qw/aom_masked_sad64x16x4d    ssse3 neon/;
   #
   # Multi-block SAD, comparing a reference to N independent blocks
   #
@@ -1139,50 +1160,53 @@
         specialize "aom_highbd_sad${w}x${h}x4d", qw/sse2/;
       }
     }
-    specialize qw/aom_highbd_sad128x128x4d avx2/;
-    specialize qw/aom_highbd_sad128x64x4d  avx2/;
-    specialize qw/aom_highbd_sad64x128x4d  avx2/;
-    specialize qw/aom_highbd_sad64x64x4d   sse2 avx2/;
-    specialize qw/aom_highbd_sad64x32x4d   sse2 avx2/;
-    specialize qw/aom_highbd_sad32x64x4d   sse2 avx2/;
-    specialize qw/aom_highbd_sad32x32x4d   sse2 avx2/;
-    specialize qw/aom_highbd_sad32x16x4d   sse2 avx2/;
-    specialize qw/aom_highbd_sad16x32x4d   sse2 avx2/;
-    specialize qw/aom_highbd_sad16x16x4d   sse2 avx2/;
-    specialize qw/aom_highbd_sad16x8x4d    sse2 avx2/;
-    specialize qw/aom_highbd_sad8x16x4d    sse2/;
-    specialize qw/aom_highbd_sad8x8x4d     sse2/;
-    specialize qw/aom_highbd_sad8x4x4d     sse2/;
-    specialize qw/aom_highbd_sad4x8x4d     sse2/;
-    specialize qw/aom_highbd_sad4x4x4d     sse2/;
+    specialize qw/aom_highbd_sad128x128x4d      avx2 neon/;
+    specialize qw/aom_highbd_sad128x64x4d       avx2 neon/;
+    specialize qw/aom_highbd_sad64x128x4d       avx2 neon/;
+    specialize qw/aom_highbd_sad64x64x4d   sse2 avx2 neon/;
+    specialize qw/aom_highbd_sad64x32x4d   sse2 avx2 neon/;
+    specialize qw/aom_highbd_sad32x64x4d   sse2 avx2 neon/;
+    specialize qw/aom_highbd_sad32x32x4d   sse2 avx2 neon/;
+    specialize qw/aom_highbd_sad32x16x4d   sse2 avx2 neon/;
+    specialize qw/aom_highbd_sad16x32x4d   sse2 avx2 neon/;
+    specialize qw/aom_highbd_sad16x16x4d   sse2 avx2 neon/;
+    specialize qw/aom_highbd_sad16x8x4d    sse2 avx2 neon/;
+    specialize qw/aom_highbd_sad8x16x4d    sse2      neon/;
+    specialize qw/aom_highbd_sad8x8x4d     sse2      neon/;
+    specialize qw/aom_highbd_sad8x4x4d     sse2      neon/;
+    specialize qw/aom_highbd_sad4x8x4d     sse2      neon/;
+    specialize qw/aom_highbd_sad4x4x4d     sse2      neon/;
 
-    specialize qw/aom_highbd_sad4x16x4d         sse2/;
-    specialize qw/aom_highbd_sad16x4x4d    avx2 sse2/;
-    specialize qw/aom_highbd_sad8x32x4d         sse2/;
-    specialize qw/aom_highbd_sad32x8x4d    avx2 sse2/;
-    specialize qw/aom_highbd_sad16x64x4d   avx2 sse2/;
-    specialize qw/aom_highbd_sad64x16x4d   avx2 sse2/;
+    specialize qw/aom_highbd_sad4x16x4d         sse2 neon/;
+    specialize qw/aom_highbd_sad16x4x4d    avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad8x32x4d         sse2 neon/;
+    specialize qw/aom_highbd_sad32x8x4d    avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad16x64x4d   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad64x16x4d   avx2 sse2 neon/;
 
-    specialize qw/aom_highbd_sad_skip_128x128x4d avx2/;
-    specialize qw/aom_highbd_sad_skip_128x64x4d  avx2/;
-    specialize qw/aom_highbd_sad_skip_64x128x4d  avx2/;
-    specialize qw/aom_highbd_sad_skip_64x64x4d   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_64x32x4d   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_32x64x4d   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_32x32x4d   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_32x16x4d   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_16x32x4d   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_16x16x4d   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_16x8x4d    avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_8x16x4d         sse2/;
-    specialize qw/aom_highbd_sad_skip_8x8x4d          sse2/;
-    specialize qw/aom_highbd_sad_skip_4x8x4d          sse2/;
+    specialize qw/aom_highbd_sad_skip_128x128x4d avx2      neon/;
+    specialize qw/aom_highbd_sad_skip_128x64x4d  avx2      neon/;
+    specialize qw/aom_highbd_sad_skip_64x128x4d  avx2      neon/;
+    specialize qw/aom_highbd_sad_skip_64x64x4d   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_64x32x4d   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_32x64x4d   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_32x32x4d   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_32x16x4d   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_16x32x4d   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_16x16x4d   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_16x8x4d    avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_16x4x4d              neon/;
+    specialize qw/aom_highbd_sad_skip_8x16x4d         sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_8x8x4d          sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_8x4x4d               neon/;
+    specialize qw/aom_highbd_sad_skip_4x8x4d          sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_4x4x4d               neon/;
 
-    specialize qw/aom_highbd_sad_skip_4x16x4d         sse2/;
-    specialize qw/aom_highbd_sad_skip_8x32x4d         sse2/;
-    specialize qw/aom_highbd_sad_skip_32x8x4d    avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_16x64x4d   avx2 sse2/;
-    specialize qw/aom_highbd_sad_skip_64x16x4d   avx2 sse2/;
+    specialize qw/aom_highbd_sad_skip_4x16x4d         sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_8x32x4d         sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_32x8x4d    avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_16x64x4d   avx2 sse2 neon/;
+    specialize qw/aom_highbd_sad_skip_64x16x4d   avx2 sse2 neon/;
 
     specialize qw/aom_highbd_sad128x128x3d avx2/;
     specialize qw/aom_highbd_sad128x64x3d  avx2/;
@@ -1214,13 +1238,15 @@
   specialize qw/aom_avg_8x8_quad avx2 sse2 neon/;
 
   add_proto qw/void aom_minmax_8x8/, "const uint8_t *s, int p, const uint8_t *d, int dp, int *min, int *max";
-  specialize qw/aom_minmax_8x8 sse2/;
+  specialize qw/aom_minmax_8x8 sse2 neon/;
 
   if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
     add_proto qw/unsigned int aom_highbd_avg_8x8/, "const uint8_t *, int p";
+    specialize qw/aom_highbd_avg_8x8 neon/;
     add_proto qw/unsigned int aom_highbd_avg_4x4/, "const uint8_t *, int p";
     specialize qw/aom_highbd_avg_4x4 neon/;
     add_proto qw/void aom_highbd_minmax_8x8/, "const uint8_t *s, int p, const uint8_t *d, int dp, int *min, int *max";
+    specialize qw/aom_highbd_minmax_8x8 neon/;
   }
 
   add_proto qw/void aom_int_pro_row/, "int16_t *hbuf, const uint8_t *ref, const int ref_stride, const int width, const int height, int norm_factor";
@@ -1238,7 +1264,7 @@
   # hamadard transform and satd for implmenting temporal dependency model
   #
   add_proto qw/void aom_hadamard_4x4/, "const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff";
-  specialize qw/aom_hadamard_4x4 sse2/;
+  specialize qw/aom_hadamard_4x4 sse2 neon/;
 
   add_proto qw/void aom_hadamard_8x8/, "const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff";
   specialize qw/aom_hadamard_8x8 sse2 neon/;
@@ -1247,7 +1273,7 @@
   specialize qw/aom_hadamard_16x16 avx2 sse2 neon/;
 
   add_proto qw/void aom_hadamard_32x32/, "const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff";
-  specialize qw/aom_hadamard_32x32 avx2 sse2/;
+  specialize qw/aom_hadamard_32x32 avx2 sse2 neon/;
 
   add_proto qw/void aom_hadamard_lp_8x8/, "const int16_t *src_diff, ptrdiff_t src_stride, int16_t *coeff";
   specialize qw/aom_hadamard_lp_8x8 sse2 neon/;
@@ -1258,18 +1284,15 @@
   add_proto qw/void aom_hadamard_lp_8x8_dual/, "const int16_t *src_diff, ptrdiff_t src_stride, int16_t *coeff";
   specialize qw/aom_hadamard_lp_8x8_dual sse2 avx2 neon/;
 
-  add_proto qw/void aom_pixel_scale/, "const int16_t *src_diff, ptrdiff_t src_stride, int16_t *coeff, int log_scale, int h8, int w8";
-  specialize qw/aom_pixel_scale sse2/;
-
   if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
     add_proto qw/void aom_highbd_hadamard_8x8/, "const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff";
-    specialize qw/aom_highbd_hadamard_8x8 avx2/;
+    specialize qw/aom_highbd_hadamard_8x8 avx2 neon/;
 
     add_proto qw/void aom_highbd_hadamard_16x16/, "const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff";
-    specialize qw/aom_highbd_hadamard_16x16 avx2/;
+    specialize qw/aom_highbd_hadamard_16x16 avx2 neon/;
 
     add_proto qw/void aom_highbd_hadamard_32x32/, "const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff";
-    specialize qw/aom_highbd_hadamard_32x32 avx2/;
+    specialize qw/aom_highbd_hadamard_32x32 avx2 neon/;
   }
   add_proto qw/int aom_satd/, "const tran_low_t *coeff, int length";
   specialize qw/aom_satd neon sse2 avx2/;
@@ -1299,17 +1322,11 @@
   #
   # Specialty Variance
   #
-  add_proto qw/void aom_get16x16var/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum";
-  add_proto qw/void aom_get8x8var/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum";
-
-  specialize qw/aom_get16x16var                neon/;
-  specialize qw/aom_get8x8var             sse2 neon/;
-
   add_proto qw/void aom_get_var_sse_sum_8x8_quad/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse8x8, int *sum8x8, unsigned int *tot_sse, int *tot_sum, uint32_t *var8x8";
   specialize qw/aom_get_var_sse_sum_8x8_quad        avx2 sse2 neon/;
 
   add_proto qw/void aom_get_var_sse_sum_16x16_dual/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse16x16, unsigned int *tot_sse, int *tot_sum, uint32_t *var16x16";
-  specialize qw/aom_get_var_sse_sum_16x16_dual        avx2/;
+  specialize qw/aom_get_var_sse_sum_16x16_dual        avx2 sse2 neon/;
 
   add_proto qw/unsigned int aom_mse16x16/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
   add_proto qw/unsigned int aom_mse16x8/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
@@ -1323,9 +1340,6 @@
 
   if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
     foreach $bd (8, 10, 12) {
-      add_proto qw/void/, "aom_highbd_${bd}_get16x16var", "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum";
-      add_proto qw/void/, "aom_highbd_${bd}_get8x8var", "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum";
-
       add_proto qw/unsigned int/, "aom_highbd_${bd}_mse16x16", "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
       add_proto qw/unsigned int/, "aom_highbd_${bd}_mse16x8", "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
       add_proto qw/unsigned int/, "aom_highbd_${bd}_mse8x16", "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
@@ -1340,10 +1354,7 @@
   #
   #
   add_proto qw/unsigned int aom_get_mb_ss/, "const int16_t *";
-  add_proto qw/unsigned int aom_get4x4sse_cs/, "const unsigned char *src_ptr, int source_stride, const unsigned char *ref_ptr, int ref_stride";
-
   specialize qw/aom_get_mb_ss sse2/;
-  specialize qw/aom_get4x4sse_cs neon/;
 
   #
   # Variance / Subpixel Variance / Subpixel Avg Variance
@@ -1522,7 +1533,7 @@
   foreach (@encoder_block_sizes) {
     ($w, $h) = @$_;
     add_proto qw/unsigned int/, "aom_masked_sub_pixel_variance${w}x${h}", "const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse";
-    specialize "aom_masked_sub_pixel_variance${w}x${h}", qw/ssse3/;
+    specialize "aom_masked_sub_pixel_variance${w}x${h}", qw/ssse3 neon/;
   }
 
   if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
@@ -1543,8 +1554,8 @@
       ($w, $h) = @$_;
       add_proto qw/unsigned int/, "aom_obmc_variance${w}x${h}", "const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse";
       add_proto qw/unsigned int/, "aom_obmc_sub_pixel_variance${w}x${h}", "const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse";
-      specialize "aom_obmc_variance${w}x${h}", qw/sse4_1 avx2/;
-      specialize "aom_obmc_sub_pixel_variance${w}x${h}", q/sse4_1/;
+      specialize "aom_obmc_variance${w}x${h}", qw/sse4_1 avx2 neon/;
+      specialize "aom_obmc_sub_pixel_variance${w}x${h}", qw/sse4_1 neon/;
     }
 
     if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
@@ -1602,6 +1613,7 @@
   # Comp Avg
   #
   add_proto qw/void aom_comp_avg_pred/, "uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride";
+  specialize qw/aom_comp_avg_pred avx2 neon/;
 
   add_proto qw/void aom_dist_wtd_comp_avg_pred/, "uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride, const DIST_WTD_COMP_PARAMS *jcp_param";
   specialize qw/aom_dist_wtd_comp_avg_pred ssse3/;
@@ -1609,47 +1621,52 @@
   if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
 
     add_proto qw/unsigned int aom_highbd_12_variance128x128/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance128x128 sse2/;
+    specialize qw/aom_highbd_12_variance128x128 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance128x64/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance128x64 sse2/;
+    specialize qw/aom_highbd_12_variance128x64 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance64x128/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance64x128 sse2/;
+    specialize qw/aom_highbd_12_variance64x128 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance64x64/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance64x64 sse2/;
+    specialize qw/aom_highbd_12_variance64x64 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance64x32/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance64x32 sse2/;
+    specialize qw/aom_highbd_12_variance64x32 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance32x64/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance32x64 sse2/;
+    specialize qw/aom_highbd_12_variance32x64 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance32x32/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance32x32 sse2/;
+    specialize qw/aom_highbd_12_variance32x32 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance32x16/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance32x16 sse2/;
+    specialize qw/aom_highbd_12_variance32x16 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance16x32/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance16x32 sse2/;
+    specialize qw/aom_highbd_12_variance16x32 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance16x16/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance16x16 sse2/;
+    specialize qw/aom_highbd_12_variance16x16 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance16x8/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance16x8 sse2/;
+    specialize qw/aom_highbd_12_variance16x8 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance8x16/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance8x16 sse2/;
+    specialize qw/aom_highbd_12_variance8x16 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance8x8/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_variance8x8 sse2/;
+    specialize qw/aom_highbd_12_variance8x8 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_variance8x4/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+    specialize qw/aom_highbd_12_variance8x4 neon/;
+
     add_proto qw/unsigned int aom_highbd_12_variance4x8/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+    specialize qw/aom_highbd_12_variance4x8 neon/;
+
     add_proto qw/unsigned int aom_highbd_12_variance4x4/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+    specialize qw/aom_highbd_12_variance4x4 neon/;
 
     add_proto qw/unsigned int aom_highbd_10_variance128x128/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
     specialize qw/aom_highbd_10_variance128x128 sse2 avx2 neon/;
@@ -1691,84 +1708,113 @@
     specialize qw/aom_highbd_10_variance8x8 sse2 avx2 neon/;
 
     add_proto qw/unsigned int aom_highbd_10_variance8x4/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+    specialize qw/aom_highbd_10_variance8x4 neon/;
+
     add_proto qw/unsigned int aom_highbd_10_variance4x8/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+    specialize qw/aom_highbd_10_variance4x8 neon/;
+
     add_proto qw/unsigned int aom_highbd_10_variance4x4/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+    specialize qw/aom_highbd_10_variance4x4 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance128x128/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance128x128 sse2/;
+    specialize qw/aom_highbd_8_variance128x128 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance128x64/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance128x64 sse2/;
+    specialize qw/aom_highbd_8_variance128x64 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance64x128/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance64x128 sse2/;
+    specialize qw/aom_highbd_8_variance64x128 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance64x64/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance64x64 sse2/;
+    specialize qw/aom_highbd_8_variance64x64 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance64x32/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance64x32 sse2/;
+    specialize qw/aom_highbd_8_variance64x32 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance32x64/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance32x64 sse2/;
+    specialize qw/aom_highbd_8_variance32x64 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance32x32/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance32x32 sse2/;
+    specialize qw/aom_highbd_8_variance32x32 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance32x16/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance32x16 sse2/;
+    specialize qw/aom_highbd_8_variance32x16 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance16x32/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance16x32 sse2/;
+    specialize qw/aom_highbd_8_variance16x32 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance16x16/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance16x16 sse2/;
+    specialize qw/aom_highbd_8_variance16x16 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance16x8/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance16x8 sse2/;
+    specialize qw/aom_highbd_8_variance16x8 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance8x16/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance8x16 sse2/;
+    specialize qw/aom_highbd_8_variance8x16 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance8x8/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_variance8x8 sse2/;
+    specialize qw/aom_highbd_8_variance8x8 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_variance8x4/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+    specialize qw/aom_highbd_8_variance8x4 neon/;
+
     add_proto qw/unsigned int aom_highbd_8_variance4x8/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+    specialize qw/aom_highbd_8_variance4x8 neon/;
+
     add_proto qw/unsigned int aom_highbd_8_variance4x4/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+    specialize qw/aom_highbd_8_variance4x4 neon/;
 
-    add_proto qw/void aom_highbd_8_get16x16var/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum";
-    add_proto qw/void aom_highbd_8_get8x8var/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum";
+    if (aom_config("CONFIG_REALTIME_ONLY") ne "yes") {
+      foreach $bd (8, 10, 12) {
+        add_proto qw/unsigned int/, "aom_highbd_${bd}_variance64x16", "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+        specialize "aom_highbd_${bd}_variance64x16" , qw/neon/;
 
-    add_proto qw/void aom_highbd_10_get16x16var/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum";
-    add_proto qw/void aom_highbd_10_get8x8var/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum";
+        add_proto qw/unsigned int/, "aom_highbd_${bd}_variance32x8", "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+        specialize "aom_highbd_${bd}_variance32x8" , qw/neon/;
 
-    add_proto qw/void aom_highbd_12_get16x16var/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum";
-    add_proto qw/void aom_highbd_12_get8x8var/, "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum";
+        add_proto qw/unsigned int/, "aom_highbd_${bd}_variance16x64", "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+        specialize "aom_highbd_${bd}_variance16x64" , qw/neon/;
+
+        add_proto qw/unsigned int/, "aom_highbd_${bd}_variance16x4", "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+        specialize "aom_highbd_${bd}_variance16x4" , qw/neon/;
+
+        add_proto qw/unsigned int/, "aom_highbd_${bd}_variance8x32", "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+        specialize "aom_highbd_${bd}_variance8x32" , qw/neon/;
+
+        add_proto qw/unsigned int/, "aom_highbd_${bd}_variance4x16", "const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse";
+        specialize "aom_highbd_${bd}_variance4x16" , qw/neon/;
+      }
+    }
 
     add_proto qw/unsigned int aom_highbd_8_mse16x16/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_mse16x16 sse2/;
+    specialize qw/aom_highbd_8_mse16x16 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_8_mse16x8/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
+    specialize qw/aom_highbd_8_mse16x8 neon/;
     add_proto qw/unsigned int aom_highbd_8_mse8x16/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
+    specialize qw/aom_highbd_8_mse8x16 neon/;
     add_proto qw/unsigned int aom_highbd_8_mse8x8/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
-    specialize qw/aom_highbd_8_mse8x8 sse2/;
+    specialize qw/aom_highbd_8_mse8x8 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_10_mse16x16/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
-    specialize qw/aom_highbd_10_mse16x16 sse2/;
+    specialize qw/aom_highbd_10_mse16x16 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_10_mse16x8/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
+    specialize qw/aom_highbd_10_mse16x8 neon/;
     add_proto qw/unsigned int aom_highbd_10_mse8x16/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
+    specialize qw/aom_highbd_10_mse8x16 neon/;
     add_proto qw/unsigned int aom_highbd_10_mse8x8/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
-    specialize qw/aom_highbd_10_mse8x8 sse2/;
+    specialize qw/aom_highbd_10_mse8x8 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_mse16x16/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_mse16x16 sse2/;
+    specialize qw/aom_highbd_12_mse16x16 sse2 neon/;
 
     add_proto qw/unsigned int aom_highbd_12_mse16x8/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
+    specialize qw/aom_highbd_12_mse16x8 neon/;
     add_proto qw/unsigned int aom_highbd_12_mse8x16/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
+    specialize qw/aom_highbd_12_mse8x16 neon/;
     add_proto qw/unsigned int aom_highbd_12_mse8x8/, "const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse";
-    specialize qw/aom_highbd_12_mse8x8 sse2/;
+    specialize qw/aom_highbd_12_mse8x8 sse2 neon/;
 
     add_proto qw/void aom_highbd_comp_avg_pred/, "uint8_t *comp_pred8, const uint8_t *pred8, int width, int height, const uint8_t *ref8, int ref_stride";
 
@@ -2028,7 +2074,7 @@
 
 
   add_proto qw/void aom_comp_mask_pred/, "uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask";
-  specialize qw/aom_comp_mask_pred ssse3 avx2/;
+  specialize qw/aom_comp_mask_pred ssse3 avx2 neon/;
 
   if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
     add_proto qw/void aom_highbd_comp_mask_pred/, "uint8_t *comp_pred, const uint8_t *pred8, int width, int height, const uint8_t *ref8, int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask";
@@ -2037,8 +2083,11 @@
 
   # Flow estimation library
   if (aom_config("CONFIG_REALTIME_ONLY") ne "yes") {
-    add_proto qw/double av1_compute_cross_correlation/, "unsigned char *im1, int stride1, int x1, int y1, unsigned char *im2, int stride2, int x2, int y2";
+    add_proto qw/double av1_compute_cross_correlation/, "const unsigned char *frame1, int stride1, int x1, int y1, const unsigned char *frame2, int stride2, int x2, int y2";
     specialize qw/av1_compute_cross_correlation sse4_1 avx2/;
+
+    add_proto qw/void aom_compute_flow_at_point/, "const uint8_t *src, const uint8_t *ref, int x, int y, int width, int height, int stride, double *u, double *v";
+    specialize qw/aom_compute_flow_at_point sse4_1/;
   }
 
 }  # CONFIG_AV1_ENCODER

diff --git a/aom_dsp/arm/aom_convolve8_neon.c b/aom_dsp/arm/aom_convolve8_neon.c
new file mode 100644
index 0000000..3d07a0f
--- /dev/null
+++ b/aom_dsp/arm/aom_convolve8_neon.c

@@ -0,0 +1,1176 @@
+/*
+ * Copyright (c) 2014 The WebM project authors. All Rights Reserved.
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <arm_neon.h>
+#include <assert.h>
+#include <string.h>
+
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+
+#include "aom/aom_integer.h"
+#include "aom_dsp/aom_dsp_common.h"
+#include "aom_dsp/aom_filter.h"
+#include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/arm/transpose_neon.h"
+#include "aom_ports/mem.h"
+
+#if AOM_ARCH_AARCH64 && \
+    (defined(__ARM_FEATURE_DOTPROD) || defined(__ARM_FEATURE_MATMUL_INT8))
+
+DECLARE_ALIGNED(16, static const uint8_t, dot_prod_permute_tbl[48]) = {
+  0, 1, 2,  3,  1, 2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6,
+  4, 5, 6,  7,  5, 6,  7,  8,  6,  7,  8,  9,  7,  8,  9,  10,
+  8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14
+};
+
+DECLARE_ALIGNED(16, static const uint8_t, dot_prod_tran_concat_tbl[32]) = {
+  0, 8,  16, 24, 1, 9,  17, 25, 2, 10, 18, 26, 3, 11, 19, 27,
+  4, 12, 20, 28, 5, 13, 21, 29, 6, 14, 22, 30, 7, 15, 23, 31
+};
+
+DECLARE_ALIGNED(16, static const uint8_t, dot_prod_merge_block_tbl[48]) = {
+  /* Shift left and insert new last column in transposed 4x4 block. */
+  1, 2, 3, 16, 5, 6, 7, 20, 9, 10, 11, 24, 13, 14, 15, 28,
+  /* Shift left and insert two new columns in transposed 4x4 block. */
+  2, 3, 16, 17, 6, 7, 20, 21, 10, 11, 24, 25, 14, 15, 28, 29,
+  /* Shift left and insert three new columns in transposed 4x4 block. */
+  3, 16, 17, 18, 7, 20, 21, 22, 11, 24, 25, 26, 15, 28, 29, 30
+};
+
+#if defined(__ARM_FEATURE_MATMUL_INT8)
+
+static INLINE int16x4_t convolve8_4_usdot(uint8x16_t samples,
+                                          const int8x8_t filter,
+                                          const uint8x16x2_t permute_tbl) {
+  uint8x16_t permuted_samples[2];
+  int32x4_t sum;
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_u8(samples, permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_u8(samples, permute_tbl.val[1]);
+
+  /* Accumulate dot product into 'correction' to account for range clamp. */
+  sum = vusdotq_lane_s32(vdupq_n_s32(0), permuted_samples[0], filter, 0);
+  sum = vusdotq_lane_s32(sum, permuted_samples[1], filter, 1);
+
+  /* Further narrowing and packing is performed by the caller. */
+  return vqmovn_s32(sum);
+}
+
+static INLINE uint8x8_t convolve8_8_usdot(uint8x16_t samples,
+                                          const int8x8_t filter,
+                                          const uint8x16x3_t permute_tbl) {
+  uint8x16_t permuted_samples[3];
+  int32x4_t sum0, sum1;
+  int16x8_t sum;
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_u8(samples, permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_u8(samples, permute_tbl.val[1]);
+  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
+  permuted_samples[2] = vqtbl1q_u8(samples, permute_tbl.val[2]);
+
+  /* First 4 output values. */
+  sum0 = vusdotq_lane_s32(vdupq_n_s32(0), permuted_samples[0], filter, 0);
+  sum0 = vusdotq_lane_s32(sum0, permuted_samples[1], filter, 1);
+  /* Second 4 output values. */
+  sum1 = vusdotq_lane_s32(vdupq_n_s32(0), permuted_samples[1], filter, 0);
+  sum1 = vusdotq_lane_s32(sum1, permuted_samples[2], filter, 1);
+
+  /* Narrow and re-pack. */
+  sum = vcombine_s16(vqmovn_s32(sum0), vqmovn_s32(sum1));
+  return vqrshrun_n_s16(sum, FILTER_BITS);
+}
+
+void aom_convolve8_horiz_neon(const uint8_t *src, ptrdiff_t src_stride,
+                              uint8_t *dst, ptrdiff_t dst_stride,
+                              const int16_t *filter_x, int x_step_q4,
+                              const int16_t *filter_y, int y_step_q4, int w,
+                              int h) {
+  const int8x8_t filter = vmovn_s16(vld1q_s16(filter_x));
+  uint8x16_t s0, s1, s2, s3;
+
+  assert((intptr_t)dst % 4 == 0);
+  assert(dst_stride % 4 == 0);
+
+  (void)x_step_q4;
+  (void)filter_y;
+  (void)y_step_q4;
+
+  src -= ((SUBPEL_TAPS / 2) - 1);
+
+  if (w == 4) {
+    const uint8x16x2_t perm_tbl = vld1q_u8_x2(dot_prod_permute_tbl);
+    do {
+      int16x4_t t0, t1, t2, t3;
+      uint8x8_t d01, d23;
+
+      load_u8_16x4(src, src_stride, &s0, &s1, &s2, &s3);
+
+      t0 = convolve8_4_usdot(s0, filter, perm_tbl);
+      t1 = convolve8_4_usdot(s1, filter, perm_tbl);
+      t2 = convolve8_4_usdot(s2, filter, perm_tbl);
+      t3 = convolve8_4_usdot(s3, filter, perm_tbl);
+      d01 = vqrshrun_n_s16(vcombine_s16(t0, t1), FILTER_BITS);
+      d23 = vqrshrun_n_s16(vcombine_s16(t2, t3), FILTER_BITS);
+
+      store_u8_4x1(dst + 0 * dst_stride, d01, 0);
+      store_u8_4x1(dst + 1 * dst_stride, d01, 1);
+      store_u8_4x1(dst + 2 * dst_stride, d23, 0);
+      store_u8_4x1(dst + 3 * dst_stride, d23, 1);
+
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      h -= 4;
+    } while (h > 0);
+  } else {
+    const uint8x16x3_t perm_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
+    const uint8_t *s;
+    uint8_t *d;
+    int width;
+    uint8x8_t d0, d1, d2, d3;
+
+    do {
+      width = w;
+      s = src;
+      d = dst;
+      do {
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
+
+        d0 = convolve8_8_usdot(s0, filter, perm_tbl);
+        d1 = convolve8_8_usdot(s1, filter, perm_tbl);
+        d2 = convolve8_8_usdot(s2, filter, perm_tbl);
+        d3 = convolve8_8_usdot(s3, filter, perm_tbl);
+
+        store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        s += 8;
+        d += 8;
+        width -= 8;
+      } while (width != 0);
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      h -= 4;
+    } while (h > 0);
+  }
+}
+
+static INLINE void transpose_concat_4x4(uint8x8_t a0, uint8x8_t a1,
+                                        uint8x8_t a2, uint8x8_t a3,
+                                        uint8x16_t *b,
+                                        const uint8x16_t permute_tbl) {
+  /* Transpose 8-bit elements and concatenate result rows as follows:
+   * a0: 00, 01, 02, 03, XX, XX, XX, XX
+   * a1: 10, 11, 12, 13, XX, XX, XX, XX
+   * a2: 20, 21, 22, 23, XX, XX, XX, XX
+   * a3: 30, 31, 32, 33, XX, XX, XX, XX
+   *
+   * b: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
+   *
+   * The 'permute_tbl' is always 'dot_prod_tran_concat_tbl' above. Passing it
+   * as an argument is preferable to loading it directly from memory as this
+   * inline helper is called many times from the same parent function.
+   */
+
+  uint8x16x2_t samples = { { vcombine_u8(a0, a1), vcombine_u8(a2, a3) } };
+  *b = vqtbl2q_u8(samples, permute_tbl);
+}
+
+static INLINE void transpose_concat_8x4(uint8x8_t a0, uint8x8_t a1,
+                                        uint8x8_t a2, uint8x8_t a3,
+                                        uint8x16_t *b0, uint8x16_t *b1,
+                                        const uint8x16x2_t permute_tbl) {
+  /* Transpose 8-bit elements and concatenate result rows as follows:
+   * a0: 00, 01, 02, 03, 04, 05, 06, 07
+   * a1: 10, 11, 12, 13, 14, 15, 16, 17
+   * a2: 20, 21, 22, 23, 24, 25, 26, 27
+   * a3: 30, 31, 32, 33, 34, 35, 36, 37
+   *
+   * b0: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
+   * b1: 04, 14, 24, 34, 05, 15, 25, 35, 06, 16, 26, 36, 07, 17, 27, 37
+   *
+   * The 'permute_tbl' is always 'dot_prod_tran_concat_tbl' above. Passing it
+   * as an argument is preferable to loading it directly from memory as this
+   * inline helper is called many times from the same parent function.
+   */
+
+  uint8x16x2_t samples = { { vcombine_u8(a0, a1), vcombine_u8(a2, a3) } };
+  *b0 = vqtbl2q_u8(samples, permute_tbl.val[0]);
+  *b1 = vqtbl2q_u8(samples, permute_tbl.val[1]);
+}
+
+static INLINE int16x4_t convolve8_4_usdot_partial(const uint8x16_t samples_lo,
+                                                  const uint8x16_t samples_hi,
+                                                  const int8x8_t filter) {
+  /* Sample permutation is performed by the caller. */
+  int32x4_t sum;
+
+  sum = vusdotq_lane_s32(vdupq_n_s32(0), samples_lo, filter, 0);
+  sum = vusdotq_lane_s32(sum, samples_hi, filter, 1);
+
+  /* Further narrowing and packing is performed by the caller. */
+  return vqmovn_s32(sum);
+}
+
+static INLINE uint8x8_t convolve8_8_usdot_partial(const uint8x16_t samples0_lo,
+                                                  const uint8x16_t samples0_hi,
+                                                  const uint8x16_t samples1_lo,
+                                                  const uint8x16_t samples1_hi,
+                                                  const int8x8_t filter) {
+  /* Sample permutation is performed by the caller. */
+  int32x4_t sum0, sum1;
+  int16x8_t sum;
+
+  /* First 4 output values. */
+  sum0 = vusdotq_lane_s32(vdupq_n_s32(0), samples0_lo, filter, 0);
+  sum0 = vusdotq_lane_s32(sum0, samples0_hi, filter, 1);
+  /* Second 4 output values. */
+  sum1 = vusdotq_lane_s32(vdupq_n_s32(0), samples1_lo, filter, 0);
+  sum1 = vusdotq_lane_s32(sum1, samples1_hi, filter, 1);
+
+  /* Narrow and re-pack. */
+  sum = vcombine_s16(vqmovn_s32(sum0), vqmovn_s32(sum1));
+  return vqrshrun_n_s16(sum, FILTER_BITS);
+}
+
+void aom_convolve8_vert_neon(const uint8_t *src, ptrdiff_t src_stride,
+                             uint8_t *dst, ptrdiff_t dst_stride,
+                             const int16_t *filter_x, int x_step_q4,
+                             const int16_t *filter_y, int y_step_q4, int w,
+                             int h) {
+  const int8x8_t filter = vmovn_s16(vld1q_s16(filter_y));
+  const uint8x16x3_t merge_block_tbl = vld1q_u8_x3(dot_prod_merge_block_tbl);
+  uint8x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
+  uint8x16x2_t samples_LUT;
+
+  assert((intptr_t)dst % 4 == 0);
+  assert(dst_stride % 4 == 0);
+
+  (void)filter_x;
+  (void)x_step_q4;
+  (void)y_step_q4;
+
+  src -= ((SUBPEL_TAPS / 2) - 1) * src_stride;
+
+  if (w == 4) {
+    const uint8x16_t tran_concat_tbl = vld1q_u8(dot_prod_tran_concat_tbl);
+    uint8x16_t s0123, s1234, s2345, s3456, s4567, s5678, s6789, s78910;
+    int16x4_t d0, d1, d2, d3;
+    uint8x8_t d01, d23;
+
+    load_u8_8x7(src, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+    src += 7 * src_stride;
+
+    s7 = vdup_n_u8(0);
+    s8 = vdup_n_u8(0);
+    s9 = vdup_n_u8(0);
+
+    /* This operation combines a conventional transpose and the sample permute
+     * (see horizontal case) required before computing the dot product.
+     */
+    transpose_concat_4x4(s0, s1, s2, s3, &s0123, tran_concat_tbl);
+    transpose_concat_4x4(s1, s2, s3, s4, &s1234, tran_concat_tbl);
+    transpose_concat_4x4(s2, s3, s4, s5, &s2345, tran_concat_tbl);
+    transpose_concat_4x4(s3, s4, s5, s6, &s3456, tran_concat_tbl);
+    transpose_concat_4x4(s4, s5, s6, s7, &s4567, tran_concat_tbl);
+    transpose_concat_4x4(s5, s6, s7, s8, &s5678, tran_concat_tbl);
+    transpose_concat_4x4(s6, s7, s8, s9, &s6789, tran_concat_tbl);
+
+    do {
+      load_u8_8x4(src, src_stride, &s7, &s8, &s9, &s10);
+
+      transpose_concat_4x4(s7, s8, s9, s10, &s78910, tran_concat_tbl);
+
+      /* Merge new data into block from previous iteration. */
+      samples_LUT.val[0] = s3456;
+      samples_LUT.val[1] = s78910;
+      s4567 = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[0]);
+      s5678 = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[1]);
+      s6789 = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[2]);
+
+      d0 = convolve8_4_usdot_partial(s0123, s4567, filter);
+      d1 = convolve8_4_usdot_partial(s1234, s5678, filter);
+      d2 = convolve8_4_usdot_partial(s2345, s6789, filter);
+      d3 = convolve8_4_usdot_partial(s3456, s78910, filter);
+      d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS);
+      d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS);
+
+      store_u8_4x1(dst + 0 * dst_stride, d01, 0);
+      store_u8_4x1(dst + 1 * dst_stride, d01, 1);
+      store_u8_4x1(dst + 2 * dst_stride, d23, 0);
+      store_u8_4x1(dst + 3 * dst_stride, d23, 1);
+
+      /* Prepare block for next iteration - re-using as much as possible. */
+      /* Shuffle everything up four rows. */
+      s0123 = s4567;
+      s1234 = s5678;
+      s2345 = s6789;
+      s3456 = s78910;
+
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      h -= 4;
+    } while (h != 0);
+  } else {
+    const uint8x16x2_t tran_concat_tbl = vld1q_u8_x2(dot_prod_tran_concat_tbl);
+    uint8x16_t s0123_lo, s0123_hi, s1234_lo, s1234_hi, s2345_lo, s2345_hi,
+        s3456_lo, s3456_hi, s4567_lo, s4567_hi, s5678_lo, s5678_hi, s6789_lo,
+        s6789_hi, s78910_lo, s78910_hi;
+    uint8x8_t d0, d1, d2, d3;
+    const uint8_t *s;
+    uint8_t *d;
+    int height;
+
+    do {
+      height = h;
+      s = src;
+      d = dst;
+
+      load_u8_8x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+      s += 7 * src_stride;
+
+      s7 = vdup_n_u8(0);
+      s8 = vdup_n_u8(0);
+      s9 = vdup_n_u8(0);
+
+      /* This operation combines a conventional transpose and the sample permute
+       * (see horizontal case) required before computing the dot product.
+       */
+      transpose_concat_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi,
+                           tran_concat_tbl);
+      transpose_concat_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi,
+                           tran_concat_tbl);
+      transpose_concat_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi,
+                           tran_concat_tbl);
+      transpose_concat_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi,
+                           tran_concat_tbl);
+      transpose_concat_8x4(s4, s5, s6, s7, &s4567_lo, &s4567_hi,
+                           tran_concat_tbl);
+      transpose_concat_8x4(s5, s6, s7, s8, &s5678_lo, &s5678_hi,
+                           tran_concat_tbl);
+      transpose_concat_8x4(s6, s7, s8, s9, &s6789_lo, &s6789_hi,
+                           tran_concat_tbl);
+
+      do {
+        load_u8_8x4(s, src_stride, &s7, &s8, &s9, &s10);
+
+        transpose_concat_8x4(s7, s8, s9, s10, &s78910_lo, &s78910_hi,
+                             tran_concat_tbl);
+
+        /* Merge new data into block from previous iteration. */
+        samples_LUT.val[0] = s3456_lo;
+        samples_LUT.val[1] = s78910_lo;
+        s4567_lo = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[0]);
+        s5678_lo = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[1]);
+        s6789_lo = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[2]);
+
+        samples_LUT.val[0] = s3456_hi;
+        samples_LUT.val[1] = s78910_hi;
+        s4567_hi = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[0]);
+        s5678_hi = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[1]);
+        s6789_hi = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[2]);
+
+        d0 = convolve8_8_usdot_partial(s0123_lo, s4567_lo, s0123_hi, s4567_hi,
+                                       filter);
+        d1 = convolve8_8_usdot_partial(s1234_lo, s5678_lo, s1234_hi, s5678_hi,
+                                       filter);
+        d2 = convolve8_8_usdot_partial(s2345_lo, s6789_lo, s2345_hi, s6789_hi,
+                                       filter);
+        d3 = convolve8_8_usdot_partial(s3456_lo, s78910_lo, s3456_hi, s78910_hi,
+                                       filter);
+
+        store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        /* Prepare block for next iteration - re-using as much as possible. */
+        /* Shuffle everything up four rows. */
+        s0123_lo = s4567_lo;
+        s0123_hi = s4567_hi;
+        s1234_lo = s5678_lo;
+        s1234_hi = s5678_hi;
+        s2345_lo = s6789_lo;
+        s2345_hi = s6789_hi;
+        s3456_lo = s78910_lo;
+        s3456_hi = s78910_hi;
+
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height != 0);
+      src += 8;
+      dst += 8;
+      w -= 8;
+    } while (w != 0);
+  }
+}
+
+#else  // !defined(__ARM_FEATURE_MATMUL_INT8)
+
+static INLINE int16x4_t convolve8_4_sdot(uint8x16_t samples,
+                                         const int8x8_t filter,
+                                         const int32x4_t correction,
+                                         const uint8x16_t range_limit,
+                                         const uint8x16x2_t permute_tbl) {
+  int8x16_t clamped_samples, permuted_samples[2];
+  int32x4_t sum;
+
+  /* Clamp sample range to [-128, 127] for 8-bit signed dot product. */
+  clamped_samples = vreinterpretq_s8_u8(vsubq_u8(samples, range_limit));
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_s8(clamped_samples, permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_s8(clamped_samples, permute_tbl.val[1]);
+
+  /* Accumulate dot product into 'correction' to account for range clamp. */
+  sum = vdotq_lane_s32(correction, permuted_samples[0], filter, 0);
+  sum = vdotq_lane_s32(sum, permuted_samples[1], filter, 1);
+
+  /* Further narrowing and packing is performed by the caller. */
+  return vqmovn_s32(sum);
+}
+
+static INLINE uint8x8_t convolve8_8_sdot(uint8x16_t samples,
+                                         const int8x8_t filter,
+                                         const int32x4_t correction,
+                                         const uint8x16_t range_limit,
+                                         const uint8x16x3_t permute_tbl) {
+  int8x16_t clamped_samples, permuted_samples[3];
+  int32x4_t sum0, sum1;
+  int16x8_t sum;
+
+  /* Clamp sample range to [-128, 127] for 8-bit signed dot product. */
+  clamped_samples = vreinterpretq_s8_u8(vsubq_u8(samples, range_limit));
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_s8(clamped_samples, permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_s8(clamped_samples, permute_tbl.val[1]);
+  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
+  permuted_samples[2] = vqtbl1q_s8(clamped_samples, permute_tbl.val[2]);
+
+  /* Accumulate dot product into 'correction' to account for range clamp. */
+  /* First 4 output values. */
+  sum0 = vdotq_lane_s32(correction, permuted_samples[0], filter, 0);
+  sum0 = vdotq_lane_s32(sum0, permuted_samples[1], filter, 1);
+  /* Second 4 output values. */
+  sum1 = vdotq_lane_s32(correction, permuted_samples[1], filter, 0);
+  sum1 = vdotq_lane_s32(sum1, permuted_samples[2], filter, 1);
+
+  /* Narrow and re-pack. */
+  sum = vcombine_s16(vqmovn_s32(sum0), vqmovn_s32(sum1));
+  return vqrshrun_n_s16(sum, FILTER_BITS);
+}
+
+void aom_convolve8_horiz_neon(const uint8_t *src, ptrdiff_t src_stride,
+                              uint8_t *dst, ptrdiff_t dst_stride,
+                              const int16_t *filter_x, int x_step_q4,
+                              const int16_t *filter_y, int y_step_q4, int w,
+                              int h) {
+  const int8x8_t filter = vmovn_s16(vld1q_s16(filter_x));
+  const int16x8_t correct_tmp = vmulq_n_s16(vld1q_s16(filter_x), 128);
+  const int32x4_t correction = vdupq_n_s32((int32_t)vaddvq_s16(correct_tmp));
+  const uint8x16_t range_limit = vdupq_n_u8(128);
+  uint8x16_t s0, s1, s2, s3;
+
+  assert((intptr_t)dst % 4 == 0);
+  assert(dst_stride % 4 == 0);
+
+  (void)x_step_q4;
+  (void)filter_y;
+  (void)y_step_q4;
+
+  src -= ((SUBPEL_TAPS / 2) - 1);
+
+  if (w == 4) {
+    const uint8x16x2_t perm_tbl = vld1q_u8_x2(dot_prod_permute_tbl);
+    do {
+      int16x4_t t0, t1, t2, t3;
+      uint8x8_t d01, d23;
+
+      load_u8_16x4(src, src_stride, &s0, &s1, &s2, &s3);
+
+      t0 = convolve8_4_sdot(s0, filter, correction, range_limit, perm_tbl);
+      t1 = convolve8_4_sdot(s1, filter, correction, range_limit, perm_tbl);
+      t2 = convolve8_4_sdot(s2, filter, correction, range_limit, perm_tbl);
+      t3 = convolve8_4_sdot(s3, filter, correction, range_limit, perm_tbl);
+      d01 = vqrshrun_n_s16(vcombine_s16(t0, t1), FILTER_BITS);
+      d23 = vqrshrun_n_s16(vcombine_s16(t2, t3), FILTER_BITS);
+
+      store_u8_4x1(dst + 0 * dst_stride, d01, 0);
+      store_u8_4x1(dst + 1 * dst_stride, d01, 1);
+      store_u8_4x1(dst + 2 * dst_stride, d23, 0);
+      store_u8_4x1(dst + 3 * dst_stride, d23, 1);
+
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      h -= 4;
+    } while (h > 0);
+  } else {
+    const uint8x16x3_t perm_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
+    const uint8_t *s;
+    uint8_t *d;
+    int width;
+    uint8x8_t d0, d1, d2, d3;
+
+    do {
+      width = w;
+      s = src;
+      d = dst;
+      do {
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
+
+        d0 = convolve8_8_sdot(s0, filter, correction, range_limit, perm_tbl);
+        d1 = convolve8_8_sdot(s1, filter, correction, range_limit, perm_tbl);
+        d2 = convolve8_8_sdot(s2, filter, correction, range_limit, perm_tbl);
+        d3 = convolve8_8_sdot(s3, filter, correction, range_limit, perm_tbl);
+
+        store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        s += 8;
+        d += 8;
+        width -= 8;
+      } while (width != 0);
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      h -= 4;
+    } while (h > 0);
+  }
+}
+
+static INLINE void transpose_concat_4x4(int8x8_t a0, int8x8_t a1, int8x8_t a2,
+                                        int8x8_t a3, int8x16_t *b,
+                                        const uint8x16_t permute_tbl) {
+  /* Transpose 8-bit elements and concatenate result rows as follows:
+   * a0: 00, 01, 02, 03, XX, XX, XX, XX
+   * a1: 10, 11, 12, 13, XX, XX, XX, XX
+   * a2: 20, 21, 22, 23, XX, XX, XX, XX
+   * a3: 30, 31, 32, 33, XX, XX, XX, XX
+   *
+   * b: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
+   *
+   * The 'permute_tbl' is always 'dot_prod_tran_concat_tbl' above. Passing it
+   * as an argument is preferable to loading it directly from memory as this
+   * inline helper is called many times from the same parent function.
+   */
+
+  int8x16x2_t samples = { { vcombine_s8(a0, a1), vcombine_s8(a2, a3) } };
+  *b = vqtbl2q_s8(samples, permute_tbl);
+}
+
+static INLINE void transpose_concat_8x4(int8x8_t a0, int8x8_t a1, int8x8_t a2,
+                                        int8x8_t a3, int8x16_t *b0,
+                                        int8x16_t *b1,
+                                        const uint8x16x2_t permute_tbl) {
+  /* Transpose 8-bit elements and concatenate result rows as follows:
+   * a0: 00, 01, 02, 03, 04, 05, 06, 07
+   * a1: 10, 11, 12, 13, 14, 15, 16, 17
+   * a2: 20, 21, 22, 23, 24, 25, 26, 27
+   * a3: 30, 31, 32, 33, 34, 35, 36, 37
+   *
+   * b0: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
+   * b1: 04, 14, 24, 34, 05, 15, 25, 35, 06, 16, 26, 36, 07, 17, 27, 37
+   *
+   * The 'permute_tbl' is always 'dot_prod_tran_concat_tbl' above. Passing it
+   * as an argument is preferable to loading it directly from memory as this
+   * inline helper is called many times from the same parent function.
+   */
+
+  int8x16x2_t samples = { { vcombine_s8(a0, a1), vcombine_s8(a2, a3) } };
+  *b0 = vqtbl2q_s8(samples, permute_tbl.val[0]);
+  *b1 = vqtbl2q_s8(samples, permute_tbl.val[1]);
+}
+
+static INLINE int16x4_t convolve8_4_sdot_partial(const int8x16_t samples_lo,
+                                                 const int8x16_t samples_hi,
+                                                 const int32x4_t correction,
+                                                 const int8x8_t filter) {
+  /* Sample range-clamping and permutation are performed by the caller. */
+  int32x4_t sum;
+
+  /* Accumulate dot product into 'correction' to account for range clamp. */
+  sum = vdotq_lane_s32(correction, samples_lo, filter, 0);
+  sum = vdotq_lane_s32(sum, samples_hi, filter, 1);
+
+  /* Further narrowing and packing is performed by the caller. */
+  return vqmovn_s32(sum);
+}
+
+static INLINE uint8x8_t convolve8_8_sdot_partial(const int8x16_t samples0_lo,
+                                                 const int8x16_t samples0_hi,
+                                                 const int8x16_t samples1_lo,
+                                                 const int8x16_t samples1_hi,
+                                                 const int32x4_t correction,
+                                                 const int8x8_t filter) {
+  /* Sample range-clamping and permutation are performed by the caller. */
+  int32x4_t sum0, sum1;
+  int16x8_t sum;
+
+  /* Accumulate dot product into 'correction' to account for range clamp. */
+  /* First 4 output values. */
+  sum0 = vdotq_lane_s32(correction, samples0_lo, filter, 0);
+  sum0 = vdotq_lane_s32(sum0, samples0_hi, filter, 1);
+  /* Second 4 output values. */
+  sum1 = vdotq_lane_s32(correction, samples1_lo, filter, 0);
+  sum1 = vdotq_lane_s32(sum1, samples1_hi, filter, 1);
+
+  /* Narrow and re-pack. */
+  sum = vcombine_s16(vqmovn_s32(sum0), vqmovn_s32(sum1));
+  return vqrshrun_n_s16(sum, FILTER_BITS);
+}
+
+void aom_convolve8_vert_neon(const uint8_t *src, ptrdiff_t src_stride,
+                             uint8_t *dst, ptrdiff_t dst_stride,
+                             const int16_t *filter_x, int x_step_q4,
+                             const int16_t *filter_y, int y_step_q4, int w,
+                             int h) {
+  const int8x8_t filter = vmovn_s16(vld1q_s16(filter_y));
+  const int16x8_t correct_tmp = vmulq_n_s16(vld1q_s16(filter_y), 128);
+  const int32x4_t correction = vdupq_n_s32((int32_t)vaddvq_s16(correct_tmp));
+  const uint8x8_t range_limit = vdup_n_u8(128);
+  const uint8x16x3_t merge_block_tbl = vld1q_u8_x3(dot_prod_merge_block_tbl);
+  uint8x8_t t0, t1, t2, t3, t4, t5, t6;
+  int8x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
+  int8x16x2_t samples_LUT;
+
+  assert((intptr_t)dst % 4 == 0);
+  assert(dst_stride % 4 == 0);
+
+  (void)filter_x;
+  (void)x_step_q4;
+  (void)y_step_q4;
+
+  src -= ((SUBPEL_TAPS / 2) - 1) * src_stride;
+
+  if (w == 4) {
+    const uint8x16_t tran_concat_tbl = vld1q_u8(dot_prod_tran_concat_tbl);
+    int8x16_t s0123, s1234, s2345, s3456, s4567, s5678, s6789, s78910;
+    int16x4_t d0, d1, d2, d3;
+    uint8x8_t d01, d23;
+
+    load_u8_8x7(src, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6);
+    src += 7 * src_stride;
+
+    /* Clamp sample range to [-128, 127] for 8-bit signed dot product. */
+    s0 = vreinterpret_s8_u8(vsub_u8(t0, range_limit));
+    s1 = vreinterpret_s8_u8(vsub_u8(t1, range_limit));
+    s2 = vreinterpret_s8_u8(vsub_u8(t2, range_limit));
+    s3 = vreinterpret_s8_u8(vsub_u8(t3, range_limit));
+    s4 = vreinterpret_s8_u8(vsub_u8(t4, range_limit));
+    s5 = vreinterpret_s8_u8(vsub_u8(t5, range_limit));
+    s6 = vreinterpret_s8_u8(vsub_u8(t6, range_limit));
+    s7 = vdup_n_s8(0);
+    s8 = vdup_n_s8(0);
+    s9 = vdup_n_s8(0);
+
+    /* This operation combines a conventional transpose and the sample permute
+     * (see horizontal case) required before computing the dot product.
+     */
+    transpose_concat_4x4(s0, s1, s2, s3, &s0123, tran_concat_tbl);
+    transpose_concat_4x4(s1, s2, s3, s4, &s1234, tran_concat_tbl);
+    transpose_concat_4x4(s2, s3, s4, s5, &s2345, tran_concat_tbl);
+    transpose_concat_4x4(s3, s4, s5, s6, &s3456, tran_concat_tbl);
+    transpose_concat_4x4(s4, s5, s6, s7, &s4567, tran_concat_tbl);
+    transpose_concat_4x4(s5, s6, s7, s8, &s5678, tran_concat_tbl);
+    transpose_concat_4x4(s6, s7, s8, s9, &s6789, tran_concat_tbl);
+
+    do {
+      uint8x8_t t7, t8, t9, t10;
+
+      load_u8_8x4(src, src_stride, &t7, &t8, &t9, &t10);
+
+      s7 = vreinterpret_s8_u8(vsub_u8(t7, range_limit));
+      s8 = vreinterpret_s8_u8(vsub_u8(t8, range_limit));
+      s9 = vreinterpret_s8_u8(vsub_u8(t9, range_limit));
+      s10 = vreinterpret_s8_u8(vsub_u8(t10, range_limit));
+
+      transpose_concat_4x4(s7, s8, s9, s10, &s78910, tran_concat_tbl);
+
+      /* Merge new data into block from previous iteration. */
+      samples_LUT.val[0] = s3456;
+      samples_LUT.val[1] = s78910;
+      s4567 = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[0]);
+      s5678 = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[1]);
+      s6789 = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[2]);
+
+      d0 = convolve8_4_sdot_partial(s0123, s4567, correction, filter);
+      d1 = convolve8_4_sdot_partial(s1234, s5678, correction, filter);
+      d2 = convolve8_4_sdot_partial(s2345, s6789, correction, filter);
+      d3 = convolve8_4_sdot_partial(s3456, s78910, correction, filter);
+      d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS);
+      d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS);
+
+      store_u8_4x1(dst + 0 * dst_stride, d01, 0);
+      store_u8_4x1(dst + 1 * dst_stride, d01, 1);
+      store_u8_4x1(dst + 2 * dst_stride, d23, 0);
+      store_u8_4x1(dst + 3 * dst_stride, d23, 1);
+
+      /* Prepare block for next iteration - re-using as much as possible. */
+      /* Shuffle everything up four rows. */
+      s0123 = s4567;
+      s1234 = s5678;
+      s2345 = s6789;
+      s3456 = s78910;
+
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      h -= 4;
+    } while (h != 0);
+  } else {
+    const uint8x16x2_t tran_concat_tbl = vld1q_u8_x2(dot_prod_tran_concat_tbl);
+    int8x16_t s0123_lo, s0123_hi, s1234_lo, s1234_hi, s2345_lo, s2345_hi,
+        s3456_lo, s3456_hi, s4567_lo, s4567_hi, s5678_lo, s5678_hi, s6789_lo,
+        s6789_hi, s78910_lo, s78910_hi;
+    uint8x8_t d0, d1, d2, d3;
+    const uint8_t *s;
+    uint8_t *d;
+    int height;
+
+    do {
+      height = h;
+      s = src;
+      d = dst;
+
+      load_u8_8x7(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6);
+      s += 7 * src_stride;
+
+      /* Clamp sample range to [-128, 127] for 8-bit signed dot product. */
+      s0 = vreinterpret_s8_u8(vsub_u8(t0, range_limit));
+      s1 = vreinterpret_s8_u8(vsub_u8(t1, range_limit));
+      s2 = vreinterpret_s8_u8(vsub_u8(t2, range_limit));
+      s3 = vreinterpret_s8_u8(vsub_u8(t3, range_limit));
+      s4 = vreinterpret_s8_u8(vsub_u8(t4, range_limit));
+      s5 = vreinterpret_s8_u8(vsub_u8(t5, range_limit));
+      s6 = vreinterpret_s8_u8(vsub_u8(t6, range_limit));
+      s7 = vdup_n_s8(0);
+      s8 = vdup_n_s8(0);
+      s9 = vdup_n_s8(0);
+
+      /* This operation combines a conventional transpose and the sample permute
+       * (see horizontal case) required before computing the dot product.
+       */
+      transpose_concat_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi,
+                           tran_concat_tbl);
+      transpose_concat_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi,
+                           tran_concat_tbl);
+      transpose_concat_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi,
+                           tran_concat_tbl);
+      transpose_concat_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi,
+                           tran_concat_tbl);
+      transpose_concat_8x4(s4, s5, s6, s7, &s4567_lo, &s4567_hi,
+                           tran_concat_tbl);
+      transpose_concat_8x4(s5, s6, s7, s8, &s5678_lo, &s5678_hi,
+                           tran_concat_tbl);
+      transpose_concat_8x4(s6, s7, s8, s9, &s6789_lo, &s6789_hi,
+                           tran_concat_tbl);
+
+      do {
+        uint8x8_t t7, t8, t9, t10;
+
+        load_u8_8x4(s, src_stride, &t7, &t8, &t9, &t10);
+
+        s7 = vreinterpret_s8_u8(vsub_u8(t7, range_limit));
+        s8 = vreinterpret_s8_u8(vsub_u8(t8, range_limit));
+        s9 = vreinterpret_s8_u8(vsub_u8(t9, range_limit));
+        s10 = vreinterpret_s8_u8(vsub_u8(t10, range_limit));
+
+        transpose_concat_8x4(s7, s8, s9, s10, &s78910_lo, &s78910_hi,
+                             tran_concat_tbl);
+
+        /* Merge new data into block from previous iteration. */
+        samples_LUT.val[0] = s3456_lo;
+        samples_LUT.val[1] = s78910_lo;
+        s4567_lo = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[0]);
+        s5678_lo = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[1]);
+        s6789_lo = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[2]);
+
+        samples_LUT.val[0] = s3456_hi;
+        samples_LUT.val[1] = s78910_hi;
+        s4567_hi = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[0]);
+        s5678_hi = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[1]);
+        s6789_hi = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[2]);
+
+        d0 = convolve8_8_sdot_partial(s0123_lo, s4567_lo, s0123_hi, s4567_hi,
+                                      correction, filter);
+        d1 = convolve8_8_sdot_partial(s1234_lo, s5678_lo, s1234_hi, s5678_hi,
+                                      correction, filter);
+        d2 = convolve8_8_sdot_partial(s2345_lo, s6789_lo, s2345_hi, s6789_hi,
+                                      correction, filter);
+        d3 = convolve8_8_sdot_partial(s3456_lo, s78910_lo, s3456_hi, s78910_hi,
+                                      correction, filter);
+
+        store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        /* Prepare block for next iteration - re-using as much as possible. */
+        /* Shuffle everything up four rows. */
+        s0123_lo = s4567_lo;
+        s0123_hi = s4567_hi;
+        s1234_lo = s5678_lo;
+        s1234_hi = s5678_hi;
+        s2345_lo = s6789_lo;
+        s2345_hi = s6789_hi;
+        s3456_lo = s78910_lo;
+        s3456_hi = s78910_hi;
+
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height != 0);
+      src += 8;
+      dst += 8;
+      w -= 8;
+    } while (w != 0);
+  }
+}
+
+#endif  // defined(__ARM_FEATURE_MATMUL_INT8)
+
+#else  // !(AOM_ARCH_AARCH64 &&
+       //   (defined(__ARM_FEATURE_DOTPROD) ||
+       //    defined(__ARM_FEATURE_MATMUL_INT8)))
+
+static INLINE int16x4_t convolve8_4(const int16x4_t s0, const int16x4_t s1,
+                                    const int16x4_t s2, const int16x4_t s3,
+                                    const int16x4_t s4, const int16x4_t s5,
+                                    const int16x4_t s6, const int16x4_t s7,
+                                    const int16x8_t filter) {
+  const int16x4_t filter_lo = vget_low_s16(filter);
+  const int16x4_t filter_hi = vget_high_s16(filter);
+  int16x4_t sum;
+
+  sum = vmul_lane_s16(s0, filter_lo, 0);
+  sum = vmla_lane_s16(sum, s1, filter_lo, 1);
+  sum = vmla_lane_s16(sum, s2, filter_lo, 2);
+  sum = vmla_lane_s16(sum, s5, filter_hi, 1);
+  sum = vmla_lane_s16(sum, s6, filter_hi, 2);
+  sum = vmla_lane_s16(sum, s7, filter_hi, 3);
+  sum = vqadd_s16(sum, vmul_lane_s16(s3, filter_lo, 3));
+  sum = vqadd_s16(sum, vmul_lane_s16(s4, filter_hi, 0));
+  return sum;
+}
+
+static INLINE uint8x8_t convolve8_8(const int16x8_t s0, const int16x8_t s1,
+                                    const int16x8_t s2, const int16x8_t s3,
+                                    const int16x8_t s4, const int16x8_t s5,
+                                    const int16x8_t s6, const int16x8_t s7,
+                                    const int16x8_t filter) {
+  const int16x4_t filter_lo = vget_low_s16(filter);
+  const int16x4_t filter_hi = vget_high_s16(filter);
+  int16x8_t sum;
+
+  sum = vmulq_lane_s16(s0, filter_lo, 0);
+  sum = vmlaq_lane_s16(sum, s1, filter_lo, 1);
+  sum = vmlaq_lane_s16(sum, s2, filter_lo, 2);
+  sum = vmlaq_lane_s16(sum, s5, filter_hi, 1);
+  sum = vmlaq_lane_s16(sum, s6, filter_hi, 2);
+  sum = vmlaq_lane_s16(sum, s7, filter_hi, 3);
+  sum = vqaddq_s16(sum, vmulq_lane_s16(s3, filter_lo, 3));
+  sum = vqaddq_s16(sum, vmulq_lane_s16(s4, filter_hi, 0));
+  return vqrshrun_n_s16(sum, FILTER_BITS);
+}
+
+void aom_convolve8_horiz_neon(const uint8_t *src, ptrdiff_t src_stride,
+                              uint8_t *dst, ptrdiff_t dst_stride,
+                              const int16_t *filter_x, int x_step_q4,
+                              const int16_t *filter_y, int y_step_q4, int w,
+                              int h) {
+  const int16x8_t filter = vld1q_s16(filter_x);
+
+  assert((intptr_t)dst % 4 == 0);
+  assert(dst_stride % 4 == 0);
+
+  (void)x_step_q4;
+  (void)filter_y;
+  (void)y_step_q4;
+
+  src -= ((SUBPEL_TAPS / 2) - 1);
+
+  if (h == 4) {
+    uint8x8_t t0, t1, t2, t3, d01, d23;
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, d0, d1, d2, d3;
+
+    load_u8_8x4(src, src_stride, &t0, &t1, &t2, &t3);
+    transpose_u8_8x4(&t0, &t1, &t2, &t3);
+    s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+    s1 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+    s2 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+    s3 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+    s4 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+    s5 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+    s6 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+
+    src += 7;
+
+    do {
+      load_u8_8x4(src, src_stride, &t0, &t1, &t2, &t3);
+      transpose_u8_8x4(&t0, &t1, &t2, &t3);
+      s7 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s8 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+      s9 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+      s10 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+
+      d0 = convolve8_4(s0, s1, s2, s3, s4, s5, s6, s7, filter);
+      d1 = convolve8_4(s1, s2, s3, s4, s5, s6, s7, s8, filter);
+      d2 = convolve8_4(s2, s3, s4, s5, s6, s7, s8, s9, filter);
+      d3 = convolve8_4(s3, s4, s5, s6, s7, s8, s9, s10, filter);
+      d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS);
+      d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS);
+
+      transpose_u8_4x4(&d01, &d23);
+
+      store_u8_4x1(dst + 0 * dst_stride, d01, 0);
+      store_u8_4x1(dst + 1 * dst_stride, d23, 0);
+      store_u8_4x1(dst + 2 * dst_stride, d01, 1);
+      store_u8_4x1(dst + 3 * dst_stride, d23, 1);
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s5 = s9;
+      s6 = s10;
+      src += 4;
+      dst += 4;
+      w -= 4;
+    } while (w != 0);
+  } else {
+    uint8x8_t t0, t1, t2, t3, t4, t5, t6, t7, d0, d1, d2, d3;
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
+
+    if (w == 4) {
+      do {
+        load_u8_8x8(src, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+        transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+        s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
+        s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
+        s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
+        s3 = vreinterpretq_s16_u16(vmovl_u8(t3));
+        s4 = vreinterpretq_s16_u16(vmovl_u8(t4));
+        s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
+        s6 = vreinterpretq_s16_u16(vmovl_u8(t6));
+
+        load_u8_8x8(src + 7, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6,
+                    &t7);
+        transpose_u8_4x8(&t0, &t1, &t2, &t3, t4, t5, t6, t7);
+        s7 = vreinterpretq_s16_u16(vmovl_u8(t0));
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t1));
+        s9 = vreinterpretq_s16_u16(vmovl_u8(t2));
+        s10 = vreinterpretq_s16_u16(vmovl_u8(t3));
+
+        d0 = convolve8_8(s0, s1, s2, s3, s4, s5, s6, s7, filter);
+        d1 = convolve8_8(s1, s2, s3, s4, s5, s6, s7, s8, filter);
+        d2 = convolve8_8(s2, s3, s4, s5, s6, s7, s8, s9, filter);
+        d3 = convolve8_8(s3, s4, s5, s6, s7, s8, s9, s10, filter);
+
+        transpose_u8_8x4(&d0, &d1, &d2, &d3);
+
+        store_u8_4x1(dst + 0 * dst_stride, d0, 0);
+        store_u8_4x1(dst + 1 * dst_stride, d1, 0);
+        store_u8_4x1(dst + 2 * dst_stride, d2, 0);
+        store_u8_4x1(dst + 3 * dst_stride, d3, 0);
+        store_u8_4x1(dst + 4 * dst_stride, d0, 1);
+        store_u8_4x1(dst + 5 * dst_stride, d1, 1);
+        store_u8_4x1(dst + 6 * dst_stride, d2, 1);
+        store_u8_4x1(dst + 7 * dst_stride, d3, 1);
+
+        src += 8 * src_stride;
+        dst += 8 * dst_stride;
+        h -= 8;
+      } while (h > 0);
+    } else {
+      uint8x8_t d4, d5, d6, d7;
+      int16x8_t s11, s12, s13, s14;
+      int width;
+      const uint8_t *s;
+      uint8_t *d;
+
+      do {
+        load_u8_8x8(src, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+        transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+        s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
+        s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
+        s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
+        s3 = vreinterpretq_s16_u16(vmovl_u8(t3));
+        s4 = vreinterpretq_s16_u16(vmovl_u8(t4));
+        s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
+        s6 = vreinterpretq_s16_u16(vmovl_u8(t6));
+
+        width = w;
+        s = src + 7;
+        d = dst;
+
+        do {
+          load_u8_8x8(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+          transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+          s7 = vreinterpretq_s16_u16(vmovl_u8(t0));
+          s8 = vreinterpretq_s16_u16(vmovl_u8(t1));
+          s9 = vreinterpretq_s16_u16(vmovl_u8(t2));
+          s10 = vreinterpretq_s16_u16(vmovl_u8(t3));
+          s11 = vreinterpretq_s16_u16(vmovl_u8(t4));
+          s12 = vreinterpretq_s16_u16(vmovl_u8(t5));
+          s13 = vreinterpretq_s16_u16(vmovl_u8(t6));
+          s14 = vreinterpretq_s16_u16(vmovl_u8(t7));
+
+          d0 = convolve8_8(s0, s1, s2, s3, s4, s5, s6, s7, filter);
+          d1 = convolve8_8(s1, s2, s3, s4, s5, s6, s7, s8, filter);
+          d2 = convolve8_8(s2, s3, s4, s5, s6, s7, s8, s9, filter);
+          d3 = convolve8_8(s3, s4, s5, s6, s7, s8, s9, s10, filter);
+          d4 = convolve8_8(s4, s5, s6, s7, s8, s9, s10, s11, filter);
+          d5 = convolve8_8(s5, s6, s7, s8, s9, s10, s11, s12, filter);
+          d6 = convolve8_8(s6, s7, s8, s9, s10, s11, s12, s13, filter);
+          d7 = convolve8_8(s7, s8, s9, s10, s11, s12, s13, s14, filter);
+
+          transpose_u8_8x8(&d0, &d1, &d2, &d3, &d4, &d5, &d6, &d7);
+
+          store_u8_8x8(d, dst_stride, d0, d1, d2, d3, d4, d5, d6, d7);
+
+          s0 = s8;
+          s1 = s9;
+          s2 = s10;
+          s3 = s11;
+          s4 = s12;
+          s5 = s13;
+          s6 = s14;
+          s += 8;
+          d += 8;
+          width -= 8;
+        } while (width != 0);
+        src += 8 * src_stride;
+        dst += 8 * dst_stride;
+        h -= 8;
+      } while (h > 0);
+    }
+  }
+}
+
+void aom_convolve8_vert_neon(const uint8_t *src, ptrdiff_t src_stride,
+                             uint8_t *dst, ptrdiff_t dst_stride,
+                             const int16_t *filter_x, int x_step_q4,
+                             const int16_t *filter_y, int y_step_q4, int w,
+                             int h) {
+  const int16x8_t filter = vld1q_s16(filter_y);
+
+  assert((intptr_t)dst % 4 == 0);
+  assert(dst_stride % 4 == 0);
+
+  (void)filter_x;
+  (void)x_step_q4;
+  (void)y_step_q4;
+
+  src -= ((SUBPEL_TAPS / 2) - 1) * src_stride;
+
+  if (w == 4) {
+    uint8x8_t t0, t1, t2, t3, t4, t5, t6, d01, d23;
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, d0, d1, d2, d3;
+
+    load_u8_8x7(src, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6);
+    s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+    s1 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+    s2 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+    s3 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+    s4 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t4)));
+    s5 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t5)));
+    s6 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t6)));
+
+    src += 7 * src_stride;
+
+    do {
+      load_u8_8x4(src, src_stride, &t0, &t1, &t2, &t3);
+      s7 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s8 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+      s9 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+      s10 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+
+      d0 = convolve8_4(s0, s1, s2, s3, s4, s5, s6, s7, filter);
+      d1 = convolve8_4(s1, s2, s3, s4, s5, s6, s7, s8, filter);
+      d2 = convolve8_4(s2, s3, s4, s5, s6, s7, s8, s9, filter);
+      d3 = convolve8_4(s3, s4, s5, s6, s7, s8, s9, s10, filter);
+      d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS);
+      d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS);
+
+      store_u8_4x1(dst + 0 * dst_stride, d01, 0);
+      store_u8_4x1(dst + 1 * dst_stride, d01, 1);
+      store_u8_4x1(dst + 2 * dst_stride, d23, 0);
+      store_u8_4x1(dst + 3 * dst_stride, d23, 1);
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s5 = s9;
+      s6 = s10;
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      h -= 4;
+    } while (h != 0);
+  } else {
+    uint8x8_t t0, t1, t2, t3, t4, t5, t6, d0, d1, d2, d3;
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
+    int height;
+    const uint8_t *s;
+    uint8_t *d;
+
+    do {
+      load_u8_8x7(src, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6);
+      s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
+      s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
+      s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
+      s3 = vreinterpretq_s16_u16(vmovl_u8(t3));
+      s4 = vreinterpretq_s16_u16(vmovl_u8(t4));
+      s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
+      s6 = vreinterpretq_s16_u16(vmovl_u8(t6));
+
+      height = h;
+      s = src + 7 * src_stride;
+      d = dst;
+
+      do {
+        load_u8_8x4(s, src_stride, &t0, &t1, &t2, &t3);
+        s7 = vreinterpretq_s16_u16(vmovl_u8(t0));
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t1));
+        s9 = vreinterpretq_s16_u16(vmovl_u8(t2));
+        s10 = vreinterpretq_s16_u16(vmovl_u8(t3));
+
+        d0 = convolve8_8(s0, s1, s2, s3, s4, s5, s6, s7, filter);
+        d1 = convolve8_8(s1, s2, s3, s4, s5, s6, s7, s8, filter);
+        d2 = convolve8_8(s2, s3, s4, s5, s6, s7, s8, s9, filter);
+        d3 = convolve8_8(s3, s4, s5, s6, s7, s8, s9, s10, filter);
+
+        store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height != 0);
+      src += 8;
+      dst += 8;
+      w -= 8;
+    } while (w != 0);
+  }
+}
+
+#endif  // AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)

diff --git a/aom_dsp/arm/avg_neon.c b/aom_dsp/arm/avg_neon.c
index 991fd3f..ef2f3af 100644
--- a/aom_dsp/arm/avg_neon.c
+++ b/aom_dsp/arm/avg_neon.c

@@ -9,7 +9,9 @@
  */
 
 #include <arm_neon.h>
+#include <assert.h>
 
+#include "config/aom_config.h"
 #include "config/aom_dsp_rtcd.h"
 #include "aom/aom_integer.h"
 #include "aom_dsp/arm/mem_neon.h"
@@ -17,7 +19,7 @@
 #include "aom_dsp/arm/transpose_neon.h"
 #include "aom_ports/mem.h"
 
-#if !defined(__aarch64__)
+#if !AOM_ARCH_AARCH64
 static INLINE uint32x2_t horizontal_add_u16x8_v(const uint16x8_t a) {
   const uint32x4_t b = vpaddlq_u16(a);
   const uint64x2_t c = vpaddlq_u32(b);
@@ -29,7 +31,7 @@
 unsigned int aom_avg_4x4_neon(const uint8_t *a, int a_stride) {
   const uint8x16_t b = load_unaligned_u8q(a, a_stride);
   const uint16x8_t c = vaddl_u8(vget_low_u8(b), vget_high_u8(b));
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   const uint32_t d = vaddlvq_u16(c);
   return (d + 8) >> 4;
 #else
@@ -52,7 +54,7 @@
     sum = vaddw_u8(sum, e);
   }
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   const uint32_t d = vaddlvq_u16(sum);
   return (d + 32) >> 6;
 #else
@@ -92,52 +94,90 @@
 void aom_int_pro_row_neon(int16_t *hbuf, const uint8_t *ref,
                           const int ref_stride, const int width,
                           const int height, int norm_factor) {
-  const uint8_t *idx = ref;
-  const uint16x8_t zero = vdupq_n_u16(0);
-  const int16x8_t neg_norm_factor = vdupq_n_s16(-norm_factor);
+  assert(width % 16 == 0);
+  assert(height % 4 == 0);
 
-  for (int wd = 0; wd < width; wd += 16) {
-    uint16x8_t vec0 = zero;
-    uint16x8_t vec1 = zero;
-    idx = ref + wd;
-    for (int ht = 0; ht < height; ++ht) {
-      const uint8x16_t tmp = vld1q_u8(idx);
-      idx += ref_stride;
-      vec0 = vaddw_u8(vec0, vget_low_u8(tmp));
-      vec1 = vaddw_u8(vec1, vget_high_u8(tmp));
+  const int16x8_t neg_norm_factor = vdupq_n_s16(-norm_factor);
+  uint16x8_t sum_lo[2], sum_hi[2];
+
+  int w = 0;
+  do {
+    const uint8_t *r = ref + w;
+    uint8x16_t r0 = vld1q_u8(r + 0 * ref_stride);
+    uint8x16_t r1 = vld1q_u8(r + 1 * ref_stride);
+    uint8x16_t r2 = vld1q_u8(r + 2 * ref_stride);
+    uint8x16_t r3 = vld1q_u8(r + 3 * ref_stride);
+
+    sum_lo[0] = vaddl_u8(vget_low_u8(r0), vget_low_u8(r1));
+    sum_hi[0] = vaddl_u8(vget_high_u8(r0), vget_high_u8(r1));
+    sum_lo[1] = vaddl_u8(vget_low_u8(r2), vget_low_u8(r3));
+    sum_hi[1] = vaddl_u8(vget_high_u8(r2), vget_high_u8(r3));
+
+    r += 4 * ref_stride;
+
+    for (int h = height - 4; h != 0; h -= 4) {
+      r0 = vld1q_u8(r + 0 * ref_stride);
+      r1 = vld1q_u8(r + 1 * ref_stride);
+      r2 = vld1q_u8(r + 2 * ref_stride);
+      r3 = vld1q_u8(r + 3 * ref_stride);
+
+      uint16x8_t tmp0_lo = vaddl_u8(vget_low_u8(r0), vget_low_u8(r1));
+      uint16x8_t tmp0_hi = vaddl_u8(vget_high_u8(r0), vget_high_u8(r1));
+      uint16x8_t tmp1_lo = vaddl_u8(vget_low_u8(r2), vget_low_u8(r3));
+      uint16x8_t tmp1_hi = vaddl_u8(vget_high_u8(r2), vget_high_u8(r3));
+
+      sum_lo[0] = vaddq_u16(sum_lo[0], tmp0_lo);
+      sum_hi[0] = vaddq_u16(sum_hi[0], tmp0_hi);
+      sum_lo[1] = vaddq_u16(sum_lo[1], tmp1_lo);
+      sum_hi[1] = vaddq_u16(sum_hi[1], tmp1_hi);
+
+      r += 4 * ref_stride;
     }
 
-    const int16x8_t result0 =
-        vshlq_s16(vreinterpretq_s16_u16(vec0), neg_norm_factor);
-    const int16x8_t result1 =
-        vshlq_s16(vreinterpretq_s16_u16(vec1), neg_norm_factor);
+    sum_lo[0] = vaddq_u16(sum_lo[0], sum_lo[1]);
+    sum_hi[0] = vaddq_u16(sum_hi[0], sum_hi[1]);
 
-    vst1q_s16(hbuf + wd, result0);
-    vst1q_s16(hbuf + wd + 8, result1);
-  }
+    const int16x8_t avg0 =
+        vshlq_s16(vreinterpretq_s16_u16(sum_lo[0]), neg_norm_factor);
+    const int16x8_t avg1 =
+        vshlq_s16(vreinterpretq_s16_u16(sum_hi[0]), neg_norm_factor);
+
+    vst1q_s16(hbuf + w, avg0);
+    vst1q_s16(hbuf + w + 8, avg1);
+    w += 16;
+  } while (w < width);
 }
 
 void aom_int_pro_col_neon(int16_t *vbuf, const uint8_t *ref,
                           const int ref_stride, const int width,
                           const int height, int norm_factor) {
-  for (int ht = 0; ht < height; ++ht) {
-    uint16x8_t sum = vdupq_n_u16(0);
-    for (int wd = 0; wd < width; wd += 16) {
-      const uint8x16_t vec = vld1q_u8(ref + wd);
-      sum = vaddq_u16(sum, vpaddlq_u8(vec));
+  assert(width % 16 == 0);
+  assert(height % 4 == 0);
+
+  const int16x4_t neg_norm_factor = vdup_n_s16(-norm_factor);
+  uint16x8_t sum[4];
+
+  int h = 0;
+  do {
+    sum[0] = vpaddlq_u8(vld1q_u8(ref + 0 * ref_stride));
+    sum[1] = vpaddlq_u8(vld1q_u8(ref + 1 * ref_stride));
+    sum[2] = vpaddlq_u8(vld1q_u8(ref + 2 * ref_stride));
+    sum[3] = vpaddlq_u8(vld1q_u8(ref + 3 * ref_stride));
+
+    for (int w = 16; w < width; w += 16) {
+      sum[0] = vpadalq_u8(sum[0], vld1q_u8(ref + 0 * ref_stride + w));
+      sum[1] = vpadalq_u8(sum[1], vld1q_u8(ref + 1 * ref_stride + w));
+      sum[2] = vpadalq_u8(sum[2], vld1q_u8(ref + 2 * ref_stride + w));
+      sum[3] = vpadalq_u8(sum[3], vld1q_u8(ref + 3 * ref_stride + w));
     }
 
-#if defined(__aarch64__)
-    vbuf[ht] = ((int16_t)vaddvq_u16(sum)) >> norm_factor;
-#else
-    const uint32x4_t a = vpaddlq_u16(sum);
-    const uint64x2_t b = vpaddlq_u32(a);
-    const uint32x2_t c = vadd_u32(vreinterpret_u32_u64(vget_low_u64(b)),
-                                  vreinterpret_u32_u64(vget_high_u64(b)));
-    vbuf[ht] = ((int16_t)vget_lane_u32(c, 0)) >> norm_factor;
-#endif
-    ref += ref_stride;
-  }
+    uint16x4_t sum_4d = vmovn_u32(horizontal_add_4d_u16x8(sum));
+    int16x4_t avg = vshl_s16(vreinterpret_s16_u16(sum_4d), neg_norm_factor);
+    vst1_s16(vbuf + h, avg);
+
+    ref += 4 * ref_stride;
+    h += 4;
+  } while (h < height);
 }
 
 // coeff: 16 bits, dynamic range [-32640, 32640].
@@ -177,7 +217,7 @@
     v_mean = vpadalq_s16(v_mean, diff);
     v_low = vget_low_s16(diff);
     v_sse = vmlal_s16(v_sse, v_low, v_low);
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     v_sse = vmlal_high_s16(v_sse, diff, diff);
 #else
     const int16x4_t v_high = vget_high_s16(diff);
@@ -192,27 +232,56 @@
   return var;
 }
 
-#if CONFIG_AV1_HIGHBITDEPTH
-unsigned int aom_highbd_avg_4x4_neon(const uint8_t *s, int p) {
-  const uint16_t *src = CONVERT_TO_SHORTPTR(s);
-  const uint16x4_t r0 = vld1_u16(src);
-  src += p;
-  uint16x4_t r1, r2, r3;
-  r1 = vld1_u16(src);
-  src += p;
-  r2 = vld1_u16(src);
-  src += p;
-  r3 = vld1_u16(src);
-  const uint16x4_t s1 = vadd_u16(r0, r1);
-  const uint16x4_t s2 = vadd_u16(r2, r3);
-  const uint16x4_t s3 = vadd_u16(s1, s2);
-#if defined(__aarch64__)
-  return (vaddv_u16(s3) + 8) >> 4;
+void aom_minmax_8x8_neon(const uint8_t *a, int a_stride, const uint8_t *b,
+                         int b_stride, int *min, int *max) {
+  // Load and concatenate.
+  const uint8x16_t a01 = load_u8_8x2(a + 0 * a_stride, a_stride);
+  const uint8x16_t a23 = load_u8_8x2(a + 2 * a_stride, a_stride);
+  const uint8x16_t a45 = load_u8_8x2(a + 4 * a_stride, a_stride);
+  const uint8x16_t a67 = load_u8_8x2(a + 6 * a_stride, a_stride);
+
+  const uint8x16_t b01 = load_u8_8x2(b + 0 * b_stride, b_stride);
+  const uint8x16_t b23 = load_u8_8x2(b + 2 * b_stride, b_stride);
+  const uint8x16_t b45 = load_u8_8x2(b + 4 * b_stride, b_stride);
+  const uint8x16_t b67 = load_u8_8x2(b + 6 * b_stride, b_stride);
+
+  // Absolute difference.
+  const uint8x16_t ab01_diff = vabdq_u8(a01, b01);
+  const uint8x16_t ab23_diff = vabdq_u8(a23, b23);
+  const uint8x16_t ab45_diff = vabdq_u8(a45, b45);
+  const uint8x16_t ab67_diff = vabdq_u8(a67, b67);
+
+  // Max values between the Q vectors.
+  const uint8x16_t ab0123_max = vmaxq_u8(ab01_diff, ab23_diff);
+  const uint8x16_t ab4567_max = vmaxq_u8(ab45_diff, ab67_diff);
+  const uint8x16_t ab0123_min = vminq_u8(ab01_diff, ab23_diff);
+  const uint8x16_t ab4567_min = vminq_u8(ab45_diff, ab67_diff);
+
+  const uint8x16_t ab07_max = vmaxq_u8(ab0123_max, ab4567_max);
+  const uint8x16_t ab07_min = vminq_u8(ab0123_min, ab4567_min);
+
+#if AOM_ARCH_AARCH64
+  *min = *max = 0;  // Clear high bits
+  *((uint8_t *)max) = vmaxvq_u8(ab07_max);
+  *((uint8_t *)min) = vminvq_u8(ab07_min);
 #else
-  const uint16x4_t h1 = vpadd_u16(s3, s3);
-  const uint16x4_t h2 = vpadd_u16(h1, h1);
-  const uint16x4_t res = vrshr_n_u16(h2, 4);
-  return vget_lane_u16(res, 0);
+  // Split into 64-bit vectors and execute pairwise min/max.
+  uint8x8_t ab_max = vmax_u8(vget_high_u8(ab07_max), vget_low_u8(ab07_max));
+  uint8x8_t ab_min = vmin_u8(vget_high_u8(ab07_min), vget_low_u8(ab07_min));
+
+  // Enough runs of vpmax/min propagate the max/min values to every position.
+  ab_max = vpmax_u8(ab_max, ab_max);
+  ab_min = vpmin_u8(ab_min, ab_min);
+
+  ab_max = vpmax_u8(ab_max, ab_max);
+  ab_min = vpmin_u8(ab_min, ab_min);
+
+  ab_max = vpmax_u8(ab_max, ab_max);
+  ab_min = vpmin_u8(ab_min, ab_min);
+
+  *min = *max = 0;  // Clear high bits
+  // Store directly to avoid costly neon->gpr transfer.
+  vst1_lane_u8((uint8_t *)max, ab_max, 0);
+  vst1_lane_u8((uint8_t *)min, ab_min, 0);
 #endif
 }
-#endif  // CONFIG_AV1_HIGHBITDEPTH

diff --git a/aom_dsp/arm/avg_pred_neon.c b/aom_dsp/arm/avg_pred_neon.c
new file mode 100644
index 0000000..04e0904
--- /dev/null
+++ b/aom_dsp/arm/avg_pred_neon.c

@@ -0,0 +1,171 @@
+/*
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <arm_neon.h>
+#include <assert.h>
+
+#include "config/aom_dsp_rtcd.h"
+#include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/blend.h"
+
+void aom_comp_avg_pred_neon(uint8_t *comp_pred, const uint8_t *pred, int width,
+                            int height, const uint8_t *ref, int ref_stride) {
+  if (width > 8) {
+    do {
+      const uint8_t *pred_ptr = pred;
+      const uint8_t *ref_ptr = ref;
+      uint8_t *comp_pred_ptr = comp_pred;
+      int w = width;
+
+      do {
+        const uint8x16_t p = vld1q_u8(pred_ptr);
+        const uint8x16_t r = vld1q_u8(ref_ptr);
+        const uint8x16_t avg = vrhaddq_u8(p, r);
+
+        vst1q_u8(comp_pred_ptr, avg);
+
+        ref_ptr += 16;
+        pred_ptr += 16;
+        comp_pred_ptr += 16;
+        w -= 16;
+      } while (w != 0);
+
+      ref += ref_stride;
+      pred += width;
+      comp_pred += width;
+    } while (--height != 0);
+  } else if (width == 8) {
+    int h = height / 2;
+
+    do {
+      const uint8x16_t p = vld1q_u8(pred);
+      const uint8x16_t r = load_u8_8x2(ref, ref_stride);
+      const uint8x16_t avg = vrhaddq_u8(p, r);
+
+      vst1q_u8(comp_pred, avg);
+
+      ref += 2 * ref_stride;
+      pred += 16;
+      comp_pred += 16;
+    } while (--h != 0);
+  } else {
+    int h = height / 4;
+    assert(width == 4);
+
+    do {
+      const uint8x16_t p = vld1q_u8(pred);
+      const uint8x16_t r = load_unaligned_u8q(ref, ref_stride);
+      const uint8x16_t avg = vrhaddq_u8(p, r);
+
+      vst1q_u8(comp_pred, avg);
+
+      ref += 4 * ref_stride;
+      pred += 16;
+      comp_pred += 16;
+    } while (--h != 0);
+  }
+}
+
+void aom_comp_mask_pred_neon(uint8_t *comp_pred, const uint8_t *pred, int width,
+                             int height, const uint8_t *ref, int ref_stride,
+                             const uint8_t *mask, int mask_stride,
+                             int invert_mask) {
+  const uint8_t *src0 = invert_mask ? pred : ref;
+  const uint8_t *src1 = invert_mask ? ref : pred;
+  const int src_stride0 = invert_mask ? width : ref_stride;
+  const int src_stride1 = invert_mask ? ref_stride : width;
+
+  if (width > 8) {
+    const uint8x16_t max_alpha = vdupq_n_u8(AOM_BLEND_A64_MAX_ALPHA);
+    do {
+      const uint8_t *src0_ptr = src0;
+      const uint8_t *src1_ptr = src1;
+      const uint8_t *mask_ptr = mask;
+      uint8_t *comp_pred_ptr = comp_pred;
+      int w = width;
+
+      do {
+        const uint8x16_t s0 = vld1q_u8(src0_ptr);
+        const uint8x16_t s1 = vld1q_u8(src1_ptr);
+        const uint8x16_t m0 = vld1q_u8(mask_ptr);
+
+        uint8x16_t m0_inv = vsubq_u8(max_alpha, m0);
+        uint16x8_t blend_u16_lo = vmull_u8(vget_low_u8(s0), vget_low_u8(m0));
+        uint16x8_t blend_u16_hi = vmull_u8(vget_high_u8(s0), vget_high_u8(m0));
+        blend_u16_lo =
+            vmlal_u8(blend_u16_lo, vget_low_u8(s1), vget_low_u8(m0_inv));
+        blend_u16_hi =
+            vmlal_u8(blend_u16_hi, vget_high_u8(s1), vget_high_u8(m0_inv));
+
+        uint8x8_t blend_u8_lo =
+            vrshrn_n_u16(blend_u16_lo, AOM_BLEND_A64_ROUND_BITS);
+        uint8x8_t blend_u8_hi =
+            vrshrn_n_u16(blend_u16_hi, AOM_BLEND_A64_ROUND_BITS);
+        uint8x16_t blend_u8 = vcombine_u8(blend_u8_lo, blend_u8_hi);
+
+        vst1q_u8(comp_pred_ptr, blend_u8);
+
+        src0_ptr += 16;
+        src1_ptr += 16;
+        mask_ptr += 16;
+        comp_pred_ptr += 16;
+        w -= 16;
+      } while (w != 0);
+
+      src0 += src_stride0;
+      src1 += src_stride1;
+      mask += mask_stride;
+      comp_pred += width;
+    } while (--height != 0);
+  } else if (width == 8) {
+    const uint8x8_t max_alpha = vdup_n_u8(AOM_BLEND_A64_MAX_ALPHA);
+
+    do {
+      const uint8x8_t s0 = vld1_u8(src0);
+      const uint8x8_t s1 = vld1_u8(src1);
+      const uint8x8_t m0 = vld1_u8(mask);
+
+      uint8x8_t m0_inv = vsub_u8(max_alpha, m0);
+      uint16x8_t blend_u16 = vmull_u8(s0, m0);
+      blend_u16 = vmlal_u8(blend_u16, s1, m0_inv);
+      uint8x8_t blend_u8 = vrshrn_n_u16(blend_u16, AOM_BLEND_A64_ROUND_BITS);
+
+      vst1_u8(comp_pred, blend_u8);
+
+      src0 += src_stride0;
+      src1 += src_stride1;
+      mask += mask_stride;
+      comp_pred += 8;
+    } while (--height != 0);
+  } else {
+    const uint8x8_t max_alpha = vdup_n_u8(AOM_BLEND_A64_MAX_ALPHA);
+    int h = height / 2;
+    assert(width == 4);
+
+    do {
+      const uint8x8_t s0 = load_unaligned_u8(src0, src_stride0);
+      const uint8x8_t s1 = load_unaligned_u8(src1, src_stride1);
+      const uint8x8_t m0 = load_unaligned_u8(mask, mask_stride);
+
+      uint8x8_t m0_inv = vsub_u8(max_alpha, m0);
+      uint16x8_t blend_u16 = vmull_u8(s0, m0);
+      blend_u16 = vmlal_u8(blend_u16, s1, m0_inv);
+      uint8x8_t blend_u8 = vrshrn_n_u16(blend_u16, AOM_BLEND_A64_ROUND_BITS);
+
+      vst1_u8(comp_pred, blend_u8);
+
+      src0 += 2 * src_stride0;
+      src1 += 2 * src_stride1;
+      mask += 2 * mask_stride;
+      comp_pred += 8;
+    } while (--h != 0);
+  }
+}

diff --git a/aom_dsp/arm/blend_a64_mask_neon.c b/aom_dsp/arm/blend_a64_mask_neon.c
index f11d57e..c3ee0b7 100644
--- a/aom_dsp/arm/blend_a64_mask_neon.c
+++ b/aom_dsp/arm/blend_a64_mask_neon.c

@@ -86,19 +86,21 @@
                              const int16x8_t vec_round_bits) {
   int16x8_t src0_0, src0_1;
   int16x8_t src1_0, src1_1;
-  uint64x2_t tu0 = vdupq_n_u64(0), tu1 = vdupq_n_u64(0), tu2 = vdupq_n_u64(0),
-             tu3 = vdupq_n_u64(0);
+  uint16x8_t tu0 = vdupq_n_u16(0);
+  uint16x8_t tu1 = vdupq_n_u16(0);
+  uint16x8_t tu2 = vdupq_n_u16(0);
+  uint16x8_t tu3 = vdupq_n_u16(0);
   int16x8_t mask0_1, mask2_3;
   int16x8_t res0, res1;
 
   load_unaligned_u16_4x4(src0, src0_stride, &tu0, &tu1);
   load_unaligned_u16_4x4(src1, src1_stride, &tu2, &tu3);
 
-  src0_0 = vreinterpretq_s16_u64(tu0);
-  src0_1 = vreinterpretq_s16_u64(tu1);
+  src0_0 = vreinterpretq_s16_u16(tu0);
+  src0_1 = vreinterpretq_s16_u16(tu1);
 
-  src1_0 = vreinterpretq_s16_u64(tu2);
-  src1_1 = vreinterpretq_s16_u64(tu3);
+  src1_0 = vreinterpretq_s16_u16(tu2);
+  src1_1 = vreinterpretq_s16_u16(tu3);
 
   mask0_1 = vcombine_s16(mask0, mask1);
   mask2_3 = vcombine_s16(mask2, mask3);
@@ -150,9 +152,10 @@
   assert(IS_POWER_OF_TWO(h));
   assert(IS_POWER_OF_TWO(w));
 
-  uint8x8_t s0, s1, s2, s3;
-  uint32x2_t tu0 = vdup_n_u32(0), tu1 = vdup_n_u32(0), tu2 = vdup_n_u32(0),
-             tu3 = vdup_n_u32(0);
+  uint8x8_t s0 = vdup_n_u8(0);
+  uint8x8_t s1 = vdup_n_u8(0);
+  uint8x8_t s2 = vdup_n_u8(0);
+  uint8x8_t s3 = vdup_n_u8(0);
   uint8x16_t t0, t1, t2, t3, t4, t5, t6, t7;
   int16x8_t mask0, mask1, mask2, mask3;
   int16x8_t mask4, mask5, mask6, mask7;
@@ -197,10 +200,10 @@
       } while (i < h);
     } else {
       do {
-        load_unaligned_u8_4x4(mask_tmp, mask_stride, &tu0, &tu1);
+        load_unaligned_u8_4x4(mask_tmp, mask_stride, &s0, &s1);
 
-        mask0 = vreinterpretq_s16_u16(vmovl_u8(vreinterpret_u8_u32(tu0)));
-        mask1 = vreinterpretq_s16_u16(vmovl_u8(vreinterpret_u8_u32(tu1)));
+        mask0 = vreinterpretq_s16_u16(vmovl_u8(s0));
+        mask1 = vreinterpretq_s16_u16(vmovl_u8(s1));
 
         mask0_low = vget_low_s16(mask0);
         mask1_low = vget_high_s16(mask0);
@@ -412,14 +415,9 @@
       } while (i < h);
     } else {
       do {
-        load_unaligned_u8_4x4(mask_tmp, 2 * mask_stride, &tu0, &tu1);
-        load_unaligned_u8_4x4(mask_tmp + mask_stride, 2 * mask_stride, &tu2,
-                              &tu3);
-
-        s0 = vreinterpret_u8_u32(tu0);
-        s1 = vreinterpret_u8_u32(tu1);
-        s2 = vreinterpret_u8_u32(tu2);
-        s3 = vreinterpret_u8_u32(tu3);
+        load_unaligned_u8_4x4(mask_tmp, 2 * mask_stride, &s0, &s1);
+        load_unaligned_u8_4x4(mask_tmp + mask_stride, 2 * mask_stride, &s2,
+                              &s3);
 
         mask0 = vreinterpretq_s16_u16(vaddl_u8(s0, s2));
         mask1 = vreinterpretq_s16_u16(vaddl_u8(s1, s3));

diff --git a/aom_dsp/arm/fwd_txfm_neon.c b/aom_dsp/arm/fwd_txfm_neon.c
index 7fccdab..a7d66b3 100644
--- a/aom_dsp/arm/fwd_txfm_neon.c
+++ b/aom_dsp/arm/fwd_txfm_neon.c

@@ -67,7 +67,10 @@
     int16x4_t out_1 = vrshrn_n_s32(temp3, DCT_CONST_BITS);
     int16x4_t out_3 = vrshrn_n_s32(temp4, DCT_CONST_BITS);
 
-    transpose_s16_4x4d(&out_0, &out_1, &out_2, &out_3);
+    // Only transpose the first pass
+    if (i == 0) {
+      transpose_s16_4x4d(&out_0, &out_1, &out_2, &out_3);
+    }
 
     *input_0 = out_0;
     *input_1 = out_1;

diff --git a/aom_dsp/arm/hadamard_neon.c b/aom_dsp/arm/hadamard_neon.c
index 75dd7d6..82ce0cd 100644
--- a/aom_dsp/arm/hadamard_neon.c
+++ b/aom_dsp/arm/hadamard_neon.c

@@ -15,6 +15,38 @@
 #include "aom_dsp/arm/mem_neon.h"
 #include "aom_dsp/arm/transpose_neon.h"
 
+static INLINE void hadamard_4x4_one_pass(int16x4_t *a0, int16x4_t *a1,
+                                         int16x4_t *a2, int16x4_t *a3) {
+  const int16x4_t b0 = vhadd_s16(*a0, *a1);
+  const int16x4_t b1 = vhsub_s16(*a0, *a1);
+  const int16x4_t b2 = vhadd_s16(*a2, *a3);
+  const int16x4_t b3 = vhsub_s16(*a2, *a3);
+
+  *a0 = vadd_s16(b0, b2);
+  *a1 = vadd_s16(b1, b3);
+  *a2 = vsub_s16(b0, b2);
+  *a3 = vsub_s16(b1, b3);
+}
+
+void aom_hadamard_4x4_neon(const int16_t *src_diff, ptrdiff_t src_stride,
+                           tran_low_t *coeff) {
+  int16x4_t a0 = vld1_s16(src_diff);
+  int16x4_t a1 = vld1_s16(src_diff + src_stride);
+  int16x4_t a2 = vld1_s16(src_diff + 2 * src_stride);
+  int16x4_t a3 = vld1_s16(src_diff + 3 * src_stride);
+
+  hadamard_4x4_one_pass(&a0, &a1, &a2, &a3);
+
+  transpose_s16_4x4d(&a0, &a1, &a2, &a3);
+
+  hadamard_4x4_one_pass(&a0, &a1, &a2, &a3);
+
+  store_s16_to_tran_low(coeff, a0);
+  store_s16_to_tran_low(coeff + 4, a1);
+  store_s16_to_tran_low(coeff + 8, a2);
+  store_s16_to_tran_low(coeff + 12, a3);
+}
+
 static void hadamard8x8_one_pass(int16x8_t *a0, int16x8_t *a1, int16x8_t *a2,
                                  int16x8_t *a3, int16x8_t *a4, int16x8_t *a5,
                                  int16x8_t *a6, int16x8_t *a7) {
@@ -154,44 +186,106 @@
 
 void aom_hadamard_16x16_neon(const int16_t *src_diff, ptrdiff_t src_stride,
                              tran_low_t *coeff) {
-  DECLARE_ALIGNED(32, tran_low_t, temp_coeff[16 * 16]);
   /* Rearrange 16x16 to 8x32 and remove stride.
    * Top left first. */
-  aom_hadamard_8x8_neon(src_diff + 0 + 0 * src_stride, src_stride,
-                        temp_coeff + 0);
+  aom_hadamard_8x8_neon(src_diff + 0 + 0 * src_stride, src_stride, coeff + 0);
   /* Top right. */
-  aom_hadamard_8x8_neon(src_diff + 8 + 0 * src_stride, src_stride,
-                        temp_coeff + 64);
+  aom_hadamard_8x8_neon(src_diff + 8 + 0 * src_stride, src_stride, coeff + 64);
   /* Bottom left. */
-  aom_hadamard_8x8_neon(src_diff + 0 + 8 * src_stride, src_stride,
-                        temp_coeff + 128);
+  aom_hadamard_8x8_neon(src_diff + 0 + 8 * src_stride, src_stride, coeff + 128);
   /* Bottom right. */
-  aom_hadamard_8x8_neon(src_diff + 8 + 8 * src_stride, src_stride,
-                        temp_coeff + 192);
+  aom_hadamard_8x8_neon(src_diff + 8 + 8 * src_stride, src_stride, coeff + 192);
 
-  tran_low_t *t_coeff = temp_coeff;
-  for (int i = 0; i < 64; i += 8) {
-    const int16x8_t a0 = load_tran_low_to_s16q(t_coeff + 0);
-    const int16x8_t a1 = load_tran_low_to_s16q(t_coeff + 64);
-    const int16x8_t a2 = load_tran_low_to_s16q(t_coeff + 128);
-    const int16x8_t a3 = load_tran_low_to_s16q(t_coeff + 192);
+  for (int i = 0; i < 64; i += 16) {
+    const int16x8_t a00 = load_tran_low_to_s16q(coeff + 0);
+    const int16x8_t a01 = load_tran_low_to_s16q(coeff + 64);
+    const int16x8_t a02 = load_tran_low_to_s16q(coeff + 128);
+    const int16x8_t a03 = load_tran_low_to_s16q(coeff + 192);
 
-    const int16x8_t b0 = vhaddq_s16(a0, a1);
-    const int16x8_t b1 = vhsubq_s16(a0, a1);
-    const int16x8_t b2 = vhaddq_s16(a2, a3);
-    const int16x8_t b3 = vhsubq_s16(a2, a3);
+    const int16x8_t b00 = vhaddq_s16(a00, a01);
+    const int16x8_t b01 = vhsubq_s16(a00, a01);
+    const int16x8_t b02 = vhaddq_s16(a02, a03);
+    const int16x8_t b03 = vhsubq_s16(a02, a03);
 
-    const int16x8_t c0 = vaddq_s16(b0, b2);
-    const int16x8_t c1 = vaddq_s16(b1, b3);
-    const int16x8_t c2 = vsubq_s16(b0, b2);
-    const int16x8_t c3 = vsubq_s16(b1, b3);
+    const int16x8_t c00 = vaddq_s16(b00, b02);
+    const int16x8_t c01 = vaddq_s16(b01, b03);
+    const int16x8_t c02 = vsubq_s16(b00, b02);
+    const int16x8_t c03 = vsubq_s16(b01, b03);
 
-    store_s16q_to_tran_low_offset_4(coeff + 0, c0);
-    store_s16q_to_tran_low_offset_4(coeff + 64, c1);
-    store_s16q_to_tran_low_offset_4(coeff + 128, c2);
-    store_s16q_to_tran_low_offset_4(coeff + 192, c3);
+    const int16x8_t a10 = load_tran_low_to_s16q(coeff + 8 + 0);
+    const int16x8_t a11 = load_tran_low_to_s16q(coeff + 8 + 64);
+    const int16x8_t a12 = load_tran_low_to_s16q(coeff + 8 + 128);
+    const int16x8_t a13 = load_tran_low_to_s16q(coeff + 8 + 192);
 
-    t_coeff += 8;
-    coeff += (4 + (((i >> 3) & 1) << 3));
+    const int16x8_t b10 = vhaddq_s16(a10, a11);
+    const int16x8_t b11 = vhsubq_s16(a10, a11);
+    const int16x8_t b12 = vhaddq_s16(a12, a13);
+    const int16x8_t b13 = vhsubq_s16(a12, a13);
+
+    const int16x8_t c10 = vaddq_s16(b10, b12);
+    const int16x8_t c11 = vaddq_s16(b11, b13);
+    const int16x8_t c12 = vsubq_s16(b10, b12);
+    const int16x8_t c13 = vsubq_s16(b11, b13);
+
+    store_s16_to_tran_low(coeff + 0 + 0, vget_low_s16(c00));
+    store_s16_to_tran_low(coeff + 0 + 4, vget_low_s16(c10));
+    store_s16_to_tran_low(coeff + 0 + 8, vget_high_s16(c00));
+    store_s16_to_tran_low(coeff + 0 + 12, vget_high_s16(c10));
+
+    store_s16_to_tran_low(coeff + 64 + 0, vget_low_s16(c01));
+    store_s16_to_tran_low(coeff + 64 + 4, vget_low_s16(c11));
+    store_s16_to_tran_low(coeff + 64 + 8, vget_high_s16(c01));
+    store_s16_to_tran_low(coeff + 64 + 12, vget_high_s16(c11));
+
+    store_s16_to_tran_low(coeff + 128 + 0, vget_low_s16(c02));
+    store_s16_to_tran_low(coeff + 128 + 4, vget_low_s16(c12));
+    store_s16_to_tran_low(coeff + 128 + 8, vget_high_s16(c02));
+    store_s16_to_tran_low(coeff + 128 + 12, vget_high_s16(c12));
+
+    store_s16_to_tran_low(coeff + 192 + 0, vget_low_s16(c03));
+    store_s16_to_tran_low(coeff + 192 + 4, vget_low_s16(c13));
+    store_s16_to_tran_low(coeff + 192 + 8, vget_high_s16(c03));
+    store_s16_to_tran_low(coeff + 192 + 12, vget_high_s16(c13));
+
+    coeff += 16;
+  }
+}
+
+void aom_hadamard_32x32_neon(const int16_t *src_diff, ptrdiff_t src_stride,
+                             tran_low_t *coeff) {
+  /* Top left first. */
+  aom_hadamard_16x16_neon(src_diff + 0 + 0 * src_stride, src_stride, coeff + 0);
+  /* Top right. */
+  aom_hadamard_16x16_neon(src_diff + 16 + 0 * src_stride, src_stride,
+                          coeff + 256);
+  /* Bottom left. */
+  aom_hadamard_16x16_neon(src_diff + 0 + 16 * src_stride, src_stride,
+                          coeff + 512);
+  /* Bottom right. */
+  aom_hadamard_16x16_neon(src_diff + 16 + 16 * src_stride, src_stride,
+                          coeff + 768);
+
+  for (int i = 0; i < 256; i += 4) {
+    const int32x4_t a0 = vld1q_s32(coeff);
+    const int32x4_t a1 = vld1q_s32(coeff + 256);
+    const int32x4_t a2 = vld1q_s32(coeff + 512);
+    const int32x4_t a3 = vld1q_s32(coeff + 768);
+
+    const int32x4_t b0 = vshrq_n_s32(vaddq_s32(a0, a1), 2);
+    const int32x4_t b1 = vshrq_n_s32(vsubq_s32(a0, a1), 2);
+    const int32x4_t b2 = vshrq_n_s32(vaddq_s32(a2, a3), 2);
+    const int32x4_t b3 = vshrq_n_s32(vsubq_s32(a2, a3), 2);
+
+    const int32x4_t c0 = vaddq_s32(b0, b2);
+    const int32x4_t c1 = vaddq_s32(b1, b3);
+    const int32x4_t c2 = vsubq_s32(b0, b2);
+    const int32x4_t c3 = vsubq_s32(b1, b3);
+
+    vst1q_s32(coeff + 0, c0);
+    vst1q_s32(coeff + 256, c1);
+    vst1q_s32(coeff + 512, c2);
+    vst1q_s32(coeff + 768, c3);
+
+    coeff += 4;
   }
 }

diff --git a/aom_dsp/arm/highbd_avg_neon.c b/aom_dsp/arm/highbd_avg_neon.c
new file mode 100644
index 0000000..47d5dae
--- /dev/null
+++ b/aom_dsp/arm/highbd_avg_neon.c

@@ -0,0 +1,125 @@
+/*
+ *  Copyright (c) 2023 The WebM project authors. All Rights Reserved.
+ *  Copyright (c) 2023, Alliance for Open Media. All Rights Reserved.
+ *
+ *  This source code is subject to the terms of the BSD 2 Clause License and
+ *  the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ *  was not distributed with this source code in the LICENSE file, you can
+ *  obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ *  Media Patent License 1.0 was not distributed with this source code in the
+ *  PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <arm_neon.h>
+
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+#include "aom/aom_integer.h"
+#include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/arm/sum_neon.h"
+#include "aom_ports/mem.h"
+
+uint32_t aom_highbd_avg_4x4_neon(const uint8_t *a, int a_stride) {
+  const uint16_t *a_ptr = CONVERT_TO_SHORTPTR(a);
+  uint16x4_t sum, a0, a1, a2, a3;
+
+  load_u16_4x4(a_ptr, a_stride, &a0, &a1, &a2, &a3);
+
+  sum = vadd_u16(a0, a1);
+  sum = vadd_u16(sum, a2);
+  sum = vadd_u16(sum, a3);
+
+  return (horizontal_add_u16x4(sum) + (1 << 3)) >> 4;
+}
+
+uint32_t aom_highbd_avg_8x8_neon(const uint8_t *a, int a_stride) {
+  const uint16_t *a_ptr = CONVERT_TO_SHORTPTR(a);
+  uint16x8_t sum, a0, a1, a2, a3, a4, a5, a6, a7;
+
+  load_u16_8x8(a_ptr, a_stride, &a0, &a1, &a2, &a3, &a4, &a5, &a6, &a7);
+
+  sum = vaddq_u16(a0, a1);
+  sum = vaddq_u16(sum, a2);
+  sum = vaddq_u16(sum, a3);
+  sum = vaddq_u16(sum, a4);
+  sum = vaddq_u16(sum, a5);
+  sum = vaddq_u16(sum, a6);
+  sum = vaddq_u16(sum, a7);
+
+  return (horizontal_add_u16x8(sum) + (1 << 5)) >> 6;
+}
+
+void aom_highbd_minmax_8x8_neon(const uint8_t *s8, int p, const uint8_t *d8,
+                                int dp, int *min, int *max) {
+  const uint16_t *a_ptr = CONVERT_TO_SHORTPTR(s8);
+  const uint16_t *b_ptr = CONVERT_TO_SHORTPTR(d8);
+
+  const uint16x8_t a0 = vld1q_u16(a_ptr + 0 * p);
+  const uint16x8_t a1 = vld1q_u16(a_ptr + 1 * p);
+  const uint16x8_t a2 = vld1q_u16(a_ptr + 2 * p);
+  const uint16x8_t a3 = vld1q_u16(a_ptr + 3 * p);
+  const uint16x8_t a4 = vld1q_u16(a_ptr + 4 * p);
+  const uint16x8_t a5 = vld1q_u16(a_ptr + 5 * p);
+  const uint16x8_t a6 = vld1q_u16(a_ptr + 6 * p);
+  const uint16x8_t a7 = vld1q_u16(a_ptr + 7 * p);
+
+  const uint16x8_t b0 = vld1q_u16(b_ptr + 0 * dp);
+  const uint16x8_t b1 = vld1q_u16(b_ptr + 1 * dp);
+  const uint16x8_t b2 = vld1q_u16(b_ptr + 2 * dp);
+  const uint16x8_t b3 = vld1q_u16(b_ptr + 3 * dp);
+  const uint16x8_t b4 = vld1q_u16(b_ptr + 4 * dp);
+  const uint16x8_t b5 = vld1q_u16(b_ptr + 5 * dp);
+  const uint16x8_t b6 = vld1q_u16(b_ptr + 6 * dp);
+  const uint16x8_t b7 = vld1q_u16(b_ptr + 7 * dp);
+
+  const uint16x8_t abs_diff0 = vabdq_u16(a0, b0);
+  const uint16x8_t abs_diff1 = vabdq_u16(a1, b1);
+  const uint16x8_t abs_diff2 = vabdq_u16(a2, b2);
+  const uint16x8_t abs_diff3 = vabdq_u16(a3, b3);
+  const uint16x8_t abs_diff4 = vabdq_u16(a4, b4);
+  const uint16x8_t abs_diff5 = vabdq_u16(a5, b5);
+  const uint16x8_t abs_diff6 = vabdq_u16(a6, b6);
+  const uint16x8_t abs_diff7 = vabdq_u16(a7, b7);
+
+  const uint16x8_t max01 = vmaxq_u16(abs_diff0, abs_diff1);
+  const uint16x8_t max23 = vmaxq_u16(abs_diff2, abs_diff3);
+  const uint16x8_t max45 = vmaxq_u16(abs_diff4, abs_diff5);
+  const uint16x8_t max67 = vmaxq_u16(abs_diff6, abs_diff7);
+
+  const uint16x8_t max0123 = vmaxq_u16(max01, max23);
+  const uint16x8_t max4567 = vmaxq_u16(max45, max67);
+  const uint16x8_t max07 = vmaxq_u16(max0123, max4567);
+
+  const uint16x8_t min01 = vminq_u16(abs_diff0, abs_diff1);
+  const uint16x8_t min23 = vminq_u16(abs_diff2, abs_diff3);
+  const uint16x8_t min45 = vminq_u16(abs_diff4, abs_diff5);
+  const uint16x8_t min67 = vminq_u16(abs_diff6, abs_diff7);
+
+  const uint16x8_t min0123 = vminq_u16(min01, min23);
+  const uint16x8_t min4567 = vminq_u16(min45, min67);
+  const uint16x8_t min07 = vminq_u16(min0123, min4567);
+
+#if AOM_ARCH_AARCH64
+  *max = (int)vmaxvq_u16(max07);
+  *min = (int)vminvq_u16(min07);
+#else
+  // Split into 64-bit vectors and execute pairwise min/max.
+  uint16x4_t ab_max = vmax_u16(vget_high_u16(max07), vget_low_u16(max07));
+  uint16x4_t ab_min = vmin_u16(vget_high_u16(min07), vget_low_u16(min07));
+
+  // Enough runs of vpmax/min propagate the max/min values to every position.
+  ab_max = vpmax_u16(ab_max, ab_max);
+  ab_min = vpmin_u16(ab_min, ab_min);
+
+  ab_max = vpmax_u16(ab_max, ab_max);
+  ab_min = vpmin_u16(ab_min, ab_min);
+
+  ab_max = vpmax_u16(ab_max, ab_max);
+  ab_min = vpmin_u16(ab_min, ab_min);
+
+  *min = *max = 0;  // Clear high bits
+  // Store directly to avoid costly neon->gpr transfer.
+  vst1_lane_u16((uint16_t *)max, ab_max, 0);
+  vst1_lane_u16((uint16_t *)min, ab_min, 0);
+#endif
+}

diff --git a/aom_dsp/arm/highbd_hadamard_neon.c b/aom_dsp/arm/highbd_hadamard_neon.c
new file mode 100644
index 0000000..aad2046
--- /dev/null
+++ b/aom_dsp/arm/highbd_hadamard_neon.c

@@ -0,0 +1,213 @@
+/*
+ *  Copyright (c) 2023 The WebM project authors. All Rights Reserved.
+ *  Copyright (c) 2023, Alliance for Open Media. All Rights Reserved.
+ *
+ *  This source code is subject to the terms of the BSD 2 Clause License and
+ *  the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ *  was not distributed with this source code in the LICENSE file, you can
+ *  obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ *  Media Patent License 1.0 was not distributed with this source code in the
+ *  PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <arm_neon.h>
+#include "config/aom_dsp_rtcd.h"
+#include "aom/aom_integer.h"
+#include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/arm/transpose_neon.h"
+#include "aom_dsp/arm/sum_neon.h"
+#include "aom_ports/mem.h"
+
+static INLINE void hadamard_highbd_col8_first_pass(int16x8_t *a0, int16x8_t *a1,
+                                                   int16x8_t *a2, int16x8_t *a3,
+                                                   int16x8_t *a4, int16x8_t *a5,
+                                                   int16x8_t *a6,
+                                                   int16x8_t *a7) {
+  int16x8_t b0 = vaddq_s16(*a0, *a1);
+  int16x8_t b1 = vsubq_s16(*a0, *a1);
+  int16x8_t b2 = vaddq_s16(*a2, *a3);
+  int16x8_t b3 = vsubq_s16(*a2, *a3);
+  int16x8_t b4 = vaddq_s16(*a4, *a5);
+  int16x8_t b5 = vsubq_s16(*a4, *a5);
+  int16x8_t b6 = vaddq_s16(*a6, *a7);
+  int16x8_t b7 = vsubq_s16(*a6, *a7);
+
+  int16x8_t c0 = vaddq_s16(b0, b2);
+  int16x8_t c2 = vsubq_s16(b0, b2);
+  int16x8_t c1 = vaddq_s16(b1, b3);
+  int16x8_t c3 = vsubq_s16(b1, b3);
+  int16x8_t c4 = vaddq_s16(b4, b6);
+  int16x8_t c6 = vsubq_s16(b4, b6);
+  int16x8_t c5 = vaddq_s16(b5, b7);
+  int16x8_t c7 = vsubq_s16(b5, b7);
+
+  *a0 = vaddq_s16(c0, c4);
+  *a2 = vsubq_s16(c0, c4);
+  *a7 = vaddq_s16(c1, c5);
+  *a6 = vsubq_s16(c1, c5);
+  *a3 = vaddq_s16(c2, c6);
+  *a1 = vsubq_s16(c2, c6);
+  *a4 = vaddq_s16(c3, c7);
+  *a5 = vsubq_s16(c3, c7);
+}
+
+static INLINE void hadamard_highbd_col4_second_pass(int16x4_t a0, int16x4_t a1,
+                                                    int16x4_t a2, int16x4_t a3,
+                                                    int16x4_t a4, int16x4_t a5,
+                                                    int16x4_t a6, int16x4_t a7,
+                                                    tran_low_t *coeff) {
+  int32x4_t b0 = vaddl_s16(a0, a1);
+  int32x4_t b1 = vsubl_s16(a0, a1);
+  int32x4_t b2 = vaddl_s16(a2, a3);
+  int32x4_t b3 = vsubl_s16(a2, a3);
+  int32x4_t b4 = vaddl_s16(a4, a5);
+  int32x4_t b5 = vsubl_s16(a4, a5);
+  int32x4_t b6 = vaddl_s16(a6, a7);
+  int32x4_t b7 = vsubl_s16(a6, a7);
+
+  int32x4_t c0 = vaddq_s32(b0, b2);
+  int32x4_t c2 = vsubq_s32(b0, b2);
+  int32x4_t c1 = vaddq_s32(b1, b3);
+  int32x4_t c3 = vsubq_s32(b1, b3);
+  int32x4_t c4 = vaddq_s32(b4, b6);
+  int32x4_t c6 = vsubq_s32(b4, b6);
+  int32x4_t c5 = vaddq_s32(b5, b7);
+  int32x4_t c7 = vsubq_s32(b5, b7);
+
+  int32x4_t d0 = vaddq_s32(c0, c4);
+  int32x4_t d2 = vsubq_s32(c0, c4);
+  int32x4_t d7 = vaddq_s32(c1, c5);
+  int32x4_t d6 = vsubq_s32(c1, c5);
+  int32x4_t d3 = vaddq_s32(c2, c6);
+  int32x4_t d1 = vsubq_s32(c2, c6);
+  int32x4_t d4 = vaddq_s32(c3, c7);
+  int32x4_t d5 = vsubq_s32(c3, c7);
+
+  vst1q_s32(coeff + 0, d0);
+  vst1q_s32(coeff + 4, d1);
+  vst1q_s32(coeff + 8, d2);
+  vst1q_s32(coeff + 12, d3);
+  vst1q_s32(coeff + 16, d4);
+  vst1q_s32(coeff + 20, d5);
+  vst1q_s32(coeff + 24, d6);
+  vst1q_s32(coeff + 28, d7);
+}
+
+void aom_highbd_hadamard_8x8_neon(const int16_t *src_diff, ptrdiff_t src_stride,
+                                  tran_low_t *coeff) {
+  int16x4_t b0, b1, b2, b3, b4, b5, b6, b7;
+
+  int16x8_t s0 = vld1q_s16(src_diff + 0 * src_stride);
+  int16x8_t s1 = vld1q_s16(src_diff + 1 * src_stride);
+  int16x8_t s2 = vld1q_s16(src_diff + 2 * src_stride);
+  int16x8_t s3 = vld1q_s16(src_diff + 3 * src_stride);
+  int16x8_t s4 = vld1q_s16(src_diff + 4 * src_stride);
+  int16x8_t s5 = vld1q_s16(src_diff + 5 * src_stride);
+  int16x8_t s6 = vld1q_s16(src_diff + 6 * src_stride);
+  int16x8_t s7 = vld1q_s16(src_diff + 7 * src_stride);
+
+  // For the first pass we can stay in 16-bit elements (4095*8 = 32760).
+  hadamard_highbd_col8_first_pass(&s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7);
+
+  transpose_s16_8x8(&s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7);
+
+  // For the second pass we need to widen to 32-bit elements, so we're
+  // processing 4 columns at a time.
+  // Skip the second transpose because it is not required.
+
+  b0 = vget_low_s16(s0);
+  b1 = vget_low_s16(s1);
+  b2 = vget_low_s16(s2);
+  b3 = vget_low_s16(s3);
+  b4 = vget_low_s16(s4);
+  b5 = vget_low_s16(s5);
+  b6 = vget_low_s16(s6);
+  b7 = vget_low_s16(s7);
+
+  hadamard_highbd_col4_second_pass(b0, b1, b2, b3, b4, b5, b6, b7, coeff);
+
+  b0 = vget_high_s16(s0);
+  b1 = vget_high_s16(s1);
+  b2 = vget_high_s16(s2);
+  b3 = vget_high_s16(s3);
+  b4 = vget_high_s16(s4);
+  b5 = vget_high_s16(s5);
+  b6 = vget_high_s16(s6);
+  b7 = vget_high_s16(s7);
+
+  hadamard_highbd_col4_second_pass(b0, b1, b2, b3, b4, b5, b6, b7, coeff + 32);
+}
+
+void aom_highbd_hadamard_16x16_neon(const int16_t *src_diff,
+                                    ptrdiff_t src_stride, tran_low_t *coeff) {
+  // Rearrange 16x16 to 8x32 and remove stride.
+  // Top left first.
+  aom_highbd_hadamard_8x8_neon(src_diff, src_stride, coeff);
+  // Top right.
+  aom_highbd_hadamard_8x8_neon(src_diff + 8, src_stride, coeff + 64);
+  // Bottom left.
+  aom_highbd_hadamard_8x8_neon(src_diff + 8 * src_stride, src_stride,
+                               coeff + 128);
+  // Bottom right.
+  aom_highbd_hadamard_8x8_neon(src_diff + 8 * src_stride + 8, src_stride,
+                               coeff + 192);
+
+  for (int i = 0; i < 16; i++) {
+    int32x4_t a0 = vld1q_s32(coeff + 4 * i);
+    int32x4_t a1 = vld1q_s32(coeff + 4 * i + 64);
+    int32x4_t a2 = vld1q_s32(coeff + 4 * i + 128);
+    int32x4_t a3 = vld1q_s32(coeff + 4 * i + 192);
+
+    int32x4_t b0 = vhaddq_s32(a0, a1);
+    int32x4_t b1 = vhsubq_s32(a0, a1);
+    int32x4_t b2 = vhaddq_s32(a2, a3);
+    int32x4_t b3 = vhsubq_s32(a2, a3);
+
+    int32x4_t c0 = vaddq_s32(b0, b2);
+    int32x4_t c1 = vaddq_s32(b1, b3);
+    int32x4_t c2 = vsubq_s32(b0, b2);
+    int32x4_t c3 = vsubq_s32(b1, b3);
+
+    vst1q_s32(coeff + 4 * i, c0);
+    vst1q_s32(coeff + 4 * i + 64, c1);
+    vst1q_s32(coeff + 4 * i + 128, c2);
+    vst1q_s32(coeff + 4 * i + 192, c3);
+  }
+}
+
+void aom_highbd_hadamard_32x32_neon(const int16_t *src_diff,
+                                    ptrdiff_t src_stride, tran_low_t *coeff) {
+  // Rearrange 32x32 to 16x64 and remove stride.
+  // Top left first.
+  aom_highbd_hadamard_16x16_neon(src_diff, src_stride, coeff);
+  // Top right.
+  aom_highbd_hadamard_16x16_neon(src_diff + 16, src_stride, coeff + 256);
+  // Bottom left.
+  aom_highbd_hadamard_16x16_neon(src_diff + 16 * src_stride, src_stride,
+                                 coeff + 512);
+  // Bottom right.
+  aom_highbd_hadamard_16x16_neon(src_diff + 16 * src_stride + 16, src_stride,
+                                 coeff + 768);
+
+  for (int i = 0; i < 64; i++) {
+    int32x4_t a0 = vld1q_s32(coeff + 4 * i);
+    int32x4_t a1 = vld1q_s32(coeff + 4 * i + 256);
+    int32x4_t a2 = vld1q_s32(coeff + 4 * i + 512);
+    int32x4_t a3 = vld1q_s32(coeff + 4 * i + 768);
+
+    int32x4_t b0 = vshrq_n_s32(vaddq_s32(a0, a1), 2);
+    int32x4_t b1 = vshrq_n_s32(vsubq_s32(a0, a1), 2);
+    int32x4_t b2 = vshrq_n_s32(vaddq_s32(a2, a3), 2);
+    int32x4_t b3 = vshrq_n_s32(vsubq_s32(a2, a3), 2);
+
+    int32x4_t c0 = vaddq_s32(b0, b2);
+    int32x4_t c1 = vaddq_s32(b1, b3);
+    int32x4_t c2 = vsubq_s32(b0, b2);
+    int32x4_t c3 = vsubq_s32(b1, b3);
+
+    vst1q_s32(coeff + 4 * i, c0);
+    vst1q_s32(coeff + 4 * i + 256, c1);
+    vst1q_s32(coeff + 4 * i + 512, c2);
+    vst1q_s32(coeff + 4 * i + 768, c3);
+  }
+}

diff --git a/aom_dsp/arm/highbd_intrapred_neon.c b/aom_dsp/arm/highbd_intrapred_neon.c
index fa2f11e..63f53c3 100644
--- a/aom_dsp/arm/highbd_intrapred_neon.c
+++ b/aom_dsp/arm/highbd_intrapred_neon.c

@@ -20,66 +20,333 @@
 // -----------------------------------------------------------------------------
 // DC
 
-static INLINE void highbd_dc_predictor(uint16_t *dst, ptrdiff_t stride, int bw,
-                                       const uint16_t *above,
-                                       const uint16_t *left) {
-  assert(bw >= 4);
-  assert(IS_POWER_OF_TWO(bw));
-  int expected_dc, sum = 0;
-  const int count = bw * 2;
-  uint32x4_t sum_q = vdupq_n_u32(0);
-  uint32x2_t sum_d;
-  uint16_t *dst_1;
-  if (bw >= 8) {
-    for (int i = 0; i < bw; i += 8) {
-      sum_q = vpadalq_u16(sum_q, vld1q_u16(above));
-      sum_q = vpadalq_u16(sum_q, vld1q_u16(left));
-      above += 8;
-      left += 8;
-    }
-    sum_d = vadd_u32(vget_low_u32(sum_q), vget_high_u32(sum_q));
-    sum = vget_lane_s32(vreinterpret_s32_u64(vpaddl_u32(sum_d)), 0);
-    expected_dc = (sum + (count >> 1)) / count;
-    const uint16x8_t dc = vdupq_n_u16((uint16_t)expected_dc);
-    for (int r = 0; r < bw; r++) {
-      dst_1 = dst;
-      for (int i = 0; i < bw; i += 8) {
-        vst1q_u16(dst_1, dc);
-        dst_1 += 8;
-      }
-      dst += stride;
-    }
-  } else {  // 4x4
-    sum_q = vaddl_u16(vld1_u16(above), vld1_u16(left));
-    sum_d = vadd_u32(vget_low_u32(sum_q), vget_high_u32(sum_q));
-    sum = vget_lane_s32(vreinterpret_s32_u64(vpaddl_u32(sum_d)), 0);
-    expected_dc = (sum + (count >> 1)) / count;
-    const uint16x4_t dc = vdup_n_u16((uint16_t)expected_dc);
-    for (int r = 0; r < bw; r++) {
-      vst1_u16(dst, dc);
-      dst += stride;
-    }
+static INLINE void highbd_dc_store_4xh(uint16_t *dst, ptrdiff_t stride, int h,
+                                       uint16x4_t dc) {
+  for (int i = 0; i < h; ++i) {
+    vst1_u16(dst + i * stride, dc);
   }
 }
 
-#define INTRA_PRED_HIGHBD_SIZED_NEON(type, width)               \
-  void aom_highbd_##type##_predictor_##width##x##width##_neon(  \
-      uint16_t *dst, ptrdiff_t stride, const uint16_t *above,   \
-      const uint16_t *left, int bd) {                           \
-    (void)bd;                                                   \
-    highbd_##type##_predictor(dst, stride, width, above, left); \
+static INLINE void highbd_dc_store_8xh(uint16_t *dst, ptrdiff_t stride, int h,
+                                       uint16x8_t dc) {
+  for (int i = 0; i < h; ++i) {
+    vst1q_u16(dst + i * stride, dc);
+  }
+}
+
+static INLINE void highbd_dc_store_16xh(uint16_t *dst, ptrdiff_t stride, int h,
+                                        uint16x8_t dc) {
+  for (int i = 0; i < h; ++i) {
+    vst1q_u16(dst + i * stride, dc);
+    vst1q_u16(dst + i * stride + 8, dc);
+  }
+}
+
+static INLINE void highbd_dc_store_32xh(uint16_t *dst, ptrdiff_t stride, int h,
+                                        uint16x8_t dc) {
+  for (int i = 0; i < h; ++i) {
+    vst1q_u16(dst + i * stride, dc);
+    vst1q_u16(dst + i * stride + 8, dc);
+    vst1q_u16(dst + i * stride + 16, dc);
+    vst1q_u16(dst + i * stride + 24, dc);
+  }
+}
+
+static INLINE void highbd_dc_store_64xh(uint16_t *dst, ptrdiff_t stride, int h,
+                                        uint16x8_t dc) {
+  for (int i = 0; i < h; ++i) {
+    vst1q_u16(dst + i * stride, dc);
+    vst1q_u16(dst + i * stride + 8, dc);
+    vst1q_u16(dst + i * stride + 16, dc);
+    vst1q_u16(dst + i * stride + 24, dc);
+    vst1q_u16(dst + i * stride + 32, dc);
+    vst1q_u16(dst + i * stride + 40, dc);
+    vst1q_u16(dst + i * stride + 48, dc);
+    vst1q_u16(dst + i * stride + 56, dc);
+  }
+}
+
+static INLINE uint32x4_t horizontal_add_and_broadcast_long_u16x8(uint16x8_t a) {
+  // Need to assume input is up to 16 bits wide from dc 64x64 partial sum, so
+  // promote first.
+  const uint32x4_t b = vpaddlq_u16(a);
+#if AOM_ARCH_AARCH64
+  const uint32x4_t c = vpaddq_u32(b, b);
+  return vpaddq_u32(c, c);
+#else
+  const uint32x2_t c = vadd_u32(vget_low_u32(b), vget_high_u32(b));
+  const uint32x2_t d = vpadd_u32(c, c);
+  return vcombine_u32(d, d);
+#endif
+}
+
+static INLINE uint16x8_t highbd_dc_load_partial_sum_4(const uint16_t *left) {
+  // Nothing to do since sum is already one vector, but saves needing to
+  // special case w=4 or h=4 cases. The combine will be zero cost for a sane
+  // compiler since vld1 already sets the top half of a vector to zero as part
+  // of the operation.
+  return vcombine_u16(vld1_u16(left), vdup_n_u16(0));
+}
+
+static INLINE uint16x8_t highbd_dc_load_partial_sum_8(const uint16_t *left) {
+  // Nothing to do since sum is already one vector, but saves needing to
+  // special case w=8 or h=8 cases.
+  return vld1q_u16(left);
+}
+
+static INLINE uint16x8_t highbd_dc_load_partial_sum_16(const uint16_t *left) {
+  const uint16x8_t a0 = vld1q_u16(left + 0);  // up to 12 bits
+  const uint16x8_t a1 = vld1q_u16(left + 8);
+  return vaddq_u16(a0, a1);  // up to 13 bits
+}
+
+static INLINE uint16x8_t highbd_dc_load_partial_sum_32(const uint16_t *left) {
+  const uint16x8_t a0 = vld1q_u16(left + 0);  // up to 12 bits
+  const uint16x8_t a1 = vld1q_u16(left + 8);
+  const uint16x8_t a2 = vld1q_u16(left + 16);
+  const uint16x8_t a3 = vld1q_u16(left + 24);
+  const uint16x8_t b0 = vaddq_u16(a0, a1);  // up to 13 bits
+  const uint16x8_t b1 = vaddq_u16(a2, a3);
+  return vaddq_u16(b0, b1);  // up to 14 bits
+}
+
+static INLINE uint16x8_t highbd_dc_load_partial_sum_64(const uint16_t *left) {
+  const uint16x8_t a0 = vld1q_u16(left + 0);  // up to 12 bits
+  const uint16x8_t a1 = vld1q_u16(left + 8);
+  const uint16x8_t a2 = vld1q_u16(left + 16);
+  const uint16x8_t a3 = vld1q_u16(left + 24);
+  const uint16x8_t a4 = vld1q_u16(left + 32);
+  const uint16x8_t a5 = vld1q_u16(left + 40);
+  const uint16x8_t a6 = vld1q_u16(left + 48);
+  const uint16x8_t a7 = vld1q_u16(left + 56);
+  const uint16x8_t b0 = vaddq_u16(a0, a1);  // up to 13 bits
+  const uint16x8_t b1 = vaddq_u16(a2, a3);
+  const uint16x8_t b2 = vaddq_u16(a4, a5);
+  const uint16x8_t b3 = vaddq_u16(a6, a7);
+  const uint16x8_t c0 = vaddq_u16(b0, b1);  // up to 14 bits
+  const uint16x8_t c1 = vaddq_u16(b2, b3);
+  return vaddq_u16(c0, c1);  // up to 15 bits
+}
+
+#define HIGHBD_DC_PREDICTOR(w, h, shift)                               \
+  void aom_highbd_dc_predictor_##w##x##h##_neon(                       \
+      uint16_t *dst, ptrdiff_t stride, const uint16_t *above,          \
+      const uint16_t *left, int bd) {                                  \
+    (void)bd;                                                          \
+    const uint16x8_t a = highbd_dc_load_partial_sum_##w(above);        \
+    const uint16x8_t l = highbd_dc_load_partial_sum_##h(left);         \
+    const uint32x4_t sum =                                             \
+        horizontal_add_and_broadcast_long_u16x8(vaddq_u16(a, l));      \
+    const uint16x4_t dc0 = vrshrn_n_u32(sum, shift);                   \
+    highbd_dc_store_##w##xh(dst, stride, (h), vdupq_lane_u16(dc0, 0)); \
   }
 
-#define INTRA_PRED_SQUARE(type)          \
-  INTRA_PRED_HIGHBD_SIZED_NEON(type, 4)  \
-  INTRA_PRED_HIGHBD_SIZED_NEON(type, 8)  \
-  INTRA_PRED_HIGHBD_SIZED_NEON(type, 16) \
-  INTRA_PRED_HIGHBD_SIZED_NEON(type, 32) \
-  INTRA_PRED_HIGHBD_SIZED_NEON(type, 64)
+void aom_highbd_dc_predictor_4x4_neon(uint16_t *dst, ptrdiff_t stride,
+                                      const uint16_t *above,
+                                      const uint16_t *left, int bd) {
+  // In the rectangular cases we simply extend the shorter vector to uint16x8
+  // in order to accumulate, however in the 4x4 case there is no shorter vector
+  // to extend so it is beneficial to do the whole calculation in uint16x4
+  // instead.
+  (void)bd;
+  const uint16x4_t a = vld1_u16(above);  // up to 12 bits
+  const uint16x4_t l = vld1_u16(left);
+  uint16x4_t sum = vpadd_u16(a, l);  // up to 13 bits
+  sum = vpadd_u16(sum, sum);         // up to 14 bits
+  sum = vpadd_u16(sum, sum);
+  const uint16x4_t dc = vrshr_n_u16(sum, 3);
+  highbd_dc_store_4xh(dst, stride, 4, dc);
+}
 
-INTRA_PRED_SQUARE(dc)
+HIGHBD_DC_PREDICTOR(8, 8, 4)
+HIGHBD_DC_PREDICTOR(16, 16, 5)
+HIGHBD_DC_PREDICTOR(32, 32, 6)
+HIGHBD_DC_PREDICTOR(64, 64, 7)
 
-#undef INTRA_PRED_SQUARE
+#undef HIGHBD_DC_PREDICTOR
+
+static INLINE int divide_using_multiply_shift(int num, int shift1,
+                                              int multiplier, int shift2) {
+  const int interm = num >> shift1;
+  return interm * multiplier >> shift2;
+}
+
+#define HIGHBD_DC_MULTIPLIER_1X2 0xAAAB
+#define HIGHBD_DC_MULTIPLIER_1X4 0x6667
+#define HIGHBD_DC_SHIFT2 17
+
+static INLINE int highbd_dc_predictor_rect(int bw, int bh, int sum, int shift1,
+                                           uint32_t multiplier) {
+  return divide_using_multiply_shift(sum + ((bw + bh) >> 1), shift1, multiplier,
+                                     HIGHBD_DC_SHIFT2);
+}
+
+#undef HIGHBD_DC_SHIFT2
+
+#define HIGHBD_DC_PREDICTOR_RECT(w, h, q, shift, mult)                  \
+  void aom_highbd_dc_predictor_##w##x##h##_neon(                        \
+      uint16_t *dst, ptrdiff_t stride, const uint16_t *above,           \
+      const uint16_t *left, int bd) {                                   \
+    (void)bd;                                                           \
+    uint16x8_t sum_above = highbd_dc_load_partial_sum_##w(above);       \
+    uint16x8_t sum_left = highbd_dc_load_partial_sum_##h(left);         \
+    uint16x8_t sum_vec = vaddq_u16(sum_left, sum_above);                \
+    int sum = horizontal_add_and_broadcast_long_u16x8(sum_vec)[0];      \
+    int dc0 = highbd_dc_predictor_rect((w), (h), sum, (shift), (mult)); \
+    highbd_dc_store_##w##xh(dst, stride, (h), vdup##q##_n_u16(dc0));    \
+  }
+
+HIGHBD_DC_PREDICTOR_RECT(4, 8, , 2, HIGHBD_DC_MULTIPLIER_1X2)
+HIGHBD_DC_PREDICTOR_RECT(4, 16, , 2, HIGHBD_DC_MULTIPLIER_1X4)
+HIGHBD_DC_PREDICTOR_RECT(8, 4, q, 2, HIGHBD_DC_MULTIPLIER_1X2)
+HIGHBD_DC_PREDICTOR_RECT(8, 16, q, 3, HIGHBD_DC_MULTIPLIER_1X2)
+HIGHBD_DC_PREDICTOR_RECT(8, 32, q, 3, HIGHBD_DC_MULTIPLIER_1X4)
+HIGHBD_DC_PREDICTOR_RECT(16, 4, q, 2, HIGHBD_DC_MULTIPLIER_1X4)
+HIGHBD_DC_PREDICTOR_RECT(16, 8, q, 3, HIGHBD_DC_MULTIPLIER_1X2)
+HIGHBD_DC_PREDICTOR_RECT(16, 32, q, 4, HIGHBD_DC_MULTIPLIER_1X2)
+HIGHBD_DC_PREDICTOR_RECT(16, 64, q, 4, HIGHBD_DC_MULTIPLIER_1X4)
+HIGHBD_DC_PREDICTOR_RECT(32, 8, q, 3, HIGHBD_DC_MULTIPLIER_1X4)
+HIGHBD_DC_PREDICTOR_RECT(32, 16, q, 4, HIGHBD_DC_MULTIPLIER_1X2)
+HIGHBD_DC_PREDICTOR_RECT(32, 64, q, 5, HIGHBD_DC_MULTIPLIER_1X2)
+HIGHBD_DC_PREDICTOR_RECT(64, 16, q, 4, HIGHBD_DC_MULTIPLIER_1X4)
+HIGHBD_DC_PREDICTOR_RECT(64, 32, q, 5, HIGHBD_DC_MULTIPLIER_1X2)
+
+#undef HIGHBD_DC_PREDICTOR_RECT
+#undef HIGHBD_DC_MULTIPLIER_1X2
+#undef HIGHBD_DC_MULTIPLIER_1X4
+
+// -----------------------------------------------------------------------------
+// DC_128
+
+#define HIGHBD_DC_PREDICTOR_128(w, h, q)                        \
+  void aom_highbd_dc_128_predictor_##w##x##h##_neon(            \
+      uint16_t *dst, ptrdiff_t stride, const uint16_t *above,   \
+      const uint16_t *left, int bd) {                           \
+    (void)above;                                                \
+    (void)bd;                                                   \
+    (void)left;                                                 \
+    highbd_dc_store_##w##xh(dst, stride, (h),                   \
+                            vdup##q##_n_u16(0x80 << (bd - 8))); \
+  }
+
+HIGHBD_DC_PREDICTOR_128(4, 4, )
+HIGHBD_DC_PREDICTOR_128(4, 8, )
+HIGHBD_DC_PREDICTOR_128(4, 16, )
+HIGHBD_DC_PREDICTOR_128(8, 4, q)
+HIGHBD_DC_PREDICTOR_128(8, 8, q)
+HIGHBD_DC_PREDICTOR_128(8, 16, q)
+HIGHBD_DC_PREDICTOR_128(8, 32, q)
+HIGHBD_DC_PREDICTOR_128(16, 4, q)
+HIGHBD_DC_PREDICTOR_128(16, 8, q)
+HIGHBD_DC_PREDICTOR_128(16, 16, q)
+HIGHBD_DC_PREDICTOR_128(16, 32, q)
+HIGHBD_DC_PREDICTOR_128(16, 64, q)
+HIGHBD_DC_PREDICTOR_128(32, 8, q)
+HIGHBD_DC_PREDICTOR_128(32, 16, q)
+HIGHBD_DC_PREDICTOR_128(32, 32, q)
+HIGHBD_DC_PREDICTOR_128(32, 64, q)
+HIGHBD_DC_PREDICTOR_128(64, 16, q)
+HIGHBD_DC_PREDICTOR_128(64, 32, q)
+HIGHBD_DC_PREDICTOR_128(64, 64, q)
+
+#undef HIGHBD_DC_PREDICTOR_128
+
+// -----------------------------------------------------------------------------
+// DC_LEFT
+
+static INLINE uint32x4_t highbd_dc_load_sum_4(const uint16_t *left) {
+  const uint16x4_t a = vld1_u16(left);   // up to 12 bits
+  const uint16x4_t b = vpadd_u16(a, a);  // up to 13 bits
+  return vcombine_u32(vpaddl_u16(b), vdup_n_u32(0));
+}
+
+static INLINE uint32x4_t highbd_dc_load_sum_8(const uint16_t *left) {
+  return horizontal_add_and_broadcast_long_u16x8(vld1q_u16(left));
+}
+
+static INLINE uint32x4_t highbd_dc_load_sum_16(const uint16_t *left) {
+  return horizontal_add_and_broadcast_long_u16x8(
+      highbd_dc_load_partial_sum_16(left));
+}
+
+static INLINE uint32x4_t highbd_dc_load_sum_32(const uint16_t *left) {
+  return horizontal_add_and_broadcast_long_u16x8(
+      highbd_dc_load_partial_sum_32(left));
+}
+
+static INLINE uint32x4_t highbd_dc_load_sum_64(const uint16_t *left) {
+  return horizontal_add_and_broadcast_long_u16x8(
+      highbd_dc_load_partial_sum_64(left));
+}
+
+#define DC_PREDICTOR_LEFT(w, h, shift, q)                                  \
+  void aom_highbd_dc_left_predictor_##w##x##h##_neon(                      \
+      uint16_t *dst, ptrdiff_t stride, const uint16_t *above,              \
+      const uint16_t *left, int bd) {                                      \
+    (void)above;                                                           \
+    (void)bd;                                                              \
+    const uint32x4_t sum = highbd_dc_load_sum_##h(left);                   \
+    const uint16x4_t dc0 = vrshrn_n_u32(sum, (shift));                     \
+    highbd_dc_store_##w##xh(dst, stride, (h), vdup##q##_lane_u16(dc0, 0)); \
+  }
+
+DC_PREDICTOR_LEFT(4, 4, 2, )
+DC_PREDICTOR_LEFT(4, 8, 3, )
+DC_PREDICTOR_LEFT(4, 16, 4, )
+DC_PREDICTOR_LEFT(8, 4, 2, q)
+DC_PREDICTOR_LEFT(8, 8, 3, q)
+DC_PREDICTOR_LEFT(8, 16, 4, q)
+DC_PREDICTOR_LEFT(8, 32, 5, q)
+DC_PREDICTOR_LEFT(16, 4, 2, q)
+DC_PREDICTOR_LEFT(16, 8, 3, q)
+DC_PREDICTOR_LEFT(16, 16, 4, q)
+DC_PREDICTOR_LEFT(16, 32, 5, q)
+DC_PREDICTOR_LEFT(16, 64, 6, q)
+DC_PREDICTOR_LEFT(32, 8, 3, q)
+DC_PREDICTOR_LEFT(32, 16, 4, q)
+DC_PREDICTOR_LEFT(32, 32, 5, q)
+DC_PREDICTOR_LEFT(32, 64, 6, q)
+DC_PREDICTOR_LEFT(64, 16, 4, q)
+DC_PREDICTOR_LEFT(64, 32, 5, q)
+DC_PREDICTOR_LEFT(64, 64, 6, q)
+
+#undef DC_PREDICTOR_LEFT
+
+// -----------------------------------------------------------------------------
+// DC_TOP
+
+#define DC_PREDICTOR_TOP(w, h, shift, q)                                   \
+  void aom_highbd_dc_top_predictor_##w##x##h##_neon(                       \
+      uint16_t *dst, ptrdiff_t stride, const uint16_t *above,              \
+      const uint16_t *left, int bd) {                                      \
+    (void)bd;                                                              \
+    (void)left;                                                            \
+    const uint32x4_t sum = highbd_dc_load_sum_##w(above);                  \
+    const uint16x4_t dc0 = vrshrn_n_u32(sum, (shift));                     \
+    highbd_dc_store_##w##xh(dst, stride, (h), vdup##q##_lane_u16(dc0, 0)); \
+  }
+
+DC_PREDICTOR_TOP(4, 4, 2, )
+DC_PREDICTOR_TOP(4, 8, 2, )
+DC_PREDICTOR_TOP(4, 16, 2, )
+DC_PREDICTOR_TOP(8, 4, 3, q)
+DC_PREDICTOR_TOP(8, 8, 3, q)
+DC_PREDICTOR_TOP(8, 16, 3, q)
+DC_PREDICTOR_TOP(8, 32, 3, q)
+DC_PREDICTOR_TOP(16, 4, 4, q)
+DC_PREDICTOR_TOP(16, 8, 4, q)
+DC_PREDICTOR_TOP(16, 16, 4, q)
+DC_PREDICTOR_TOP(16, 32, 4, q)
+DC_PREDICTOR_TOP(16, 64, 4, q)
+DC_PREDICTOR_TOP(32, 8, 5, q)
+DC_PREDICTOR_TOP(32, 16, 5, q)
+DC_PREDICTOR_TOP(32, 32, 5, q)
+DC_PREDICTOR_TOP(32, 64, 5, q)
+DC_PREDICTOR_TOP(64, 16, 6, q)
+DC_PREDICTOR_TOP(64, 32, 6, q)
+DC_PREDICTOR_TOP(64, 64, 6, q)
+
+#undef DC_PREDICTOR_TOP
 
 // -----------------------------------------------------------------------------
 // V_PRED
@@ -213,6 +480,170 @@
 HIGHBD_V_NXM(64, 64)
 
 // -----------------------------------------------------------------------------
+// H_PRED
+
+static INLINE void highbd_h_store_4x4(uint16_t *dst, ptrdiff_t stride,
+                                      uint16x4_t left) {
+  vst1_u16(dst + 0 * stride, vdup_lane_u16(left, 0));
+  vst1_u16(dst + 1 * stride, vdup_lane_u16(left, 1));
+  vst1_u16(dst + 2 * stride, vdup_lane_u16(left, 2));
+  vst1_u16(dst + 3 * stride, vdup_lane_u16(left, 3));
+}
+
+static INLINE void highbd_h_store_8x4(uint16_t *dst, ptrdiff_t stride,
+                                      uint16x4_t left) {
+  vst1q_u16(dst + 0 * stride, vdupq_lane_u16(left, 0));
+  vst1q_u16(dst + 1 * stride, vdupq_lane_u16(left, 1));
+  vst1q_u16(dst + 2 * stride, vdupq_lane_u16(left, 2));
+  vst1q_u16(dst + 3 * stride, vdupq_lane_u16(left, 3));
+}
+
+static INLINE void highbd_h_store_16x1(uint16_t *dst, uint16x8_t left) {
+  vst1q_u16(dst + 0, left);
+  vst1q_u16(dst + 8, left);
+}
+
+static INLINE void highbd_h_store_16x4(uint16_t *dst, ptrdiff_t stride,
+                                       uint16x4_t left) {
+  highbd_h_store_16x1(dst + 0 * stride, vdupq_lane_u16(left, 0));
+  highbd_h_store_16x1(dst + 1 * stride, vdupq_lane_u16(left, 1));
+  highbd_h_store_16x1(dst + 2 * stride, vdupq_lane_u16(left, 2));
+  highbd_h_store_16x1(dst + 3 * stride, vdupq_lane_u16(left, 3));
+}
+
+static INLINE void highbd_h_store_32x1(uint16_t *dst, uint16x8_t left) {
+  vst1q_u16(dst + 0, left);
+  vst1q_u16(dst + 8, left);
+  vst1q_u16(dst + 16, left);
+  vst1q_u16(dst + 24, left);
+}
+
+static INLINE void highbd_h_store_32x4(uint16_t *dst, ptrdiff_t stride,
+                                       uint16x4_t left) {
+  highbd_h_store_32x1(dst + 0 * stride, vdupq_lane_u16(left, 0));
+  highbd_h_store_32x1(dst + 1 * stride, vdupq_lane_u16(left, 1));
+  highbd_h_store_32x1(dst + 2 * stride, vdupq_lane_u16(left, 2));
+  highbd_h_store_32x1(dst + 3 * stride, vdupq_lane_u16(left, 3));
+}
+
+static INLINE void highbd_h_store_64x1(uint16_t *dst, uint16x8_t left) {
+  vst1q_u16(dst + 0, left);
+  vst1q_u16(dst + 8, left);
+  vst1q_u16(dst + 16, left);
+  vst1q_u16(dst + 24, left);
+  vst1q_u16(dst + 32, left);
+  vst1q_u16(dst + 40, left);
+  vst1q_u16(dst + 48, left);
+  vst1q_u16(dst + 56, left);
+}
+
+static INLINE void highbd_h_store_64x4(uint16_t *dst, ptrdiff_t stride,
+                                       uint16x4_t left) {
+  highbd_h_store_64x1(dst + 0 * stride, vdupq_lane_u16(left, 0));
+  highbd_h_store_64x1(dst + 1 * stride, vdupq_lane_u16(left, 1));
+  highbd_h_store_64x1(dst + 2 * stride, vdupq_lane_u16(left, 2));
+  highbd_h_store_64x1(dst + 3 * stride, vdupq_lane_u16(left, 3));
+}
+
+void aom_highbd_h_predictor_4x4_neon(uint16_t *dst, ptrdiff_t stride,
+                                     const uint16_t *above,
+                                     const uint16_t *left, int bd) {
+  (void)above;
+  (void)bd;
+  highbd_h_store_4x4(dst, stride, vld1_u16(left));
+}
+
+void aom_highbd_h_predictor_4x8_neon(uint16_t *dst, ptrdiff_t stride,
+                                     const uint16_t *above,
+                                     const uint16_t *left, int bd) {
+  (void)above;
+  (void)bd;
+  uint16x8_t l = vld1q_u16(left);
+  highbd_h_store_4x4(dst + 0 * stride, stride, vget_low_u16(l));
+  highbd_h_store_4x4(dst + 4 * stride, stride, vget_high_u16(l));
+}
+
+void aom_highbd_h_predictor_8x4_neon(uint16_t *dst, ptrdiff_t stride,
+                                     const uint16_t *above,
+                                     const uint16_t *left, int bd) {
+  (void)above;
+  (void)bd;
+  highbd_h_store_8x4(dst, stride, vld1_u16(left));
+}
+
+void aom_highbd_h_predictor_8x8_neon(uint16_t *dst, ptrdiff_t stride,
+                                     const uint16_t *above,
+                                     const uint16_t *left, int bd) {
+  (void)above;
+  (void)bd;
+  uint16x8_t l = vld1q_u16(left);
+  highbd_h_store_8x4(dst + 0 * stride, stride, vget_low_u16(l));
+  highbd_h_store_8x4(dst + 4 * stride, stride, vget_high_u16(l));
+}
+
+void aom_highbd_h_predictor_16x4_neon(uint16_t *dst, ptrdiff_t stride,
+                                      const uint16_t *above,
+                                      const uint16_t *left, int bd) {
+  (void)above;
+  (void)bd;
+  highbd_h_store_16x4(dst, stride, vld1_u16(left));
+}
+
+void aom_highbd_h_predictor_16x8_neon(uint16_t *dst, ptrdiff_t stride,
+                                      const uint16_t *above,
+                                      const uint16_t *left, int bd) {
+  (void)above;
+  (void)bd;
+  uint16x8_t l = vld1q_u16(left);
+  highbd_h_store_16x4(dst + 0 * stride, stride, vget_low_u16(l));
+  highbd_h_store_16x4(dst + 4 * stride, stride, vget_high_u16(l));
+}
+
+void aom_highbd_h_predictor_32x8_neon(uint16_t *dst, ptrdiff_t stride,
+                                      const uint16_t *above,
+                                      const uint16_t *left, int bd) {
+  (void)above;
+  (void)bd;
+  uint16x8_t l = vld1q_u16(left);
+  highbd_h_store_32x4(dst + 0 * stride, stride, vget_low_u16(l));
+  highbd_h_store_32x4(dst + 4 * stride, stride, vget_high_u16(l));
+}
+
+// For cases where height >= 16 we use pairs of loads to get LDP instructions.
+#define HIGHBD_H_WXH_LARGE(w, h)                                            \
+  void aom_highbd_h_predictor_##w##x##h##_neon(                             \
+      uint16_t *dst, ptrdiff_t stride, const uint16_t *above,               \
+      const uint16_t *left, int bd) {                                       \
+    (void)above;                                                            \
+    (void)bd;                                                               \
+    for (int i = 0; i < (h) / 16; ++i) {                                    \
+      uint16x8_t l0 = vld1q_u16(left + 0);                                  \
+      uint16x8_t l1 = vld1q_u16(left + 8);                                  \
+      highbd_h_store_##w##x4(dst + 0 * stride, stride, vget_low_u16(l0));   \
+      highbd_h_store_##w##x4(dst + 4 * stride, stride, vget_high_u16(l0));  \
+      highbd_h_store_##w##x4(dst + 8 * stride, stride, vget_low_u16(l1));   \
+      highbd_h_store_##w##x4(dst + 12 * stride, stride, vget_high_u16(l1)); \
+      left += 16;                                                           \
+      dst += 16 * stride;                                                   \
+    }                                                                       \
+  }
+
+HIGHBD_H_WXH_LARGE(4, 16)
+HIGHBD_H_WXH_LARGE(8, 16)
+HIGHBD_H_WXH_LARGE(8, 32)
+HIGHBD_H_WXH_LARGE(16, 16)
+HIGHBD_H_WXH_LARGE(16, 32)
+HIGHBD_H_WXH_LARGE(16, 64)
+HIGHBD_H_WXH_LARGE(32, 16)
+HIGHBD_H_WXH_LARGE(32, 32)
+HIGHBD_H_WXH_LARGE(32, 64)
+HIGHBD_H_WXH_LARGE(64, 16)
+HIGHBD_H_WXH_LARGE(64, 32)
+HIGHBD_H_WXH_LARGE(64, 64)
+
+#undef HIGHBD_H_WXH_LARGE
+
+// -----------------------------------------------------------------------------
 // PAETH
 
 static INLINE void highbd_paeth_4or8_x_h_neon(uint16_t *dest, ptrdiff_t stride,

diff --git a/aom_dsp/arm/highbd_loopfilter_neon.c b/aom_dsp/arm/highbd_loopfilter_neon.c
index 0b720ce..2b5128e 100644
--- a/aom_dsp/arm/highbd_loopfilter_neon.c
+++ b/aom_dsp/arm/highbd_loopfilter_neon.c

@@ -247,12 +247,12 @@
   filter4_masks(p0q0, p1q1, hev_thresh, outer_mask, inner_thresh, &hev_mask,
                 &needs_filter4_mask);
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   if (vaddv_u16(needs_filter4_mask) == 0) {
     // None of the values will be filtered.
     return;
   }
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
 
   // Copy the masks to the high bits for packed comparisons later.
   const uint16x8_t hev_mask_8 = vcombine_u16(hev_mask, hev_mask);
@@ -313,12 +313,12 @@
   filter4_masks(p0q0, p1q1, hev_thresh, outer_mask, inner_thresh, &hev_mask,
                 &needs_filter4_mask);
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   if (vaddv_u16(needs_filter4_mask) == 0) {
     // None of the values will be filtered.
     return;
   }
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
 
   // Copy the masks to the high bits for packed comparisons later.
   const uint16x8_t hev_mask_8 = vcombine_u16(hev_mask, hev_mask);
@@ -437,12 +437,12 @@
   filter6_masks(p2q2, p1q1, p0q0, hev_thresh, outer_mask, inner_thresh, bd,
                 &needs_filter_mask, &is_flat3_mask, &hev_mask);
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   if (vaddv_u16(needs_filter_mask) == 0) {
     // None of the values will be filtered.
     return;
   }
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
 
   // Copy the masks to the high bits for packed comparisons later.
   const uint16x8_t hev_mask_8 = vcombine_u16(hev_mask, hev_mask);
@@ -528,12 +528,12 @@
   filter6_masks(p2q2, p1q1, p0q0, hev_thresh, outer_mask, inner_thresh, bd,
                 &needs_filter_mask, &is_flat3_mask, &hev_mask);
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   if (vaddv_u16(needs_filter_mask) == 0) {
     // None of the values will be filtered.
     return;
   }
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
 
   // Copy the masks to the high bits for packed comparisons later.
   const uint16x8_t hev_mask_8 = vcombine_u16(hev_mask, hev_mask);
@@ -684,12 +684,12 @@
   filter8_masks(p3q3, p2q2, p1q1, p0q0, hev_thresh, outer_mask, inner_thresh,
                 bd, &needs_filter_mask, &is_flat4_mask, &hev_mask);
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   if (vaddv_u16(needs_filter_mask) == 0) {
     // None of the values will be filtered.
     return;
   }
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
 
   // Copy the masks to the high bits for packed comparisons later.
   const uint16x8_t hev_mask_8 = vcombine_u16(hev_mask, hev_mask);
@@ -783,12 +783,12 @@
   filter8_masks(p3q3, p2q2, p1q1, p0q0, hev_thresh, outer_mask, inner_thresh,
                 bd, &needs_filter_mask, &is_flat4_mask, &hev_mask);
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   if (vaddv_u16(needs_filter_mask) == 0) {
     // None of the values will be filtered.
     return;
   }
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
 
   // Copy the masks to the high bits for packed comparisons later.
   const uint16x8_t hev_mask_8 = vcombine_u16(hev_mask, hev_mask);
@@ -976,12 +976,12 @@
   filter8_masks(p3q3, p2q2, p1q1, p0q0, hev_thresh, outer_mask, inner_thresh,
                 bd, &needs_filter_mask, &is_flat4_mask, &hev_mask);
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   if (vaddv_u16(needs_filter_mask) == 0) {
     // None of the values will be filtered.
     return;
   }
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
   const uint16x8_t p4q4 = vcombine_u16(src[2], src[11]);
   const uint16x8_t p5q5 = vcombine_u16(src[1], src[12]);
   const uint16x8_t p6q6 = vcombine_u16(src[0], src[13]);
@@ -1083,7 +1083,7 @@
 static INLINE uint16x8x2_t permute_acdb64(const uint16x8_t ab,
                                           const uint16x8_t cd) {
   uint16x8x2_t acdb;
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   // a[b] <- [c]d
   acdb.val[0] = vreinterpretq_u16_u64(
       vtrn1q_u64(vreinterpretq_u64_u16(ab), vreinterpretq_u64_u16(cd)));
@@ -1099,7 +1099,7 @@
   acdb.val[1] = vreinterpretq_u16_u64(
       vsetq_lane_u64(vgetq_lane_u64(vreinterpretq_u64_u16(cd), 1),
                      vreinterpretq_u64_u16(ab), 0));
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
   return acdb;
 }
 
@@ -1144,12 +1144,12 @@
   filter8_masks(p3q3, p2q2, p1q1, p0q0, hev_thresh, outer_mask, inner_thresh,
                 bd, &needs_filter_mask, &is_flat4_mask, &hev_mask);
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   if (vaddv_u16(needs_filter_mask) == 0) {
     // None of the values will be filtered.
     return;
   }
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
   const uint16x8_t p4q4 =
       vcombine_u16(vget_low_u16(src_p[3]), vget_high_u16(src_q[0]));
   const uint16x8_t p5q5 =

diff --git a/aom_dsp/arm/highbd_quantize_neon.c b/aom_dsp/arm/highbd_quantize_neon.c
index 927e13c..77a7aac 100644
--- a/aom_dsp/arm/highbd_quantize_neon.c
+++ b/aom_dsp/arm/highbd_quantize_neon.c

@@ -12,6 +12,8 @@
 #include <arm_neon.h>
 #include <assert.h>
 
+#include "config/aom_config.h"
+
 #include "aom_dsp/quantize.h"
 #include "aom_dsp/arm/mem_neon.h"
 
@@ -19,7 +21,7 @@
 #include "av1/encoder/av1_quantize.h"
 
 static INLINE uint32_t sum_abs_coeff(const uint32x4_t a) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddvq_u32(a);
 #else
   const uint64x2_t b = vpaddlq_u32(a);
@@ -98,7 +100,7 @@
 }
 
 static INLINE uint16_t get_max_eob(int16x8_t v_eobmax) {
-#ifdef __aarch64__
+#if AOM_ARCH_AARCH64
   return (uint16_t)vmaxvq_s16(v_eobmax);
 #else
   const int16x4_t v_eobmax_3210 =
@@ -116,7 +118,7 @@
 }
 
 static INLINE uint16_t get_min_eob(int16x8_t v_eobmin) {
-#ifdef __aarch64__
+#if AOM_ARCH_AARCH64
   return (uint16_t)vminvq_s16(v_eobmin);
 #else
   const int16x4_t v_eobmin_3210 =

diff --git a/aom_dsp/arm/highbd_sad4d_neon.c b/aom_dsp/arm/highbd_sad4d_neon.c
new file mode 100644
index 0000000..f2fda36
--- /dev/null
+++ b/aom_dsp/arm/highbd_sad4d_neon.c

@@ -0,0 +1,360 @@
+/*
+ * Copyright (c) 2023 The WebM project authors. All Rights Reserved.
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <arm_neon.h>
+
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+
+#include "aom/aom_integer.h"
+#include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/arm/sum_neon.h"
+
+static INLINE void highbd_sad4xhx4d_small_neon(const uint8_t *src_ptr,
+                                               int src_stride,
+                                               const uint8_t *const ref_ptr[4],
+                                               int ref_stride, uint32_t res[4],
+                                               int h) {
+  const uint16_t *src16_ptr = CONVERT_TO_SHORTPTR(src_ptr);
+  const uint16_t *ref16_ptr0 = CONVERT_TO_SHORTPTR(ref_ptr[0]);
+  const uint16_t *ref16_ptr1 = CONVERT_TO_SHORTPTR(ref_ptr[1]);
+  const uint16_t *ref16_ptr2 = CONVERT_TO_SHORTPTR(ref_ptr[2]);
+  const uint16_t *ref16_ptr3 = CONVERT_TO_SHORTPTR(ref_ptr[3]);
+
+  uint32x4_t sum[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                        vdupq_n_u32(0) };
+
+  int i = 0;
+  do {
+    uint16x4_t s = vld1_u16(src16_ptr + i * src_stride);
+    uint16x4_t r0 = vld1_u16(ref16_ptr0 + i * ref_stride);
+    uint16x4_t r1 = vld1_u16(ref16_ptr1 + i * ref_stride);
+    uint16x4_t r2 = vld1_u16(ref16_ptr2 + i * ref_stride);
+    uint16x4_t r3 = vld1_u16(ref16_ptr3 + i * ref_stride);
+
+    sum[0] = vabal_u16(sum[0], s, r0);
+    sum[1] = vabal_u16(sum[1], s, r1);
+    sum[2] = vabal_u16(sum[2], s, r2);
+    sum[3] = vabal_u16(sum[3], s, r3);
+
+  } while (++i < h);
+
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum));
+}
+
+static INLINE void highbd_sad8xhx4d_small_neon(const uint8_t *src_ptr,
+                                               int src_stride,
+                                               const uint8_t *const ref_ptr[4],
+                                               int ref_stride, uint32_t res[4],
+                                               int h) {
+  const uint16_t *src16_ptr = CONVERT_TO_SHORTPTR(src_ptr);
+  const uint16_t *ref16_ptr0 = CONVERT_TO_SHORTPTR(ref_ptr[0]);
+  const uint16_t *ref16_ptr1 = CONVERT_TO_SHORTPTR(ref_ptr[1]);
+  const uint16_t *ref16_ptr2 = CONVERT_TO_SHORTPTR(ref_ptr[2]);
+  const uint16_t *ref16_ptr3 = CONVERT_TO_SHORTPTR(ref_ptr[3]);
+
+  uint16x8_t sum[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                        vdupq_n_u16(0) };
+  uint32x4_t sum_u32[4];
+
+  int i = 0;
+  do {
+    uint16x8_t s = vld1q_u16(src16_ptr + i * src_stride);
+
+    sum[0] = vabaq_u16(sum[0], s, vld1q_u16(ref16_ptr0 + i * ref_stride));
+    sum[1] = vabaq_u16(sum[1], s, vld1q_u16(ref16_ptr1 + i * ref_stride));
+    sum[2] = vabaq_u16(sum[2], s, vld1q_u16(ref16_ptr2 + i * ref_stride));
+    sum[3] = vabaq_u16(sum[3], s, vld1q_u16(ref16_ptr3 + i * ref_stride));
+
+  } while (++i < h);
+
+  sum_u32[0] = vpaddlq_u16(sum[0]);
+  sum_u32[1] = vpaddlq_u16(sum[1]);
+  sum_u32[2] = vpaddlq_u16(sum[2]);
+  sum_u32[3] = vpaddlq_u16(sum[3]);
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum_u32));
+}
+
+static INLINE void sad8_neon(uint16x8_t src, uint16x8_t ref,
+                             uint32x4_t *const sad_sum) {
+  uint16x8_t abs_diff = vabdq_u16(src, ref);
+  *sad_sum = vpadalq_u16(*sad_sum, abs_diff);
+}
+
+static INLINE void highbd_sad8xhx4d_large_neon(const uint8_t *src_ptr,
+                                               int src_stride,
+                                               const uint8_t *const ref_ptr[4],
+                                               int ref_stride, uint32_t res[4],
+                                               int h) {
+  const uint16_t *src16_ptr = CONVERT_TO_SHORTPTR(src_ptr);
+  const uint16_t *ref16_ptr0 = CONVERT_TO_SHORTPTR(ref_ptr[0]);
+  const uint16_t *ref16_ptr1 = CONVERT_TO_SHORTPTR(ref_ptr[1]);
+  const uint16_t *ref16_ptr2 = CONVERT_TO_SHORTPTR(ref_ptr[2]);
+  const uint16_t *ref16_ptr3 = CONVERT_TO_SHORTPTR(ref_ptr[3]);
+
+  uint32x4_t sum[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                        vdupq_n_u32(0) };
+
+  int i = 0;
+  do {
+    uint16x8_t s = vld1q_u16(src16_ptr + i * src_stride);
+    sad8_neon(s, vld1q_u16(ref16_ptr0 + i * ref_stride), &sum[0]);
+    sad8_neon(s, vld1q_u16(ref16_ptr1 + i * ref_stride), &sum[1]);
+    sad8_neon(s, vld1q_u16(ref16_ptr2 + i * ref_stride), &sum[2]);
+    sad8_neon(s, vld1q_u16(ref16_ptr3 + i * ref_stride), &sum[3]);
+
+  } while (++i < h);
+
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum));
+}
+
+static INLINE void highbd_sad16xhx4d_large_neon(const uint8_t *src_ptr,
+                                                int src_stride,
+                                                const uint8_t *const ref_ptr[4],
+                                                int ref_stride, uint32_t res[4],
+                                                int h) {
+  const uint16_t *src16_ptr = CONVERT_TO_SHORTPTR(src_ptr);
+  const uint16_t *ref16_ptr0 = CONVERT_TO_SHORTPTR(ref_ptr[0]);
+  const uint16_t *ref16_ptr1 = CONVERT_TO_SHORTPTR(ref_ptr[1]);
+  const uint16_t *ref16_ptr2 = CONVERT_TO_SHORTPTR(ref_ptr[2]);
+  const uint16_t *ref16_ptr3 = CONVERT_TO_SHORTPTR(ref_ptr[3]);
+
+  uint32x4_t sum_lo[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                           vdupq_n_u32(0) };
+  uint32x4_t sum_hi[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                           vdupq_n_u32(0) };
+  uint32x4_t sum[4];
+
+  int i = 0;
+  do {
+    uint16x8_t s0 = vld1q_u16(src16_ptr + i * src_stride);
+    sad8_neon(s0, vld1q_u16(ref16_ptr0 + i * ref_stride), &sum_lo[0]);
+    sad8_neon(s0, vld1q_u16(ref16_ptr1 + i * ref_stride), &sum_lo[1]);
+    sad8_neon(s0, vld1q_u16(ref16_ptr2 + i * ref_stride), &sum_lo[2]);
+    sad8_neon(s0, vld1q_u16(ref16_ptr3 + i * ref_stride), &sum_lo[3]);
+
+    uint16x8_t s1 = vld1q_u16(src16_ptr + i * src_stride + 8);
+    sad8_neon(s1, vld1q_u16(ref16_ptr0 + i * ref_stride + 8), &sum_hi[0]);
+    sad8_neon(s1, vld1q_u16(ref16_ptr1 + i * ref_stride + 8), &sum_hi[1]);
+    sad8_neon(s1, vld1q_u16(ref16_ptr2 + i * ref_stride + 8), &sum_hi[2]);
+    sad8_neon(s1, vld1q_u16(ref16_ptr3 + i * ref_stride + 8), &sum_hi[3]);
+
+  } while (++i < h);
+
+  sum[0] = vaddq_u32(sum_lo[0], sum_hi[0]);
+  sum[1] = vaddq_u32(sum_lo[1], sum_hi[1]);
+  sum[2] = vaddq_u32(sum_lo[2], sum_hi[2]);
+  sum[3] = vaddq_u32(sum_lo[3], sum_hi[3]);
+
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum));
+}
+
+static INLINE void highbd_sadwxhx4d_large_neon(const uint8_t *src_ptr,
+                                               int src_stride,
+                                               const uint8_t *const ref_ptr[4],
+                                               int ref_stride, uint32_t res[4],
+                                               int w, int h) {
+  const uint16_t *src16_ptr = CONVERT_TO_SHORTPTR(src_ptr);
+  const uint16_t *ref16_ptr0 = CONVERT_TO_SHORTPTR(ref_ptr[0]);
+  const uint16_t *ref16_ptr1 = CONVERT_TO_SHORTPTR(ref_ptr[1]);
+  const uint16_t *ref16_ptr2 = CONVERT_TO_SHORTPTR(ref_ptr[2]);
+  const uint16_t *ref16_ptr3 = CONVERT_TO_SHORTPTR(ref_ptr[3]);
+
+  uint32x4_t sum_lo[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                           vdupq_n_u32(0) };
+  uint32x4_t sum_hi[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                           vdupq_n_u32(0) };
+  uint32x4_t sum[4];
+
+  int i = 0;
+  do {
+    int j = 0;
+    do {
+      uint16x8_t s0 = vld1q_u16(src16_ptr + i * src_stride + j);
+      sad8_neon(s0, vld1q_u16(ref16_ptr0 + i * ref_stride + j), &sum_lo[0]);
+      sad8_neon(s0, vld1q_u16(ref16_ptr1 + i * ref_stride + j), &sum_lo[1]);
+      sad8_neon(s0, vld1q_u16(ref16_ptr2 + i * ref_stride + j), &sum_lo[2]);
+      sad8_neon(s0, vld1q_u16(ref16_ptr3 + i * ref_stride + j), &sum_lo[3]);
+
+      uint16x8_t s1 = vld1q_u16(src16_ptr + i * src_stride + j + 8);
+      sad8_neon(s1, vld1q_u16(ref16_ptr0 + i * ref_stride + j + 8), &sum_hi[0]);
+      sad8_neon(s1, vld1q_u16(ref16_ptr1 + i * ref_stride + j + 8), &sum_hi[1]);
+      sad8_neon(s1, vld1q_u16(ref16_ptr2 + i * ref_stride + j + 8), &sum_hi[2]);
+      sad8_neon(s1, vld1q_u16(ref16_ptr3 + i * ref_stride + j + 8), &sum_hi[3]);
+
+      uint16x8_t s2 = vld1q_u16(src16_ptr + i * src_stride + j + 16);
+      sad8_neon(s2, vld1q_u16(ref16_ptr0 + i * ref_stride + j + 16),
+                &sum_lo[0]);
+      sad8_neon(s2, vld1q_u16(ref16_ptr1 + i * ref_stride + j + 16),
+                &sum_lo[1]);
+      sad8_neon(s2, vld1q_u16(ref16_ptr2 + i * ref_stride + j + 16),
+                &sum_lo[2]);
+      sad8_neon(s2, vld1q_u16(ref16_ptr3 + i * ref_stride + j + 16),
+                &sum_lo[3]);
+
+      uint16x8_t s3 = vld1q_u16(src16_ptr + i * src_stride + j + 24);
+      sad8_neon(s3, vld1q_u16(ref16_ptr0 + i * ref_stride + j + 24),
+                &sum_hi[0]);
+      sad8_neon(s3, vld1q_u16(ref16_ptr1 + i * ref_stride + j + 24),
+                &sum_hi[1]);
+      sad8_neon(s3, vld1q_u16(ref16_ptr2 + i * ref_stride + j + 24),
+                &sum_hi[2]);
+      sad8_neon(s3, vld1q_u16(ref16_ptr3 + i * ref_stride + j + 24),
+                &sum_hi[3]);
+
+      j += 32;
+    } while (j < w);
+
+  } while (++i < h);
+
+  sum[0] = vaddq_u32(sum_lo[0], sum_hi[0]);
+  sum[1] = vaddq_u32(sum_lo[1], sum_hi[1]);
+  sum[2] = vaddq_u32(sum_lo[2], sum_hi[2]);
+  sum[3] = vaddq_u32(sum_lo[3], sum_hi[3]);
+
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum));
+}
+
+static INLINE void highbd_sad128xhx4d_large_neon(
+    const uint8_t *src_ptr, int src_stride, const uint8_t *const ref_ptr[4],
+    int ref_stride, uint32_t res[4], int h) {
+  highbd_sadwxhx4d_large_neon(src_ptr, src_stride, ref_ptr, ref_stride, res,
+                              128, h);
+}
+
+static INLINE void highbd_sad64xhx4d_large_neon(const uint8_t *src_ptr,
+                                                int src_stride,
+                                                const uint8_t *const ref_ptr[4],
+                                                int ref_stride, uint32_t res[4],
+                                                int h) {
+  highbd_sadwxhx4d_large_neon(src_ptr, src_stride, ref_ptr, ref_stride, res, 64,
+                              h);
+}
+
+static INLINE void highbd_sad32xhx4d_large_neon(const uint8_t *src_ptr,
+                                                int src_stride,
+                                                const uint8_t *const ref_ptr[4],
+                                                int ref_stride, uint32_t res[4],
+                                                int h) {
+  highbd_sadwxhx4d_large_neon(src_ptr, src_stride, ref_ptr, ref_stride, res, 32,
+                              h);
+}
+
+#define HBD_SAD_WXH_4D_SMALL_NEON(w, h)                                      \
+  void aom_highbd_sad##w##x##h##x4d_neon(                                    \
+      const uint8_t *src, int src_stride, const uint8_t *const ref_array[4], \
+      int ref_stride, uint32_t sad_array[4]) {                               \
+    highbd_sad##w##xhx4d_small_neon(src, src_stride, ref_array, ref_stride,  \
+                                    sad_array, (h));                         \
+  }
+
+#define HBD_SAD_WXH_4D_LARGE_NEON(w, h)                                      \
+  void aom_highbd_sad##w##x##h##x4d_neon(                                    \
+      const uint8_t *src, int src_stride, const uint8_t *const ref_array[4], \
+      int ref_stride, uint32_t sad_array[4]) {                               \
+    highbd_sad##w##xhx4d_large_neon(src, src_stride, ref_array, ref_stride,  \
+                                    sad_array, (h));                         \
+  }
+
+HBD_SAD_WXH_4D_SMALL_NEON(4, 4)
+HBD_SAD_WXH_4D_SMALL_NEON(4, 8)
+
+HBD_SAD_WXH_4D_SMALL_NEON(8, 4)
+HBD_SAD_WXH_4D_SMALL_NEON(8, 8)
+HBD_SAD_WXH_4D_SMALL_NEON(8, 16)
+
+HBD_SAD_WXH_4D_LARGE_NEON(16, 8)
+HBD_SAD_WXH_4D_LARGE_NEON(16, 16)
+HBD_SAD_WXH_4D_LARGE_NEON(16, 32)
+
+HBD_SAD_WXH_4D_LARGE_NEON(32, 16)
+HBD_SAD_WXH_4D_LARGE_NEON(32, 32)
+HBD_SAD_WXH_4D_LARGE_NEON(32, 64)
+
+HBD_SAD_WXH_4D_LARGE_NEON(64, 32)
+HBD_SAD_WXH_4D_LARGE_NEON(64, 64)
+HBD_SAD_WXH_4D_LARGE_NEON(64, 128)
+
+HBD_SAD_WXH_4D_LARGE_NEON(128, 64)
+HBD_SAD_WXH_4D_LARGE_NEON(128, 128)
+
+#if !CONFIG_REALTIME_ONLY
+HBD_SAD_WXH_4D_SMALL_NEON(4, 16)
+
+HBD_SAD_WXH_4D_LARGE_NEON(8, 32)
+
+HBD_SAD_WXH_4D_LARGE_NEON(16, 4)
+HBD_SAD_WXH_4D_LARGE_NEON(16, 64)
+
+HBD_SAD_WXH_4D_LARGE_NEON(32, 8)
+
+HBD_SAD_WXH_4D_LARGE_NEON(64, 16)
+#endif  // !CONFIG_REALTIME_ONLY
+
+#define HBD_SAD_SKIP_WXH_4D_SMALL_NEON(w, h)                                 \
+  void aom_highbd_sad_skip_##w##x##h##x4d_neon(                              \
+      const uint8_t *src, int src_stride, const uint8_t *const ref_array[4], \
+      int ref_stride, uint32_t sad_array[4]) {                               \
+    highbd_sad##w##xhx4d_small_neon(src, 2 * src_stride, ref_array,          \
+                                    2 * ref_stride, sad_array, ((h) >> 1));  \
+    sad_array[0] <<= 1;                                                      \
+    sad_array[1] <<= 1;                                                      \
+    sad_array[2] <<= 1;                                                      \
+    sad_array[3] <<= 1;                                                      \
+  }
+
+#define HBD_SAD_SKIP_WXH_4D_LARGE_NEON(w, h)                                 \
+  void aom_highbd_sad_skip_##w##x##h##x4d_neon(                              \
+      const uint8_t *src, int src_stride, const uint8_t *const ref_array[4], \
+      int ref_stride, uint32_t sad_array[4]) {                               \
+    highbd_sad##w##xhx4d_large_neon(src, 2 * src_stride, ref_array,          \
+                                    2 * ref_stride, sad_array, ((h) >> 1));  \
+    sad_array[0] <<= 1;                                                      \
+    sad_array[1] <<= 1;                                                      \
+    sad_array[2] <<= 1;                                                      \
+    sad_array[3] <<= 1;                                                      \
+  }
+
+HBD_SAD_SKIP_WXH_4D_SMALL_NEON(4, 4)
+HBD_SAD_SKIP_WXH_4D_SMALL_NEON(4, 8)
+
+HBD_SAD_SKIP_WXH_4D_SMALL_NEON(8, 4)
+HBD_SAD_SKIP_WXH_4D_SMALL_NEON(8, 8)
+HBD_SAD_SKIP_WXH_4D_SMALL_NEON(8, 16)
+
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(16, 8)
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(16, 16)
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(16, 32)
+
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(32, 16)
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(32, 32)
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(32, 64)
+
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(64, 32)
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(64, 64)
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(64, 128)
+
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(128, 64)
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(128, 128)
+
+#if !CONFIG_REALTIME_ONLY
+HBD_SAD_SKIP_WXH_4D_SMALL_NEON(4, 16)
+
+HBD_SAD_SKIP_WXH_4D_SMALL_NEON(8, 32)
+
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(16, 4)
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(16, 64)
+
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(32, 8)
+
+HBD_SAD_SKIP_WXH_4D_LARGE_NEON(64, 16)
+#endif  // !CONFIG_REALTIME_ONLY

diff --git a/aom_dsp/arm/highbd_sad_neon.c b/aom_dsp/arm/highbd_sad_neon.c
new file mode 100644
index 0000000..919eb55
--- /dev/null
+++ b/aom_dsp/arm/highbd_sad_neon.c

@@ -0,0 +1,285 @@
+/*
+ * Copyright (c) 2023 The WebM project authors. All Rights Reserved.
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <arm_neon.h>
+
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+
+#include "aom/aom_integer.h"
+#include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/arm/sum_neon.h"
+
+static INLINE uint32_t highbd_sad4xh_small_neon(const uint8_t *src_ptr,
+                                                int src_stride,
+                                                const uint8_t *ref_ptr,
+                                                int ref_stride, int h) {
+  const uint16_t *src16_ptr = CONVERT_TO_SHORTPTR(src_ptr);
+  const uint16_t *ref16_ptr = CONVERT_TO_SHORTPTR(ref_ptr);
+  uint32x4_t sum = vdupq_n_u32(0);
+
+  int i = h;
+  do {
+    uint16x4_t s = vld1_u16(src16_ptr);
+    uint16x4_t r = vld1_u16(ref16_ptr);
+    sum = vabal_u16(sum, s, r);
+
+    src16_ptr += src_stride;
+    ref16_ptr += ref_stride;
+  } while (--i != 0);
+
+  return horizontal_add_u32x4(sum);
+}
+
+static INLINE uint32_t highbd_sad8xh_small_neon(const uint8_t *src_ptr,
+                                                int src_stride,
+                                                const uint8_t *ref_ptr,
+                                                int ref_stride, int h) {
+  const uint16_t *src16_ptr = CONVERT_TO_SHORTPTR(src_ptr);
+  const uint16_t *ref16_ptr = CONVERT_TO_SHORTPTR(ref_ptr);
+  uint16x8_t sum = vdupq_n_u16(0);
+
+  int i = h;
+  do {
+    uint16x8_t s = vld1q_u16(src16_ptr);
+    uint16x8_t r = vld1q_u16(ref16_ptr);
+    sum = vabaq_u16(sum, s, r);
+
+    src16_ptr += src_stride;
+    ref16_ptr += ref_stride;
+  } while (--i != 0);
+
+  return horizontal_add_u16x8(sum);
+}
+
+static INLINE uint32_t highbd_sad8xh_large_neon(const uint8_t *src_ptr,
+                                                int src_stride,
+                                                const uint8_t *ref_ptr,
+                                                int ref_stride, int h) {
+  const uint16_t *src16_ptr = CONVERT_TO_SHORTPTR(src_ptr);
+  const uint16_t *ref16_ptr = CONVERT_TO_SHORTPTR(ref_ptr);
+  uint32x4_t sum_u32 = vdupq_n_u32(0);
+
+  int i = h;
+  do {
+    uint16x8_t s = vld1q_u16(src16_ptr);
+    uint16x8_t r = vld1q_u16(ref16_ptr);
+    uint16x8_t sum_u16 = vabdq_u16(s, r);
+    sum_u32 = vpadalq_u16(sum_u32, sum_u16);
+
+    src16_ptr += src_stride;
+    ref16_ptr += ref_stride;
+  } while (--i != 0);
+
+  return horizontal_add_u32x4(sum_u32);
+}
+
+static INLINE uint32_t highbd_sad16xh_large_neon(const uint8_t *src_ptr,
+                                                 int src_stride,
+                                                 const uint8_t *ref_ptr,
+                                                 int ref_stride, int h) {
+  const uint16_t *src16_ptr = CONVERT_TO_SHORTPTR(src_ptr);
+  const uint16_t *ref16_ptr = CONVERT_TO_SHORTPTR(ref_ptr);
+  uint32x4_t sum[2] = { vdupq_n_u32(0), vdupq_n_u32(0) };
+
+  int i = h;
+  do {
+    uint16x8_t s0 = vld1q_u16(src16_ptr);
+    uint16x8_t r0 = vld1q_u16(ref16_ptr);
+    uint16x8_t diff0 = vabdq_u16(s0, r0);
+    sum[0] = vpadalq_u16(sum[0], diff0);
+
+    uint16x8_t s1 = vld1q_u16(src16_ptr + 8);
+    uint16x8_t r1 = vld1q_u16(ref16_ptr + 8);
+    uint16x8_t diff1 = vabdq_u16(s1, r1);
+    sum[1] = vpadalq_u16(sum[1], diff1);
+
+    src16_ptr += src_stride;
+    ref16_ptr += ref_stride;
+  } while (--i != 0);
+
+  sum[0] = vaddq_u32(sum[0], sum[1]);
+  return horizontal_add_u32x4(sum[0]);
+}
+
+static INLINE uint32_t highbd_sadwxh_large_neon(const uint8_t *src_ptr,
+                                                int src_stride,
+                                                const uint8_t *ref_ptr,
+                                                int ref_stride, int w, int h) {
+  const uint16_t *src16_ptr = CONVERT_TO_SHORTPTR(src_ptr);
+  const uint16_t *ref16_ptr = CONVERT_TO_SHORTPTR(ref_ptr);
+  uint32x4_t sum[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                        vdupq_n_u32(0) };
+
+  int i = h;
+  do {
+    int j = 0;
+    do {
+      uint16x8_t s0 = vld1q_u16(src16_ptr + j);
+      uint16x8_t r0 = vld1q_u16(ref16_ptr + j);
+      uint16x8_t diff0 = vabdq_u16(s0, r0);
+      sum[0] = vpadalq_u16(sum[0], diff0);
+
+      uint16x8_t s1 = vld1q_u16(src16_ptr + j + 8);
+      uint16x8_t r1 = vld1q_u16(ref16_ptr + j + 8);
+      uint16x8_t diff1 = vabdq_u16(s1, r1);
+      sum[1] = vpadalq_u16(sum[1], diff1);
+
+      uint16x8_t s2 = vld1q_u16(src16_ptr + j + 16);
+      uint16x8_t r2 = vld1q_u16(ref16_ptr + j + 16);
+      uint16x8_t diff2 = vabdq_u16(s2, r2);
+      sum[2] = vpadalq_u16(sum[2], diff2);
+
+      uint16x8_t s3 = vld1q_u16(src16_ptr + j + 24);
+      uint16x8_t r3 = vld1q_u16(ref16_ptr + j + 24);
+      uint16x8_t diff3 = vabdq_u16(s3, r3);
+      sum[3] = vpadalq_u16(sum[3], diff3);
+
+      j += 32;
+    } while (j < w);
+
+    src16_ptr += src_stride;
+    ref16_ptr += ref_stride;
+  } while (--i != 0);
+
+  sum[0] = vaddq_u32(sum[0], sum[1]);
+  sum[2] = vaddq_u32(sum[2], sum[3]);
+  sum[0] = vaddq_u32(sum[0], sum[2]);
+
+  return horizontal_add_u32x4(sum[0]);
+}
+
+static INLINE unsigned int highbd_sad128xh_large_neon(const uint8_t *src_ptr,
+                                                      int src_stride,
+                                                      const uint8_t *ref_ptr,
+                                                      int ref_stride, int h) {
+  return highbd_sadwxh_large_neon(src_ptr, src_stride, ref_ptr, ref_stride, 128,
+                                  h);
+}
+
+static INLINE unsigned int highbd_sad64xh_large_neon(const uint8_t *src_ptr,
+                                                     int src_stride,
+                                                     const uint8_t *ref_ptr,
+                                                     int ref_stride, int h) {
+  return highbd_sadwxh_large_neon(src_ptr, src_stride, ref_ptr, ref_stride, 64,
+                                  h);
+}
+
+static INLINE unsigned int highbd_sad32xh_large_neon(const uint8_t *src_ptr,
+                                                     int src_stride,
+                                                     const uint8_t *ref_ptr,
+                                                     int ref_stride, int h) {
+  return highbd_sadwxh_large_neon(src_ptr, src_stride, ref_ptr, ref_stride, 32,
+                                  h);
+}
+
+#define HBD_SAD_WXH_SMALL_NEON(w, h)                                      \
+  unsigned int aom_highbd_sad##w##x##h##_neon(                            \
+      const uint8_t *src, int src_stride, const uint8_t *ref,             \
+      int ref_stride) {                                                   \
+    return highbd_sad##w##xh_small_neon(src, src_stride, ref, ref_stride, \
+                                        (h));                             \
+  }
+
+#define HBD_SAD_WXH_LARGE_NEON(w, h)                                      \
+  unsigned int aom_highbd_sad##w##x##h##_neon(                            \
+      const uint8_t *src, int src_stride, const uint8_t *ref,             \
+      int ref_stride) {                                                   \
+    return highbd_sad##w##xh_large_neon(src, src_stride, ref, ref_stride, \
+                                        (h));                             \
+  }
+
+HBD_SAD_WXH_SMALL_NEON(4, 4)
+HBD_SAD_WXH_SMALL_NEON(4, 8)
+
+HBD_SAD_WXH_SMALL_NEON(8, 4)
+HBD_SAD_WXH_SMALL_NEON(8, 8)
+HBD_SAD_WXH_SMALL_NEON(8, 16)
+
+HBD_SAD_WXH_LARGE_NEON(16, 8)
+HBD_SAD_WXH_LARGE_NEON(16, 16)
+HBD_SAD_WXH_LARGE_NEON(16, 32)
+
+HBD_SAD_WXH_LARGE_NEON(32, 16)
+HBD_SAD_WXH_LARGE_NEON(32, 32)
+HBD_SAD_WXH_LARGE_NEON(32, 64)
+
+HBD_SAD_WXH_LARGE_NEON(64, 32)
+HBD_SAD_WXH_LARGE_NEON(64, 64)
+HBD_SAD_WXH_LARGE_NEON(64, 128)
+
+HBD_SAD_WXH_LARGE_NEON(128, 64)
+HBD_SAD_WXH_LARGE_NEON(128, 128)
+
+#if !CONFIG_REALTIME_ONLY
+HBD_SAD_WXH_SMALL_NEON(4, 16)
+
+HBD_SAD_WXH_LARGE_NEON(8, 32)
+
+HBD_SAD_WXH_LARGE_NEON(16, 4)
+HBD_SAD_WXH_LARGE_NEON(16, 64)
+
+HBD_SAD_WXH_LARGE_NEON(32, 8)
+
+HBD_SAD_WXH_LARGE_NEON(64, 16)
+#endif  // !CONFIG_REALTIME_ONLY
+
+#define HBD_SAD_SKIP_WXH_SMALL_NEON(w, h)                             \
+  unsigned int aom_highbd_sad_skip_##w##x##h##_neon(                  \
+      const uint8_t *src, int src_stride, const uint8_t *ref,         \
+      int ref_stride) {                                               \
+    return 2 * highbd_sad##w##xh_small_neon(src, 2 * src_stride, ref, \
+                                            2 * ref_stride, (h) / 2); \
+  }
+
+#define HBD_SAD_SKIP_WXH_LARGE_NEON(w, h)                             \
+  unsigned int aom_highbd_sad_skip_##w##x##h##_neon(                  \
+      const uint8_t *src, int src_stride, const uint8_t *ref,         \
+      int ref_stride) {                                               \
+    return 2 * highbd_sad##w##xh_large_neon(src, 2 * src_stride, ref, \
+                                            2 * ref_stride, (h) / 2); \
+  }
+
+HBD_SAD_SKIP_WXH_SMALL_NEON(4, 4)
+HBD_SAD_SKIP_WXH_SMALL_NEON(4, 8)
+
+HBD_SAD_SKIP_WXH_SMALL_NEON(8, 4)
+HBD_SAD_SKIP_WXH_SMALL_NEON(8, 8)
+HBD_SAD_SKIP_WXH_SMALL_NEON(8, 16)
+
+HBD_SAD_SKIP_WXH_LARGE_NEON(16, 8)
+HBD_SAD_SKIP_WXH_LARGE_NEON(16, 16)
+HBD_SAD_SKIP_WXH_LARGE_NEON(16, 32)
+
+HBD_SAD_SKIP_WXH_LARGE_NEON(32, 16)
+HBD_SAD_SKIP_WXH_LARGE_NEON(32, 32)
+HBD_SAD_SKIP_WXH_LARGE_NEON(32, 64)
+
+HBD_SAD_SKIP_WXH_LARGE_NEON(64, 32)
+HBD_SAD_SKIP_WXH_LARGE_NEON(64, 64)
+HBD_SAD_SKIP_WXH_LARGE_NEON(64, 128)
+
+HBD_SAD_SKIP_WXH_LARGE_NEON(128, 64)
+HBD_SAD_SKIP_WXH_LARGE_NEON(128, 128)
+
+#if !CONFIG_REALTIME_ONLY
+HBD_SAD_SKIP_WXH_SMALL_NEON(4, 16)
+
+HBD_SAD_SKIP_WXH_SMALL_NEON(8, 32)
+
+HBD_SAD_SKIP_WXH_LARGE_NEON(16, 4)
+HBD_SAD_SKIP_WXH_LARGE_NEON(16, 64)
+
+HBD_SAD_SKIP_WXH_LARGE_NEON(32, 8)
+
+HBD_SAD_SKIP_WXH_LARGE_NEON(64, 16)
+#endif  // !CONFIG_REALTIME_ONLY

diff --git a/aom_dsp/arm/highbd_variance_neon.c b/aom_dsp/arm/highbd_variance_neon.c
index 3c3877a..948f2f7 100644
--- a/aom_dsp/arm/highbd_variance_neon.c
+++ b/aom_dsp/arm/highbd_variance_neon.c

@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2023 The WebM project authors. All Rights Reserved.
  * Copyright (c) 2022, Alliance for Open Media. All rights reserved
  *
  * This source code is subject to the terms of the BSD 2 Clause License and
@@ -16,156 +17,515 @@
 
 #include "aom_dsp/variance.h"
 #include "aom_dsp/aom_filter.h"
+#include "aom_dsp/arm/mem_neon.h"
 #include "aom_dsp/arm/sum_neon.h"
 
-typedef void (*high_variance_fn_t)(const uint16_t *src, int src_stride,
-                                   const uint16_t *ref, int ref_stride,
-                                   uint32_t *sse, int *sum);
+// Process a block of width 4 two rows at a time.
+static INLINE void highbd_variance_4xh_neon(const uint16_t *src_ptr,
+                                            int src_stride,
+                                            const uint16_t *ref_ptr,
+                                            int ref_stride, int h,
+                                            uint64_t *sse, int64_t *sum) {
+  int16x8_t sum_s16 = vdupq_n_s16(0);
+  int32x4_t sse_s32 = vdupq_n_s32(0);
 
-void aom_highbd_calc16x16var_neon(const uint16_t *src, int src_stride,
-                                  const uint16_t *ref, int ref_stride,
-                                  uint32_t *sse, int *sum) {
-  int i, j;
-  int16x8_t v_sum = vdupq_n_s16(0);
-  int32x4_t v_sse_lo = vdupq_n_s32(0);
-  int32x4_t v_sse_hi = vdupq_n_s32(0);
+  int i = h;
+  do {
+    const uint16x8_t s = load_unaligned_u16_4x2(src_ptr, src_stride);
+    const uint16x8_t r = load_unaligned_u16_4x2(ref_ptr, ref_stride);
 
-  for (i = 0; i < 16; ++i) {
-    for (j = 0; j < 16; j += 8) {
-      const uint16x8_t v_a = vld1q_u16(&src[j]);
-      const uint16x8_t v_b = vld1q_u16(&ref[j]);
-      const int16x8_t sv_diff = vreinterpretq_s16_u16(vsubq_u16(v_a, v_b));
-      v_sum = vaddq_s16(v_sum, sv_diff);
-      v_sse_lo =
-          vmlal_s16(v_sse_lo, vget_low_s16(sv_diff), vget_low_s16(sv_diff));
-      v_sse_hi =
-          vmlal_s16(v_sse_hi, vget_high_s16(sv_diff), vget_high_s16(sv_diff));
-    }
-    src += src_stride;
-    ref += ref_stride;
-  }
+    int16x8_t diff = vreinterpretq_s16_u16(vsubq_u16(s, r));
+    sum_s16 = vaddq_s16(sum_s16, diff);
 
-  *sum = horizontal_add_s16x8(v_sum);
-  *sse = (unsigned int)horizontal_add_s32x4(vaddq_s32(v_sse_lo, v_sse_hi));
+    sse_s32 = vmlal_s16(sse_s32, vget_low_s16(diff), vget_low_s16(diff));
+    sse_s32 = vmlal_s16(sse_s32, vget_high_s16(diff), vget_high_s16(diff));
+
+    src_ptr += 2 * src_stride;
+    ref_ptr += 2 * ref_stride;
+    i -= 2;
+  } while (i != 0);
+
+  *sum = horizontal_add_s16x8(sum_s16);
+  *sse = horizontal_add_s32x4(sse_s32);
 }
 
-void aom_highbd_calc8x8var_neon(const uint16_t *src, int src_stride,
-                                const uint16_t *ref, int ref_stride,
-                                uint32_t *sse, int *sum) {
-  int i;
-  int16x8_t v_sum = vdupq_n_s16(0);
-  int32x4_t v_sse_lo = vdupq_n_s32(0);
-  int32x4_t v_sse_hi = vdupq_n_s32(0);
+// For 8-bit and 10-bit data, since we're using two int32x4 accumulators, all
+// block sizes can be processed in 32-bit elements (1023*1023*128*32 =
+// 4286582784 for a 128x128 block).
+static INLINE void highbd_variance_large_neon(const uint16_t *src_ptr,
+                                              int src_stride,
+                                              const uint16_t *ref_ptr,
+                                              int ref_stride, int w, int h,
+                                              uint64_t *sse, int64_t *sum) {
+  int32x4_t sum_s32 = vdupq_n_s32(0);
+  int32x4_t sse_s32[2] = { vdupq_n_s32(0), vdupq_n_s32(0) };
 
-  for (i = 0; i < 8; ++i) {
-    const uint16x8_t v_a = vld1q_u16(&src[0]);
-    const uint16x8_t v_b = vld1q_u16(&ref[0]);
-    const int16x8_t sv_diff = vreinterpretq_s16_u16(vsubq_u16(v_a, v_b));
-    v_sum = vaddq_s16(v_sum, sv_diff);
-    v_sse_lo =
-        vmlal_s16(v_sse_lo, vget_low_s16(sv_diff), vget_low_s16(sv_diff));
-    v_sse_hi =
-        vmlal_s16(v_sse_hi, vget_high_s16(sv_diff), vget_high_s16(sv_diff));
-    src += src_stride;
-    ref += ref_stride;
-  }
+  int i = h;
+  do {
+    int j = 0;
+    do {
+      const uint16x8_t s = vld1q_u16(src_ptr + j);
+      const uint16x8_t r = vld1q_u16(ref_ptr + j);
 
-  *sum = horizontal_add_s16x8(v_sum);
-  *sse = (unsigned int)horizontal_add_s32x4(vaddq_s32(v_sse_lo, v_sse_hi));
+      const int16x8_t diff = vreinterpretq_s16_u16(vsubq_u16(s, r));
+      sum_s32 = vpadalq_s16(sum_s32, diff);
+
+      sse_s32[0] =
+          vmlal_s16(sse_s32[0], vget_low_s16(diff), vget_low_s16(diff));
+      sse_s32[1] =
+          vmlal_s16(sse_s32[1], vget_high_s16(diff), vget_high_s16(diff));
+
+      j += 8;
+    } while (j < w);
+
+    src_ptr += src_stride;
+    ref_ptr += ref_stride;
+  } while (--i != 0);
+
+  *sum = horizontal_add_s32x4(sum_s32);
+  *sse = horizontal_long_add_u32x4(vaddq_u32(
+      vreinterpretq_u32_s32(sse_s32[0]), vreinterpretq_u32_s32(sse_s32[1])));
 }
 
-void aom_highbd_calc4x4var_neon(const uint16_t *src, int src_stride,
-                                const uint16_t *ref, int ref_stride,
-                                uint32_t *sse, int *sum) {
-  int i;
-  int16x8_t v_sum = vdupq_n_s16(0);
-  int32x4_t v_sse_lo = vdupq_n_s32(0);
-  int32x4_t v_sse_hi = vdupq_n_s32(0);
-
-  for (i = 0; i < 4; i += 2) {
-    const uint16x4_t v_a_r0 = vld1_u16(&src[0]);
-    const uint16x4_t v_b_r0 = vld1_u16(&ref[0]);
-    const uint16x4_t v_a_r1 = vld1_u16(&src[src_stride]);
-    const uint16x4_t v_b_r1 = vld1_u16(&ref[ref_stride]);
-    const uint16x8_t v_a = vcombine_u16(v_a_r0, v_a_r1);
-    const uint16x8_t v_b = vcombine_u16(v_b_r0, v_b_r1);
-    const int16x8_t sv_diff = vreinterpretq_s16_u16(vsubq_u16(v_a, v_b));
-    v_sum = vaddq_s16(v_sum, sv_diff);
-    v_sse_lo =
-        vmlal_s16(v_sse_lo, vget_low_s16(sv_diff), vget_low_s16(sv_diff));
-    v_sse_hi =
-        vmlal_s16(v_sse_hi, vget_high_s16(sv_diff), vget_high_s16(sv_diff));
-    src += src_stride << 1;
-    ref += ref_stride << 1;
-  }
-
-  *sum = horizontal_add_s16x8(v_sum);
-  *sse = (unsigned int)horizontal_add_s32x4(vaddq_s32(v_sse_lo, v_sse_hi));
+static INLINE void highbd_variance_8xh_neon(const uint16_t *src, int src_stride,
+                                            const uint16_t *ref, int ref_stride,
+                                            int h, uint64_t *sse,
+                                            int64_t *sum) {
+  highbd_variance_large_neon(src, src_stride, ref, ref_stride, 8, h, sse, sum);
 }
 
-static void highbd_10_variance_neon(const uint16_t *src, int src_stride,
-                                    const uint16_t *ref, int ref_stride, int w,
-                                    int h, uint32_t *sse, int *sum,
-                                    high_variance_fn_t var_fn, int block_size) {
-  int i, j;
-  uint64_t sse_long = 0;
-  int32_t sum_long = 0;
-
-  for (i = 0; i < h; i += block_size) {
-    for (j = 0; j < w; j += block_size) {
-      unsigned int sse0;
-      int sum0;
-      var_fn(src + src_stride * i + j, src_stride, ref + ref_stride * i + j,
-             ref_stride, &sse0, &sum0);
-      sse_long += sse0;
-      sum_long += sum0;
-    }
-  }
-  *sum = ROUND_POWER_OF_TWO(sum_long, 2);
-  *sse = (uint32_t)ROUND_POWER_OF_TWO(sse_long, 4);
+static INLINE void highbd_variance_16xh_neon(const uint16_t *src,
+                                             int src_stride,
+                                             const uint16_t *ref,
+                                             int ref_stride, int h,
+                                             uint64_t *sse, int64_t *sum) {
+  highbd_variance_large_neon(src, src_stride, ref, ref_stride, 16, h, sse, sum);
 }
 
-#define VAR_FN(w, h, block_size, shift)                                    \
-  uint32_t aom_highbd_10_variance##w##x##h##_neon(                         \
-      const uint8_t *src8, int src_stride, const uint8_t *ref8,            \
-      int ref_stride, uint32_t *sse) {                                     \
-    int sum;                                                               \
-    int64_t var;                                                           \
-    uint16_t *src = CONVERT_TO_SHORTPTR(src8);                             \
-    uint16_t *ref = CONVERT_TO_SHORTPTR(ref8);                             \
-    highbd_10_variance_neon(                                               \
-        src, src_stride, ref, ref_stride, w, h, sse, &sum,                 \
-        aom_highbd_calc##block_size##x##block_size##var_neon, block_size); \
-    var = (int64_t)(*sse) - (((int64_t)sum * sum) >> shift);               \
-    return (var >= 0) ? (uint32_t)var : 0;                                 \
+static INLINE void highbd_variance_32xh_neon(const uint16_t *src,
+                                             int src_stride,
+                                             const uint16_t *ref,
+                                             int ref_stride, int h,
+                                             uint64_t *sse, int64_t *sum) {
+  highbd_variance_large_neon(src, src_stride, ref, ref_stride, 32, h, sse, sum);
+}
+
+static INLINE void highbd_variance_64xh_neon(const uint16_t *src,
+                                             int src_stride,
+                                             const uint16_t *ref,
+                                             int ref_stride, int h,
+                                             uint64_t *sse, int64_t *sum) {
+  highbd_variance_large_neon(src, src_stride, ref, ref_stride, 64, h, sse, sum);
+}
+
+static INLINE void highbd_variance_128xh_neon(const uint16_t *src,
+                                              int src_stride,
+                                              const uint16_t *ref,
+                                              int ref_stride, int h,
+                                              uint64_t *sse, int64_t *sum) {
+  highbd_variance_large_neon(src, src_stride, ref, ref_stride, 128, h, sse,
+                             sum);
+}
+
+// For 12-bit data, we can only accumulate up to 128 elements in the sum of
+// squares (4095*4095*128 = 2146435200), and because we're using two int32x4
+// accumulators, we can only process up to 32 32-element rows (32*32/8 = 128)
+// or 16 64-element rows before we have to accumulate into 64-bit elements.
+// Therefore blocks of size 32x64, 64x32, 64x64, 64x128, 128x64, 128x128 are
+// processed in a different helper function.
+
+// Process a block of any size where the width is divisible by 8, with
+// accumulation into 64-bit elements.
+static INLINE void highbd_variance_xlarge_neon(
+    const uint16_t *src_ptr, int src_stride, const uint16_t *ref_ptr,
+    int ref_stride, int w, int h, int h_limit, uint64_t *sse, int64_t *sum) {
+  int32x4_t sum_s32 = vdupq_n_s32(0);
+  int64x2_t sse_s64 = vdupq_n_s64(0);
+
+  // 'h_limit' is the number of 'w'-width rows we can process before our 32-bit
+  // accumulator overflows. After hitting this limit we accumulate into 64-bit
+  // elements.
+  int h_tmp = h > h_limit ? h_limit : h;
+
+  int i = 0;
+  do {
+    int32x4_t sse_s32[2] = { vdupq_n_s32(0), vdupq_n_s32(0) };
+    do {
+      int j = 0;
+      do {
+        const uint16x8_t s0 = vld1q_u16(src_ptr + j);
+        const uint16x8_t r0 = vld1q_u16(ref_ptr + j);
+
+        const int16x8_t diff = vreinterpretq_s16_u16(vsubq_u16(s0, r0));
+        sum_s32 = vpadalq_s16(sum_s32, diff);
+
+        sse_s32[0] =
+            vmlal_s16(sse_s32[0], vget_low_s16(diff), vget_low_s16(diff));
+        sse_s32[1] =
+            vmlal_s16(sse_s32[1], vget_high_s16(diff), vget_high_s16(diff));
+
+        j += 8;
+      } while (j < w);
+
+      src_ptr += src_stride;
+      ref_ptr += ref_stride;
+      i++;
+    } while (i < h_tmp);
+
+    sse_s64 = vpadalq_s32(sse_s64, sse_s32[0]);
+    sse_s64 = vpadalq_s32(sse_s64, sse_s32[1]);
+    h_tmp += h_limit;
+  } while (i < h);
+
+  *sum = horizontal_add_s32x4(sum_s32);
+  *sse = (uint64_t)horizontal_add_s64x2(sse_s64);
+}
+
+static INLINE void highbd_variance_32xh_xlarge_neon(
+    const uint16_t *src, int src_stride, const uint16_t *ref, int ref_stride,
+    int h, uint64_t *sse, int64_t *sum) {
+  highbd_variance_xlarge_neon(src, src_stride, ref, ref_stride, 32, h, 32, sse,
+                              sum);
+}
+
+static INLINE void highbd_variance_64xh_xlarge_neon(
+    const uint16_t *src, int src_stride, const uint16_t *ref, int ref_stride,
+    int h, uint64_t *sse, int64_t *sum) {
+  highbd_variance_xlarge_neon(src, src_stride, ref, ref_stride, 64, h, 16, sse,
+                              sum);
+}
+
+static INLINE void highbd_variance_128xh_xlarge_neon(
+    const uint16_t *src, int src_stride, const uint16_t *ref, int ref_stride,
+    int h, uint64_t *sse, int64_t *sum) {
+  highbd_variance_xlarge_neon(src, src_stride, ref, ref_stride, 128, h, 8, sse,
+                              sum);
+}
+
+#define HBD_VARIANCE_WXH_8_NEON(w, h)                                 \
+  uint32_t aom_highbd_8_variance##w##x##h##_neon(                     \
+      const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, \
+      int ref_stride, uint32_t *sse) {                                \
+    int sum;                                                          \
+    uint64_t sse_long = 0;                                            \
+    int64_t sum_long = 0;                                             \
+    uint16_t *src = CONVERT_TO_SHORTPTR(src_ptr);                     \
+    uint16_t *ref = CONVERT_TO_SHORTPTR(ref_ptr);                     \
+    highbd_variance_##w##xh_neon(src, src_stride, ref, ref_stride, h, \
+                                 &sse_long, &sum_long);               \
+    *sse = (uint32_t)sse_long;                                        \
+    sum = (int)sum_long;                                              \
+    return *sse - (uint32_t)(((int64_t)sum * sum) / (w * h));         \
   }
 
-VAR_FN(128, 128, 16, 14)
-VAR_FN(128, 64, 16, 13)
-VAR_FN(64, 128, 16, 13)
-VAR_FN(64, 64, 16, 12)
-VAR_FN(64, 32, 16, 11)
-VAR_FN(32, 64, 16, 11)
-VAR_FN(32, 32, 16, 10)
-VAR_FN(32, 16, 16, 9)
-VAR_FN(16, 32, 16, 9)
-VAR_FN(16, 16, 16, 8)
-VAR_FN(16, 8, 8, 7)
-VAR_FN(8, 16, 8, 7)
-VAR_FN(8, 8, 8, 6)
+#define HBD_VARIANCE_WXH_10_NEON(w, h)                                \
+  uint32_t aom_highbd_10_variance##w##x##h##_neon(                    \
+      const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, \
+      int ref_stride, uint32_t *sse) {                                \
+    int sum;                                                          \
+    int64_t var;                                                      \
+    uint64_t sse_long = 0;                                            \
+    int64_t sum_long = 0;                                             \
+    uint16_t *src = CONVERT_TO_SHORTPTR(src_ptr);                     \
+    uint16_t *ref = CONVERT_TO_SHORTPTR(ref_ptr);                     \
+    highbd_variance_##w##xh_neon(src, src_stride, ref, ref_stride, h, \
+                                 &sse_long, &sum_long);               \
+    *sse = (uint32_t)ROUND_POWER_OF_TWO(sse_long, 4);                 \
+    sum = (int)ROUND_POWER_OF_TWO(sum_long, 2);                       \
+    var = (int64_t)(*sse) - (((int64_t)sum * sum) / (w * h));         \
+    return (var >= 0) ? (uint32_t)var : 0;                            \
+  }
 
-VAR_FN(16, 4, 4, 6)
-VAR_FN(4, 16, 4, 6)
+#define HBD_VARIANCE_WXH_12_NEON(w, h)                                \
+  uint32_t aom_highbd_12_variance##w##x##h##_neon(                    \
+      const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, \
+      int ref_stride, uint32_t *sse) {                                \
+    int sum;                                                          \
+    int64_t var;                                                      \
+    uint64_t sse_long = 0;                                            \
+    int64_t sum_long = 0;                                             \
+    uint16_t *src = CONVERT_TO_SHORTPTR(src_ptr);                     \
+    uint16_t *ref = CONVERT_TO_SHORTPTR(ref_ptr);                     \
+    highbd_variance_##w##xh_neon(src, src_stride, ref, ref_stride, h, \
+                                 &sse_long, &sum_long);               \
+    *sse = (uint32_t)ROUND_POWER_OF_TWO(sse_long, 8);                 \
+    sum = (int)ROUND_POWER_OF_TWO(sum_long, 4);                       \
+    var = (int64_t)(*sse) - (((int64_t)sum * sum) / (w * h));         \
+    return (var >= 0) ? (uint32_t)var : 0;                            \
+  }
 
-VAR_FN(8, 4, 4, 5)
-VAR_FN(4, 8, 4, 5)
-VAR_FN(4, 4, 4, 4)
+#define HBD_VARIANCE_WXH_12_XLARGE_NEON(w, h)                                \
+  uint32_t aom_highbd_12_variance##w##x##h##_neon(                           \
+      const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr,        \
+      int ref_stride, uint32_t *sse) {                                       \
+    int sum;                                                                 \
+    int64_t var;                                                             \
+    uint64_t sse_long = 0;                                                   \
+    int64_t sum_long = 0;                                                    \
+    uint16_t *src = CONVERT_TO_SHORTPTR(src_ptr);                            \
+    uint16_t *ref = CONVERT_TO_SHORTPTR(ref_ptr);                            \
+    highbd_variance_##w##xh_xlarge_neon(src, src_stride, ref, ref_stride, h, \
+                                        &sse_long, &sum_long);               \
+    *sse = (uint32_t)ROUND_POWER_OF_TWO(sse_long, 8);                        \
+    sum = (int)ROUND_POWER_OF_TWO(sum_long, 4);                              \
+    var = (int64_t)(*sse) - (((int64_t)sum * sum) / (w * h));                \
+    return (var >= 0) ? (uint32_t)var : 0;                                   \
+  }
+
+// 8-bit
+HBD_VARIANCE_WXH_8_NEON(4, 4)
+HBD_VARIANCE_WXH_8_NEON(4, 8)
+
+HBD_VARIANCE_WXH_8_NEON(8, 4)
+HBD_VARIANCE_WXH_8_NEON(8, 8)
+HBD_VARIANCE_WXH_8_NEON(8, 16)
+
+HBD_VARIANCE_WXH_8_NEON(16, 8)
+HBD_VARIANCE_WXH_8_NEON(16, 16)
+HBD_VARIANCE_WXH_8_NEON(16, 32)
+
+HBD_VARIANCE_WXH_8_NEON(32, 16)
+HBD_VARIANCE_WXH_8_NEON(32, 32)
+HBD_VARIANCE_WXH_8_NEON(32, 64)
+
+HBD_VARIANCE_WXH_8_NEON(64, 32)
+HBD_VARIANCE_WXH_8_NEON(64, 64)
+HBD_VARIANCE_WXH_8_NEON(64, 128)
+
+HBD_VARIANCE_WXH_8_NEON(128, 64)
+HBD_VARIANCE_WXH_8_NEON(128, 128)
+
+// 10-bit
+HBD_VARIANCE_WXH_10_NEON(4, 4)
+HBD_VARIANCE_WXH_10_NEON(4, 8)
+
+HBD_VARIANCE_WXH_10_NEON(8, 4)
+HBD_VARIANCE_WXH_10_NEON(8, 8)
+HBD_VARIANCE_WXH_10_NEON(8, 16)
+
+HBD_VARIANCE_WXH_10_NEON(16, 8)
+HBD_VARIANCE_WXH_10_NEON(16, 16)
+HBD_VARIANCE_WXH_10_NEON(16, 32)
+
+HBD_VARIANCE_WXH_10_NEON(32, 16)
+HBD_VARIANCE_WXH_10_NEON(32, 32)
+HBD_VARIANCE_WXH_10_NEON(32, 64)
+
+HBD_VARIANCE_WXH_10_NEON(64, 32)
+HBD_VARIANCE_WXH_10_NEON(64, 64)
+HBD_VARIANCE_WXH_10_NEON(64, 128)
+
+HBD_VARIANCE_WXH_10_NEON(128, 64)
+HBD_VARIANCE_WXH_10_NEON(128, 128)
+
+// 12-bit
+HBD_VARIANCE_WXH_12_NEON(4, 4)
+HBD_VARIANCE_WXH_12_NEON(4, 8)
+
+HBD_VARIANCE_WXH_12_NEON(8, 4)
+HBD_VARIANCE_WXH_12_NEON(8, 8)
+HBD_VARIANCE_WXH_12_NEON(8, 16)
+
+HBD_VARIANCE_WXH_12_NEON(16, 8)
+HBD_VARIANCE_WXH_12_NEON(16, 16)
+HBD_VARIANCE_WXH_12_NEON(16, 32)
+
+HBD_VARIANCE_WXH_12_NEON(32, 16)
+HBD_VARIANCE_WXH_12_NEON(32, 32)
+HBD_VARIANCE_WXH_12_XLARGE_NEON(32, 64)
+
+HBD_VARIANCE_WXH_12_XLARGE_NEON(64, 32)
+HBD_VARIANCE_WXH_12_XLARGE_NEON(64, 64)
+HBD_VARIANCE_WXH_12_XLARGE_NEON(64, 128)
+
+HBD_VARIANCE_WXH_12_XLARGE_NEON(128, 64)
+HBD_VARIANCE_WXH_12_XLARGE_NEON(128, 128)
 
 #if !CONFIG_REALTIME_ONLY
-VAR_FN(64, 16, 16, 10)
-VAR_FN(16, 64, 16, 10)
-VAR_FN(8, 32, 8, 8)
-VAR_FN(32, 8, 8, 8)
+// 8-bit
+HBD_VARIANCE_WXH_8_NEON(4, 16)
+
+HBD_VARIANCE_WXH_8_NEON(8, 32)
+
+HBD_VARIANCE_WXH_8_NEON(16, 4)
+HBD_VARIANCE_WXH_8_NEON(16, 64)
+
+HBD_VARIANCE_WXH_8_NEON(32, 8)
+
+HBD_VARIANCE_WXH_8_NEON(64, 16)
+
+// 10-bit
+HBD_VARIANCE_WXH_10_NEON(4, 16)
+
+HBD_VARIANCE_WXH_10_NEON(8, 32)
+
+HBD_VARIANCE_WXH_10_NEON(16, 4)
+HBD_VARIANCE_WXH_10_NEON(16, 64)
+
+HBD_VARIANCE_WXH_10_NEON(32, 8)
+
+HBD_VARIANCE_WXH_10_NEON(64, 16)
+
+// 12-bit
+HBD_VARIANCE_WXH_12_NEON(4, 16)
+
+HBD_VARIANCE_WXH_12_NEON(8, 32)
+
+HBD_VARIANCE_WXH_12_NEON(16, 4)
+HBD_VARIANCE_WXH_12_NEON(16, 64)
+
+HBD_VARIANCE_WXH_12_NEON(32, 8)
+
+HBD_VARIANCE_WXH_12_NEON(64, 16)
+
 #endif  // !CONFIG_REALTIME_ONLY
 
-#undef VAR_FN
+static INLINE uint32_t highbd_mse_wxh_neon(const uint16_t *src_ptr,
+                                           int src_stride,
+                                           const uint16_t *ref_ptr,
+                                           int ref_stride, int w, int h,
+                                           unsigned int *sse) {
+  uint32x4_t sse_u32[2] = { vdupq_n_u32(0), vdupq_n_u32(0) };
+
+  int i = h;
+  do {
+    int j = 0;
+    do {
+      uint16x8_t s = vld1q_u16(src_ptr + j);
+      uint16x8_t r = vld1q_u16(ref_ptr + j);
+
+      uint16x8_t diff = vabdq_u16(s, r);
+
+      sse_u32[0] =
+          vmlal_u16(sse_u32[0], vget_low_u16(diff), vget_low_u16(diff));
+      sse_u32[1] =
+          vmlal_u16(sse_u32[1], vget_high_u16(diff), vget_high_u16(diff));
+
+      j += 8;
+    } while (j < w);
+
+    src_ptr += src_stride;
+    ref_ptr += ref_stride;
+  } while (--i != 0);
+
+  *sse = horizontal_add_u32x4(vaddq_u32(sse_u32[0], sse_u32[1]));
+  return *sse;
+}
+
+#if defined(__ARM_FEATURE_DOTPROD)
+
+static INLINE uint32_t highbd_mse8_8xh_neon(const uint16_t *src_ptr,
+                                            int src_stride,
+                                            const uint16_t *ref_ptr,
+                                            int ref_stride, int h,
+                                            unsigned int *sse) {
+  uint32x4_t sse_u32 = vdupq_n_u32(0);
+
+  int i = h / 2;
+  do {
+    uint16x8_t s0 = vld1q_u16(src_ptr);
+    src_ptr += src_stride;
+    uint16x8_t s1 = vld1q_u16(src_ptr);
+    src_ptr += src_stride;
+    uint16x8_t r0 = vld1q_u16(ref_ptr);
+    ref_ptr += ref_stride;
+    uint16x8_t r1 = vld1q_u16(ref_ptr);
+    ref_ptr += ref_stride;
+
+    uint8x16_t s = vcombine_u8(vmovn_u16(s0), vmovn_u16(s1));
+    uint8x16_t r = vcombine_u8(vmovn_u16(r0), vmovn_u16(r1));
+
+    uint8x16_t diff = vabdq_u8(s, r);
+    sse_u32 = vdotq_u32(sse_u32, diff, diff);
+  } while (--i != 0);
+
+  *sse = horizontal_add_u32x4(sse_u32);
+  return *sse;
+}
+
+static INLINE uint32_t highbd_mse8_16xh_neon(const uint16_t *src_ptr,
+                                             int src_stride,
+                                             const uint16_t *ref_ptr,
+                                             int ref_stride, int h,
+                                             unsigned int *sse) {
+  uint32x4_t sse_u32 = vdupq_n_u32(0);
+
+  int i = h;
+  do {
+    uint16x8_t s0 = vld1q_u16(src_ptr);
+    uint16x8_t s1 = vld1q_u16(src_ptr + 8);
+    uint16x8_t r0 = vld1q_u16(ref_ptr);
+    uint16x8_t r1 = vld1q_u16(ref_ptr + 8);
+
+    uint8x16_t s = vcombine_u8(vmovn_u16(s0), vmovn_u16(s1));
+    uint8x16_t r = vcombine_u8(vmovn_u16(r0), vmovn_u16(r1));
+
+    uint8x16_t diff = vabdq_u8(s, r);
+    sse_u32 = vdotq_u32(sse_u32, diff, diff);
+
+    src_ptr += src_stride;
+    ref_ptr += ref_stride;
+  } while (--i != 0);
+
+  *sse = horizontal_add_u32x4(sse_u32);
+  return *sse;
+}
+
+#else  // !defined(__ARM_FEATURE_DOTPROD)
+
+static INLINE uint32_t highbd_mse8_8xh_neon(const uint16_t *src_ptr,
+                                            int src_stride,
+                                            const uint16_t *ref_ptr,
+                                            int ref_stride, int h,
+                                            unsigned int *sse) {
+  return highbd_mse_wxh_neon(src_ptr, src_stride, ref_ptr, ref_stride, 8, h,
+                             sse);
+}
+
+static INLINE uint32_t highbd_mse8_16xh_neon(const uint16_t *src_ptr,
+                                             int src_stride,
+                                             const uint16_t *ref_ptr,
+                                             int ref_stride, int h,
+                                             unsigned int *sse) {
+  return highbd_mse_wxh_neon(src_ptr, src_stride, ref_ptr, ref_stride, 16, h,
+                             sse);
+}
+
+#endif  // defined(__ARM_FEATURE_DOTPROD)
+
+#define HIGHBD_MSE_WXH_NEON(w, h)                                       \
+  uint32_t aom_highbd_8_mse##w##x##h##_neon(                            \
+      const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr,   \
+      int ref_stride, uint32_t *sse) {                                  \
+    uint16_t *src = CONVERT_TO_SHORTPTR(src_ptr);                       \
+    uint16_t *ref = CONVERT_TO_SHORTPTR(ref_ptr);                       \
+    highbd_mse8_##w##xh_neon(src, src_stride, ref, ref_stride, h, sse); \
+    return *sse;                                                        \
+  }                                                                     \
+                                                                        \
+  uint32_t aom_highbd_10_mse##w##x##h##_neon(                           \
+      const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr,   \
+      int ref_stride, uint32_t *sse) {                                  \
+    uint16_t *src = CONVERT_TO_SHORTPTR(src_ptr);                       \
+    uint16_t *ref = CONVERT_TO_SHORTPTR(ref_ptr);                       \
+    highbd_mse_wxh_neon(src, src_stride, ref, ref_stride, w, h, sse);   \
+    *sse = ROUND_POWER_OF_TWO(*sse, 4);                                 \
+    return *sse;                                                        \
+  }                                                                     \
+                                                                        \
+  uint32_t aom_highbd_12_mse##w##x##h##_neon(                           \
+      const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr,   \
+      int ref_stride, uint32_t *sse) {                                  \
+    uint16_t *src = CONVERT_TO_SHORTPTR(src_ptr);                       \
+    uint16_t *ref = CONVERT_TO_SHORTPTR(ref_ptr);                       \
+    highbd_mse_wxh_neon(src, src_stride, ref, ref_stride, w, h, sse);   \
+    *sse = ROUND_POWER_OF_TWO(*sse, 8);                                 \
+    return *sse;                                                        \
+  }
+
+HIGHBD_MSE_WXH_NEON(16, 16)
+HIGHBD_MSE_WXH_NEON(16, 8)
+HIGHBD_MSE_WXH_NEON(8, 16)
+HIGHBD_MSE_WXH_NEON(8, 8)
+
+#undef HIGHBD_MSE_WXH_NEON

diff --git a/aom_dsp/arm/intrapred_neon.c b/aom_dsp/arm/intrapred_neon.c
index 8e6dc12..2161378 100644
--- a/aom_dsp/arm/intrapred_neon.c
+++ b/aom_dsp/arm/intrapred_neon.c

@@ -17,518 +17,1029 @@
 
 #include "aom/aom_integer.h"
 #include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/arm/sum_neon.h"
 #include "aom_dsp/intrapred_common.h"
 
 //------------------------------------------------------------------------------
 // DC 4x4
 
-// 'do_above' and 'do_left' facilitate branch removal when inlined.
-static INLINE void dc_4x4(uint8_t *dst, ptrdiff_t stride, const uint8_t *above,
-                          const uint8_t *left, int do_above, int do_left) {
-  uint16x8_t sum_top;
-  uint16x8_t sum_left;
-  uint8x8_t dc0;
+static INLINE uint16x8_t dc_load_sum_4(const uint8_t *in) {
+  const uint8x8_t a = load_u8_4x1_lane0(in);
+  const uint16x4_t p0 = vpaddl_u8(a);
+  const uint16x4_t p1 = vpadd_u16(p0, p0);
+  return vcombine_u16(p1, vdup_n_u16(0));
+}
 
-  if (do_above) {
-    const uint8x8_t A = vld1_u8(above);  // top row
-    const uint16x4_t p0 = vpaddl_u8(A);  // cascading summation of the top
-    const uint16x4_t p1 = vpadd_u16(p0, p0);
-    sum_top = vcombine_u16(p1, p1);
-  }
-
-  if (do_left) {
-    const uint8x8_t L = vld1_u8(left);   // left border
-    const uint16x4_t p0 = vpaddl_u8(L);  // cascading summation of the left
-    const uint16x4_t p1 = vpadd_u16(p0, p0);
-    sum_left = vcombine_u16(p1, p1);
-  }
-
-  if (do_above && do_left) {
-    const uint16x8_t sum = vaddq_u16(sum_left, sum_top);
-    dc0 = vrshrn_n_u16(sum, 3);
-  } else if (do_above) {
-    dc0 = vrshrn_n_u16(sum_top, 2);
-  } else if (do_left) {
-    dc0 = vrshrn_n_u16(sum_left, 2);
-  } else {
-    dc0 = vdup_n_u8(0x80);
-  }
-
-  {
-    const uint8x8_t dc = vdup_lane_u8(dc0, 0);
-    int i;
-    for (i = 0; i < 4; ++i) {
-      vst1_lane_u32((uint32_t *)(dst + i * stride), vreinterpret_u32_u8(dc), 0);
-    }
+static INLINE void dc_store_4xh(uint8_t *dst, ptrdiff_t stride, int h,
+                                uint8x8_t dc) {
+  for (int i = 0; i < h; ++i) {
+    store_u8_4x1(dst + i * stride, dc, 0);
   }
 }
 
 void aom_dc_predictor_4x4_neon(uint8_t *dst, ptrdiff_t stride,
                                const uint8_t *above, const uint8_t *left) {
-  dc_4x4(dst, stride, above, left, 1, 1);
+  const uint16x8_t sum_top = dc_load_sum_4(above);
+  const uint16x8_t sum_left = dc_load_sum_4(left);
+  const uint16x8_t sum = vaddq_u16(sum_left, sum_top);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum, 3);
+  dc_store_4xh(dst, stride, 4, vdup_lane_u8(dc0, 0));
 }
 
 void aom_dc_left_predictor_4x4_neon(uint8_t *dst, ptrdiff_t stride,
                                     const uint8_t *above, const uint8_t *left) {
+  const uint16x8_t sum_left = dc_load_sum_4(left);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum_left, 2);
   (void)above;
-  dc_4x4(dst, stride, NULL, left, 0, 1);
+  dc_store_4xh(dst, stride, 4, vdup_lane_u8(dc0, 0));
 }
 
 void aom_dc_top_predictor_4x4_neon(uint8_t *dst, ptrdiff_t stride,
                                    const uint8_t *above, const uint8_t *left) {
+  const uint16x8_t sum_top = dc_load_sum_4(above);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum_top, 2);
   (void)left;
-  dc_4x4(dst, stride, above, NULL, 1, 0);
+  dc_store_4xh(dst, stride, 4, vdup_lane_u8(dc0, 0));
 }
 
 void aom_dc_128_predictor_4x4_neon(uint8_t *dst, ptrdiff_t stride,
                                    const uint8_t *above, const uint8_t *left) {
+  const uint8x8_t dc0 = vdup_n_u8(0x80);
   (void)above;
   (void)left;
-  dc_4x4(dst, stride, NULL, NULL, 0, 0);
+  dc_store_4xh(dst, stride, 4, dc0);
 }
 
 //------------------------------------------------------------------------------
 // DC 8x8
 
-// 'do_above' and 'do_left' facilitate branch removal when inlined.
-static INLINE void dc_8x8(uint8_t *dst, ptrdiff_t stride, const uint8_t *above,
-                          const uint8_t *left, int do_above, int do_left) {
-  uint16x8_t sum_top;
-  uint16x8_t sum_left;
-  uint8x8_t dc0;
+static INLINE uint16x8_t dc_load_sum_8(const uint8_t *in) {
+  // This isn't used in the case where we want to load both above and left
+  // vectors, since we want to avoid performing the reduction twice.
+  const uint8x8_t a = vld1_u8(in);
+  const uint16x4_t p0 = vpaddl_u8(a);
+  const uint16x4_t p1 = vpadd_u16(p0, p0);
+  const uint16x4_t p2 = vpadd_u16(p1, p1);
+  return vcombine_u16(p2, vdup_n_u16(0));
+}
 
-  if (do_above) {
-    const uint8x8_t A = vld1_u8(above);  // top row
-    const uint16x4_t p0 = vpaddl_u8(A);  // cascading summation of the top
-    const uint16x4_t p1 = vpadd_u16(p0, p0);
-    const uint16x4_t p2 = vpadd_u16(p1, p1);
-    sum_top = vcombine_u16(p2, p2);
-  }
+static INLINE uint16x8_t horizontal_add_and_broadcast_u16x8(uint16x8_t a) {
+#if AOM_ARCH_AARCH64
+  // On AArch64 we could also use vdupq_n_u16(vaddvq_u16(a)) here to save an
+  // instruction, however the addv instruction is usually slightly more
+  // expensive than a pairwise addition, so the need for immediately
+  // broadcasting the result again seems to negate any benefit.
+  const uint16x8_t b = vpaddq_u16(a, a);
+  const uint16x8_t c = vpaddq_u16(b, b);
+  return vpaddq_u16(c, c);
+#else
+  const uint16x4_t b = vadd_u16(vget_low_u16(a), vget_high_u16(a));
+  const uint16x4_t c = vpadd_u16(b, b);
+  const uint16x4_t d = vpadd_u16(c, c);
+  return vcombine_u16(d, d);
+#endif
+}
 
-  if (do_left) {
-    const uint8x8_t L = vld1_u8(left);   // left border
-    const uint16x4_t p0 = vpaddl_u8(L);  // cascading summation of the left
-    const uint16x4_t p1 = vpadd_u16(p0, p0);
-    const uint16x4_t p2 = vpadd_u16(p1, p1);
-    sum_left = vcombine_u16(p2, p2);
-  }
-
-  if (do_above && do_left) {
-    const uint16x8_t sum = vaddq_u16(sum_left, sum_top);
-    dc0 = vrshrn_n_u16(sum, 4);
-  } else if (do_above) {
-    dc0 = vrshrn_n_u16(sum_top, 3);
-  } else if (do_left) {
-    dc0 = vrshrn_n_u16(sum_left, 3);
-  } else {
-    dc0 = vdup_n_u8(0x80);
-  }
-
-  {
-    const uint8x8_t dc = vdup_lane_u8(dc0, 0);
-    int i;
-    for (i = 0; i < 8; ++i) {
-      vst1_u32((uint32_t *)(dst + i * stride), vreinterpret_u32_u8(dc));
-    }
+static INLINE void dc_store_8xh(uint8_t *dst, ptrdiff_t stride, int h,
+                                uint8x8_t dc) {
+  for (int i = 0; i < h; ++i) {
+    vst1_u8(dst + i * stride, dc);
   }
 }
 
 void aom_dc_predictor_8x8_neon(uint8_t *dst, ptrdiff_t stride,
                                const uint8_t *above, const uint8_t *left) {
-  dc_8x8(dst, stride, above, left, 1, 1);
+  const uint8x8_t sum_top = vld1_u8(above);
+  const uint8x8_t sum_left = vld1_u8(left);
+  uint16x8_t sum = vaddl_u8(sum_left, sum_top);
+  sum = horizontal_add_and_broadcast_u16x8(sum);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum, 4);
+  dc_store_8xh(dst, stride, 8, vdup_lane_u8(dc0, 0));
 }
 
 void aom_dc_left_predictor_8x8_neon(uint8_t *dst, ptrdiff_t stride,
                                     const uint8_t *above, const uint8_t *left) {
+  const uint16x8_t sum_left = dc_load_sum_8(left);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum_left, 3);
   (void)above;
-  dc_8x8(dst, stride, NULL, left, 0, 1);
+  dc_store_8xh(dst, stride, 8, vdup_lane_u8(dc0, 0));
 }
 
 void aom_dc_top_predictor_8x8_neon(uint8_t *dst, ptrdiff_t stride,
                                    const uint8_t *above, const uint8_t *left) {
+  const uint16x8_t sum_top = dc_load_sum_8(above);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum_top, 3);
   (void)left;
-  dc_8x8(dst, stride, above, NULL, 1, 0);
+  dc_store_8xh(dst, stride, 8, vdup_lane_u8(dc0, 0));
 }
 
 void aom_dc_128_predictor_8x8_neon(uint8_t *dst, ptrdiff_t stride,
                                    const uint8_t *above, const uint8_t *left) {
+  const uint8x8_t dc0 = vdup_n_u8(0x80);
   (void)above;
   (void)left;
-  dc_8x8(dst, stride, NULL, NULL, 0, 0);
+  dc_store_8xh(dst, stride, 8, dc0);
 }
 
 //------------------------------------------------------------------------------
 // DC 16x16
 
-// 'do_above' and 'do_left' facilitate branch removal when inlined.
-static INLINE void dc_16x16(uint8_t *dst, ptrdiff_t stride,
-                            const uint8_t *above, const uint8_t *left,
-                            int do_above, int do_left) {
-  uint16x8_t sum_top;
-  uint16x8_t sum_left;
-  uint8x8_t dc0;
+static INLINE uint16x8_t dc_load_partial_sum_16(const uint8_t *in) {
+  const uint8x16_t a = vld1q_u8(in);
+  // delay the remainder of the reduction until
+  // horizontal_add_and_broadcast_u16x8, since we want to do it once rather
+  // than twice in the case we are loading both above and left.
+  return vpaddlq_u8(a);
+}
 
-  if (do_above) {
-    const uint8x16_t A = vld1q_u8(above);  // top row
-    const uint16x8_t p0 = vpaddlq_u8(A);   // cascading summation of the top
-    const uint16x4_t p1 = vadd_u16(vget_low_u16(p0), vget_high_u16(p0));
-    const uint16x4_t p2 = vpadd_u16(p1, p1);
-    const uint16x4_t p3 = vpadd_u16(p2, p2);
-    sum_top = vcombine_u16(p3, p3);
-  }
+static INLINE uint16x8_t dc_load_sum_16(const uint8_t *in) {
+  return horizontal_add_and_broadcast_u16x8(dc_load_partial_sum_16(in));
+}
 
-  if (do_left) {
-    const uint8x16_t L = vld1q_u8(left);  // left row
-    const uint16x8_t p0 = vpaddlq_u8(L);  // cascading summation of the left
-    const uint16x4_t p1 = vadd_u16(vget_low_u16(p0), vget_high_u16(p0));
-    const uint16x4_t p2 = vpadd_u16(p1, p1);
-    const uint16x4_t p3 = vpadd_u16(p2, p2);
-    sum_left = vcombine_u16(p3, p3);
-  }
-
-  if (do_above && do_left) {
-    const uint16x8_t sum = vaddq_u16(sum_left, sum_top);
-    dc0 = vrshrn_n_u16(sum, 5);
-  } else if (do_above) {
-    dc0 = vrshrn_n_u16(sum_top, 4);
-  } else if (do_left) {
-    dc0 = vrshrn_n_u16(sum_left, 4);
-  } else {
-    dc0 = vdup_n_u8(0x80);
-  }
-
-  {
-    const uint8x16_t dc = vdupq_lane_u8(dc0, 0);
-    int i;
-    for (i = 0; i < 16; ++i) {
-      vst1q_u8(dst + i * stride, dc);
-    }
+static INLINE void dc_store_16xh(uint8_t *dst, ptrdiff_t stride, int h,
+                                 uint8x16_t dc) {
+  for (int i = 0; i < h; ++i) {
+    vst1q_u8(dst + i * stride, dc);
   }
 }
 
 void aom_dc_predictor_16x16_neon(uint8_t *dst, ptrdiff_t stride,
                                  const uint8_t *above, const uint8_t *left) {
-  dc_16x16(dst, stride, above, left, 1, 1);
+  const uint16x8_t sum_top = dc_load_partial_sum_16(above);
+  const uint16x8_t sum_left = dc_load_partial_sum_16(left);
+  uint16x8_t sum = vaddq_u16(sum_left, sum_top);
+  sum = horizontal_add_and_broadcast_u16x8(sum);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum, 5);
+  dc_store_16xh(dst, stride, 16, vdupq_lane_u8(dc0, 0));
 }
 
 void aom_dc_left_predictor_16x16_neon(uint8_t *dst, ptrdiff_t stride,
                                       const uint8_t *above,
                                       const uint8_t *left) {
+  const uint16x8_t sum_left = dc_load_sum_16(left);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum_left, 4);
   (void)above;
-  dc_16x16(dst, stride, NULL, left, 0, 1);
+  dc_store_16xh(dst, stride, 16, vdupq_lane_u8(dc0, 0));
 }
 
 void aom_dc_top_predictor_16x16_neon(uint8_t *dst, ptrdiff_t stride,
                                      const uint8_t *above,
                                      const uint8_t *left) {
+  const uint16x8_t sum_top = dc_load_sum_16(above);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum_top, 4);
   (void)left;
-  dc_16x16(dst, stride, above, NULL, 1, 0);
+  dc_store_16xh(dst, stride, 16, vdupq_lane_u8(dc0, 0));
 }
 
 void aom_dc_128_predictor_16x16_neon(uint8_t *dst, ptrdiff_t stride,
                                      const uint8_t *above,
                                      const uint8_t *left) {
+  const uint8x16_t dc0 = vdupq_n_u8(0x80);
   (void)above;
   (void)left;
-  dc_16x16(dst, stride, NULL, NULL, 0, 0);
+  dc_store_16xh(dst, stride, 16, dc0);
 }
 
 //------------------------------------------------------------------------------
 // DC 32x32
 
-// 'do_above' and 'do_left' facilitate branch removal when inlined.
-static INLINE void dc_32x32(uint8_t *dst, ptrdiff_t stride,
-                            const uint8_t *above, const uint8_t *left,
-                            int do_above, int do_left) {
-  uint16x8_t sum_top;
-  uint16x8_t sum_left;
-  uint8x8_t dc0;
+static INLINE uint16x8_t dc_load_partial_sum_32(const uint8_t *in) {
+  const uint8x16_t a0 = vld1q_u8(in);
+  const uint8x16_t a1 = vld1q_u8(in + 16);
+  // delay the remainder of the reduction until
+  // horizontal_add_and_broadcast_u16x8, since we want to do it once rather
+  // than twice in the case we are loading both above and left.
+  return vpadalq_u8(vpaddlq_u8(a0), a1);
+}
 
-  if (do_above) {
-    const uint8x16_t A0 = vld1q_u8(above);  // top row
-    const uint8x16_t A1 = vld1q_u8(above + 16);
-    const uint16x8_t p0 = vpaddlq_u8(A0);  // cascading summation of the top
-    const uint16x8_t p1 = vpaddlq_u8(A1);
-    const uint16x8_t p2 = vaddq_u16(p0, p1);
-    const uint16x4_t p3 = vadd_u16(vget_low_u16(p2), vget_high_u16(p2));
-    const uint16x4_t p4 = vpadd_u16(p3, p3);
-    const uint16x4_t p5 = vpadd_u16(p4, p4);
-    sum_top = vcombine_u16(p5, p5);
-  }
+static INLINE uint16x8_t dc_load_sum_32(const uint8_t *in) {
+  return horizontal_add_and_broadcast_u16x8(dc_load_partial_sum_32(in));
+}
 
-  if (do_left) {
-    const uint8x16_t L0 = vld1q_u8(left);  // left row
-    const uint8x16_t L1 = vld1q_u8(left + 16);
-    const uint16x8_t p0 = vpaddlq_u8(L0);  // cascading summation of the left
-    const uint16x8_t p1 = vpaddlq_u8(L1);
-    const uint16x8_t p2 = vaddq_u16(p0, p1);
-    const uint16x4_t p3 = vadd_u16(vget_low_u16(p2), vget_high_u16(p2));
-    const uint16x4_t p4 = vpadd_u16(p3, p3);
-    const uint16x4_t p5 = vpadd_u16(p4, p4);
-    sum_left = vcombine_u16(p5, p5);
-  }
-
-  if (do_above && do_left) {
-    const uint16x8_t sum = vaddq_u16(sum_left, sum_top);
-    dc0 = vrshrn_n_u16(sum, 6);
-  } else if (do_above) {
-    dc0 = vrshrn_n_u16(sum_top, 5);
-  } else if (do_left) {
-    dc0 = vrshrn_n_u16(sum_left, 5);
-  } else {
-    dc0 = vdup_n_u8(0x80);
-  }
-
-  {
-    const uint8x16_t dc = vdupq_lane_u8(dc0, 0);
-    int i;
-    for (i = 0; i < 32; ++i) {
-      vst1q_u8(dst + i * stride, dc);
-      vst1q_u8(dst + i * stride + 16, dc);
-    }
+static INLINE void dc_store_32xh(uint8_t *dst, ptrdiff_t stride, int h,
+                                 uint8x16_t dc) {
+  for (int i = 0; i < h; ++i) {
+    vst1q_u8(dst + i * stride, dc);
+    vst1q_u8(dst + i * stride + 16, dc);
   }
 }
 
 void aom_dc_predictor_32x32_neon(uint8_t *dst, ptrdiff_t stride,
                                  const uint8_t *above, const uint8_t *left) {
-  dc_32x32(dst, stride, above, left, 1, 1);
+  const uint16x8_t sum_top = dc_load_partial_sum_32(above);
+  const uint16x8_t sum_left = dc_load_partial_sum_32(left);
+  uint16x8_t sum = vaddq_u16(sum_left, sum_top);
+  sum = horizontal_add_and_broadcast_u16x8(sum);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum, 6);
+  dc_store_32xh(dst, stride, 32, vdupq_lane_u8(dc0, 0));
 }
 
 void aom_dc_left_predictor_32x32_neon(uint8_t *dst, ptrdiff_t stride,
                                       const uint8_t *above,
                                       const uint8_t *left) {
+  const uint16x8_t sum_left = dc_load_sum_32(left);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum_left, 5);
   (void)above;
-  dc_32x32(dst, stride, NULL, left, 0, 1);
+  dc_store_32xh(dst, stride, 32, vdupq_lane_u8(dc0, 0));
 }
 
 void aom_dc_top_predictor_32x32_neon(uint8_t *dst, ptrdiff_t stride,
                                      const uint8_t *above,
                                      const uint8_t *left) {
+  const uint16x8_t sum_top = dc_load_sum_32(above);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum_top, 5);
   (void)left;
-  dc_32x32(dst, stride, above, NULL, 1, 0);
+  dc_store_32xh(dst, stride, 32, vdupq_lane_u8(dc0, 0));
 }
 
 void aom_dc_128_predictor_32x32_neon(uint8_t *dst, ptrdiff_t stride,
                                      const uint8_t *above,
                                      const uint8_t *left) {
+  const uint8x16_t dc0 = vdupq_n_u8(0x80);
   (void)above;
   (void)left;
-  dc_32x32(dst, stride, NULL, NULL, 0, 0);
+  dc_store_32xh(dst, stride, 32, dc0);
 }
 
+//------------------------------------------------------------------------------
+// DC 64x64
+
+static INLINE uint16x8_t dc_load_partial_sum_64(const uint8_t *in) {
+  const uint8x16_t a0 = vld1q_u8(in);
+  const uint8x16_t a1 = vld1q_u8(in + 16);
+  const uint8x16_t a2 = vld1q_u8(in + 32);
+  const uint8x16_t a3 = vld1q_u8(in + 48);
+  const uint16x8_t p01 = vpadalq_u8(vpaddlq_u8(a0), a1);
+  const uint16x8_t p23 = vpadalq_u8(vpaddlq_u8(a2), a3);
+  // delay the remainder of the reduction until
+  // horizontal_add_and_broadcast_u16x8, since we want to do it once rather
+  // than twice in the case we are loading both above and left.
+  return vaddq_u16(p01, p23);
+}
+
+static INLINE uint16x8_t dc_load_sum_64(const uint8_t *in) {
+  return horizontal_add_and_broadcast_u16x8(dc_load_partial_sum_64(in));
+}
+
+static INLINE void dc_store_64xh(uint8_t *dst, ptrdiff_t stride, int h,
+                                 uint8x16_t dc) {
+  for (int i = 0; i < h; ++i) {
+    vst1q_u8(dst + i * stride, dc);
+    vst1q_u8(dst + i * stride + 16, dc);
+    vst1q_u8(dst + i * stride + 32, dc);
+    vst1q_u8(dst + i * stride + 48, dc);
+  }
+}
+
+void aom_dc_predictor_64x64_neon(uint8_t *dst, ptrdiff_t stride,
+                                 const uint8_t *above, const uint8_t *left) {
+  const uint16x8_t sum_top = dc_load_partial_sum_64(above);
+  const uint16x8_t sum_left = dc_load_partial_sum_64(left);
+  uint16x8_t sum = vaddq_u16(sum_left, sum_top);
+  sum = horizontal_add_and_broadcast_u16x8(sum);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum, 7);
+  dc_store_64xh(dst, stride, 64, vdupq_lane_u8(dc0, 0));
+}
+
+void aom_dc_left_predictor_64x64_neon(uint8_t *dst, ptrdiff_t stride,
+                                      const uint8_t *above,
+                                      const uint8_t *left) {
+  const uint16x8_t sum_left = dc_load_sum_64(left);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum_left, 6);
+  (void)above;
+  dc_store_64xh(dst, stride, 64, vdupq_lane_u8(dc0, 0));
+}
+
+void aom_dc_top_predictor_64x64_neon(uint8_t *dst, ptrdiff_t stride,
+                                     const uint8_t *above,
+                                     const uint8_t *left) {
+  const uint16x8_t sum_top = dc_load_sum_64(above);
+  const uint8x8_t dc0 = vrshrn_n_u16(sum_top, 6);
+  (void)left;
+  dc_store_64xh(dst, stride, 64, vdupq_lane_u8(dc0, 0));
+}
+
+void aom_dc_128_predictor_64x64_neon(uint8_t *dst, ptrdiff_t stride,
+                                     const uint8_t *above,
+                                     const uint8_t *left) {
+  const uint8x16_t dc0 = vdupq_n_u8(0x80);
+  (void)above;
+  (void)left;
+  dc_store_64xh(dst, stride, 64, dc0);
+}
+
+//------------------------------------------------------------------------------
+// DC rectangular cases
+
+#define DC_MULTIPLIER_1X2 0x5556
+#define DC_MULTIPLIER_1X4 0x3334
+
+#define DC_SHIFT2 16
+
+static INLINE int divide_using_multiply_shift(int num, int shift1,
+                                              int multiplier, int shift2) {
+  const int interm = num >> shift1;
+  return interm * multiplier >> shift2;
+}
+
+static INLINE int calculate_dc_from_sum(int bw, int bh, uint32_t sum,
+                                        int shift1, int multiplier) {
+  const int expected_dc = divide_using_multiply_shift(
+      sum + ((bw + bh) >> 1), shift1, multiplier, DC_SHIFT2);
+  assert(expected_dc < (1 << 8));
+  return expected_dc;
+}
+
+#undef DC_SHIFT2
+
+void aom_dc_predictor_4x8_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  uint8x8_t a = load_u8_4x1_lane0(above);
+  uint8x8_t l = vld1_u8(left);
+  uint32_t sum = horizontal_add_u16x8(vaddl_u8(a, l));
+  uint32_t dc = calculate_dc_from_sum(4, 8, sum, 2, DC_MULTIPLIER_1X2);
+  dc_store_4xh(dst, stride, 8, vdup_n_u8(dc));
+}
+
+void aom_dc_predictor_8x4_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  uint8x8_t a = vld1_u8(above);
+  uint8x8_t l = load_u8_4x1_lane0(left);
+  uint32_t sum = horizontal_add_u16x8(vaddl_u8(a, l));
+  uint32_t dc = calculate_dc_from_sum(8, 4, sum, 2, DC_MULTIPLIER_1X2);
+  dc_store_8xh(dst, stride, 4, vdup_n_u8(dc));
+}
+
+void aom_dc_predictor_4x16_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  uint8x8_t a = load_u8_4x1_lane0(above);
+  uint8x16_t l = vld1q_u8(left);
+  uint16x8_t sum_al = vaddw_u8(vpaddlq_u8(l), a);
+  uint32_t sum = horizontal_add_u16x8(sum_al);
+  uint32_t dc = calculate_dc_from_sum(4, 16, sum, 2, DC_MULTIPLIER_1X4);
+  dc_store_4xh(dst, stride, 16, vdup_n_u8(dc));
+}
+
+void aom_dc_predictor_16x4_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  uint8x16_t a = vld1q_u8(above);
+  uint8x8_t l = load_u8_4x1_lane0(left);
+  uint16x8_t sum_al = vaddw_u8(vpaddlq_u8(a), l);
+  uint32_t sum = horizontal_add_u16x8(sum_al);
+  uint32_t dc = calculate_dc_from_sum(16, 4, sum, 2, DC_MULTIPLIER_1X4);
+  dc_store_16xh(dst, stride, 4, vdupq_n_u8(dc));
+}
+
+void aom_dc_predictor_8x16_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  uint8x8_t a = vld1_u8(above);
+  uint8x16_t l = vld1q_u8(left);
+  uint16x8_t sum_al = vaddw_u8(vpaddlq_u8(l), a);
+  uint32_t sum = horizontal_add_u16x8(sum_al);
+  uint32_t dc = calculate_dc_from_sum(8, 16, sum, 3, DC_MULTIPLIER_1X2);
+  dc_store_8xh(dst, stride, 16, vdup_n_u8(dc));
+}
+
+void aom_dc_predictor_16x8_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  uint8x16_t a = vld1q_u8(above);
+  uint8x8_t l = vld1_u8(left);
+  uint16x8_t sum_al = vaddw_u8(vpaddlq_u8(a), l);
+  uint32_t sum = horizontal_add_u16x8(sum_al);
+  uint32_t dc = calculate_dc_from_sum(16, 8, sum, 3, DC_MULTIPLIER_1X2);
+  dc_store_16xh(dst, stride, 8, vdupq_n_u8(dc));
+}
+
+void aom_dc_predictor_8x32_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  uint8x8_t a = vld1_u8(above);
+  uint16x8_t sum_left = dc_load_partial_sum_32(left);
+  uint16x8_t sum_al = vaddw_u8(sum_left, a);
+  uint32_t sum = horizontal_add_u16x8(sum_al);
+  uint32_t dc = calculate_dc_from_sum(8, 32, sum, 3, DC_MULTIPLIER_1X4);
+  dc_store_8xh(dst, stride, 32, vdup_n_u8(dc));
+}
+
+void aom_dc_predictor_32x8_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  uint16x8_t sum_top = dc_load_partial_sum_32(above);
+  uint8x8_t l = vld1_u8(left);
+  uint16x8_t sum_al = vaddw_u8(sum_top, l);
+  uint32_t sum = horizontal_add_u16x8(sum_al);
+  uint32_t dc = calculate_dc_from_sum(32, 8, sum, 3, DC_MULTIPLIER_1X4);
+  dc_store_32xh(dst, stride, 8, vdupq_n_u8(dc));
+}
+
+void aom_dc_predictor_16x32_neon(uint8_t *dst, ptrdiff_t stride,
+                                 const uint8_t *above, const uint8_t *left) {
+  uint16x8_t sum_above = dc_load_partial_sum_16(above);
+  uint16x8_t sum_left = dc_load_partial_sum_32(left);
+  uint16x8_t sum_al = vaddq_u16(sum_left, sum_above);
+  uint32_t sum = horizontal_add_u16x8(sum_al);
+  uint32_t dc = calculate_dc_from_sum(16, 32, sum, 4, DC_MULTIPLIER_1X2);
+  dc_store_16xh(dst, stride, 32, vdupq_n_u8(dc));
+}
+
+void aom_dc_predictor_32x16_neon(uint8_t *dst, ptrdiff_t stride,
+                                 const uint8_t *above, const uint8_t *left) {
+  uint16x8_t sum_above = dc_load_partial_sum_32(above);
+  uint16x8_t sum_left = dc_load_partial_sum_16(left);
+  uint16x8_t sum_al = vaddq_u16(sum_left, sum_above);
+  uint32_t sum = horizontal_add_u16x8(sum_al);
+  uint32_t dc = calculate_dc_from_sum(32, 16, sum, 4, DC_MULTIPLIER_1X2);
+  dc_store_32xh(dst, stride, 16, vdupq_n_u8(dc));
+}
+
+void aom_dc_predictor_16x64_neon(uint8_t *dst, ptrdiff_t stride,
+                                 const uint8_t *above, const uint8_t *left) {
+  uint16x8_t sum_above = dc_load_partial_sum_16(above);
+  uint16x8_t sum_left = dc_load_partial_sum_64(left);
+  uint16x8_t sum_al = vaddq_u16(sum_left, sum_above);
+  uint32_t sum = horizontal_add_u16x8(sum_al);
+  uint32_t dc = calculate_dc_from_sum(16, 64, sum, 4, DC_MULTIPLIER_1X4);
+  dc_store_16xh(dst, stride, 64, vdupq_n_u8(dc));
+}
+
+void aom_dc_predictor_64x16_neon(uint8_t *dst, ptrdiff_t stride,
+                                 const uint8_t *above, const uint8_t *left) {
+  uint16x8_t sum_above = dc_load_partial_sum_64(above);
+  uint16x8_t sum_left = dc_load_partial_sum_16(left);
+  uint16x8_t sum_al = vaddq_u16(sum_above, sum_left);
+  uint32_t sum = horizontal_add_u16x8(sum_al);
+  uint32_t dc = calculate_dc_from_sum(64, 16, sum, 4, DC_MULTIPLIER_1X4);
+  dc_store_64xh(dst, stride, 16, vdupq_n_u8(dc));
+}
+
+void aom_dc_predictor_32x64_neon(uint8_t *dst, ptrdiff_t stride,
+                                 const uint8_t *above, const uint8_t *left) {
+  uint16x8_t sum_above = dc_load_partial_sum_32(above);
+  uint16x8_t sum_left = dc_load_partial_sum_64(left);
+  uint16x8_t sum_al = vaddq_u16(sum_above, sum_left);
+  uint32_t sum = horizontal_add_u16x8(sum_al);
+  uint32_t dc = calculate_dc_from_sum(32, 64, sum, 5, DC_MULTIPLIER_1X2);
+  dc_store_32xh(dst, stride, 64, vdupq_n_u8(dc));
+}
+
+void aom_dc_predictor_64x32_neon(uint8_t *dst, ptrdiff_t stride,
+                                 const uint8_t *above, const uint8_t *left) {
+  uint16x8_t sum_above = dc_load_partial_sum_64(above);
+  uint16x8_t sum_left = dc_load_partial_sum_32(left);
+  uint16x8_t sum_al = vaddq_u16(sum_above, sum_left);
+  uint32_t sum = horizontal_add_u16x8(sum_al);
+  uint32_t dc = calculate_dc_from_sum(64, 32, sum, 5, DC_MULTIPLIER_1X2);
+  dc_store_64xh(dst, stride, 32, vdupq_n_u8(dc));
+}
+
+#undef DC_MULTIPLIER_1X2
+#undef DC_MULTIPLIER_1X4
+
+#define DC_PREDICTOR_128(w, h, q)                                            \
+  void aom_dc_128_predictor_##w##x##h##_neon(uint8_t *dst, ptrdiff_t stride, \
+                                             const uint8_t *above,           \
+                                             const uint8_t *left) {          \
+    (void)above;                                                             \
+    (void)left;                                                              \
+    dc_store_##w##xh(dst, stride, (h), vdup##q##_n_u8(0x80));                \
+  }
+
+DC_PREDICTOR_128(4, 8, )
+DC_PREDICTOR_128(4, 16, )
+DC_PREDICTOR_128(8, 4, )
+DC_PREDICTOR_128(8, 16, )
+DC_PREDICTOR_128(8, 32, )
+DC_PREDICTOR_128(16, 4, q)
+DC_PREDICTOR_128(16, 8, q)
+DC_PREDICTOR_128(16, 32, q)
+DC_PREDICTOR_128(16, 64, q)
+DC_PREDICTOR_128(32, 8, q)
+DC_PREDICTOR_128(32, 16, q)
+DC_PREDICTOR_128(32, 64, q)
+DC_PREDICTOR_128(64, 32, q)
+DC_PREDICTOR_128(64, 16, q)
+
+#undef DC_PREDICTOR_128
+
+#define DC_PREDICTOR_LEFT(w, h, shift, q)                                     \
+  void aom_dc_left_predictor_##w##x##h##_neon(uint8_t *dst, ptrdiff_t stride, \
+                                              const uint8_t *above,           \
+                                              const uint8_t *left) {          \
+    (void)above;                                                              \
+    const uint16x8_t sum = dc_load_sum_##h(left);                             \
+    const uint8x8_t dc0 = vrshrn_n_u16(sum, (shift));                         \
+    dc_store_##w##xh(dst, stride, (h), vdup##q##_lane_u8(dc0, 0));            \
+  }
+
+DC_PREDICTOR_LEFT(4, 8, 3, )
+DC_PREDICTOR_LEFT(8, 4, 2, )
+DC_PREDICTOR_LEFT(8, 16, 4, )
+DC_PREDICTOR_LEFT(16, 8, 3, q)
+DC_PREDICTOR_LEFT(16, 32, 5, q)
+DC_PREDICTOR_LEFT(32, 16, 4, q)
+DC_PREDICTOR_LEFT(32, 64, 6, q)
+DC_PREDICTOR_LEFT(64, 32, 5, q)
+DC_PREDICTOR_LEFT(4, 16, 4, )
+DC_PREDICTOR_LEFT(16, 4, 2, q)
+DC_PREDICTOR_LEFT(8, 32, 5, )
+DC_PREDICTOR_LEFT(32, 8, 3, q)
+DC_PREDICTOR_LEFT(16, 64, 6, q)
+DC_PREDICTOR_LEFT(64, 16, 4, q)
+
+#undef DC_PREDICTOR_LEFT
+
+#define DC_PREDICTOR_TOP(w, h, shift, q)                                     \
+  void aom_dc_top_predictor_##w##x##h##_neon(uint8_t *dst, ptrdiff_t stride, \
+                                             const uint8_t *above,           \
+                                             const uint8_t *left) {          \
+    (void)left;                                                              \
+    const uint16x8_t sum = dc_load_sum_##w(above);                           \
+    const uint8x8_t dc0 = vrshrn_n_u16(sum, (shift));                        \
+    dc_store_##w##xh(dst, stride, (h), vdup##q##_lane_u8(dc0, 0));           \
+  }
+
+DC_PREDICTOR_TOP(4, 8, 2, )
+DC_PREDICTOR_TOP(4, 16, 2, )
+DC_PREDICTOR_TOP(8, 4, 3, )
+DC_PREDICTOR_TOP(8, 16, 3, )
+DC_PREDICTOR_TOP(8, 32, 3, )
+DC_PREDICTOR_TOP(16, 4, 4, q)
+DC_PREDICTOR_TOP(16, 8, 4, q)
+DC_PREDICTOR_TOP(16, 32, 4, q)
+DC_PREDICTOR_TOP(16, 64, 4, q)
+DC_PREDICTOR_TOP(32, 8, 5, q)
+DC_PREDICTOR_TOP(32, 16, 5, q)
+DC_PREDICTOR_TOP(32, 64, 5, q)
+DC_PREDICTOR_TOP(64, 16, 6, q)
+DC_PREDICTOR_TOP(64, 32, 6, q)
+
+#undef DC_PREDICTOR_TOP
+
 // -----------------------------------------------------------------------------
 
-void aom_d135_predictor_4x4_neon(uint8_t *dst, ptrdiff_t stride,
-                                 const uint8_t *above, const uint8_t *left) {
-  const uint8x8_t XABCD_u8 = vld1_u8(above - 1);
-  const uint64x1_t XABCD = vreinterpret_u64_u8(XABCD_u8);
-  const uint64x1_t ____XABC = vshl_n_u64(XABCD, 32);
-  const uint32x2_t zero = vdup_n_u32(0);
-  const uint32x2_t IJKL = vld1_lane_u32((const uint32_t *)left, zero, 0);
-  const uint8x8_t IJKL_u8 = vreinterpret_u8_u32(IJKL);
-  const uint64x1_t LKJI____ = vreinterpret_u64_u8(vrev32_u8(IJKL_u8));
-  const uint64x1_t LKJIXABC = vorr_u64(LKJI____, ____XABC);
-  const uint8x8_t KJIXABC_ = vreinterpret_u8_u64(vshr_n_u64(LKJIXABC, 8));
-  const uint8x8_t JIXABC__ = vreinterpret_u8_u64(vshr_n_u64(LKJIXABC, 16));
-  const uint8_t D = vget_lane_u8(XABCD_u8, 4);
-  const uint8x8_t JIXABCD_ = vset_lane_u8(D, JIXABC__, 6);
-  const uint8x8_t LKJIXABC_u8 = vreinterpret_u8_u64(LKJIXABC);
-  const uint8x8_t avg1 = vhadd_u8(JIXABCD_, LKJIXABC_u8);
-  const uint8x8_t avg2 = vrhadd_u8(avg1, KJIXABC_);
-  const uint64x1_t avg2_u64 = vreinterpret_u64_u8(avg2);
-  const uint32x2_t r3 = vreinterpret_u32_u8(avg2);
-  const uint32x2_t r2 = vreinterpret_u32_u64(vshr_n_u64(avg2_u64, 8));
-  const uint32x2_t r1 = vreinterpret_u32_u64(vshr_n_u64(avg2_u64, 16));
-  const uint32x2_t r0 = vreinterpret_u32_u64(vshr_n_u64(avg2_u64, 24));
-  vst1_lane_u32((uint32_t *)(dst + 0 * stride), r0, 0);
-  vst1_lane_u32((uint32_t *)(dst + 1 * stride), r1, 0);
-  vst1_lane_u32((uint32_t *)(dst + 2 * stride), r2, 0);
-  vst1_lane_u32((uint32_t *)(dst + 3 * stride), r3, 0);
+static INLINE void v_store_4xh(uint8_t *dst, ptrdiff_t stride, int h,
+                               uint8x8_t d0) {
+  for (int i = 0; i < h; ++i) {
+    store_u8_4x1(dst + i * stride, d0, 0);
+  }
+}
+
+static INLINE void v_store_8xh(uint8_t *dst, ptrdiff_t stride, int h,
+                               uint8x8_t d0) {
+  for (int i = 0; i < h; ++i) {
+    vst1_u8(dst + i * stride, d0);
+  }
+}
+
+static INLINE void v_store_16xh(uint8_t *dst, ptrdiff_t stride, int h,
+                                uint8x16_t d0) {
+  for (int i = 0; i < h; ++i) {
+    vst1q_u8(dst + i * stride, d0);
+  }
+}
+
+static INLINE void v_store_32xh(uint8_t *dst, ptrdiff_t stride, int h,
+                                uint8x16_t d0, uint8x16_t d1) {
+  for (int i = 0; i < h; ++i) {
+    vst1q_u8(dst + 0, d0);
+    vst1q_u8(dst + 16, d1);
+    dst += stride;
+  }
+}
+
+static INLINE void v_store_64xh(uint8_t *dst, ptrdiff_t stride, int h,
+                                uint8x16_t d0, uint8x16_t d1, uint8x16_t d2,
+                                uint8x16_t d3) {
+  for (int i = 0; i < h; ++i) {
+    vst1q_u8(dst + 0, d0);
+    vst1q_u8(dst + 16, d1);
+    vst1q_u8(dst + 32, d2);
+    vst1q_u8(dst + 48, d3);
+    dst += stride;
+  }
 }
 
 void aom_v_predictor_4x4_neon(uint8_t *dst, ptrdiff_t stride,
                               const uint8_t *above, const uint8_t *left) {
-  int i;
-  uint32x2_t d0u32 = vdup_n_u32(0);
   (void)left;
-
-  d0u32 = vld1_lane_u32((const uint32_t *)above, d0u32, 0);
-  for (i = 0; i < 4; i++, dst += stride)
-    vst1_lane_u32((uint32_t *)dst, d0u32, 0);
+  v_store_4xh(dst, stride, 4, load_u8_4x1_lane0(above));
 }
 
 void aom_v_predictor_8x8_neon(uint8_t *dst, ptrdiff_t stride,
                               const uint8_t *above, const uint8_t *left) {
-  int i;
-  uint8x8_t d0u8 = vdup_n_u8(0);
   (void)left;
-
-  d0u8 = vld1_u8(above);
-  for (i = 0; i < 8; i++, dst += stride) vst1_u8(dst, d0u8);
+  v_store_8xh(dst, stride, 8, vld1_u8(above));
 }
 
 void aom_v_predictor_16x16_neon(uint8_t *dst, ptrdiff_t stride,
                                 const uint8_t *above, const uint8_t *left) {
-  int i;
-  uint8x16_t q0u8 = vdupq_n_u8(0);
   (void)left;
-
-  q0u8 = vld1q_u8(above);
-  for (i = 0; i < 16; i++, dst += stride) vst1q_u8(dst, q0u8);
+  v_store_16xh(dst, stride, 16, vld1q_u8(above));
 }
 
 void aom_v_predictor_32x32_neon(uint8_t *dst, ptrdiff_t stride,
                                 const uint8_t *above, const uint8_t *left) {
-  int i;
-  uint8x16_t q0u8 = vdupq_n_u8(0);
-  uint8x16_t q1u8 = vdupq_n_u8(0);
+  const uint8x16_t d0 = vld1q_u8(above);
+  const uint8x16_t d1 = vld1q_u8(above + 16);
   (void)left;
+  v_store_32xh(dst, stride, 32, d0, d1);
+}
 
-  q0u8 = vld1q_u8(above);
-  q1u8 = vld1q_u8(above + 16);
-  for (i = 0; i < 32; i++, dst += stride) {
-    vst1q_u8(dst, q0u8);
-    vst1q_u8(dst + 16, q1u8);
-  }
+void aom_v_predictor_4x8_neon(uint8_t *dst, ptrdiff_t stride,
+                              const uint8_t *above, const uint8_t *left) {
+  (void)left;
+  v_store_4xh(dst, stride, 8, load_u8_4x1_lane0(above));
+}
+
+void aom_v_predictor_4x16_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  (void)left;
+  v_store_4xh(dst, stride, 16, load_u8_4x1_lane0(above));
+}
+
+void aom_v_predictor_8x4_neon(uint8_t *dst, ptrdiff_t stride,
+                              const uint8_t *above, const uint8_t *left) {
+  (void)left;
+  v_store_8xh(dst, stride, 4, vld1_u8(above));
+}
+
+void aom_v_predictor_8x16_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  (void)left;
+  v_store_8xh(dst, stride, 16, vld1_u8(above));
+}
+
+void aom_v_predictor_8x32_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  (void)left;
+  v_store_8xh(dst, stride, 32, vld1_u8(above));
+}
+
+void aom_v_predictor_16x4_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  (void)left;
+  v_store_16xh(dst, stride, 4, vld1q_u8(above));
+}
+
+void aom_v_predictor_16x8_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  (void)left;
+  v_store_16xh(dst, stride, 8, vld1q_u8(above));
+}
+
+void aom_v_predictor_16x32_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  (void)left;
+  v_store_16xh(dst, stride, 32, vld1q_u8(above));
+}
+
+void aom_v_predictor_16x64_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  (void)left;
+  v_store_16xh(dst, stride, 64, vld1q_u8(above));
+}
+
+void aom_v_predictor_32x8_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(above);
+  const uint8x16_t d1 = vld1q_u8(above + 16);
+  (void)left;
+  v_store_32xh(dst, stride, 8, d0, d1);
+}
+
+void aom_v_predictor_32x16_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(above);
+  const uint8x16_t d1 = vld1q_u8(above + 16);
+  (void)left;
+  v_store_32xh(dst, stride, 16, d0, d1);
+}
+
+void aom_v_predictor_32x64_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(above);
+  const uint8x16_t d1 = vld1q_u8(above + 16);
+  (void)left;
+  v_store_32xh(dst, stride, 64, d0, d1);
+}
+
+void aom_v_predictor_64x16_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(above);
+  const uint8x16_t d1 = vld1q_u8(above + 16);
+  const uint8x16_t d2 = vld1q_u8(above + 32);
+  const uint8x16_t d3 = vld1q_u8(above + 48);
+  (void)left;
+  v_store_64xh(dst, stride, 16, d0, d1, d2, d3);
+}
+
+void aom_v_predictor_64x32_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(above);
+  const uint8x16_t d1 = vld1q_u8(above + 16);
+  const uint8x16_t d2 = vld1q_u8(above + 32);
+  const uint8x16_t d3 = vld1q_u8(above + 48);
+  (void)left;
+  v_store_64xh(dst, stride, 32, d0, d1, d2, d3);
+}
+
+void aom_v_predictor_64x64_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(above);
+  const uint8x16_t d1 = vld1q_u8(above + 16);
+  const uint8x16_t d2 = vld1q_u8(above + 32);
+  const uint8x16_t d3 = vld1q_u8(above + 48);
+  (void)left;
+  v_store_64xh(dst, stride, 64, d0, d1, d2, d3);
+}
+
+// -----------------------------------------------------------------------------
+
+static INLINE void h_store_4x8(uint8_t *dst, ptrdiff_t stride, uint8x8_t d0) {
+  store_u8_4x1(dst + 0 * stride, vdup_lane_u8(d0, 0), 0);
+  store_u8_4x1(dst + 1 * stride, vdup_lane_u8(d0, 1), 0);
+  store_u8_4x1(dst + 2 * stride, vdup_lane_u8(d0, 2), 0);
+  store_u8_4x1(dst + 3 * stride, vdup_lane_u8(d0, 3), 0);
+  store_u8_4x1(dst + 4 * stride, vdup_lane_u8(d0, 4), 0);
+  store_u8_4x1(dst + 5 * stride, vdup_lane_u8(d0, 5), 0);
+  store_u8_4x1(dst + 6 * stride, vdup_lane_u8(d0, 6), 0);
+  store_u8_4x1(dst + 7 * stride, vdup_lane_u8(d0, 7), 0);
+}
+
+static INLINE void h_store_8x8(uint8_t *dst, ptrdiff_t stride, uint8x8_t d0) {
+  vst1_u8(dst + 0 * stride, vdup_lane_u8(d0, 0));
+  vst1_u8(dst + 1 * stride, vdup_lane_u8(d0, 1));
+  vst1_u8(dst + 2 * stride, vdup_lane_u8(d0, 2));
+  vst1_u8(dst + 3 * stride, vdup_lane_u8(d0, 3));
+  vst1_u8(dst + 4 * stride, vdup_lane_u8(d0, 4));
+  vst1_u8(dst + 5 * stride, vdup_lane_u8(d0, 5));
+  vst1_u8(dst + 6 * stride, vdup_lane_u8(d0, 6));
+  vst1_u8(dst + 7 * stride, vdup_lane_u8(d0, 7));
+}
+
+static INLINE void h_store_16x8(uint8_t *dst, ptrdiff_t stride, uint8x8_t d0) {
+  vst1q_u8(dst + 0 * stride, vdupq_lane_u8(d0, 0));
+  vst1q_u8(dst + 1 * stride, vdupq_lane_u8(d0, 1));
+  vst1q_u8(dst + 2 * stride, vdupq_lane_u8(d0, 2));
+  vst1q_u8(dst + 3 * stride, vdupq_lane_u8(d0, 3));
+  vst1q_u8(dst + 4 * stride, vdupq_lane_u8(d0, 4));
+  vst1q_u8(dst + 5 * stride, vdupq_lane_u8(d0, 5));
+  vst1q_u8(dst + 6 * stride, vdupq_lane_u8(d0, 6));
+  vst1q_u8(dst + 7 * stride, vdupq_lane_u8(d0, 7));
+}
+
+static INLINE void h_store_32x8(uint8_t *dst, ptrdiff_t stride, uint8x8_t d0) {
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 0));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 0));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 1));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 1));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 2));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 2));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 3));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 3));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 4));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 4));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 5));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 5));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 6));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 6));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 7));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 7));
+}
+
+static INLINE void h_store_64x8(uint8_t *dst, ptrdiff_t stride, uint8x8_t d0) {
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 0));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 0));
+  vst1q_u8(dst + 32, vdupq_lane_u8(d0, 0));
+  vst1q_u8(dst + 48, vdupq_lane_u8(d0, 0));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 1));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 1));
+  vst1q_u8(dst + 32, vdupq_lane_u8(d0, 1));
+  vst1q_u8(dst + 48, vdupq_lane_u8(d0, 1));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 2));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 2));
+  vst1q_u8(dst + 32, vdupq_lane_u8(d0, 2));
+  vst1q_u8(dst + 48, vdupq_lane_u8(d0, 2));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 3));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 3));
+  vst1q_u8(dst + 32, vdupq_lane_u8(d0, 3));
+  vst1q_u8(dst + 48, vdupq_lane_u8(d0, 3));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 4));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 4));
+  vst1q_u8(dst + 32, vdupq_lane_u8(d0, 4));
+  vst1q_u8(dst + 48, vdupq_lane_u8(d0, 4));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 5));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 5));
+  vst1q_u8(dst + 32, vdupq_lane_u8(d0, 5));
+  vst1q_u8(dst + 48, vdupq_lane_u8(d0, 5));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 6));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 6));
+  vst1q_u8(dst + 32, vdupq_lane_u8(d0, 6));
+  vst1q_u8(dst + 48, vdupq_lane_u8(d0, 6));
+  dst += stride;
+  vst1q_u8(dst + 0, vdupq_lane_u8(d0, 7));
+  vst1q_u8(dst + 16, vdupq_lane_u8(d0, 7));
+  vst1q_u8(dst + 32, vdupq_lane_u8(d0, 7));
+  vst1q_u8(dst + 48, vdupq_lane_u8(d0, 7));
 }
 
 void aom_h_predictor_4x4_neon(uint8_t *dst, ptrdiff_t stride,
                               const uint8_t *above, const uint8_t *left) {
-  uint8x8_t d0u8 = vdup_n_u8(0);
-  uint32x2_t d1u32 = vdup_n_u32(0);
+  const uint8x8_t d0 = load_u8_4x1_lane0(left);
   (void)above;
-
-  d1u32 = vld1_lane_u32((const uint32_t *)left, d1u32, 0);
-
-  d0u8 = vdup_lane_u8(vreinterpret_u8_u32(d1u32), 0);
-  vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(d0u8), 0);
-  dst += stride;
-  d0u8 = vdup_lane_u8(vreinterpret_u8_u32(d1u32), 1);
-  vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(d0u8), 0);
-  dst += stride;
-  d0u8 = vdup_lane_u8(vreinterpret_u8_u32(d1u32), 2);
-  vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(d0u8), 0);
-  dst += stride;
-  d0u8 = vdup_lane_u8(vreinterpret_u8_u32(d1u32), 3);
-  vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(d0u8), 0);
+  store_u8_4x1(dst + 0 * stride, vdup_lane_u8(d0, 0), 0);
+  store_u8_4x1(dst + 1 * stride, vdup_lane_u8(d0, 1), 0);
+  store_u8_4x1(dst + 2 * stride, vdup_lane_u8(d0, 2), 0);
+  store_u8_4x1(dst + 3 * stride, vdup_lane_u8(d0, 3), 0);
 }
 
 void aom_h_predictor_8x8_neon(uint8_t *dst, ptrdiff_t stride,
                               const uint8_t *above, const uint8_t *left) {
-  uint8x8_t d0u8 = vdup_n_u8(0);
-  uint64x1_t d1u64 = vdup_n_u64(0);
+  const uint8x8_t d0 = vld1_u8(left);
   (void)above;
-
-  d1u64 = vld1_u64((const uint64_t *)left);
-
-  d0u8 = vdup_lane_u8(vreinterpret_u8_u64(d1u64), 0);
-  vst1_u8(dst, d0u8);
-  dst += stride;
-  d0u8 = vdup_lane_u8(vreinterpret_u8_u64(d1u64), 1);
-  vst1_u8(dst, d0u8);
-  dst += stride;
-  d0u8 = vdup_lane_u8(vreinterpret_u8_u64(d1u64), 2);
-  vst1_u8(dst, d0u8);
-  dst += stride;
-  d0u8 = vdup_lane_u8(vreinterpret_u8_u64(d1u64), 3);
-  vst1_u8(dst, d0u8);
-  dst += stride;
-  d0u8 = vdup_lane_u8(vreinterpret_u8_u64(d1u64), 4);
-  vst1_u8(dst, d0u8);
-  dst += stride;
-  d0u8 = vdup_lane_u8(vreinterpret_u8_u64(d1u64), 5);
-  vst1_u8(dst, d0u8);
-  dst += stride;
-  d0u8 = vdup_lane_u8(vreinterpret_u8_u64(d1u64), 6);
-  vst1_u8(dst, d0u8);
-  dst += stride;
-  d0u8 = vdup_lane_u8(vreinterpret_u8_u64(d1u64), 7);
-  vst1_u8(dst, d0u8);
+  h_store_8x8(dst, stride, d0);
 }
 
 void aom_h_predictor_16x16_neon(uint8_t *dst, ptrdiff_t stride,
                                 const uint8_t *above, const uint8_t *left) {
-  int j;
-  uint8x8_t d2u8 = vdup_n_u8(0);
-  uint8x16_t q0u8 = vdupq_n_u8(0);
-  uint8x16_t q1u8 = vdupq_n_u8(0);
+  const uint8x16_t d0 = vld1q_u8(left);
   (void)above;
-
-  q1u8 = vld1q_u8(left);
-  d2u8 = vget_low_u8(q1u8);
-  for (j = 0; j < 2; j++, d2u8 = vget_high_u8(q1u8)) {
-    q0u8 = vdupq_lane_u8(d2u8, 0);
-    vst1q_u8(dst, q0u8);
-    dst += stride;
-    q0u8 = vdupq_lane_u8(d2u8, 1);
-    vst1q_u8(dst, q0u8);
-    dst += stride;
-    q0u8 = vdupq_lane_u8(d2u8, 2);
-    vst1q_u8(dst, q0u8);
-    dst += stride;
-    q0u8 = vdupq_lane_u8(d2u8, 3);
-    vst1q_u8(dst, q0u8);
-    dst += stride;
-    q0u8 = vdupq_lane_u8(d2u8, 4);
-    vst1q_u8(dst, q0u8);
-    dst += stride;
-    q0u8 = vdupq_lane_u8(d2u8, 5);
-    vst1q_u8(dst, q0u8);
-    dst += stride;
-    q0u8 = vdupq_lane_u8(d2u8, 6);
-    vst1q_u8(dst, q0u8);
-    dst += stride;
-    q0u8 = vdupq_lane_u8(d2u8, 7);
-    vst1q_u8(dst, q0u8);
-    dst += stride;
-  }
+  h_store_16x8(dst, stride, vget_low_u8(d0));
+  h_store_16x8(dst + 8 * stride, stride, vget_high_u8(d0));
 }
 
 void aom_h_predictor_32x32_neon(uint8_t *dst, ptrdiff_t stride,
                                 const uint8_t *above, const uint8_t *left) {
-  int j, k;
-  uint8x8_t d2u8 = vdup_n_u8(0);
-  uint8x16_t q0u8 = vdupq_n_u8(0);
-  uint8x16_t q1u8 = vdupq_n_u8(0);
+  const uint8x16_t d0 = vld1q_u8(left);
+  const uint8x16_t d1 = vld1q_u8(left + 16);
   (void)above;
+  h_store_32x8(dst + 0 * stride, stride, vget_low_u8(d0));
+  h_store_32x8(dst + 8 * stride, stride, vget_high_u8(d0));
+  h_store_32x8(dst + 16 * stride, stride, vget_low_u8(d1));
+  h_store_32x8(dst + 24 * stride, stride, vget_high_u8(d1));
+}
 
-  for (k = 0; k < 2; k++, left += 16) {
-    q1u8 = vld1q_u8(left);
-    d2u8 = vget_low_u8(q1u8);
-    for (j = 0; j < 2; j++, d2u8 = vget_high_u8(q1u8)) {
-      q0u8 = vdupq_lane_u8(d2u8, 0);
-      vst1q_u8(dst, q0u8);
-      vst1q_u8(dst + 16, q0u8);
-      dst += stride;
-      q0u8 = vdupq_lane_u8(d2u8, 1);
-      vst1q_u8(dst, q0u8);
-      vst1q_u8(dst + 16, q0u8);
-      dst += stride;
-      q0u8 = vdupq_lane_u8(d2u8, 2);
-      vst1q_u8(dst, q0u8);
-      vst1q_u8(dst + 16, q0u8);
-      dst += stride;
-      q0u8 = vdupq_lane_u8(d2u8, 3);
-      vst1q_u8(dst, q0u8);
-      vst1q_u8(dst + 16, q0u8);
-      dst += stride;
-      q0u8 = vdupq_lane_u8(d2u8, 4);
-      vst1q_u8(dst, q0u8);
-      vst1q_u8(dst + 16, q0u8);
-      dst += stride;
-      q0u8 = vdupq_lane_u8(d2u8, 5);
-      vst1q_u8(dst, q0u8);
-      vst1q_u8(dst + 16, q0u8);
-      dst += stride;
-      q0u8 = vdupq_lane_u8(d2u8, 6);
-      vst1q_u8(dst, q0u8);
-      vst1q_u8(dst + 16, q0u8);
-      dst += stride;
-      q0u8 = vdupq_lane_u8(d2u8, 7);
-      vst1q_u8(dst, q0u8);
-      vst1q_u8(dst + 16, q0u8);
-      dst += stride;
-    }
+void aom_h_predictor_4x8_neon(uint8_t *dst, ptrdiff_t stride,
+                              const uint8_t *above, const uint8_t *left) {
+  const uint8x8_t d0 = vld1_u8(left);
+  (void)above;
+  h_store_4x8(dst, stride, d0);
+}
+
+void aom_h_predictor_4x16_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(left);
+  (void)above;
+  h_store_4x8(dst + 0 * stride, stride, vget_low_u8(d0));
+  h_store_4x8(dst + 8 * stride, stride, vget_high_u8(d0));
+}
+
+void aom_h_predictor_8x4_neon(uint8_t *dst, ptrdiff_t stride,
+                              const uint8_t *above, const uint8_t *left) {
+  const uint8x8_t d0 = load_u8_4x1_lane0(left);
+  (void)above;
+  vst1_u8(dst + 0 * stride, vdup_lane_u8(d0, 0));
+  vst1_u8(dst + 1 * stride, vdup_lane_u8(d0, 1));
+  vst1_u8(dst + 2 * stride, vdup_lane_u8(d0, 2));
+  vst1_u8(dst + 3 * stride, vdup_lane_u8(d0, 3));
+}
+
+void aom_h_predictor_8x16_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(left);
+  (void)above;
+  h_store_8x8(dst + 0 * stride, stride, vget_low_u8(d0));
+  h_store_8x8(dst + 8 * stride, stride, vget_high_u8(d0));
+}
+
+void aom_h_predictor_8x32_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(left);
+  const uint8x16_t d1 = vld1q_u8(left + 16);
+  (void)above;
+  h_store_8x8(dst + 0 * stride, stride, vget_low_u8(d0));
+  h_store_8x8(dst + 8 * stride, stride, vget_high_u8(d0));
+  h_store_8x8(dst + 16 * stride, stride, vget_low_u8(d1));
+  h_store_8x8(dst + 24 * stride, stride, vget_high_u8(d1));
+}
+
+void aom_h_predictor_16x4_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  const uint8x8_t d0 = load_u8_4x1_lane0(left);
+  (void)above;
+  vst1q_u8(dst + 0 * stride, vdupq_lane_u8(d0, 0));
+  vst1q_u8(dst + 1 * stride, vdupq_lane_u8(d0, 1));
+  vst1q_u8(dst + 2 * stride, vdupq_lane_u8(d0, 2));
+  vst1q_u8(dst + 3 * stride, vdupq_lane_u8(d0, 3));
+}
+
+void aom_h_predictor_16x8_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  const uint8x8_t d0 = vld1_u8(left);
+  (void)above;
+  h_store_16x8(dst, stride, d0);
+}
+
+void aom_h_predictor_16x32_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(left);
+  const uint8x16_t d1 = vld1q_u8(left + 16);
+  (void)above;
+  h_store_16x8(dst + 0 * stride, stride, vget_low_u8(d0));
+  h_store_16x8(dst + 8 * stride, stride, vget_high_u8(d0));
+  h_store_16x8(dst + 16 * stride, stride, vget_low_u8(d1));
+  h_store_16x8(dst + 24 * stride, stride, vget_high_u8(d1));
+}
+
+void aom_h_predictor_16x64_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(left);
+  const uint8x16_t d1 = vld1q_u8(left + 16);
+  const uint8x16_t d2 = vld1q_u8(left + 32);
+  const uint8x16_t d3 = vld1q_u8(left + 48);
+  (void)above;
+  h_store_16x8(dst + 0 * stride, stride, vget_low_u8(d0));
+  h_store_16x8(dst + 8 * stride, stride, vget_high_u8(d0));
+  h_store_16x8(dst + 16 * stride, stride, vget_low_u8(d1));
+  h_store_16x8(dst + 24 * stride, stride, vget_high_u8(d1));
+  h_store_16x8(dst + 32 * stride, stride, vget_low_u8(d2));
+  h_store_16x8(dst + 40 * stride, stride, vget_high_u8(d2));
+  h_store_16x8(dst + 48 * stride, stride, vget_low_u8(d3));
+  h_store_16x8(dst + 56 * stride, stride, vget_high_u8(d3));
+}
+
+void aom_h_predictor_32x8_neon(uint8_t *dst, ptrdiff_t stride,
+                               const uint8_t *above, const uint8_t *left) {
+  const uint8x8_t d0 = vld1_u8(left);
+  (void)above;
+  h_store_32x8(dst, stride, d0);
+}
+
+void aom_h_predictor_32x16_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(left);
+  (void)above;
+  h_store_32x8(dst + 0 * stride, stride, vget_low_u8(d0));
+  h_store_32x8(dst + 8 * stride, stride, vget_high_u8(d0));
+}
+
+void aom_h_predictor_32x64_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(left + 0);
+  const uint8x16_t d1 = vld1q_u8(left + 16);
+  const uint8x16_t d2 = vld1q_u8(left + 32);
+  const uint8x16_t d3 = vld1q_u8(left + 48);
+  (void)above;
+  h_store_32x8(dst + 0 * stride, stride, vget_low_u8(d0));
+  h_store_32x8(dst + 8 * stride, stride, vget_high_u8(d0));
+  h_store_32x8(dst + 16 * stride, stride, vget_low_u8(d1));
+  h_store_32x8(dst + 24 * stride, stride, vget_high_u8(d1));
+  h_store_32x8(dst + 32 * stride, stride, vget_low_u8(d2));
+  h_store_32x8(dst + 40 * stride, stride, vget_high_u8(d2));
+  h_store_32x8(dst + 48 * stride, stride, vget_low_u8(d3));
+  h_store_32x8(dst + 56 * stride, stride, vget_high_u8(d3));
+}
+
+void aom_h_predictor_64x16_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  const uint8x16_t d0 = vld1q_u8(left);
+  (void)above;
+  h_store_64x8(dst + 0 * stride, stride, vget_low_u8(d0));
+  h_store_64x8(dst + 8 * stride, stride, vget_high_u8(d0));
+}
+
+void aom_h_predictor_64x32_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  (void)above;
+  for (int i = 0; i < 2; ++i) {
+    const uint8x16_t d0 = vld1q_u8(left);
+    h_store_64x8(dst + 0 * stride, stride, vget_low_u8(d0));
+    h_store_64x8(dst + 8 * stride, stride, vget_high_u8(d0));
+    left += 16;
+    dst += 16 * stride;
+  }
+}
+
+void aom_h_predictor_64x64_neon(uint8_t *dst, ptrdiff_t stride,
+                                const uint8_t *above, const uint8_t *left) {
+  (void)above;
+  for (int i = 0; i < 4; ++i) {
+    const uint8x16_t d0 = vld1q_u8(left);
+    h_store_64x8(dst + 0 * stride, stride, vget_low_u8(d0));
+    h_store_64x8(dst + 8 * stride, stride, vget_high_u8(d0));
+    left += 16;
+    dst += 16 * stride;
   }
 }
 
@@ -638,7 +1149,6 @@
     0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
 };
 
-/* clang-format on */
 static AOM_FORCE_INLINE void dr_prediction_z1_HxW_internal_neon_64(
     int H, int W, uint8x8_t *dst, const uint8_t *above, int upsample_above,
     int dx) {
@@ -653,23 +1163,12 @@
   // final pixels will be calculated as:
   //   (above[x] * 32 + 16 + (above[x+1] - above[x]) * shift) >> 5
 
-  uint16x8_t a0, a1;
-  uint16x8_t diff, a32;
-  uint16x8_t a16;
-  uint8x8_t a_mbase_x;
-
-  a16 = vdupq_n_u16(16);
-  a_mbase_x = vdup_n_u8(above[max_base_x]);
-  uint16x8_t v_32 = vdupq_n_u16(32);
-  int16x8_t v_upsample_above = vdupq_n_s16(upsample_above);
-  uint16x8_t c3f = vdupq_n_u16(0x3f);
+  const uint16x8_t a16 = vdupq_n_u16(16);
+  const uint8x8_t a_mbase_x = vdup_n_u8(above[max_base_x]);
+  const uint8x8_t v_32 = vdup_n_u8(32);
 
   int x = dx;
   for (int r = 0; r < W; r++) {
-    uint16x8_t res;
-    uint16x8_t shift;
-    uint8x8x2_t v_tmp_a0_128;
-
     int base = x >> frac_bits;
     int base_max_diff = (max_base_x - base) >> upsample_above;
     if (base_max_diff <= 0) {
@@ -681,24 +1180,22 @@
 
     if (base_max_diff > H) base_max_diff = H;
 
+    uint8x8x2_t a01_128;
+    uint16x8_t shift;
     if (upsample_above) {
-      v_tmp_a0_128 = vld2_u8(above + base);
-      shift = vshrq_n_u16(
-          vandq_u16(vshlq_u16(vdupq_n_u16(x), v_upsample_above), c3f), 1);
+      a01_128 = vld2_u8(above + base);
+      shift = vdupq_n_u16(((x << upsample_above) & 0x3f) >> 1);
     } else {
-      v_tmp_a0_128.val[0] = vld1_u8(above + base);
-      v_tmp_a0_128.val[1] = vld1_u8(above + base + 1);
-      shift = vshrq_n_u16(vandq_u16(vdupq_n_u16(x), c3f), 1);
+      a01_128.val[0] = vld1_u8(above + base);
+      a01_128.val[1] = vld1_u8(above + base + 1);
+      shift = vdupq_n_u16((x & 0x3f) >> 1);
     }
-    a0 = vmovl_u8(v_tmp_a0_128.val[0]);
-    a1 = vmovl_u8(v_tmp_a0_128.val[1]);
-    diff = vsubq_u16(a1, a0);        // a[x+1] - a[x]
-    a32 = vmlaq_u16(a16, a0, v_32);  // a[x] * 32 + 16
-    res = vmlaq_u16(a32, diff, shift);
+    uint16x8_t diff = vsubl_u8(a01_128.val[1], a01_128.val[0]);
+    uint16x8_t a32 = vmlal_u8(a16, a01_128.val[0], v_32);
+    uint16x8_t res = vmlaq_u16(a32, diff, shift);
 
     uint8x8_t mask = vld1_u8(BaseMask[base_max_diff]);
-    dst[r] =
-        vorr_u8(vand_u8(mask, vshrn_n_u16(res, 5)), vbic_u8(a_mbase_x, mask));
+    dst[r] = vbsl_u8(mask, vshrn_n_u16(res, 5), a_mbase_x);
 
     x += dx;
   }
@@ -743,17 +1240,10 @@
   // final pixels will be calculated as:
   //   (above[x] * 32 + 16 + (above[x+1] - above[x]) * shift) >> 5
 
-  uint8x16x2_t a0, a1;
-  uint16x8x2_t diff, a32;
-  uint16x8_t a16, c3f;
-  uint8x16_t a_mbase_x;
-
-  a16 = vdupq_n_u16(16);
-  a_mbase_x = vdupq_n_u8(above[max_base_x]);
-  c3f = vdupq_n_u16(0x3f);
-  uint16x8_t v_32 = vdupq_n_u16(32);
-  uint8x16_t v_zero = vdupq_n_u8(0);
-  int16x8_t v_upsample_above = vdupq_n_s16(upsample_above);
+  const uint16x8_t a16 = vdupq_n_u16(16);
+  const uint8x16_t a_mbase_x = vdupq_n_u8(above[max_base_x]);
+  const uint8x8_t v_32 = vdup_n_u8(32);
+  const uint8x16_t v_zero = vdupq_n_u8(0);
 
   int x = dx;
   for (int r = 0; r < W; r++) {
@@ -776,30 +1266,24 @@
       uint8x8x2_t v_tmp_a0_128 = vld2_u8(above + base);
       a0_128 = vcombine_u8(v_tmp_a0_128.val[0], v_tmp_a0_128.val[1]);
       a1_128 = vextq_u8(a0_128, v_zero, 8);
-      shift = vshrq_n_u16(
-          vandq_u16(vshlq_u16(vdupq_n_u16(x), v_upsample_above), c3f), 1);
+      shift = vdupq_n_u16(((x << upsample_above) & 0x3f) >> 1);
     } else {
       a0_128 = vld1q_u8(above + base);
       a1_128 = vld1q_u8(above + base + 1);
-      shift = vshrq_n_u16(vandq_u16(vdupq_n_u16(x), c3f), 1);
+      shift = vdupq_n_u16((x & 0x3f) >> 1);
     }
-    a0 = vzipq_u8(a0_128, v_zero);
-    a1 = vzipq_u8(a1_128, v_zero);
-    diff.val[0] = vsubq_u16(vreinterpretq_u16_u8(a1.val[0]),
-                            vreinterpretq_u16_u8(a0.val[0]));  // a[x+1] - a[x]
-    diff.val[1] = vsubq_u16(vreinterpretq_u16_u8(a1.val[1]),
-                            vreinterpretq_u16_u8(a0.val[1]));  // a[x+1] - a[x]
-    a32.val[0] = vmlaq_u16(a16, vreinterpretq_u16_u8(a0.val[0]),
-                           v_32);  // a[x] * 32 + 16
-    a32.val[1] = vmlaq_u16(a16, vreinterpretq_u16_u8(a0.val[1]),
-                           v_32);  // a[x] * 32 + 16
+    uint16x8x2_t diff, a32;
+    diff.val[0] = vsubl_u8(vget_low_u8(a1_128), vget_low_u8(a0_128));
+    diff.val[1] = vsubl_u8(vget_high_u8(a1_128), vget_high_u8(a0_128));
+    a32.val[0] = vmlal_u8(a16, vget_low_u8(a0_128), v_32);
+    a32.val[1] = vmlal_u8(a16, vget_high_u8(a0_128), v_32);
     res.val[0] = vmlaq_u16(a32.val[0], diff.val[0], shift);
     res.val[1] = vmlaq_u16(a32.val[1], diff.val[1], shift);
     uint8x16_t v_temp =
         vcombine_u8(vshrn_n_u16(res.val[0], 5), vshrn_n_u16(res.val[1], 5));
 
     uint8x16_t mask = vld1q_u8(BaseMask[base_max_diff]);
-    dst[r] = vorrq_u8(vandq_u8(mask, v_temp), vbicq_u8(a_mbase_x, mask));
+    dst[r] = vbslq_u8(mask, v_temp, a_mbase_x);
 
     x += dx;
   }
@@ -831,22 +1315,13 @@
   // final pixels will be calculated as:
   //   (above[x] * 32 + 16 + (above[x+1] - above[x]) * shift) >> 5
 
-  uint8x16_t a_mbase_x;
-  uint8x16x2_t a0, a1;
-  uint16x8x2_t diff, a32;
-  uint16x8_t a16, c3f;
-
-  a_mbase_x = vdupq_n_u8(above[max_base_x]);
-  a16 = vdupq_n_u16(16);
-  c3f = vdupq_n_u16(0x3f);
-  uint16x8_t v_32 = vdupq_n_u16(32);
-  uint8x16_t v_zero = vdupq_n_u8(0);
+  const uint8x16_t a_mbase_x = vdupq_n_u8(above[max_base_x]);
+  const uint16x8_t a16 = vdupq_n_u16(16);
+  const uint8x8_t v_32 = vdup_n_u8(32);
 
   int x = dx;
   for (int r = 0; r < N; r++) {
-    uint16x8x2_t res;
     uint8x16_t res16[2];
-    uint8x16_t a0_128, a1_128;
 
     int base = x >> frac_bits;
     int base_max_diff = (max_base_x - base);
@@ -859,27 +1334,21 @@
     }
     if (base_max_diff > 32) base_max_diff = 32;
 
-    uint16x8_t shift = vshrq_n_u16(vandq_u16(vdupq_n_u16(x), c3f), 1);
+    uint16x8_t shift = vdupq_n_u16((x & 0x3f) >> 1);
 
     for (int j = 0, jj = 0; j < 32; j += 16, jj++) {
       int mdiff = base_max_diff - j;
       if (mdiff <= 0) {
         res16[jj] = a_mbase_x;
       } else {
+        uint16x8x2_t a32, diff, res;
+        uint8x16_t a0_128, a1_128;
         a0_128 = vld1q_u8(above + base + j);
         a1_128 = vld1q_u8(above + base + j + 1);
-        a0 = vzipq_u8(a0_128, v_zero);
-        a1 = vzipq_u8(a1_128, v_zero);
-        diff.val[0] =
-            vsubq_u16(vreinterpretq_u16_u8(a1.val[0]),
-                      vreinterpretq_u16_u8(a0.val[0]));  // a[x+1] - a[x]
-        diff.val[1] =
-            vsubq_u16(vreinterpretq_u16_u8(a1.val[1]),
-                      vreinterpretq_u16_u8(a0.val[1]));  // a[x+1] - a[x]
-        a32.val[0] = vmlaq_u16(a16, vreinterpretq_u16_u8(a0.val[0]),
-                               v_32);  // a[x] * 32 + 16
-        a32.val[1] = vmlaq_u16(a16, vreinterpretq_u16_u8(a0.val[1]),
-                               v_32);  // a[x] * 32 + 16
+        diff.val[0] = vsubl_u8(vget_low_u8(a1_128), vget_low_u8(a0_128));
+        diff.val[1] = vsubl_u8(vget_high_u8(a1_128), vget_high_u8(a0_128));
+        a32.val[0] = vmlal_u8(a16, vget_low_u8(a0_128), v_32);
+        a32.val[1] = vmlal_u8(a16, vget_high_u8(a0_128), v_32);
         res.val[0] = vmlaq_u16(a32.val[0], diff.val[0], shift);
         res.val[1] = vmlaq_u16(a32.val[1], diff.val[1], shift);
 
@@ -892,10 +1361,8 @@
 
     mask.val[0] = vld1q_u8(BaseMask[base_max_diff]);
     mask.val[1] = vld1q_u8(BaseMask[base_max_diff] + 16);
-    dstvec[r].val[0] = vorrq_u8(vandq_u8(mask.val[0], res16[0]),
-                                vbicq_u8(a_mbase_x, mask.val[0]));
-    dstvec[r].val[1] = vorrq_u8(vandq_u8(mask.val[1], res16[1]),
-                                vbicq_u8(a_mbase_x, mask.val[1]));
+    dstvec[r].val[0] = vbslq_u8(mask.val[0], res16[0], a_mbase_x);
+    dstvec[r].val[1] = vbslq_u8(mask.val[1], res16[1], a_mbase_x);
     x += dx;
   }
 }
@@ -927,23 +1394,15 @@
   // final pixels will be calculated as:
   //   (above[x] * 32 + 16 + (above[x+1] - above[x]) * shift) >> 5
 
-  uint8x16x2_t a0, a1;
-  uint16x8x2_t a32, diff;
-  uint16x8_t a16, c3f;
-  uint8x16_t a_mbase_x, max_base_x128, mask128;
-
-  a16 = vdupq_n_u16(16);
-  a_mbase_x = vdupq_n_u8(above[max_base_x]);
-  max_base_x128 = vdupq_n_u8(max_base_x);
-  c3f = vdupq_n_u16(0x3f);
-  uint16x8_t v_32 = vdupq_n_u16(32);
-  uint8x16_t v_zero = vdupq_n_u8(0);
-  uint8x16_t step = vdupq_n_u8(16);
+  const uint16x8_t a16 = vdupq_n_u16(16);
+  const uint8x16_t a_mbase_x = vdupq_n_u8(above[max_base_x]);
+  const uint8x16_t max_base_x128 = vdupq_n_u8(max_base_x);
+  const uint8x8_t v_32 = vdup_n_u8(32);
+  const uint8x16_t v_zero = vdupq_n_u8(0);
+  const uint8x16_t step = vdupq_n_u8(16);
 
   int x = dx;
   for (int r = 0; r < N; r++, dst += stride) {
-    uint16x8x2_t res;
-
     int base = x >> frac_bits;
     if (base >= max_base_x) {
       for (int i = r; i < N; ++i) {
@@ -956,8 +1415,7 @@
       return;
     }
 
-    uint16x8_t shift = vshrq_n_u16(vandq_u16(vdupq_n_u16(x), c3f), 1);
-    uint8x16_t a0_128, a1_128, res128;
+    uint16x8_t shift = vdupq_n_u16((x & 0x3f) >> 1);
     uint8x16_t base_inc128 =
         vaddq_u8(vdupq_n_u8(base), vcombine_u8(vcreate_u8(0x0706050403020100),
                                                vcreate_u8(0x0F0E0D0C0B0A0908)));
@@ -967,28 +1425,21 @@
       if (mdif <= 0) {
         vst1q_u8(dst + j, a_mbase_x);
       } else {
+        uint16x8x2_t a32, diff, res;
+        uint8x16_t a0_128, a1_128, mask128, res128;
         a0_128 = vld1q_u8(above + base + j);
         a1_128 = vld1q_u8(above + base + 1 + j);
-        a0 = vzipq_u8(a0_128, v_zero);
-        a1 = vzipq_u8(a1_128, v_zero);
-        diff.val[0] =
-            vsubq_u16(vreinterpretq_u16_u8(a1.val[0]),
-                      vreinterpretq_u16_u8(a0.val[0]));  // a[x+1] - a[x]
-        diff.val[1] =
-            vsubq_u16(vreinterpretq_u16_u8(a1.val[1]),
-                      vreinterpretq_u16_u8(a0.val[1]));  // a[x+1] - a[x]
-        a32.val[0] = vmlaq_u16(a16, vreinterpretq_u16_u8(a0.val[0]),
-                               v_32);  // a[x] * 32 + 16
-        a32.val[1] = vmlaq_u16(a16, vreinterpretq_u16_u8(a0.val[1]),
-                               v_32);  // a[x] * 32 + 16
+        diff.val[0] = vsubl_u8(vget_low_u8(a1_128), vget_low_u8(a0_128));
+        diff.val[1] = vsubl_u8(vget_high_u8(a1_128), vget_high_u8(a0_128));
+        a32.val[0] = vmlal_u8(a16, vget_low_u8(a0_128), v_32);
+        a32.val[1] = vmlal_u8(a16, vget_high_u8(a0_128), v_32);
         res.val[0] = vmlaq_u16(a32.val[0], diff.val[0], shift);
         res.val[1] = vmlaq_u16(a32.val[1], diff.val[1], shift);
         uint8x16_t v_temp =
             vcombine_u8(vshrn_n_u16(res.val[0], 5), vshrn_n_u16(res.val[1], 5));
 
         mask128 = vcgtq_u8(vqsubq_u8(max_base_x128, base_inc128), v_zero);
-        res128 =
-            vorrq_u8(vandq_u8(mask128, v_temp), vbicq_u8(a_mbase_x, mask128));
+        res128 = vbslq_u8(mask128, v_temp, a_mbase_x);
         vst1q_u8(dst + j, res128);
 
         base_inc128 = vaddq_u8(base_inc128, step);
@@ -1023,7 +1474,6 @@
       break;
     default: break;
   }
-  return;
 }
 
 /* ---------------------P R E D I C T I O N   Z 2--------------------------- */
@@ -1077,11 +1527,17 @@
   int16x4_t dy64 = vdup_n_s16(dy);
   int16x4_t v_frac_bits_y = vdup_n_s16(-frac_bits_y);
   int16x4_t min_base_y64 = vdup_n_s16(min_base_y);
-  int16x4_t v_one = vdup_lane_s16(v_1234, 0);
+
+#if AOM_ARCH_AARCH64
+  // Use ext rather than loading left + 14 directly to avoid over-read.
+  const uint8x16_t left_m2 = vld1q_u8(left - 2);
+  const uint8x16_t left_0 = vld1q_u8(left);
+  const uint8x16_t left_14 = vextq_u8(left_0, left_0, 14);
+  const uint8x16x2_t left_vals = { { left_m2, left_14 } };
+#endif  // AOM_ARCH_AARCH64
 
   for (int r = 0; r < N; r++) {
     uint16x8_t res, shift;
-    uint16x4_t ydx;
     uint8x8_t resx, resy;
     uint16x4x2_t v_shift;
     v_shift.val[1] = vdup_n_u16(0);
@@ -1105,7 +1561,7 @@
       v_shift.val[0] = vreinterpret_u16_u8(v_zero_u8);
       v_shift.val[1] = vreinterpret_u16_u8(v_zero_u8);
     } else {
-      ydx = vdup_n_u16(y * dx);
+      uint16x4_t ydx = vdup_n_u16(y * dx);
 
       if (upsample_above) {
         uint8x8x2_t v_tmp;
@@ -1128,29 +1584,39 @@
     }
 
     // y calc
-    uint8x8_t a0_y, a1_y;
     if (base_x < min_base_x) {
-      DECLARE_ALIGNED(32, int16_t, base_y_c[4]);
       int16x4_t v_r6 = vdup_n_s16(r << 6);
       int16x4_t y_c64 = vmls_s16(v_r6, v_1234, dy64);
       int16x4_t base_y_c64 = vshl_s16(y_c64, v_frac_bits_y);
       uint16x4_t mask64 = vcgt_s16(min_base_y64, base_y_c64);
 
+      // Values in base_y_c64 range from -2 through 14 inclusive.
       base_y_c64 = vbic_s16(base_y_c64, vreinterpret_s16_u16(mask64));
+
+#if AOM_ARCH_AARCH64
+      uint8x8_t left_idx0 = vreinterpret_u8_s16(base_y_c64 + 2);  // [0, 16]
+      uint8x8_t left_idx1 = vreinterpret_u8_s16(base_y_c64 + 3);  // [1, 17]
+
+      uint8x8_t a0_y = vtrn1_u8(vqtbl2_u8(left_vals, left_idx0), v_zero_u8);
+      uint8x8_t a1_y = vtrn1_u8(vqtbl2_u8(left_vals, left_idx1), v_zero_u8);
+#else   // !AOM_ARCH_AARCH64
+      DECLARE_ALIGNED(32, int16_t, base_y_c[4]);
+
       vst1_s16(base_y_c, base_y_c64);
-      a0_y = v_zero_u8;
+      uint8x8_t a0_y = vdup_n_u8(0);
       a0_y = vld1_lane_u8(left + base_y_c[0], a0_y, 0);
       a0_y = vld1_lane_u8(left + base_y_c[1], a0_y, 2);
       a0_y = vld1_lane_u8(left + base_y_c[2], a0_y, 4);
       a0_y = vld1_lane_u8(left + base_y_c[3], a0_y, 6);
 
-      base_y_c64 = vadd_s16(base_y_c64, v_one);
+      base_y_c64 = vadd_s16(base_y_c64, vdup_n_s16(1));
       vst1_s16(base_y_c, base_y_c64);
-      a1_y = v_zero_u8;
+      uint8x8_t a1_y = vdup_n_u8(0);
       a1_y = vld1_lane_u8(left + base_y_c[0], a1_y, 0);
       a1_y = vld1_lane_u8(left + base_y_c[1], a1_y, 2);
       a1_y = vld1_lane_u8(left + base_y_c[2], a1_y, 4);
       a1_y = vld1_lane_u8(left + base_y_c[3], a1_y, 6);
+#endif  // AOM_ARCH_AARCH64
 
       if (upsample_left) {
         v_shift.val[1] = vshr_n_u16(
@@ -1173,7 +1639,7 @@
     resy = vext_u8(resx, v_zero_u8, 4);
 
     uint8x8_t mask = vld1_u8(BaseMask[base_min_diff]);
-    uint8x8_t v_resxy = vorr_u8(vand_u8(mask, resy), vbic_u8(resx, mask));
+    uint8x8_t v_resxy = vbsl_u8(mask, resy, resx);
     vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(v_resxy), 0);
 
     dst += stride;
@@ -1217,27 +1683,31 @@
   //   above[x+1] - above[x]
   // final pixels will be calculated as:
   //   (above[x] * 32 + 16 + (above[x+1] - above[x]) * shift) >> 5
-  uint8x16x2_t a0_x, a1_x;
   uint16x8x2_t diff, a32;
-  uint16x8_t c1234, a16, c3f;
-  uint8x16_t a0_x128, a1_x128;
-  int16x8_t min_base_y128, dy128;
-  uint16x8_t v_32 = vdupq_n_u16(32);
   uint8x16_t v_zero = vdupq_n_u8(0);
   int16x8_t v_upsample_left = vdupq_n_s16(upsample_left);
   int16x8_t v_upsample_above = vdupq_n_s16(upsample_above);
   int16x8_t v_frac_bits_y = vdupq_n_s16(-frac_bits_y);
 
-  a16 = vdupq_n_u16(16);
-  c3f = vdupq_n_u16(0x3f);
-  min_base_y128 = vdupq_n_s16(min_base_y);
-  dy128 = vdupq_n_s16(dy);
-  c1234 = vcombine_u16(vcreate_u16(0x0004000300020001),
-                       vcreate_u16(0x0008000700060005));
+  uint16x8_t a16 = vdupq_n_u16(16);
+  uint16x8_t c3f = vdupq_n_u16(0x3f);
+  int16x8_t min_base_y128 = vdupq_n_s16(min_base_y);
+  int16x8_t dy128 = vdupq_n_s16(dy);
+  uint16x8_t c1234 = vcombine_u16(vcreate_u16(0x0004000300020001),
+                                  vcreate_u16(0x0008000700060005));
+
+#if AOM_ARCH_AARCH64
+  // Use ext rather than loading left + 30 directly to avoid over-read.
+  const uint8x16_t left_m2 = vld1q_u8(left - 2);
+  const uint8x16_t left_0 = vld1q_u8(left + 0);
+  const uint8x16_t left_16 = vld1q_u8(left + 16);
+  const uint8x16_t left_14 = vextq_u8(left_0, left_16, 14);
+  const uint8x16_t left_30 = vextq_u8(left_16, left_16, 14);
+  const uint8x16x3_t left_vals = { { left_m2, left_14, left_30 } };
+#endif  // AOM_ARCH_AARCH64
 
   for (int r = 0; r < N; r++) {
     uint8x8_t resx, resy, resxy;
-    uint16x8_t r6, ydx;
     uint16x8x2_t res, shift;
     shift.val[1] = vdupq_n_u16(0);
 
@@ -1255,16 +1725,16 @@
       if (base_min_diff < 0) base_min_diff = 0;
     }
 
+    uint8x8_t a0_x0, a1_x0;
     if (base_shift > 7) {
-      a0_x.val[0] = v_zero;
-      a0_x.val[1] = v_zero;
-      a1_x.val[0] = v_zero;
-      a1_x.val[1] = v_zero;
+      a0_x0 = vdup_n_u8(0);
+      a1_x0 = vdup_n_u8(0);
       shift.val[0] = vreinterpretq_u16_u8(v_zero);
       shift.val[1] = vreinterpretq_u16_u8(v_zero);
     } else {
-      ydx = vdupq_n_u16(y * dx);
-      r6 = vshlq_n_u16(vextq_u16(c1234, vreinterpretq_u16_u8(v_zero), 2), 6);
+      uint16x8_t ydx = vdupq_n_u16(y * dx);
+      uint16x8_t r6 =
+          vshlq_n_u16(vextq_u16(c1234, vreinterpretq_u16_u8(v_zero), 2), 6);
 
       if (upsample_above) {
         uint8x8x2_t v_tmp;
@@ -1274,24 +1744,27 @@
         uint8x8_t v_index_high = vld1_u8(EvenOddMaskx[base_shift] + 8);
         shift.val[0] = vshrq_n_u16(
             vandq_u16(vshlq_u16(vsubq_u16(r6, ydx), v_upsample_above), c3f), 1);
-        a0_x.val[0] =
-            vreinterpretq_u8_u16(vmovl_u8(vtbl2_u8(v_tmp, v_index_low)));
-        a1_x.val[0] =
-            vreinterpretq_u8_u16(vmovl_u8(vtbl2_u8(v_tmp, v_index_high)));
+        a0_x0 = vtbl2_u8(v_tmp, v_index_low);
+        a1_x0 = vtbl2_u8(v_tmp, v_index_high);
       } else {
+        uint8x16_t a0_x128, a1_x128;
         a0_x128 = vld1q_u8(above + base_x + base_shift);
         a1_x128 = vextq_u8(a0_x128, v_zero, 1);
         vector_shuffle(&a0_x128, &v_zero, base_shift);
         vector_shuffle(&a1_x128, &v_zero, base_shift);
         shift.val[0] = vshrq_n_u16(vandq_u16(vsubq_u16(r6, ydx), c3f), 1);
-        a0_x.val[0] = vreinterpretq_u8_u16(vmovl_u8(vget_low_u8(a0_x128)));
-        a1_x.val[0] = vreinterpretq_u8_u16(vmovl_u8(vget_low_u8(a1_x128)));
+        a0_x0 = vget_low_u8(a0_x128);
+        a1_x0 = vget_low_u8(a1_x128);
       }
     }
 
+    diff.val[0] = vsubl_u8(a1_x0, a0_x0);              // a[x+1] - a[x]
+    a32.val[0] = vmlal_u8(a16, a0_x0, vdup_n_u8(32));  // a[x] * 32 + 16
+    res.val[0] = vmlaq_u16(a32.val[0], diff.val[0], shift.val[0]);
+    resx = vshrn_n_u16(res.val[0], 5);
+
     // y calc
     if (base_x < min_base_x) {
-      DECLARE_ALIGNED(32, int16_t, base_y_c[16]);
       int16x8_t y_c128, base_y_c128;
       uint16x8_t mask128;
       int16x8_t v_r6 = vdupq_n_s16(r << 6);
@@ -1300,30 +1773,43 @@
       base_y_c128 = vshlq_s16(y_c128, v_frac_bits_y);
       mask128 = vcgtq_s16(min_base_y128, base_y_c128);
 
+      // Values in base_y_c128 range from -2 through 31 inclusive.
       base_y_c128 = vbicq_s16(base_y_c128, vreinterpretq_s16_u16(mask128));
-      vst1q_s16(base_y_c, base_y_c128);
-      a0_x.val[1] = v_zero;
-      a0_x.val[1] = vld1q_lane_u8(left + base_y_c[0], a0_x.val[1], 0);
-      a0_x.val[1] = vld1q_lane_u8(left + base_y_c[1], a0_x.val[1], 2);
-      a0_x.val[1] = vld1q_lane_u8(left + base_y_c[2], a0_x.val[1], 4);
-      a0_x.val[1] = vld1q_lane_u8(left + base_y_c[3], a0_x.val[1], 6);
-      a0_x.val[1] = vld1q_lane_u8(left + base_y_c[4], a0_x.val[1], 8);
-      a0_x.val[1] = vld1q_lane_u8(left + base_y_c[5], a0_x.val[1], 10);
-      a0_x.val[1] = vld1q_lane_u8(left + base_y_c[6], a0_x.val[1], 12);
-      a0_x.val[1] = vld1q_lane_u8(left + base_y_c[7], a0_x.val[1], 14);
 
-      base_y_c128 =
-          vaddq_s16(base_y_c128, vreinterpretq_s16_u16(vshrq_n_u16(a16, 4)));
+#if AOM_ARCH_AARCH64
+      uint8x16_t left_idx0 = vreinterpretq_u8_s16(base_y_c128 + 2);  // [0, 33]
+      uint8x16_t left_idx1 = vreinterpretq_u8_s16(base_y_c128 + 3);  // [1, 34]
+      uint8x16_t left_idx01 = vuzp1q_u8(left_idx0, left_idx1);
+
+      uint8x16_t a01_x = vqtbl3q_u8(left_vals, left_idx01);
+      uint8x8_t a0_x1 = vget_low_u8(a01_x);
+      uint8x8_t a1_x1 = vget_high_u8(a01_x);
+#else   // !AOM_ARCH_AARCH64
+      DECLARE_ALIGNED(32, int16_t, base_y_c[16]);
+
       vst1q_s16(base_y_c, base_y_c128);
-      a1_x.val[1] = v_zero;
-      a1_x.val[1] = vld1q_lane_u8(left + base_y_c[0], a1_x.val[1], 0);
-      a1_x.val[1] = vld1q_lane_u8(left + base_y_c[1], a1_x.val[1], 2);
-      a1_x.val[1] = vld1q_lane_u8(left + base_y_c[2], a1_x.val[1], 4);
-      a1_x.val[1] = vld1q_lane_u8(left + base_y_c[3], a1_x.val[1], 6);
-      a1_x.val[1] = vld1q_lane_u8(left + base_y_c[4], a1_x.val[1], 8);
-      a1_x.val[1] = vld1q_lane_u8(left + base_y_c[5], a1_x.val[1], 10);
-      a1_x.val[1] = vld1q_lane_u8(left + base_y_c[6], a1_x.val[1], 12);
-      a1_x.val[1] = vld1q_lane_u8(left + base_y_c[7], a1_x.val[1], 14);
+      uint8x8_t a0_x1 = vdup_n_u8(0);
+      a0_x1 = vld1_lane_u8(left + base_y_c[0], a0_x1, 0);
+      a0_x1 = vld1_lane_u8(left + base_y_c[1], a0_x1, 1);
+      a0_x1 = vld1_lane_u8(left + base_y_c[2], a0_x1, 2);
+      a0_x1 = vld1_lane_u8(left + base_y_c[3], a0_x1, 3);
+      a0_x1 = vld1_lane_u8(left + base_y_c[4], a0_x1, 4);
+      a0_x1 = vld1_lane_u8(left + base_y_c[5], a0_x1, 5);
+      a0_x1 = vld1_lane_u8(left + base_y_c[6], a0_x1, 6);
+      a0_x1 = vld1_lane_u8(left + base_y_c[7], a0_x1, 7);
+
+      base_y_c128 = vaddq_s16(base_y_c128, vdupq_n_s16(1));
+      vst1q_s16(base_y_c, base_y_c128);
+      uint8x8_t a1_x1 = vdup_n_u8(0);
+      a1_x1 = vld1_lane_u8(left + base_y_c[0], a1_x1, 0);
+      a1_x1 = vld1_lane_u8(left + base_y_c[1], a1_x1, 1);
+      a1_x1 = vld1_lane_u8(left + base_y_c[2], a1_x1, 2);
+      a1_x1 = vld1_lane_u8(left + base_y_c[3], a1_x1, 3);
+      a1_x1 = vld1_lane_u8(left + base_y_c[4], a1_x1, 4);
+      a1_x1 = vld1_lane_u8(left + base_y_c[5], a1_x1, 5);
+      a1_x1 = vld1_lane_u8(left + base_y_c[6], a1_x1, 6);
+      a1_x1 = vld1_lane_u8(left + base_y_c[7], a1_x1, 7);
+#endif  // AOM_ARCH_AARCH64
 
       if (upsample_left) {
         shift.val[1] = vshrq_n_u16(
@@ -1334,26 +1820,18 @@
         shift.val[1] =
             vshrq_n_u16(vandq_u16(vreinterpretq_u16_s16(y_c128), c3f), 1);
       }
+
+      diff.val[1] = vsubl_u8(a1_x1, a0_x1);
+      a32.val[1] = vmlal_u8(a16, a0_x1, vdup_n_u8(32));
+      res.val[1] = vmlaq_u16(a32.val[1], diff.val[1], shift.val[1]);
+      resy = vshrn_n_u16(res.val[1], 5);
+      uint8x8_t mask = vld1_u8(BaseMask[base_min_diff]);
+      resxy = vbsl_u8(mask, resy, resx);
+      vst1_u8(dst, resxy);
+    } else {
+      vst1_u8(dst, resx);
     }
-    diff.val[0] =
-        vsubq_u16(vreinterpretq_u16_u8(a1_x.val[0]),
-                  vreinterpretq_u16_u8(a0_x.val[0]));  // a[x+1] - a[x]
-    diff.val[1] =
-        vsubq_u16(vreinterpretq_u16_u8(a1_x.val[1]),
-                  vreinterpretq_u16_u8(a0_x.val[1]));  // a[x+1] - a[x]
-    a32.val[0] = vmlaq_u16(a16, vreinterpretq_u16_u8(a0_x.val[0]),
-                           v_32);  // a[x] * 32 + 16
-    a32.val[1] = vmlaq_u16(a16, vreinterpretq_u16_u8(a0_x.val[1]),
-                           v_32);  // a[x] * 32 + 16
-    res.val[0] = vmlaq_u16(a32.val[0], diff.val[0], shift.val[0]);
-    res.val[1] = vmlaq_u16(a32.val[1], diff.val[1], shift.val[1]);
-    resx = vshrn_n_u16(res.val[0], 5);
-    resy = vshrn_n_u16(res.val[1], 5);
 
-    uint8x8_t mask = vld1_u8(BaseMask[base_min_diff]);
-
-    resxy = vorr_u8(vand_u8(mask, resy), vbic_u8(resx, mask));
-    vst1_u8(dst, resxy);
     dst += stride;
   }
 }
@@ -1371,22 +1849,17 @@
   const int frac_bits_x = 6;
   const int frac_bits_y = 6;
 
-  uint16x8_t a16, c1, c3f;
-  int16x8_t min_base_y256, dy256;
   uint16x8x2_t a32, c0123, c1234, diff, shifty;
-  uint8x16x2_t a0_x, a1_x, a0_y, a1_y;
-  uint8x16_t a0_x128, a1_x128;
+  uint8x16x2_t a0_x, a1_x;
   uint16x8_t v_32 = vdupq_n_u16(32);
   uint8x16_t v_zero = vdupq_n_u8(0);
   int16x8_t v_frac_bits_y = vdupq_n_s16(-frac_bits_y);
 
-  DECLARE_ALIGNED(32, int16_t, base_y_c[16]);
-
-  a16 = vdupq_n_u16(16);
-  c1 = vshrq_n_u16(a16, 4);
-  min_base_y256 = vdupq_n_s16(min_base_y);
-  c3f = vdupq_n_u16(0x3f);
-  dy256 = vdupq_n_s16(dy);
+  uint16x8_t a16 = vdupq_n_u16(16);
+  uint16x8_t c1 = vshrq_n_u16(a16, 4);
+  int16x8_t min_base_y256 = vdupq_n_s16(min_base_y);
+  uint16x8_t c3f = vdupq_n_u16(0x3f);
+  int16x8_t dy256 = vdupq_n_s16(dy);
   c0123.val[0] = vcombine_u16(vcreate_u16(0x0003000200010000),
                               vcreate_u16(0x0007000600050004));
   c0123.val[1] = vcombine_u16(vcreate_u16(0x000B000A00090008),
@@ -1394,12 +1867,25 @@
   c1234.val[0] = vaddq_u16(c0123.val[0], c1);
   c1234.val[1] = vaddq_u16(c0123.val[1], c1);
 
+#if AOM_ARCH_AARCH64
+  const uint8x16_t left_m1 = vld1q_u8(left - 1);
+  const uint8x16_t left_0 = vld1q_u8(left + 0);
+  const uint8x16_t left_16 = vld1q_u8(left + 16);
+  const uint8x16_t left_32 = vld1q_u8(left + 32);
+  const uint8x16_t left_48 = vld1q_u8(left + 48);
+  const uint8x16_t left_15 = vextq_u8(left_0, left_16, 15);
+  const uint8x16_t left_31 = vextq_u8(left_16, left_32, 15);
+  const uint8x16_t left_47 = vextq_u8(left_32, left_48, 15);
+  const uint8x16x4_t left_vals0 = { { left_m1, left_15, left_31, left_47 } };
+  const uint8x16x4_t left_vals1 = { { left_0, left_16, left_32, left_48 } };
+#endif  // AOM_ARCH_AARCH64
+
   for (int r = 0; r < H; r++) {
     uint16x8x2_t res, r6, shift;
-    uint16x8_t ydx, j256;
+    uint16x8_t j256;
     uint8x16_t resx, resy, resxy;
     int y = r + 1;
-    ydx = vdupq_n_u16((uint16_t)(y * dx));
+    uint16x8_t ydx = vdupq_n_u16((uint16_t)(y * dx));
 
     int base_x = (-y * dx) >> frac_bits_x;
     for (int j = 0; j < W; j += 16) {
@@ -1417,6 +1903,7 @@
       }
 
       if (base_shift < 16) {
+        uint8x16_t a0_x128, a1_x128;
         a0_x128 = vld1q_u8(above + base_x + base_shift + j);
         a1_x128 = vld1q_u8(above + base_x + base_shift + 1 + j);
         vector_shuffle(&a0_x128, &v_zero, base_shift);
@@ -1471,19 +1958,20 @@
         mask256.val[0] = vcgtq_s16(min_base_y256, base_y_c256.val[0]);
         mask256.val[1] = vcgtq_s16(min_base_y256, base_y_c256.val[1]);
 
-        base_y_c256.val[0] = vorrq_s16(
-            vandq_s16(vreinterpretq_s16_u16(mask256.val[0]), min_base_y256),
-            vbicq_s16(base_y_c256.val[0],
-                      vreinterpretq_s16_u16(mask256.val[0])));
-        base_y_c256.val[1] = vorrq_s16(
-            vandq_s16(vreinterpretq_s16_u16(mask256.val[1]), min_base_y256),
-            vbicq_s16(base_y_c256.val[1],
-                      vreinterpretq_s16_u16(mask256.val[1])));
+        base_y_c256.val[0] =
+            vbslq_s16(mask256.val[0], min_base_y256, base_y_c256.val[0]);
+        base_y_c256.val[1] =
+            vbslq_s16(mask256.val[1], min_base_y256, base_y_c256.val[1]);
 
         int16_t min_y = vgetq_lane_s16(base_y_c256.val[1], 7);
         int16_t max_y = vgetq_lane_s16(base_y_c256.val[0], 0);
         int16_t offset_diff = max_y - min_y;
 
+        uint8x8_t a0_y0;
+        uint8x8_t a0_y1;
+        uint8x8_t a1_y0;
+        uint8x8_t a1_y1;
+
         if (offset_diff < 16) {
           assert(offset_diff >= 0);
           int16x8_t min_y256 =
@@ -1503,7 +1991,7 @@
           a0_y128 = vandq_u8(a0_y128, v_loadmaskz2);
           a1_y128 = vld1q_u8(left + min_y + 1);
           a1_y128 = vandq_u8(a1_y128, v_loadmaskz2);
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
           a0_y128 = vqtbl1q_u8(a0_y128, vreinterpretq_u8_s8(base_y_offset128));
           a1_y128 = vqtbl1q_u8(a1_y128, vreinterpretq_u8_s8(base_y_offset128));
 #else
@@ -1524,73 +2012,91 @@
           v_res.val[1] = vtbl2_u8(v_tmp, v_index_high);
           a1_y128 = vcombine_u8(v_res.val[0], v_res.val[1]);
 #endif
-          a0_y = vzipq_u8(a0_y128, v_zero);
-          a1_y = vzipq_u8(a1_y128, v_zero);
+          a0_y0 = vget_low_u8(a0_y128);
+          a0_y1 = vget_high_u8(a0_y128);
+          a1_y0 = vget_low_u8(a1_y128);
+          a1_y1 = vget_high_u8(a1_y128);
         } else {
+          // Values in base_y_c256 range from -1 through 62 inclusive.
           base_y_c256.val[0] = vbicq_s16(base_y_c256.val[0],
                                          vreinterpretq_s16_u16(mask256.val[0]));
           base_y_c256.val[1] = vbicq_s16(base_y_c256.val[1],
                                          vreinterpretq_s16_u16(mask256.val[1]));
+
+#if AOM_ARCH_AARCH64
+          // Values in left_idx{0,1} range from 0 through 63 inclusive.
+          uint8x16_t left_idx0 = vreinterpretq_u8_s16(base_y_c256.val[0] + 1);
+          uint8x16_t left_idx1 = vreinterpretq_u8_s16(base_y_c256.val[1] + 1);
+
+          uint8x16_t left_idx01 = vuzp1q_u8(left_idx0, left_idx1);
+
+          uint8x16_t a0_y01 = vqtbl4q_u8(left_vals0, left_idx01);
+          uint8x16_t a1_y01 = vqtbl4q_u8(left_vals1, left_idx01);
+
+          a0_y0 = vget_low_u8(a0_y01);
+          a0_y1 = vget_high_u8(a0_y01);
+          a1_y0 = vget_low_u8(a1_y01);
+          a1_y1 = vget_high_u8(a1_y01);
+#else   // !AOM_ARCH_AARCH64
+          DECLARE_ALIGNED(32, int16_t, base_y_c[16]);
+
           vst1q_s16(base_y_c, base_y_c256.val[0]);
           vst1q_s16(base_y_c + 8, base_y_c256.val[1]);
-          a0_y.val[0] = v_zero;
-          a0_y.val[1] = v_zero;
-          a0_y.val[0] = vld1q_lane_u8(left + base_y_c[0], a0_y.val[0], 0);
-          a0_y.val[0] = vld1q_lane_u8(left + base_y_c[1], a0_y.val[0], 2);
-          a0_y.val[0] = vld1q_lane_u8(left + base_y_c[2], a0_y.val[0], 4);
-          a0_y.val[0] = vld1q_lane_u8(left + base_y_c[3], a0_y.val[0], 6);
-          a0_y.val[0] = vld1q_lane_u8(left + base_y_c[4], a0_y.val[0], 8);
-          a0_y.val[0] = vld1q_lane_u8(left + base_y_c[5], a0_y.val[0], 10);
-          a0_y.val[0] = vld1q_lane_u8(left + base_y_c[6], a0_y.val[0], 12);
-          a0_y.val[0] = vld1q_lane_u8(left + base_y_c[7], a0_y.val[0], 14);
-          a0_y.val[1] = vld1q_lane_u8(left + base_y_c[8], a0_y.val[1], 0);
-          a0_y.val[1] = vld1q_lane_u8(left + base_y_c[9], a0_y.val[1], 2);
-          a0_y.val[1] = vld1q_lane_u8(left + base_y_c[10], a0_y.val[1], 4);
-          a0_y.val[1] = vld1q_lane_u8(left + base_y_c[11], a0_y.val[1], 6);
-          a0_y.val[1] = vld1q_lane_u8(left + base_y_c[12], a0_y.val[1], 8);
-          a0_y.val[1] = vld1q_lane_u8(left + base_y_c[13], a0_y.val[1], 10);
-          a0_y.val[1] = vld1q_lane_u8(left + base_y_c[14], a0_y.val[1], 12);
-          a0_y.val[1] = vld1q_lane_u8(left + base_y_c[15], a0_y.val[1], 14);
+          a0_y0 = vdup_n_u8(0);
+          a0_y0 = vld1_lane_u8(left + base_y_c[0], a0_y0, 0);
+          a0_y0 = vld1_lane_u8(left + base_y_c[1], a0_y0, 1);
+          a0_y0 = vld1_lane_u8(left + base_y_c[2], a0_y0, 2);
+          a0_y0 = vld1_lane_u8(left + base_y_c[3], a0_y0, 3);
+          a0_y0 = vld1_lane_u8(left + base_y_c[4], a0_y0, 4);
+          a0_y0 = vld1_lane_u8(left + base_y_c[5], a0_y0, 5);
+          a0_y0 = vld1_lane_u8(left + base_y_c[6], a0_y0, 6);
+          a0_y0 = vld1_lane_u8(left + base_y_c[7], a0_y0, 7);
+          a0_y1 = vdup_n_u8(0);
+          a0_y1 = vld1_lane_u8(left + base_y_c[8], a0_y1, 0);
+          a0_y1 = vld1_lane_u8(left + base_y_c[9], a0_y1, 1);
+          a0_y1 = vld1_lane_u8(left + base_y_c[10], a0_y1, 2);
+          a0_y1 = vld1_lane_u8(left + base_y_c[11], a0_y1, 3);
+          a0_y1 = vld1_lane_u8(left + base_y_c[12], a0_y1, 4);
+          a0_y1 = vld1_lane_u8(left + base_y_c[13], a0_y1, 5);
+          a0_y1 = vld1_lane_u8(left + base_y_c[14], a0_y1, 6);
+          a0_y1 = vld1_lane_u8(left + base_y_c[15], a0_y1, 7);
 
           base_y_c256.val[0] =
               vaddq_s16(base_y_c256.val[0], vreinterpretq_s16_u16(c1));
           base_y_c256.val[1] =
               vaddq_s16(base_y_c256.val[1], vreinterpretq_s16_u16(c1));
+
           vst1q_s16(base_y_c, base_y_c256.val[0]);
           vst1q_s16(base_y_c + 8, base_y_c256.val[1]);
-          a1_y.val[0] = v_zero;
-          a1_y.val[1] = v_zero;
-          a1_y.val[0] = vld1q_lane_u8(left + base_y_c[0], a1_y.val[0], 0);
-          a1_y.val[0] = vld1q_lane_u8(left + base_y_c[1], a1_y.val[0], 2);
-          a1_y.val[0] = vld1q_lane_u8(left + base_y_c[2], a1_y.val[0], 4);
-          a1_y.val[0] = vld1q_lane_u8(left + base_y_c[3], a1_y.val[0], 6);
-          a1_y.val[0] = vld1q_lane_u8(left + base_y_c[4], a1_y.val[0], 8);
-          a1_y.val[0] = vld1q_lane_u8(left + base_y_c[5], a1_y.val[0], 10);
-          a1_y.val[0] = vld1q_lane_u8(left + base_y_c[6], a1_y.val[0], 12);
-          a1_y.val[0] = vld1q_lane_u8(left + base_y_c[7], a1_y.val[0], 14);
-          a1_y.val[1] = vld1q_lane_u8(left + base_y_c[8], a1_y.val[1], 0);
-          a1_y.val[1] = vld1q_lane_u8(left + base_y_c[9], a1_y.val[1], 2);
-          a1_y.val[1] = vld1q_lane_u8(left + base_y_c[10], a1_y.val[1], 4);
-          a1_y.val[1] = vld1q_lane_u8(left + base_y_c[11], a1_y.val[1], 6);
-          a1_y.val[1] = vld1q_lane_u8(left + base_y_c[12], a1_y.val[1], 8);
-          a1_y.val[1] = vld1q_lane_u8(left + base_y_c[13], a1_y.val[1], 10);
-          a1_y.val[1] = vld1q_lane_u8(left + base_y_c[14], a1_y.val[1], 12);
-          a1_y.val[1] = vld1q_lane_u8(left + base_y_c[15], a1_y.val[1], 14);
+          a1_y0 = vdup_n_u8(0);
+          a1_y0 = vld1_lane_u8(left + base_y_c[0], a1_y0, 0);
+          a1_y0 = vld1_lane_u8(left + base_y_c[1], a1_y0, 1);
+          a1_y0 = vld1_lane_u8(left + base_y_c[2], a1_y0, 2);
+          a1_y0 = vld1_lane_u8(left + base_y_c[3], a1_y0, 3);
+          a1_y0 = vld1_lane_u8(left + base_y_c[4], a1_y0, 4);
+          a1_y0 = vld1_lane_u8(left + base_y_c[5], a1_y0, 5);
+          a1_y0 = vld1_lane_u8(left + base_y_c[6], a1_y0, 6);
+          a1_y0 = vld1_lane_u8(left + base_y_c[7], a1_y0, 7);
+          a1_y1 = vdup_n_u8(0);
+          a1_y1 = vld1_lane_u8(left + base_y_c[8], a1_y1, 0);
+          a1_y1 = vld1_lane_u8(left + base_y_c[9], a1_y1, 1);
+          a1_y1 = vld1_lane_u8(left + base_y_c[10], a1_y1, 2);
+          a1_y1 = vld1_lane_u8(left + base_y_c[11], a1_y1, 3);
+          a1_y1 = vld1_lane_u8(left + base_y_c[12], a1_y1, 4);
+          a1_y1 = vld1_lane_u8(left + base_y_c[13], a1_y1, 5);
+          a1_y1 = vld1_lane_u8(left + base_y_c[14], a1_y1, 6);
+          a1_y1 = vld1_lane_u8(left + base_y_c[15], a1_y1, 7);
+#endif  // AOM_ARCH_AARCH64
         }
+
         shifty.val[0] = vshrq_n_u16(
             vandq_u16(vreinterpretq_u16_s16(y_c256.val[0]), c3f), 1);
         shifty.val[1] = vshrq_n_u16(
             vandq_u16(vreinterpretq_u16_s16(y_c256.val[1]), c3f), 1);
-        diff.val[0] =
-            vsubq_u16(vreinterpretq_u16_u8(a1_y.val[0]),
-                      vreinterpretq_u16_u8(a0_y.val[0]));  // a[x+1] - a[x]
-        diff.val[1] =
-            vsubq_u16(vreinterpretq_u16_u8(a1_y.val[1]),
-                      vreinterpretq_u16_u8(a0_y.val[1]));  // a[x+1] - a[x]
-        a32.val[0] = vmlaq_u16(a16, vreinterpretq_u16_u8(a0_y.val[0]),
-                               v_32);  // a[x] * 32 + 16
-        a32.val[1] = vmlaq_u16(a16, vreinterpretq_u16_u8(a0_y.val[1]),
-                               v_32);  // a[x] * 32 + 16
+        diff.val[0] = vsubl_u8(a1_y0, a0_y0);              // a[x+1] - a[x]
+        diff.val[1] = vsubl_u8(a1_y1, a0_y1);              // a[x+1] - a[x]
+        a32.val[0] = vmlal_u8(a16, a0_y0, vdup_n_u8(32));  // a[x] * 32 + 16
+        a32.val[1] = vmlal_u8(a16, a0_y1, vdup_n_u8(32));  // a[x] * 32 + 16
         res.val[0] = vmlaq_u16(a32.val[0], diff.val[0], shifty.val[0]);
         res.val[1] = vmlaq_u16(a32.val[1], diff.val[1], shifty.val[1]);
 
@@ -1600,7 +2106,7 @@
         resy = v_zero;
       }
       uint8x16_t mask = vld1q_u8(BaseMask[base_min_diff]);
-      resxy = vorrq_u8(vandq_u8(mask, resy), vbicq_u8(resx, mask));
+      resxy = vbslq_u8(mask, resy, resx);
       vst1q_u8(dst + j, resxy);
     }  // for j
     dst += stride;
@@ -1629,7 +2135,6 @@
                                 upsample_above, upsample_left, dx, dy);
       break;
   }
-  return;
 }
 
 /* ---------------------P R E D I C T I O N   Z 3--------------------------- */
@@ -1813,7 +2318,7 @@
   w11 = vzipq_u32(vreinterpretq_u32_u16(w6.val[1]),
                   vreinterpretq_u32_u16(w7.val[1]));
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   d[0] = vzip1q_u64(vreinterpretq_u64_u32(w8.val[0]),
                     vreinterpretq_u64_u32(w9.val[0]));
   d[1] = vzip2q_u64(vreinterpretq_u64_u32(w8.val[0]),
@@ -1883,7 +2388,7 @@
   w15 = vzipq_u32(vreinterpretq_u32_u16(w10.val[1]),
                   vreinterpretq_u32_u16(w11.val[1]));
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   d[0] = vzip1q_u64(vreinterpretq_u64_u32(w12.val[0]),
                     vreinterpretq_u64_u32(w13.val[0]));
   d[1] = vzip2q_u64(vreinterpretq_u64_u32(w12.val[0]),
@@ -1938,7 +2443,7 @@
   w15 = vzipq_u32(vreinterpretq_u32_u16(w10.val[1]),
                   vreinterpretq_u32_u16(w11.val[1]));
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   d[8] = vzip1q_u64(vreinterpretq_u64_u32(w12.val[0]),
                     vreinterpretq_u64_u32(w13.val[0]));
   d[9] = vzip2q_u64(vreinterpretq_u64_u32(w12.val[0]),
@@ -2011,7 +2516,7 @@
 
   // Store first 4-line result
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   d[0].val[0] = vzip1q_u64(vreinterpretq_u64_u32(w6.val[0]),
                            vreinterpretq_u64_u32(w14.val[0]));
   d[0].val[1] = vzip2q_u64(vreinterpretq_u64_u32(w6.val[0]),
@@ -2067,7 +2572,7 @@
 
   // Store second 4-line result
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   d[4].val[0] = vzip1q_u64(vreinterpretq_u64_u32(w6.val[0]),
                            vreinterpretq_u64_u32(w14.val[0]));
   d[4].val[1] = vzip2q_u64(vreinterpretq_u64_u32(w6.val[0]),
@@ -2134,7 +2639,7 @@
 
   // Store first 4-line result
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   d[8].val[0] = vzip1q_u64(vreinterpretq_u64_u32(w6.val[0]),
                            vreinterpretq_u64_u32(w14.val[0]));
   d[8].val[1] = vzip2q_u64(vreinterpretq_u64_u32(w6.val[0]),
@@ -2190,7 +2695,7 @@
 
   // Store second 4-line result
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   d[12].val[0] = vzip1q_u64(vreinterpretq_u64_u32(w6.val[0]),
                             vreinterpretq_u64_u32(w14.val[0]));
   d[12].val[1] = vzip2q_u64(vreinterpretq_u64_u32(w6.val[0]),
@@ -3212,7 +3717,7 @@
                                        int width, int height) {
   const uint8x8_t top_left = vdup_n_u8(top_row[-1]);
   const uint16x8_t top_left_x2 = vdupq_n_u16(top_row[-1] + top_row[-1]);
-  uint8x8_t top;
+  uint8x8_t UNINITIALIZED_IS_SAFE(top);
   if (width == 4) {
     load_u8_4x1(top_row, &top, 0);
   } else {  // width == 8

diff --git a/aom_dsp/arm/loopfilter_neon.c b/aom_dsp/arm/loopfilter_neon.c
index f3f86a2..8fc7ccb 100644
--- a/aom_dsp/arm/loopfilter_neon.c
+++ b/aom_dsp/arm/loopfilter_neon.c

@@ -628,7 +628,7 @@
   // row1: x p6 p5 p4 p3 p2 p1 p0 | q0 q1 q2 q3 q4 q5 q6 y
   // row2: x p6 p5 p4 p3 p2 p1 p0 | q0 q1 q2 q3 q4 q5 q6 y
   // row3: x p6 p5 p4 p3 p2 p1 p0 | q0 q1 q2 q3 q4 q5 q6 y
-  load_u8_8x16(src - 8, stride, &row0, &row1, &row2, &row3);
+  load_u8_16x4(src - 8, stride, &row0, &row1, &row2, &row3);
 
   pxp3 = vget_low_u8(row0);
   p6p2 = vget_low_u8(row1);
@@ -841,8 +841,7 @@
   // row1: p1 p0 | q0 q1
   // row2: p1 p0 | q0 q1
   // row3: p1 p0 | q0 q1
-  load_unaligned_u8_4x4(src - 2, stride, (uint32x2_t *)&p1p0,
-                        (uint32x2_t *)&q0q1);
+  load_unaligned_u8_4x4(src - 2, stride, &p1p0, &q0q1);
 
   transpose_u8_4x4(&p1p0, &q0q1);
 
@@ -1037,7 +1036,7 @@
 
 void aom_lpf_horizontal_4_neon(uint8_t *src, int stride, const uint8_t *blimit,
                                const uint8_t *limit, const uint8_t *thresh) {
-  uint8x8_t p0q0, UNINITIALIZED_IS_SAFE(p1q1);
+  uint8x8_t UNINITIALIZED_IS_SAFE(p0q0), UNINITIALIZED_IS_SAFE(p1q1);
 
   load_u8_4x1(src - 2 * stride, &p1q1, 0);
   load_u8_4x1(src - 1 * stride, &p0q0, 0);

diff --git a/aom_dsp/arm/masked_sad4d_neon.c b/aom_dsp/arm/masked_sad4d_neon.c
new file mode 100644
index 0000000..98daeda
--- /dev/null
+++ b/aom_dsp/arm/masked_sad4d_neon.c

@@ -0,0 +1,563 @@
+/*
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <arm_neon.h>
+
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+#include "aom/aom_integer.h"
+#include "aom_dsp/blend.h"
+#include "mem_neon.h"
+#include "sum_neon.h"
+
+static INLINE uint16x8_t masked_sad_16x1_neon(uint16x8_t sad,
+                                              const uint8x16_t s0,
+                                              const uint8x16_t a0,
+                                              const uint8x16_t b0,
+                                              const uint8x16_t m0) {
+  uint8x16_t m0_inv = vsubq_u8(vdupq_n_u8(AOM_BLEND_A64_MAX_ALPHA), m0);
+  uint16x8_t blend_u16_lo = vmull_u8(vget_low_u8(m0), vget_low_u8(a0));
+  uint16x8_t blend_u16_hi = vmull_u8(vget_high_u8(m0), vget_high_u8(a0));
+  blend_u16_lo = vmlal_u8(blend_u16_lo, vget_low_u8(m0_inv), vget_low_u8(b0));
+  blend_u16_hi = vmlal_u8(blend_u16_hi, vget_high_u8(m0_inv), vget_high_u8(b0));
+
+  uint8x8_t blend_u8_lo = vrshrn_n_u16(blend_u16_lo, AOM_BLEND_A64_ROUND_BITS);
+  uint8x8_t blend_u8_hi = vrshrn_n_u16(blend_u16_hi, AOM_BLEND_A64_ROUND_BITS);
+  uint8x16_t blend_u8 = vcombine_u8(blend_u8_lo, blend_u8_hi);
+  return vpadalq_u8(sad, vabdq_u8(blend_u8, s0));
+}
+
+static INLINE void masked_inv_sadwxhx4d_large_neon(
+    const uint8_t *src, int src_stride, const uint8_t *const ref[4],
+    int ref_stride, const uint8_t *second_pred, const uint8_t *mask,
+    int mask_stride, uint32_t res[4], int width, int height, int h_overflow) {
+  uint32x4_t sum[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                        vdupq_n_u32(0) };
+  int h_limit = height > h_overflow ? h_overflow : height;
+
+  int ref_offset = 0;
+  int i = 0;
+  do {
+    uint16x8_t sum_lo[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                             vdupq_n_u16(0) };
+    uint16x8_t sum_hi[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                             vdupq_n_u16(0) };
+
+    do {
+      int j = 0;
+      do {
+        uint8x16_t s0 = vld1q_u8(src + j);
+        uint8x16_t p0 = vld1q_u8(second_pred + j);
+        uint8x16_t m0 = vld1q_u8(mask + j);
+        sum_lo[0] = masked_sad_16x1_neon(sum_lo[0], s0, p0,
+                                         vld1q_u8(ref[0] + ref_offset + j), m0);
+        sum_lo[1] = masked_sad_16x1_neon(sum_lo[1], s0, p0,
+                                         vld1q_u8(ref[1] + ref_offset + j), m0);
+        sum_lo[2] = masked_sad_16x1_neon(sum_lo[2], s0, p0,
+                                         vld1q_u8(ref[2] + ref_offset + j), m0);
+        sum_lo[3] = masked_sad_16x1_neon(sum_lo[3], s0, p0,
+                                         vld1q_u8(ref[3] + ref_offset + j), m0);
+
+        uint8x16_t s1 = vld1q_u8(src + j + 16);
+        uint8x16_t p1 = vld1q_u8(second_pred + j + 16);
+        uint8x16_t m1 = vld1q_u8(mask + j + 16);
+        sum_hi[0] = masked_sad_16x1_neon(
+            sum_hi[0], s1, p1, vld1q_u8(ref[0] + ref_offset + j + 16), m1);
+        sum_hi[1] = masked_sad_16x1_neon(
+            sum_hi[1], s1, p1, vld1q_u8(ref[1] + ref_offset + j + 16), m1);
+        sum_hi[2] = masked_sad_16x1_neon(
+            sum_hi[2], s1, p1, vld1q_u8(ref[2] + ref_offset + j + 16), m1);
+        sum_hi[3] = masked_sad_16x1_neon(
+            sum_hi[3], s1, p1, vld1q_u8(ref[3] + ref_offset + j + 16), m1);
+
+        j += 32;
+      } while (j < width);
+
+      src += src_stride;
+      ref_offset += ref_stride;
+      second_pred += width;
+      mask += mask_stride;
+    } while (++i < h_limit);
+
+    sum[0] = vpadalq_u16(sum[0], sum_lo[0]);
+    sum[0] = vpadalq_u16(sum[0], sum_hi[0]);
+    sum[1] = vpadalq_u16(sum[1], sum_lo[1]);
+    sum[1] = vpadalq_u16(sum[1], sum_hi[1]);
+    sum[2] = vpadalq_u16(sum[2], sum_lo[2]);
+    sum[2] = vpadalq_u16(sum[2], sum_hi[2]);
+    sum[3] = vpadalq_u16(sum[3], sum_lo[3]);
+    sum[3] = vpadalq_u16(sum[3], sum_hi[3]);
+
+    h_limit += h_overflow;
+  } while (i < height);
+
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum));
+}
+
+static INLINE void masked_inv_sad128xhx4d_neon(
+    const uint8_t *src, int src_stride, const uint8_t *const ref[4],
+    int ref_stride, const uint8_t *second_pred, const uint8_t *mask,
+    int mask_stride, uint32_t res[4], int h) {
+  masked_inv_sadwxhx4d_large_neon(src, src_stride, ref, ref_stride, second_pred,
+                                  mask, mask_stride, res, 128, h, 32);
+}
+
+static INLINE void masked_inv_sad64xhx4d_neon(
+    const uint8_t *src, int src_stride, const uint8_t *const ref[4],
+    int ref_stride, const uint8_t *second_pred, const uint8_t *mask,
+    int mask_stride, uint32_t res[4], int h) {
+  masked_inv_sadwxhx4d_large_neon(src, src_stride, ref, ref_stride, second_pred,
+                                  mask, mask_stride, res, 64, h, 64);
+}
+
+static INLINE void masked_sadwxhx4d_large_neon(
+    const uint8_t *src, int src_stride, const uint8_t *const ref[4],
+    int ref_stride, const uint8_t *second_pred, const uint8_t *mask,
+    int mask_stride, uint32_t res[4], int width, int height, int h_overflow) {
+  uint32x4_t sum[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                        vdupq_n_u32(0) };
+  int h_limit = height > h_overflow ? h_overflow : height;
+
+  int ref_offset = 0;
+  int i = 0;
+  do {
+    uint16x8_t sum_lo[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                             vdupq_n_u16(0) };
+    uint16x8_t sum_hi[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                             vdupq_n_u16(0) };
+
+    do {
+      int j = 0;
+      do {
+        uint8x16_t s0 = vld1q_u8(src + j);
+        uint8x16_t p0 = vld1q_u8(second_pred + j);
+        uint8x16_t m0 = vld1q_u8(mask + j);
+        sum_lo[0] = masked_sad_16x1_neon(
+            sum_lo[0], s0, vld1q_u8(ref[0] + ref_offset + j), p0, m0);
+        sum_lo[1] = masked_sad_16x1_neon(
+            sum_lo[1], s0, vld1q_u8(ref[1] + ref_offset + j), p0, m0);
+        sum_lo[2] = masked_sad_16x1_neon(
+            sum_lo[2], s0, vld1q_u8(ref[2] + ref_offset + j), p0, m0);
+        sum_lo[3] = masked_sad_16x1_neon(
+            sum_lo[3], s0, vld1q_u8(ref[3] + ref_offset + j), p0, m0);
+
+        uint8x16_t s1 = vld1q_u8(src + j + 16);
+        uint8x16_t p1 = vld1q_u8(second_pred + j + 16);
+        uint8x16_t m1 = vld1q_u8(mask + j + 16);
+        sum_hi[0] = masked_sad_16x1_neon(
+            sum_hi[0], s1, vld1q_u8(ref[0] + ref_offset + j + 16), p1, m1);
+        sum_hi[1] = masked_sad_16x1_neon(
+            sum_hi[1], s1, vld1q_u8(ref[1] + ref_offset + j + 16), p1, m1);
+        sum_hi[2] = masked_sad_16x1_neon(
+            sum_hi[2], s1, vld1q_u8(ref[2] + ref_offset + j + 16), p1, m1);
+        sum_hi[3] = masked_sad_16x1_neon(
+            sum_hi[3], s1, vld1q_u8(ref[3] + ref_offset + j + 16), p1, m1);
+
+        j += 32;
+      } while (j < width);
+
+      src += src_stride;
+      ref_offset += ref_stride;
+      second_pred += width;
+      mask += mask_stride;
+    } while (++i < h_limit);
+
+    sum[0] = vpadalq_u16(sum[0], sum_lo[0]);
+    sum[0] = vpadalq_u16(sum[0], sum_hi[0]);
+    sum[1] = vpadalq_u16(sum[1], sum_lo[1]);
+    sum[1] = vpadalq_u16(sum[1], sum_hi[1]);
+    sum[2] = vpadalq_u16(sum[2], sum_lo[2]);
+    sum[2] = vpadalq_u16(sum[2], sum_hi[2]);
+    sum[3] = vpadalq_u16(sum[3], sum_lo[3]);
+    sum[3] = vpadalq_u16(sum[3], sum_hi[3]);
+
+    h_limit += h_overflow;
+  } while (i < height);
+
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum));
+}
+
+static INLINE void masked_sad128xhx4d_neon(const uint8_t *src, int src_stride,
+                                           const uint8_t *const ref[4],
+                                           int ref_stride,
+                                           const uint8_t *second_pred,
+                                           const uint8_t *mask, int mask_stride,
+                                           uint32_t res[4], int h) {
+  masked_sadwxhx4d_large_neon(src, src_stride, ref, ref_stride, second_pred,
+                              mask, mask_stride, res, 128, h, 32);
+}
+
+static INLINE void masked_sad64xhx4d_neon(const uint8_t *src, int src_stride,
+                                          const uint8_t *const ref[4],
+                                          int ref_stride,
+                                          const uint8_t *second_pred,
+                                          const uint8_t *mask, int mask_stride,
+                                          uint32_t res[4], int h) {
+  masked_sadwxhx4d_large_neon(src, src_stride, ref, ref_stride, second_pred,
+                              mask, mask_stride, res, 64, h, 64);
+}
+
+static INLINE void masked_inv_sad32xhx4d_neon(
+    const uint8_t *src, int src_stride, const uint8_t *const ref[4],
+    int ref_stride, const uint8_t *second_pred, const uint8_t *mask,
+    int mask_stride, uint32_t res[4], int h) {
+  uint16x8_t sum_lo[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                           vdupq_n_u16(0) };
+  uint16x8_t sum_hi[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                           vdupq_n_u16(0) };
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    uint8x16_t s0 = vld1q_u8(src);
+    uint8x16_t p0 = vld1q_u8(second_pred);
+    uint8x16_t m0 = vld1q_u8(mask);
+    sum_lo[0] = masked_sad_16x1_neon(sum_lo[0], s0, p0,
+                                     vld1q_u8(ref[0] + ref_offset), m0);
+    sum_lo[1] = masked_sad_16x1_neon(sum_lo[1], s0, p0,
+                                     vld1q_u8(ref[1] + ref_offset), m0);
+    sum_lo[2] = masked_sad_16x1_neon(sum_lo[2], s0, p0,
+                                     vld1q_u8(ref[2] + ref_offset), m0);
+    sum_lo[3] = masked_sad_16x1_neon(sum_lo[3], s0, p0,
+                                     vld1q_u8(ref[3] + ref_offset), m0);
+
+    uint8x16_t s1 = vld1q_u8(src + 16);
+    uint8x16_t p1 = vld1q_u8(second_pred + 16);
+    uint8x16_t m1 = vld1q_u8(mask + 16);
+    sum_hi[0] = masked_sad_16x1_neon(sum_hi[0], s1, p1,
+                                     vld1q_u8(ref[0] + ref_offset + 16), m1);
+    sum_hi[1] = masked_sad_16x1_neon(sum_hi[1], s1, p1,
+                                     vld1q_u8(ref[1] + ref_offset + 16), m1);
+    sum_hi[2] = masked_sad_16x1_neon(sum_hi[2], s1, p1,
+                                     vld1q_u8(ref[2] + ref_offset + 16), m1);
+    sum_hi[3] = masked_sad_16x1_neon(sum_hi[3], s1, p1,
+                                     vld1q_u8(ref[3] + ref_offset + 16), m1);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+    second_pred += 32;
+    mask += mask_stride;
+  } while (--i != 0);
+
+  vst1q_u32(res, horizontal_long_add_4d_u16x8(sum_lo, sum_hi));
+}
+
+static INLINE void masked_sad32xhx4d_neon(const uint8_t *src, int src_stride,
+                                          const uint8_t *const ref[4],
+                                          int ref_stride,
+                                          const uint8_t *second_pred,
+                                          const uint8_t *mask, int mask_stride,
+                                          uint32_t res[4], int h) {
+  uint16x8_t sum_lo[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                           vdupq_n_u16(0) };
+  uint16x8_t sum_hi[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                           vdupq_n_u16(0) };
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    uint8x16_t s0 = vld1q_u8(src);
+    uint8x16_t p0 = vld1q_u8(second_pred);
+    uint8x16_t m0 = vld1q_u8(mask);
+    sum_lo[0] = masked_sad_16x1_neon(sum_lo[0], s0,
+                                     vld1q_u8(ref[0] + ref_offset), p0, m0);
+    sum_lo[1] = masked_sad_16x1_neon(sum_lo[1], s0,
+                                     vld1q_u8(ref[1] + ref_offset), p0, m0);
+    sum_lo[2] = masked_sad_16x1_neon(sum_lo[2], s0,
+                                     vld1q_u8(ref[2] + ref_offset), p0, m0);
+    sum_lo[3] = masked_sad_16x1_neon(sum_lo[3], s0,
+                                     vld1q_u8(ref[3] + ref_offset), p0, m0);
+
+    uint8x16_t s1 = vld1q_u8(src + 16);
+    uint8x16_t p1 = vld1q_u8(second_pred + 16);
+    uint8x16_t m1 = vld1q_u8(mask + 16);
+    sum_hi[0] = masked_sad_16x1_neon(
+        sum_hi[0], s1, vld1q_u8(ref[0] + ref_offset + 16), p1, m1);
+    sum_hi[1] = masked_sad_16x1_neon(
+        sum_hi[1], s1, vld1q_u8(ref[1] + ref_offset + 16), p1, m1);
+    sum_hi[2] = masked_sad_16x1_neon(
+        sum_hi[2], s1, vld1q_u8(ref[2] + ref_offset + 16), p1, m1);
+    sum_hi[3] = masked_sad_16x1_neon(
+        sum_hi[3], s1, vld1q_u8(ref[3] + ref_offset + 16), p1, m1);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+    second_pred += 32;
+    mask += mask_stride;
+  } while (--i != 0);
+
+  vst1q_u32(res, horizontal_long_add_4d_u16x8(sum_lo, sum_hi));
+}
+
+static INLINE void masked_inv_sad16xhx4d_neon(
+    const uint8_t *src, int src_stride, const uint8_t *const ref[4],
+    int ref_stride, const uint8_t *second_pred, const uint8_t *mask,
+    int mask_stride, uint32_t res[4], int h) {
+  uint16x8_t sum_u16[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                            vdupq_n_u16(0) };
+  uint32x4_t sum_u32[4];
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    uint8x16_t s0 = vld1q_u8(src);
+    uint8x16_t p0 = vld1q_u8(second_pred);
+    uint8x16_t m0 = vld1q_u8(mask);
+    sum_u16[0] = masked_sad_16x1_neon(sum_u16[0], s0, p0,
+                                      vld1q_u8(ref[0] + ref_offset), m0);
+    sum_u16[1] = masked_sad_16x1_neon(sum_u16[1], s0, p0,
+                                      vld1q_u8(ref[1] + ref_offset), m0);
+    sum_u16[2] = masked_sad_16x1_neon(sum_u16[2], s0, p0,
+                                      vld1q_u8(ref[2] + ref_offset), m0);
+    sum_u16[3] = masked_sad_16x1_neon(sum_u16[3], s0, p0,
+                                      vld1q_u8(ref[3] + ref_offset), m0);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+    second_pred += 16;
+    mask += mask_stride;
+  } while (--i != 0);
+
+  sum_u32[0] = vpaddlq_u16(sum_u16[0]);
+  sum_u32[1] = vpaddlq_u16(sum_u16[1]);
+  sum_u32[2] = vpaddlq_u16(sum_u16[2]);
+  sum_u32[3] = vpaddlq_u16(sum_u16[3]);
+
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum_u32));
+}
+
+static INLINE void masked_sad16xhx4d_neon(const uint8_t *src, int src_stride,
+                                          const uint8_t *const ref[4],
+                                          int ref_stride,
+                                          const uint8_t *second_pred,
+                                          const uint8_t *mask, int mask_stride,
+                                          uint32_t res[4], int h) {
+  uint16x8_t sum_u16[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                            vdupq_n_u16(0) };
+  uint32x4_t sum_u32[4];
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    uint8x16_t s0 = vld1q_u8(src);
+    uint8x16_t p0 = vld1q_u8(second_pred);
+    uint8x16_t m0 = vld1q_u8(mask);
+    sum_u16[0] = masked_sad_16x1_neon(sum_u16[0], s0,
+                                      vld1q_u8(ref[0] + ref_offset), p0, m0);
+    sum_u16[1] = masked_sad_16x1_neon(sum_u16[1], s0,
+                                      vld1q_u8(ref[1] + ref_offset), p0, m0);
+    sum_u16[2] = masked_sad_16x1_neon(sum_u16[2], s0,
+                                      vld1q_u8(ref[2] + ref_offset), p0, m0);
+    sum_u16[3] = masked_sad_16x1_neon(sum_u16[3], s0,
+                                      vld1q_u8(ref[3] + ref_offset), p0, m0);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+    second_pred += 16;
+    mask += mask_stride;
+  } while (--i != 0);
+
+  sum_u32[0] = vpaddlq_u16(sum_u16[0]);
+  sum_u32[1] = vpaddlq_u16(sum_u16[1]);
+  sum_u32[2] = vpaddlq_u16(sum_u16[2]);
+  sum_u32[3] = vpaddlq_u16(sum_u16[3]);
+
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum_u32));
+}
+
+static INLINE uint16x8_t masked_sad_8x1_neon(uint16x8_t sad, const uint8x8_t s0,
+                                             const uint8x8_t a0,
+                                             const uint8x8_t b0,
+                                             const uint8x8_t m0) {
+  uint8x8_t m0_inv = vsub_u8(vdup_n_u8(AOM_BLEND_A64_MAX_ALPHA), m0);
+  uint16x8_t blend_u16 = vmull_u8(m0, a0);
+  blend_u16 = vmlal_u8(blend_u16, m0_inv, b0);
+
+  uint8x8_t blend_u8 = vrshrn_n_u16(blend_u16, AOM_BLEND_A64_ROUND_BITS);
+  return vabal_u8(sad, blend_u8, s0);
+}
+
+static INLINE void masked_inv_sad8xhx4d_neon(
+    const uint8_t *src, int src_stride, const uint8_t *const ref[4],
+    int ref_stride, const uint8_t *second_pred, const uint8_t *mask,
+    int mask_stride, uint32_t res[4], int h) {
+  uint16x8_t sum[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                        vdupq_n_u16(0) };
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    uint8x8_t s0 = vld1_u8(src);
+    uint8x8_t p0 = vld1_u8(second_pred);
+    uint8x8_t m0 = vld1_u8(mask);
+    sum[0] =
+        masked_sad_8x1_neon(sum[0], s0, p0, vld1_u8(ref[0] + ref_offset), m0);
+    sum[1] =
+        masked_sad_8x1_neon(sum[1], s0, p0, vld1_u8(ref[1] + ref_offset), m0);
+    sum[2] =
+        masked_sad_8x1_neon(sum[2], s0, p0, vld1_u8(ref[2] + ref_offset), m0);
+    sum[3] =
+        masked_sad_8x1_neon(sum[3], s0, p0, vld1_u8(ref[3] + ref_offset), m0);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+    second_pred += 8;
+    mask += mask_stride;
+  } while (--i != 0);
+
+  vst1q_u32(res, horizontal_add_4d_u16x8(sum));
+}
+
+static INLINE void masked_sad8xhx4d_neon(const uint8_t *src, int src_stride,
+                                         const uint8_t *const ref[4],
+                                         int ref_stride,
+                                         const uint8_t *second_pred,
+                                         const uint8_t *mask, int mask_stride,
+                                         uint32_t res[4], int h) {
+  uint16x8_t sum[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                        vdupq_n_u16(0) };
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    uint8x8_t s0 = vld1_u8(src);
+    uint8x8_t p0 = vld1_u8(second_pred);
+    uint8x8_t m0 = vld1_u8(mask);
+
+    sum[0] =
+        masked_sad_8x1_neon(sum[0], s0, vld1_u8(ref[0] + ref_offset), p0, m0);
+    sum[1] =
+        masked_sad_8x1_neon(sum[1], s0, vld1_u8(ref[1] + ref_offset), p0, m0);
+    sum[2] =
+        masked_sad_8x1_neon(sum[2], s0, vld1_u8(ref[2] + ref_offset), p0, m0);
+    sum[3] =
+        masked_sad_8x1_neon(sum[3], s0, vld1_u8(ref[3] + ref_offset), p0, m0);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+    second_pred += 8;
+    mask += mask_stride;
+  } while (--i != 0);
+
+  vst1q_u32(res, horizontal_add_4d_u16x8(sum));
+}
+
+static INLINE void masked_inv_sad4xhx4d_neon(
+    const uint8_t *src, int src_stride, const uint8_t *const ref[4],
+    int ref_stride, const uint8_t *second_pred, const uint8_t *mask,
+    int mask_stride, uint32_t res[4], int h) {
+  uint16x8_t sum[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                        vdupq_n_u16(0) };
+
+  int ref_offset = 0;
+  int i = h / 2;
+  do {
+    uint8x8_t s = load_unaligned_u8(src, src_stride);
+    uint8x8_t r0 = load_unaligned_u8(ref[0] + ref_offset, ref_stride);
+    uint8x8_t r1 = load_unaligned_u8(ref[1] + ref_offset, ref_stride);
+    uint8x8_t r2 = load_unaligned_u8(ref[2] + ref_offset, ref_stride);
+    uint8x8_t r3 = load_unaligned_u8(ref[3] + ref_offset, ref_stride);
+    uint8x8_t p0 = vld1_u8(second_pred);
+    uint8x8_t m0 = load_unaligned_u8(mask, mask_stride);
+
+    sum[0] = masked_sad_8x1_neon(sum[0], s, p0, r0, m0);
+    sum[1] = masked_sad_8x1_neon(sum[1], s, p0, r1, m0);
+    sum[2] = masked_sad_8x1_neon(sum[2], s, p0, r2, m0);
+    sum[3] = masked_sad_8x1_neon(sum[3], s, p0, r3, m0);
+
+    src += 2 * src_stride;
+    ref_offset += 2 * ref_stride;
+    second_pred += 2 * 4;
+    mask += 2 * mask_stride;
+  } while (--i != 0);
+
+  vst1q_u32(res, horizontal_add_4d_u16x8(sum));
+}
+
+static INLINE void masked_sad4xhx4d_neon(const uint8_t *src, int src_stride,
+                                         const uint8_t *const ref[4],
+                                         int ref_stride,
+                                         const uint8_t *second_pred,
+                                         const uint8_t *mask, int mask_stride,
+                                         uint32_t res[4], int h) {
+  uint16x8_t sum[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                        vdupq_n_u16(0) };
+
+  int ref_offset = 0;
+  int i = h / 2;
+  do {
+    uint8x8_t s = load_unaligned_u8(src, src_stride);
+    uint8x8_t r0 = load_unaligned_u8(ref[0] + ref_offset, ref_stride);
+    uint8x8_t r1 = load_unaligned_u8(ref[1] + ref_offset, ref_stride);
+    uint8x8_t r2 = load_unaligned_u8(ref[2] + ref_offset, ref_stride);
+    uint8x8_t r3 = load_unaligned_u8(ref[3] + ref_offset, ref_stride);
+    uint8x8_t p0 = vld1_u8(second_pred);
+    uint8x8_t m0 = load_unaligned_u8(mask, mask_stride);
+
+    sum[0] = masked_sad_8x1_neon(sum[0], s, r0, p0, m0);
+    sum[1] = masked_sad_8x1_neon(sum[1], s, r1, p0, m0);
+    sum[2] = masked_sad_8x1_neon(sum[2], s, r2, p0, m0);
+    sum[3] = masked_sad_8x1_neon(sum[3], s, r3, p0, m0);
+
+    src += 2 * src_stride;
+    ref_offset += 2 * ref_stride;
+    second_pred += 2 * 4;
+    mask += 2 * mask_stride;
+  } while (--i != 0);
+
+  vst1q_u32(res, horizontal_add_4d_u16x8(sum));
+}
+
+#define MASKED_SAD4D_WXH_NEON(w, h)                                           \
+  void aom_masked_sad##w##x##h##x4d_neon(                                     \
+      const uint8_t *src, int src_stride, const uint8_t *ref[4],              \
+      int ref_stride, const uint8_t *second_pred, const uint8_t *msk,         \
+      int msk_stride, int invert_mask, uint32_t res[4]) {                     \
+    if (invert_mask) {                                                        \
+      return masked_inv_sad##w##xhx4d_neon(src, src_stride, ref, ref_stride,  \
+                                           second_pred, msk, msk_stride, res, \
+                                           h);                                \
+    } else {                                                                  \
+      return masked_sad##w##xhx4d_neon(src, src_stride, ref, ref_stride,      \
+                                       second_pred, msk, msk_stride, res, h); \
+    }                                                                         \
+  }
+
+MASKED_SAD4D_WXH_NEON(4, 8)
+MASKED_SAD4D_WXH_NEON(4, 4)
+
+MASKED_SAD4D_WXH_NEON(8, 16)
+MASKED_SAD4D_WXH_NEON(8, 8)
+MASKED_SAD4D_WXH_NEON(8, 4)
+
+MASKED_SAD4D_WXH_NEON(16, 32)
+MASKED_SAD4D_WXH_NEON(16, 16)
+MASKED_SAD4D_WXH_NEON(16, 8)
+
+MASKED_SAD4D_WXH_NEON(32, 64)
+MASKED_SAD4D_WXH_NEON(32, 32)
+MASKED_SAD4D_WXH_NEON(32, 16)
+
+MASKED_SAD4D_WXH_NEON(64, 128)
+MASKED_SAD4D_WXH_NEON(64, 64)
+MASKED_SAD4D_WXH_NEON(64, 32)
+
+MASKED_SAD4D_WXH_NEON(128, 128)
+MASKED_SAD4D_WXH_NEON(128, 64)
+
+#if !CONFIG_REALTIME_ONLY
+MASKED_SAD4D_WXH_NEON(4, 16)
+MASKED_SAD4D_WXH_NEON(16, 4)
+MASKED_SAD4D_WXH_NEON(8, 32)
+MASKED_SAD4D_WXH_NEON(32, 8)
+MASKED_SAD4D_WXH_NEON(16, 64)
+MASKED_SAD4D_WXH_NEON(64, 16)
+#endif

diff --git a/aom_dsp/arm/masked_sad_neon.c b/aom_dsp/arm/masked_sad_neon.c
new file mode 100644
index 0000000..340df05
--- /dev/null
+++ b/aom_dsp/arm/masked_sad_neon.c

@@ -0,0 +1,257 @@
+/*
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <arm_neon.h>
+
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+
+#include "aom/aom_integer.h"
+#include "aom_dsp/blend.h"
+#include "mem_neon.h"
+#include "sum_neon.h"
+
+static INLINE uint16x8_t masked_sad_16x1_neon(uint16x8_t sad,
+                                              const uint8_t *src,
+                                              const uint8_t *a,
+                                              const uint8_t *b,
+                                              const uint8_t *m) {
+  uint8x16_t m0 = vld1q_u8(m);
+  uint8x16_t a0 = vld1q_u8(a);
+  uint8x16_t b0 = vld1q_u8(b);
+  uint8x16_t s0 = vld1q_u8(src);
+
+  uint8x16_t m0_inv = vsubq_u8(vdupq_n_u8(AOM_BLEND_A64_MAX_ALPHA), m0);
+  uint16x8_t blend_u16_lo = vmull_u8(vget_low_u8(m0), vget_low_u8(a0));
+  uint16x8_t blend_u16_hi = vmull_u8(vget_high_u8(m0), vget_high_u8(a0));
+  blend_u16_lo = vmlal_u8(blend_u16_lo, vget_low_u8(m0_inv), vget_low_u8(b0));
+  blend_u16_hi = vmlal_u8(blend_u16_hi, vget_high_u8(m0_inv), vget_high_u8(b0));
+
+  uint8x8_t blend_u8_lo = vrshrn_n_u16(blend_u16_lo, AOM_BLEND_A64_ROUND_BITS);
+  uint8x8_t blend_u8_hi = vrshrn_n_u16(blend_u16_hi, AOM_BLEND_A64_ROUND_BITS);
+  uint8x16_t blend_u8 = vcombine_u8(blend_u8_lo, blend_u8_hi);
+
+  return vpadalq_u8(sad, vabdq_u8(blend_u8, s0));
+}
+
+static INLINE unsigned masked_sad_128xh_neon(const uint8_t *src, int src_stride,
+                                             const uint8_t *a, int a_stride,
+                                             const uint8_t *b, int b_stride,
+                                             const uint8_t *m, int m_stride,
+                                             int height) {
+  // Eight accumulator vectors are required to avoid overflow in the 128x128
+  // case.
+  assert(height <= 128);
+  uint16x8_t sad[] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                       vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                       vdupq_n_u16(0), vdupq_n_u16(0) };
+
+  do {
+    sad[0] = masked_sad_16x1_neon(sad[0], &src[0], &a[0], &b[0], &m[0]);
+    sad[1] = masked_sad_16x1_neon(sad[1], &src[16], &a[16], &b[16], &m[16]);
+    sad[2] = masked_sad_16x1_neon(sad[2], &src[32], &a[32], &b[32], &m[32]);
+    sad[3] = masked_sad_16x1_neon(sad[3], &src[48], &a[48], &b[48], &m[48]);
+    sad[4] = masked_sad_16x1_neon(sad[4], &src[64], &a[64], &b[64], &m[64]);
+    sad[5] = masked_sad_16x1_neon(sad[5], &src[80], &a[80], &b[80], &m[80]);
+    sad[6] = masked_sad_16x1_neon(sad[6], &src[96], &a[96], &b[96], &m[96]);
+    sad[7] = masked_sad_16x1_neon(sad[7], &src[112], &a[112], &b[112], &m[112]);
+
+    src += src_stride;
+    a += a_stride;
+    b += b_stride;
+    m += m_stride;
+    height--;
+  } while (height != 0);
+
+  return horizontal_long_add_u16x8(sad[0], sad[1]) +
+         horizontal_long_add_u16x8(sad[2], sad[3]) +
+         horizontal_long_add_u16x8(sad[4], sad[5]) +
+         horizontal_long_add_u16x8(sad[6], sad[7]);
+}
+
+static INLINE unsigned masked_sad_64xh_neon(const uint8_t *src, int src_stride,
+                                            const uint8_t *a, int a_stride,
+                                            const uint8_t *b, int b_stride,
+                                            const uint8_t *m, int m_stride,
+                                            int height) {
+  // Four accumulator vectors are required to avoid overflow in the 64x128 case.
+  assert(height <= 128);
+  uint16x8_t sad[] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                       vdupq_n_u16(0) };
+
+  do {
+    sad[0] = masked_sad_16x1_neon(sad[0], &src[0], &a[0], &b[0], &m[0]);
+    sad[1] = masked_sad_16x1_neon(sad[1], &src[16], &a[16], &b[16], &m[16]);
+    sad[2] = masked_sad_16x1_neon(sad[2], &src[32], &a[32], &b[32], &m[32]);
+    sad[3] = masked_sad_16x1_neon(sad[3], &src[48], &a[48], &b[48], &m[48]);
+
+    src += src_stride;
+    a += a_stride;
+    b += b_stride;
+    m += m_stride;
+    height--;
+  } while (height != 0);
+
+  return horizontal_long_add_u16x8(sad[0], sad[1]) +
+         horizontal_long_add_u16x8(sad[2], sad[3]);
+}
+
+static INLINE unsigned masked_sad_32xh_neon(const uint8_t *src, int src_stride,
+                                            const uint8_t *a, int a_stride,
+                                            const uint8_t *b, int b_stride,
+                                            const uint8_t *m, int m_stride,
+                                            int height) {
+  // We could use a single accumulator up to height=64 without overflow.
+  assert(height <= 64);
+  uint16x8_t sad = vdupq_n_u16(0);
+
+  do {
+    sad = masked_sad_16x1_neon(sad, &src[0], &a[0], &b[0], &m[0]);
+    sad = masked_sad_16x1_neon(sad, &src[16], &a[16], &b[16], &m[16]);
+
+    src += src_stride;
+    a += a_stride;
+    b += b_stride;
+    m += m_stride;
+    height--;
+  } while (height != 0);
+
+  return horizontal_add_u16x8(sad);
+}
+
+static INLINE unsigned masked_sad_16xh_neon(const uint8_t *src, int src_stride,
+                                            const uint8_t *a, int a_stride,
+                                            const uint8_t *b, int b_stride,
+                                            const uint8_t *m, int m_stride,
+                                            int height) {
+  // We could use a single accumulator up to height=128 without overflow.
+  assert(height <= 128);
+  uint16x8_t sad = vdupq_n_u16(0);
+
+  do {
+    sad = masked_sad_16x1_neon(sad, src, a, b, m);
+
+    src += src_stride;
+    a += a_stride;
+    b += b_stride;
+    m += m_stride;
+    height--;
+  } while (height != 0);
+
+  return horizontal_add_u16x8(sad);
+}
+
+static INLINE unsigned masked_sad_8xh_neon(const uint8_t *src, int src_stride,
+                                           const uint8_t *a, int a_stride,
+                                           const uint8_t *b, int b_stride,
+                                           const uint8_t *m, int m_stride,
+                                           int height) {
+  // We could use a single accumulator up to height=128 without overflow.
+  assert(height <= 128);
+  uint16x4_t sad = vdup_n_u16(0);
+
+  do {
+    uint8x8_t m0 = vld1_u8(m);
+    uint8x8_t a0 = vld1_u8(a);
+    uint8x8_t b0 = vld1_u8(b);
+    uint8x8_t s0 = vld1_u8(src);
+
+    uint8x8_t m0_inv = vsub_u8(vdup_n_u8(AOM_BLEND_A64_MAX_ALPHA), m0);
+    uint16x8_t blend_u16 = vmull_u8(m0, a0);
+    blend_u16 = vmlal_u8(blend_u16, m0_inv, b0);
+    uint8x8_t blend_u8 = vrshrn_n_u16(blend_u16, AOM_BLEND_A64_ROUND_BITS);
+
+    sad = vpadal_u8(sad, vabd_u8(blend_u8, s0));
+
+    src += src_stride;
+    a += a_stride;
+    b += b_stride;
+    m += m_stride;
+    height--;
+  } while (height != 0);
+
+  return horizontal_add_u16x4(sad);
+}
+
+static INLINE unsigned masked_sad_4xh_neon(const uint8_t *src, int src_stride,
+                                           const uint8_t *a, int a_stride,
+                                           const uint8_t *b, int b_stride,
+                                           const uint8_t *m, int m_stride,
+                                           int height) {
+  // Process two rows per loop iteration.
+  assert(height % 2 == 0);
+
+  // We could use a single accumulator up to height=256 without overflow.
+  assert(height <= 256);
+  uint16x4_t sad = vdup_n_u16(0);
+
+  do {
+    uint8x8_t m0 = load_unaligned_u8(m, m_stride);
+    uint8x8_t a0 = load_unaligned_u8(a, a_stride);
+    uint8x8_t b0 = load_unaligned_u8(b, b_stride);
+    uint8x8_t s0 = load_unaligned_u8(src, src_stride);
+
+    uint8x8_t m0_inv = vsub_u8(vdup_n_u8(AOM_BLEND_A64_MAX_ALPHA), m0);
+    uint16x8_t blend_u16 = vmull_u8(m0, a0);
+    blend_u16 = vmlal_u8(blend_u16, m0_inv, b0);
+    uint8x8_t blend_u8 = vrshrn_n_u16(blend_u16, AOM_BLEND_A64_ROUND_BITS);
+
+    sad = vpadal_u8(sad, vabd_u8(blend_u8, s0));
+
+    src += 2 * src_stride;
+    a += 2 * a_stride;
+    b += 2 * b_stride;
+    m += 2 * m_stride;
+    height -= 2;
+  } while (height != 0);
+
+  return horizontal_add_u16x4(sad);
+}
+
+#define MASKED_SAD_WXH_NEON(width, height)                                    \
+  unsigned aom_masked_sad##width##x##height##_neon(                           \
+      const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, \
+      const uint8_t *second_pred, const uint8_t *msk, int msk_stride,         \
+      int invert_mask) {                                                      \
+    if (!invert_mask)                                                         \
+      return masked_sad_##width##xh_neon(src, src_stride, ref, ref_stride,    \
+                                         second_pred, width, msk, msk_stride, \
+                                         height);                             \
+    else                                                                      \
+      return masked_sad_##width##xh_neon(src, src_stride, second_pred, width, \
+                                         ref, ref_stride, msk, msk_stride,    \
+                                         height);                             \
+  }
+
+MASKED_SAD_WXH_NEON(4, 4)
+MASKED_SAD_WXH_NEON(4, 8)
+MASKED_SAD_WXH_NEON(8, 4)
+MASKED_SAD_WXH_NEON(8, 8)
+MASKED_SAD_WXH_NEON(8, 16)
+MASKED_SAD_WXH_NEON(16, 8)
+MASKED_SAD_WXH_NEON(16, 16)
+MASKED_SAD_WXH_NEON(16, 32)
+MASKED_SAD_WXH_NEON(32, 16)
+MASKED_SAD_WXH_NEON(32, 32)
+MASKED_SAD_WXH_NEON(32, 64)
+MASKED_SAD_WXH_NEON(64, 32)
+MASKED_SAD_WXH_NEON(64, 64)
+MASKED_SAD_WXH_NEON(64, 128)
+MASKED_SAD_WXH_NEON(128, 64)
+MASKED_SAD_WXH_NEON(128, 128)
+#if !CONFIG_REALTIME_ONLY
+MASKED_SAD_WXH_NEON(4, 16)
+MASKED_SAD_WXH_NEON(16, 4)
+MASKED_SAD_WXH_NEON(8, 32)
+MASKED_SAD_WXH_NEON(32, 8)
+MASKED_SAD_WXH_NEON(16, 64)
+MASKED_SAD_WXH_NEON(64, 16)
+#endif

diff --git a/aom_dsp/arm/mem_neon.h b/aom_dsp/arm/mem_neon.h
index 73a5127..16d44c5 100644
--- a/aom_dsp/arm/mem_neon.h
+++ b/aom_dsp/arm/mem_neon.h

@@ -73,14 +73,18 @@
 #endif  // __GNUC__ < 9
 #endif  // defined(__GNUC__) && !defined(__clang__)
 
-static INLINE void store_row2_u8_8x8(uint8_t *s, int p, const uint8x8_t s0,
-                                     const uint8x8_t s1) {
+static INLINE void store_u8_8x2(uint8_t *s, ptrdiff_t p, const uint8x8_t s0,
+                                const uint8x8_t s1) {
   vst1_u8(s, s0);
   s += p;
   vst1_u8(s, s1);
   s += p;
 }
 
+static INLINE uint8x16_t load_u8_8x2(const uint8_t *s, ptrdiff_t p) {
+  return vcombine_u8(vld1_u8(s), vld1_u8(s + p));
+}
+
 /* These intrinsics require immediate values, so we must use #defines
    to enforce that. */
 #define load_u8_4x1(s, s0, lane)                                           \
@@ -89,6 +93,13 @@
         vld1_lane_u32((uint32_t *)(s), vreinterpret_u32_u8(*(s0)), lane)); \
   } while (0)
 
+// Load four bytes into the low half of a uint8x8_t, zero the upper half.
+static INLINE uint8x8_t load_u8_4x1_lane0(const uint8_t *p) {
+  uint8x8_t ret = vdup_n_u8(0);
+  load_u8_4x1(p, &ret, 0);
+  return ret;
+}
+
 static INLINE void load_u8_8x8(const uint8_t *s, ptrdiff_t p,
                                uint8x8_t *const s0, uint8x8_t *const s1,
                                uint8x8_t *const s2, uint8x8_t *const s3,
@@ -111,16 +122,24 @@
   *s7 = vld1_u8(s);
 }
 
-static INLINE void load_u8_8x16(const uint8_t *s, ptrdiff_t p,
-                                uint8x16_t *const s0, uint8x16_t *const s1,
-                                uint8x16_t *const s2, uint8x16_t *const s3) {
-  *s0 = vld1q_u8(s);
+static INLINE void load_u8_8x7(const uint8_t *s, ptrdiff_t p,
+                               uint8x8_t *const s0, uint8x8_t *const s1,
+                               uint8x8_t *const s2, uint8x8_t *const s3,
+                               uint8x8_t *const s4, uint8x8_t *const s5,
+                               uint8x8_t *const s6) {
+  *s0 = vld1_u8(s);
   s += p;
-  *s1 = vld1q_u8(s);
+  *s1 = vld1_u8(s);
   s += p;
-  *s2 = vld1q_u8(s);
+  *s2 = vld1_u8(s);
   s += p;
-  *s3 = vld1q_u8(s);
+  *s3 = vld1_u8(s);
+  s += p;
+  *s4 = vld1_u8(s);
+  s += p;
+  *s5 = vld1_u8(s);
+  s += p;
+  *s6 = vld1_u8(s);
 }
 
 static INLINE void load_u8_8x4(const uint8_t *s, const ptrdiff_t p,
@@ -148,6 +167,40 @@
   s += p;
 }
 
+static INLINE void load_u16_4x7(const uint16_t *s, ptrdiff_t p,
+                                uint16x4_t *const s0, uint16x4_t *const s1,
+                                uint16x4_t *const s2, uint16x4_t *const s3,
+                                uint16x4_t *const s4, uint16x4_t *const s5,
+                                uint16x4_t *const s6) {
+  *s0 = vld1_u16(s);
+  s += p;
+  *s1 = vld1_u16(s);
+  s += p;
+  *s2 = vld1_u16(s);
+  s += p;
+  *s3 = vld1_u16(s);
+  s += p;
+  *s4 = vld1_u16(s);
+  s += p;
+  *s5 = vld1_u16(s);
+  s += p;
+  *s6 = vld1_u16(s);
+}
+
+static INLINE void load_s16_8x2(const int16_t *s, const ptrdiff_t p,
+                                int16x8_t *const s0, int16x8_t *const s1) {
+  *s0 = vld1q_s16(s);
+  s += p;
+  *s1 = vld1q_s16(s);
+}
+
+static INLINE void load_u16_8x2(const uint16_t *s, const ptrdiff_t p,
+                                uint16x8_t *const s0, uint16x8_t *const s1) {
+  *s0 = vld1q_u16(s);
+  s += p;
+  *s1 = vld1q_u16(s);
+}
+
 static INLINE void load_u16_8x4(const uint16_t *s, const ptrdiff_t p,
                                 uint16x8_t *const s0, uint16x8_t *const s1,
                                 uint16x8_t *const s2, uint16x8_t *const s3) {
@@ -161,6 +214,66 @@
   s += p;
 }
 
+static INLINE void load_s16_4x11(const int16_t *s, ptrdiff_t p,
+                                 int16x4_t *const s0, int16x4_t *const s1,
+                                 int16x4_t *const s2, int16x4_t *const s3,
+                                 int16x4_t *const s4, int16x4_t *const s5,
+                                 int16x4_t *const s6, int16x4_t *const s7,
+                                 int16x4_t *const s8, int16x4_t *const s9,
+                                 int16x4_t *const s10) {
+  *s0 = vld1_s16(s);
+  s += p;
+  *s1 = vld1_s16(s);
+  s += p;
+  *s2 = vld1_s16(s);
+  s += p;
+  *s3 = vld1_s16(s);
+  s += p;
+  *s4 = vld1_s16(s);
+  s += p;
+  *s5 = vld1_s16(s);
+  s += p;
+  *s6 = vld1_s16(s);
+  s += p;
+  *s7 = vld1_s16(s);
+  s += p;
+  *s8 = vld1_s16(s);
+  s += p;
+  *s9 = vld1_s16(s);
+  s += p;
+  *s10 = vld1_s16(s);
+}
+
+static INLINE void load_u16_4x11(const uint16_t *s, ptrdiff_t p,
+                                 uint16x4_t *const s0, uint16x4_t *const s1,
+                                 uint16x4_t *const s2, uint16x4_t *const s3,
+                                 uint16x4_t *const s4, uint16x4_t *const s5,
+                                 uint16x4_t *const s6, uint16x4_t *const s7,
+                                 uint16x4_t *const s8, uint16x4_t *const s9,
+                                 uint16x4_t *const s10) {
+  *s0 = vld1_u16(s);
+  s += p;
+  *s1 = vld1_u16(s);
+  s += p;
+  *s2 = vld1_u16(s);
+  s += p;
+  *s3 = vld1_u16(s);
+  s += p;
+  *s4 = vld1_u16(s);
+  s += p;
+  *s5 = vld1_u16(s);
+  s += p;
+  *s6 = vld1_u16(s);
+  s += p;
+  *s7 = vld1_u16(s);
+  s += p;
+  *s8 = vld1_u16(s);
+  s += p;
+  *s9 = vld1_u16(s);
+  s += p;
+  *s10 = vld1_u16(s);
+}
+
 static INLINE void load_s16_4x8(const int16_t *s, ptrdiff_t p,
                                 int16x4_t *const s0, int16x4_t *const s1,
                                 int16x4_t *const s2, int16x4_t *const s3,
@@ -183,6 +296,88 @@
   *s7 = vld1_s16(s);
 }
 
+static INLINE void load_s16_4x7(const int16_t *s, ptrdiff_t p,
+                                int16x4_t *const s0, int16x4_t *const s1,
+                                int16x4_t *const s2, int16x4_t *const s3,
+                                int16x4_t *const s4, int16x4_t *const s5,
+                                int16x4_t *const s6) {
+  *s0 = vld1_s16(s);
+  s += p;
+  *s1 = vld1_s16(s);
+  s += p;
+  *s2 = vld1_s16(s);
+  s += p;
+  *s3 = vld1_s16(s);
+  s += p;
+  *s4 = vld1_s16(s);
+  s += p;
+  *s5 = vld1_s16(s);
+  s += p;
+  *s6 = vld1_s16(s);
+}
+
+static INLINE void load_s16_4x5(const int16_t *s, ptrdiff_t p,
+                                int16x4_t *const s0, int16x4_t *const s1,
+                                int16x4_t *const s2, int16x4_t *const s3,
+                                int16x4_t *const s4) {
+  *s0 = vld1_s16(s);
+  s += p;
+  *s1 = vld1_s16(s);
+  s += p;
+  *s2 = vld1_s16(s);
+  s += p;
+  *s3 = vld1_s16(s);
+  s += p;
+  *s4 = vld1_s16(s);
+}
+
+static INLINE void load_u16_4x5(const uint16_t *s, const ptrdiff_t p,
+                                uint16x4_t *const s0, uint16x4_t *const s1,
+                                uint16x4_t *const s2, uint16x4_t *const s3,
+                                uint16x4_t *const s4) {
+  *s0 = vld1_u16(s);
+  s += p;
+  *s1 = vld1_u16(s);
+  s += p;
+  *s2 = vld1_u16(s);
+  s += p;
+  *s3 = vld1_u16(s);
+  s += p;
+  *s4 = vld1_u16(s);
+  s += p;
+}
+
+static INLINE void load_u8_8x5(const uint8_t *s, ptrdiff_t p,
+                               uint8x8_t *const s0, uint8x8_t *const s1,
+                               uint8x8_t *const s2, uint8x8_t *const s3,
+                               uint8x8_t *const s4) {
+  *s0 = vld1_u8(s);
+  s += p;
+  *s1 = vld1_u8(s);
+  s += p;
+  *s2 = vld1_u8(s);
+  s += p;
+  *s3 = vld1_u8(s);
+  s += p;
+  *s4 = vld1_u8(s);
+}
+
+static INLINE void load_u16_8x5(const uint16_t *s, const ptrdiff_t p,
+                                uint16x8_t *const s0, uint16x8_t *const s1,
+                                uint16x8_t *const s2, uint16x8_t *const s3,
+                                uint16x8_t *const s4) {
+  *s0 = vld1q_u16(s);
+  s += p;
+  *s1 = vld1q_u16(s);
+  s += p;
+  *s2 = vld1q_u16(s);
+  s += p;
+  *s3 = vld1q_u16(s);
+  s += p;
+  *s4 = vld1q_u16(s);
+  s += p;
+}
+
 static INLINE void load_s16_4x4(const int16_t *s, ptrdiff_t p,
                                 int16x4_t *const s0, int16x4_t *const s1,
                                 int16x4_t *const s2, int16x4_t *const s3) {
@@ -197,6 +392,11 @@
 
 /* These intrinsics require immediate values, so we must use #defines
    to enforce that. */
+#define store_u8_2x1(s, s0, lane)                                  \
+  do {                                                             \
+    vst1_lane_u16((uint16_t *)(s), vreinterpret_u16_u8(s0), lane); \
+  } while (0)
+
 #define store_u8_4x1(s, s0, lane)                                  \
   do {                                                             \
     vst1_lane_u32((uint32_t *)(s), vreinterpret_u32_u8(s0), lane); \
@@ -282,6 +482,13 @@
   vst1_u16(s, s3);
 }
 
+static INLINE void store_u16_8x2(uint16_t *s, ptrdiff_t dst_stride,
+                                 const uint16x8_t s0, const uint16x8_t s1) {
+  vst1q_u16(s, s0);
+  s += dst_stride;
+  vst1q_u16(s, s1);
+}
+
 static INLINE void store_u16_8x4(uint16_t *s, ptrdiff_t dst_stride,
                                  const uint16x8_t s0, const uint16x8_t s1,
                                  const uint16x8_t s2, const uint16x8_t s3) {
@@ -328,6 +535,21 @@
   vst1_s16(s, s3);
 }
 
+/* These intrinsics require immediate values, so we must use #defines
+   to enforce that. */
+#define store_s16_2x1(s, s0, lane)                                 \
+  do {                                                             \
+    vst1_lane_s32((int32_t *)(s), vreinterpret_s32_s16(s0), lane); \
+  } while (0)
+#define store_u16_2x1(s, s0, lane)                                  \
+  do {                                                              \
+    vst1_lane_u32((uint32_t *)(s), vreinterpret_u32_u16(s0), lane); \
+  } while (0)
+#define store_u16q_2x1(s, s0, lane)                                   \
+  do {                                                                \
+    vst1q_lane_u32((uint32_t *)(s), vreinterpretq_u32_u16(s0), lane); \
+  } while (0)
+
 static INLINE void store_s16_8x4(int16_t *s, ptrdiff_t dst_stride,
                                  const int16x8_t s0, const int16x8_t s1,
                                  const int16x8_t s2, const int16x8_t s3) {
@@ -340,6 +562,96 @@
   vst1q_s16(s, s3);
 }
 
+static INLINE void load_u8_8x11(const uint8_t *s, ptrdiff_t p,
+                                uint8x8_t *const s0, uint8x8_t *const s1,
+                                uint8x8_t *const s2, uint8x8_t *const s3,
+                                uint8x8_t *const s4, uint8x8_t *const s5,
+                                uint8x8_t *const s6, uint8x8_t *const s7,
+                                uint8x8_t *const s8, uint8x8_t *const s9,
+                                uint8x8_t *const s10) {
+  *s0 = vld1_u8(s);
+  s += p;
+  *s1 = vld1_u8(s);
+  s += p;
+  *s2 = vld1_u8(s);
+  s += p;
+  *s3 = vld1_u8(s);
+  s += p;
+  *s4 = vld1_u8(s);
+  s += p;
+  *s5 = vld1_u8(s);
+  s += p;
+  *s6 = vld1_u8(s);
+  s += p;
+  *s7 = vld1_u8(s);
+  s += p;
+  *s8 = vld1_u8(s);
+  s += p;
+  *s9 = vld1_u8(s);
+  s += p;
+  *s10 = vld1_u8(s);
+}
+
+static INLINE void load_s16_8x11(const int16_t *s, ptrdiff_t p,
+                                 int16x8_t *const s0, int16x8_t *const s1,
+                                 int16x8_t *const s2, int16x8_t *const s3,
+                                 int16x8_t *const s4, int16x8_t *const s5,
+                                 int16x8_t *const s6, int16x8_t *const s7,
+                                 int16x8_t *const s8, int16x8_t *const s9,
+                                 int16x8_t *const s10) {
+  *s0 = vld1q_s16(s);
+  s += p;
+  *s1 = vld1q_s16(s);
+  s += p;
+  *s2 = vld1q_s16(s);
+  s += p;
+  *s3 = vld1q_s16(s);
+  s += p;
+  *s4 = vld1q_s16(s);
+  s += p;
+  *s5 = vld1q_s16(s);
+  s += p;
+  *s6 = vld1q_s16(s);
+  s += p;
+  *s7 = vld1q_s16(s);
+  s += p;
+  *s8 = vld1q_s16(s);
+  s += p;
+  *s9 = vld1q_s16(s);
+  s += p;
+  *s10 = vld1q_s16(s);
+}
+
+static INLINE void load_u16_8x11(const uint16_t *s, ptrdiff_t p,
+                                 uint16x8_t *const s0, uint16x8_t *const s1,
+                                 uint16x8_t *const s2, uint16x8_t *const s3,
+                                 uint16x8_t *const s4, uint16x8_t *const s5,
+                                 uint16x8_t *const s6, uint16x8_t *const s7,
+                                 uint16x8_t *const s8, uint16x8_t *const s9,
+                                 uint16x8_t *const s10) {
+  *s0 = vld1q_u16(s);
+  s += p;
+  *s1 = vld1q_u16(s);
+  s += p;
+  *s2 = vld1q_u16(s);
+  s += p;
+  *s3 = vld1q_u16(s);
+  s += p;
+  *s4 = vld1q_u16(s);
+  s += p;
+  *s5 = vld1q_u16(s);
+  s += p;
+  *s6 = vld1q_u16(s);
+  s += p;
+  *s7 = vld1q_u16(s);
+  s += p;
+  *s8 = vld1q_u16(s);
+  s += p;
+  *s9 = vld1q_u16(s);
+  s += p;
+  *s10 = vld1q_u16(s);
+}
+
 static INLINE void load_s16_8x8(const int16_t *s, ptrdiff_t p,
                                 int16x8_t *const s0, int16x8_t *const s1,
                                 int16x8_t *const s2, int16x8_t *const s3,
@@ -362,6 +674,61 @@
   *s7 = vld1q_s16(s);
 }
 
+static INLINE void load_u16_8x7(const uint16_t *s, ptrdiff_t p,
+                                uint16x8_t *const s0, uint16x8_t *const s1,
+                                uint16x8_t *const s2, uint16x8_t *const s3,
+                                uint16x8_t *const s4, uint16x8_t *const s5,
+                                uint16x8_t *const s6) {
+  *s0 = vld1q_u16(s);
+  s += p;
+  *s1 = vld1q_u16(s);
+  s += p;
+  *s2 = vld1q_u16(s);
+  s += p;
+  *s3 = vld1q_u16(s);
+  s += p;
+  *s4 = vld1q_u16(s);
+  s += p;
+  *s5 = vld1q_u16(s);
+  s += p;
+  *s6 = vld1q_u16(s);
+}
+
+static INLINE void load_s16_8x7(const int16_t *s, ptrdiff_t p,
+                                int16x8_t *const s0, int16x8_t *const s1,
+                                int16x8_t *const s2, int16x8_t *const s3,
+                                int16x8_t *const s4, int16x8_t *const s5,
+                                int16x8_t *const s6) {
+  *s0 = vld1q_s16(s);
+  s += p;
+  *s1 = vld1q_s16(s);
+  s += p;
+  *s2 = vld1q_s16(s);
+  s += p;
+  *s3 = vld1q_s16(s);
+  s += p;
+  *s4 = vld1q_s16(s);
+  s += p;
+  *s5 = vld1q_s16(s);
+  s += p;
+  *s6 = vld1q_s16(s);
+}
+
+static INLINE void load_s16_8x5(const int16_t *s, ptrdiff_t p,
+                                int16x8_t *const s0, int16x8_t *const s1,
+                                int16x8_t *const s2, int16x8_t *const s3,
+                                int16x8_t *const s4) {
+  *s0 = vld1q_s16(s);
+  s += p;
+  *s1 = vld1q_s16(s);
+  s += p;
+  *s2 = vld1q_s16(s);
+  s += p;
+  *s3 = vld1q_s16(s);
+  s += p;
+  *s4 = vld1q_s16(s);
+}
+
 static INLINE void load_s16_8x4(const int16_t *s, ptrdiff_t p,
                                 int16x8_t *const s0, int16x8_t *const s1,
                                 int16x8_t *const s2, int16x8_t *const s3) {
@@ -404,71 +771,61 @@
   return vreinterpretq_u8_u32(a_u32);
 }
 
-static INLINE void load_unaligned_u8_4x8(const uint8_t *buf, int stride,
-                                         uint32x2_t *tu0, uint32x2_t *tu1,
-                                         uint32x2_t *tu2, uint32x2_t *tu3) {
+static INLINE uint8x8_t load_unaligned_u8_2x2(const uint8_t *buf, int stride) {
+  uint16_t a;
+  uint16x4_t a_u16;
+
+  memcpy(&a, buf, 2);
+  buf += stride;
+  a_u16 = vdup_n_u16(a);
+  memcpy(&a, buf, 2);
+  a_u16 = vset_lane_u16(a, a_u16, 1);
+  return vreinterpret_u8_u16(a_u16);
+}
+
+static INLINE uint8x8_t load_unaligned_u8_4x1(const uint8_t *buf) {
   uint32_t a;
+  uint32x2_t a_u32;
+
+  memcpy(&a, buf, 4);
+  a_u32 = vdup_n_u32(0);
+  a_u32 = vset_lane_u32(a, a_u32, 0);
+  return vreinterpret_u8_u32(a_u32);
+}
+
+static INLINE uint8x8_t load_unaligned_u8_4x2(const uint8_t *buf, int stride) {
+  uint32_t a;
+  uint32x2_t a_u32;
 
   memcpy(&a, buf, 4);
   buf += stride;
-  *tu0 = vdup_n_u32(a);
+  a_u32 = vdup_n_u32(a);
   memcpy(&a, buf, 4);
-  buf += stride;
-  *tu0 = vset_lane_u32(a, *tu0, 1);
-  memcpy(&a, buf, 4);
-  buf += stride;
-  *tu1 = vdup_n_u32(a);
-  memcpy(&a, buf, 4);
-  buf += stride;
-  *tu1 = vset_lane_u32(a, *tu1, 1);
-  memcpy(&a, buf, 4);
-  buf += stride;
-  *tu2 = vdup_n_u32(a);
-  memcpy(&a, buf, 4);
-  buf += stride;
-  *tu2 = vset_lane_u32(a, *tu2, 1);
-  memcpy(&a, buf, 4);
-  buf += stride;
-  *tu3 = vdup_n_u32(a);
-  memcpy(&a, buf, 4);
-  *tu3 = vset_lane_u32(a, *tu3, 1);
+  a_u32 = vset_lane_u32(a, a_u32, 1);
+  return vreinterpret_u8_u32(a_u32);
 }
 
 static INLINE void load_unaligned_u8_4x4(const uint8_t *buf, int stride,
-                                         uint32x2_t *tu0, uint32x2_t *tu1) {
-  uint32_t a;
-
-  memcpy(&a, buf, 4);
-  buf += stride;
-  *tu0 = vdup_n_u32(a);
-  memcpy(&a, buf, 4);
-  buf += stride;
-  *tu0 = vset_lane_u32(a, *tu0, 1);
-  memcpy(&a, buf, 4);
-  buf += stride;
-  *tu1 = vdup_n_u32(a);
-  memcpy(&a, buf, 4);
-  *tu1 = vset_lane_u32(a, *tu1, 1);
+                                         uint8x8_t *tu0, uint8x8_t *tu1) {
+  *tu0 = load_unaligned_u8_4x2(buf, stride);
+  buf += 2 * stride;
+  *tu1 = load_unaligned_u8_4x2(buf, stride);
 }
 
-static INLINE void load_unaligned_u8_4x1(const uint8_t *buf, int stride,
-                                         uint32x2_t *tu0) {
-  uint32_t a;
-
-  memcpy(&a, buf, 4);
-  buf += stride;
-  *tu0 = vset_lane_u32(a, *tu0, 0);
+static INLINE void load_unaligned_u8_3x8(const uint8_t *buf, int stride,
+                                         uint8x8_t *tu0, uint8x8_t *tu1,
+                                         uint8x8_t *tu2) {
+  load_unaligned_u8_4x4(buf, stride, tu0, tu1);
+  buf += 4 * stride;
+  *tu2 = load_unaligned_u8_4x2(buf, stride);
 }
 
-static INLINE void load_unaligned_u8_4x2(const uint8_t *buf, int stride,
-                                         uint32x2_t *tu0) {
-  uint32_t a;
-
-  memcpy(&a, buf, 4);
-  buf += stride;
-  *tu0 = vdup_n_u32(a);
-  memcpy(&a, buf, 4);
-  *tu0 = vset_lane_u32(a, *tu0, 1);
+static INLINE void load_unaligned_u8_4x8(const uint8_t *buf, int stride,
+                                         uint8x8_t *tu0, uint8x8_t *tu1,
+                                         uint8x8_t *tu2, uint8x8_t *tu3) {
+  load_unaligned_u8_4x4(buf, stride, tu0, tu1);
+  buf += 4 * stride;
+  load_unaligned_u8_4x4(buf, stride, tu2, tu3);
 }
 
 /* These intrinsics require immediate values, so we must use #defines
@@ -487,17 +844,6 @@
     memcpy(dst, &a, 2);                                \
   } while (0)
 
-static INLINE void load_unaligned_u8_2x2(const uint8_t *buf, int stride,
-                                         uint16x4_t *tu0) {
-  uint16_t a;
-
-  memcpy(&a, buf, 2);
-  buf += stride;
-  *tu0 = vdup_n_u16(a);
-  memcpy(&a, buf, 2);
-  *tu0 = vset_lane_u16(a, *tu0, 1);
-}
-
 static INLINE void load_u8_16x8(const uint8_t *s, ptrdiff_t p,
                                 uint8x16_t *const s0, uint8x16_t *const s1,
                                 uint8x16_t *const s2, uint8x16_t *const s3,
@@ -532,21 +878,65 @@
   *s3 = vld1q_u8(s);
 }
 
-static INLINE void load_unaligned_u16_4x4(const uint16_t *buf, uint32_t stride,
-                                          uint64x2_t *tu0, uint64x2_t *tu1) {
+static INLINE void load_u16_8x8(const uint16_t *s, const ptrdiff_t p,
+                                uint16x8_t *s0, uint16x8_t *s1, uint16x8_t *s2,
+                                uint16x8_t *s3, uint16x8_t *s4, uint16x8_t *s5,
+                                uint16x8_t *s6, uint16x8_t *s7) {
+  *s0 = vld1q_u16(s);
+  s += p;
+  *s1 = vld1q_u16(s);
+  s += p;
+  *s2 = vld1q_u16(s);
+  s += p;
+  *s3 = vld1q_u16(s);
+  s += p;
+  *s4 = vld1q_u16(s);
+  s += p;
+  *s5 = vld1q_u16(s);
+  s += p;
+  *s6 = vld1q_u16(s);
+  s += p;
+  *s7 = vld1q_u16(s);
+}
+
+static INLINE void load_u16_16x4(const uint16_t *s, ptrdiff_t p,
+                                 uint16x8_t *const s0, uint16x8_t *const s1,
+                                 uint16x8_t *const s2, uint16x8_t *const s3,
+                                 uint16x8_t *const s4, uint16x8_t *const s5,
+                                 uint16x8_t *const s6, uint16x8_t *const s7) {
+  *s0 = vld1q_u16(s);
+  *s1 = vld1q_u16(s + 8);
+  s += p;
+  *s2 = vld1q_u16(s);
+  *s3 = vld1q_u16(s + 8);
+  s += p;
+  *s4 = vld1q_u16(s);
+  *s5 = vld1q_u16(s + 8);
+  s += p;
+  *s6 = vld1q_u16(s);
+  *s7 = vld1q_u16(s + 8);
+}
+
+static INLINE uint16x8_t load_unaligned_u16_4x2(const uint16_t *buf,
+                                                uint32_t stride) {
   uint64_t a;
+  uint64x2_t a_u64;
 
   memcpy(&a, buf, 8);
   buf += stride;
-  *tu0 = vdupq_n_u64(a);
+  a_u64 = vdupq_n_u64(0);
+  a_u64 = vsetq_lane_u64(a, a_u64, 0);
   memcpy(&a, buf, 8);
   buf += stride;
-  *tu0 = vsetq_lane_u64(a, *tu0, 1);
-  memcpy(&a, buf, 8);
-  buf += stride;
-  *tu1 = vdupq_n_u64(a);
-  memcpy(&a, buf, 8);
-  *tu1 = vsetq_lane_u64(a, *tu1, 1);
+  a_u64 = vsetq_lane_u64(a, a_u64, 1);
+  return vreinterpretq_u16_u64(a_u64);
+}
+
+static INLINE void load_unaligned_u16_4x4(const uint16_t *buf, uint32_t stride,
+                                          uint16x8_t *tu0, uint16x8_t *tu1) {
+  *tu0 = load_unaligned_u16_4x2(buf, stride);
+  buf += 2 * stride;
+  *tu1 = load_unaligned_u16_4x2(buf, stride);
 }
 
 static INLINE void load_s32_4x4(int32_t *s, int32_t p, int32x4_t *s1,
@@ -609,17 +999,9 @@
   vst1q_s32(buf + 4, v1);
 }
 
-// Stores the second result at an offset of 8 (instead of 4) to match the output
-// with that of C implementation and the function is similar to
-// store_s16q_to_tran_low(). The offset in the function name signifies that
-// pointer should be incremented by at least 4 in the calling function after
-// store_s16q_to_tran_low_offset_4() call.
-static INLINE void store_s16q_to_tran_low_offset_4(tran_low_t *buf,
-                                                   const int16x8_t a) {
-  const int32x4_t v0 = vmovl_s16(vget_low_s16(a));
-  const int32x4_t v1 = vmovl_s16(vget_high_s16(a));
+static INLINE void store_s16_to_tran_low(tran_low_t *buf, const int16x4_t a) {
+  const int32x4_t v0 = vmovl_s16(a);
   vst1q_s32(buf, v0);
-  vst1q_s32(buf + 8, v1);
 }
 
 #endif  // AOM_AOM_DSP_ARM_MEM_NEON_H_

diff --git a/aom_dsp/arm/obmc_sad_neon.c b/aom_dsp/arm/obmc_sad_neon.c
new file mode 100644
index 0000000..a692cbb
--- /dev/null
+++ b/aom_dsp/arm/obmc_sad_neon.c

@@ -0,0 +1,250 @@
+/*
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <arm_neon.h>
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+#include "mem_neon.h"
+#include "sum_neon.h"
+
+static INLINE void obmc_sad_8x1_s16_neon(int16x8_t ref_s16, const int32_t *mask,
+                                         const int32_t *wsrc, uint32x4_t *sum) {
+  int32x4_t wsrc_lo = vld1q_s32(wsrc);
+  int32x4_t wsrc_hi = vld1q_s32(wsrc + 4);
+
+  int32x4_t mask_lo = vld1q_s32(mask);
+  int32x4_t mask_hi = vld1q_s32(mask + 4);
+
+  int16x8_t mask_s16 =
+      vuzpq_s16(vreinterpretq_s16_s32(mask_lo), vreinterpretq_s16_s32(mask_hi))
+          .val[0];
+
+  int32x4_t pre_lo = vmull_s16(vget_low_s16(ref_s16), vget_low_s16(mask_s16));
+  int32x4_t pre_hi = vmull_s16(vget_high_s16(ref_s16), vget_high_s16(mask_s16));
+
+  uint32x4_t abs_lo = vreinterpretq_u32_s32(vabdq_s32(wsrc_lo, pre_lo));
+  uint32x4_t abs_hi = vreinterpretq_u32_s32(vabdq_s32(wsrc_hi, pre_hi));
+
+  *sum = vrsraq_n_u32(*sum, abs_lo, 12);
+  *sum = vrsraq_n_u32(*sum, abs_hi, 12);
+}
+
+#if AOM_ARCH_AARCH64
+
+// Use tbl for doing a double-width zero extension from 8->32 bits since we can
+// do this in one instruction rather than two (indices out of range (255 here)
+// are set to zero by tbl).
+DECLARE_ALIGNED(16, static const uint8_t, obmc_variance_permute_idx[]) = {
+  0,  255, 255, 255, 1,  255, 255, 255, 2,  255, 255, 255, 3,  255, 255, 255,
+  4,  255, 255, 255, 5,  255, 255, 255, 6,  255, 255, 255, 7,  255, 255, 255,
+  8,  255, 255, 255, 9,  255, 255, 255, 10, 255, 255, 255, 11, 255, 255, 255,
+  12, 255, 255, 255, 13, 255, 255, 255, 14, 255, 255, 255, 15, 255, 255, 255
+};
+
+static INLINE void obmc_sad_8x1_s32_neon(uint32x4_t ref_u32_lo,
+                                         uint32x4_t ref_u32_hi,
+                                         const int32_t *mask,
+                                         const int32_t *wsrc,
+                                         uint32x4_t sum[2]) {
+  int32x4_t wsrc_lo = vld1q_s32(wsrc);
+  int32x4_t wsrc_hi = vld1q_s32(wsrc + 4);
+  int32x4_t mask_lo = vld1q_s32(mask);
+  int32x4_t mask_hi = vld1q_s32(mask + 4);
+
+  int32x4_t pre_lo = vmulq_s32(vreinterpretq_s32_u32(ref_u32_lo), mask_lo);
+  int32x4_t pre_hi = vmulq_s32(vreinterpretq_s32_u32(ref_u32_hi), mask_hi);
+
+  uint32x4_t abs_lo = vreinterpretq_u32_s32(vabdq_s32(wsrc_lo, pre_lo));
+  uint32x4_t abs_hi = vreinterpretq_u32_s32(vabdq_s32(wsrc_hi, pre_hi));
+
+  sum[0] = vrsraq_n_u32(sum[0], abs_lo, 12);
+  sum[1] = vrsraq_n_u32(sum[1], abs_hi, 12);
+}
+
+static INLINE unsigned int obmc_sad_large_neon(const uint8_t *ref,
+                                               int ref_stride,
+                                               const int32_t *wsrc,
+                                               const int32_t *mask, int width,
+                                               int height) {
+  uint32x4_t sum[2] = { vdupq_n_u32(0), vdupq_n_u32(0) };
+
+  // Use tbl for doing a double-width zero extension from 8->32 bits since we
+  // can do this in one instruction rather than two.
+  uint8x16_t pre_idx0 = vld1q_u8(&obmc_variance_permute_idx[0]);
+  uint8x16_t pre_idx1 = vld1q_u8(&obmc_variance_permute_idx[16]);
+  uint8x16_t pre_idx2 = vld1q_u8(&obmc_variance_permute_idx[32]);
+  uint8x16_t pre_idx3 = vld1q_u8(&obmc_variance_permute_idx[48]);
+
+  int h = height;
+  do {
+    int w = width;
+    const uint8_t *ref_ptr = ref;
+    do {
+      uint8x16_t r = vld1q_u8(ref_ptr);
+
+      uint32x4_t ref_u32_lo = vreinterpretq_u32_u8(vqtbl1q_u8(r, pre_idx0));
+      uint32x4_t ref_u32_hi = vreinterpretq_u32_u8(vqtbl1q_u8(r, pre_idx1));
+      obmc_sad_8x1_s32_neon(ref_u32_lo, ref_u32_hi, mask, wsrc, sum);
+
+      ref_u32_lo = vreinterpretq_u32_u8(vqtbl1q_u8(r, pre_idx2));
+      ref_u32_hi = vreinterpretq_u32_u8(vqtbl1q_u8(r, pre_idx3));
+      obmc_sad_8x1_s32_neon(ref_u32_lo, ref_u32_hi, mask + 8, wsrc + 8, sum);
+
+      ref_ptr += 16;
+      wsrc += 16;
+      mask += 16;
+      w -= 16;
+    } while (w != 0);
+
+    ref += ref_stride;
+  } while (--h != 0);
+
+  return horizontal_add_u32x4(vaddq_u32(sum[0], sum[1]));
+}
+
+#else  // !AOM_ARCH_AARCH64
+
+static INLINE unsigned int obmc_sad_large_neon(const uint8_t *ref,
+                                               int ref_stride,
+                                               const int32_t *wsrc,
+                                               const int32_t *mask, int width,
+                                               int height) {
+  uint32x4_t sum = vdupq_n_u32(0);
+
+  int h = height;
+  do {
+    int w = width;
+    const uint8_t *ref_ptr = ref;
+    do {
+      uint8x16_t r = vld1q_u8(ref_ptr);
+
+      int16x8_t ref_s16 = vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(r)));
+      obmc_sad_8x1_s16_neon(ref_s16, mask, wsrc, &sum);
+
+      ref_s16 = vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(r)));
+      obmc_sad_8x1_s16_neon(ref_s16, mask + 8, wsrc + 8, &sum);
+
+      ref_ptr += 16;
+      wsrc += 16;
+      mask += 16;
+      w -= 16;
+    } while (w != 0);
+
+    ref += ref_stride;
+  } while (--h != 0);
+
+  return horizontal_add_u32x4(sum);
+}
+
+#endif  // AOM_ARCH_AARCH64
+
+static INLINE unsigned int obmc_sad_128xh_neon(const uint8_t *ref,
+                                               int ref_stride,
+                                               const int32_t *wsrc,
+                                               const int32_t *mask, int h) {
+  return obmc_sad_large_neon(ref, ref_stride, wsrc, mask, 128, h);
+}
+
+static INLINE unsigned int obmc_sad_64xh_neon(const uint8_t *ref,
+                                              int ref_stride,
+                                              const int32_t *wsrc,
+                                              const int32_t *mask, int h) {
+  return obmc_sad_large_neon(ref, ref_stride, wsrc, mask, 64, h);
+}
+
+static INLINE unsigned int obmc_sad_32xh_neon(const uint8_t *ref,
+                                              int ref_stride,
+                                              const int32_t *wsrc,
+                                              const int32_t *mask, int h) {
+  return obmc_sad_large_neon(ref, ref_stride, wsrc, mask, 32, h);
+}
+
+static INLINE unsigned int obmc_sad_16xh_neon(const uint8_t *ref,
+                                              int ref_stride,
+                                              const int32_t *wsrc,
+                                              const int32_t *mask, int h) {
+  return obmc_sad_large_neon(ref, ref_stride, wsrc, mask, 16, h);
+}
+
+static INLINE unsigned int obmc_sad_8xh_neon(const uint8_t *ref, int ref_stride,
+                                             const int32_t *wsrc,
+                                             const int32_t *mask, int height) {
+  uint32x4_t sum = vdupq_n_u32(0);
+
+  int h = height;
+  do {
+    uint8x8_t r = vld1_u8(ref);
+
+    int16x8_t ref_s16 = vreinterpretq_s16_u16(vmovl_u8(r));
+    obmc_sad_8x1_s16_neon(ref_s16, mask, wsrc, &sum);
+
+    ref += ref_stride;
+    wsrc += 8;
+    mask += 8;
+  } while (--h != 0);
+
+  return horizontal_add_u32x4(sum);
+}
+
+static INLINE unsigned int obmc_sad_4xh_neon(const uint8_t *ref, int ref_stride,
+                                             const int32_t *wsrc,
+                                             const int32_t *mask, int height) {
+  uint32x4_t sum = vdupq_n_u32(0);
+
+  int h = height / 2;
+  do {
+    uint8x8_t r = load_unaligned_u8(ref, ref_stride);
+
+    int16x8_t ref_s16 = vreinterpretq_s16_u16(vmovl_u8(r));
+    obmc_sad_8x1_s16_neon(ref_s16, mask, wsrc, &sum);
+
+    ref += 2 * ref_stride;
+    wsrc += 8;
+    mask += 8;
+  } while (--h != 0);
+
+  return horizontal_add_u32x4(sum);
+}
+
+#define OBMC_SAD_WXH_NEON(w, h)                                   \
+  unsigned int aom_obmc_sad##w##x##h##_neon(                      \
+      const uint8_t *ref, int ref_stride, const int32_t *wsrc,    \
+      const int32_t *mask) {                                      \
+    return obmc_sad_##w##xh_neon(ref, ref_stride, wsrc, mask, h); \
+  }
+
+OBMC_SAD_WXH_NEON(4, 4)
+OBMC_SAD_WXH_NEON(4, 8)
+OBMC_SAD_WXH_NEON(4, 16)
+
+OBMC_SAD_WXH_NEON(8, 4)
+OBMC_SAD_WXH_NEON(8, 8)
+OBMC_SAD_WXH_NEON(8, 16)
+OBMC_SAD_WXH_NEON(8, 32)
+
+OBMC_SAD_WXH_NEON(16, 4)
+OBMC_SAD_WXH_NEON(16, 8)
+OBMC_SAD_WXH_NEON(16, 16)
+OBMC_SAD_WXH_NEON(16, 32)
+OBMC_SAD_WXH_NEON(16, 64)
+
+OBMC_SAD_WXH_NEON(32, 8)
+OBMC_SAD_WXH_NEON(32, 16)
+OBMC_SAD_WXH_NEON(32, 32)
+OBMC_SAD_WXH_NEON(32, 64)
+
+OBMC_SAD_WXH_NEON(64, 16)
+OBMC_SAD_WXH_NEON(64, 32)
+OBMC_SAD_WXH_NEON(64, 64)
+OBMC_SAD_WXH_NEON(64, 128)
+
+OBMC_SAD_WXH_NEON(128, 64)
+OBMC_SAD_WXH_NEON(128, 128)

diff --git a/aom_dsp/arm/obmc_variance_neon.c b/aom_dsp/arm/obmc_variance_neon.c
new file mode 100644
index 0000000..50cd5f3
--- /dev/null
+++ b/aom_dsp/arm/obmc_variance_neon.c

@@ -0,0 +1,290 @@
+/*
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <arm_neon.h>
+
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+#include "mem_neon.h"
+#include "sum_neon.h"
+
+static INLINE void obmc_variance_8x1_s16_neon(int16x8_t pre_s16,
+                                              const int32_t *wsrc,
+                                              const int32_t *mask,
+                                              int32x4_t *ssev,
+                                              int32x4_t *sumv) {
+  // For 4xh and 8xh we observe it is faster to avoid the double-widening of
+  // pre. Instead we do a single widening step and narrow the mask to 16-bits
+  // to allow us to perform a widening multiply. Widening multiply
+  // instructions have better throughput on some micro-architectures but for
+  // the larger block sizes this benefit is outweighed by the additional
+  // instruction needed to first narrow the mask vectors.
+
+  int32x4_t wsrc_s32_lo = vld1q_s32(&wsrc[0]);
+  int32x4_t wsrc_s32_hi = vld1q_s32(&wsrc[4]);
+  int16x8_t mask_s16 = vuzpq_s16(vreinterpretq_s16_s32(vld1q_s32(&mask[0])),
+                                 vreinterpretq_s16_s32(vld1q_s32(&mask[4])))
+                           .val[0];
+
+  int32x4_t diff_s32_lo =
+      vmlsl_s16(wsrc_s32_lo, vget_low_s16(pre_s16), vget_low_s16(mask_s16));
+  int32x4_t diff_s32_hi =
+      vmlsl_s16(wsrc_s32_hi, vget_high_s16(pre_s16), vget_high_s16(mask_s16));
+
+  // ROUND_POWER_OF_TWO_SIGNED(value, 12) rounds to nearest with ties away
+  // from zero, however vrshrq_n_s32 rounds to nearest with ties rounded up.
+  // This difference only affects the bit patterns at the rounding breakpoints
+  // exactly, so we can add -1 to all negative numbers to move the breakpoint
+  // one value across and into the correct rounding region.
+  diff_s32_lo = vsraq_n_s32(diff_s32_lo, diff_s32_lo, 31);
+  diff_s32_hi = vsraq_n_s32(diff_s32_hi, diff_s32_hi, 31);
+  int32x4_t round_s32_lo = vrshrq_n_s32(diff_s32_lo, 12);
+  int32x4_t round_s32_hi = vrshrq_n_s32(diff_s32_hi, 12);
+
+  *sumv = vrsraq_n_s32(*sumv, diff_s32_lo, 12);
+  *sumv = vrsraq_n_s32(*sumv, diff_s32_hi, 12);
+  *ssev = vmlaq_s32(*ssev, round_s32_lo, round_s32_lo);
+  *ssev = vmlaq_s32(*ssev, round_s32_hi, round_s32_hi);
+}
+
+#if AOM_ARCH_AARCH64
+
+// Use tbl for doing a double-width zero extension from 8->32 bits since we can
+// do this in one instruction rather than two (indices out of range (255 here)
+// are set to zero by tbl).
+DECLARE_ALIGNED(16, static const uint8_t, obmc_variance_permute_idx[]) = {
+  0,  255, 255, 255, 1,  255, 255, 255, 2,  255, 255, 255, 3,  255, 255, 255,
+  4,  255, 255, 255, 5,  255, 255, 255, 6,  255, 255, 255, 7,  255, 255, 255,
+  8,  255, 255, 255, 9,  255, 255, 255, 10, 255, 255, 255, 11, 255, 255, 255,
+  12, 255, 255, 255, 13, 255, 255, 255, 14, 255, 255, 255, 15, 255, 255, 255
+};
+
+static INLINE void obmc_variance_8x1_s32_neon(
+    int32x4_t pre_lo, int32x4_t pre_hi, const int32_t *wsrc,
+    const int32_t *mask, int32x4_t *ssev, int32x4_t *sumv) {
+  int32x4_t wsrc_lo = vld1q_s32(&wsrc[0]);
+  int32x4_t wsrc_hi = vld1q_s32(&wsrc[4]);
+  int32x4_t mask_lo = vld1q_s32(&mask[0]);
+  int32x4_t mask_hi = vld1q_s32(&mask[4]);
+
+  int32x4_t diff_lo = vmlsq_s32(wsrc_lo, pre_lo, mask_lo);
+  int32x4_t diff_hi = vmlsq_s32(wsrc_hi, pre_hi, mask_hi);
+
+  // ROUND_POWER_OF_TWO_SIGNED(value, 12) rounds to nearest with ties away from
+  // zero, however vrshrq_n_s32 rounds to nearest with ties rounded up. This
+  // difference only affects the bit patterns at the rounding breakpoints
+  // exactly, so we can add -1 to all negative numbers to move the breakpoint
+  // one value across and into the correct rounding region.
+  diff_lo = vsraq_n_s32(diff_lo, diff_lo, 31);
+  diff_hi = vsraq_n_s32(diff_hi, diff_hi, 31);
+  int32x4_t round_lo = vrshrq_n_s32(diff_lo, 12);
+  int32x4_t round_hi = vrshrq_n_s32(diff_hi, 12);
+
+  *sumv = vrsraq_n_s32(*sumv, diff_lo, 12);
+  *sumv = vrsraq_n_s32(*sumv, diff_hi, 12);
+  *ssev = vmlaq_s32(*ssev, round_lo, round_lo);
+  *ssev = vmlaq_s32(*ssev, round_hi, round_hi);
+}
+
+static INLINE void obmc_variance_large_neon(const uint8_t *pre, int pre_stride,
+                                            const int32_t *wsrc,
+                                            const int32_t *mask, int width,
+                                            int height, unsigned *sse,
+                                            int *sum) {
+  assert(width % 16 == 0);
+
+  // Use tbl for doing a double-width zero extension from 8->32 bits since we
+  // can do this in one instruction rather than two.
+  uint8x16_t pre_idx0 = vld1q_u8(&obmc_variance_permute_idx[0]);
+  uint8x16_t pre_idx1 = vld1q_u8(&obmc_variance_permute_idx[16]);
+  uint8x16_t pre_idx2 = vld1q_u8(&obmc_variance_permute_idx[32]);
+  uint8x16_t pre_idx3 = vld1q_u8(&obmc_variance_permute_idx[48]);
+
+  int32x4_t ssev = vdupq_n_s32(0);
+  int32x4_t sumv = vdupq_n_s32(0);
+
+  int h = height;
+  do {
+    int w = width;
+    do {
+      uint8x16_t pre_u8 = vld1q_u8(pre);
+
+      int32x4_t pre_s32_lo = vreinterpretq_s32_u8(vqtbl1q_u8(pre_u8, pre_idx0));
+      int32x4_t pre_s32_hi = vreinterpretq_s32_u8(vqtbl1q_u8(pre_u8, pre_idx1));
+      obmc_variance_8x1_s32_neon(pre_s32_lo, pre_s32_hi, &wsrc[0], &mask[0],
+                                 &ssev, &sumv);
+
+      pre_s32_lo = vreinterpretq_s32_u8(vqtbl1q_u8(pre_u8, pre_idx2));
+      pre_s32_hi = vreinterpretq_s32_u8(vqtbl1q_u8(pre_u8, pre_idx3));
+      obmc_variance_8x1_s32_neon(pre_s32_lo, pre_s32_hi, &wsrc[8], &mask[8],
+                                 &ssev, &sumv);
+
+      wsrc += 16;
+      mask += 16;
+      pre += 16;
+      w -= 16;
+    } while (w != 0);
+
+    pre += pre_stride - width;
+  } while (--h != 0);
+
+  *sse = horizontal_add_s32x4(ssev);
+  *sum = horizontal_add_s32x4(sumv);
+}
+
+#else  // !AOM_ARCH_AARCH64
+
+static INLINE void obmc_variance_large_neon(const uint8_t *pre, int pre_stride,
+                                            const int32_t *wsrc,
+                                            const int32_t *mask, int width,
+                                            int height, unsigned *sse,
+                                            int *sum) {
+  // Non-aarch64 targets do not have a 128-bit tbl instruction, so use the
+  // widening version of the core kernel instead.
+
+  assert(width % 16 == 0);
+
+  int32x4_t ssev = vdupq_n_s32(0);
+  int32x4_t sumv = vdupq_n_s32(0);
+
+  int h = height;
+  do {
+    int w = width;
+    do {
+      uint8x16_t pre_u8 = vld1q_u8(pre);
+
+      int16x8_t pre_s16 = vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(pre_u8)));
+      obmc_variance_8x1_s16_neon(pre_s16, &wsrc[0], &mask[0], &ssev, &sumv);
+
+      pre_s16 = vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(pre_u8)));
+      obmc_variance_8x1_s16_neon(pre_s16, &wsrc[8], &mask[8], &ssev, &sumv);
+
+      wsrc += 16;
+      mask += 16;
+      pre += 16;
+      w -= 16;
+    } while (w != 0);
+
+    pre += pre_stride - width;
+  } while (--h != 0);
+
+  *sse = horizontal_add_s32x4(ssev);
+  *sum = horizontal_add_s32x4(sumv);
+}
+
+#endif  // AOM_ARCH_AARCH64
+
+static INLINE void obmc_variance_neon_128xh(const uint8_t *pre, int pre_stride,
+                                            const int32_t *wsrc,
+                                            const int32_t *mask, int h,
+                                            unsigned *sse, int *sum) {
+  obmc_variance_large_neon(pre, pre_stride, wsrc, mask, 128, h, sse, sum);
+}
+
+static INLINE void obmc_variance_neon_64xh(const uint8_t *pre, int pre_stride,
+                                           const int32_t *wsrc,
+                                           const int32_t *mask, int h,
+                                           unsigned *sse, int *sum) {
+  obmc_variance_large_neon(pre, pre_stride, wsrc, mask, 64, h, sse, sum);
+}
+
+static INLINE void obmc_variance_neon_32xh(const uint8_t *pre, int pre_stride,
+                                           const int32_t *wsrc,
+                                           const int32_t *mask, int h,
+                                           unsigned *sse, int *sum) {
+  obmc_variance_large_neon(pre, pre_stride, wsrc, mask, 32, h, sse, sum);
+}
+
+static INLINE void obmc_variance_neon_16xh(const uint8_t *pre, int pre_stride,
+                                           const int32_t *wsrc,
+                                           const int32_t *mask, int h,
+                                           unsigned *sse, int *sum) {
+  obmc_variance_large_neon(pre, pre_stride, wsrc, mask, 16, h, sse, sum);
+}
+
+static INLINE void obmc_variance_neon_8xh(const uint8_t *pre, int pre_stride,
+                                          const int32_t *wsrc,
+                                          const int32_t *mask, int h,
+                                          unsigned *sse, int *sum) {
+  int32x4_t ssev = vdupq_n_s32(0);
+  int32x4_t sumv = vdupq_n_s32(0);
+
+  do {
+    uint8x8_t pre_u8 = vld1_u8(pre);
+    int16x8_t pre_s16 = vreinterpretq_s16_u16(vmovl_u8(pre_u8));
+
+    obmc_variance_8x1_s16_neon(pre_s16, wsrc, mask, &ssev, &sumv);
+
+    pre += pre_stride;
+    wsrc += 8;
+    mask += 8;
+  } while (--h != 0);
+
+  *sse = horizontal_add_s32x4(ssev);
+  *sum = horizontal_add_s32x4(sumv);
+}
+
+static INLINE void obmc_variance_neon_4xh(const uint8_t *pre, int pre_stride,
+                                          const int32_t *wsrc,
+                                          const int32_t *mask, int h,
+                                          unsigned *sse, int *sum) {
+  assert(h % 2 == 0);
+
+  int32x4_t ssev = vdupq_n_s32(0);
+  int32x4_t sumv = vdupq_n_s32(0);
+
+  do {
+    uint8x8_t pre_u8 = load_unaligned_u8(pre, pre_stride);
+    int16x8_t pre_s16 = vreinterpretq_s16_u16(vmovl_u8(pre_u8));
+
+    obmc_variance_8x1_s16_neon(pre_s16, wsrc, mask, &ssev, &sumv);
+
+    pre += 2 * pre_stride;
+    wsrc += 8;
+    mask += 8;
+    h -= 2;
+  } while (h != 0);
+
+  *sse = horizontal_add_s32x4(ssev);
+  *sum = horizontal_add_s32x4(sumv);
+}
+
+#define OBMC_VARIANCE_WXH_NEON(W, H)                                       \
+  unsigned aom_obmc_variance##W##x##H##_neon(                              \
+      const uint8_t *pre, int pre_stride, const int32_t *wsrc,             \
+      const int32_t *mask, unsigned *sse) {                                \
+    int sum;                                                               \
+    obmc_variance_neon_##W##xh(pre, pre_stride, wsrc, mask, H, sse, &sum); \
+    return *sse - (unsigned)(((int64_t)sum * sum) / (W * H));              \
+  }
+
+OBMC_VARIANCE_WXH_NEON(4, 4)
+OBMC_VARIANCE_WXH_NEON(4, 8)
+OBMC_VARIANCE_WXH_NEON(8, 4)
+OBMC_VARIANCE_WXH_NEON(8, 8)
+OBMC_VARIANCE_WXH_NEON(8, 16)
+OBMC_VARIANCE_WXH_NEON(16, 8)
+OBMC_VARIANCE_WXH_NEON(16, 16)
+OBMC_VARIANCE_WXH_NEON(16, 32)
+OBMC_VARIANCE_WXH_NEON(32, 16)
+OBMC_VARIANCE_WXH_NEON(32, 32)
+OBMC_VARIANCE_WXH_NEON(32, 64)
+OBMC_VARIANCE_WXH_NEON(64, 32)
+OBMC_VARIANCE_WXH_NEON(64, 64)
+OBMC_VARIANCE_WXH_NEON(64, 128)
+OBMC_VARIANCE_WXH_NEON(128, 64)
+OBMC_VARIANCE_WXH_NEON(128, 128)
+OBMC_VARIANCE_WXH_NEON(4, 16)
+OBMC_VARIANCE_WXH_NEON(16, 4)
+OBMC_VARIANCE_WXH_NEON(8, 32)
+OBMC_VARIANCE_WXH_NEON(32, 8)
+OBMC_VARIANCE_WXH_NEON(16, 64)
+OBMC_VARIANCE_WXH_NEON(64, 16)

diff --git a/aom_dsp/arm/sad4d_neon.c b/aom_dsp/arm/sad4d_neon.c
deleted file mode 100644
index e1eccc3..0000000
--- a/aom_dsp/arm/sad4d_neon.c
+++ /dev/null

@@ -1,534 +0,0 @@
-/*
- * Copyright (c) 2016, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-
-#include <arm_neon.h>
-
-#include "config/aom_config.h"
-#include "config/aom_dsp_rtcd.h"
-
-#include "aom/aom_integer.h"
-#include "aom_dsp/arm/sum_neon.h"
-
-#if defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
-
-static INLINE void sad16_neon(uint8x16_t src, uint8x16_t ref,
-                              uint32x4_t *const sad_sum) {
-  uint8x16_t abs_diff = vabdq_u8(src, ref);
-  *sad_sum = vdotq_u32(*sad_sum, abs_diff, vdupq_n_u8(1));
-}
-
-static INLINE void sad128xhx4d_neon(const uint8_t *src, int src_stride,
-                                    const uint8_t *const ref[4], int ref_stride,
-                                    uint32_t res[4], int h) {
-  uint32x4_t sum_lo[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
-                           vdupq_n_u32(0) };
-  uint32x4_t sum_hi[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
-                           vdupq_n_u32(0) };
-
-  int i = 0;
-  do {
-    const uint8x16_t s0 = vld1q_u8(src + i * src_stride);
-    sad16_neon(s0, vld1q_u8(ref[0] + i * ref_stride), &sum_lo[0]);
-    sad16_neon(s0, vld1q_u8(ref[1] + i * ref_stride), &sum_lo[1]);
-    sad16_neon(s0, vld1q_u8(ref[2] + i * ref_stride), &sum_lo[2]);
-    sad16_neon(s0, vld1q_u8(ref[3] + i * ref_stride), &sum_lo[3]);
-
-    const uint8x16_t s1 = vld1q_u8(src + i * src_stride + 16);
-    sad16_neon(s1, vld1q_u8(ref[0] + i * ref_stride + 16), &sum_hi[0]);
-    sad16_neon(s1, vld1q_u8(ref[1] + i * ref_stride + 16), &sum_hi[1]);
-    sad16_neon(s1, vld1q_u8(ref[2] + i * ref_stride + 16), &sum_hi[2]);
-    sad16_neon(s1, vld1q_u8(ref[3] + i * ref_stride + 16), &sum_hi[3]);
-
-    const uint8x16_t s2 = vld1q_u8(src + i * src_stride + 32);
-    sad16_neon(s2, vld1q_u8(ref[0] + i * ref_stride + 32), &sum_lo[0]);
-    sad16_neon(s2, vld1q_u8(ref[1] + i * ref_stride + 32), &sum_lo[1]);
-    sad16_neon(s2, vld1q_u8(ref[2] + i * ref_stride + 32), &sum_lo[2]);
-    sad16_neon(s2, vld1q_u8(ref[3] + i * ref_stride + 32), &sum_lo[3]);
-
-    const uint8x16_t s3 = vld1q_u8(src + i * src_stride + 48);
-    sad16_neon(s3, vld1q_u8(ref[0] + i * ref_stride + 48), &sum_hi[0]);
-    sad16_neon(s3, vld1q_u8(ref[1] + i * ref_stride + 48), &sum_hi[1]);
-    sad16_neon(s3, vld1q_u8(ref[2] + i * ref_stride + 48), &sum_hi[2]);
-    sad16_neon(s3, vld1q_u8(ref[3] + i * ref_stride + 48), &sum_hi[3]);
-
-    const uint8x16_t s4 = vld1q_u8(src + i * src_stride + 64);
-    sad16_neon(s4, vld1q_u8(ref[0] + i * ref_stride + 64), &sum_lo[0]);
-    sad16_neon(s4, vld1q_u8(ref[1] + i * ref_stride + 64), &sum_lo[1]);
-    sad16_neon(s4, vld1q_u8(ref[2] + i * ref_stride + 64), &sum_lo[2]);
-    sad16_neon(s4, vld1q_u8(ref[3] + i * ref_stride + 64), &sum_lo[3]);
-
-    const uint8x16_t s5 = vld1q_u8(src + i * src_stride + 80);
-    sad16_neon(s5, vld1q_u8(ref[0] + i * ref_stride + 80), &sum_hi[0]);
-    sad16_neon(s5, vld1q_u8(ref[1] + i * ref_stride + 80), &sum_hi[1]);
-    sad16_neon(s5, vld1q_u8(ref[2] + i * ref_stride + 80), &sum_hi[2]);
-    sad16_neon(s5, vld1q_u8(ref[3] + i * ref_stride + 80), &sum_hi[3]);
-
-    const uint8x16_t s6 = vld1q_u8(src + i * src_stride + 96);
-    sad16_neon(s6, vld1q_u8(ref[0] + i * ref_stride + 96), &sum_lo[0]);
-    sad16_neon(s6, vld1q_u8(ref[1] + i * ref_stride + 96), &sum_lo[1]);
-    sad16_neon(s6, vld1q_u8(ref[2] + i * ref_stride + 96), &sum_lo[2]);
-    sad16_neon(s6, vld1q_u8(ref[3] + i * ref_stride + 96), &sum_lo[3]);
-
-    const uint8x16_t s7 = vld1q_u8(src + i * src_stride + 112);
-    sad16_neon(s7, vld1q_u8(ref[0] + i * ref_stride + 112), &sum_hi[0]);
-    sad16_neon(s7, vld1q_u8(ref[1] + i * ref_stride + 112), &sum_hi[1]);
-    sad16_neon(s7, vld1q_u8(ref[2] + i * ref_stride + 112), &sum_hi[2]);
-    sad16_neon(s7, vld1q_u8(ref[3] + i * ref_stride + 112), &sum_hi[3]);
-
-    i++;
-  } while (i < h);
-
-  uint32x4_t res0 = vpaddq_u32(vaddq_u32(sum_lo[0], sum_hi[0]),
-                               vaddq_u32(sum_lo[1], sum_hi[1]));
-  uint32x4_t res1 = vpaddq_u32(vaddq_u32(sum_lo[2], sum_hi[2]),
-                               vaddq_u32(sum_lo[3], sum_hi[3]));
-  vst1q_u32(res, vpaddq_u32(res0, res1));
-}
-
-static INLINE void sad64xhx4d_neon(const uint8_t *src, int src_stride,
-                                   const uint8_t *const ref[4], int ref_stride,
-                                   uint32_t res[4], int h) {
-  uint32x4_t sum_lo[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
-                           vdupq_n_u32(0) };
-  uint32x4_t sum_hi[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
-                           vdupq_n_u32(0) };
-
-  int i = 0;
-  do {
-    const uint8x16_t s0 = vld1q_u8(src + i * src_stride);
-    sad16_neon(s0, vld1q_u8(ref[0] + i * ref_stride), &sum_lo[0]);
-    sad16_neon(s0, vld1q_u8(ref[1] + i * ref_stride), &sum_lo[1]);
-    sad16_neon(s0, vld1q_u8(ref[2] + i * ref_stride), &sum_lo[2]);
-    sad16_neon(s0, vld1q_u8(ref[3] + i * ref_stride), &sum_lo[3]);
-
-    const uint8x16_t s1 = vld1q_u8(src + i * src_stride + 16);
-    sad16_neon(s1, vld1q_u8(ref[0] + i * ref_stride + 16), &sum_hi[0]);
-    sad16_neon(s1, vld1q_u8(ref[1] + i * ref_stride + 16), &sum_hi[1]);
-    sad16_neon(s1, vld1q_u8(ref[2] + i * ref_stride + 16), &sum_hi[2]);
-    sad16_neon(s1, vld1q_u8(ref[3] + i * ref_stride + 16), &sum_hi[3]);
-
-    const uint8x16_t s2 = vld1q_u8(src + i * src_stride + 32);
-    sad16_neon(s2, vld1q_u8(ref[0] + i * ref_stride + 32), &sum_lo[0]);
-    sad16_neon(s2, vld1q_u8(ref[1] + i * ref_stride + 32), &sum_lo[1]);
-    sad16_neon(s2, vld1q_u8(ref[2] + i * ref_stride + 32), &sum_lo[2]);
-    sad16_neon(s2, vld1q_u8(ref[3] + i * ref_stride + 32), &sum_lo[3]);
-
-    const uint8x16_t s3 = vld1q_u8(src + i * src_stride + 48);
-    sad16_neon(s3, vld1q_u8(ref[0] + i * ref_stride + 48), &sum_hi[0]);
-    sad16_neon(s3, vld1q_u8(ref[1] + i * ref_stride + 48), &sum_hi[1]);
-    sad16_neon(s3, vld1q_u8(ref[2] + i * ref_stride + 48), &sum_hi[2]);
-    sad16_neon(s3, vld1q_u8(ref[3] + i * ref_stride + 48), &sum_hi[3]);
-
-    i++;
-  } while (i < h);
-
-  uint32x4_t res0 = vpaddq_u32(vaddq_u32(sum_lo[0], sum_hi[0]),
-                               vaddq_u32(sum_lo[1], sum_hi[1]));
-  uint32x4_t res1 = vpaddq_u32(vaddq_u32(sum_lo[2], sum_hi[2]),
-                               vaddq_u32(sum_lo[3], sum_hi[3]));
-  vst1q_u32(res, vpaddq_u32(res0, res1));
-}
-
-static INLINE void sad32xhx4d_neon(const uint8_t *src, int src_stride,
-                                   const uint8_t *const ref[4], int ref_stride,
-                                   uint32_t res[4], int h) {
-  uint32x4_t sum_lo[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
-                           vdupq_n_u32(0) };
-  uint32x4_t sum_hi[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
-                           vdupq_n_u32(0) };
-
-  int i = 0;
-  do {
-    const uint8x16_t s0 = vld1q_u8(src + i * src_stride);
-    sad16_neon(s0, vld1q_u8(ref[0] + i * ref_stride), &sum_lo[0]);
-    sad16_neon(s0, vld1q_u8(ref[1] + i * ref_stride), &sum_lo[1]);
-    sad16_neon(s0, vld1q_u8(ref[2] + i * ref_stride), &sum_lo[2]);
-    sad16_neon(s0, vld1q_u8(ref[3] + i * ref_stride), &sum_lo[3]);
-
-    const uint8x16_t s1 = vld1q_u8(src + i * src_stride + 16);
-    sad16_neon(s1, vld1q_u8(ref[0] + i * ref_stride + 16), &sum_hi[0]);
-    sad16_neon(s1, vld1q_u8(ref[1] + i * ref_stride + 16), &sum_hi[1]);
-    sad16_neon(s1, vld1q_u8(ref[2] + i * ref_stride + 16), &sum_hi[2]);
-    sad16_neon(s1, vld1q_u8(ref[3] + i * ref_stride + 16), &sum_hi[3]);
-
-    i++;
-  } while (i < h);
-
-  uint32x4_t res0 = vpaddq_u32(vaddq_u32(sum_lo[0], sum_hi[0]),
-                               vaddq_u32(sum_lo[1], sum_hi[1]));
-  uint32x4_t res1 = vpaddq_u32(vaddq_u32(sum_lo[2], sum_hi[2]),
-                               vaddq_u32(sum_lo[3], sum_hi[3]));
-  vst1q_u32(res, vpaddq_u32(res0, res1));
-}
-
-static INLINE void sad16xhx4d_neon(const uint8_t *src, int src_stride,
-                                   const uint8_t *const ref[4], int ref_stride,
-                                   uint32_t res[4], int h) {
-  uint32x4_t sum[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
-                        vdupq_n_u32(0) };
-
-  int i = 0;
-  do {
-    const uint8x16_t s = vld1q_u8(src + i * src_stride);
-    sad16_neon(s, vld1q_u8(ref[0] + i * ref_stride), &sum[0]);
-    sad16_neon(s, vld1q_u8(ref[1] + i * ref_stride), &sum[1]);
-    sad16_neon(s, vld1q_u8(ref[2] + i * ref_stride), &sum[2]);
-    sad16_neon(s, vld1q_u8(ref[3] + i * ref_stride), &sum[3]);
-
-    i++;
-  } while (i < h);
-
-  uint32x4_t res0 = vpaddq_u32(sum[0], sum[1]);
-  uint32x4_t res1 = vpaddq_u32(sum[2], sum[3]);
-  vst1q_u32(res, vpaddq_u32(res0, res1));
-}
-
-#else  // !(defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD))
-
-static INLINE void sad16_neon(uint8x16_t src, uint8x16_t ref,
-                              uint16x8_t *const sad_sum) {
-  uint8x16_t abs_diff = vabdq_u8(src, ref);
-  *sad_sum = vpadalq_u8(*sad_sum, abs_diff);
-}
-
-static INLINE void sad128xhx4d_neon(const uint8_t *src, int src_stride,
-                                    const uint8_t *const ref[4], int ref_stride,
-                                    uint32_t res[4], int h) {
-  vst1q_u32(res, vdupq_n_u32(0));
-  int h_tmp = h > 32 ? 32 : h;
-
-  int i = 0;
-  do {
-    uint16x8_t sum_lo[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
-                             vdupq_n_u16(0) };
-    uint16x8_t sum_hi[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
-                             vdupq_n_u16(0) };
-
-    do {
-      const uint8x16_t s0 = vld1q_u8(src + i * src_stride);
-      sad16_neon(s0, vld1q_u8(ref[0] + i * ref_stride), &sum_lo[0]);
-      sad16_neon(s0, vld1q_u8(ref[1] + i * ref_stride), &sum_lo[1]);
-      sad16_neon(s0, vld1q_u8(ref[2] + i * ref_stride), &sum_lo[2]);
-      sad16_neon(s0, vld1q_u8(ref[3] + i * ref_stride), &sum_lo[3]);
-
-      const uint8x16_t s1 = vld1q_u8(src + i * src_stride + 16);
-      sad16_neon(s1, vld1q_u8(ref[0] + i * ref_stride + 16), &sum_hi[0]);
-      sad16_neon(s1, vld1q_u8(ref[1] + i * ref_stride + 16), &sum_hi[1]);
-      sad16_neon(s1, vld1q_u8(ref[2] + i * ref_stride + 16), &sum_hi[2]);
-      sad16_neon(s1, vld1q_u8(ref[3] + i * ref_stride + 16), &sum_hi[3]);
-
-      const uint8x16_t s2 = vld1q_u8(src + i * src_stride + 32);
-      sad16_neon(s2, vld1q_u8(ref[0] + i * ref_stride + 32), &sum_lo[0]);
-      sad16_neon(s2, vld1q_u8(ref[1] + i * ref_stride + 32), &sum_lo[1]);
-      sad16_neon(s2, vld1q_u8(ref[2] + i * ref_stride + 32), &sum_lo[2]);
-      sad16_neon(s2, vld1q_u8(ref[3] + i * ref_stride + 32), &sum_lo[3]);
-
-      const uint8x16_t s3 = vld1q_u8(src + i * src_stride + 48);
-      sad16_neon(s3, vld1q_u8(ref[0] + i * ref_stride + 48), &sum_hi[0]);
-      sad16_neon(s3, vld1q_u8(ref[1] + i * ref_stride + 48), &sum_hi[1]);
-      sad16_neon(s3, vld1q_u8(ref[2] + i * ref_stride + 48), &sum_hi[2]);
-      sad16_neon(s3, vld1q_u8(ref[3] + i * ref_stride + 48), &sum_hi[3]);
-
-      const uint8x16_t s4 = vld1q_u8(src + i * src_stride + 64);
-      sad16_neon(s4, vld1q_u8(ref[0] + i * ref_stride + 64), &sum_lo[0]);
-      sad16_neon(s4, vld1q_u8(ref[1] + i * ref_stride + 64), &sum_lo[1]);
-      sad16_neon(s4, vld1q_u8(ref[2] + i * ref_stride + 64), &sum_lo[2]);
-      sad16_neon(s4, vld1q_u8(ref[3] + i * ref_stride + 64), &sum_lo[3]);
-
-      const uint8x16_t s5 = vld1q_u8(src + i * src_stride + 80);
-      sad16_neon(s5, vld1q_u8(ref[0] + i * ref_stride + 80), &sum_hi[0]);
-      sad16_neon(s5, vld1q_u8(ref[1] + i * ref_stride + 80), &sum_hi[1]);
-      sad16_neon(s5, vld1q_u8(ref[2] + i * ref_stride + 80), &sum_hi[2]);
-      sad16_neon(s5, vld1q_u8(ref[3] + i * ref_stride + 80), &sum_hi[3]);
-
-      const uint8x16_t s6 = vld1q_u8(src + i * src_stride + 96);
-      sad16_neon(s6, vld1q_u8(ref[0] + i * ref_stride + 96), &sum_lo[0]);
-      sad16_neon(s6, vld1q_u8(ref[1] + i * ref_stride + 96), &sum_lo[1]);
-      sad16_neon(s6, vld1q_u8(ref[2] + i * ref_stride + 96), &sum_lo[2]);
-      sad16_neon(s6, vld1q_u8(ref[3] + i * ref_stride + 96), &sum_lo[3]);
-
-      const uint8x16_t s7 = vld1q_u8(src + i * src_stride + 112);
-      sad16_neon(s7, vld1q_u8(ref[0] + i * ref_stride + 112), &sum_hi[0]);
-      sad16_neon(s7, vld1q_u8(ref[1] + i * ref_stride + 112), &sum_hi[1]);
-      sad16_neon(s7, vld1q_u8(ref[2] + i * ref_stride + 112), &sum_hi[2]);
-      sad16_neon(s7, vld1q_u8(ref[3] + i * ref_stride + 112), &sum_hi[3]);
-
-      i++;
-    } while (i < h_tmp);
-
-    res[0] += horizontal_long_add_u16x8(sum_lo[0], sum_hi[0]);
-    res[1] += horizontal_long_add_u16x8(sum_lo[1], sum_hi[1]);
-    res[2] += horizontal_long_add_u16x8(sum_lo[2], sum_hi[2]);
-    res[3] += horizontal_long_add_u16x8(sum_lo[3], sum_hi[3]);
-
-    h_tmp += 32;
-  } while (i < h);
-}
-
-static INLINE void sad64xhx4d_neon(const uint8_t *src, int src_stride,
-                                   const uint8_t *const ref[4], int ref_stride,
-                                   uint32_t res[4], int h) {
-  vst1q_u32(res, vdupq_n_u32(0));
-  int h_tmp = h > 64 ? 64 : h;
-
-  int i = 0;
-  do {
-    uint16x8_t sum_lo[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
-                             vdupq_n_u16(0) };
-    uint16x8_t sum_hi[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
-                             vdupq_n_u16(0) };
-
-    do {
-      const uint8x16_t s0 = vld1q_u8(src + i * src_stride);
-      sad16_neon(s0, vld1q_u8(ref[0] + i * ref_stride), &sum_lo[0]);
-      sad16_neon(s0, vld1q_u8(ref[1] + i * ref_stride), &sum_lo[1]);
-      sad16_neon(s0, vld1q_u8(ref[2] + i * ref_stride), &sum_lo[2]);
-      sad16_neon(s0, vld1q_u8(ref[3] + i * ref_stride), &sum_lo[3]);
-
-      const uint8x16_t s1 = vld1q_u8(src + i * src_stride + 16);
-      sad16_neon(s1, vld1q_u8(ref[0] + i * ref_stride + 16), &sum_hi[0]);
-      sad16_neon(s1, vld1q_u8(ref[1] + i * ref_stride + 16), &sum_hi[1]);
-      sad16_neon(s1, vld1q_u8(ref[2] + i * ref_stride + 16), &sum_hi[2]);
-      sad16_neon(s1, vld1q_u8(ref[3] + i * ref_stride + 16), &sum_hi[3]);
-
-      const uint8x16_t s2 = vld1q_u8(src + i * src_stride + 32);
-      sad16_neon(s2, vld1q_u8(ref[0] + i * ref_stride + 32), &sum_lo[0]);
-      sad16_neon(s2, vld1q_u8(ref[1] + i * ref_stride + 32), &sum_lo[1]);
-      sad16_neon(s2, vld1q_u8(ref[2] + i * ref_stride + 32), &sum_lo[2]);
-      sad16_neon(s2, vld1q_u8(ref[3] + i * ref_stride + 32), &sum_lo[3]);
-
-      const uint8x16_t s3 = vld1q_u8(src + i * src_stride + 48);
-      sad16_neon(s3, vld1q_u8(ref[0] + i * ref_stride + 48), &sum_hi[0]);
-      sad16_neon(s3, vld1q_u8(ref[1] + i * ref_stride + 48), &sum_hi[1]);
-      sad16_neon(s3, vld1q_u8(ref[2] + i * ref_stride + 48), &sum_hi[2]);
-      sad16_neon(s3, vld1q_u8(ref[3] + i * ref_stride + 48), &sum_hi[3]);
-
-      i++;
-    } while (i < h_tmp);
-
-    res[0] += horizontal_long_add_u16x8(sum_lo[0], sum_hi[0]);
-    res[1] += horizontal_long_add_u16x8(sum_lo[1], sum_hi[1]);
-    res[2] += horizontal_long_add_u16x8(sum_lo[2], sum_hi[2]);
-    res[3] += horizontal_long_add_u16x8(sum_lo[3], sum_hi[3]);
-
-    h_tmp += 64;
-  } while (i < h);
-}
-
-static INLINE void sad32xhx4d_neon(const uint8_t *src, int src_stride,
-                                   const uint8_t *const ref[4], int ref_stride,
-                                   uint32_t res[4], int h) {
-  uint16x8_t sum_lo[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
-                           vdupq_n_u16(0) };
-  uint16x8_t sum_hi[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
-                           vdupq_n_u16(0) };
-
-  int i = 0;
-  do {
-    const uint8x16_t s0 = vld1q_u8(src + i * src_stride);
-    sad16_neon(s0, vld1q_u8(ref[0] + i * ref_stride), &sum_lo[0]);
-    sad16_neon(s0, vld1q_u8(ref[1] + i * ref_stride), &sum_lo[1]);
-    sad16_neon(s0, vld1q_u8(ref[2] + i * ref_stride), &sum_lo[2]);
-    sad16_neon(s0, vld1q_u8(ref[3] + i * ref_stride), &sum_lo[3]);
-
-    const uint8x16_t s1 = vld1q_u8(src + i * src_stride + 16);
-    sad16_neon(s1, vld1q_u8(ref[0] + i * ref_stride + 16), &sum_hi[0]);
-    sad16_neon(s1, vld1q_u8(ref[1] + i * ref_stride + 16), &sum_hi[1]);
-    sad16_neon(s1, vld1q_u8(ref[2] + i * ref_stride + 16), &sum_hi[2]);
-    sad16_neon(s1, vld1q_u8(ref[3] + i * ref_stride + 16), &sum_hi[3]);
-
-    i++;
-  } while (i < h);
-
-  res[0] = horizontal_long_add_u16x8(sum_lo[0], sum_hi[0]);
-  res[1] = horizontal_long_add_u16x8(sum_lo[1], sum_hi[1]);
-  res[2] = horizontal_long_add_u16x8(sum_lo[2], sum_hi[2]);
-  res[3] = horizontal_long_add_u16x8(sum_lo[3], sum_hi[3]);
-}
-
-static INLINE void sad16xhx4d_neon(const uint8_t *src, int src_stride,
-                                   const uint8_t *const ref[4], int ref_stride,
-                                   uint32_t res[4], int h) {
-  uint16x8_t sum[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
-                        vdupq_n_u16(0) };
-
-  int i = 0;
-  do {
-    const uint8x16_t s = vld1q_u8(src + i * src_stride);
-    sad16_neon(s, vld1q_u8(ref[0] + i * ref_stride), &sum[0]);
-    sad16_neon(s, vld1q_u8(ref[1] + i * ref_stride), &sum[1]);
-    sad16_neon(s, vld1q_u8(ref[2] + i * ref_stride), &sum[2]);
-    sad16_neon(s, vld1q_u8(ref[3] + i * ref_stride), &sum[3]);
-
-    i++;
-  } while (i < h);
-
-  res[0] = horizontal_add_u16x8(sum[0]);
-  res[1] = horizontal_add_u16x8(sum[1]);
-  res[2] = horizontal_add_u16x8(sum[2]);
-  res[3] = horizontal_add_u16x8(sum[3]);
-}
-
-#endif  // defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
-
-static INLINE void sad8_neon(uint8x8_t src, uint8x8_t ref,
-                             uint16x8_t *const sad_sum) {
-  uint8x8_t abs_diff = vabd_u8(src, ref);
-  *sad_sum = vaddw_u8(*sad_sum, abs_diff);
-}
-
-static INLINE void sad8xhx4d_neon(const uint8_t *src, int src_stride,
-                                  const uint8_t *const ref[4], int ref_stride,
-                                  uint32_t res[4], int h) {
-  uint16x8_t sum[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
-                        vdupq_n_u16(0) };
-
-  int i = 0;
-  do {
-    const uint8x8_t s = vld1_u8(src + i * src_stride);
-    sad8_neon(s, vld1_u8(ref[0] + i * ref_stride), &sum[0]);
-    sad8_neon(s, vld1_u8(ref[1] + i * ref_stride), &sum[1]);
-    sad8_neon(s, vld1_u8(ref[2] + i * ref_stride), &sum[2]);
-    sad8_neon(s, vld1_u8(ref[3] + i * ref_stride), &sum[3]);
-
-    i++;
-  } while (i < h);
-
-  res[0] = horizontal_add_u16x8(sum[0]);
-  res[1] = horizontal_add_u16x8(sum[1]);
-  res[2] = horizontal_add_u16x8(sum[2]);
-  res[3] = horizontal_add_u16x8(sum[3]);
-}
-
-static INLINE void sad4xhx4d_neon(const uint8_t *src, int src_stride,
-                                  const uint8_t *const ref[4], int ref_stride,
-                                  uint32_t res[4], int h) {
-  uint16x8_t sum[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
-                        vdupq_n_u16(0) };
-
-  int i = 0;
-  do {
-    uint32x2_t s, r0, r1, r2, r3;
-    uint32_t s_lo, s_hi, r0_lo, r0_hi, r1_lo, r1_hi, r2_lo, r2_hi, r3_lo, r3_hi;
-
-    memcpy(&s_lo, src + i * src_stride, 4);
-    memcpy(&r0_lo, ref[0] + i * ref_stride, 4);
-    memcpy(&r1_lo, ref[1] + i * ref_stride, 4);
-    memcpy(&r2_lo, ref[2] + i * ref_stride, 4);
-    memcpy(&r3_lo, ref[3] + i * ref_stride, 4);
-    s = vdup_n_u32(s_lo);
-    r0 = vdup_n_u32(r0_lo);
-    r1 = vdup_n_u32(r1_lo);
-    r2 = vdup_n_u32(r2_lo);
-    r3 = vdup_n_u32(r3_lo);
-
-    memcpy(&s_hi, src + (i + 1) * src_stride, 4);
-    memcpy(&r0_hi, ref[0] + (i + 1) * ref_stride, 4);
-    memcpy(&r1_hi, ref[1] + (i + 1) * ref_stride, 4);
-    memcpy(&r2_hi, ref[2] + (i + 1) * ref_stride, 4);
-    memcpy(&r3_hi, ref[3] + (i + 1) * ref_stride, 4);
-    s = vset_lane_u32(s_hi, s, 1);
-    r0 = vset_lane_u32(r0_hi, r0, 1);
-    r1 = vset_lane_u32(r1_hi, r1, 1);
-    r2 = vset_lane_u32(r2_hi, r2, 1);
-    r3 = vset_lane_u32(r3_hi, r3, 1);
-
-    sad8_neon(vreinterpret_u8_u32(s), vreinterpret_u8_u32(r0), &sum[0]);
-    sad8_neon(vreinterpret_u8_u32(s), vreinterpret_u8_u32(r1), &sum[1]);
-    sad8_neon(vreinterpret_u8_u32(s), vreinterpret_u8_u32(r2), &sum[2]);
-    sad8_neon(vreinterpret_u8_u32(s), vreinterpret_u8_u32(r3), &sum[3]);
-
-    i += 2;
-  } while (i < h);
-
-  res[0] = horizontal_add_u16x8(sum[0]);
-  res[1] = horizontal_add_u16x8(sum[1]);
-  res[2] = horizontal_add_u16x8(sum[2]);
-  res[3] = horizontal_add_u16x8(sum[3]);
-}
-
-#define SAD_WXH_4D_NEON(w, h)                                                  \
-  void aom_sad##w##x##h##x4d_neon(const uint8_t *src, int src_stride,          \
-                                  const uint8_t *const ref[4], int ref_stride, \
-                                  uint32_t res[4]) {                           \
-    sad##w##xhx4d_neon(src, src_stride, ref, ref_stride, res, (h));            \
-  }
-
-SAD_WXH_4D_NEON(4, 4)
-SAD_WXH_4D_NEON(4, 8)
-SAD_WXH_4D_NEON(4, 16)
-SAD_WXH_4D_NEON(4, 32)
-
-SAD_WXH_4D_NEON(8, 4)
-SAD_WXH_4D_NEON(8, 8)
-SAD_WXH_4D_NEON(8, 16)
-SAD_WXH_4D_NEON(8, 32)
-
-SAD_WXH_4D_NEON(16, 4)
-SAD_WXH_4D_NEON(16, 8)
-SAD_WXH_4D_NEON(16, 16)
-SAD_WXH_4D_NEON(16, 32)
-SAD_WXH_4D_NEON(16, 64)
-
-SAD_WXH_4D_NEON(32, 8)
-SAD_WXH_4D_NEON(32, 16)
-SAD_WXH_4D_NEON(32, 32)
-SAD_WXH_4D_NEON(32, 64)
-
-SAD_WXH_4D_NEON(64, 16)
-SAD_WXH_4D_NEON(64, 32)
-SAD_WXH_4D_NEON(64, 64)
-SAD_WXH_4D_NEON(64, 128)
-
-SAD_WXH_4D_NEON(128, 64)
-SAD_WXH_4D_NEON(128, 128)
-
-#undef SAD_WXH_4D_NEON
-
-#define SAD_SKIP_WXH_4D_NEON(w, h)                                          \
-  void aom_sad_skip_##w##x##h##x4d_neon(const uint8_t *src, int src_stride, \
-                                        const uint8_t *const ref[4],        \
-                                        int ref_stride, uint32_t res[4]) {  \
-    sad##w##xhx4d_neon(src, 2 * src_stride, ref, 2 * ref_stride, res,       \
-                       ((h) >> 1));                                         \
-    res[0] <<= 1;                                                           \
-    res[1] <<= 1;                                                           \
-    res[2] <<= 1;                                                           \
-    res[3] <<= 1;                                                           \
-  }
-
-SAD_SKIP_WXH_4D_NEON(4, 8)
-SAD_SKIP_WXH_4D_NEON(4, 16)
-SAD_SKIP_WXH_4D_NEON(4, 32)
-
-SAD_SKIP_WXH_4D_NEON(8, 8)
-SAD_SKIP_WXH_4D_NEON(8, 16)
-SAD_SKIP_WXH_4D_NEON(8, 32)
-
-SAD_SKIP_WXH_4D_NEON(16, 8)
-SAD_SKIP_WXH_4D_NEON(16, 16)
-SAD_SKIP_WXH_4D_NEON(16, 32)
-SAD_SKIP_WXH_4D_NEON(16, 64)
-
-SAD_SKIP_WXH_4D_NEON(32, 8)
-SAD_SKIP_WXH_4D_NEON(32, 16)
-SAD_SKIP_WXH_4D_NEON(32, 32)
-SAD_SKIP_WXH_4D_NEON(32, 64)
-
-SAD_SKIP_WXH_4D_NEON(64, 16)
-SAD_SKIP_WXH_4D_NEON(64, 32)
-SAD_SKIP_WXH_4D_NEON(64, 64)
-SAD_SKIP_WXH_4D_NEON(64, 128)
-
-SAD_SKIP_WXH_4D_NEON(128, 64)
-SAD_SKIP_WXH_4D_NEON(128, 128)
-
-#undef SAD_SKIP_WXH_4D_NEON

diff --git a/aom_dsp/arm/sad_neon.c b/aom_dsp/arm/sad_neon.c
index 5ba7f10..60efef8 100644
--- a/aom_dsp/arm/sad_neon.c
+++ b/aom_dsp/arm/sad_neon.c

@@ -10,9 +10,12 @@
  */
 
 #include <arm_neon.h>
+
 #include "config/aom_config.h"
 #include "config/aom_dsp_rtcd.h"
+
 #include "aom/aom_integer.h"
+#include "aom_dsp/arm/mem_neon.h"
 #include "aom_dsp/arm/sum_neon.h"
 
 #if defined(__ARM_FEATURE_DOTPROD)
@@ -289,24 +292,13 @@
 
   int i = h / 2;
   do {
-    uint32x2_t s, r;
-    uint32_t s0, s1, r0, r1;
+    uint8x8_t s = load_unaligned_u8(src_ptr, src_stride);
+    uint8x8_t r = load_unaligned_u8(ref_ptr, ref_stride);
 
-    memcpy(&s0, src_ptr, 4);
-    memcpy(&r0, ref_ptr, 4);
-    s = vdup_n_u32(s0);
-    r = vdup_n_u32(r0);
-    src_ptr += src_stride;
-    ref_ptr += ref_stride;
+    sum = vabal_u8(sum, s, r);
 
-    memcpy(&s1, src_ptr, 4);
-    memcpy(&r1, ref_ptr, 4);
-    s = vset_lane_u32(s1, s, 1);
-    r = vset_lane_u32(r1, r, 1);
-    src_ptr += src_stride;
-    ref_ptr += ref_stride;
-
-    sum = vabal_u8(sum, vreinterpret_u8_u32(s), vreinterpret_u8_u32(r));
+    src_ptr += 2 * src_stride;
+    ref_ptr += 2 * ref_stride;
   } while (--i != 0);
 
   return horizontal_add_u16x8(sum);
@@ -320,25 +312,19 @@
 
 SAD_WXH_NEON(4, 4)
 SAD_WXH_NEON(4, 8)
-SAD_WXH_NEON(4, 16)
 
 SAD_WXH_NEON(8, 4)
 SAD_WXH_NEON(8, 8)
 SAD_WXH_NEON(8, 16)
-SAD_WXH_NEON(8, 32)
 
-SAD_WXH_NEON(16, 4)
 SAD_WXH_NEON(16, 8)
 SAD_WXH_NEON(16, 16)
 SAD_WXH_NEON(16, 32)
-SAD_WXH_NEON(16, 64)
 
-SAD_WXH_NEON(32, 8)
 SAD_WXH_NEON(32, 16)
 SAD_WXH_NEON(32, 32)
 SAD_WXH_NEON(32, 64)
 
-SAD_WXH_NEON(64, 16)
 SAD_WXH_NEON(64, 32)
 SAD_WXH_NEON(64, 64)
 SAD_WXH_NEON(64, 128)
@@ -346,6 +332,15 @@
 SAD_WXH_NEON(128, 64)
 SAD_WXH_NEON(128, 128)
 
+#if !CONFIG_REALTIME_ONLY
+SAD_WXH_NEON(4, 16)
+SAD_WXH_NEON(8, 32)
+SAD_WXH_NEON(16, 4)
+SAD_WXH_NEON(16, 64)
+SAD_WXH_NEON(32, 8)
+SAD_WXH_NEON(64, 16)
+#endif  // !CONFIG_REALTIME_ONLY
+
 #undef SAD_WXH_NEON
 
 #define SAD_SKIP_WXH_NEON(w, h)                                                \
@@ -356,24 +351,21 @@
            sad##w##xh_neon(src, 2 * src_stride, ref, 2 * ref_stride, (h) / 2); \
   }
 
+SAD_SKIP_WXH_NEON(4, 4)
 SAD_SKIP_WXH_NEON(4, 8)
-SAD_SKIP_WXH_NEON(4, 16)
 
+SAD_SKIP_WXH_NEON(8, 4)
 SAD_SKIP_WXH_NEON(8, 8)
 SAD_SKIP_WXH_NEON(8, 16)
-SAD_SKIP_WXH_NEON(8, 32)
 
 SAD_SKIP_WXH_NEON(16, 8)
 SAD_SKIP_WXH_NEON(16, 16)
 SAD_SKIP_WXH_NEON(16, 32)
-SAD_SKIP_WXH_NEON(16, 64)
 
-SAD_SKIP_WXH_NEON(32, 8)
 SAD_SKIP_WXH_NEON(32, 16)
 SAD_SKIP_WXH_NEON(32, 32)
 SAD_SKIP_WXH_NEON(32, 64)
 
-SAD_SKIP_WXH_NEON(64, 16)
 SAD_SKIP_WXH_NEON(64, 32)
 SAD_SKIP_WXH_NEON(64, 64)
 SAD_SKIP_WXH_NEON(64, 128)
@@ -381,6 +373,15 @@
 SAD_SKIP_WXH_NEON(128, 64)
 SAD_SKIP_WXH_NEON(128, 128)
 
+#if !CONFIG_REALTIME_ONLY
+SAD_SKIP_WXH_NEON(4, 16)
+SAD_SKIP_WXH_NEON(8, 32)
+SAD_SKIP_WXH_NEON(16, 4)
+SAD_SKIP_WXH_NEON(16, 64)
+SAD_SKIP_WXH_NEON(32, 8)
+SAD_SKIP_WXH_NEON(64, 16)
+#endif  // !CONFIG_REALTIME_ONLY
+
 #undef SAD_SKIP_WXH_NEON
 
 #if defined(__ARM_FEATURE_DOTPROD)
@@ -732,28 +733,15 @@
 
   int i = h / 2;
   do {
-    uint32x2_t s, r;
-    uint32_t s0, s1, r0, r1;
-    uint8x8_t p, avg;
+    uint8x8_t s = load_unaligned_u8(src_ptr, src_stride);
+    uint8x8_t r = load_unaligned_u8(ref_ptr, ref_stride);
+    uint8x8_t p = vld1_u8(second_pred);
 
-    memcpy(&s0, src_ptr, 4);
-    memcpy(&r0, ref_ptr, 4);
-    s = vdup_n_u32(s0);
-    r = vdup_n_u32(r0);
-    src_ptr += src_stride;
-    ref_ptr += ref_stride;
+    uint8x8_t avg = vrhadd_u8(r, p);
+    sum = vabal_u8(sum, s, avg);
 
-    memcpy(&s1, src_ptr, 4);
-    memcpy(&r1, ref_ptr, 4);
-    s = vset_lane_u32(s1, s, 1);
-    r = vset_lane_u32(r1, r, 1);
-    src_ptr += src_stride;
-    ref_ptr += ref_stride;
-
-    p = vld1_u8(second_pred);
-    avg = vrhadd_u8(vreinterpret_u8_u32(r), p);
-
-    sum = vabal_u8(sum, vreinterpret_u8_u32(s), avg);
+    src_ptr += 2 * src_stride;
+    ref_ptr += 2 * ref_stride;
     second_pred += 8;
   } while (--i != 0);
 
@@ -770,25 +758,19 @@
 
 SAD_WXH_AVG_NEON(4, 4)
 SAD_WXH_AVG_NEON(4, 8)
-SAD_WXH_AVG_NEON(4, 16)
 
 SAD_WXH_AVG_NEON(8, 4)
 SAD_WXH_AVG_NEON(8, 8)
 SAD_WXH_AVG_NEON(8, 16)
-SAD_WXH_AVG_NEON(8, 32)
 
-SAD_WXH_AVG_NEON(16, 4)
 SAD_WXH_AVG_NEON(16, 8)
 SAD_WXH_AVG_NEON(16, 16)
 SAD_WXH_AVG_NEON(16, 32)
-SAD_WXH_AVG_NEON(16, 64)
 
-SAD_WXH_AVG_NEON(32, 8)
 SAD_WXH_AVG_NEON(32, 16)
 SAD_WXH_AVG_NEON(32, 32)
 SAD_WXH_AVG_NEON(32, 64)
 
-SAD_WXH_AVG_NEON(64, 16)
 SAD_WXH_AVG_NEON(64, 32)
 SAD_WXH_AVG_NEON(64, 64)
 SAD_WXH_AVG_NEON(64, 128)
@@ -796,4 +778,13 @@
 SAD_WXH_AVG_NEON(128, 64)
 SAD_WXH_AVG_NEON(128, 128)
 
+#if !CONFIG_REALTIME_ONLY
+SAD_WXH_AVG_NEON(4, 16)
+SAD_WXH_AVG_NEON(8, 32)
+SAD_WXH_AVG_NEON(16, 4)
+SAD_WXH_AVG_NEON(16, 64)
+SAD_WXH_AVG_NEON(32, 8)
+SAD_WXH_AVG_NEON(64, 16)
+#endif  // !CONFIG_REALTIME_ONLY
+
 #undef SAD_WXH_AVG_NEON

diff --git a/aom_dsp/arm/sadxd_neon.c b/aom_dsp/arm/sadxd_neon.c
new file mode 100644
index 0000000..81803b1
--- /dev/null
+++ b/aom_dsp/arm/sadxd_neon.c

@@ -0,0 +1,688 @@
+/*
+ * Copyright (c) 2016, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <arm_neon.h>
+
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+
+#include "aom/aom_integer.h"
+#include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/arm/sum_neon.h"
+
+#if defined(__ARM_FEATURE_DOTPROD)
+
+static INLINE void sad16_neon(uint8x16_t src, uint8x16_t ref,
+                              uint32x4_t *const sad_sum) {
+  uint8x16_t abs_diff = vabdq_u8(src, ref);
+  *sad_sum = vdotq_u32(*sad_sum, abs_diff, vdupq_n_u8(1));
+}
+
+static INLINE void sadwxhx3d_large_neon(const uint8_t *src, int src_stride,
+                                        const uint8_t *const ref[4],
+                                        int ref_stride, uint32_t res[4], int w,
+                                        int h) {
+  uint32x4_t sum_lo[3] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0) };
+  uint32x4_t sum_hi[3] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0) };
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    int j = 0;
+    do {
+      const uint8x16_t s0 = vld1q_u8(src + j);
+      sad16_neon(s0, vld1q_u8(ref[0] + ref_offset + j), &sum_lo[0]);
+      sad16_neon(s0, vld1q_u8(ref[1] + ref_offset + j), &sum_lo[1]);
+      sad16_neon(s0, vld1q_u8(ref[2] + ref_offset + j), &sum_lo[2]);
+
+      const uint8x16_t s1 = vld1q_u8(src + j + 16);
+      sad16_neon(s1, vld1q_u8(ref[0] + ref_offset + j + 16), &sum_hi[0]);
+      sad16_neon(s1, vld1q_u8(ref[1] + ref_offset + j + 16), &sum_hi[1]);
+      sad16_neon(s1, vld1q_u8(ref[2] + ref_offset + j + 16), &sum_hi[2]);
+
+      j += 32;
+    } while (j < w);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+  } while (--i != 0);
+
+  res[0] = horizontal_add_u32x4(vaddq_u32(sum_lo[0], sum_hi[0]));
+  res[1] = horizontal_add_u32x4(vaddq_u32(sum_lo[1], sum_hi[1]));
+  res[2] = horizontal_add_u32x4(vaddq_u32(sum_lo[2], sum_hi[2]));
+}
+
+static INLINE void sad128xhx3d_neon(const uint8_t *src, int src_stride,
+                                    const uint8_t *const ref[4], int ref_stride,
+                                    uint32_t res[4], int h) {
+  sadwxhx3d_large_neon(src, src_stride, ref, ref_stride, res, 128, h);
+}
+
+static INLINE void sad64xhx3d_neon(const uint8_t *src, int src_stride,
+                                   const uint8_t *const ref[4], int ref_stride,
+                                   uint32_t res[4], int h) {
+  sadwxhx3d_large_neon(src, src_stride, ref, ref_stride, res, 64, h);
+}
+
+static INLINE void sad32xhx3d_neon(const uint8_t *src, int src_stride,
+                                   const uint8_t *const ref[4], int ref_stride,
+                                   uint32_t res[4], int h) {
+  sadwxhx3d_large_neon(src, src_stride, ref, ref_stride, res, 32, h);
+}
+
+static INLINE void sad16xhx3d_neon(const uint8_t *src, int src_stride,
+                                   const uint8_t *const ref[4], int ref_stride,
+                                   uint32_t res[4], int h) {
+  uint32x4_t sum[3] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0) };
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    const uint8x16_t s = vld1q_u8(src);
+    sad16_neon(s, vld1q_u8(ref[0] + ref_offset), &sum[0]);
+    sad16_neon(s, vld1q_u8(ref[1] + ref_offset), &sum[1]);
+    sad16_neon(s, vld1q_u8(ref[2] + ref_offset), &sum[2]);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+  } while (--i != 0);
+
+  res[0] = horizontal_add_u32x4(sum[0]);
+  res[1] = horizontal_add_u32x4(sum[1]);
+  res[2] = horizontal_add_u32x4(sum[2]);
+}
+
+#else  // !(defined(__ARM_FEATURE_DOTPROD))
+
+static INLINE void sad16_neon(uint8x16_t src, uint8x16_t ref,
+                              uint16x8_t *const sad_sum) {
+  uint8x16_t abs_diff = vabdq_u8(src, ref);
+  *sad_sum = vpadalq_u8(*sad_sum, abs_diff);
+}
+
+static INLINE void sadwxhx3d_large_neon(const uint8_t *src, int src_stride,
+                                        const uint8_t *const ref[3],
+                                        int ref_stride, uint32_t res[3], int w,
+                                        int h, int h_overflow) {
+  uint32x4_t sum[3] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0) };
+  int h_limit = h > h_overflow ? h_overflow : h;
+
+  int ref_offset = 0;
+  int i = 0;
+  do {
+    uint16x8_t sum_lo[3] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0) };
+    uint16x8_t sum_hi[3] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0) };
+
+    do {
+      int j = 0;
+      do {
+        const uint8x16_t s0 = vld1q_u8(src + j);
+        sad16_neon(s0, vld1q_u8(ref[0] + ref_offset + j), &sum_lo[0]);
+        sad16_neon(s0, vld1q_u8(ref[1] + ref_offset + j), &sum_lo[1]);
+        sad16_neon(s0, vld1q_u8(ref[2] + ref_offset + j), &sum_lo[2]);
+
+        const uint8x16_t s1 = vld1q_u8(src + j + 16);
+        sad16_neon(s1, vld1q_u8(ref[0] + ref_offset + j + 16), &sum_hi[0]);
+        sad16_neon(s1, vld1q_u8(ref[1] + ref_offset + j + 16), &sum_hi[1]);
+        sad16_neon(s1, vld1q_u8(ref[2] + ref_offset + j + 16), &sum_hi[2]);
+
+        j += 32;
+      } while (j < w);
+
+      src += src_stride;
+      ref_offset += ref_stride;
+    } while (++i < h_limit);
+
+    sum[0] = vpadalq_u16(sum[0], sum_lo[0]);
+    sum[0] = vpadalq_u16(sum[0], sum_hi[0]);
+    sum[1] = vpadalq_u16(sum[1], sum_lo[1]);
+    sum[1] = vpadalq_u16(sum[1], sum_hi[1]);
+    sum[2] = vpadalq_u16(sum[2], sum_lo[2]);
+    sum[2] = vpadalq_u16(sum[2], sum_hi[2]);
+
+    h_limit += h_overflow;
+  } while (i < h);
+
+  res[0] = horizontal_add_u32x4(sum[0]);
+  res[1] = horizontal_add_u32x4(sum[1]);
+  res[2] = horizontal_add_u32x4(sum[2]);
+}
+
+static INLINE void sad128xhx3d_neon(const uint8_t *src, int src_stride,
+                                    const uint8_t *const ref[3], int ref_stride,
+                                    uint32_t res[3], int h) {
+  sadwxhx3d_large_neon(src, src_stride, ref, ref_stride, res, 128, h, 32);
+}
+
+static INLINE void sad64xhx3d_neon(const uint8_t *src, int src_stride,
+                                   const uint8_t *const ref[3], int ref_stride,
+                                   uint32_t res[3], int h) {
+  sadwxhx3d_large_neon(src, src_stride, ref, ref_stride, res, 64, h, 64);
+}
+
+static INLINE void sad32xhx3d_neon(const uint8_t *src, int src_stride,
+                                   const uint8_t *const ref[3], int ref_stride,
+                                   uint32_t res[3], int h) {
+  uint16x8_t sum_lo[3] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0) };
+  uint16x8_t sum_hi[3] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0) };
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    const uint8x16_t s0 = vld1q_u8(src);
+    sad16_neon(s0, vld1q_u8(ref[0] + ref_offset), &sum_lo[0]);
+    sad16_neon(s0, vld1q_u8(ref[1] + ref_offset), &sum_lo[1]);
+    sad16_neon(s0, vld1q_u8(ref[2] + ref_offset), &sum_lo[2]);
+
+    const uint8x16_t s1 = vld1q_u8(src + 16);
+    sad16_neon(s1, vld1q_u8(ref[0] + ref_offset + 16), &sum_hi[0]);
+    sad16_neon(s1, vld1q_u8(ref[1] + ref_offset + 16), &sum_hi[1]);
+    sad16_neon(s1, vld1q_u8(ref[2] + ref_offset + 16), &sum_hi[2]);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+  } while (--i != 0);
+
+  res[0] = horizontal_long_add_u16x8(sum_lo[0], sum_hi[0]);
+  res[1] = horizontal_long_add_u16x8(sum_lo[1], sum_hi[1]);
+  res[2] = horizontal_long_add_u16x8(sum_lo[2], sum_hi[2]);
+}
+
+static INLINE void sad16xhx3d_neon(const uint8_t *src, int src_stride,
+                                   const uint8_t *const ref[3], int ref_stride,
+                                   uint32_t res[3], int h) {
+  uint16x8_t sum[3] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0) };
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    const uint8x16_t s = vld1q_u8(src);
+    sad16_neon(s, vld1q_u8(ref[0] + ref_offset), &sum[0]);
+    sad16_neon(s, vld1q_u8(ref[1] + ref_offset), &sum[1]);
+    sad16_neon(s, vld1q_u8(ref[2] + ref_offset), &sum[2]);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+  } while (--i != 0);
+
+  res[0] = horizontal_add_u16x8(sum[0]);
+  res[1] = horizontal_add_u16x8(sum[1]);
+  res[2] = horizontal_add_u16x8(sum[2]);
+}
+
+#endif  // defined(__ARM_FEATURE_DOTPROD)
+
+static INLINE void sad8xhx3d_neon(const uint8_t *src, int src_stride,
+                                  const uint8_t *const ref[3], int ref_stride,
+                                  uint32_t res[3], int h) {
+  uint16x8_t sum[3];
+
+  uint8x8_t s = vld1_u8(src);
+  sum[0] = vabdl_u8(s, vld1_u8(ref[0]));
+  sum[1] = vabdl_u8(s, vld1_u8(ref[1]));
+  sum[2] = vabdl_u8(s, vld1_u8(ref[2]));
+
+  src += src_stride;
+  int ref_offset = ref_stride;
+  int i = h - 1;
+  do {
+    s = vld1_u8(src);
+    sum[0] = vabal_u8(sum[0], s, vld1_u8(ref[0] + ref_offset));
+    sum[1] = vabal_u8(sum[1], s, vld1_u8(ref[1] + ref_offset));
+    sum[2] = vabal_u8(sum[2], s, vld1_u8(ref[2] + ref_offset));
+
+    src += src_stride;
+    ref_offset += ref_stride;
+  } while (--i != 0);
+
+  res[0] = horizontal_add_u16x8(sum[0]);
+  res[1] = horizontal_add_u16x8(sum[1]);
+  res[2] = horizontal_add_u16x8(sum[2]);
+}
+
+static INLINE void sad4xhx3d_neon(const uint8_t *src, int src_stride,
+                                  const uint8_t *const ref[3], int ref_stride,
+                                  uint32_t res[3], int h) {
+  assert(h % 2 == 0);
+  uint16x8_t sum[3];
+
+  uint8x8_t s = load_unaligned_u8(src, src_stride);
+  uint8x8_t r0 = load_unaligned_u8(ref[0], ref_stride);
+  uint8x8_t r1 = load_unaligned_u8(ref[1], ref_stride);
+  uint8x8_t r2 = load_unaligned_u8(ref[2], ref_stride);
+
+  sum[0] = vabdl_u8(s, r0);
+  sum[1] = vabdl_u8(s, r1);
+  sum[2] = vabdl_u8(s, r2);
+
+  src += 2 * src_stride;
+  int ref_offset = 2 * ref_stride;
+  int i = (h / 2) - 1;
+  do {
+    s = load_unaligned_u8(src, src_stride);
+    r0 = load_unaligned_u8(ref[0] + ref_offset, ref_stride);
+    r1 = load_unaligned_u8(ref[1] + ref_offset, ref_stride);
+    r2 = load_unaligned_u8(ref[2] + ref_offset, ref_stride);
+
+    sum[0] = vabal_u8(sum[0], s, r0);
+    sum[1] = vabal_u8(sum[1], s, r1);
+    sum[2] = vabal_u8(sum[2], s, r2);
+
+    src += 2 * src_stride;
+    ref_offset += 2 * ref_stride;
+  } while (--i != 0);
+
+  res[0] = horizontal_add_u16x8(sum[0]);
+  res[1] = horizontal_add_u16x8(sum[1]);
+  res[2] = horizontal_add_u16x8(sum[2]);
+}
+
+#define SAD_WXH_3D_NEON(w, h)                                                  \
+  void aom_sad##w##x##h##x3d_neon(const uint8_t *src, int src_stride,          \
+                                  const uint8_t *const ref[4], int ref_stride, \
+                                  uint32_t res[4]) {                           \
+    sad##w##xhx3d_neon(src, src_stride, ref, ref_stride, res, (h));            \
+  }
+
+SAD_WXH_3D_NEON(4, 4)
+SAD_WXH_3D_NEON(4, 8)
+
+SAD_WXH_3D_NEON(8, 4)
+SAD_WXH_3D_NEON(8, 8)
+SAD_WXH_3D_NEON(8, 16)
+
+SAD_WXH_3D_NEON(16, 8)
+SAD_WXH_3D_NEON(16, 16)
+SAD_WXH_3D_NEON(16, 32)
+
+SAD_WXH_3D_NEON(32, 16)
+SAD_WXH_3D_NEON(32, 32)
+SAD_WXH_3D_NEON(32, 64)
+
+SAD_WXH_3D_NEON(64, 32)
+SAD_WXH_3D_NEON(64, 64)
+SAD_WXH_3D_NEON(64, 128)
+
+SAD_WXH_3D_NEON(128, 64)
+SAD_WXH_3D_NEON(128, 128)
+
+#if !CONFIG_REALTIME_ONLY
+SAD_WXH_3D_NEON(4, 16)
+SAD_WXH_3D_NEON(8, 32)
+SAD_WXH_3D_NEON(16, 4)
+SAD_WXH_3D_NEON(16, 64)
+SAD_WXH_3D_NEON(32, 8)
+SAD_WXH_3D_NEON(64, 16)
+#endif  // !CONFIG_REALTIME_ONLY
+
+#undef SAD_WXH_3D_NEON
+
+#if defined(__ARM_FEATURE_DOTPROD)
+
+static INLINE void sadwxhx4d_large_neon(const uint8_t *src, int src_stride,
+                                        const uint8_t *const ref[4],
+                                        int ref_stride, uint32_t res[4], int w,
+                                        int h) {
+  uint32x4_t sum_lo[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                           vdupq_n_u32(0) };
+  uint32x4_t sum_hi[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                           vdupq_n_u32(0) };
+  uint32x4_t sum[4];
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    int j = 0;
+    do {
+      const uint8x16_t s0 = vld1q_u8(src + j);
+      sad16_neon(s0, vld1q_u8(ref[0] + ref_offset + j), &sum_lo[0]);
+      sad16_neon(s0, vld1q_u8(ref[1] + ref_offset + j), &sum_lo[1]);
+      sad16_neon(s0, vld1q_u8(ref[2] + ref_offset + j), &sum_lo[2]);
+      sad16_neon(s0, vld1q_u8(ref[3] + ref_offset + j), &sum_lo[3]);
+
+      const uint8x16_t s1 = vld1q_u8(src + j + 16);
+      sad16_neon(s1, vld1q_u8(ref[0] + ref_offset + j + 16), &sum_hi[0]);
+      sad16_neon(s1, vld1q_u8(ref[1] + ref_offset + j + 16), &sum_hi[1]);
+      sad16_neon(s1, vld1q_u8(ref[2] + ref_offset + j + 16), &sum_hi[2]);
+      sad16_neon(s1, vld1q_u8(ref[3] + ref_offset + j + 16), &sum_hi[3]);
+
+      j += 32;
+    } while (j < w);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+  } while (--i != 0);
+
+  sum[0] = vaddq_u32(sum_lo[0], sum_hi[0]);
+  sum[1] = vaddq_u32(sum_lo[1], sum_hi[1]);
+  sum[2] = vaddq_u32(sum_lo[2], sum_hi[2]);
+  sum[3] = vaddq_u32(sum_lo[3], sum_hi[3]);
+
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum));
+}
+
+static INLINE void sad128xhx4d_neon(const uint8_t *src, int src_stride,
+                                    const uint8_t *const ref[4], int ref_stride,
+                                    uint32_t res[4], int h) {
+  sadwxhx4d_large_neon(src, src_stride, ref, ref_stride, res, 128, h);
+}
+
+static INLINE void sad64xhx4d_neon(const uint8_t *src, int src_stride,
+                                   const uint8_t *const ref[4], int ref_stride,
+                                   uint32_t res[4], int h) {
+  sadwxhx4d_large_neon(src, src_stride, ref, ref_stride, res, 64, h);
+}
+
+static INLINE void sad32xhx4d_neon(const uint8_t *src, int src_stride,
+                                   const uint8_t *const ref[4], int ref_stride,
+                                   uint32_t res[4], int h) {
+  sadwxhx4d_large_neon(src, src_stride, ref, ref_stride, res, 32, h);
+}
+
+static INLINE void sad16xhx4d_neon(const uint8_t *src, int src_stride,
+                                   const uint8_t *const ref[4], int ref_stride,
+                                   uint32_t res[4], int h) {
+  uint32x4_t sum[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                        vdupq_n_u32(0) };
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    const uint8x16_t s = vld1q_u8(src);
+    sad16_neon(s, vld1q_u8(ref[0] + ref_offset), &sum[0]);
+    sad16_neon(s, vld1q_u8(ref[1] + ref_offset), &sum[1]);
+    sad16_neon(s, vld1q_u8(ref[2] + ref_offset), &sum[2]);
+    sad16_neon(s, vld1q_u8(ref[3] + ref_offset), &sum[3]);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+  } while (--i != 0);
+
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum));
+}
+
+#else  // !(defined(__ARM_FEATURE_DOTPROD))
+
+static INLINE void sadwxhx4d_large_neon(const uint8_t *src, int src_stride,
+                                        const uint8_t *const ref[4],
+                                        int ref_stride, uint32_t res[4], int w,
+                                        int h, int h_overflow) {
+  uint32x4_t sum[4] = { vdupq_n_u32(0), vdupq_n_u32(0), vdupq_n_u32(0),
+                        vdupq_n_u32(0) };
+  int h_limit = h > h_overflow ? h_overflow : h;
+
+  int ref_offset = 0;
+  int i = 0;
+  do {
+    uint16x8_t sum_lo[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                             vdupq_n_u16(0) };
+    uint16x8_t sum_hi[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                             vdupq_n_u16(0) };
+
+    do {
+      int j = 0;
+      do {
+        const uint8x16_t s0 = vld1q_u8(src + j);
+        sad16_neon(s0, vld1q_u8(ref[0] + ref_offset + j), &sum_lo[0]);
+        sad16_neon(s0, vld1q_u8(ref[1] + ref_offset + j), &sum_lo[1]);
+        sad16_neon(s0, vld1q_u8(ref[2] + ref_offset + j), &sum_lo[2]);
+        sad16_neon(s0, vld1q_u8(ref[3] + ref_offset + j), &sum_lo[3]);
+
+        const uint8x16_t s1 = vld1q_u8(src + j + 16);
+        sad16_neon(s1, vld1q_u8(ref[0] + ref_offset + j + 16), &sum_hi[0]);
+        sad16_neon(s1, vld1q_u8(ref[1] + ref_offset + j + 16), &sum_hi[1]);
+        sad16_neon(s1, vld1q_u8(ref[2] + ref_offset + j + 16), &sum_hi[2]);
+        sad16_neon(s1, vld1q_u8(ref[3] + ref_offset + j + 16), &sum_hi[3]);
+
+        j += 32;
+      } while (j < w);
+
+      src += src_stride;
+      ref_offset += ref_stride;
+    } while (++i < h_limit);
+
+    sum[0] = vpadalq_u16(sum[0], sum_lo[0]);
+    sum[0] = vpadalq_u16(sum[0], sum_hi[0]);
+    sum[1] = vpadalq_u16(sum[1], sum_lo[1]);
+    sum[1] = vpadalq_u16(sum[1], sum_hi[1]);
+    sum[2] = vpadalq_u16(sum[2], sum_lo[2]);
+    sum[2] = vpadalq_u16(sum[2], sum_hi[2]);
+    sum[3] = vpadalq_u16(sum[3], sum_lo[3]);
+    sum[3] = vpadalq_u16(sum[3], sum_hi[3]);
+
+    h_limit += h_overflow;
+  } while (i < h);
+
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum));
+}
+
+static INLINE void sad128xhx4d_neon(const uint8_t *src, int src_stride,
+                                    const uint8_t *const ref[4], int ref_stride,
+                                    uint32_t res[4], int h) {
+  sadwxhx4d_large_neon(src, src_stride, ref, ref_stride, res, 128, h, 32);
+}
+
+static INLINE void sad64xhx4d_neon(const uint8_t *src, int src_stride,
+                                   const uint8_t *const ref[4], int ref_stride,
+                                   uint32_t res[4], int h) {
+  sadwxhx4d_large_neon(src, src_stride, ref, ref_stride, res, 64, h, 64);
+}
+
+static INLINE void sad32xhx4d_neon(const uint8_t *src, int src_stride,
+                                   const uint8_t *const ref[4], int ref_stride,
+                                   uint32_t res[4], int h) {
+  uint16x8_t sum_lo[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                           vdupq_n_u16(0) };
+  uint16x8_t sum_hi[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                           vdupq_n_u16(0) };
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    const uint8x16_t s0 = vld1q_u8(src);
+    sad16_neon(s0, vld1q_u8(ref[0] + ref_offset), &sum_lo[0]);
+    sad16_neon(s0, vld1q_u8(ref[1] + ref_offset), &sum_lo[1]);
+    sad16_neon(s0, vld1q_u8(ref[2] + ref_offset), &sum_lo[2]);
+    sad16_neon(s0, vld1q_u8(ref[3] + ref_offset), &sum_lo[3]);
+
+    const uint8x16_t s1 = vld1q_u8(src + 16);
+    sad16_neon(s1, vld1q_u8(ref[0] + ref_offset + 16), &sum_hi[0]);
+    sad16_neon(s1, vld1q_u8(ref[1] + ref_offset + 16), &sum_hi[1]);
+    sad16_neon(s1, vld1q_u8(ref[2] + ref_offset + 16), &sum_hi[2]);
+    sad16_neon(s1, vld1q_u8(ref[3] + ref_offset + 16), &sum_hi[3]);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+  } while (--i != 0);
+
+  vst1q_u32(res, horizontal_long_add_4d_u16x8(sum_lo, sum_hi));
+}
+
+static INLINE void sad16xhx4d_neon(const uint8_t *src, int src_stride,
+                                   const uint8_t *const ref[4], int ref_stride,
+                                   uint32_t res[4], int h) {
+  uint16x8_t sum_u16[4] = { vdupq_n_u16(0), vdupq_n_u16(0), vdupq_n_u16(0),
+                            vdupq_n_u16(0) };
+  uint32x4_t sum_u32[4];
+
+  int ref_offset = 0;
+  int i = h;
+  do {
+    const uint8x16_t s = vld1q_u8(src);
+    sad16_neon(s, vld1q_u8(ref[0] + ref_offset), &sum_u16[0]);
+    sad16_neon(s, vld1q_u8(ref[1] + ref_offset), &sum_u16[1]);
+    sad16_neon(s, vld1q_u8(ref[2] + ref_offset), &sum_u16[2]);
+    sad16_neon(s, vld1q_u8(ref[3] + ref_offset), &sum_u16[3]);
+
+    src += src_stride;
+    ref_offset += ref_stride;
+  } while (--i != 0);
+
+  sum_u32[0] = vpaddlq_u16(sum_u16[0]);
+  sum_u32[1] = vpaddlq_u16(sum_u16[1]);
+  sum_u32[2] = vpaddlq_u16(sum_u16[2]);
+  sum_u32[3] = vpaddlq_u16(sum_u16[3]);
+
+  vst1q_u32(res, horizontal_add_4d_u32x4(sum_u32));
+}
+
+#endif  // defined(__ARM_FEATURE_DOTPROD)
+
+static INLINE void sad8xhx4d_neon(const uint8_t *src, int src_stride,
+                                  const uint8_t *const ref[4], int ref_stride,
+                                  uint32_t res[4], int h) {
+  uint16x8_t sum[4];
+
+  uint8x8_t s = vld1_u8(src);
+  sum[0] = vabdl_u8(s, vld1_u8(ref[0]));
+  sum[1] = vabdl_u8(s, vld1_u8(ref[1]));
+  sum[2] = vabdl_u8(s, vld1_u8(ref[2]));
+  sum[3] = vabdl_u8(s, vld1_u8(ref[3]));
+
+  src += src_stride;
+  int ref_offset = ref_stride;
+  int i = h - 1;
+  do {
+    s = vld1_u8(src);
+    sum[0] = vabal_u8(sum[0], s, vld1_u8(ref[0] + ref_offset));
+    sum[1] = vabal_u8(sum[1], s, vld1_u8(ref[1] + ref_offset));
+    sum[2] = vabal_u8(sum[2], s, vld1_u8(ref[2] + ref_offset));
+    sum[3] = vabal_u8(sum[3], s, vld1_u8(ref[3] + ref_offset));
+
+    src += src_stride;
+    ref_offset += ref_stride;
+  } while (--i != 0);
+
+  vst1q_u32(res, horizontal_add_4d_u16x8(sum));
+}
+
+static INLINE void sad4xhx4d_neon(const uint8_t *src, int src_stride,
+                                  const uint8_t *const ref[4], int ref_stride,
+                                  uint32_t res[4], int h) {
+  uint16x8_t sum[4];
+
+  uint8x8_t s = load_unaligned_u8(src, src_stride);
+  uint8x8_t r0 = load_unaligned_u8(ref[0], ref_stride);
+  uint8x8_t r1 = load_unaligned_u8(ref[1], ref_stride);
+  uint8x8_t r2 = load_unaligned_u8(ref[2], ref_stride);
+  uint8x8_t r3 = load_unaligned_u8(ref[3], ref_stride);
+
+  sum[0] = vabdl_u8(s, r0);
+  sum[1] = vabdl_u8(s, r1);
+  sum[2] = vabdl_u8(s, r2);
+  sum[3] = vabdl_u8(s, r3);
+
+  src += 2 * src_stride;
+  int ref_offset = 2 * ref_stride;
+  int i = h / 2;
+  while (--i != 0) {
+    s = load_unaligned_u8(src, src_stride);
+    r0 = load_unaligned_u8(ref[0] + ref_offset, ref_stride);
+    r1 = load_unaligned_u8(ref[1] + ref_offset, ref_stride);
+    r2 = load_unaligned_u8(ref[2] + ref_offset, ref_stride);
+    r3 = load_unaligned_u8(ref[3] + ref_offset, ref_stride);
+
+    sum[0] = vabal_u8(sum[0], s, r0);
+    sum[1] = vabal_u8(sum[1], s, r1);
+    sum[2] = vabal_u8(sum[2], s, r2);
+    sum[3] = vabal_u8(sum[3], s, r3);
+
+    src += 2 * src_stride;
+    ref_offset += 2 * ref_stride;
+  }
+
+  vst1q_u32(res, horizontal_add_4d_u16x8(sum));
+}
+
+#define SAD_WXH_4D_NEON(w, h)                                                  \
+  void aom_sad##w##x##h##x4d_neon(const uint8_t *src, int src_stride,          \
+                                  const uint8_t *const ref[4], int ref_stride, \
+                                  uint32_t res[4]) {                           \
+    sad##w##xhx4d_neon(src, src_stride, ref, ref_stride, res, (h));            \
+  }
+
+SAD_WXH_4D_NEON(4, 4)
+SAD_WXH_4D_NEON(4, 8)
+
+SAD_WXH_4D_NEON(8, 4)
+SAD_WXH_4D_NEON(8, 8)
+SAD_WXH_4D_NEON(8, 16)
+
+SAD_WXH_4D_NEON(16, 8)
+SAD_WXH_4D_NEON(16, 16)
+SAD_WXH_4D_NEON(16, 32)
+
+SAD_WXH_4D_NEON(32, 16)
+SAD_WXH_4D_NEON(32, 32)
+SAD_WXH_4D_NEON(32, 64)
+
+SAD_WXH_4D_NEON(64, 32)
+SAD_WXH_4D_NEON(64, 64)
+SAD_WXH_4D_NEON(64, 128)
+
+SAD_WXH_4D_NEON(128, 64)
+SAD_WXH_4D_NEON(128, 128)
+
+#if !CONFIG_REALTIME_ONLY
+SAD_WXH_4D_NEON(4, 16)
+SAD_WXH_4D_NEON(8, 32)
+SAD_WXH_4D_NEON(16, 4)
+SAD_WXH_4D_NEON(16, 64)
+SAD_WXH_4D_NEON(32, 8)
+SAD_WXH_4D_NEON(64, 16)
+#endif  // !CONFIG_REALTIME_ONLY
+
+#undef SAD_WXH_4D_NEON
+
+#define SAD_SKIP_WXH_4D_NEON(w, h)                                          \
+  void aom_sad_skip_##w##x##h##x4d_neon(const uint8_t *src, int src_stride, \
+                                        const uint8_t *const ref[4],        \
+                                        int ref_stride, uint32_t res[4]) {  \
+    sad##w##xhx4d_neon(src, 2 * src_stride, ref, 2 * ref_stride, res,       \
+                       ((h) >> 1));                                         \
+    res[0] <<= 1;                                                           \
+    res[1] <<= 1;                                                           \
+    res[2] <<= 1;                                                           \
+    res[3] <<= 1;                                                           \
+  }
+
+SAD_SKIP_WXH_4D_NEON(4, 4)
+SAD_SKIP_WXH_4D_NEON(4, 8)
+
+SAD_SKIP_WXH_4D_NEON(8, 4)
+SAD_SKIP_WXH_4D_NEON(8, 8)
+SAD_SKIP_WXH_4D_NEON(8, 16)
+
+SAD_SKIP_WXH_4D_NEON(16, 8)
+SAD_SKIP_WXH_4D_NEON(16, 16)
+SAD_SKIP_WXH_4D_NEON(16, 32)
+
+SAD_SKIP_WXH_4D_NEON(32, 16)
+SAD_SKIP_WXH_4D_NEON(32, 32)
+SAD_SKIP_WXH_4D_NEON(32, 64)
+
+SAD_SKIP_WXH_4D_NEON(64, 32)
+SAD_SKIP_WXH_4D_NEON(64, 64)
+SAD_SKIP_WXH_4D_NEON(64, 128)
+
+SAD_SKIP_WXH_4D_NEON(128, 64)
+SAD_SKIP_WXH_4D_NEON(128, 128)
+
+#if !CONFIG_REALTIME_ONLY
+SAD_SKIP_WXH_4D_NEON(4, 16)
+SAD_SKIP_WXH_4D_NEON(8, 32)
+SAD_SKIP_WXH_4D_NEON(16, 4)
+SAD_SKIP_WXH_4D_NEON(16, 64)
+SAD_SKIP_WXH_4D_NEON(32, 8)
+SAD_SKIP_WXH_4D_NEON(64, 16)
+#endif  // !CONFIG_REALTIME_ONLY
+
+#undef SAD_SKIP_WXH_4D_NEON

diff --git a/aom_dsp/arm/sse_neon.c b/aom_dsp/arm/sse_neon.c
index 2c988dc..d1d3d93 100644
--- a/aom_dsp/arm/sse_neon.c
+++ b/aom_dsp/arm/sse_neon.c

@@ -348,7 +348,8 @@
 
 int64_t aom_highbd_sse_neon(const uint8_t *a8, int a_stride, const uint8_t *b8,
                             int b_stride, int width, int height) {
-  const uint16x8_t q0 = { 0, 1, 2, 3, 4, 5, 6, 7 };
+  static const uint16_t k01234567[8] = { 0, 1, 2, 3, 4, 5, 6, 7 };
+  const uint16x8_t q0 = vld1q_u16(k01234567);
   int64_t sse = 0;
   uint16_t *a = CONVERT_TO_SHORTPTR(a8);
   uint16_t *b = CONVERT_TO_SHORTPTR(b8);

diff --git a/aom_dsp/arm/subpel_variance_neon.c b/aom_dsp/arm/subpel_variance_neon.c
index a058860..9599ae0 100644
--- a/aom_dsp/arm/subpel_variance_neon.c
+++ b/aom_dsp/arm/subpel_variance_neon.c

@@ -549,3 +549,239 @@
 
 #undef SUBPEL_AVG_VARIANCE_WXH_NEON
 #undef SPECIALIZED_SUBPEL_AVG_VARIANCE_WXH_NEON
+
+#if !CONFIG_REALTIME_ONLY
+
+#define OBMC_SUBPEL_VARIANCE_WXH_NEON(w, h, padding)                   \
+  unsigned int aom_obmc_sub_pixel_variance##w##x##h##_neon(            \
+      const uint8_t *pre, int pre_stride, int xoffset, int yoffset,    \
+      const int32_t *wsrc, const int32_t *mask, unsigned int *sse) {   \
+    uint8_t tmp0[w * (h + padding)];                                   \
+    uint8_t tmp1[w * h];                                               \
+    var_filter_block2d_bil_w##w(pre, tmp0, pre_stride, 1, h + padding, \
+                                xoffset);                              \
+    var_filter_block2d_bil_w##w(tmp0, tmp1, w, w, h, yoffset);         \
+    return aom_obmc_variance##w##x##h(tmp1, w, wsrc, mask, sse);       \
+  }
+
+#define SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(w, h, padding)              \
+  unsigned int aom_obmc_sub_pixel_variance##w##x##h##_neon(                   \
+      const uint8_t *pre, int pre_stride, int xoffset, int yoffset,           \
+      const int32_t *wsrc, const int32_t *mask, unsigned int *sse) {          \
+    if (xoffset == 0) {                                                       \
+      if (yoffset == 0) {                                                     \
+        return aom_obmc_variance##w##x##h##_neon(pre, pre_stride, wsrc, mask, \
+                                                 sse);                        \
+      } else if (yoffset == 4) {                                              \
+        uint8_t tmp[w * h];                                                   \
+        var_filter_block2d_avg(pre, tmp, pre_stride, pre_stride, w, h);       \
+        return aom_obmc_variance##w##x##h##_neon(tmp, w, wsrc, mask, sse);    \
+      } else {                                                                \
+        uint8_t tmp[w * h];                                                   \
+        var_filter_block2d_bil_w##w(pre, tmp, pre_stride, pre_stride, h,      \
+                                    yoffset);                                 \
+        return aom_obmc_variance##w##x##h##_neon(tmp, w, wsrc, mask, sse);    \
+      }                                                                       \
+    } else if (xoffset == 4) {                                                \
+      uint8_t tmp0[w * (h + padding)];                                        \
+      if (yoffset == 0) {                                                     \
+        var_filter_block2d_avg(pre, tmp0, pre_stride, 1, w, h);               \
+        return aom_obmc_variance##w##x##h##_neon(tmp0, w, wsrc, mask, sse);   \
+      } else if (yoffset == 4) {                                              \
+        uint8_t tmp1[w * (h + padding)];                                      \
+        var_filter_block2d_avg(pre, tmp0, pre_stride, 1, w, h + padding);     \
+        var_filter_block2d_avg(tmp0, tmp1, w, w, w, h);                       \
+        return aom_obmc_variance##w##x##h##_neon(tmp1, w, wsrc, mask, sse);   \
+      } else {                                                                \
+        uint8_t tmp1[w * (h + padding)];                                      \
+        var_filter_block2d_avg(pre, tmp0, pre_stride, 1, w, h + padding);     \
+        var_filter_block2d_bil_w##w(tmp0, tmp1, w, w, h, yoffset);            \
+        return aom_obmc_variance##w##x##h##_neon(tmp1, w, wsrc, mask, sse);   \
+      }                                                                       \
+    } else {                                                                  \
+      uint8_t tmp0[w * (h + padding)];                                        \
+      if (yoffset == 0) {                                                     \
+        var_filter_block2d_bil_w##w(pre, tmp0, pre_stride, 1, h, xoffset);    \
+        return aom_obmc_variance##w##x##h##_neon(tmp0, w, wsrc, mask, sse);   \
+      } else if (yoffset == 4) {                                              \
+        uint8_t tmp1[w * h];                                                  \
+        var_filter_block2d_bil_w##w(pre, tmp0, pre_stride, 1, h + padding,    \
+                                    xoffset);                                 \
+        var_filter_block2d_avg(tmp0, tmp1, w, w, w, h);                       \
+        return aom_obmc_variance##w##x##h##_neon(tmp1, w, wsrc, mask, sse);   \
+      } else {                                                                \
+        uint8_t tmp1[w * h];                                                  \
+        var_filter_block2d_bil_w##w(pre, tmp0, pre_stride, 1, h + padding,    \
+                                    xoffset);                                 \
+        var_filter_block2d_bil_w##w(tmp0, tmp1, w, w, h, yoffset);            \
+        return aom_obmc_variance##w##x##h##_neon(tmp1, w, wsrc, mask, sse);   \
+      }                                                                       \
+    }                                                                         \
+  }
+
+OBMC_SUBPEL_VARIANCE_WXH_NEON(4, 4, 2)
+OBMC_SUBPEL_VARIANCE_WXH_NEON(4, 8, 2)
+OBMC_SUBPEL_VARIANCE_WXH_NEON(4, 16, 2)
+
+OBMC_SUBPEL_VARIANCE_WXH_NEON(8, 4, 1)
+OBMC_SUBPEL_VARIANCE_WXH_NEON(8, 8, 1)
+OBMC_SUBPEL_VARIANCE_WXH_NEON(8, 16, 1)
+OBMC_SUBPEL_VARIANCE_WXH_NEON(8, 32, 1)
+
+OBMC_SUBPEL_VARIANCE_WXH_NEON(16, 4, 1)
+OBMC_SUBPEL_VARIANCE_WXH_NEON(16, 8, 1)
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(16, 16, 1)
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(16, 32, 1)
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(16, 64, 1)
+
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(32, 8, 1)
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(32, 16, 1)
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(32, 32, 1)
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(32, 64, 1)
+
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(64, 16, 1)
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(64, 32, 1)
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(64, 64, 1)
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(64, 128, 1)
+
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(128, 64, 1)
+SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON(128, 128, 1)
+
+#undef OBMC_SUBPEL_VARIANCE_WXH_NEON
+#undef SPECIALIZED_OBMC_SUBPEL_VARIANCE_WXH_NEON
+#endif  // !CONFIG_REALTIME_ONLY
+
+#define MASKED_SUBPEL_VARIANCE_WXH_NEON(w, h, padding)                         \
+  unsigned int aom_masked_sub_pixel_variance##w##x##h##_neon(                  \
+      const uint8_t *src, int src_stride, int xoffset, int yoffset,            \
+      const uint8_t *ref, int ref_stride, const uint8_t *second_pred,          \
+      const uint8_t *msk, int msk_stride, int invert_mask,                     \
+      unsigned int *sse) {                                                     \
+    uint8_t tmp0[w * (h + padding)];                                           \
+    uint8_t tmp1[w * h];                                                       \
+    uint8_t tmp2[w * h];                                                       \
+    var_filter_block2d_bil_w##w(src, tmp0, src_stride, 1, (h + padding),       \
+                                xoffset);                                      \
+    var_filter_block2d_bil_w##w(tmp0, tmp1, w, w, h, yoffset);                 \
+    aom_comp_mask_pred_neon(tmp2, second_pred, w, h, tmp1, w, msk, msk_stride, \
+                            invert_mask);                                      \
+    return aom_variance##w##x##h##_neon(tmp2, w, ref, ref_stride, sse);        \
+  }
+
+#define SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(w, h, padding)             \
+  unsigned int aom_masked_sub_pixel_variance##w##x##h##_neon(                  \
+      const uint8_t *src, int src_stride, int xoffset, int yoffset,            \
+      const uint8_t *ref, int ref_stride, const uint8_t *second_pred,          \
+      const uint8_t *msk, int msk_stride, int invert_mask,                     \
+      unsigned int *sse) {                                                     \
+    if (xoffset == 0) {                                                        \
+      uint8_t tmp0[w * h];                                                     \
+      if (yoffset == 0) {                                                      \
+        aom_comp_mask_pred_neon(tmp0, second_pred, w, h, src, src_stride, msk, \
+                                msk_stride, invert_mask);                      \
+        return aom_variance##w##x##h##_neon(tmp0, w, ref, ref_stride, sse);    \
+      } else if (yoffset == 4) {                                               \
+        uint8_t tmp1[w * h];                                                   \
+        var_filter_block2d_avg(src, tmp0, src_stride, src_stride, w, h);       \
+        aom_comp_mask_pred_neon(tmp1, second_pred, w, h, tmp0, w, msk,         \
+                                msk_stride, invert_mask);                      \
+        return aom_variance##w##x##h##_neon(tmp1, w, ref, ref_stride, sse);    \
+      } else {                                                                 \
+        uint8_t tmp1[w * h];                                                   \
+        var_filter_block2d_bil_w##w(src, tmp0, src_stride, src_stride, h,      \
+                                    yoffset);                                  \
+        aom_comp_mask_pred_neon(tmp1, second_pred, w, h, tmp0, w, msk,         \
+                                msk_stride, invert_mask);                      \
+        return aom_variance##w##x##h##_neon(tmp1, w, ref, ref_stride, sse);    \
+      }                                                                        \
+    } else if (xoffset == 4) {                                                 \
+      uint8_t tmp0[w * (h + padding)];                                         \
+      if (yoffset == 0) {                                                      \
+        uint8_t tmp1[w * h];                                                   \
+        var_filter_block2d_avg(src, tmp0, src_stride, 1, w, h);                \
+        aom_comp_mask_pred_neon(tmp1, second_pred, w, h, tmp0, w, msk,         \
+                                msk_stride, invert_mask);                      \
+        return aom_variance##w##x##h##_neon(tmp1, w, ref, ref_stride, sse);    \
+      } else if (yoffset == 4) {                                               \
+        uint8_t tmp1[w * h];                                                   \
+        uint8_t tmp2[w * h];                                                   \
+        var_filter_block2d_avg(src, tmp0, src_stride, 1, w, (h + padding));    \
+        var_filter_block2d_avg(tmp0, tmp1, w, w, w, h);                        \
+        aom_comp_mask_pred_neon(tmp2, second_pred, w, h, tmp1, w, msk,         \
+                                msk_stride, invert_mask);                      \
+        return aom_variance##w##x##h##_neon(tmp2, w, ref, ref_stride, sse);    \
+      } else {                                                                 \
+        uint8_t tmp1[w * h];                                                   \
+        uint8_t tmp2[w * h];                                                   \
+        var_filter_block2d_avg(src, tmp0, src_stride, 1, w, (h + padding));    \
+        var_filter_block2d_bil_w##w(tmp0, tmp1, w, w, h, yoffset);             \
+        aom_comp_mask_pred_neon(tmp2, second_pred, w, h, tmp1, w, msk,         \
+                                msk_stride, invert_mask);                      \
+        return aom_variance##w##x##h##_neon(tmp2, w, ref, ref_stride, sse);    \
+      }                                                                        \
+    } else {                                                                   \
+      if (yoffset == 0) {                                                      \
+        uint8_t tmp0[w * h];                                                   \
+        uint8_t tmp1[w * h];                                                   \
+        var_filter_block2d_bil_w##w(src, tmp0, src_stride, 1, h, xoffset);     \
+        aom_comp_mask_pred_neon(tmp1, second_pred, w, h, tmp0, w, msk,         \
+                                msk_stride, invert_mask);                      \
+        return aom_variance##w##x##h##_neon(tmp1, w, ref, ref_stride, sse);    \
+      } else if (yoffset == 4) {                                               \
+        uint8_t tmp0[w * (h + padding)];                                       \
+        uint8_t tmp1[w * h];                                                   \
+        uint8_t tmp2[w * h];                                                   \
+        var_filter_block2d_bil_w##w(src, tmp0, src_stride, 1, (h + padding),   \
+                                    xoffset);                                  \
+        var_filter_block2d_avg(tmp0, tmp1, w, w, w, h);                        \
+        aom_comp_mask_pred_neon(tmp2, second_pred, w, h, tmp1, w, msk,         \
+                                msk_stride, invert_mask);                      \
+        return aom_variance##w##x##h##_neon(tmp2, w, ref, ref_stride, sse);    \
+      } else {                                                                 \
+        uint8_t tmp0[w * (h + padding)];                                       \
+        uint8_t tmp1[w * (h + padding)];                                       \
+        uint8_t tmp2[w * h];                                                   \
+        var_filter_block2d_bil_w##w(src, tmp0, src_stride, 1, (h + padding),   \
+                                    xoffset);                                  \
+        var_filter_block2d_bil_w##w(tmp0, tmp1, w, w, h, yoffset);             \
+        aom_comp_mask_pred_neon(tmp2, second_pred, w, h, tmp1, w, msk,         \
+                                msk_stride, invert_mask);                      \
+        return aom_variance##w##x##h##_neon(tmp2, w, ref, ref_stride, sse);    \
+      }                                                                        \
+    }                                                                          \
+  }
+
+MASKED_SUBPEL_VARIANCE_WXH_NEON(4, 4, 2)
+MASKED_SUBPEL_VARIANCE_WXH_NEON(4, 8, 2)
+
+MASKED_SUBPEL_VARIANCE_WXH_NEON(8, 4, 1)
+MASKED_SUBPEL_VARIANCE_WXH_NEON(8, 8, 1)
+MASKED_SUBPEL_VARIANCE_WXH_NEON(8, 16, 1)
+
+MASKED_SUBPEL_VARIANCE_WXH_NEON(16, 8, 1)
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(16, 16, 1)
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(16, 32, 1)
+
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(32, 16, 1)
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(32, 32, 1)
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(32, 64, 1)
+
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(64, 32, 1)
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(64, 64, 1)
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(64, 128, 1)
+
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(128, 64, 1)
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(128, 128, 1)
+
+// Realtime mode doesn't use 4x rectangular blocks.
+#if !CONFIG_REALTIME_ONLY
+MASKED_SUBPEL_VARIANCE_WXH_NEON(4, 16, 2)
+MASKED_SUBPEL_VARIANCE_WXH_NEON(8, 32, 1)
+MASKED_SUBPEL_VARIANCE_WXH_NEON(16, 4, 1)
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(16, 64, 1)
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(32, 8, 1)
+SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON(64, 16, 1)
+#endif  // !CONFIG_REALTIME_ONLY
+
+#undef MASKED_SUBPEL_VARIANCE_WXH_NEON
+#undef SPECIALIZED_MASKED_SUBPEL_VARIANCE_WXH_NEON

diff --git a/aom_dsp/arm/sum_neon.h b/aom_dsp/arm/sum_neon.h
index 855edf6..ff68c12 100644
--- a/aom_dsp/arm/sum_neon.h
+++ b/aom_dsp/arm/sum_neon.h

@@ -15,7 +15,7 @@
 #include "aom_ports/mem.h"
 
 static INLINE int horizontal_add_s16x8(const int16x8_t a) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlvq_s16(a);
 #else
   const int32x4_t b = vpaddlq_s16(a);
@@ -27,7 +27,7 @@
 }
 
 static INLINE int horizontal_add_s32x4(const int32x4_t a) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddvq_s32(a);
 #else
   const int64x2_t b = vpaddlq_s32(a);
@@ -37,8 +37,16 @@
 #endif
 }
 
+static INLINE int64_t horizontal_add_s64x2(const int64x2_t a) {
+#if AOM_ARCH_AARCH64
+  return vaddvq_s64(a);
+#else
+  return vgetq_lane_s64(a, 0) + vgetq_lane_s64(a, 1);
+#endif
+}
+
 static INLINE uint64_t horizontal_add_u64x2(const uint64x2_t a) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddvq_u64(a);
 #else
   return vgetq_lane_u64(a, 0) + vgetq_lane_u64(a, 1);
@@ -46,7 +54,7 @@
 }
 
 static INLINE uint64_t horizontal_long_add_u32x4(const uint32x4_t a) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlvq_u32(a);
 #else
   const uint64x2_t b = vpaddlq_u32(a);
@@ -55,7 +63,7 @@
 }
 
 static INLINE unsigned int horizontal_add_u32x4(const uint32x4_t a) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddvq_u32(a);
 #else
   const uint64x2_t b = vpaddlq_u32(a);
@@ -65,9 +73,24 @@
 #endif
 }
 
+static INLINE uint32x4_t horizontal_add_4d_u32x4(const uint32x4_t sum[4]) {
+#if AOM_ARCH_AARCH64
+  uint32x4_t res01 = vpaddq_u32(sum[0], sum[1]);
+  uint32x4_t res23 = vpaddq_u32(sum[2], sum[3]);
+  return vpaddq_u32(res01, res23);
+#else
+  uint32x4_t res = vdupq_n_u32(0);
+  res = vsetq_lane_u32(horizontal_add_u32x4(sum[0]), res, 0);
+  res = vsetq_lane_u32(horizontal_add_u32x4(sum[1]), res, 1);
+  res = vsetq_lane_u32(horizontal_add_u32x4(sum[2]), res, 2);
+  res = vsetq_lane_u32(horizontal_add_u32x4(sum[3]), res, 3);
+  return res;
+#endif
+}
+
 static INLINE uint32_t horizontal_long_add_u16x8(const uint16x8_t vec_lo,
                                                  const uint16x8_t vec_hi) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlvq_u16(vec_lo) + vaddlvq_u16(vec_hi);
 #else
   const uint32x4_t vec_l_lo =
@@ -82,8 +105,33 @@
 #endif
 }
 
+static INLINE uint32x4_t horizontal_long_add_4d_u16x8(
+    const uint16x8_t sum_lo[4], const uint16x8_t sum_hi[4]) {
+  const uint32x4_t a0 = vpaddlq_u16(sum_lo[0]);
+  const uint32x4_t a1 = vpaddlq_u16(sum_lo[1]);
+  const uint32x4_t a2 = vpaddlq_u16(sum_lo[2]);
+  const uint32x4_t a3 = vpaddlq_u16(sum_lo[3]);
+  const uint32x4_t b0 = vpadalq_u16(a0, sum_hi[0]);
+  const uint32x4_t b1 = vpadalq_u16(a1, sum_hi[1]);
+  const uint32x4_t b2 = vpadalq_u16(a2, sum_hi[2]);
+  const uint32x4_t b3 = vpadalq_u16(a3, sum_hi[3]);
+#if AOM_ARCH_AARCH64
+  const uint32x4_t c0 = vpaddq_u32(b0, b1);
+  const uint32x4_t c1 = vpaddq_u32(b2, b3);
+  return vpaddq_u32(c0, c1);
+#else
+  const uint32x2_t c0 = vadd_u32(vget_low_u32(b0), vget_high_u32(b0));
+  const uint32x2_t c1 = vadd_u32(vget_low_u32(b1), vget_high_u32(b1));
+  const uint32x2_t c2 = vadd_u32(vget_low_u32(b2), vget_high_u32(b2));
+  const uint32x2_t c3 = vadd_u32(vget_low_u32(b3), vget_high_u32(b3));
+  const uint32x2_t d0 = vpadd_u32(c0, c1);
+  const uint32x2_t d1 = vpadd_u32(c2, c3);
+  return vcombine_u32(d0, d1);
+#endif
+}
+
 static INLINE uint32_t horizontal_add_u16x8(const uint16x8_t a) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlvq_u16(a);
 #else
   const uint32x4_t b = vpaddlq_u16(a);
@@ -94,8 +142,25 @@
 #endif
 }
 
+static INLINE uint32x4_t horizontal_add_4d_u16x8(const uint16x8_t sum[4]) {
+#if AOM_ARCH_AARCH64
+  const uint16x8_t a0 = vpaddq_u16(sum[0], sum[1]);
+  const uint16x8_t a1 = vpaddq_u16(sum[2], sum[3]);
+  const uint16x8_t b0 = vpaddq_u16(a0, a1);
+  return vpaddlq_u16(b0);
+#else
+  const uint16x4_t a0 = vadd_u16(vget_low_u16(sum[0]), vget_high_u16(sum[0]));
+  const uint16x4_t a1 = vadd_u16(vget_low_u16(sum[1]), vget_high_u16(sum[1]));
+  const uint16x4_t a2 = vadd_u16(vget_low_u16(sum[2]), vget_high_u16(sum[2]));
+  const uint16x4_t a3 = vadd_u16(vget_low_u16(sum[3]), vget_high_u16(sum[3]));
+  const uint16x4_t b0 = vpadd_u16(a0, a1);
+  const uint16x4_t b1 = vpadd_u16(a2, a3);
+  return vpaddlq_u16(vcombine_u16(b0, b1));
+#endif
+}
+
 static INLINE uint32_t horizontal_add_u32x2(const uint32x2_t a) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddv_u32(a);
 #else
   const uint64x1_t b = vpaddl_u32(a);
@@ -103,8 +168,17 @@
 #endif
 }
 
+static INLINE uint64_t horizontal_long_add_u32x2(const uint32x2_t a) {
+#if AOM_ARCH_AARCH64
+  return vaddlv_u32(a);
+#else
+  const uint64x1_t b = vpaddl_u32(a);
+  return vget_lane_u64(b, 0);
+#endif
+}
+
 static INLINE uint32_t horizontal_add_u16x4(const uint16x4_t a) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlv_u16(a);
 #else
   const uint32x2_t b = vpaddl_u16(a);

diff --git a/aom_dsp/arm/sum_squares_neon.c b/aom_dsp/arm/sum_squares_neon.c
index bf212a9..626cf21 100644
--- a/aom_dsp/arm/sum_squares_neon.c
+++ b/aom_dsp/arm/sum_squares_neon.c

@@ -35,7 +35,7 @@
                                                        int stride, int height) {
   int32x4_t sum_squares[2] = { vdupq_n_s32(0), vdupq_n_s32(0) };
 
-  int h = 0;
+  int h = height;
   do {
     int16x4_t s0 = vld1_s16(src + 0 * stride);
     int16x4_t s1 = vld1_s16(src + 1 * stride);
@@ -48,8 +48,8 @@
     sum_squares[1] = vmlal_s16(sum_squares[1], s3, s3);
 
     src += 4 * stride;
-    h += 4;
-  } while (h < height);
+    h -= 4;
+  } while (h != 0);
 
   return horizontal_long_add_u32x4(
       vreinterpretq_u32_s32(vaddq_s32(sum_squares[0], sum_squares[1])));
@@ -60,7 +60,7 @@
                                                        int height) {
   uint64x2_t sum_squares = vdupq_n_u64(0);
 
-  int h = 0;
+  int h = height;
   do {
     int32x4_t ss_row[2] = { vdupq_n_s32(0), vdupq_n_s32(0) };
     int w = 0;
@@ -86,8 +86,8 @@
         sum_squares, vreinterpretq_u32_s32(vaddq_s32(ss_row[0], ss_row[1])));
 
     src += 4 * stride;
-    h += 4;
-  } while (h < height);
+    h -= 4;
+  } while (h != 0);
 
   return horizontal_add_u64x2(sum_squares);
 }
@@ -134,7 +134,7 @@
   int32x4_t sse[2] = { vdupq_n_s32(0), vdupq_n_s32(0) };
   int32x2_t sum_acc[2] = { vdup_n_s32(0), vdup_n_s32(0) };
 
-  int h = 0;
+  int h = height;
   do {
     int16x4_t s0 = vld1_s16(src + 0 * stride);
     int16x4_t s1 = vld1_s16(src + 1 * stride);
@@ -152,8 +152,8 @@
     sum_acc[1] = vpadal_s16(sum_acc[1], s3);
 
     src += 4 * stride;
-    h += 4;
-  } while (h < height);
+    h -= 4;
+  } while (h != 0);
 
   *sum += horizontal_add_s32x4(vcombine_s32(sum_acc[0], sum_acc[1]));
   return horizontal_long_add_u32x4(
@@ -166,7 +166,7 @@
   uint64x2_t sse = vdupq_n_u64(0);
   int32x4_t sum_acc = vdupq_n_s32(0);
 
-  int h = 0;
+  int h = height;
   do {
     int32x4_t sse_row[2] = { vdupq_n_s32(0), vdupq_n_s32(0) };
     int w = 0;
@@ -198,8 +198,8 @@
                       vreinterpretq_u32_s32(vaddq_s32(sse_row[0], sse_row[1])));
 
     src += 4 * stride;
-    h += 4;
-  } while (h < height);
+    h -= 4;
+  } while (h != 0);
 
   *sum += horizontal_add_s32x4(sum_acc);
   return horizontal_add_u64x2(sse);
@@ -223,3 +223,478 @@
 
   return sse;
 }
+
+static INLINE uint64_t aom_sum_squares_i16_4xn_neon(const int16_t *src,
+                                                    uint32_t n) {
+  uint64x2_t sum_u64 = vdupq_n_u64(0);
+
+  int i = n;
+  do {
+    uint32x4_t sum;
+    int16x4_t s0 = vld1_s16(src);
+
+    sum = vreinterpretq_u32_s32(vmull_s16(s0, s0));
+
+    sum_u64 = vpadalq_u32(sum_u64, sum);
+
+    src += 4;
+    i -= 4;
+  } while (i >= 4);
+
+  if (i > 0) {
+    return horizontal_add_u64x2(sum_u64) + aom_sum_squares_i16_c(src, i);
+  }
+  return horizontal_add_u64x2(sum_u64);
+}
+
+static INLINE uint64_t aom_sum_squares_i16_8xn_neon(const int16_t *src,
+                                                    uint32_t n) {
+  uint64x2_t sum_u64[2] = { vdupq_n_u64(0), vdupq_n_u64(0) };
+
+  int i = n;
+  do {
+    uint32x4_t sum[2];
+    int16x8_t s0 = vld1q_s16(src);
+
+    sum[0] =
+        vreinterpretq_u32_s32(vmull_s16(vget_low_s16(s0), vget_low_s16(s0)));
+    sum[1] =
+        vreinterpretq_u32_s32(vmull_s16(vget_high_s16(s0), vget_high_s16(s0)));
+
+    sum_u64[0] = vpadalq_u32(sum_u64[0], sum[0]);
+    sum_u64[1] = vpadalq_u32(sum_u64[1], sum[1]);
+
+    src += 8;
+    i -= 8;
+  } while (i >= 8);
+
+  if (i > 0) {
+    return horizontal_add_u64x2(vaddq_u64(sum_u64[0], sum_u64[1])) +
+           aom_sum_squares_i16_c(src, i);
+  }
+  return horizontal_add_u64x2(vaddq_u64(sum_u64[0], sum_u64[1]));
+}
+
+uint64_t aom_sum_squares_i16_neon(const int16_t *src, uint32_t n) {
+  // This function seems to be called only for values of N >= 64. See
+  // av1/encoder/compound_type.c.
+  if (LIKELY(n >= 8)) {
+    return aom_sum_squares_i16_8xn_neon(src, n);
+  }
+  if (n >= 4) {
+    return aom_sum_squares_i16_4xn_neon(src, n);
+  }
+  return aom_sum_squares_i16_c(src, n);
+}
+
+#if defined(__ARM_FEATURE_DOTPROD)
+
+static INLINE uint64_t aom_var_2d_u8_4xh_neon(uint8_t *src, int src_stride,
+                                              int width, int height) {
+  uint64_t sum = 0;
+  uint64_t sse = 0;
+  uint32x2_t sum_u32 = vdup_n_u32(0);
+  uint32x2_t sse_u32 = vdup_n_u32(0);
+
+  int h = height / 2;
+  do {
+    int w = width;
+    uint8_t *src_ptr = src;
+    do {
+      uint8x8_t s0 = load_unaligned_u8(src_ptr, src_stride);
+
+      sum_u32 = vdot_u32(sum_u32, s0, vdup_n_u8(1));
+
+      sse_u32 = vdot_u32(sse_u32, s0, s0);
+
+      src_ptr += 8;
+      w -= 8;
+    } while (w >= 8);
+
+    // Process remaining columns in the row using C.
+    while (w > 0) {
+      int idx = width - w;
+      const uint8_t v = src[idx];
+      sum += v;
+      sse += v * v;
+      w--;
+    }
+
+    src += 2 * src_stride;
+  } while (--h != 0);
+
+  sum += horizontal_long_add_u32x2(sum_u32);
+  sse += horizontal_long_add_u32x2(sse_u32);
+
+  return sse - sum * sum / (width * height);
+}
+
+static INLINE uint64_t aom_var_2d_u8_8xh_neon(uint8_t *src, int src_stride,
+                                              int width, int height) {
+  uint64_t sum = 0;
+  uint64_t sse = 0;
+  uint32x2_t sum_u32 = vdup_n_u32(0);
+  uint32x2_t sse_u32 = vdup_n_u32(0);
+
+  int h = height;
+  do {
+    int w = width;
+    uint8_t *src_ptr = src;
+    do {
+      uint8x8_t s0 = vld1_u8(src_ptr);
+
+      sum_u32 = vdot_u32(sum_u32, s0, vdup_n_u8(1));
+
+      sse_u32 = vdot_u32(sse_u32, s0, s0);
+
+      src_ptr += 8;
+      w -= 8;
+    } while (w >= 8);
+
+    // Process remaining columns in the row using C.
+    while (w > 0) {
+      int idx = width - w;
+      const uint8_t v = src[idx];
+      sum += v;
+      sse += v * v;
+      w--;
+    }
+
+    src += src_stride;
+  } while (--h != 0);
+
+  sum += horizontal_long_add_u32x2(sum_u32);
+  sse += horizontal_long_add_u32x2(sse_u32);
+
+  return sse - sum * sum / (width * height);
+}
+
+static INLINE uint64_t aom_var_2d_u8_16xh_neon(uint8_t *src, int src_stride,
+                                               int width, int height) {
+  uint64_t sum = 0;
+  uint64_t sse = 0;
+  uint32x4_t sum_u32 = vdupq_n_u32(0);
+  uint32x4_t sse_u32 = vdupq_n_u32(0);
+
+  int h = height;
+  do {
+    int w = width;
+    uint8_t *src_ptr = src;
+    do {
+      uint8x16_t s0 = vld1q_u8(src_ptr);
+
+      sum_u32 = vdotq_u32(sum_u32, s0, vdupq_n_u8(1));
+
+      sse_u32 = vdotq_u32(sse_u32, s0, s0);
+
+      src_ptr += 16;
+      w -= 16;
+    } while (w >= 16);
+
+    // Process remaining columns in the row using C.
+    while (w > 0) {
+      int idx = width - w;
+      const uint8_t v = src[idx];
+      sum += v;
+      sse += v * v;
+      w--;
+    }
+
+    src += src_stride;
+  } while (--h != 0);
+
+  sum += horizontal_long_add_u32x4(sum_u32);
+  sse += horizontal_long_add_u32x4(sse_u32);
+
+  return sse - sum * sum / (width * height);
+}
+
+#else  //  !defined(__ARM_FEATURE_DOTPROD)
+
+static INLINE uint64_t aom_var_2d_u8_4xh_neon(uint8_t *src, int src_stride,
+                                              int width, int height) {
+  uint64_t sum = 0;
+  uint64_t sse = 0;
+  uint32x2_t sum_u32 = vdup_n_u32(0);
+  uint32x4_t sse_u32 = vdupq_n_u32(0);
+
+  // 255*256 = 65280, so we can accumulate up to 256 8-bit elements in a 16-bit
+  // element before we need to accumulate to 32-bit elements. Since we're
+  // accumulating in uint16x4_t vectors, this means we can accumulate up to 4
+  // rows of 256 elements. Therefore the limit can be computed as: h_limit = (4
+  // * 256) / width.
+  int h_limit = (4 * 256) / width;
+  int h_tmp = height > h_limit ? h_limit : height;
+
+  int h = 0;
+  do {
+    uint16x4_t sum_u16 = vdup_n_u16(0);
+    do {
+      uint8_t *src_ptr = src;
+      int w = width;
+      do {
+        uint8x8_t s0 = load_unaligned_u8(src_ptr, src_stride);
+
+        sum_u16 = vpadal_u8(sum_u16, s0);
+
+        uint16x8_t sse_u16 = vmull_u8(s0, s0);
+
+        sse_u32 = vpadalq_u16(sse_u32, sse_u16);
+
+        src_ptr += 8;
+        w -= 8;
+      } while (w >= 8);
+
+      // Process remaining columns in the row using C.
+      while (w > 0) {
+        int idx = width - w;
+        const uint8_t v = src[idx];
+        sum += v;
+        sse += v * v;
+        w--;
+      }
+
+      src += 2 * src_stride;
+      h += 2;
+    } while (h < h_tmp && h < height);
+
+    sum_u32 = vpadal_u16(sum_u32, sum_u16);
+    h_tmp += h_limit;
+  } while (h < height);
+
+  sum += horizontal_long_add_u32x2(sum_u32);
+  sse += horizontal_long_add_u32x4(sse_u32);
+
+  return sse - sum * sum / (width * height);
+}
+
+static INLINE uint64_t aom_var_2d_u8_8xh_neon(uint8_t *src, int src_stride,
+                                              int width, int height) {
+  uint64_t sum = 0;
+  uint64_t sse = 0;
+  uint32x2_t sum_u32 = vdup_n_u32(0);
+  uint32x4_t sse_u32 = vdupq_n_u32(0);
+
+  // 255*256 = 65280, so we can accumulate up to 256 8-bit elements in a 16-bit
+  // element before we need to accumulate to 32-bit elements. Since we're
+  // accumulating in uint16x4_t vectors, this means we can accumulate up to 4
+  // rows of 256 elements. Therefore the limit can be computed as: h_limit = (4
+  // * 256) / width.
+  int h_limit = (4 * 256) / width;
+  int h_tmp = height > h_limit ? h_limit : height;
+
+  int h = 0;
+  do {
+    uint16x4_t sum_u16 = vdup_n_u16(0);
+    do {
+      uint8_t *src_ptr = src;
+      int w = width;
+      do {
+        uint8x8_t s0 = vld1_u8(src_ptr);
+
+        sum_u16 = vpadal_u8(sum_u16, s0);
+
+        uint16x8_t sse_u16 = vmull_u8(s0, s0);
+
+        sse_u32 = vpadalq_u16(sse_u32, sse_u16);
+
+        src_ptr += 8;
+        w -= 8;
+      } while (w >= 8);
+
+      // Process remaining columns in the row using C.
+      while (w > 0) {
+        int idx = width - w;
+        const uint8_t v = src[idx];
+        sum += v;
+        sse += v * v;
+        w--;
+      }
+
+      src += src_stride;
+      ++h;
+    } while (h < h_tmp && h < height);
+
+    sum_u32 = vpadal_u16(sum_u32, sum_u16);
+    h_tmp += h_limit;
+  } while (h < height);
+
+  sum += horizontal_long_add_u32x2(sum_u32);
+  sse += horizontal_long_add_u32x4(sse_u32);
+
+  return sse - sum * sum / (width * height);
+}
+
+static INLINE uint64_t aom_var_2d_u8_16xh_neon(uint8_t *src, int src_stride,
+                                               int width, int height) {
+  uint64_t sum = 0;
+  uint64_t sse = 0;
+  uint32x4_t sum_u32 = vdupq_n_u32(0);
+  uint32x4_t sse_u32[2] = { vdupq_n_u32(0), vdupq_n_u32(0) };
+
+  // 255*256 = 65280, so we can accumulate up to 256 8-bit elements in a 16-bit
+  // element before we need to accumulate to 32-bit elements. Since we're
+  // accumulating in uint16x8_t vectors, this means we can accumulate up to 8
+  // rows of 256 elements. Therefore the limit can be computed as: h_limit = (8
+  // * 256) / width.
+  int h_limit = (8 * 256) / width;
+  int h_tmp = height > h_limit ? h_limit : height;
+
+  int h = 0;
+  do {
+    uint16x8_t sum_u16 = vdupq_n_u16(0);
+    do {
+      int w = width;
+      uint8_t *src_ptr = src;
+      do {
+        uint8x16_t s0 = vld1q_u8(src_ptr);
+
+        sum_u16 = vpadalq_u8(sum_u16, s0);
+
+        uint16x8_t sse_u16_lo = vmull_u8(vget_low_u8(s0), vget_low_u8(s0));
+        uint16x8_t sse_u16_hi = vmull_u8(vget_high_u8(s0), vget_high_u8(s0));
+
+        sse_u32[0] = vpadalq_u16(sse_u32[0], sse_u16_lo);
+        sse_u32[1] = vpadalq_u16(sse_u32[1], sse_u16_hi);
+
+        src_ptr += 16;
+        w -= 16;
+      } while (w >= 16);
+
+      // Process remaining columns in the row using C.
+      while (w > 0) {
+        int idx = width - w;
+        const uint8_t v = src[idx];
+        sum += v;
+        sse += v * v;
+        w--;
+      }
+
+      src += src_stride;
+      ++h;
+    } while (h < h_tmp && h < height);
+
+    sum_u32 = vpadalq_u16(sum_u32, sum_u16);
+    h_tmp += h_limit;
+  } while (h < height);
+
+  sum += horizontal_long_add_u32x4(sum_u32);
+  sse += horizontal_long_add_u32x4(vaddq_u32(sse_u32[0], sse_u32[1]));
+
+  return sse - sum * sum / (width * height);
+}
+
+#endif  // defined(__ARM_FEATURE_DOTPROD)
+
+uint64_t aom_var_2d_u8_neon(uint8_t *src, int src_stride, int width,
+                            int height) {
+  if (width >= 16) {
+    return aom_var_2d_u8_16xh_neon(src, src_stride, width, height);
+  }
+  if (width >= 8) {
+    return aom_var_2d_u8_8xh_neon(src, src_stride, width, height);
+  }
+  if (width >= 4 && height % 2 == 0) {
+    return aom_var_2d_u8_4xh_neon(src, src_stride, width, height);
+  }
+  return aom_var_2d_u8_c(src, src_stride, width, height);
+}
+
+static INLINE uint64_t aom_var_2d_u16_4xh_neon(uint8_t *src, int src_stride,
+                                               int width, int height) {
+  uint16_t *src_u16 = CONVERT_TO_SHORTPTR(src);
+  uint64_t sum = 0;
+  uint64_t sse = 0;
+  uint32x2_t sum_u32 = vdup_n_u32(0);
+  uint64x2_t sse_u64 = vdupq_n_u64(0);
+
+  int h = height;
+  do {
+    int w = width;
+    uint16_t *src_ptr = src_u16;
+    do {
+      uint16x4_t s0 = vld1_u16(src_ptr);
+
+      sum_u32 = vpadal_u16(sum_u32, s0);
+
+      uint32x4_t sse_u32 = vmull_u16(s0, s0);
+
+      sse_u64 = vpadalq_u32(sse_u64, sse_u32);
+
+      src_ptr += 4;
+      w -= 4;
+    } while (w >= 4);
+
+    // Process remaining columns in the row using C.
+    while (w > 0) {
+      int idx = width - w;
+      const uint16_t v = src_u16[idx];
+      sum += v;
+      sse += v * v;
+      w--;
+    }
+
+    src_u16 += src_stride;
+  } while (--h != 0);
+
+  sum += horizontal_long_add_u32x2(sum_u32);
+  sse += horizontal_add_u64x2(sse_u64);
+
+  return sse - sum * sum / (width * height);
+}
+
+static INLINE uint64_t aom_var_2d_u16_8xh_neon(uint8_t *src, int src_stride,
+                                               int width, int height) {
+  uint16_t *src_u16 = CONVERT_TO_SHORTPTR(src);
+  uint64_t sum = 0;
+  uint64_t sse = 0;
+  uint32x4_t sum_u32 = vdupq_n_u32(0);
+  uint64x2_t sse_u64[2] = { vdupq_n_u64(0), vdupq_n_u64(0) };
+
+  int h = height;
+  do {
+    int w = width;
+    uint16_t *src_ptr = src_u16;
+    do {
+      uint16x8_t s0 = vld1q_u16(src_ptr);
+
+      sum_u32 = vpadalq_u16(sum_u32, s0);
+
+      uint32x4_t sse_u32_lo = vmull_u16(vget_low_u16(s0), vget_low_u16(s0));
+      uint32x4_t sse_u32_hi = vmull_u16(vget_high_u16(s0), vget_high_u16(s0));
+
+      sse_u64[0] = vpadalq_u32(sse_u64[0], sse_u32_lo);
+      sse_u64[1] = vpadalq_u32(sse_u64[1], sse_u32_hi);
+
+      src_ptr += 8;
+      w -= 8;
+    } while (w >= 8);
+
+    // Process remaining columns in the row using C.
+    while (w > 0) {
+      int idx = width - w;
+      const uint16_t v = src_u16[idx];
+      sum += v;
+      sse += v * v;
+      w--;
+    }
+
+    src_u16 += src_stride;
+  } while (--h != 0);
+
+  sum += horizontal_long_add_u32x4(sum_u32);
+  sse += horizontal_add_u64x2(vaddq_u64(sse_u64[0], sse_u64[1]));
+
+  return sse - sum * sum / (width * height);
+}
+
+uint64_t aom_var_2d_u16_neon(uint8_t *src, int src_stride, int width,
+                             int height) {
+  if (width >= 8) {
+    return aom_var_2d_u16_8xh_neon(src, src_stride, width, height);
+  }
+  if (width >= 4) {
+    return aom_var_2d_u16_4xh_neon(src, src_stride, width, height);
+  }
+  return aom_var_2d_u16_c(src, src_stride, width, height);
+}

diff --git a/aom_dsp/arm/transpose_neon.h b/aom_dsp/arm/transpose_neon.h
index 26fc1fd..8218140 100644
--- a/aom_dsp/arm/transpose_neon.h
+++ b/aom_dsp/arm/transpose_neon.h

@@ -13,6 +13,8 @@
 
 #include <arm_neon.h>
 
+#include "config/aom_config.h"
+
 // Swap high and low halves.
 static INLINE uint16x8_t transpose64_u16q(const uint16x8_t a) {
   return vextq_u16(a, a, 4);
@@ -258,13 +260,19 @@
   a[3] = vreinterpretq_u16_u32(c1.val[1]);
 }
 
-static INLINE uint16x8x2_t aom_vtrnq_u64_to_u16(const uint32x4_t a0,
-                                                const uint32x4_t a1) {
+static INLINE uint16x8x2_t aom_vtrnq_u64_to_u16(uint32x4_t a0, uint32x4_t a1) {
   uint16x8x2_t b0;
+#if AOM_ARCH_AARCH64
+  b0.val[0] = vreinterpretq_u16_u64(
+      vtrn1q_u64(vreinterpretq_u64_u32(a0), vreinterpretq_u64_u32(a1)));
+  b0.val[1] = vreinterpretq_u16_u64(
+      vtrn2q_u64(vreinterpretq_u64_u32(a0), vreinterpretq_u64_u32(a1)));
+#else
   b0.val[0] = vcombine_u16(vreinterpret_u16_u32(vget_low_u32(a0)),
                            vreinterpret_u16_u32(vget_low_u32(a1)));
   b0.val[1] = vcombine_u16(vreinterpret_u16_u32(vget_high_u32(a0)),
                            vreinterpret_u16_u32(vget_high_u32(a1)));
+#endif
   return b0;
 }
 
@@ -343,7 +351,7 @@
                                      uint16x4_t *a6, uint16x4_t *a7,
                                      uint16x8_t *o0, uint16x8_t *o1,
                                      uint16x8_t *o2, uint16x8_t *o3) {
-  // Swap 16 bit elements. Goes from:
+  // Combine rows. Goes from:
   // a0: 00 01 02 03
   // a1: 10 11 12 13
   // a2: 20 21 22 23
@@ -353,53 +361,40 @@
   // a6: 60 61 62 63
   // a7: 70 71 72 73
   // to:
-  // b0.val[0]: 00 10 02 12
-  // b0.val[1]: 01 11 03 13
-  // b1.val[0]: 20 30 22 32
-  // b1.val[1]: 21 31 23 33
-  // b2.val[0]: 40 50 42 52
-  // b2.val[1]: 41 51 43 53
-  // b3.val[0]: 60 70 62 72
-  // b3.val[1]: 61 71 63 73
+  // b0: 00 01 02 03 40 41 42 43
+  // b1: 10 11 12 13 50 51 52 53
+  // b2: 20 21 22 23 60 61 62 63
+  // b3: 30 31 32 33 70 71 72 73
 
-  uint16x4x2_t b0 = vtrn_u16(*a0, *a1);
-  uint16x4x2_t b1 = vtrn_u16(*a2, *a3);
-  uint16x4x2_t b2 = vtrn_u16(*a4, *a5);
-  uint16x4x2_t b3 = vtrn_u16(*a6, *a7);
+  const uint16x8_t b0 = vcombine_u16(*a0, *a4);
+  const uint16x8_t b1 = vcombine_u16(*a1, *a5);
+  const uint16x8_t b2 = vcombine_u16(*a2, *a6);
+  const uint16x8_t b3 = vcombine_u16(*a3, *a7);
+
+  // Swap 16 bit elements resulting in:
+  // c0.val[0]: 00 10 02 12 40 50 42 52
+  // c0.val[1]: 01 11 03 13 41 51 43 53
+  // c1.val[0]: 20 30 22 32 60 70 62 72
+  // c1.val[1]: 21 31 23 33 61 71 63 73
+
+  const uint16x8x2_t c0 = vtrnq_u16(b0, b1);
+  const uint16x8x2_t c1 = vtrnq_u16(b2, b3);
 
   // Swap 32 bit elements resulting in:
-  // c0.val[0]: 00 10 20 30
-  // c0.val[1]: 02 12 22 32
-  // c1.val[0]: 01 11 21 31
-  // c1.val[1]: 03 13 23 33
-  // c2.val[0]: 40 50 60 70
-  // c2.val[1]: 42 52 62 72
-  // c3.val[0]: 41 51 61 71
-  // c3.val[1]: 43 53 63 73
+  // d0.val[0]: 00 10 20 30 40 50 60 70
+  // d0.val[1]: 02 12 22 32 42 52 62 72
+  // d1.val[0]: 01 11 21 31 41 51 61 71
+  // d1.val[1]: 03 13 23 33 43 53 63 73
 
-  uint32x2x2_t c0 = vtrn_u32(vreinterpret_u32_u16(b0.val[0]),
-                             vreinterpret_u32_u16(b1.val[0]));
-  uint32x2x2_t c1 = vtrn_u32(vreinterpret_u32_u16(b0.val[1]),
-                             vreinterpret_u32_u16(b1.val[1]));
-  uint32x2x2_t c2 = vtrn_u32(vreinterpret_u32_u16(b2.val[0]),
-                             vreinterpret_u32_u16(b3.val[0]));
-  uint32x2x2_t c3 = vtrn_u32(vreinterpret_u32_u16(b2.val[1]),
-                             vreinterpret_u32_u16(b3.val[1]));
+  const uint32x4x2_t d0 = vtrnq_u32(vreinterpretq_u32_u16(c0.val[0]),
+                                    vreinterpretq_u32_u16(c1.val[0]));
+  const uint32x4x2_t d1 = vtrnq_u32(vreinterpretq_u32_u16(c0.val[1]),
+                                    vreinterpretq_u32_u16(c1.val[1]));
 
-  // Swap 64 bit elements resulting in:
-  // o0: 00 10 20 30 40 50 60 70
-  // o1: 01 11 21 31 41 51 61 71
-  // o2: 02 12 22 32 42 52 62 72
-  // o3: 03 13 23 33 43 53 63 73
-
-  *o0 = vcombine_u16(vreinterpret_u16_u32(c0.val[0]),
-                     vreinterpret_u16_u32(c2.val[0]));
-  *o1 = vcombine_u16(vreinterpret_u16_u32(c1.val[0]),
-                     vreinterpret_u16_u32(c3.val[0]));
-  *o2 = vcombine_u16(vreinterpret_u16_u32(c0.val[1]),
-                     vreinterpret_u16_u32(c2.val[1]));
-  *o3 = vcombine_u16(vreinterpret_u16_u32(c1.val[1]),
-                     vreinterpret_u16_u32(c3.val[1]));
+  *o0 = vreinterpretq_u16_u32(d0.val[0]);
+  *o1 = vreinterpretq_u16_u32(d1.val[0]);
+  *o2 = vreinterpretq_u16_u32(d0.val[1]);
+  *o3 = vreinterpretq_u16_u32(d1.val[1]);
 }
 
 static INLINE void transpose_s16_4x8(int16x4_t *a0, int16x4_t *a1,
@@ -408,7 +403,7 @@
                                      int16x4_t *a6, int16x4_t *a7,
                                      int16x8_t *o0, int16x8_t *o1,
                                      int16x8_t *o2, int16x8_t *o3) {
-  // Swap 16 bit elements. Goes from:
+  // Combine rows. Goes from:
   // a0: 00 01 02 03
   // a1: 10 11 12 13
   // a2: 20 21 22 23
@@ -418,53 +413,40 @@
   // a6: 60 61 62 63
   // a7: 70 71 72 73
   // to:
-  // b0.val[0]: 00 10 02 12
-  // b0.val[1]: 01 11 03 13
-  // b1.val[0]: 20 30 22 32
-  // b1.val[1]: 21 31 23 33
-  // b2.val[0]: 40 50 42 52
-  // b2.val[1]: 41 51 43 53
-  // b3.val[0]: 60 70 62 72
-  // b3.val[1]: 61 71 63 73
+  // b0: 00 01 02 03 40 41 42 43
+  // b1: 10 11 12 13 50 51 52 53
+  // b2: 20 21 22 23 60 61 62 63
+  // b3: 30 31 32 33 70 71 72 73
 
-  int16x4x2_t b0 = vtrn_s16(*a0, *a1);
-  int16x4x2_t b1 = vtrn_s16(*a2, *a3);
-  int16x4x2_t b2 = vtrn_s16(*a4, *a5);
-  int16x4x2_t b3 = vtrn_s16(*a6, *a7);
+  const int16x8_t b0 = vcombine_s16(*a0, *a4);
+  const int16x8_t b1 = vcombine_s16(*a1, *a5);
+  const int16x8_t b2 = vcombine_s16(*a2, *a6);
+  const int16x8_t b3 = vcombine_s16(*a3, *a7);
+
+  // Swap 16 bit elements resulting in:
+  // c0.val[0]: 00 10 02 12 40 50 42 52
+  // c0.val[1]: 01 11 03 13 41 51 43 53
+  // c1.val[0]: 20 30 22 32 60 70 62 72
+  // c1.val[1]: 21 31 23 33 61 71 63 73
+
+  const int16x8x2_t c0 = vtrnq_s16(b0, b1);
+  const int16x8x2_t c1 = vtrnq_s16(b2, b3);
 
   // Swap 32 bit elements resulting in:
-  // c0.val[0]: 00 10 20 30
-  // c0.val[1]: 02 12 22 32
-  // c1.val[0]: 01 11 21 31
-  // c1.val[1]: 03 13 23 33
-  // c2.val[0]: 40 50 60 70
-  // c2.val[1]: 42 52 62 72
-  // c3.val[0]: 41 51 61 71
-  // c3.val[1]: 43 53 63 73
+  // d0.val[0]: 00 10 20 30 40 50 60 70
+  // d0.val[1]: 02 12 22 32 42 52 62 72
+  // d1.val[0]: 01 11 21 31 41 51 61 71
+  // d1.val[1]: 03 13 23 33 43 53 63 73
 
-  int32x2x2_t c0 = vtrn_s32(vreinterpret_s32_s16(b0.val[0]),
-                            vreinterpret_s32_s16(b1.val[0]));
-  int32x2x2_t c1 = vtrn_s32(vreinterpret_s32_s16(b0.val[1]),
-                            vreinterpret_s32_s16(b1.val[1]));
-  int32x2x2_t c2 = vtrn_s32(vreinterpret_s32_s16(b2.val[0]),
-                            vreinterpret_s32_s16(b3.val[0]));
-  int32x2x2_t c3 = vtrn_s32(vreinterpret_s32_s16(b2.val[1]),
-                            vreinterpret_s32_s16(b3.val[1]));
+  const int32x4x2_t d0 = vtrnq_s32(vreinterpretq_s32_s16(c0.val[0]),
+                                   vreinterpretq_s32_s16(c1.val[0]));
+  const int32x4x2_t d1 = vtrnq_s32(vreinterpretq_s32_s16(c0.val[1]),
+                                   vreinterpretq_s32_s16(c1.val[1]));
 
-  // Swap 64 bit elements resulting in:
-  // o0: 00 10 20 30 40 50 60 70
-  // o1: 01 11 21 31 41 51 61 71
-  // o2: 02 12 22 32 42 52 62 72
-  // o3: 03 13 23 33 43 53 63 73
-
-  *o0 = vcombine_s16(vreinterpret_s16_s32(c0.val[0]),
-                     vreinterpret_s16_s32(c2.val[0]));
-  *o1 = vcombine_s16(vreinterpret_s16_s32(c1.val[0]),
-                     vreinterpret_s16_s32(c3.val[0]));
-  *o2 = vcombine_s16(vreinterpret_s16_s32(c0.val[1]),
-                     vreinterpret_s16_s32(c2.val[1]));
-  *o3 = vcombine_s16(vreinterpret_s16_s32(c1.val[1]),
-                     vreinterpret_s16_s32(c3.val[1]));
+  *o0 = vreinterpretq_s16_s32(d0.val[0]);
+  *o1 = vreinterpretq_s16_s32(d1.val[0]);
+  *o2 = vreinterpretq_s16_s32(d0.val[1]);
+  *o3 = vreinterpretq_s16_s32(d1.val[1]);
 }
 
 static INLINE void transpose_u16_8x8(uint16x8_t *a0, uint16x8_t *a1,
@@ -514,25 +496,45 @@
   const uint32x4x2_t c3 = vtrnq_u32(vreinterpretq_u32_u16(b2.val[1]),
                                     vreinterpretq_u32_u16(b3.val[1]));
 
-  *a0 = vcombine_u16(vget_low_u16(vreinterpretq_u16_u32(c0.val[0])),
-                     vget_low_u16(vreinterpretq_u16_u32(c2.val[0])));
-  *a4 = vcombine_u16(vget_high_u16(vreinterpretq_u16_u32(c0.val[0])),
-                     vget_high_u16(vreinterpretq_u16_u32(c2.val[0])));
+  // Swap 64 bit elements resulting in:
+  // d0.val[0]: 00 10 20 30 40 50 60 70
+  // d0.val[1]: 04 14 24 34 44 54 64 74
+  // d1.val[0]: 01 11 21 31 41 51 61 71
+  // d1.val[1]: 05 15 25 35 45 55 65 75
+  // d2.val[0]: 02 12 22 32 42 52 62 72
+  // d2.val[1]: 06 16 26 36 46 56 66 76
+  // d3.val[0]: 03 13 23 33 43 53 63 73
+  // d3.val[1]: 07 17 27 37 47 57 67 77
 
-  *a2 = vcombine_u16(vget_low_u16(vreinterpretq_u16_u32(c0.val[1])),
-                     vget_low_u16(vreinterpretq_u16_u32(c2.val[1])));
-  *a6 = vcombine_u16(vget_high_u16(vreinterpretq_u16_u32(c0.val[1])),
-                     vget_high_u16(vreinterpretq_u16_u32(c2.val[1])));
+  const uint16x8x2_t d0 = aom_vtrnq_u64_to_u16(c0.val[0], c2.val[0]);
+  const uint16x8x2_t d1 = aom_vtrnq_u64_to_u16(c1.val[0], c3.val[0]);
+  const uint16x8x2_t d2 = aom_vtrnq_u64_to_u16(c0.val[1], c2.val[1]);
+  const uint16x8x2_t d3 = aom_vtrnq_u64_to_u16(c1.val[1], c3.val[1]);
 
-  *a1 = vcombine_u16(vget_low_u16(vreinterpretq_u16_u32(c1.val[0])),
-                     vget_low_u16(vreinterpretq_u16_u32(c3.val[0])));
-  *a5 = vcombine_u16(vget_high_u16(vreinterpretq_u16_u32(c1.val[0])),
-                     vget_high_u16(vreinterpretq_u16_u32(c3.val[0])));
+  *a0 = d0.val[0];
+  *a1 = d1.val[0];
+  *a2 = d2.val[0];
+  *a3 = d3.val[0];
+  *a4 = d0.val[1];
+  *a5 = d1.val[1];
+  *a6 = d2.val[1];
+  *a7 = d3.val[1];
+}
 
-  *a3 = vcombine_u16(vget_low_u16(vreinterpretq_u16_u32(c1.val[1])),
-                     vget_low_u16(vreinterpretq_u16_u32(c3.val[1])));
-  *a7 = vcombine_u16(vget_high_u16(vreinterpretq_u16_u32(c1.val[1])),
-                     vget_high_u16(vreinterpretq_u16_u32(c3.val[1])));
+static INLINE int16x8x2_t aom_vtrnq_s64_to_s16(int32x4_t a0, int32x4_t a1) {
+  int16x8x2_t b0;
+#if AOM_ARCH_AARCH64
+  b0.val[0] = vreinterpretq_s16_s64(
+      vtrn1q_s64(vreinterpretq_s64_s32(a0), vreinterpretq_s64_s32(a1)));
+  b0.val[1] = vreinterpretq_s16_s64(
+      vtrn2q_s64(vreinterpretq_s64_s32(a0), vreinterpretq_s64_s32(a1)));
+#else
+  b0.val[0] = vcombine_s16(vreinterpret_s16_s32(vget_low_s32(a0)),
+                           vreinterpret_s16_s32(vget_low_s32(a1)));
+  b0.val[1] = vcombine_s16(vreinterpret_s16_s32(vget_high_s32(a0)),
+                           vreinterpret_s16_s32(vget_high_s32(a1)));
+#endif
+  return b0;
 }
 
 static INLINE void transpose_s16_8x8(int16x8_t *a0, int16x8_t *a1,
@@ -582,37 +584,32 @@
   const int32x4x2_t c3 = vtrnq_s32(vreinterpretq_s32_s16(b2.val[1]),
                                    vreinterpretq_s32_s16(b3.val[1]));
 
-  *a0 = vcombine_s16(vget_low_s16(vreinterpretq_s16_s32(c0.val[0])),
-                     vget_low_s16(vreinterpretq_s16_s32(c2.val[0])));
-  *a4 = vcombine_s16(vget_high_s16(vreinterpretq_s16_s32(c0.val[0])),
-                     vget_high_s16(vreinterpretq_s16_s32(c2.val[0])));
+  // Swap 64 bit elements resulting in:
+  // d0.val[0]: 00 10 20 30 40 50 60 70
+  // d0.val[1]: 04 14 24 34 44 54 64 74
+  // d1.val[0]: 01 11 21 31 41 51 61 71
+  // d1.val[1]: 05 15 25 35 45 55 65 75
+  // d2.val[0]: 02 12 22 32 42 52 62 72
+  // d2.val[1]: 06 16 26 36 46 56 66 76
+  // d3.val[0]: 03 13 23 33 43 53 63 73
+  // d3.val[1]: 07 17 27 37 47 57 67 77
 
-  *a2 = vcombine_s16(vget_low_s16(vreinterpretq_s16_s32(c0.val[1])),
-                     vget_low_s16(vreinterpretq_s16_s32(c2.val[1])));
-  *a6 = vcombine_s16(vget_high_s16(vreinterpretq_s16_s32(c0.val[1])),
-                     vget_high_s16(vreinterpretq_s16_s32(c2.val[1])));
+  const int16x8x2_t d0 = aom_vtrnq_s64_to_s16(c0.val[0], c2.val[0]);
+  const int16x8x2_t d1 = aom_vtrnq_s64_to_s16(c1.val[0], c3.val[0]);
+  const int16x8x2_t d2 = aom_vtrnq_s64_to_s16(c0.val[1], c2.val[1]);
+  const int16x8x2_t d3 = aom_vtrnq_s64_to_s16(c1.val[1], c3.val[1]);
 
-  *a1 = vcombine_s16(vget_low_s16(vreinterpretq_s16_s32(c1.val[0])),
-                     vget_low_s16(vreinterpretq_s16_s32(c3.val[0])));
-  *a5 = vcombine_s16(vget_high_s16(vreinterpretq_s16_s32(c1.val[0])),
-                     vget_high_s16(vreinterpretq_s16_s32(c3.val[0])));
-
-  *a3 = vcombine_s16(vget_low_s16(vreinterpretq_s16_s32(c1.val[1])),
-                     vget_low_s16(vreinterpretq_s16_s32(c3.val[1])));
-  *a7 = vcombine_s16(vget_high_s16(vreinterpretq_s16_s32(c1.val[1])),
-                     vget_high_s16(vreinterpretq_s16_s32(c3.val[1])));
+  *a0 = d0.val[0];
+  *a1 = d1.val[0];
+  *a2 = d2.val[0];
+  *a3 = d3.val[0];
+  *a4 = d0.val[1];
+  *a5 = d1.val[1];
+  *a6 = d2.val[1];
+  *a7 = d3.val[1];
 }
 
-static INLINE int16x8x2_t aom_vtrnq_s64_to_s16(int32x4_t a0, int32x4_t a1) {
-  int16x8x2_t b0;
-  b0.val[0] = vcombine_s16(vreinterpret_s16_s32(vget_low_s32(a0)),
-                           vreinterpret_s16_s32(vget_low_s32(a1)));
-  b0.val[1] = vcombine_s16(vreinterpret_s16_s32(vget_high_s32(a0)),
-                           vreinterpret_s16_s32(vget_high_s32(a1)));
-  return b0;
-}
-
-static INLINE void transpose_s16_8x8q(int16x8_t *a0, int16x8_t *out) {
+static INLINE void transpose_s16_8x8q(int16x8_t *a, int16x8_t *out) {
   // Swap 16 bit elements. Goes from:
   // a0: 00 01 02 03 04 05 06 07
   // a1: 10 11 12 13 14 15 16 17
@@ -632,10 +629,10 @@
   // b3.val[0]: 60 70 62 72 64 74 66 76
   // b3.val[1]: 61 71 63 73 65 75 67 77
 
-  const int16x8x2_t b0 = vtrnq_s16(*a0, *(a0 + 1));
-  const int16x8x2_t b1 = vtrnq_s16(*(a0 + 2), *(a0 + 3));
-  const int16x8x2_t b2 = vtrnq_s16(*(a0 + 4), *(a0 + 5));
-  const int16x8x2_t b3 = vtrnq_s16(*(a0 + 6), *(a0 + 7));
+  const int16x8x2_t b0 = vtrnq_s16(a[0], a[1]);
+  const int16x8x2_t b1 = vtrnq_s16(a[2], a[3]);
+  const int16x8x2_t b2 = vtrnq_s16(a[4], a[5]);
+  const int16x8x2_t b3 = vtrnq_s16(a[6], a[7]);
 
   // Swap 32 bit elements resulting in:
   // c0.val[0]: 00 10 20 30 04 14 24 34
@@ -665,19 +662,53 @@
   // d2.val[1]: 06 16 26 36 46 56 66 76
   // d3.val[0]: 03 13 23 33 43 53 63 73
   // d3.val[1]: 07 17 27 37 47 57 67 77
+
   const int16x8x2_t d0 = aom_vtrnq_s64_to_s16(c0.val[0], c2.val[0]);
   const int16x8x2_t d1 = aom_vtrnq_s64_to_s16(c1.val[0], c3.val[0]);
   const int16x8x2_t d2 = aom_vtrnq_s64_to_s16(c0.val[1], c2.val[1]);
   const int16x8x2_t d3 = aom_vtrnq_s64_to_s16(c1.val[1], c3.val[1]);
 
-  *out = d0.val[0];
-  *(out + 1) = d1.val[0];
-  *(out + 2) = d2.val[0];
-  *(out + 3) = d3.val[0];
-  *(out + 4) = d0.val[1];
-  *(out + 5) = d1.val[1];
-  *(out + 6) = d2.val[1];
-  *(out + 7) = d3.val[1];
+  out[0] = d0.val[0];
+  out[1] = d1.val[0];
+  out[2] = d2.val[0];
+  out[3] = d3.val[0];
+  out[4] = d0.val[1];
+  out[5] = d1.val[1];
+  out[6] = d2.val[1];
+  out[7] = d3.val[1];
+}
+
+static INLINE void transpose_u16_4x4d(uint16x4_t *a0, uint16x4_t *a1,
+                                      uint16x4_t *a2, uint16x4_t *a3) {
+  // Swap 16 bit elements. Goes from:
+  // a0: 00 01 02 03
+  // a1: 10 11 12 13
+  // a2: 20 21 22 23
+  // a3: 30 31 32 33
+  // to:
+  // b0.val[0]: 00 10 02 12
+  // b0.val[1]: 01 11 03 13
+  // b1.val[0]: 20 30 22 32
+  // b1.val[1]: 21 31 23 33
+
+  const uint16x4x2_t b0 = vtrn_u16(*a0, *a1);
+  const uint16x4x2_t b1 = vtrn_u16(*a2, *a3);
+
+  // Swap 32 bit elements resulting in:
+  // c0.val[0]: 00 10 20 30
+  // c0.val[1]: 02 12 22 32
+  // c1.val[0]: 01 11 21 31
+  // c1.val[1]: 03 13 23 33
+
+  const uint32x2x2_t c0 = vtrn_u32(vreinterpret_u32_u16(b0.val[0]),
+                                   vreinterpret_u32_u16(b1.val[0]));
+  const uint32x2x2_t c1 = vtrn_u32(vreinterpret_u32_u16(b0.val[1]),
+                                   vreinterpret_u32_u16(b1.val[1]));
+
+  *a0 = vreinterpret_u16_u32(c0.val[0]);
+  *a1 = vreinterpret_u16_u32(c1.val[0]);
+  *a2 = vreinterpret_u16_u32(c0.val[1]);
+  *a3 = vreinterpret_u16_u32(c1.val[1]);
 }
 
 static INLINE void transpose_s16_4x4d(int16x4_t *a0, int16x4_t *a1,
@@ -715,8 +746,15 @@
 
 static INLINE int32x4x2_t aom_vtrnq_s64_to_s32(int32x4_t a0, int32x4_t a1) {
   int32x4x2_t b0;
+#if AOM_ARCH_AARCH64
+  b0.val[0] = vreinterpretq_s32_s64(
+      vtrn1q_s64(vreinterpretq_s64_s32(a0), vreinterpretq_s64_s32(a1)));
+  b0.val[1] = vreinterpretq_s32_s64(
+      vtrn2q_s64(vreinterpretq_s64_s32(a0), vreinterpretq_s64_s32(a1)));
+#else
   b0.val[0] = vcombine_s32(vget_low_s32(a0), vget_low_s32(a1));
   b0.val[1] = vcombine_s32(vget_high_s32(a0), vget_high_s32(a1));
+#endif
   return b0;
 }
 

diff --git a/aom_dsp/arm/variance_neon.c b/aom_dsp/arm/variance_neon.c
index 40e40f0..5e33996 100644
--- a/aom_dsp/arm/variance_neon.c
+++ b/aom_dsp/arm/variance_neon.c

@@ -27,7 +27,7 @@
   uint32x4_t ref_sum = vdupq_n_u32(0);
   uint32x4_t sse_u32 = vdupq_n_u32(0);
 
-  int i = 0;
+  int i = h;
   do {
     uint8x16_t s = load_unaligned_u8q(src, src_stride);
     uint8x16_t r = load_unaligned_u8q(ref, ref_stride);
@@ -40,8 +40,8 @@
 
     src += 4 * src_stride;
     ref += 4 * ref_stride;
-    i += 4;
-  } while (i < h);
+    i -= 4;
+  } while (i != 0);
 
   int32x4_t sum_diff =
       vsubq_s32(vreinterpretq_s32_u32(src_sum), vreinterpretq_s32_u32(ref_sum));
@@ -56,7 +56,7 @@
   uint32x4_t ref_sum = vdupq_n_u32(0);
   uint32x4_t sse_u32 = vdupq_n_u32(0);
 
-  int i = 0;
+  int i = h;
   do {
     uint8x16_t s = vcombine_u8(vld1_u8(src), vld1_u8(src + src_stride));
     uint8x16_t r = vcombine_u8(vld1_u8(ref), vld1_u8(ref + ref_stride));
@@ -69,8 +69,8 @@
 
     src += 2 * src_stride;
     ref += 2 * ref_stride;
-    i += 2;
-  } while (i < h);
+    i -= 2;
+  } while (i != 0);
 
   int32x4_t sum_diff =
       vsubq_s32(vreinterpretq_s32_u32(src_sum), vreinterpretq_s32_u32(ref_sum));
@@ -85,7 +85,7 @@
   uint32x4_t ref_sum = vdupq_n_u32(0);
   uint32x4_t sse_u32 = vdupq_n_u32(0);
 
-  int i = 0;
+  int i = h;
   do {
     uint8x16_t s = vld1q_u8(src);
     uint8x16_t r = vld1q_u8(ref);
@@ -98,8 +98,7 @@
 
     src += src_stride;
     ref += ref_stride;
-    i++;
-  } while (i < h);
+  } while (--i != 0);
 
   int32x4_t sum_diff =
       vsubq_s32(vreinterpretq_s32_u32(src_sum), vreinterpretq_s32_u32(ref_sum));
@@ -114,7 +113,7 @@
   uint32x4_t ref_sum = vdupq_n_u32(0);
   uint32x4_t sse_u32 = vdupq_n_u32(0);
 
-  int i = 0;
+  int i = h;
   do {
     int j = 0;
     do {
@@ -132,8 +131,7 @@
 
     src += src_stride;
     ref += ref_stride;
-    i++;
-  } while (i < h);
+  } while (--i != 0);
 
   int32x4_t sum_diff =
       vsubq_s32(vreinterpretq_s32_u32(src_sum), vreinterpretq_s32_u32(ref_sum));
@@ -171,7 +169,7 @@
   // 32767 / 255 ~= 128, but we use an 8-wide accumulator; so 256 4-wide rows.
   assert(h <= 256);
 
-  int i = 0;
+  int i = h;
   do {
     uint8x8_t s = load_unaligned_u8(src, src_stride);
     uint8x8_t r = load_unaligned_u8(ref, ref_stride);
@@ -184,8 +182,8 @@
 
     src += 2 * src_stride;
     ref += 2 * ref_stride;
-    i += 2;
-  } while (i < h);
+    i -= 2;
+  } while (i != 0);
 
   *sum = horizontal_add_s16x8(sum_s16);
   *sse = (uint32_t)horizontal_add_s32x4(sse_s32);
@@ -201,7 +199,7 @@
   // 32767 / 255 ~= 128
   assert(h <= 128);
 
-  int i = 0;
+  int i = h;
   do {
     uint8x8_t s = vld1_u8(src);
     uint8x8_t r = vld1_u8(ref);
@@ -215,8 +213,7 @@
 
     src += src_stride;
     ref += ref_stride;
-    i++;
-  } while (i < h);
+  } while (--i != 0);
 
   *sum = horizontal_add_s16x8(sum_s16);
   *sse = (uint32_t)horizontal_add_s32x4(vaddq_s32(sse_s32[0], sse_s32[1]));
@@ -232,7 +229,7 @@
   // 32767 / 255 ~= 128, so 128 16-wide rows.
   assert(h <= 128);
 
-  int i = 0;
+  int i = h;
   do {
     uint8x16_t s = vld1q_u8(src);
     uint8x16_t r = vld1q_u8(ref);
@@ -256,8 +253,7 @@
 
     src += src_stride;
     ref += ref_stride;
-    i++;
-  } while (i < h);
+  } while (--i != 0);
 
   *sum = horizontal_add_s16x8(vaddq_s16(sum_s16[0], sum_s16[1]));
   *sse = (uint32_t)horizontal_add_s32x4(vaddq_s32(sse_s32[0], sse_s32[1]));
@@ -378,17 +374,6 @@
 
 #undef VARIANCE_WXH_NEON
 
-void aom_get8x8var_neon(const uint8_t *src, int src_stride, const uint8_t *ref,
-                        int ref_stride, unsigned int *sse, int *sum) {
-  variance_8xh_neon(src, src_stride, ref, ref_stride, 8, sse, sum);
-}
-
-void aom_get16x16var_neon(const uint8_t *src, int src_stride,
-                          const uint8_t *ref, int ref_stride, unsigned int *sse,
-                          int *sum) {
-  variance_16xh_neon(src, src_stride, ref, ref_stride, 16, sse, sum);
-}
-
 // TODO(yunqingwang): Perform variance of two/four 8x8 blocks similar to that of
 // AVX2. Also, implement the NEON for variance computation present in this
 // function.
@@ -409,6 +394,25 @@
     var8x8[i] = sse8x8[i] - (uint32_t)(((int64_t)sum8x8[i] * sum8x8[i]) >> 6);
 }
 
+void aom_get_var_sse_sum_16x16_dual_neon(const uint8_t *src, int src_stride,
+                                         const uint8_t *ref, int ref_stride,
+                                         uint32_t *sse16x16,
+                                         unsigned int *tot_sse, int *tot_sum,
+                                         uint32_t *var16x16) {
+  int sum16x16[2] = { 0 };
+  // Loop over 2 16x16 blocks. Process one 16x32 block.
+  for (int k = 0; k < 2; k++) {
+    variance_16xh_neon(src + (k * 16), src_stride, ref + (k * 16), ref_stride,
+                       16, &sse16x16[k], &sum16x16[k]);
+  }
+
+  *tot_sse += sse16x16[0] + sse16x16[1];
+  *tot_sum += sum16x16[0] + sum16x16[1];
+  for (int i = 0; i < 2; i++)
+    var16x16[i] =
+        sse16x16[i] - (uint32_t)(((int64_t)sum16x16[i] * sum16x16[i]) >> 8);
+}
+
 #if defined(__ARM_FEATURE_DOTPROD)
 
 static INLINE unsigned int mse8xh_neon(const uint8_t *src, int src_stride,
@@ -416,7 +420,7 @@
                                        unsigned int *sse, int h) {
   uint32x4_t sse_u32 = vdupq_n_u32(0);
 
-  int i = 0;
+  int i = h;
   do {
     uint8x16_t s = vcombine_u8(vld1_u8(src), vld1_u8(src + src_stride));
     uint8x16_t r = vcombine_u8(vld1_u8(ref), vld1_u8(ref + ref_stride));
@@ -427,8 +431,8 @@
 
     src += 2 * src_stride;
     ref += 2 * ref_stride;
-    i += 2;
-  } while (i < h);
+    i -= 2;
+  } while (i != 0);
 
   *sse = horizontal_add_u32x4(sse_u32);
   return horizontal_add_u32x4(sse_u32);
@@ -439,7 +443,7 @@
                                         unsigned int *sse, int h) {
   uint32x4_t sse_u32[2] = { vdupq_n_u32(0), vdupq_n_u32(0) };
 
-  int i = 0;
+  int i = h;
   do {
     uint8x16_t s0 = vld1q_u8(src);
     uint8x16_t s1 = vld1q_u8(src + src_stride);
@@ -454,25 +458,13 @@
 
     src += 2 * src_stride;
     ref += 2 * ref_stride;
-    i += 2;
-  } while (i < h);
+    i -= 2;
+  } while (i != 0);
 
   *sse = horizontal_add_u32x4(vaddq_u32(sse_u32[0], sse_u32[1]));
   return horizontal_add_u32x4(vaddq_u32(sse_u32[0], sse_u32[1]));
 }
 
-unsigned int aom_get4x4sse_cs_neon(const uint8_t *src, int src_stride,
-                                   const uint8_t *ref, int ref_stride) {
-  uint8x16_t s = load_unaligned_u8q(src, src_stride);
-  uint8x16_t r = load_unaligned_u8q(ref, ref_stride);
-
-  uint8x16_t abs_diff = vabdq_u8(s, r);
-
-  uint32x4_t sse = vdotq_u32(vdupq_n_u32(0), abs_diff, abs_diff);
-
-  return horizontal_add_u32x4(sse);
-}
-
 #else  // !defined(__ARM_FEATURE_DOTPROD)
 
 static INLINE unsigned int mse8xh_neon(const uint8_t *src, int src_stride,
@@ -483,7 +475,7 @@
   uint16x8_t diff[2];
   int32x4_t sse_s32[2] = { vdupq_n_s32(0), vdupq_n_s32(0) };
 
-  int i = 0;
+  int i = h;
   do {
     s[0] = vld1_u8(src);
     src += src_stride;
@@ -507,8 +499,8 @@
     sse_s32[0] = vmlal_s16(sse_s32[0], diff_hi[0], diff_hi[0]);
     sse_s32[1] = vmlal_s16(sse_s32[1], diff_hi[1], diff_hi[1]);
 
-    i += 2;
-  } while (i < h);
+    i -= 2;
+  } while (i != 0);
 
   sse_s32[0] = vaddq_s32(sse_s32[0], sse_s32[1]);
 
@@ -525,7 +517,7 @@
   int32x4_t sse_s32[4] = { vdupq_n_s32(0), vdupq_n_s32(0), vdupq_n_s32(0),
                            vdupq_n_s32(0) };
 
-  int i = 0;
+  int i = h;
   do {
     s[0] = vld1q_u8(src);
     src += src_stride;
@@ -561,8 +553,8 @@
     sse_s32[2] = vmlal_s16(sse_s32[2], diff_hi[2], diff_hi[2]);
     sse_s32[3] = vmlal_s16(sse_s32[3], diff_hi[3], diff_hi[3]);
 
-    i += 2;
-  } while (i < h);
+    i -= 2;
+  } while (i != 0);
 
   sse_s32[0] = vaddq_s32(sse_s32[0], sse_s32[1]);
   sse_s32[2] = vaddq_s32(sse_s32[2], sse_s32[3]);
@@ -572,40 +564,6 @@
   return horizontal_add_u32x4(vreinterpretq_u32_s32(sse_s32[0]));
 }
 
-unsigned int aom_get4x4sse_cs_neon(const uint8_t *src, int src_stride,
-                                   const uint8_t *ref, int ref_stride) {
-  uint8x8_t s[4], r[4];
-  int16x4_t diff[4];
-  int32x4_t sse;
-
-  s[0] = vld1_u8(src);
-  src += src_stride;
-  r[0] = vld1_u8(ref);
-  ref += ref_stride;
-  s[1] = vld1_u8(src);
-  src += src_stride;
-  r[1] = vld1_u8(ref);
-  ref += ref_stride;
-  s[2] = vld1_u8(src);
-  src += src_stride;
-  r[2] = vld1_u8(ref);
-  ref += ref_stride;
-  s[3] = vld1_u8(src);
-  r[3] = vld1_u8(ref);
-
-  diff[0] = vget_low_s16(vreinterpretq_s16_u16(vsubl_u8(s[0], r[0])));
-  diff[1] = vget_low_s16(vreinterpretq_s16_u16(vsubl_u8(s[1], r[1])));
-  diff[2] = vget_low_s16(vreinterpretq_s16_u16(vsubl_u8(s[2], r[2])));
-  diff[3] = vget_low_s16(vreinterpretq_s16_u16(vsubl_u8(s[3], r[3])));
-
-  sse = vmull_s16(diff[0], diff[0]);
-  sse = vmlal_s16(sse, diff[1], diff[1]);
-  sse = vmlal_s16(sse, diff[2], diff[2]);
-  sse = vmlal_s16(sse, diff[3], diff[3]);
-
-  return horizontal_add_u32x4(vreinterpretq_u32_s32(sse));
-}
-
 #endif  // defined(__ARM_FEATURE_DOTPROD)
 
 #define MSE_WXH_NEON(w, h)                                                 \
@@ -647,7 +605,7 @@
                                               int h) {
   uint64x2_t square_result = vdupq_n_u64(0);
   uint32_t d0, d1;
-  int i = 0;
+  int i = h;
   uint8_t *dst_ptr = dst;
   uint16_t *src_ptr = src;
   do {
@@ -678,8 +636,8 @@
     const uint16x8_t src_16x8 = vcombine_u16(src0_16x4, src1_16x4);
 
     COMPUTE_MSE_16BIT(src_16x8, dst_16x8)
-    i += 2;
-  } while (i < h);
+    i -= 2;
+  } while (i != 0);
   uint64x1_t sum =
       vadd_u64(vget_high_u64(square_result), vget_low_u64(square_result));
   return vget_lane_u64(sum, 0);
@@ -689,16 +647,18 @@
                                               uint16_t *src, int sstride,
                                               int h) {
   uint64x2_t square_result = vdupq_n_u64(0);
-  int i = 0;
+  int i = h;
   do {
     // d7 d6 d5 d4 d3 d2 d1 d0 - 8 bit
-    const uint16x8_t dst_16x8 = vmovl_u8(vld1_u8(&dst[i * dstride]));
+    const uint16x8_t dst_16x8 = vmovl_u8(vld1_u8(dst));
     // s7 s6 s5 s4 s3 s2 s1 s0 - 16 bit
-    const uint16x8_t src_16x8 = vld1q_u16(&src[i * sstride]);
+    const uint16x8_t src_16x8 = vld1q_u16(src);
 
     COMPUTE_MSE_16BIT(src_16x8, dst_16x8)
-    i++;
-  } while (i < h);
+
+    dst += dstride;
+    src += sstride;
+  } while (--i != 0);
   uint64x1_t sum =
       vadd_u64(vget_high_u64(square_result), vget_low_u64(square_result));
   return vget_lane_u64(sum, 0);

diff --git a/aom_dsp/avg.c b/aom_dsp/avg.c
index ceb1026..7b36bf3 100644
--- a/aom_dsp/avg.c
+++ b/aom_dsp/avg.c

@@ -87,7 +87,7 @@
   int i, j;
   const uint16_t *s = CONVERT_TO_SHORTPTR(s8);
   const uint16_t *d = CONVERT_TO_SHORTPTR(d8);
-  *min = 255;
+  *min = 65535;
   *max = 0;
   for (i = 0; i < 8; ++i, s += p, d += dp) {
     for (j = 0; j < 8; ++j) {
@@ -99,14 +99,6 @@
 }
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
-void aom_pixel_scale_c(const int16_t *src_diff, ptrdiff_t src_stride,
-                       int16_t *coeff, int log_scale, int h8, int w8) {
-  for (int idy = 0; idy < h8 * 8; ++idy)
-    for (int idx = 0; idx < w8 * 8; ++idx)
-      coeff[idy * (h8 * 8) + idx] = src_diff[idy * src_stride + idx]
-                                    << log_scale;
-}
-
 static void hadamard_col4(const int16_t *src_diff, ptrdiff_t src_stride,
                           int16_t *coeff) {
   int16_t b0 = (src_diff[0 * src_stride] + src_diff[1 * src_stride]) >> 1;
@@ -333,19 +325,19 @@
     aom_hadamard_16x16_c(src_ptr, src_stride, coeff + idx * 256);
   }
 
-  // coeff: 15 bit, dynamic range [-16320, 16320]
+  // coeff: 16 bit, dynamic range [-32768, 32767]
   for (idx = 0; idx < 256; ++idx) {
     tran_low_t a0 = coeff[0];
     tran_low_t a1 = coeff[256];
     tran_low_t a2 = coeff[512];
     tran_low_t a3 = coeff[768];
 
-    tran_low_t b0 = (a0 + a1) >> 2;  // (a0 + a1): 16 bit, [-32640, 32640]
+    tran_low_t b0 = (a0 + a1) >> 2;  // (a0 + a1): 17 bit, [-65536, 65535]
     tran_low_t b1 = (a0 - a1) >> 2;  // b0-b3: 15 bit, dynamic range
-    tran_low_t b2 = (a2 + a3) >> 2;  // [-16320, 16320]
+    tran_low_t b2 = (a2 + a3) >> 2;  // [-16384, 16383]
     tran_low_t b3 = (a2 - a3) >> 2;
 
-    coeff[0] = b0 + b2;  // 16 bit, [-32640, 32640]
+    coeff[0] = b0 + b2;  // 16 bit, [-32768, 32767]
     coeff[256] = b1 + b3;
     coeff[512] = b0 - b2;
     coeff[768] = b1 - b3;

diff --git a/aom_dsp/entdec.c b/aom_dsp/entdec.c
index da43e8a..5bbcdda 100644
--- a/aom_dsp/entdec.c
+++ b/aom_dsp/entdec.c

@@ -205,14 +205,14 @@
   assert(dif >> (OD_EC_WINDOW_SIZE - 16) < r);
   assert(icdf[nsyms - 1] == OD_ICDF(CDF_PROB_TOP));
   assert(32768U <= r);
-  assert(7 - EC_PROB_SHIFT - CDF_SHIFT >= 0);
+  assert(7 - EC_PROB_SHIFT >= 0);
   c = (unsigned)(dif >> (OD_EC_WINDOW_SIZE - 16));
   v = r;
   ret = -1;
   do {
     u = v;
     v = ((r >> 8) * (uint32_t)(icdf[++ret] >> EC_PROB_SHIFT) >>
-         (7 - EC_PROB_SHIFT - CDF_SHIFT));
+         (7 - EC_PROB_SHIFT));
     v += EC_MIN_PROB * (N - ret);
   } while (c < v);
   assert(v < u);

diff --git a/aom_dsp/entenc.c b/aom_dsp/entenc.c
index 2fd4493..dfc1624 100644
--- a/aom_dsp/entenc.c
+++ b/aom_dsp/entenc.c

@@ -49,11 +49,11 @@
   }*/
 
 /*Takes updated low and range values, renormalizes them so that
-   32768 <= rng < 65536 (flushing bytes from low to the pre-carry buffer if
+   32768 <= rng < 65536 (flushing bytes from low to the output buffer if
    necessary), and stores them back in the encoder context.
   low: The new value of low.
   rng: The new value of the range.*/
-static void od_ec_enc_normalize(od_ec_enc *enc, od_ec_window low,
+static void od_ec_enc_normalize(od_ec_enc *enc, od_ec_enc_window low,
                                 unsigned rng) {
   int d;
   int c;
@@ -63,44 +63,59 @@
   /*The number of leading zeros in the 16-bit binary representation of rng.*/
   d = 16 - OD_ILOG_NZ(rng);
   s = c + d;
-  /*TODO: Right now we flush every time we have at least one byte available.
-    Instead we should use an od_ec_window and flush right before we're about to
-     shift bits off the end of the window.
-    For a 32-bit window this is about the same amount of work, but for a 64-bit
-     window it should be a fair win.*/
-  if (s >= 0) {
-    uint16_t *buf;
-    uint32_t storage;
-    uint32_t offs;
-    unsigned m;
-    buf = enc->precarry_buf;
-    storage = enc->precarry_storage;
-    offs = enc->offs;
-    if (offs + 2 > storage) {
-      storage = 2 * storage + 2;
-      buf = (uint16_t *)realloc(buf, sizeof(*buf) * storage);
-      if (buf == NULL) {
+
+  /* We flush every time "low" cannot safely and efficiently accommodate any
+     more data. Overall, c must not exceed 63 at the time of byte flush out. To
+     facilitate this, "s" cannot exceed 56-bits because we have to keep 1 byte
+     for carry. Also, we need to subtract 16 because we want to keep room for
+     the next symbol worth "d"-bits (max 15). An alternate condition would be if
+     (e < d), where e = number of leading zeros in "low", indicating there is
+     not enough rooom to accommodate "rng" worth of "d"-bits in "low". However,
+     this approach needs additional computations: (i) compute "e", (ii) push
+     the leading 0x00's as a special case.
+  */
+  if (s >= 40) {  // 56 - 16
+    unsigned char *out = enc->buf;
+    uint32_t storage = enc->storage;
+    uint32_t offs = enc->offs;
+    if (offs + 8 > storage) {
+      storage = 2 * storage + 8;
+      out = (unsigned char *)realloc(out, sizeof(*out) * storage);
+      if (out == NULL) {
         enc->error = -1;
         enc->offs = 0;
         return;
       }
-      enc->precarry_buf = buf;
-      enc->precarry_storage = storage;
+      enc->buf = out;
+      enc->storage = storage;
     }
-    c += 16;
-    m = (1 << c) - 1;
-    if (s >= 8) {
-      assert(offs < storage);
-      buf[offs++] = (uint16_t)(low >> c);
-      low &= m;
-      c -= 8;
-      m >>= 8;
-    }
-    assert(offs < storage);
-    buf[offs++] = (uint16_t)(low >> c);
+    // Need to add 1 byte here since enc->cnt always counts 1 byte less
+    // (enc->cnt = -9) to ensure correct operation
+    uint8_t num_bytes_ready = (s >> 3) + 1;
+
+    // Update "c" to contain the number of non-ready bits in "low". Since "low"
+    // has 64-bit capacity, we need to add the (64 - 40) cushion bits and take
+    // off the number of ready bits.
+    c += 24 - (num_bytes_ready << 3);
+
+    // Prepare "output" and update "low"
+    uint64_t output = low >> c;
+    low = low & (((uint64_t)1 << c) - 1);
+
+    // Prepare data and carry mask
+    uint64_t mask = (uint64_t)1 << (num_bytes_ready << 3);
+    uint64_t carry = output & mask;
+
+    mask = mask - 0x01;
+    output = output & mask;
+
+    // Write data in a single operation
+    write_enc_data_to_out_buf(out, offs, output, carry, &enc->offs,
+                              num_bytes_ready);
+
+    // Update state of the encoder: enc->cnt to contain the number of residual
+    // bits
     s = c + d - 24;
-    low &= m;
-    enc->offs = offs;
   }
   enc->low = low << d;
   enc->rng = rng << d;
@@ -117,12 +132,6 @@
     enc->storage = 0;
     enc->error = -1;
   }
-  enc->precarry_buf = (uint16_t *)malloc(sizeof(*enc->precarry_buf) * size);
-  enc->precarry_storage = size;
-  if (size > 0 && enc->precarry_buf == NULL) {
-    enc->precarry_storage = 0;
-    enc->error = -1;
-  }
 }
 
 /*Reinitializes the encoder.*/
@@ -141,21 +150,16 @@
 }
 
 /*Frees the buffers used by the encoder.*/
-void od_ec_enc_clear(od_ec_enc *enc) {
-  free(enc->precarry_buf);
-  free(enc->buf);
-}
+void od_ec_enc_clear(od_ec_enc *enc) { free(enc->buf); }
 
 /*Encodes a symbol given its frequency in Q15.
   fl: CDF_PROB_TOP minus the cumulative frequency of all symbols that come
-  before the
-       one to be encoded.
+  before the one to be encoded.
   fh: CDF_PROB_TOP minus the cumulative frequency of all symbols up to and
-  including
-       the one to be encoded.*/
+  including the one to be encoded.*/
 static void od_ec_encode_q15(od_ec_enc *enc, unsigned fl, unsigned fh, int s,
                              int nsyms) {
-  od_ec_window l;
+  od_ec_enc_window l;
   unsigned r;
   unsigned u;
   unsigned v;
@@ -164,20 +168,17 @@
   assert(32768U <= r);
   assert(fh <= fl);
   assert(fl <= 32768U);
-  assert(7 - EC_PROB_SHIFT - CDF_SHIFT >= 0);
+  assert(7 - EC_PROB_SHIFT >= 0);
   const int N = nsyms - 1;
   if (fl < CDF_PROB_TOP) {
-    u = ((r >> 8) * (uint32_t)(fl >> EC_PROB_SHIFT) >>
-         (7 - EC_PROB_SHIFT - CDF_SHIFT)) +
+    u = ((r >> 8) * (uint32_t)(fl >> EC_PROB_SHIFT) >> (7 - EC_PROB_SHIFT)) +
         EC_MIN_PROB * (N - (s - 1));
-    v = ((r >> 8) * (uint32_t)(fh >> EC_PROB_SHIFT) >>
-         (7 - EC_PROB_SHIFT - CDF_SHIFT)) +
+    v = ((r >> 8) * (uint32_t)(fh >> EC_PROB_SHIFT) >> (7 - EC_PROB_SHIFT)) +
         EC_MIN_PROB * (N - (s + 0));
     l += r - u;
     r = u - v;
   } else {
-    r -= ((r >> 8) * (uint32_t)(fh >> EC_PROB_SHIFT) >>
-          (7 - EC_PROB_SHIFT - CDF_SHIFT)) +
+    r -= ((r >> 8) * (uint32_t)(fh >> EC_PROB_SHIFT) >> (7 - EC_PROB_SHIFT)) +
          EC_MIN_PROB * (N - (s + 0));
   }
   od_ec_enc_normalize(enc, l, r);
@@ -191,7 +192,7 @@
   val: The value to encode (0 or 1).
   f: The probability that the val is one, scaled by 32768.*/
 void od_ec_encode_bool_q15(od_ec_enc *enc, int val, unsigned f) {
-  od_ec_window l;
+  od_ec_enc_window l;
   unsigned r;
   unsigned v;
   assert(0 < f);
@@ -251,12 +252,11 @@
   mask = ((1U << nbits) - 1) << shift;
   if (enc->offs > 0) {
     /*The first byte has been finalized.*/
-    enc->precarry_buf[0] =
-        (uint16_t)((enc->precarry_buf[0] & ~mask) | val << shift);
+    enc->buf[0] = (unsigned char)((enc->buf[0] & ~mask) | val << shift);
   } else if (9 + enc->cnt + (enc->rng == 0x8000) > nbits) {
     /*The first byte has yet to be output.*/
-    enc->low = (enc->low & ~((od_ec_window)mask << (16 + enc->cnt))) |
-               (od_ec_window)val << (16 + enc->cnt + shift);
+    enc->low = (enc->low & ~((od_ec_enc_window)mask << (16 + enc->cnt))) |
+               (od_ec_enc_window)val << (16 + enc->cnt + shift);
   } else {
     /*The encoder hasn't even encoded _nbits of data yet.*/
     enc->error = -1;
@@ -276,11 +276,10 @@
 unsigned char *od_ec_enc_done(od_ec_enc *enc, uint32_t *nbytes) {
   unsigned char *out;
   uint32_t storage;
-  uint16_t *buf;
   uint32_t offs;
-  od_ec_window m;
-  od_ec_window e;
-  od_ec_window l;
+  od_ec_enc_window m;
+  od_ec_enc_window e;
+  od_ec_enc_window l;
   int c;
   int s;
   if (enc->error) return NULL;
@@ -295,8 +294,7 @@
             (double)tell / enc->nb_symbols);
   }
 #endif
-  /*We output the minimum number of bits that ensures that the symbols encoded
-     thus far will be decoded correctly regardless of the bits that follow.*/
+
   l = enc->low;
   c = enc->cnt;
   s = 10;
@@ -304,36 +302,14 @@
   e = ((l + m) & ~m) | (m + 1);
   s += c;
   offs = enc->offs;
-  buf = enc->precarry_buf;
-  if (s > 0) {
-    unsigned n;
-    storage = enc->precarry_storage;
-    if (offs + ((s + 7) >> 3) > storage) {
-      storage = storage * 2 + ((s + 7) >> 3);
-      buf = (uint16_t *)realloc(buf, sizeof(*buf) * storage);
-      if (buf == NULL) {
-        enc->error = -1;
-        return NULL;
-      }
-      enc->precarry_buf = buf;
-      enc->precarry_storage = storage;
-    }
-    n = (1 << (c + 16)) - 1;
-    do {
-      assert(offs < storage);
-      buf[offs++] = (uint16_t)(e >> (c + 16));
-      e &= n;
-      s -= 8;
-      c -= 8;
-      n >>= 8;
-    } while (s > 0);
-  }
+
   /*Make sure there's enough room for the entropy-coded bits.*/
   out = enc->buf;
   storage = enc->storage;
-  c = OD_MAXI((s + 7) >> 3, 0);
-  if (offs + c > storage) {
-    storage = offs + c;
+  const int s_bits = (s + 7) >> 3;
+  int b = OD_MAXI(s_bits, 0);
+  if (offs + b > storage) {
+    storage = offs + b;
     out = (unsigned char *)realloc(out, sizeof(*out) * storage);
     if (out == NULL) {
       enc->error = -1;
@@ -342,23 +318,30 @@
     enc->buf = out;
     enc->storage = storage;
   }
-  *nbytes = offs;
-  /*Perform carry propagation.*/
-  assert(offs <= storage);
-  out = out + storage - offs;
-  c = 0;
-  while (offs > 0) {
-    offs--;
-    c = buf[offs] + c;
-    out[offs] = (unsigned char)c;
-    c >>= 8;
+
+  /*We output the minimum number of bits that ensures that the symbols encoded
+     thus far will be decoded correctly regardless of the bits that follow.*/
+  if (s > 0) {
+    uint64_t n;
+    n = ((uint64_t)1 << (c + 16)) - 1;
+    do {
+      assert(offs < storage);
+      uint16_t val = (uint16_t)(e >> (c + 16));
+      out[offs] = (unsigned char)(val & 0x00FF);
+      if (val & 0x0100) {
+        assert(offs > 0);
+        propagate_carry_bwd(out, offs - 1);
+      }
+      offs++;
+
+      e &= n;
+      s -= 8;
+      c -= 8;
+      n >>= 8;
+    } while (s > 0);
   }
-  /*Note: Unless there's an allocation error, if you keep encoding into the
-     current buffer and call this function again later, everything will work
-     just fine (you won't get a new packet out, but you will get a single
-     buffer with the new data appended to the old).
-    However, this function is O(N) where N is the amount of data coded so far,
-     so calling it more than once for a given packet is a bad idea.*/
+  *nbytes = offs;
+
   return out;
 }
 
@@ -407,17 +390,10 @@
 void od_ec_enc_rollback(od_ec_enc *dst, const od_ec_enc *src) {
   unsigned char *buf;
   uint32_t storage;
-  uint16_t *precarry_buf;
-  uint32_t precarry_storage;
   assert(dst->storage >= src->storage);
-  assert(dst->precarry_storage >= src->precarry_storage);
   buf = dst->buf;
   storage = dst->storage;
-  precarry_buf = dst->precarry_buf;
-  precarry_storage = dst->precarry_storage;
   OD_COPY(dst, src, 1);
   dst->buf = buf;
   dst->storage = storage;
-  dst->precarry_buf = precarry_buf;
-  dst->precarry_storage = precarry_storage;
 }

diff --git a/aom_dsp/entenc.h b/aom_dsp/entenc.h
index 3551d42..467e47b 100644
--- a/aom_dsp/entenc.h
+++ b/aom_dsp/entenc.h

@@ -13,11 +13,14 @@
 #define AOM_AOM_DSP_ENTENC_H_
 #include <stddef.h>
 #include "aom_dsp/entcode.h"
+#include "aom_ports/bitops.h"
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
+typedef uint64_t od_ec_enc_window;
+
 typedef struct od_ec_enc od_ec_enc;
 
 #define OD_MEASURE_EC_OVERHEAD (0)
@@ -30,14 +33,10 @@
   unsigned char *buf;
   /*The size of the buffer.*/
   uint32_t storage;
-  /*A buffer for output bytes with their associated carry flags.*/
-  uint16_t *precarry_buf;
-  /*The size of the pre-carry buffer.*/
-  uint32_t precarry_storage;
   /*The offset at which the next entropy-coded byte will be written.*/
   uint32_t offs;
   /*The low end of the current range.*/
-  od_ec_window low;
+  od_ec_enc_window low;
   /*The number of values in the current range.*/
   uint16_t rng;
   /*The number of bits of data in the current value.*/
@@ -78,6 +77,32 @@
 void od_ec_enc_checkpoint(od_ec_enc *dst, const od_ec_enc *src);
 void od_ec_enc_rollback(od_ec_enc *dst, const od_ec_enc *src);
 
+// buf is the frame bitbuffer, offs is where carry to be added
+static AOM_INLINE void propagate_carry_bwd(unsigned char *buf, uint32_t offs) {
+  uint16_t sum, carry = 1;
+  do {
+    sum = (uint16_t)buf[offs] + 1;
+    buf[offs--] = (unsigned char)sum;
+    carry = sum >> 8;
+  } while (carry);
+}
+
+// Reverse byte order and write data to buffer adding the carry-bit
+static AOM_INLINE void write_enc_data_to_out_buf(unsigned char *out,
+                                                 uint32_t offs, uint64_t output,
+                                                 uint64_t carry,
+                                                 uint32_t *enc_offs,
+                                                 uint8_t num_bytes_ready) {
+  const uint64_t reg = get_byteswap64(output) >> ((8 - num_bytes_ready) << 3);
+  memcpy(&out[offs], &reg, 8);
+  // Propagate carry backwards if exists
+  if (carry) {
+    assert(offs > 0);
+    propagate_carry_bwd(out, offs - 1);
+  }
+  *enc_offs = offs + num_bytes_ready;
+}
+
 #ifdef __cplusplus
 }  // extern "C"
 #endif

diff --git a/aom_dsp/flow_estimation/corner_detect.c b/aom_dsp/flow_estimation/corner_detect.c
index c49e3fa..7848295 100644
--- a/aom_dsp/flow_estimation/corner_detect.c
+++ b/aom_dsp/flow_estimation/corner_detect.c

@@ -17,21 +17,149 @@
 
 #include "third_party/fastfeat/fast.h"
 
+#include "aom_dsp/aom_dsp_common.h"
 #include "aom_dsp/flow_estimation/corner_detect.h"
+#include "aom_mem/aom_mem.h"
+#include "av1/common/common.h"
 
-// Fast_9 wrapper
 #define FAST_BARRIER 18
-int av1_fast_corner_detect(unsigned char *buf, int width, int height,
-                           int stride, int *points, int max_points) {
-  int num_points;
-  xy *const frm_corners_xy = aom_fast9_detect_nonmax(buf, width, height, stride,
-                                                     FAST_BARRIER, &num_points);
-  num_points = (num_points <= max_points ? num_points : max_points);
-  if (num_points > 0 && frm_corners_xy) {
-    memcpy(points, frm_corners_xy, sizeof(*frm_corners_xy) * num_points);
-    free(frm_corners_xy);
-    return num_points;
+
+size_t av1_get_corner_list_size() { return sizeof(CornerList); }
+
+CornerList *av1_alloc_corner_list() {
+  CornerList *corners = (CornerList *)aom_calloc(1, sizeof(CornerList));
+  if (!corners) {
+    return NULL;
   }
-  free(frm_corners_xy);
-  return 0;
+
+  corners->valid = false;
+#if CONFIG_MULTITHREAD
+  pthread_mutex_init(&corners->mutex, NULL);
+#endif  // CONFIG_MULTITHREAD
+  return corners;
+}
+
+void compute_corner_list(const ImagePyramid *pyr, CornerList *corners) {
+  const uint8_t *buf = pyr->layers[0].buffer;
+  int width = pyr->layers[0].width;
+  int height = pyr->layers[0].height;
+  int stride = pyr->layers[0].stride;
+
+  int *scores = NULL;
+  int num_corners;
+  xy *const frame_corners_xy = aom_fast9_detect_nonmax(
+      buf, width, height, stride, FAST_BARRIER, &scores, &num_corners);
+
+  if (num_corners <= 0) {
+    // Some error occured, so no corners are available
+    corners->num_corners = 0;
+  } else if (num_corners <= MAX_CORNERS) {
+    // Use all detected corners
+    memcpy(corners->corners, frame_corners_xy,
+           sizeof(*frame_corners_xy) * num_corners);
+    corners->num_corners = num_corners;
+  } else {
+    // There are more than MAX_CORNERS corners avilable, so pick out a subset
+    // of the sharpest corners, as these will be the most useful for flow
+    // estimation
+    int histogram[256];
+    av1_zero(histogram);
+    for (int i = 0; i < num_corners; i++) {
+      assert(FAST_BARRIER <= scores[i] && scores[i] <= 255);
+      histogram[scores[i]] += 1;
+    }
+
+    int threshold = -1;
+    int found_corners = 0;
+    for (int bucket = 255; bucket >= 0; bucket--) {
+      if (found_corners + histogram[bucket] > MAX_CORNERS) {
+        // Set threshold here
+        threshold = bucket;
+        break;
+      }
+      found_corners += histogram[bucket];
+    }
+    assert(threshold != -1 && "Failed to select a valid threshold");
+
+    int copied_corners = 0;
+    for (int i = 0; i < num_corners; i++) {
+      if (scores[i] > threshold) {
+        assert(copied_corners < MAX_CORNERS);
+        corners->corners[2 * copied_corners + 0] = frame_corners_xy[i].x;
+        corners->corners[2 * copied_corners + 1] = frame_corners_xy[i].y;
+        copied_corners += 1;
+      }
+    }
+    assert(copied_corners == found_corners);
+    corners->num_corners = copied_corners;
+  }
+
+  free(scores);
+  free(frame_corners_xy);
+}
+
+void av1_compute_corner_list(const ImagePyramid *pyr, CornerList *corners) {
+  assert(corners);
+
+#if CONFIG_MULTITHREAD
+  pthread_mutex_lock(&corners->mutex);
+#endif  // CONFIG_MULTITHREAD
+
+  if (!corners->valid) {
+    compute_corner_list(pyr, corners);
+    corners->valid = true;
+  }
+
+#if CONFIG_MULTITHREAD
+  pthread_mutex_unlock(&corners->mutex);
+#endif  // CONFIG_MULTITHREAD
+}
+
+#ifndef NDEBUG
+// Check if a corner list has already been computed.
+// This is mostly a debug helper - as it is necessary to hold corners->mutex
+// while reading the valid flag, we cannot just write:
+//   assert(corners->valid);
+// This function allows the check to be correctly written as:
+//   assert(aom_is_corner_list_valid(corners));
+bool aom_is_corner_list_valid(CornerList *corners) {
+  assert(corners);
+
+  // Per the comments in the CornerList struct, we must take this mutex
+  // before reading or writing the "valid" flag, and hold it while computing
+  // the pyramid, to ensure proper behaviour if multiple threads call this
+  // function simultaneously
+#if CONFIG_MULTITHREAD
+  pthread_mutex_lock(&corners->mutex);
+#endif  // CONFIG_MULTITHREAD
+
+  bool valid = corners->valid;
+
+#if CONFIG_MULTITHREAD
+  pthread_mutex_unlock(&corners->mutex);
+#endif  // CONFIG_MULTITHREAD
+
+  return valid;
+}
+#endif
+
+void av1_invalidate_corner_list(CornerList *corners) {
+  if (corners) {
+#if CONFIG_MULTITHREAD
+    pthread_mutex_lock(&corners->mutex);
+#endif  // CONFIG_MULTITHREAD
+    corners->valid = false;
+#if CONFIG_MULTITHREAD
+    pthread_mutex_unlock(&corners->mutex);
+#endif  // CONFIG_MULTITHREAD
+  }
+}
+
+void av1_free_corner_list(CornerList *corners) {
+  if (corners) {
+#if CONFIG_MULTITHREAD
+    pthread_mutex_destroy(&corners->mutex);
+#endif  // CONFIG_MULTITHREAD
+    aom_free(corners);
+  }
 }

diff --git a/aom_dsp/flow_estimation/corner_detect.h b/aom_dsp/flow_estimation/corner_detect.h
index 4481c4e..c77813e 100644
--- a/aom_dsp/flow_estimation/corner_detect.h
+++ b/aom_dsp/flow_estimation/corner_detect.h

@@ -14,14 +14,64 @@
 
 #include <stdio.h>
 #include <stdlib.h>
+#include <stdbool.h>
 #include <memory.h>
 
+#include "aom_dsp/pyramid.h"
+#include "aom_util/aom_thread.h"
+
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-int av1_fast_corner_detect(unsigned char *buf, int width, int height,
-                           int stride, int *points, int max_points);
+#define MAX_CORNERS 4096
+
+typedef struct corner_list {
+#if CONFIG_MULTITHREAD
+  // Mutex which is used to prevent the corner list from being computed twice
+  // at the same time
+  //
+  // Semantics:
+  // * This mutex must be held whenever reading or writing the `valid` flag
+  //
+  // * This mutex must also be held while computing the image pyramid,
+  //   to ensure that only one thread may do so at a time.
+  //
+  // * However, once you have read the valid flag and seen a true value,
+  //   it is safe to drop the mutex and read from the remaining fields.
+  //   This is because, once the image pyramid is computed, its contents
+  //   will not be changed until the parent frame buffer is recycled,
+  //   which will not happen until there are no more outstanding references
+  //   to the frame buffer.
+  pthread_mutex_t mutex;
+#endif  // CONFIG_MULTITHREAD
+  // Flag indicating whether the corner list contains valid data
+  bool valid;
+  // Number of corners found
+  int num_corners;
+  // (x, y) coordinates of each corner
+  int corners[2 * MAX_CORNERS];
+} CornerList;
+
+size_t av1_get_corner_list_size();
+
+CornerList *av1_alloc_corner_list();
+
+void av1_compute_corner_list(const ImagePyramid *pyr, CornerList *corners);
+
+#ifndef NDEBUG
+// Check if a corner list has already been computed.
+// This is mostly a debug helper - as it is necessary to hold corners->mutex
+// while reading the valid flag, we cannot just write:
+//   assert(corners->valid);
+// This function allows the check to be correctly written as:
+//   assert(aom_is_corner_list_valid(corners));
+bool aom_is_corner_list_valid(CornerList *corners);
+#endif
+
+void av1_invalidate_corner_list(CornerList *corners);
+
+void av1_free_corner_list(CornerList *corners);
 
 #ifdef __cplusplus
 }

diff --git a/aom_dsp/flow_estimation/corner_match.c b/aom_dsp/flow_estimation/corner_match.c
index f675604..f34178e 100644
--- a/aom_dsp/flow_estimation/corner_match.c
+++ b/aom_dsp/flow_estimation/corner_match.c

@@ -19,6 +19,7 @@
 #include "aom_dsp/flow_estimation/corner_match.h"
 #include "aom_dsp/flow_estimation/flow_estimation.h"
 #include "aom_dsp/flow_estimation/ransac.h"
+#include "aom_dsp/pyramid.h"
 #include "aom_scale/yv12config.h"
 
 #define SEARCH_SZ 9
@@ -26,30 +27,32 @@
 
 #define THRESHOLD_NCC 0.75
 
-/* Compute var(im) * MATCH_SZ_SQ over a MATCH_SZ by MATCH_SZ window of im,
+/* Compute var(frame) * MATCH_SZ_SQ over a MATCH_SZ by MATCH_SZ window of frame,
    centered at (x, y).
 */
-static double compute_variance(unsigned char *im, int stride, int x, int y) {
+static double compute_variance(const unsigned char *frame, int stride, int x,
+                               int y) {
   int sum = 0;
   int sumsq = 0;
   int var;
   int i, j;
   for (i = 0; i < MATCH_SZ; ++i)
     for (j = 0; j < MATCH_SZ; ++j) {
-      sum += im[(i + y - MATCH_SZ_BY2) * stride + (j + x - MATCH_SZ_BY2)];
-      sumsq += im[(i + y - MATCH_SZ_BY2) * stride + (j + x - MATCH_SZ_BY2)] *
-               im[(i + y - MATCH_SZ_BY2) * stride + (j + x - MATCH_SZ_BY2)];
+      sum += frame[(i + y - MATCH_SZ_BY2) * stride + (j + x - MATCH_SZ_BY2)];
+      sumsq += frame[(i + y - MATCH_SZ_BY2) * stride + (j + x - MATCH_SZ_BY2)] *
+               frame[(i + y - MATCH_SZ_BY2) * stride + (j + x - MATCH_SZ_BY2)];
     }
   var = sumsq * MATCH_SZ_SQ - sum * sum;
   return (double)var;
 }
 
-/* Compute corr(im1, im2) * MATCH_SZ * stddev(im1), where the
+/* Compute corr(frame1, frame2) * MATCH_SZ * stddev(frame1), where the
    correlation/standard deviation are taken over MATCH_SZ by MATCH_SZ windows
    of each image, centered at (x1, y1) and (x2, y2) respectively.
 */
-double av1_compute_cross_correlation_c(unsigned char *im1, int stride1, int x1,
-                                       int y1, unsigned char *im2, int stride2,
+double av1_compute_cross_correlation_c(const unsigned char *frame1, int stride1,
+                                       int x1, int y1,
+                                       const unsigned char *frame2, int stride2,
                                        int x2, int y2) {
   int v1, v2;
   int sum1 = 0;
@@ -60,8 +63,8 @@
   int i, j;
   for (i = 0; i < MATCH_SZ; ++i)
     for (j = 0; j < MATCH_SZ; ++j) {
-      v1 = im1[(i + y1 - MATCH_SZ_BY2) * stride1 + (j + x1 - MATCH_SZ_BY2)];
-      v2 = im2[(i + y2 - MATCH_SZ_BY2) * stride2 + (j + x2 - MATCH_SZ_BY2)];
+      v1 = frame1[(i + y1 - MATCH_SZ_BY2) * stride1 + (j + x1 - MATCH_SZ_BY2)];
+      v2 = frame2[(i + y2 - MATCH_SZ_BY2) * stride2 + (j + x2 - MATCH_SZ_BY2)];
       sum1 += v1;
       sum2 += v2;
       sumsq2 += v2 * v2;
@@ -84,28 +87,30 @@
           (point1y - point2y) * (point1y - point2y)) <= thresh * thresh;
 }
 
-static void improve_correspondence(unsigned char *frm, unsigned char *ref,
-                                   int width, int height, int frm_stride,
-                                   int ref_stride,
+static void improve_correspondence(const unsigned char *src,
+                                   const unsigned char *ref, int width,
+                                   int height, int src_stride, int ref_stride,
                                    Correspondence *correspondences,
                                    int num_correspondences) {
   int i;
   for (i = 0; i < num_correspondences; ++i) {
     int x, y, best_x = 0, best_y = 0;
     double best_match_ncc = 0.0;
+    // For this algorithm, all points have integer coordinates.
+    // It's a little more efficient to convert them to ints once,
+    // before the inner loops
+    int x0 = (int)correspondences[i].x;
+    int y0 = (int)correspondences[i].y;
+    int rx0 = (int)correspondences[i].rx;
+    int ry0 = (int)correspondences[i].ry;
     for (y = -SEARCH_SZ_BY2; y <= SEARCH_SZ_BY2; ++y) {
       for (x = -SEARCH_SZ_BY2; x <= SEARCH_SZ_BY2; ++x) {
         double match_ncc;
-        if (!is_eligible_point(correspondences[i].rx + x,
-                               correspondences[i].ry + y, width, height))
+        if (!is_eligible_point(rx0 + x, ry0 + y, width, height)) continue;
+        if (!is_eligible_distance(x0, y0, rx0 + x, ry0 + y, width, height))
           continue;
-        if (!is_eligible_distance(correspondences[i].x, correspondences[i].y,
-                                  correspondences[i].rx + x,
-                                  correspondences[i].ry + y, width, height))
-          continue;
-        match_ncc = av1_compute_cross_correlation(
-            frm, frm_stride, correspondences[i].x, correspondences[i].y, ref,
-            ref_stride, correspondences[i].rx + x, correspondences[i].ry + y);
+        match_ncc = av1_compute_cross_correlation(src, src_stride, x0, y0, ref,
+                                                  ref_stride, rx0 + x, ry0 + y);
         if (match_ncc > best_match_ncc) {
           best_match_ncc = match_ncc;
           best_y = y;
@@ -119,19 +124,18 @@
   for (i = 0; i < num_correspondences; ++i) {
     int x, y, best_x = 0, best_y = 0;
     double best_match_ncc = 0.0;
+    int x0 = (int)correspondences[i].x;
+    int y0 = (int)correspondences[i].y;
+    int rx0 = (int)correspondences[i].rx;
+    int ry0 = (int)correspondences[i].ry;
     for (y = -SEARCH_SZ_BY2; y <= SEARCH_SZ_BY2; ++y)
       for (x = -SEARCH_SZ_BY2; x <= SEARCH_SZ_BY2; ++x) {
         double match_ncc;
-        if (!is_eligible_point(correspondences[i].x + x,
-                               correspondences[i].y + y, width, height))
-          continue;
-        if (!is_eligible_distance(
-                correspondences[i].x + x, correspondences[i].y + y,
-                correspondences[i].rx, correspondences[i].ry, width, height))
+        if (!is_eligible_point(x0 + x, y0 + y, width, height)) continue;
+        if (!is_eligible_distance(x0 + x, y0 + y, rx0, ry0, width, height))
           continue;
         match_ncc = av1_compute_cross_correlation(
-            ref, ref_stride, correspondences[i].rx, correspondences[i].ry, frm,
-            frm_stride, correspondences[i].x + x, correspondences[i].y + y);
+            ref, ref_stride, rx0, ry0, src, src_stride, x0 + x, y0 + y);
         if (match_ncc > best_match_ncc) {
           best_match_ncc = match_ncc;
           best_y = y;
@@ -143,14 +147,15 @@
   }
 }
 
-int aom_determine_correspondence(unsigned char *src, int *src_corners,
-                                 int num_src_corners, unsigned char *ref,
-                                 int *ref_corners, int num_ref_corners,
+int aom_determine_correspondence(const unsigned char *src,
+                                 const int *src_corners, int num_src_corners,
+                                 const unsigned char *ref,
+                                 const int *ref_corners, int num_ref_corners,
                                  int width, int height, int src_stride,
-                                 int ref_stride, int *correspondence_pts) {
+                                 int ref_stride,
+                                 Correspondence *correspondences) {
   // TODO(sarahparker) Improve this to include 2-way match
   int i, j;
-  Correspondence *correspondences = (Correspondence *)correspondence_pts;
   int num_correspondences = 0;
   for (i = 0; i < num_src_corners; ++i) {
     double best_match_ncc = 0.0;
@@ -195,71 +200,44 @@
   return num_correspondences;
 }
 
-static bool get_inliers_from_indices(MotionModel *params,
-                                     int *correspondences) {
-  int *inliers_tmp = (int *)aom_calloc(2 * MAX_CORNERS, sizeof(*inliers_tmp));
-  if (!inliers_tmp) return false;
-
-  for (int i = 0; i < params->num_inliers; i++) {
-    int index = params->inliers[i];
-    inliers_tmp[2 * i] = correspondences[4 * index];
-    inliers_tmp[2 * i + 1] = correspondences[4 * index + 1];
-  }
-  memcpy(params->inliers, inliers_tmp, sizeof(*inliers_tmp) * 2 * MAX_CORNERS);
-  aom_free(inliers_tmp);
-  return true;
-}
-
-int av1_compute_global_motion_feature_based(
-    TransformationType type, unsigned char *src_buffer, int src_width,
-    int src_height, int src_stride, int *src_corners, int num_src_corners,
-    YV12_BUFFER_CONFIG *ref, int bit_depth, int *num_inliers_by_motion,
-    MotionModel *params_by_motion, int num_motions) {
-  int i;
-  int num_ref_corners;
+bool av1_compute_global_motion_feature_match(
+    TransformationType type, YV12_BUFFER_CONFIG *src, YV12_BUFFER_CONFIG *ref,
+    int bit_depth, MotionModel *motion_models, int num_motion_models) {
   int num_correspondences;
-  int *correspondences;
-  int ref_corners[2 * MAX_CORNERS];
-  unsigned char *ref_buffer = ref->y_buffer;
-  RansacFunc ransac = av1_get_ransac_type(type);
+  Correspondence *correspondences;
+  ImagePyramid *src_pyramid = src->y_pyramid;
+  CornerList *src_corners = src->corners;
+  ImagePyramid *ref_pyramid = ref->y_pyramid;
+  CornerList *ref_corners = ref->corners;
 
-  if (ref->flags & YV12_FLAG_HIGHBITDEPTH) {
-    ref_buffer = av1_downconvert_frame(ref, bit_depth);
-  }
+  // Precompute information we will need about each frame
+  aom_compute_pyramid(src, bit_depth, src_pyramid);
+  av1_compute_corner_list(src_pyramid, src_corners);
+  aom_compute_pyramid(ref, bit_depth, ref_pyramid);
+  av1_compute_corner_list(ref_pyramid, ref_corners);
 
-  num_ref_corners =
-      av1_fast_corner_detect(ref_buffer, ref->y_width, ref->y_height,
-                             ref->y_stride, ref_corners, MAX_CORNERS);
+  const uint8_t *src_buffer = src_pyramid->layers[0].buffer;
+  const int src_width = src_pyramid->layers[0].width;
+  const int src_height = src_pyramid->layers[0].height;
+  const int src_stride = src_pyramid->layers[0].stride;
+
+  const uint8_t *ref_buffer = ref_pyramid->layers[0].buffer;
+  assert(ref_pyramid->layers[0].width == src_width);
+  assert(ref_pyramid->layers[0].height == src_height);
+  const int ref_stride = ref_pyramid->layers[0].stride;
 
   // find correspondences between the two images
-  correspondences =
-      (int *)aom_malloc(num_src_corners * 4 * sizeof(*correspondences));
-  if (!correspondences) return 0;
+  correspondences = (Correspondence *)aom_malloc(src_corners->num_corners *
+                                                 sizeof(*correspondences));
+  if (!correspondences) return false;
   num_correspondences = aom_determine_correspondence(
-      src_buffer, (int *)src_corners, num_src_corners, ref_buffer,
-      (int *)ref_corners, num_ref_corners, src_width, src_height, src_stride,
-      ref->y_stride, correspondences);
+      src_buffer, src_corners->corners, src_corners->num_corners, ref_buffer,
+      ref_corners->corners, ref_corners->num_corners, src_width, src_height,
+      src_stride, ref_stride, correspondences);
 
-  ransac(correspondences, num_correspondences, num_inliers_by_motion,
-         params_by_motion, num_motions);
-
-  // Set num_inliers = 0 for motions with too few inliers so they are ignored.
-  for (i = 0; i < num_motions; ++i) {
-    if (num_inliers_by_motion[i] < MIN_INLIER_PROB * num_correspondences ||
-        num_correspondences == 0) {
-      num_inliers_by_motion[i] = 0;
-    } else if (!get_inliers_from_indices(&params_by_motion[i],
-                                         correspondences)) {
-      aom_free(correspondences);
-      return 0;
-    }
-  }
+  bool result = ransac(correspondences, num_correspondences, type,
+                       motion_models, num_motion_models);
 
   aom_free(correspondences);
-
-  // Return true if any one of the motions has inliers.
-  for (i = 0; i < num_motions; ++i) {
-    if (num_inliers_by_motion[i] > 0) return 1;
-  }
-  return 0;
+  return result;
 }

diff --git a/aom_dsp/flow_estimation/corner_match.h b/aom_dsp/flow_estimation/corner_match.h
index 71afadf..bb69944 100644
--- a/aom_dsp/flow_estimation/corner_match.h
+++ b/aom_dsp/flow_estimation/corner_match.h

@@ -12,10 +12,12 @@
 #ifndef AOM_AOM_DSP_FLOW_ESTIMATION_CORNER_MATCH_H_
 #define AOM_AOM_DSP_FLOW_ESTIMATION_CORNER_MATCH_H_
 
+#include <stdbool.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <memory.h>
 
+#include "aom_dsp/flow_estimation/corner_detect.h"
 #include "aom_dsp/flow_estimation/flow_estimation.h"
 #include "aom_scale/yv12config.h"
 
@@ -27,22 +29,17 @@
 #define MATCH_SZ_BY2 ((MATCH_SZ - 1) / 2)
 #define MATCH_SZ_SQ (MATCH_SZ * MATCH_SZ)
 
-typedef struct {
-  int x, y;
-  int rx, ry;
-} Correspondence;
-
-int aom_determine_correspondence(unsigned char *src, int *src_corners,
-                                 int num_src_corners, unsigned char *ref,
-                                 int *ref_corners, int num_ref_corners,
+int aom_determine_correspondence(const unsigned char *src,
+                                 const int *src_corners, int num_src_corners,
+                                 const unsigned char *ref,
+                                 const int *ref_corners, int num_ref_corners,
                                  int width, int height, int src_stride,
-                                 int ref_stride, int *correspondence_pts);
+                                 int ref_stride,
+                                 Correspondence *correspondences);
 
-int av1_compute_global_motion_feature_based(
-    TransformationType type, unsigned char *src_buffer, int src_width,
-    int src_height, int src_stride, int *src_corners, int num_src_corners,
-    YV12_BUFFER_CONFIG *ref, int bit_depth, int *num_inliers_by_motion,
-    MotionModel *params_by_motion, int num_motions);
+bool av1_compute_global_motion_feature_match(
+    TransformationType type, YV12_BUFFER_CONFIG *src, YV12_BUFFER_CONFIG *ref,
+    int bit_depth, MotionModel *motion_models, int num_motion_models);
 
 #ifdef __cplusplus
 }

diff --git a/aom_dsp/flow_estimation/disflow.c b/aom_dsp/flow_estimation/disflow.c
index 2a6ad4b..a8e7b06 100644
--- a/aom_dsp/flow_estimation/disflow.c
+++ b/aom_dsp/flow_estimation/disflow.c

@@ -9,626 +9,643 @@
  * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
  */
 
-#include <stdbool.h>
-#include <stddef.h>
-#include <stdint.h>
+// Dense Inverse Search flow algorithm
+// Paper: https://arxiv.org/abs/1603.03590
 
+#include <assert.h>
+#include <math.h>
+
+#include "aom_dsp/aom_dsp_common.h"
+#include "aom_dsp/flow_estimation/corner_detect.h"
 #include "aom_dsp/flow_estimation/disflow.h"
-#include "aom_dsp/flow_estimation/flow_estimation.h"
 #include "aom_dsp/flow_estimation/ransac.h"
+#include "aom_dsp/pyramid.h"
+#include "aom_mem/aom_mem.h"
 
-#include "aom_scale/yv12config.h"
+#include "config/aom_dsp_rtcd.h"
 
+// TODO(rachelbarker):
+// Implement specialized functions for upscaling flow fields,
+// replacing av1_upscale_plane_double_prec().
+// Then we can avoid needing to include code from av1/
 #include "av1/common/resize.h"
 
-// Number of pyramid levels in disflow computation
-#define N_LEVELS 2
-// Size of square patches in the disflow dense grid
-#define PATCH_SIZE 8
-// Center point of square patch
-#define PATCH_CENTER ((PATCH_SIZE + 1) >> 1)
-// Step size between patches, lower value means greater patch overlap
-#define PATCH_STEP 1
-// Minimum size of border padding for disflow
-#define MIN_PAD 7
-// Warp error convergence threshold for disflow
-#define DISFLOW_ERROR_TR 0.01
-// Max number of iterations if warp convergence is not found
-#define DISFLOW_MAX_ITR 10
+// Amount to downsample the flow field by.
+// eg. DOWNSAMPLE_SHIFT = 2 (DOWNSAMPLE_FACTOR == 4) means we calculate
+// one flow point for each 4x4 pixel region of the frame
+// Must be a power of 2
+#define DOWNSAMPLE_SHIFT 3
+#define DOWNSAMPLE_FACTOR (1 << DOWNSAMPLE_SHIFT)
+// Number of outermost flow field entries (on each edge) which can't be
+// computed, because the patch they correspond to extends outside of the
+// frame
+// The border is (DISFLOW_PATCH_SIZE >> 1) pixels, which is
+// (DISFLOW_PATCH_SIZE >> 1) >> DOWNSAMPLE_SHIFT many flow field entries
+#define FLOW_BORDER ((DISFLOW_PATCH_SIZE >> 1) >> DOWNSAMPLE_SHIFT)
+// When downsampling the flow field, each flow field entry covers a square
+// region of pixels in the image pyramid. This value is equal to the position
+// of the center of that region, as an offset from the top/left edge.
+//
+// Note: Using ((DOWNSAMPLE_FACTOR - 1) / 2) is equivalent to the more
+// natural expression ((DOWNSAMPLE_FACTOR / 2) - 1),
+// unless DOWNSAMPLE_FACTOR == 1 (ie, no downsampling), in which case
+// this gives the correct offset of 0 instead of -1.
+#define UPSAMPLE_CENTER_OFFSET ((DOWNSAMPLE_FACTOR - 1) / 2)
 
-// Struct for an image pyramid
-typedef struct {
-  int n_levels;
-  int pad_size;
-  int has_gradient;
-  int widths[N_LEVELS];
-  int heights[N_LEVELS];
-  int strides[N_LEVELS];
-  int level_loc[N_LEVELS];
-  unsigned char *level_buffer;
-  double *level_dx_buffer;
-  double *level_dy_buffer;
-} ImagePyramid;
-
-// Don't use points around the frame border since they are less reliable
-static INLINE int valid_point(int x, int y, int width, int height) {
-  return (x > (PATCH_SIZE + PATCH_CENTER)) &&
-         (x < (width - PATCH_SIZE - PATCH_CENTER)) &&
-         (y > (PATCH_SIZE + PATCH_CENTER)) &&
-         (y < (height - PATCH_SIZE - PATCH_CENTER));
+static INLINE void get_cubic_kernel_dbl(double x, double *kernel) {
+  assert(0 <= x && x < 1);
+  double x2 = x * x;
+  double x3 = x2 * x;
+  kernel[0] = -0.5 * x + x2 - 0.5 * x3;
+  kernel[1] = 1.0 - 2.5 * x2 + 1.5 * x3;
+  kernel[2] = 0.5 * x + 2.0 * x2 - 1.5 * x3;
+  kernel[3] = -0.5 * x2 + 0.5 * x3;
 }
 
-static int determine_disflow_correspondence(int *frm_corners,
-                                            int num_frm_corners, double *flow_u,
-                                            double *flow_v, int width,
-                                            int height, int stride,
-                                            double *correspondences) {
+static INLINE void get_cubic_kernel_int(double x, int *kernel) {
+  double kernel_dbl[4];
+  get_cubic_kernel_dbl(x, kernel_dbl);
+
+  kernel[0] = (int)rint(kernel_dbl[0] * (1 << DISFLOW_INTERP_BITS));
+  kernel[1] = (int)rint(kernel_dbl[1] * (1 << DISFLOW_INTERP_BITS));
+  kernel[2] = (int)rint(kernel_dbl[2] * (1 << DISFLOW_INTERP_BITS));
+  kernel[3] = (int)rint(kernel_dbl[3] * (1 << DISFLOW_INTERP_BITS));
+}
+
+static INLINE double get_cubic_value_dbl(const double *p,
+                                         const double *kernel) {
+  return kernel[0] * p[0] + kernel[1] * p[1] + kernel[2] * p[2] +
+         kernel[3] * p[3];
+}
+
+static INLINE int get_cubic_value_int(const int *p, const int *kernel) {
+  return kernel[0] * p[0] + kernel[1] * p[1] + kernel[2] * p[2] +
+         kernel[3] * p[3];
+}
+
+static INLINE double bicubic_interp_one(const double *arr, int stride,
+                                        double *h_kernel, double *v_kernel) {
+  double tmp[1 * 4];
+
+  // Horizontal convolution
+  for (int i = -1; i < 3; ++i) {
+    tmp[i + 1] = get_cubic_value_dbl(&arr[i * stride - 1], h_kernel);
+  }
+
+  // Vertical convolution
+  return get_cubic_value_dbl(tmp, v_kernel);
+}
+
+static int determine_disflow_correspondence(CornerList *corners,
+                                            const FlowField *flow,
+                                            Correspondence *correspondences) {
+  const int width = flow->width;
+  const int height = flow->height;
+  const int stride = flow->stride;
+
   int num_correspondences = 0;
-  int x, y;
-  for (int i = 0; i < num_frm_corners; ++i) {
-    x = frm_corners[2 * i];
-    y = frm_corners[2 * i + 1];
-    if (valid_point(x, y, width, height)) {
-      correspondences[4 * num_correspondences] = x;
-      correspondences[4 * num_correspondences + 1] = y;
-      correspondences[4 * num_correspondences + 2] = x + flow_u[y * stride + x];
-      correspondences[4 * num_correspondences + 3] = y + flow_v[y * stride + x];
-      num_correspondences++;
-    }
+  for (int i = 0; i < corners->num_corners; ++i) {
+    const int x0 = corners->corners[2 * i];
+    const int y0 = corners->corners[2 * i + 1];
+
+    // Offset points, to compensate for the fact that (say) a flow field entry
+    // at horizontal index i, is nominally associated with the pixel at
+    // horizontal coordinate (i << DOWNSAMPLE_FACTOR) + UPSAMPLE_CENTER_OFFSET
+    // This offset must be applied before we split the coordinate into integer
+    // and fractional parts, in order for the interpolation to be correct.
+    const int x = x0 - UPSAMPLE_CENTER_OFFSET;
+    const int y = y0 - UPSAMPLE_CENTER_OFFSET;
+
+    // Split the pixel coordinates into integer flow field coordinates and
+    // an offset for interpolation
+    const int flow_x = x >> DOWNSAMPLE_SHIFT;
+    const double flow_sub_x =
+        (x & (DOWNSAMPLE_FACTOR - 1)) / (double)DOWNSAMPLE_FACTOR;
+    const int flow_y = y >> DOWNSAMPLE_SHIFT;
+    const double flow_sub_y =
+        (y & (DOWNSAMPLE_FACTOR - 1)) / (double)DOWNSAMPLE_FACTOR;
+
+    // Make sure that bicubic interpolation won't read outside of the flow field
+    if (flow_x < 1 || (flow_x + 2) >= width) continue;
+    if (flow_y < 1 || (flow_y + 2) >= height) continue;
+
+    double h_kernel[4];
+    double v_kernel[4];
+    get_cubic_kernel_dbl(flow_sub_x, h_kernel);
+    get_cubic_kernel_dbl(flow_sub_y, v_kernel);
+
+    const double flow_u = bicubic_interp_one(&flow->u[flow_y * stride + flow_x],
+                                             stride, h_kernel, v_kernel);
+    const double flow_v = bicubic_interp_one(&flow->v[flow_y * stride + flow_x],
+                                             stride, h_kernel, v_kernel);
+
+    // Use original points (without offsets) when filling in correspondence
+    // array
+    correspondences[num_correspondences].x = x0;
+    correspondences[num_correspondences].y = y0;
+    correspondences[num_correspondences].rx = x0 + flow_u;
+    correspondences[num_correspondences].ry = y0 + flow_v;
+    num_correspondences++;
   }
   return num_correspondences;
 }
 
-static double getCubicValue(double p[4], double x) {
-  return p[1] + 0.5 * x *
-                    (p[2] - p[0] +
-                     x * (2.0 * p[0] - 5.0 * p[1] + 4.0 * p[2] - p[3] +
-                          x * (3.0 * (p[1] - p[2]) + p[3] - p[0])));
-}
+// Compare two regions of width x height pixels, one rooted at position
+// (x, y) in src and the other at (x + u, y + v) in ref.
+// This function returns the sum of squared pixel differences between
+// the two regions.
+static INLINE void compute_flow_error(const uint8_t *src, const uint8_t *ref,
+                                      int width, int height, int stride, int x,
+                                      int y, double u, double v, int16_t *dt) {
+  // Split offset into integer and fractional parts, and compute cubic
+  // interpolation kernels
+  const int u_int = (int)floor(u);
+  const int v_int = (int)floor(v);
+  const double u_frac = u - floor(u);
+  const double v_frac = v - floor(v);
 
-static void get_subcolumn(unsigned char *ref, double col[4], int stride, int x,
-                          int y_start) {
-  int i;
-  for (i = 0; i < 4; ++i) {
-    col[i] = ref[(i + y_start) * stride + x];
-  }
-}
+  int h_kernel[4];
+  int v_kernel[4];
+  get_cubic_kernel_int(u_frac, h_kernel);
+  get_cubic_kernel_int(v_frac, v_kernel);
 
-static double bicubic(unsigned char *ref, double x, double y, int stride) {
-  double arr[4];
-  int k;
-  int i = (int)x;
-  int j = (int)y;
-  for (k = 0; k < 4; ++k) {
-    double arr_temp[4];
-    get_subcolumn(ref, arr_temp, stride, i + k - 1, j - 1);
-    arr[k] = getCubicValue(arr_temp, y - j);
-  }
-  return getCubicValue(arr, x - i);
-}
+  // Storage for intermediate values between the two convolution directions
+  int tmp_[DISFLOW_PATCH_SIZE * (DISFLOW_PATCH_SIZE + 3)];
+  int *tmp = tmp_ + DISFLOW_PATCH_SIZE;  // Offset by one row
 
-// Interpolate a warped block using bicubic interpolation when possible
-static unsigned char interpolate(unsigned char *ref, double x, double y,
-                                 int width, int height, int stride) {
-  if (x < 0 && y < 0)
-    return ref[0];
-  else if (x < 0 && y > height - 1)
-    return ref[(height - 1) * stride];
-  else if (x > width - 1 && y < 0)
-    return ref[width - 1];
-  else if (x > width - 1 && y > height - 1)
-    return ref[(height - 1) * stride + (width - 1)];
-  else if (x < 0) {
-    int v;
-    int i = (int)y;
-    double a = y - i;
-    if (y > 1 && y < height - 2) {
-      double arr[4];
-      get_subcolumn(ref, arr, stride, 0, i - 1);
-      return clamp((int)(getCubicValue(arr, a) + 0.5), 0, 255);
-    }
-    v = (int)(ref[i * stride] * (1 - a) + ref[(i + 1) * stride] * a + 0.5);
-    return clamp(v, 0, 255);
-  } else if (y < 0) {
-    int v;
-    int j = (int)x;
-    double b = x - j;
-    if (x > 1 && x < width - 2) {
-      double arr[4] = { ref[j - 1], ref[j], ref[j + 1], ref[j + 2] };
-      return clamp((int)(getCubicValue(arr, b) + 0.5), 0, 255);
-    }
-    v = (int)(ref[j] * (1 - b) + ref[j + 1] * b + 0.5);
-    return clamp(v, 0, 255);
-  } else if (x > width - 1) {
-    int v;
-    int i = (int)y;
-    double a = y - i;
-    if (y > 1 && y < height - 2) {
-      double arr[4];
-      get_subcolumn(ref, arr, stride, width - 1, i - 1);
-      return clamp((int)(getCubicValue(arr, a) + 0.5), 0, 255);
-    }
-    v = (int)(ref[i * stride + width - 1] * (1 - a) +
-              ref[(i + 1) * stride + width - 1] * a + 0.5);
-    return clamp(v, 0, 255);
-  } else if (y > height - 1) {
-    int v;
-    int j = (int)x;
-    double b = x - j;
-    if (x > 1 && x < width - 2) {
-      int row = (height - 1) * stride;
-      double arr[4] = { ref[row + j - 1], ref[row + j], ref[row + j + 1],
-                        ref[row + j + 2] };
-      return clamp((int)(getCubicValue(arr, b) + 0.5), 0, 255);
-    }
-    v = (int)(ref[(height - 1) * stride + j] * (1 - b) +
-              ref[(height - 1) * stride + j + 1] * b + 0.5);
-    return clamp(v, 0, 255);
-  } else if (x > 1 && y > 1 && x < width - 2 && y < height - 2) {
-    return clamp((int)(bicubic(ref, x, y, stride) + 0.5), 0, 255);
-  } else {
-    int i = (int)y;
-    int j = (int)x;
-    double a = y - i;
-    double b = x - j;
-    int v = (int)(ref[i * stride + j] * (1 - a) * (1 - b) +
-                  ref[i * stride + j + 1] * (1 - a) * b +
-                  ref[(i + 1) * stride + j] * a * (1 - b) +
-                  ref[(i + 1) * stride + j + 1] * a * b);
-    return clamp(v, 0, 255);
-  }
-}
+  // Clamp coordinates so that all pixels we fetch will remain within the
+  // allocated border region, but allow them to go far enough out that
+  // the border pixels' values do not change.
+  // Since we are calculating an 8x8 block, the bottom-right pixel
+  // in the block has coordinates (x0 + 7, y0 + 7). Then, the cubic
+  // interpolation has 4 taps, meaning that the output of pixel
+  // (x_w, y_w) depends on the pixels in the range
+  // ([x_w - 1, x_w + 2], [y_w - 1, y_w + 2]).
+  //
+  // Thus the most extreme coordinates which will be fetched are
+  // (x0 - 1, y0 - 1) and (x0 + 9, y0 + 9).
+  const int x0 = clamp(x + u_int, -9, width);
+  const int y0 = clamp(y + v_int, -9, height);
 
-// Warps a block using flow vector [u, v] and computes the mse
-static double compute_warp_and_error(unsigned char *ref, unsigned char *frm,
-                                     int width, int height, int stride, int x,
-                                     int y, double u, double v, int16_t *dt) {
-  int i, j;
-  unsigned char warped;
-  double x_w, y_w;
-  double mse = 0;
-  int16_t err = 0;
-  for (i = y; i < y + PATCH_SIZE; ++i)
-    for (j = x; j < x + PATCH_SIZE; ++j) {
-      x_w = (double)j + u;
-      y_w = (double)i + v;
-      warped = interpolate(ref, x_w, y_w, width, height, stride);
-      err = warped - frm[j + i * stride];
-      mse += err * err;
-      dt[(i - y) * PATCH_SIZE + (j - x)] = err;
-    }
+  // Horizontal convolution
+  for (int i = -1; i < DISFLOW_PATCH_SIZE + 2; ++i) {
+    const int y_w = y0 + i;
+    for (int j = 0; j < DISFLOW_PATCH_SIZE; ++j) {
+      const int x_w = x0 + j;
+      int arr[4];
 
-  mse /= (PATCH_SIZE * PATCH_SIZE);
-  return mse;
-}
+      arr[0] = (int)ref[y_w * stride + (x_w - 1)];
+      arr[1] = (int)ref[y_w * stride + (x_w + 0)];
+      arr[2] = (int)ref[y_w * stride + (x_w + 1)];
+      arr[3] = (int)ref[y_w * stride + (x_w + 2)];
 
-// Computes the components of the system of equations used to solve for
-// a flow vector. This includes:
-// 1.) The hessian matrix for optical flow. This matrix is in the
-// form of:
-//
-//       M = |sum(dx * dx)  sum(dx * dy)|
-//           |sum(dx * dy)  sum(dy * dy)|
-//
-// 2.)   b = |sum(dx * dt)|
-//           |sum(dy * dt)|
-// Where the sums are computed over a square window of PATCH_SIZE.
-static INLINE void compute_flow_system(const double *dx, int dx_stride,
-                                       const double *dy, int dy_stride,
-                                       const int16_t *dt, int dt_stride,
-                                       double *M, double *b) {
-  for (int i = 0; i < PATCH_SIZE; i++) {
-    for (int j = 0; j < PATCH_SIZE; j++) {
-      M[0] += dx[i * dx_stride + j] * dx[i * dx_stride + j];
-      M[1] += dx[i * dx_stride + j] * dy[i * dy_stride + j];
-      M[3] += dy[i * dy_stride + j] * dy[i * dy_stride + j];
-
-      b[0] += dx[i * dx_stride + j] * dt[i * dt_stride + j];
-      b[1] += dy[i * dy_stride + j] * dt[i * dt_stride + j];
+      // Apply kernel and round, keeping 6 extra bits of precision.
+      //
+      // 6 is the maximum allowable number of extra bits which will avoid
+      // the intermediate values overflowing an int16_t. The most extreme
+      // intermediate value occurs when:
+      // * The input pixels are [0, 255, 255, 0]
+      // * u_frac = 0.5
+      // In this case, the un-scaled output is 255 * 1.125 = 286.875.
+      // As an integer with 6 fractional bits, that is 18360, which fits
+      // in an int16_t. But with 7 fractional bits it would be 36720,
+      // which is too large.
+      tmp[i * DISFLOW_PATCH_SIZE + j] = ROUND_POWER_OF_TWO(
+          get_cubic_value_int(arr, h_kernel), DISFLOW_INTERP_BITS - 6);
     }
   }
 
-  M[2] = M[1];
-}
+  // Vertical convolution
+  for (int i = 0; i < DISFLOW_PATCH_SIZE; ++i) {
+    for (int j = 0; j < DISFLOW_PATCH_SIZE; ++j) {
+      const int *p = &tmp[i * DISFLOW_PATCH_SIZE + j];
+      const int arr[4] = { p[-DISFLOW_PATCH_SIZE], p[0], p[DISFLOW_PATCH_SIZE],
+                           p[2 * DISFLOW_PATCH_SIZE] };
+      const int result = get_cubic_value_int(arr, v_kernel);
 
-// Solves a general Mx = b where M is a 2x2 matrix and b is a 2x1 matrix
-static INLINE void solve_2x2_system(const double *M, const double *b,
-                                    double *output_vec) {
-  double M_0 = M[0];
-  double M_3 = M[3];
-  double det = (M_0 * M_3) - (M[1] * M[2]);
-  if (det < 1e-5) {
-    // Handle singular matrix
-    // TODO(sarahparker) compare results using pseudo inverse instead
-    M_0 += 1e-10;
-    M_3 += 1e-10;
-    det = (M_0 * M_3) - (M[1] * M[2]);
-  }
-  const double det_inv = 1 / det;
-  const double mult_b0 = det_inv * b[0];
-  const double mult_b1 = det_inv * b[1];
-  output_vec[0] = M_3 * mult_b0 - M[1] * mult_b1;
-  output_vec[1] = -M[2] * mult_b0 + M_0 * mult_b1;
-}
-
-/*
-static INLINE void image_difference(const uint8_t *src, int src_stride,
-                                    const uint8_t *ref, int ref_stride,
-                                    int16_t *dst, int dst_stride, int height,
-                                    int width) {
-  const int block_unit = 8;
-  // Take difference in 8x8 blocks to make use of optimized diff function
-  for (int i = 0; i < height; i += block_unit) {
-    for (int j = 0; j < width; j += block_unit) {
-      aom_subtract_block(block_unit, block_unit, dst + i * dst_stride + j,
-                         dst_stride, src + i * src_stride + j, src_stride,
-                         ref + i * ref_stride + j, ref_stride);
+      // Apply kernel and round.
+      // This time, we have to round off the 6 extra bits which were kept
+      // earlier, but we also want to keep DISFLOW_DERIV_SCALE_LOG2 extra bits
+      // of precision to match the scale of the dx and dy arrays.
+      const int round_bits = DISFLOW_INTERP_BITS + 6 - DISFLOW_DERIV_SCALE_LOG2;
+      const int warped = ROUND_POWER_OF_TWO(result, round_bits);
+      const int src_px = src[(x + j) + (y + i) * stride] << 3;
+      const int err = warped - src_px;
+      dt[i * DISFLOW_PATCH_SIZE + j] = err;
     }
   }
 }
-*/
 
-static INLINE void convolve_2d_sobel_y(const uint8_t *src, int src_stride,
-                                       double *dst, int dst_stride, int w,
-                                       int h, int dir, double norm) {
-  int16_t im_block[(MAX_SB_SIZE + MAX_FILTER_TAP - 1) * MAX_SB_SIZE];
-  DECLARE_ALIGNED(256, static const int16_t, sobel_a[3]) = { 1, 0, -1 };
-  DECLARE_ALIGNED(256, static const int16_t, sobel_b[3]) = { 1, 2, 1 };
+static INLINE void sobel_filter(const uint8_t *src, int src_stride,
+                                int16_t *dst, int dst_stride, int dir) {
+  int16_t tmp_[DISFLOW_PATCH_SIZE * (DISFLOW_PATCH_SIZE + 2)];
+  int16_t *tmp = tmp_ + DISFLOW_PATCH_SIZE;
+
+  // Sobel filter kernel
+  // This must have an overall scale factor equal to DISFLOW_DERIV_SCALE,
+  // in order to produce correctly scaled outputs.
+  // To work out the scale factor, we multiply two factors:
+  //
+  // * For the derivative filter (sobel_a), comparing our filter
+  //    image[x - 1] - image[x + 1]
+  //   to the standard form
+  //    d/dx image[x] = image[x+1] - image[x]
+  //   tells us that we're actually calculating -2 * d/dx image[2]
+  //
+  // * For the smoothing filter (sobel_b), all coefficients are positive
+  //   so the scale factor is just the sum of the coefficients
+  //
+  // Thus we need to make sure that DISFLOW_DERIV_SCALE = 2 * sum(sobel_b)
+  // (and take care of the - sign from sobel_a elsewhere)
+  static const int16_t sobel_a[3] = { 1, 0, -1 };
+  static const int16_t sobel_b[3] = { 1, 2, 1 };
   const int taps = 3;
-  int im_h = h + taps - 1;
-  int im_stride = w;
-  const int fo_vert = 1;
-  const int fo_horiz = 1;
 
   // horizontal filter
-  const uint8_t *src_horiz = src - fo_vert * src_stride;
-  const int16_t *x_filter = dir ? sobel_a : sobel_b;
-  for (int y = 0; y < im_h; ++y) {
-    for (int x = 0; x < w; ++x) {
-      int16_t sum = 0;
+  const int16_t *h_kernel = dir ? sobel_a : sobel_b;
+
+  for (int y = -1; y < DISFLOW_PATCH_SIZE + 1; ++y) {
+    for (int x = 0; x < DISFLOW_PATCH_SIZE; ++x) {
+      int sum = 0;
       for (int k = 0; k < taps; ++k) {
-        sum += x_filter[k] * src_horiz[y * src_stride + x - fo_horiz + k];
+        sum += h_kernel[k] * src[y * src_stride + (x + k - 1)];
       }
-      im_block[y * im_stride + x] = sum;
+      tmp[y * DISFLOW_PATCH_SIZE + x] = sum;
     }
   }
 
   // vertical filter
-  int16_t *src_vert = im_block + fo_vert * im_stride;
-  const int16_t *y_filter = dir ? sobel_b : sobel_a;
-  for (int y = 0; y < h; ++y) {
-    for (int x = 0; x < w; ++x) {
-      int16_t sum = 0;
+  const int16_t *v_kernel = dir ? sobel_b : sobel_a;
+
+  for (int y = 0; y < DISFLOW_PATCH_SIZE; ++y) {
+    for (int x = 0; x < DISFLOW_PATCH_SIZE; ++x) {
+      int sum = 0;
       for (int k = 0; k < taps; ++k) {
-        sum += y_filter[k] * src_vert[(y - fo_vert + k) * im_stride + x];
+        sum += v_kernel[k] * tmp[(y + k - 1) * DISFLOW_PATCH_SIZE + x];
       }
-      dst[y * dst_stride + x] = sum * norm;
+      dst[y * dst_stride + x] = sum;
     }
   }
 }
 
-// Compute an image gradient using a sobel filter.
-// If dir == 1, compute the x gradient. If dir == 0, compute y. This function
-// assumes the images have been padded so that they can be processed in units
-// of 8.
-static INLINE void sobel_xy_image_gradient(const uint8_t *src, int src_stride,
-                                           double *dst, int dst_stride,
-                                           int height, int width, int dir) {
-  double norm = 1.0;
-  // TODO(sarahparker) experiment with doing this over larger block sizes
-  const int block_unit = 8;
-  // Filter in 8x8 blocks to eventually make use of optimized convolve function
-  for (int i = 0; i < height; i += block_unit) {
-    for (int j = 0; j < width; j += block_unit) {
-      convolve_2d_sobel_y(src + i * src_stride + j, src_stride,
-                          dst + i * dst_stride + j, dst_stride, block_unit,
-                          block_unit, dir, norm);
+// Computes the components of the system of equations used to solve for
+// a flow vector.
+//
+// The flow equations are a least-squares system, derived as follows:
+//
+// For each pixel in the patch, we calculate the current error `dt`,
+// and the x and y gradients `dx` and `dy` of the source patch.
+// This means that, to first order, the squared error for this pixel is
+//
+//    (dt + u * dx + v * dy)^2
+//
+// where (u, v) are the incremental changes to the flow vector.
+//
+// We then want to find the values of u and v which minimize the sum
+// of the squared error across all pixels. Conveniently, this fits exactly
+// into the form of a least squares problem, with one equation
+//
+//   u * dx + v * dy = -dt
+//
+// for each pixel.
+//
+// Summing across all pixels in a square window of size DISFLOW_PATCH_SIZE,
+// and absorbing the - sign elsewhere, this results in the least squares system
+//
+//   M = |sum(dx * dx)  sum(dx * dy)|
+//       |sum(dx * dy)  sum(dy * dy)|
+//
+//   b = |sum(dx * dt)|
+//       |sum(dy * dt)|
+static INLINE void compute_flow_matrix(const int16_t *dx, int dx_stride,
+                                       const int16_t *dy, int dy_stride,
+                                       double *M) {
+  int tmp[4] = { 0 };
+
+  for (int i = 0; i < DISFLOW_PATCH_SIZE; i++) {
+    for (int j = 0; j < DISFLOW_PATCH_SIZE; j++) {
+      tmp[0] += dx[i * dx_stride + j] * dx[i * dx_stride + j];
+      tmp[1] += dx[i * dx_stride + j] * dy[i * dy_stride + j];
+      // Don't compute tmp[2], as it should be equal to tmp[1]
+      tmp[3] += dy[i * dy_stride + j] * dy[i * dy_stride + j];
+    }
+  }
+
+  // Apply regularization
+  // We follow the standard regularization method of adding `k * I` before
+  // inverting. This ensures that the matrix will be invertible.
+  //
+  // Setting the regularization strength k to 1 seems to work well here, as
+  // typical values coming from the other equations are very large (1e5 to
+  // 1e6, with an upper limit of around 6e7, at the time of writing).
+  // It also preserves the property that all matrix values are whole numbers,
+  // which is convenient for integerized SIMD implementation.
+  tmp[0] += 1;
+  tmp[3] += 1;
+
+  tmp[2] = tmp[1];
+
+  M[0] = (double)tmp[0];
+  M[1] = (double)tmp[1];
+  M[2] = (double)tmp[2];
+  M[3] = (double)tmp[3];
+}
+
+static INLINE void compute_flow_vector(const int16_t *dx, int dx_stride,
+                                       const int16_t *dy, int dy_stride,
+                                       const int16_t *dt, int dt_stride,
+                                       int *b) {
+  memset(b, 0, 2 * sizeof(*b));
+
+  for (int i = 0; i < DISFLOW_PATCH_SIZE; i++) {
+    for (int j = 0; j < DISFLOW_PATCH_SIZE; j++) {
+      b[0] += dx[i * dx_stride + j] * dt[i * dt_stride + j];
+      b[1] += dy[i * dy_stride + j] * dt[i * dt_stride + j];
     }
   }
 }
 
-static void free_pyramid(ImagePyramid *pyr) {
-  aom_free(pyr->level_buffer);
-  if (pyr->has_gradient) {
-    aom_free(pyr->level_dx_buffer);
-    aom_free(pyr->level_dy_buffer);
-  }
-  aom_free(pyr);
+// Try to invert the matrix M
+// Note: Due to the nature of how a least-squares matrix is constructed, all of
+// the eigenvalues will be >= 0, and therefore det M >= 0 as well.
+// The regularization term `+ k * I` further ensures that det M >= k^2.
+// As mentioned in compute_flow_matrix(), here we use k = 1, so det M >= 1.
+// So we don't have to worry about non-invertible matrices here.
+static INLINE void invert_2x2(const double *M, double *M_inv) {
+  double det = (M[0] * M[3]) - (M[1] * M[2]);
+  assert(det >= 1);
+  const double det_inv = 1 / det;
+
+  M_inv[0] = M[3] * det_inv;
+  M_inv[1] = -M[1] * det_inv;
+  M_inv[2] = -M[2] * det_inv;
+  M_inv[3] = M[0] * det_inv;
 }
 
-static ImagePyramid *alloc_pyramid(int width, int height, int pad_size,
-                                   int compute_gradient) {
-  ImagePyramid *pyr = aom_calloc(1, sizeof(*pyr));
-  if (!pyr) return NULL;
-  pyr->has_gradient = compute_gradient;
-  // 2 * width * height is the upper bound for a buffer that fits
-  // all pyramid levels + padding for each level
-  const int buffer_size = sizeof(*pyr->level_buffer) * 2 * width * height +
-                          (width + 2 * pad_size) * 2 * pad_size * N_LEVELS;
-  pyr->level_buffer = aom_malloc(buffer_size);
-  if (!pyr->level_buffer) {
-    free_pyramid(pyr);
-    return NULL;
-  }
-  memset(pyr->level_buffer, 0, buffer_size);
+void aom_compute_flow_at_point_c(const uint8_t *src, const uint8_t *ref, int x,
+                                 int y, int width, int height, int stride,
+                                 double *u, double *v) {
+  double M[4];
+  double M_inv[4];
+  int b[2];
+  int16_t dt[DISFLOW_PATCH_SIZE * DISFLOW_PATCH_SIZE];
+  int16_t dx[DISFLOW_PATCH_SIZE * DISFLOW_PATCH_SIZE];
+  int16_t dy[DISFLOW_PATCH_SIZE * DISFLOW_PATCH_SIZE];
 
-  if (compute_gradient) {
-    const int gradient_size =
-        sizeof(*pyr->level_dx_buffer) * 2 * width * height +
-        (width + 2 * pad_size) * 2 * pad_size * N_LEVELS;
-    pyr->level_dx_buffer = aom_calloc(1, gradient_size);
-    pyr->level_dy_buffer = aom_calloc(1, gradient_size);
-    if (!(pyr->level_dx_buffer && pyr->level_dy_buffer)) {
-      free_pyramid(pyr);
-      return NULL;
-    }
-  }
-  return pyr;
-}
+  // Compute gradients within this patch
+  const uint8_t *src_patch = &src[y * stride + x];
+  sobel_filter(src_patch, stride, dx, DISFLOW_PATCH_SIZE, 1);
+  sobel_filter(src_patch, stride, dy, DISFLOW_PATCH_SIZE, 0);
 
-static INLINE void update_level_dims(ImagePyramid *frm_pyr, int level) {
-  frm_pyr->widths[level] = frm_pyr->widths[level - 1] >> 1;
-  frm_pyr->heights[level] = frm_pyr->heights[level - 1] >> 1;
-  frm_pyr->strides[level] = frm_pyr->widths[level] + 2 * frm_pyr->pad_size;
-  // Point the beginning of the next level buffer to the correct location inside
-  // the padded border
-  frm_pyr->level_loc[level] =
-      frm_pyr->level_loc[level - 1] +
-      frm_pyr->strides[level - 1] *
-          (2 * frm_pyr->pad_size + frm_pyr->heights[level - 1]);
-}
-
-// Compute coarse to fine pyramids for a frame
-static void compute_flow_pyramids(unsigned char *frm, const int frm_width,
-                                  const int frm_height, const int frm_stride,
-                                  int n_levels, int pad_size, int compute_grad,
-                                  ImagePyramid *frm_pyr) {
-  int cur_width, cur_height, cur_stride, cur_loc;
-  assert((frm_width >> n_levels) > 0);
-  assert((frm_height >> n_levels) > 0);
-
-  // Initialize first level
-  frm_pyr->n_levels = n_levels;
-  frm_pyr->pad_size = pad_size;
-  frm_pyr->widths[0] = frm_width;
-  frm_pyr->heights[0] = frm_height;
-  frm_pyr->strides[0] = frm_width + 2 * frm_pyr->pad_size;
-  // Point the beginning of the level buffer to the location inside
-  // the padded border
-  frm_pyr->level_loc[0] =
-      frm_pyr->strides[0] * frm_pyr->pad_size + frm_pyr->pad_size;
-  // This essentially copies the original buffer into the pyramid buffer
-  // without the original padding
-  av1_resize_plane(frm, frm_height, frm_width, frm_stride,
-                   frm_pyr->level_buffer + frm_pyr->level_loc[0],
-                   frm_pyr->heights[0], frm_pyr->widths[0],
-                   frm_pyr->strides[0]);
-
-  if (compute_grad) {
-    cur_width = frm_pyr->widths[0];
-    cur_height = frm_pyr->heights[0];
-    cur_stride = frm_pyr->strides[0];
-    cur_loc = frm_pyr->level_loc[0];
-    assert(frm_pyr->has_gradient && frm_pyr->level_dx_buffer != NULL &&
-           frm_pyr->level_dy_buffer != NULL);
-    // Computation x gradient
-    sobel_xy_image_gradient(frm_pyr->level_buffer + cur_loc, cur_stride,
-                            frm_pyr->level_dx_buffer + cur_loc, cur_stride,
-                            cur_height, cur_width, 1);
-
-    // Computation y gradient
-    sobel_xy_image_gradient(frm_pyr->level_buffer + cur_loc, cur_stride,
-                            frm_pyr->level_dy_buffer + cur_loc, cur_stride,
-                            cur_height, cur_width, 0);
-  }
-
-  // Start at the finest level and resize down to the coarsest level
-  for (int level = 1; level < n_levels; ++level) {
-    update_level_dims(frm_pyr, level);
-    cur_width = frm_pyr->widths[level];
-    cur_height = frm_pyr->heights[level];
-    cur_stride = frm_pyr->strides[level];
-    cur_loc = frm_pyr->level_loc[level];
-
-    av1_resize_plane(frm_pyr->level_buffer + frm_pyr->level_loc[level - 1],
-                     frm_pyr->heights[level - 1], frm_pyr->widths[level - 1],
-                     frm_pyr->strides[level - 1],
-                     frm_pyr->level_buffer + cur_loc, cur_height, cur_width,
-                     cur_stride);
-
-    if (compute_grad) {
-      assert(frm_pyr->has_gradient && frm_pyr->level_dx_buffer != NULL &&
-             frm_pyr->level_dy_buffer != NULL);
-      // Computation x gradient
-      sobel_xy_image_gradient(frm_pyr->level_buffer + cur_loc, cur_stride,
-                              frm_pyr->level_dx_buffer + cur_loc, cur_stride,
-                              cur_height, cur_width, 1);
-
-      // Computation y gradient
-      sobel_xy_image_gradient(frm_pyr->level_buffer + cur_loc, cur_stride,
-                              frm_pyr->level_dy_buffer + cur_loc, cur_stride,
-                              cur_height, cur_width, 0);
-    }
-  }
-}
-
-static INLINE void compute_flow_at_point(unsigned char *frm, unsigned char *ref,
-                                         double *dx, double *dy, int x, int y,
-                                         int width, int height, int stride,
-                                         double *u, double *v) {
-  double M[4] = { 0 };
-  double b[2] = { 0 };
-  double tmp_output_vec[2] = { 0 };
-  double error = 0;
-  int16_t dt[PATCH_SIZE * PATCH_SIZE];
-  double o_u = *u;
-  double o_v = *v;
+  compute_flow_matrix(dx, DISFLOW_PATCH_SIZE, dy, DISFLOW_PATCH_SIZE, M);
+  invert_2x2(M, M_inv);
 
   for (int itr = 0; itr < DISFLOW_MAX_ITR; itr++) {
-    error = compute_warp_and_error(ref, frm, width, height, stride, x, y, *u,
-                                   *v, dt);
-    if (error <= DISFLOW_ERROR_TR) break;
-    compute_flow_system(dx, stride, dy, stride, dt, PATCH_SIZE, M, b);
-    solve_2x2_system(M, b, tmp_output_vec);
-    *u += tmp_output_vec[0];
-    *v += tmp_output_vec[1];
+    compute_flow_error(src, ref, width, height, stride, x, y, *u, *v, dt);
+    compute_flow_vector(dx, DISFLOW_PATCH_SIZE, dy, DISFLOW_PATCH_SIZE, dt,
+                        DISFLOW_PATCH_SIZE, b);
+
+    // Solve flow equations to find a better estimate for the flow vector
+    // at this point
+    const double step_u = M_inv[0] * b[0] + M_inv[1] * b[1];
+    const double step_v = M_inv[2] * b[0] + M_inv[3] * b[1];
+    *u += fclamp(step_u * DISFLOW_STEP_SIZE, -2, 2);
+    *v += fclamp(step_v * DISFLOW_STEP_SIZE, -2, 2);
+
+    if (fabs(step_u) + fabs(step_v) < DISFLOW_STEP_SIZE_THRESOLD) {
+      // Stop iteration when we're close to convergence
+      break;
+    }
   }
-  if (fabs(*u - o_u) > PATCH_SIZE || fabs(*v - o_u) > PATCH_SIZE) {
-    *u = o_u;
-    *v = o_v;
+}
+
+static void fill_flow_field_borders(double *flow, int width, int height,
+                                    int stride) {
+  // Calculate the bounds of the rectangle which was filled in by
+  // compute_flow_field() before calling this function.
+  // These indices are inclusive on both ends.
+  const int left_index = FLOW_BORDER;
+  const int right_index = (width - FLOW_BORDER - 1);
+  const int top_index = FLOW_BORDER;
+  const int bottom_index = (height - FLOW_BORDER - 1);
+
+  // Left area
+  for (int i = top_index; i <= bottom_index; i += 1) {
+    double *row = flow + i * stride;
+    const double left = row[left_index];
+    for (int j = 0; j < left_index; j++) {
+      row[j] = left;
+    }
+  }
+
+  // Right area
+  for (int i = top_index; i <= bottom_index; i += 1) {
+    double *row = flow + i * stride;
+    const double right = row[right_index];
+    for (int j = right_index + 1; j < width; j++) {
+      row[j] = right;
+    }
+  }
+
+  // Top area
+  const double *top_row = flow + top_index * stride;
+  for (int i = 0; i < top_index; i++) {
+    double *row = flow + i * stride;
+    memcpy(row, top_row, width * sizeof(*row));
+  }
+
+  // Bottom area
+  const double *bottom_row = flow + bottom_index * stride;
+  for (int i = bottom_index + 1; i < height; i++) {
+    double *row = flow + i * stride;
+    memcpy(row, bottom_row, width * sizeof(*row));
   }
 }
 
 // make sure flow_u and flow_v start at 0
-static bool compute_flow_field(ImagePyramid *frm_pyr, ImagePyramid *ref_pyr,
-                               double *flow_u, double *flow_v) {
-  int cur_width, cur_height, cur_stride, cur_loc, patch_loc, patch_center;
-  double *u_upscale =
-      aom_malloc(frm_pyr->strides[0] * frm_pyr->heights[0] * sizeof(*flow_u));
-  double *v_upscale =
-      aom_malloc(frm_pyr->strides[0] * frm_pyr->heights[0] * sizeof(*flow_v));
-  if (!(u_upscale && v_upscale)) {
-    aom_free(u_upscale);
-    aom_free(v_upscale);
-    return false;
-  }
+static void compute_flow_field(const ImagePyramid *src_pyr,
+                               const ImagePyramid *ref_pyr, FlowField *flow) {
+  assert(src_pyr->n_levels == ref_pyr->n_levels);
 
-  assert(frm_pyr->n_levels == ref_pyr->n_levels);
+  double *flow_u = flow->u;
+  double *flow_v = flow->v;
+
+  const size_t flow_size = flow->stride * (size_t)flow->height;
+  double *u_upscale = aom_malloc(flow_size * sizeof(*u_upscale));
+  double *v_upscale = aom_malloc(flow_size * sizeof(*v_upscale));
 
   // Compute flow field from coarsest to finest level of the pyramid
-  for (int level = frm_pyr->n_levels - 1; level >= 0; --level) {
-    cur_width = frm_pyr->widths[level];
-    cur_height = frm_pyr->heights[level];
-    cur_stride = frm_pyr->strides[level];
-    cur_loc = frm_pyr->level_loc[level];
+  for (int level = src_pyr->n_levels - 1; level >= 0; --level) {
+    const PyramidLayer *cur_layer = &src_pyr->layers[level];
+    const int cur_width = cur_layer->width;
+    const int cur_height = cur_layer->height;
+    const int cur_stride = cur_layer->stride;
 
-    for (int i = PATCH_SIZE; i < cur_height - PATCH_SIZE; i += PATCH_STEP) {
-      for (int j = PATCH_SIZE; j < cur_width - PATCH_SIZE; j += PATCH_STEP) {
-        patch_loc = i * cur_stride + j;
-        patch_center = patch_loc + PATCH_CENTER * cur_stride + PATCH_CENTER;
-        compute_flow_at_point(frm_pyr->level_buffer + cur_loc,
-                              ref_pyr->level_buffer + cur_loc,
-                              frm_pyr->level_dx_buffer + cur_loc + patch_loc,
-                              frm_pyr->level_dy_buffer + cur_loc + patch_loc, j,
-                              i, cur_width, cur_height, cur_stride,
-                              flow_u + patch_center, flow_v + patch_center);
+    const uint8_t *src_buffer = cur_layer->buffer;
+    const uint8_t *ref_buffer = ref_pyr->layers[level].buffer;
+
+    const int cur_flow_width = cur_width >> DOWNSAMPLE_SHIFT;
+    const int cur_flow_height = cur_height >> DOWNSAMPLE_SHIFT;
+    const int cur_flow_stride = flow->stride;
+
+    for (int i = FLOW_BORDER; i < cur_flow_height - FLOW_BORDER; i += 1) {
+      for (int j = FLOW_BORDER; j < cur_flow_width - FLOW_BORDER; j += 1) {
+        const int flow_field_idx = i * cur_flow_stride + j;
+
+        // Calculate the position of a patch of size DISFLOW_PATCH_SIZE pixels,
+        // which is centered on the region covered by this flow field entry
+        const int patch_center_x =
+            (j << DOWNSAMPLE_SHIFT) + UPSAMPLE_CENTER_OFFSET;  // In pixels
+        const int patch_center_y =
+            (i << DOWNSAMPLE_SHIFT) + UPSAMPLE_CENTER_OFFSET;  // In pixels
+        const int patch_tl_x = patch_center_x - DISFLOW_PATCH_CENTER;
+        const int patch_tl_y = patch_center_y - DISFLOW_PATCH_CENTER;
+        assert(patch_tl_x >= 0);
+        assert(patch_tl_y >= 0);
+
+        aom_compute_flow_at_point(src_buffer, ref_buffer, patch_tl_x,
+                                  patch_tl_y, cur_width, cur_height, cur_stride,
+                                  &flow_u[flow_field_idx],
+                                  &flow_v[flow_field_idx]);
       }
     }
-    // TODO(sarahparker) Replace this with upscale function in resize.c
+
+    // Fill in the areas which we haven't explicitly computed, with copies
+    // of the outermost values which we did compute
+    fill_flow_field_borders(flow_u, cur_flow_width, cur_flow_height,
+                            cur_flow_stride);
+    fill_flow_field_borders(flow_v, cur_flow_width, cur_flow_height,
+                            cur_flow_stride);
+
     if (level > 0) {
-      int h_upscale = frm_pyr->heights[level - 1];
-      int w_upscale = frm_pyr->widths[level - 1];
-      int s_upscale = frm_pyr->strides[level - 1];
-      for (int i = 0; i < h_upscale; ++i) {
-        for (int j = 0; j < w_upscale; ++j) {
-          u_upscale[j + i * s_upscale] =
-              flow_u[(int)(j >> 1) + (int)(i >> 1) * cur_stride];
-          v_upscale[j + i * s_upscale] =
-              flow_v[(int)(j >> 1) + (int)(i >> 1) * cur_stride];
+      const int upscale_flow_width = cur_flow_width << 1;
+      const int upscale_flow_height = cur_flow_height << 1;
+      const int upscale_stride = flow->stride;
+
+      av1_upscale_plane_double_prec(
+          flow_u, cur_flow_height, cur_flow_width, cur_flow_stride, u_upscale,
+          upscale_flow_height, upscale_flow_width, upscale_stride);
+      av1_upscale_plane_double_prec(
+          flow_v, cur_flow_height, cur_flow_width, cur_flow_stride, v_upscale,
+          upscale_flow_height, upscale_flow_width, upscale_stride);
+
+      // Multiply all flow vectors by 2.
+      // When we move down a pyramid level, the image resolution doubles.
+      // Thus we need to double all vectors in order for them to represent
+      // the same translation at the next level down
+      for (int i = 0; i < upscale_flow_height; i++) {
+        for (int j = 0; j < upscale_flow_width; j++) {
+          const int index = i * upscale_stride + j;
+          flow_u[index] = u_upscale[index] * 2.0;
+          flow_v[index] = v_upscale[index] * 2.0;
         }
       }
-      memcpy(flow_u, u_upscale,
-             frm_pyr->strides[0] * frm_pyr->heights[0] * sizeof(*flow_u));
-      memcpy(flow_v, v_upscale,
-             frm_pyr->strides[0] * frm_pyr->heights[0] * sizeof(*flow_v));
+
+      // If we didn't fill in the rightmost column or bottommost row during
+      // upsampling (in order to keep the ratio to exactly 2), fill them
+      // in here by copying the next closest column/row
+      const PyramidLayer *next_layer = &src_pyr->layers[level - 1];
+      const int next_flow_width = next_layer->width >> DOWNSAMPLE_SHIFT;
+      const int next_flow_height = next_layer->height >> DOWNSAMPLE_SHIFT;
+
+      // Rightmost column
+      if (next_flow_width > upscale_flow_width) {
+        assert(next_flow_width == upscale_flow_width + 1);
+        for (int i = 0; i < upscale_flow_height; i++) {
+          const int index = i * upscale_stride + upscale_flow_width;
+          flow_u[index] = flow_u[index - 1];
+          flow_v[index] = flow_v[index - 1];
+        }
+      }
+
+      // Bottommost row
+      if (next_flow_height > upscale_flow_height) {
+        assert(next_flow_height == upscale_flow_height + 1);
+        for (int j = 0; j < next_flow_width; j++) {
+          const int index = upscale_flow_height * upscale_stride + j;
+          flow_u[index] = flow_u[index - upscale_stride];
+          flow_v[index] = flow_v[index - upscale_stride];
+        }
+      }
     }
   }
   aom_free(u_upscale);
   aom_free(v_upscale);
-  return true;
 }
 
-int av1_compute_global_motion_disflow_based(
-    TransformationType type, unsigned char *frm_buffer, int frm_width,
-    int frm_height, int frm_stride, int *frm_corners, int num_frm_corners,
-    YV12_BUFFER_CONFIG *ref, int bit_depth, int *num_inliers_by_motion,
-    MotionModel *params_by_motion, int num_motions) {
-  unsigned char *ref_buffer = ref->y_buffer;
-  const int ref_width = ref->y_width;
-  const int ref_height = ref->y_height;
-  const int pad_size = AOMMAX(PATCH_SIZE, MIN_PAD);
-  int num_correspondences;
-  double *correspondences;
-  RansacFuncDouble ransac = av1_get_ransac_double_prec_type(type);
-  assert(frm_width == ref_width);
-  assert(frm_height == ref_height);
+static FlowField *alloc_flow_field(int frame_width, int frame_height) {
+  FlowField *flow = (FlowField *)aom_malloc(sizeof(FlowField));
+  if (flow == NULL) return NULL;
 
-  // Ensure the number of pyramid levels will work with the frame resolution
-  const int msb =
-      frm_width < frm_height ? get_msb(frm_width) : get_msb(frm_height);
-  const int n_levels = AOMMIN(msb, N_LEVELS);
+  // Calculate the size of the bottom (largest) layer of the flow pyramid
+  flow->width = frame_width >> DOWNSAMPLE_SHIFT;
+  flow->height = frame_height >> DOWNSAMPLE_SHIFT;
+  flow->stride = flow->width;
 
-  if (ref->flags & YV12_FLAG_HIGHBITDEPTH) {
-    ref_buffer = av1_downconvert_frame(ref, bit_depth);
+  const size_t flow_size = flow->stride * (size_t)flow->height;
+  flow->u = aom_calloc(flow_size, sizeof(*flow->u));
+  flow->v = aom_calloc(flow_size, sizeof(*flow->v));
+
+  if (flow->u == NULL || flow->v == NULL) {
+    aom_free(flow->u);
+    aom_free(flow->v);
+    aom_free(flow);
+    return NULL;
   }
 
-  // TODO(sarahparker) We will want to do the source pyramid computation
-  // outside of this function so it doesn't get recomputed for every
-  // reference. We also don't need to compute every pyramid level for the
-  // reference in advance, since lower levels can be overwritten once their
-  // flow field is computed and upscaled. I'll add these optimizations
-  // once the full implementation is working.
-  // Allocate frm image pyramids
-  int compute_gradient = 1;
-  ImagePyramid *frm_pyr =
-      alloc_pyramid(frm_width, frm_height, pad_size, compute_gradient);
-  if (!frm_pyr) return 0;
-  compute_flow_pyramids(frm_buffer, frm_width, frm_height, frm_stride, n_levels,
-                        pad_size, compute_gradient, frm_pyr);
-  // Allocate ref image pyramids
-  compute_gradient = 0;
-  ImagePyramid *ref_pyr =
-      alloc_pyramid(ref_width, ref_height, pad_size, compute_gradient);
-  if (!ref_pyr) {
-    free_pyramid(frm_pyr);
-    return 0;
-  }
-  compute_flow_pyramids(ref_buffer, ref_width, ref_height, ref->y_stride,
-                        n_levels, pad_size, compute_gradient, ref_pyr);
+  return flow;
+}
 
-  int ret = 0;
-  double *flow_u =
-      aom_malloc(frm_pyr->strides[0] * frm_pyr->heights[0] * sizeof(*flow_u));
-  double *flow_v =
-      aom_malloc(frm_pyr->strides[0] * frm_pyr->heights[0] * sizeof(*flow_v));
-  if (!(flow_u && flow_v)) goto Error;
+static void free_flow_field(FlowField *flow) {
+  aom_free(flow->u);
+  aom_free(flow->v);
+  aom_free(flow);
+}
 
-  memset(flow_u, 0,
-         frm_pyr->strides[0] * frm_pyr->heights[0] * sizeof(*flow_u));
-  memset(flow_v, 0,
-         frm_pyr->strides[0] * frm_pyr->heights[0] * sizeof(*flow_v));
+// Compute flow field between `src` and `ref`, and then use that flow to
+// compute a global motion model relating the two frames.
+//
+// Following the convention in flow_estimation.h, the flow vectors are computed
+// at fixed points in `src` and point to the corresponding locations in `ref`,
+// regardless of the temporal ordering of the frames.
+bool av1_compute_global_motion_disflow(TransformationType type,
+                                       YV12_BUFFER_CONFIG *src,
+                                       YV12_BUFFER_CONFIG *ref, int bit_depth,
+                                       MotionModel *motion_models,
+                                       int num_motion_models) {
+  // Precompute information we will need about each frame
+  ImagePyramid *src_pyramid = src->y_pyramid;
+  CornerList *src_corners = src->corners;
+  ImagePyramid *ref_pyramid = ref->y_pyramid;
+  aom_compute_pyramid(src, bit_depth, src_pyramid);
+  av1_compute_corner_list(src_pyramid, src_corners);
+  aom_compute_pyramid(ref, bit_depth, ref_pyramid);
 
-  if (!compute_flow_field(frm_pyr, ref_pyr, flow_u, flow_v)) goto Error;
+  const int src_width = src_pyramid->layers[0].width;
+  const int src_height = src_pyramid->layers[0].height;
+  assert(ref_pyramid->layers[0].width == src_width);
+  assert(ref_pyramid->layers[0].height == src_height);
+
+  FlowField *flow = alloc_flow_field(src_width, src_height);
+  if (!flow) return false;
+
+  compute_flow_field(src_pyramid, ref_pyramid, flow);
 
   // find correspondences between the two images using the flow field
-  correspondences = aom_malloc(num_frm_corners * 4 * sizeof(*correspondences));
-  if (!correspondences) goto Error;
-  num_correspondences = determine_disflow_correspondence(
-      frm_corners, num_frm_corners, flow_u, flow_v, frm_width, frm_height,
-      frm_pyr->strides[0], correspondences);
-  ransac(correspondences, num_correspondences, num_inliers_by_motion,
-         params_by_motion, num_motions);
-
-  // Set num_inliers = 0 for motions with too few inliers so they are ignored.
-  for (int i = 0; i < num_motions; ++i) {
-    if (num_inliers_by_motion[i] < MIN_INLIER_PROB * num_correspondences) {
-      num_inliers_by_motion[i] = 0;
-    }
+  Correspondence *correspondences =
+      aom_malloc(src_corners->num_corners * sizeof(*correspondences));
+  if (!correspondences) {
+    free_flow_field(flow);
+    return false;
   }
 
-  // Return true if any one of the motions has inliers.
-  for (int i = 0; i < num_motions; ++i) {
-    if (num_inliers_by_motion[i] > 0) {
-      ret = 1;
-      break;
-    }
-  }
+  const int num_correspondences =
+      determine_disflow_correspondence(src_corners, flow, correspondences);
+
+  bool result = ransac(correspondences, num_correspondences, type,
+                       motion_models, num_motion_models);
 
   aom_free(correspondences);
-Error:
-  free_pyramid(frm_pyr);
-  free_pyramid(ref_pyr);
-  aom_free(flow_u);
-  aom_free(flow_v);
-  return ret;
+  free_flow_field(flow);
+  return result;
 }

diff --git a/aom_dsp/flow_estimation/disflow.h b/aom_dsp/flow_estimation/disflow.h
index 52fb261..2e97ba2 100644
--- a/aom_dsp/flow_estimation/disflow.h
+++ b/aom_dsp/flow_estimation/disflow.h

@@ -12,18 +12,88 @@
 #ifndef AOM_AOM_DSP_FLOW_ESTIMATION_DISFLOW_H_
 #define AOM_AOM_DSP_FLOW_ESTIMATION_DISFLOW_H_
 
+#include <stdbool.h>
+
 #include "aom_dsp/flow_estimation/flow_estimation.h"
+#include "aom_dsp/rect.h"
 #include "aom_scale/yv12config.h"
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-int av1_compute_global_motion_disflow_based(
-    TransformationType type, unsigned char *frm_buffer, int frm_width,
-    int frm_height, int frm_stride, int *frm_corners, int num_frm_corners,
-    YV12_BUFFER_CONFIG *ref, int bit_depth, int *num_inliers_by_motion,
-    MotionModel *params_by_motion, int num_motions);
+// Number of pyramid levels in disflow computation
+#define DISFLOW_PYRAMID_LEVELS 12
+
+// Size of square patches in the disflow dense grid
+// Must be a power of 2
+#define DISFLOW_PATCH_SIZE_LOG2 3
+#define DISFLOW_PATCH_SIZE (1 << DISFLOW_PATCH_SIZE_LOG2)
+// Center point of square patch
+#define DISFLOW_PATCH_CENTER ((DISFLOW_PATCH_SIZE / 2) - 1)
+
+// Overall scale of the `dx`, `dy` and `dt` arrays in the disflow code
+// In other words, the various derivatives are calculated with an internal
+// precision of (8 + DISFLOW_DERIV_SCALE_LOG2) bits, from an 8-bit input.
+//
+// This must be carefully synchronized with the code in sobel_filter()
+// (which fills the dx and dy arrays) and compute_flow_error() (which
+// fills dt); see the comments in those functions for more details
+#define DISFLOW_DERIV_SCALE_LOG2 3
+#define DISFLOW_DERIV_SCALE (1 << DISFLOW_DERIV_SCALE_LOG2)
+
+// Scale factor applied to each step in the main refinement loop
+//
+// This should be <= 1.0 to avoid overshoot. Values below 1.0
+// may help in some cases, but slow convergence overall, so
+// will require careful tuning.
+// TODO(rachelbarker): Tune this value
+#define DISFLOW_STEP_SIZE 1.0
+
+// Step size at which we should terminate iteration
+// The idea here is that, if we take a step which is much smaller than 1px in
+// size, then the values won't change much from iteration to iteration, so
+// many future steps will also be small, and that won't have much effect
+// on the ultimate result. So we can terminate early.
+//
+// To look at it another way, when we take a small step, that means that
+// either we're near to convergence (so can stop), or we're stuck in a
+// shallow valley and will take many iterations to get unstuck.
+//
+// Solving the latter properly requires fancier methods, such as "gradient
+// descent with momentum". For now, we terminate to avoid wasting a ton of
+// time on points which are either nearly-converged or stuck.
+//
+// Terminating at 1/8 px seems to give good results for global motion estimation
+#define DISFLOW_STEP_SIZE_THRESOLD (1. / 8.)
+
+// Max number of iterations if warp convergence is not found
+#define DISFLOW_MAX_ITR 4
+
+// Internal precision of cubic interpolation filters
+// The limiting factor here is that:
+// * Before integerizing, the maximum value of any kernel tap is 1.0
+// * After integerizing, each tap must fit into an int16_t.
+// Thus the largest multiplier we can get away with is 2^14 = 16384,
+// as 2^15 = 32768 is too large to fit in an int16_t.
+#define DISFLOW_INTERP_BITS 14
+
+typedef struct {
+  // x and y directions of flow, per patch
+  double *u;
+  double *v;
+
+  // Sizes of the above arrays
+  int width;
+  int height;
+  int stride;
+} FlowField;
+
+bool av1_compute_global_motion_disflow(TransformationType type,
+                                       YV12_BUFFER_CONFIG *src,
+                                       YV12_BUFFER_CONFIG *ref, int bit_depth,
+                                       MotionModel *motion_models,
+                                       int num_motion_models);
 
 #ifdef __cplusplus
 }

diff --git a/aom_dsp/flow_estimation/flow_estimation.c b/aom_dsp/flow_estimation/flow_estimation.c
index d8cf8bd..a6bf942 100644
--- a/aom_dsp/flow_estimation/flow_estimation.c
+++ b/aom_dsp/flow_estimation/flow_estimation.c

@@ -11,49 +11,48 @@
 
 #include <assert.h>
 
+#include "aom_dsp/flow_estimation/corner_detect.h"
 #include "aom_dsp/flow_estimation/corner_match.h"
 #include "aom_dsp/flow_estimation/disflow.h"
 #include "aom_dsp/flow_estimation/flow_estimation.h"
 #include "aom_ports/mem.h"
 #include "aom_scale/yv12config.h"
 
-int aom_compute_global_motion(TransformationType type,
-                              unsigned char *src_buffer, int src_width,
-                              int src_height, int src_stride, int *src_corners,
-                              int num_src_corners, YV12_BUFFER_CONFIG *ref,
-                              int bit_depth,
-                              GlobalMotionEstimationType gm_estimation_type,
-                              int *num_inliers_by_motion,
-                              MotionModel *params_by_motion, int num_motions) {
-  switch (gm_estimation_type) {
-    case GLOBAL_MOTION_FEATURE_BASED:
-      return av1_compute_global_motion_feature_based(
-          type, src_buffer, src_width, src_height, src_stride, src_corners,
-          num_src_corners, ref, bit_depth, num_inliers_by_motion,
-          params_by_motion, num_motions);
-    case GLOBAL_MOTION_DISFLOW_BASED:
-      return av1_compute_global_motion_disflow_based(
-          type, src_buffer, src_width, src_height, src_stride, src_corners,
-          num_src_corners, ref, bit_depth, num_inliers_by_motion,
-          params_by_motion, num_motions);
+// For each global motion method, how many pyramid levels should we allocate?
+// Note that this is a maximum, and fewer levels will be allocated if the frame
+// is not large enough to need all of the specified levels
+const int global_motion_pyr_levels[GLOBAL_MOTION_METHODS] = {
+  1,   // GLOBAL_MOTION_METHOD_FEATURE_MATCH
+  16,  // GLOBAL_MOTION_METHOD_DISFLOW
+};
+
+// clang-format off
+const double kIdentityParams[MAX_PARAMDIM] = {
+  0.0, 0.0, 1.0, 0.0, 0.0, 1.0
+};
+// clang-format on
+
+// Compute a global motion model between the given source and ref frames.
+//
+// As is standard for video codecs, the resulting model maps from (x, y)
+// coordinates in `src` to the corresponding points in `ref`, regardless
+// of the temporal order of the two frames.
+//
+// Returns true if global motion estimation succeeded, false if not.
+// The output models should only be used if this function succeeds.
+bool aom_compute_global_motion(TransformationType type, YV12_BUFFER_CONFIG *src,
+                               YV12_BUFFER_CONFIG *ref, int bit_depth,
+                               GlobalMotionMethod gm_method,
+                               MotionModel *motion_models,
+                               int num_motion_models) {
+  switch (gm_method) {
+    case GLOBAL_MOTION_METHOD_FEATURE_MATCH:
+      return av1_compute_global_motion_feature_match(
+          type, src, ref, bit_depth, motion_models, num_motion_models);
+    case GLOBAL_MOTION_METHOD_DISFLOW:
+      return av1_compute_global_motion_disflow(
+          type, src, ref, bit_depth, motion_models, num_motion_models);
     default: assert(0 && "Unknown global motion estimation type");
   }
   return 0;
 }
-
-unsigned char *av1_downconvert_frame(YV12_BUFFER_CONFIG *frm, int bit_depth) {
-  int i, j;
-  uint16_t *orig_buf = CONVERT_TO_SHORTPTR(frm->y_buffer);
-  uint8_t *buf_8bit = frm->y_buffer_8bit;
-  assert(buf_8bit);
-  if (!frm->buf_8bit_valid) {
-    for (i = 0; i < frm->y_height; ++i) {
-      for (j = 0; j < frm->y_width; ++j) {
-        buf_8bit[i * frm->y_stride + j] =
-            orig_buf[i * frm->y_stride + j] >> (bit_depth - 8);
-      }
-    }
-    frm->buf_8bit_valid = 1;
-  }
-  return buf_8bit;
-}

diff --git a/aom_dsp/flow_estimation/flow_estimation.h b/aom_dsp/flow_estimation/flow_estimation.h
index ab9d328..4f2192c 100644
--- a/aom_dsp/flow_estimation/flow_estimation.h
+++ b/aom_dsp/flow_estimation/flow_estimation.h

@@ -12,6 +12,8 @@
 #ifndef AOM_AOM_DSP_FLOW_ESTIMATION_H_
 #define AOM_AOM_DSP_FLOW_ESTIMATION_H_
 
+#include "aom_dsp/pyramid.h"
+#include "aom_dsp/flow_estimation/corner_detect.h"
 #include "aom_ports/mem.h"
 #include "aom_scale/yv12config.h"
 
@@ -19,8 +21,7 @@
 extern "C" {
 #endif
 
-#define MAX_PARAMDIM 9
-#define MAX_CORNERS 4096
+#define MAX_PARAMDIM 6
 #define MIN_INLIER_PROB 0.1
 
 /* clang-format off */
@@ -36,27 +37,56 @@
 // number of parameters used by each transformation in TransformationTypes
 static const int trans_model_params[TRANS_TYPES] = { 0, 2, 4, 6 };
 
+// Available methods which can be used for global motion estimation
 typedef enum {
-  GLOBAL_MOTION_FEATURE_BASED,
-  GLOBAL_MOTION_DISFLOW_BASED,
-} GlobalMotionEstimationType;
+  GLOBAL_MOTION_METHOD_FEATURE_MATCH,
+  GLOBAL_MOTION_METHOD_DISFLOW,
+  GLOBAL_MOTION_METHOD_LAST = GLOBAL_MOTION_METHOD_DISFLOW,
+  GLOBAL_MOTION_METHODS
+} GlobalMotionMethod;
 
 typedef struct {
-  double params[MAX_PARAMDIM - 1];
+  double params[MAX_PARAMDIM];
   int *inliers;
   int num_inliers;
 } MotionModel;
 
-int aom_compute_global_motion(TransformationType type,
-                              unsigned char *src_buffer, int src_width,
-                              int src_height, int src_stride, int *src_corners,
-                              int num_src_corners, YV12_BUFFER_CONFIG *ref,
-                              int bit_depth,
-                              GlobalMotionEstimationType gm_estimation_type,
-                              int *num_inliers_by_motion,
-                              MotionModel *params_by_motion, int num_motions);
+// Data structure to store a single correspondence point during global
+// motion search.
+//
+// A correspondence (x, y) -> (rx, ry) means that point (x, y) in the
+// source frame corresponds to point (rx, ry) in the ref frame.
+typedef struct {
+  double x, y;
+  double rx, ry;
+} Correspondence;
 
-unsigned char *av1_downconvert_frame(YV12_BUFFER_CONFIG *frm, int bit_depth);
+// For each global motion method, how many pyramid levels should we allocate?
+// Note that this is a maximum, and fewer levels will be allocated if the frame
+// is not large enough to need all of the specified levels
+extern const int global_motion_pyr_levels[GLOBAL_MOTION_METHODS];
+
+// Which global motion method should we use in practice?
+// Disflow is both faster and gives better results than feature matching in
+// practically all cases, so we use disflow by default
+static const GlobalMotionMethod default_global_motion_method =
+    GLOBAL_MOTION_METHOD_DISFLOW;
+
+extern const double kIdentityParams[MAX_PARAMDIM];
+
+// Compute a global motion model between the given source and ref frames.
+//
+// As is standard for video codecs, the resulting model maps from (x, y)
+// coordinates in `src` to the corresponding points in `ref`, regardless
+// of the temporal order of the two frames.
+//
+// Returns true if global motion estimation succeeded, false if not.
+// The output models should only be used if this function succeeds.
+bool aom_compute_global_motion(TransformationType type, YV12_BUFFER_CONFIG *src,
+                               YV12_BUFFER_CONFIG *ref, int bit_depth,
+                               GlobalMotionMethod gm_method,
+                               MotionModel *motion_models,
+                               int num_motion_models);
 
 #ifdef __cplusplus
 }

diff --git a/aom_dsp/flow_estimation/ransac.c b/aom_dsp/flow_estimation/ransac.c
index 8ffc30d..81c5f2c 100644
--- a/aom_dsp/flow_estimation/ransac.c
+++ b/aom_dsp/flow_estimation/ransac.c

@@ -13,37 +13,54 @@
 #include <math.h>
 #include <time.h>
 #include <stdio.h>
-#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
 #include <assert.h>
 
 #include "aom_dsp/flow_estimation/ransac.h"
 #include "aom_dsp/mathutils.h"
+#include "aom_mem/aom_mem.h"
 
 // TODO(rachelbarker): Remove dependence on code in av1/encoder/
 #include "av1/encoder/random.h"
 
 #define MAX_MINPTS 4
-#define MAX_DEGENERATE_ITER 10
 #define MINPTS_MULTIPLIER 5
 
 #define INLIER_THRESHOLD 1.25
-#define MIN_TRIALS 20
+#define INLIER_THRESHOLD_SQUARED (INLIER_THRESHOLD * INLIER_THRESHOLD)
+#define NUM_TRIALS 20
+
+// Flag to enable functions for finding TRANSLATION type models.
+//
+// These modes are not considered currently due to a spec bug (see comments
+// in gm_get_motion_vector() in av1/common/mv.h). Thus we don't need to compile
+// the corresponding search functions, but it is nice to keep the source around
+// but disabled, for completeness.
+#define ALLOW_TRANSLATION_MODELS 0
 
 ////////////////////////////////////////////////////////////////////////////////
 // ransac
-typedef int (*IsDegenerateFunc)(double *p);
-typedef void (*NormalizeFunc)(double *p, int np, double *T);
-typedef void (*DenormalizeFunc)(double *params, double *T1, double *T2);
-typedef int (*FindTransformationFunc)(int points, double *points1,
-                                      double *points2, double *params);
-typedef void (*ProjectPointsDoubleFunc)(double *mat, double *points,
-                                        double *proj, int n, int stride_points,
-                                        int stride_proj);
+typedef bool (*IsDegenerateFunc)(double *p);
+typedef bool (*FindTransformationFunc)(int points, const double *points1,
+                                       const double *points2, double *params);
+typedef void (*ProjectPointsFunc)(const double *mat, const double *points,
+                                  double *proj, int n, int stride_points,
+                                  int stride_proj);
 
-static void project_points_double_translation(double *mat, double *points,
-                                              double *proj, int n,
-                                              int stride_points,
-                                              int stride_proj) {
+// vtable-like structure which stores all of the information needed by RANSAC
+// for a particular model type
+typedef struct {
+  IsDegenerateFunc is_degenerate;
+  FindTransformationFunc find_transformation;
+  ProjectPointsFunc project_points;
+  int minpts;
+} RansacModelInfo;
+
+#if ALLOW_TRANSLATION_MODELS
+static void project_points_translation(const double *mat, const double *points,
+                                       double *proj, int n, int stride_points,
+                                       int stride_proj) {
   int i;
   for (i = 0; i < n; ++i) {
     const double x = *(points++), y = *(points++);
@@ -53,23 +70,11 @@
     proj += stride_proj - 2;
   }
 }
+#endif  // ALLOW_TRANSLATION_MODELS
 
-static void project_points_double_rotzoom(double *mat, double *points,
-                                          double *proj, int n,
-                                          int stride_points, int stride_proj) {
-  int i;
-  for (i = 0; i < n; ++i) {
-    const double x = *(points++), y = *(points++);
-    *(proj++) = mat[2] * x + mat[3] * y + mat[0];
-    *(proj++) = -mat[3] * x + mat[2] * y + mat[1];
-    points += stride_points - 2;
-    proj += stride_proj - 2;
-  }
-}
-
-static void project_points_double_affine(double *mat, double *points,
-                                         double *proj, int n, int stride_points,
-                                         int stride_proj) {
+static void project_points_affine(const double *mat, const double *points,
+                                  double *proj, int n, int stride_points,
+                                  int stride_proj) {
   int i;
   for (i = 0; i < n; ++i) {
     const double x = *(points++), y = *(points++);
@@ -80,261 +85,135 @@
   }
 }
 
-static void normalize_homography(double *pts, int n, double *T) {
-  double *p = pts;
-  double mean[2] = { 0, 0 };
-  double msqe = 0;
-  double scale;
-  int i;
+#if ALLOW_TRANSLATION_MODELS
+static bool find_translation(int np, const double *pts1, const double *pts2,
+                             double *params) {
+  double sumx = 0;
+  double sumy = 0;
 
-  assert(n > 0);
-  for (i = 0; i < n; ++i, p += 2) {
-    mean[0] += p[0];
-    mean[1] += p[1];
-  }
-  mean[0] /= n;
-  mean[1] /= n;
-  for (p = pts, i = 0; i < n; ++i, p += 2) {
-    p[0] -= mean[0];
-    p[1] -= mean[1];
-    msqe += sqrt(p[0] * p[0] + p[1] * p[1]);
-  }
-  msqe /= n;
-  scale = (msqe == 0 ? 1.0 : sqrt(2) / msqe);
-  T[0] = scale;
-  T[1] = 0;
-  T[2] = -scale * mean[0];
-  T[3] = 0;
-  T[4] = scale;
-  T[5] = -scale * mean[1];
-  T[6] = 0;
-  T[7] = 0;
-  T[8] = 1;
-  for (p = pts, i = 0; i < n; ++i, p += 2) {
-    p[0] *= scale;
-    p[1] *= scale;
-  }
-}
-
-static void invnormalize_mat(double *T, double *iT) {
-  double is = 1.0 / T[0];
-  double m0 = -T[2] * is;
-  double m1 = -T[5] * is;
-  iT[0] = is;
-  iT[1] = 0;
-  iT[2] = m0;
-  iT[3] = 0;
-  iT[4] = is;
-  iT[5] = m1;
-  iT[6] = 0;
-  iT[7] = 0;
-  iT[8] = 1;
-}
-
-static void denormalize_homography(double *params, double *T1, double *T2) {
-  double iT2[9];
-  double params2[9];
-  invnormalize_mat(T2, iT2);
-  multiply_mat(params, T1, params2, 3, 3, 3);
-  multiply_mat(iT2, params2, params, 3, 3, 3);
-}
-
-static void denormalize_affine_reorder(double *params, double *T1, double *T2) {
-  double params_denorm[MAX_PARAMDIM];
-  params_denorm[0] = params[0];
-  params_denorm[1] = params[1];
-  params_denorm[2] = params[4];
-  params_denorm[3] = params[2];
-  params_denorm[4] = params[3];
-  params_denorm[5] = params[5];
-  params_denorm[6] = params_denorm[7] = 0;
-  params_denorm[8] = 1;
-  denormalize_homography(params_denorm, T1, T2);
-  params[0] = params_denorm[2];
-  params[1] = params_denorm[5];
-  params[2] = params_denorm[0];
-  params[3] = params_denorm[1];
-  params[4] = params_denorm[3];
-  params[5] = params_denorm[4];
-  params[6] = params[7] = 0;
-}
-
-static void denormalize_rotzoom_reorder(double *params, double *T1,
-                                        double *T2) {
-  double params_denorm[MAX_PARAMDIM];
-  params_denorm[0] = params[0];
-  params_denorm[1] = params[1];
-  params_denorm[2] = params[2];
-  params_denorm[3] = -params[1];
-  params_denorm[4] = params[0];
-  params_denorm[5] = params[3];
-  params_denorm[6] = params_denorm[7] = 0;
-  params_denorm[8] = 1;
-  denormalize_homography(params_denorm, T1, T2);
-  params[0] = params_denorm[2];
-  params[1] = params_denorm[5];
-  params[2] = params_denorm[0];
-  params[3] = params_denorm[1];
-  params[4] = -params[3];
-  params[5] = params[2];
-  params[6] = params[7] = 0;
-}
-
-static void denormalize_translation_reorder(double *params, double *T1,
-                                            double *T2) {
-  double params_denorm[MAX_PARAMDIM];
-  params_denorm[0] = 1;
-  params_denorm[1] = 0;
-  params_denorm[2] = params[0];
-  params_denorm[3] = 0;
-  params_denorm[4] = 1;
-  params_denorm[5] = params[1];
-  params_denorm[6] = params_denorm[7] = 0;
-  params_denorm[8] = 1;
-  denormalize_homography(params_denorm, T1, T2);
-  params[0] = params_denorm[2];
-  params[1] = params_denorm[5];
-  params[2] = params[5] = 1;
-  params[3] = params[4] = 0;
-  params[6] = params[7] = 0;
-}
-
-static int find_translation(int np, double *pts1, double *pts2, double *mat) {
-  int i;
-  double sx, sy, dx, dy;
-  double sumx, sumy;
-
-  double T1[9], T2[9];
-  normalize_homography(pts1, np, T1);
-  normalize_homography(pts2, np, T2);
-
-  sumx = 0;
-  sumy = 0;
-  for (i = 0; i < np; ++i) {
-    dx = *(pts2++);
-    dy = *(pts2++);
-    sx = *(pts1++);
-    sy = *(pts1++);
+  for (int i = 0; i < np; ++i) {
+    double dx = *(pts2++);
+    double dy = *(pts2++);
+    double sx = *(pts1++);
+    double sy = *(pts1++);
 
     sumx += dx - sx;
     sumy += dy - sy;
   }
-  mat[0] = sumx / np;
-  mat[1] = sumy / np;
-  denormalize_translation_reorder(mat, T1, T2);
-  return 0;
+
+  params[0] = sumx / np;
+  params[1] = sumy / np;
+  params[2] = 1;
+  params[3] = 0;
+  params[4] = 0;
+  params[5] = 1;
+  return true;
+}
+#endif  // ALLOW_TRANSLATION_MODELS
+
+static bool find_rotzoom(int np, const double *pts1, const double *pts2,
+                         double *params) {
+  const int n = 4;    // Size of least-squares problem
+  double mat[4 * 4];  // Accumulator for A'A
+  double y[4];        // Accumulator for A'b
+  double a[4];        // Single row of A
+  double b;           // Single element of b
+
+  least_squares_init(mat, y, n);
+  for (int i = 0; i < np; ++i) {
+    double dx = *(pts2++);
+    double dy = *(pts2++);
+    double sx = *(pts1++);
+    double sy = *(pts1++);
+
+    a[0] = 1;
+    a[1] = 0;
+    a[2] = sx;
+    a[3] = sy;
+    b = dx;
+    least_squares_accumulate(mat, y, a, b, n);
+
+    a[0] = 0;
+    a[1] = 1;
+    a[2] = sy;
+    a[3] = -sx;
+    b = dy;
+    least_squares_accumulate(mat, y, a, b, n);
+  }
+
+  // Fill in params[0] .. params[3] with output model
+  if (!least_squares_solve(mat, y, params, n)) {
+    return false;
+  }
+
+  // Fill in remaining parameters
+  params[4] = -params[3];
+  params[5] = params[2];
+
+  return true;
 }
 
-static int find_rotzoom(int np, double *pts1, double *pts2, double *mat) {
-  const int np2 = np * 2;
-  double *a = (double *)aom_malloc(sizeof(*a) * (np2 * 5 + 20));
-  if (a == NULL) return 1;
-  double *b = a + np2 * 4;
-  double *temp = b + np2;
-  int i;
-  double sx, sy, dx, dy;
+static bool find_affine(int np, const double *pts1, const double *pts2,
+                        double *params) {
+  // Note: The least squares problem for affine models is 6-dimensional,
+  // but it splits into two independent 3-dimensional subproblems.
+  // Solving these two subproblems separately and recombining at the end
+  // results in less total computation than solving the 6-dimensional
+  // problem directly.
+  //
+  // The two subproblems correspond to all the parameters which contribute
+  // to the x output of the model, and all the parameters which contribute
+  // to the y output, respectively.
 
-  double T1[9], T2[9];
-  normalize_homography(pts1, np, T1);
-  normalize_homography(pts2, np, T2);
+  const int n = 3;       // Size of each least-squares problem
+  double mat[2][3 * 3];  // Accumulator for A'A
+  double y[2][3];        // Accumulator for A'b
+  double x[2][3];        // Output vector
+  double a[2][3];        // Single row of A
+  double b[2];           // Single element of b
 
-  for (i = 0; i < np; ++i) {
-    dx = *(pts2++);
-    dy = *(pts2++);
-    sx = *(pts1++);
-    sy = *(pts1++);
+  least_squares_init(mat[0], y[0], n);
+  least_squares_init(mat[1], y[1], n);
+  for (int i = 0; i < np; ++i) {
+    double dx = *(pts2++);
+    double dy = *(pts2++);
+    double sx = *(pts1++);
+    double sy = *(pts1++);
 
-    a[i * 2 * 4 + 0] = sx;
-    a[i * 2 * 4 + 1] = sy;
-    a[i * 2 * 4 + 2] = 1;
-    a[i * 2 * 4 + 3] = 0;
-    a[(i * 2 + 1) * 4 + 0] = sy;
-    a[(i * 2 + 1) * 4 + 1] = -sx;
-    a[(i * 2 + 1) * 4 + 2] = 0;
-    a[(i * 2 + 1) * 4 + 3] = 1;
+    a[0][0] = 1;
+    a[0][1] = sx;
+    a[0][2] = sy;
+    b[0] = dx;
+    least_squares_accumulate(mat[0], y[0], a[0], b[0], n);
 
-    b[2 * i] = dx;
-    b[2 * i + 1] = dy;
+    a[1][0] = 1;
+    a[1][1] = sx;
+    a[1][2] = sy;
+    b[1] = dy;
+    least_squares_accumulate(mat[1], y[1], a[1], b[1], n);
   }
-  if (!least_squares(4, a, np2, 4, b, temp, mat)) {
-    aom_free(a);
-    return 1;
+
+  if (!least_squares_solve(mat[0], y[0], x[0], n)) {
+    return false;
   }
-  denormalize_rotzoom_reorder(mat, T1, T2);
-  aom_free(a);
-  return 0;
-}
-
-static int find_affine(int np, double *pts1, double *pts2, double *mat) {
-  assert(np > 0);
-  const int np2 = np * 2;
-  double *a = (double *)aom_malloc(sizeof(*a) * (np2 * 7 + 42));
-  if (a == NULL) return 1;
-  double *b = a + np2 * 6;
-  double *temp = b + np2;
-  int i;
-  double sx, sy, dx, dy;
-
-  double T1[9], T2[9];
-  normalize_homography(pts1, np, T1);
-  normalize_homography(pts2, np, T2);
-
-  for (i = 0; i < np; ++i) {
-    dx = *(pts2++);
-    dy = *(pts2++);
-    sx = *(pts1++);
-    sy = *(pts1++);
-
-    a[i * 2 * 6 + 0] = sx;
-    a[i * 2 * 6 + 1] = sy;
-    a[i * 2 * 6 + 2] = 0;
-    a[i * 2 * 6 + 3] = 0;
-    a[i * 2 * 6 + 4] = 1;
-    a[i * 2 * 6 + 5] = 0;
-    a[(i * 2 + 1) * 6 + 0] = 0;
-    a[(i * 2 + 1) * 6 + 1] = 0;
-    a[(i * 2 + 1) * 6 + 2] = sx;
-    a[(i * 2 + 1) * 6 + 3] = sy;
-    a[(i * 2 + 1) * 6 + 4] = 0;
-    a[(i * 2 + 1) * 6 + 5] = 1;
-
-    b[2 * i] = dx;
-    b[2 * i + 1] = dy;
+  if (!least_squares_solve(mat[1], y[1], x[1], n)) {
+    return false;
   }
-  if (!least_squares(6, a, np2, 6, b, temp, mat)) {
-    aom_free(a);
-    return 1;
-  }
-  denormalize_affine_reorder(mat, T1, T2);
-  aom_free(a);
-  return 0;
-}
 
-static int get_rand_indices(int npoints, int minpts, int *indices,
-                            unsigned int *seed) {
-  int i, j;
-  int ptr = lcg_rand16(seed) % npoints;
-  if (minpts > npoints) return 0;
-  indices[0] = ptr;
-  ptr = (ptr == npoints - 1 ? 0 : ptr + 1);
-  i = 1;
-  while (i < minpts) {
-    int index = lcg_rand16(seed) % npoints;
-    while (index) {
-      ptr = (ptr == npoints - 1 ? 0 : ptr + 1);
-      for (j = 0; j < i; ++j) {
-        if (indices[j] == ptr) break;
-      }
-      if (j == i) index--;
-    }
-    indices[i++] = ptr;
-  }
-  return 1;
+  // Rearrange least squares result to form output model
+  params[0] = x[0][0];
+  params[1] = x[1][0];
+  params[2] = x[0][1];
+  params[3] = x[0][2];
+  params[4] = x[1][1];
+  params[5] = x[1][2];
+
+  return true;
 }
 
 typedef struct {
   int num_inliers;
-  double variance;
+  double sse;  // Sum of squared errors of inliers
   int *inlier_indices;
 } RANSAC_MOTION;
 
@@ -345,13 +224,13 @@
 
   if (motion_a->num_inliers > motion_b->num_inliers) return -1;
   if (motion_a->num_inliers < motion_b->num_inliers) return 1;
-  if (motion_a->variance < motion_b->variance) return -1;
-  if (motion_a->variance > motion_b->variance) return 1;
+  if (motion_a->sse < motion_b->sse) return -1;
+  if (motion_a->sse > motion_b->sse) return 1;
   return 0;
 }
 
-static int is_better_motion(const RANSAC_MOTION *motion_a,
-                            const RANSAC_MOTION *motion_b) {
+static bool is_better_motion(const RANSAC_MOTION *motion_a,
+                             const RANSAC_MOTION *motion_b) {
   return compare_motions(motion_a, motion_b) < 0;
 }
 
@@ -364,24 +243,14 @@
   }
 }
 
-static const double kInfiniteVariance = 1e12;
-
-static void clear_motion(RANSAC_MOTION *motion, int num_points) {
-  motion->num_inliers = 0;
-  motion->variance = kInfiniteVariance;
-  memset(motion->inlier_indices, 0,
-         sizeof(*motion->inlier_indices) * num_points);
-}
-
-static int ransac(const int *matched_points, int npoints,
-                  int *num_inliers_by_motion, MotionModel *params_by_motion,
-                  int num_desired_motions, int minpts,
-                  IsDegenerateFunc is_degenerate,
-                  FindTransformationFunc find_transformation,
-                  ProjectPointsDoubleFunc projectpoints) {
-  int trial_count = 0;
+// Returns true on success, false on error
+static bool ransac_internal(const Correspondence *matched_points, int npoints,
+                            MotionModel *motion_models, int num_desired_motions,
+                            const RansacModelInfo *model_info) {
+  assert(npoints >= 0);
   int i = 0;
-  int ret_val = 0;
+  int minpts = model_info->minpts;
+  bool ret_val = true;
 
   unsigned int seed = (unsigned int)npoints;
 
@@ -389,7 +258,7 @@
 
   double *points1, *points2;
   double *corners1, *corners2;
-  double *image1_coord;
+  double *projected_corners;
 
   // Store information for the num_desired_motions best transformations found
   // and the worst motion among them, as well as the motion currently under
@@ -401,123 +270,115 @@
   // currently under consideration.
   double params_this_motion[MAX_PARAMDIM];
 
-  double *cnp1, *cnp2;
-
-  for (i = 0; i < num_desired_motions; ++i) {
-    num_inliers_by_motion[i] = 0;
-  }
   if (npoints < minpts * MINPTS_MULTIPLIER || npoints == 0) {
-    return 1;
+    return false;
   }
 
+  int min_inliers = AOMMAX((int)(MIN_INLIER_PROB * npoints), minpts);
+
   points1 = (double *)aom_malloc(sizeof(*points1) * npoints * 2);
   points2 = (double *)aom_malloc(sizeof(*points2) * npoints * 2);
   corners1 = (double *)aom_malloc(sizeof(*corners1) * npoints * 2);
   corners2 = (double *)aom_malloc(sizeof(*corners2) * npoints * 2);
-  image1_coord = (double *)aom_malloc(sizeof(*image1_coord) * npoints * 2);
+  projected_corners =
+      (double *)aom_malloc(sizeof(*projected_corners) * npoints * 2);
   motions =
       (RANSAC_MOTION *)aom_calloc(num_desired_motions, sizeof(RANSAC_MOTION));
-  current_motion.inlier_indices =
-      (int *)aom_malloc(sizeof(*current_motion.inlier_indices) * npoints);
-  if (!(points1 && points2 && corners1 && corners2 && image1_coord && motions &&
-        current_motion.inlier_indices)) {
-    ret_val = 1;
+
+  // Allocate one large buffer which will be carved up to store the inlier
+  // indices for the current motion plus the num_desired_motions many
+  // output models
+  // This allows us to keep the allocation/deallocation logic simple, without
+  // having to (for example) check that `motions` is non-null before allocating
+  // the inlier arrays
+  int *inlier_buffer = (int *)aom_malloc(sizeof(*inlier_buffer) * npoints *
+                                         (num_desired_motions + 1));
+
+  if (!(points1 && points2 && corners1 && corners2 && projected_corners &&
+        motions && inlier_buffer)) {
+    ret_val = false;
     goto finish_ransac;
   }
 
-  for (i = 0; i < num_desired_motions; ++i) {
-    motions[i].inlier_indices =
-        (int *)aom_malloc(sizeof(*motions->inlier_indices) * npoints);
-    if (!motions[i].inlier_indices) {
-      ret_val = 1;
-      goto finish_ransac;
-    }
-    clear_motion(motions + i, npoints);
-  }
-  clear_motion(&current_motion, npoints);
-
+  // Once all our allocations are known-good, we can fill in our structures
   worst_kept_motion = motions;
 
-  cnp1 = corners1;
-  cnp2 = corners2;
+  for (i = 0; i < num_desired_motions; ++i) {
+    motions[i].inlier_indices = inlier_buffer + i * npoints;
+  }
+  memset(&current_motion, 0, sizeof(current_motion));
+  current_motion.inlier_indices = inlier_buffer + num_desired_motions * npoints;
+
   for (i = 0; i < npoints; ++i) {
-    *(cnp1++) = *(matched_points++);
-    *(cnp1++) = *(matched_points++);
-    *(cnp2++) = *(matched_points++);
-    *(cnp2++) = *(matched_points++);
+    corners1[2 * i + 0] = matched_points[i].x;
+    corners1[2 * i + 1] = matched_points[i].y;
+    corners2[2 * i + 0] = matched_points[i].rx;
+    corners2[2 * i + 1] = matched_points[i].ry;
   }
 
-  while (MIN_TRIALS > trial_count) {
-    double sum_distance = 0.0;
-    double sum_distance_squared = 0.0;
+  for (int trial_count = 0; trial_count < NUM_TRIALS; trial_count++) {
+    lcg_pick(npoints, minpts, indices, &seed);
 
-    clear_motion(&current_motion, npoints);
+    copy_points_at_indices(points1, corners1, indices, minpts);
+    copy_points_at_indices(points2, corners2, indices, minpts);
 
-    int degenerate = 1;
-    int num_degenerate_iter = 0;
-
-    while (degenerate) {
-      num_degenerate_iter++;
-      if (!get_rand_indices(npoints, minpts, indices, &seed)) {
-        ret_val = 1;
-        goto finish_ransac;
-      }
-
-      copy_points_at_indices(points1, corners1, indices, minpts);
-      copy_points_at_indices(points2, corners2, indices, minpts);
-
-      degenerate = is_degenerate(points1);
-      if (num_degenerate_iter > MAX_DEGENERATE_ITER) {
-        ret_val = 1;
-        goto finish_ransac;
-      }
-    }
-
-    if (find_transformation(minpts, points1, points2, params_this_motion)) {
-      trial_count++;
+    if (model_info->is_degenerate(points1)) {
       continue;
     }
 
-    projectpoints(params_this_motion, corners1, image1_coord, npoints, 2, 2);
+    if (!model_info->find_transformation(minpts, points1, points2,
+                                         params_this_motion)) {
+      continue;
+    }
 
+    model_info->project_points(params_this_motion, corners1, projected_corners,
+                               npoints, 2, 2);
+
+    current_motion.num_inliers = 0;
+    double sse = 0.0;
     for (i = 0; i < npoints; ++i) {
-      double dx = image1_coord[i * 2] - corners2[i * 2];
-      double dy = image1_coord[i * 2 + 1] - corners2[i * 2 + 1];
-      double distance = sqrt(dx * dx + dy * dy);
+      double dx = projected_corners[i * 2] - corners2[i * 2];
+      double dy = projected_corners[i * 2 + 1] - corners2[i * 2 + 1];
+      double squared_error = dx * dx + dy * dy;
 
-      if (distance < INLIER_THRESHOLD) {
+      if (squared_error < INLIER_THRESHOLD_SQUARED) {
         current_motion.inlier_indices[current_motion.num_inliers++] = i;
-        sum_distance += distance;
-        sum_distance_squared += distance * distance;
+        sse += squared_error;
       }
     }
 
-    if (current_motion.num_inliers >= worst_kept_motion->num_inliers &&
-        current_motion.num_inliers > 1) {
-      double mean_distance;
-      mean_distance = sum_distance / ((double)current_motion.num_inliers);
-      current_motion.variance =
-          sum_distance_squared / ((double)current_motion.num_inliers - 1.0) -
-          mean_distance * mean_distance * ((double)current_motion.num_inliers) /
-              ((double)current_motion.num_inliers - 1.0);
-      if (is_better_motion(&current_motion, worst_kept_motion)) {
-        // This motion is better than the worst currently kept motion. Remember
-        // the inlier points and variance. The parameters for each kept motion
-        // will be recomputed later using only the inliers.
-        worst_kept_motion->num_inliers = current_motion.num_inliers;
-        worst_kept_motion->variance = current_motion.variance;
-        memcpy(worst_kept_motion->inlier_indices, current_motion.inlier_indices,
-               sizeof(*current_motion.inlier_indices) * npoints);
-        assert(npoints > 0);
-        // Determine the new worst kept motion and its num_inliers and variance.
-        for (i = 0; i < num_desired_motions; ++i) {
-          if (is_better_motion(worst_kept_motion, &motions[i])) {
-            worst_kept_motion = &motions[i];
-          }
+    if (current_motion.num_inliers < min_inliers) {
+      // Reject models with too few inliers
+      continue;
+    }
+
+    current_motion.sse = sse;
+    if (is_better_motion(&current_motion, worst_kept_motion)) {
+      // This motion is better than the worst currently kept motion. Remember
+      // the inlier points and sse. The parameters for each kept motion
+      // will be recomputed later using only the inliers.
+      worst_kept_motion->num_inliers = current_motion.num_inliers;
+      worst_kept_motion->sse = current_motion.sse;
+
+      // Rather than copying the (potentially many) inlier indices from
+      // current_motion.inlier_indices to worst_kept_motion->inlier_indices,
+      // we can swap the underlying pointers.
+      //
+      // This is okay because the next time current_motion.inlier_indices
+      // is used will be in the next trial, where we ignore its previous
+      // contents anyway. And both arrays will be deallocated together at the
+      // end of this function, so there are no lifetime issues.
+      int *tmp = worst_kept_motion->inlier_indices;
+      worst_kept_motion->inlier_indices = current_motion.inlier_indices;
+      current_motion.inlier_indices = tmp;
+
+      // Determine the new worst kept motion and its num_inliers and sse.
+      for (i = 0; i < num_desired_motions; ++i) {
+        if (is_better_motion(worst_kept_motion, &motions[i])) {
+          worst_kept_motion = &motions[i];
         }
       }
     }
-    trial_count++;
   }
 
   // Sort the motions, best first.
@@ -525,310 +386,96 @@
 
   // Recompute the motions using only the inliers.
   for (i = 0; i < num_desired_motions; ++i) {
-    if (motions[i].num_inliers >= minpts) {
+    int num_inliers = motions[i].num_inliers;
+    if (num_inliers > 0) {
+      assert(num_inliers >= minpts);
+
       copy_points_at_indices(points1, corners1, motions[i].inlier_indices,
-                             motions[i].num_inliers);
+                             num_inliers);
       copy_points_at_indices(points2, corners2, motions[i].inlier_indices,
-                             motions[i].num_inliers);
+                             num_inliers);
 
-      find_transformation(motions[i].num_inliers, points1, points2,
-                          params_by_motion[i].params);
+      if (!model_info->find_transformation(num_inliers, points1, points2,
+                                           motion_models[i].params)) {
+        // In the unlikely event that this model fitting fails,
+        // we don't have a good fallback. So just clear the output
+        // model and move on
+        memcpy(motion_models[i].params, kIdentityParams,
+               MAX_PARAMDIM * sizeof(*(motion_models[i].params)));
+        motion_models[i].num_inliers = 0;
+        continue;
+      }
 
-      params_by_motion[i].num_inliers = motions[i].num_inliers;
-      memcpy(params_by_motion[i].inliers, motions[i].inlier_indices,
-             sizeof(*motions[i].inlier_indices) * npoints);
-      num_inliers_by_motion[i] = motions[i].num_inliers;
+      // Populate inliers array
+      for (int j = 0; j < num_inliers; j++) {
+        int index = motions[i].inlier_indices[j];
+        const Correspondence *corr = &matched_points[index];
+        motion_models[i].inliers[2 * j + 0] = (int)rint(corr->x);
+        motion_models[i].inliers[2 * j + 1] = (int)rint(corr->y);
+      }
+      motion_models[i].num_inliers = num_inliers;
+    } else {
+      memcpy(motion_models[i].params, kIdentityParams,
+             MAX_PARAMDIM * sizeof(*(motion_models[i].params)));
+      motion_models[i].num_inliers = 0;
     }
   }
 
 finish_ransac:
-  aom_free(points1);
-  aom_free(points2);
-  aom_free(corners1);
+  aom_free(inlier_buffer);
+  aom_free(motions);
+  aom_free(projected_corners);
   aom_free(corners2);
-  aom_free(image1_coord);
-  aom_free(current_motion.inlier_indices);
-  if (motions) {
-    for (i = 0; i < num_desired_motions; ++i) {
-      aom_free(motions[i].inlier_indices);
-    }
-    aom_free(motions);
-  }
+  aom_free(corners1);
+  aom_free(points2);
+  aom_free(points1);
 
   return ret_val;
 }
 
-static int ransac_double_prec(const double *matched_points, int npoints,
-                              int *num_inliers_by_motion,
-                              MotionModel *params_by_motion,
-                              int num_desired_motions, int minpts,
-                              IsDegenerateFunc is_degenerate,
-                              FindTransformationFunc find_transformation,
-                              ProjectPointsDoubleFunc projectpoints) {
-  int trial_count = 0;
-  int i = 0;
-  int ret_val = 0;
-
-  unsigned int seed = (unsigned int)npoints;
-
-  int indices[MAX_MINPTS] = { 0 };
-
-  double *points1, *points2;
-  double *corners1, *corners2;
-  double *image1_coord;
-
-  // Store information for the num_desired_motions best transformations found
-  // and the worst motion among them, as well as the motion currently under
-  // consideration.
-  RANSAC_MOTION *motions, *worst_kept_motion = NULL;
-  RANSAC_MOTION current_motion;
-
-  // Store the parameters and the indices of the inlier points for the motion
-  // currently under consideration.
-  double params_this_motion[MAX_PARAMDIM];
-
-  double *cnp1, *cnp2;
-
-  for (i = 0; i < num_desired_motions; ++i) {
-    num_inliers_by_motion[i] = 0;
-  }
-  if (npoints < minpts * MINPTS_MULTIPLIER || npoints == 0) {
-    return 1;
-  }
-
-  points1 = (double *)aom_malloc(sizeof(*points1) * npoints * 2);
-  points2 = (double *)aom_malloc(sizeof(*points2) * npoints * 2);
-  corners1 = (double *)aom_malloc(sizeof(*corners1) * npoints * 2);
-  corners2 = (double *)aom_malloc(sizeof(*corners2) * npoints * 2);
-  image1_coord = (double *)aom_malloc(sizeof(*image1_coord) * npoints * 2);
-  motions =
-      (RANSAC_MOTION *)aom_calloc(num_desired_motions, sizeof(RANSAC_MOTION));
-  current_motion.inlier_indices =
-      (int *)aom_malloc(sizeof(*current_motion.inlier_indices) * npoints);
-  if (!(points1 && points2 && corners1 && corners2 && image1_coord && motions &&
-        current_motion.inlier_indices)) {
-    ret_val = 1;
-    goto finish_ransac;
-  }
-
-  for (i = 0; i < num_desired_motions; ++i) {
-    motions[i].inlier_indices =
-        (int *)aom_malloc(sizeof(*motions->inlier_indices) * npoints);
-    if (!motions[i].inlier_indices) {
-      ret_val = 1;
-      goto finish_ransac;
-    }
-    clear_motion(motions + i, npoints);
-  }
-  clear_motion(&current_motion, npoints);
-
-  worst_kept_motion = motions;
-
-  cnp1 = corners1;
-  cnp2 = corners2;
-  for (i = 0; i < npoints; ++i) {
-    *(cnp1++) = *(matched_points++);
-    *(cnp1++) = *(matched_points++);
-    *(cnp2++) = *(matched_points++);
-    *(cnp2++) = *(matched_points++);
-  }
-
-  while (MIN_TRIALS > trial_count) {
-    double sum_distance = 0.0;
-    double sum_distance_squared = 0.0;
-
-    clear_motion(&current_motion, npoints);
-
-    int degenerate = 1;
-    int num_degenerate_iter = 0;
-
-    while (degenerate) {
-      num_degenerate_iter++;
-      if (!get_rand_indices(npoints, minpts, indices, &seed)) {
-        ret_val = 1;
-        goto finish_ransac;
-      }
-
-      copy_points_at_indices(points1, corners1, indices, minpts);
-      copy_points_at_indices(points2, corners2, indices, minpts);
-
-      degenerate = is_degenerate(points1);
-      if (num_degenerate_iter > MAX_DEGENERATE_ITER) {
-        ret_val = 1;
-        goto finish_ransac;
-      }
-    }
-
-    if (find_transformation(minpts, points1, points2, params_this_motion)) {
-      trial_count++;
-      continue;
-    }
-
-    projectpoints(params_this_motion, corners1, image1_coord, npoints, 2, 2);
-
-    for (i = 0; i < npoints; ++i) {
-      double dx = image1_coord[i * 2] - corners2[i * 2];
-      double dy = image1_coord[i * 2 + 1] - corners2[i * 2 + 1];
-      double distance = sqrt(dx * dx + dy * dy);
-
-      if (distance < INLIER_THRESHOLD) {
-        current_motion.inlier_indices[current_motion.num_inliers++] = i;
-        sum_distance += distance;
-        sum_distance_squared += distance * distance;
-      }
-    }
-
-    if (current_motion.num_inliers >= worst_kept_motion->num_inliers &&
-        current_motion.num_inliers > 1) {
-      double mean_distance;
-      mean_distance = sum_distance / ((double)current_motion.num_inliers);
-      current_motion.variance =
-          sum_distance_squared / ((double)current_motion.num_inliers - 1.0) -
-          mean_distance * mean_distance * ((double)current_motion.num_inliers) /
-              ((double)current_motion.num_inliers - 1.0);
-      if (is_better_motion(&current_motion, worst_kept_motion)) {
-        // This motion is better than the worst currently kept motion. Remember
-        // the inlier points and variance. The parameters for each kept motion
-        // will be recomputed later using only the inliers.
-        worst_kept_motion->num_inliers = current_motion.num_inliers;
-        worst_kept_motion->variance = current_motion.variance;
-        memcpy(worst_kept_motion->inlier_indices, current_motion.inlier_indices,
-               sizeof(*current_motion.inlier_indices) * npoints);
-        assert(npoints > 0);
-        // Determine the new worst kept motion and its num_inliers and variance.
-        for (i = 0; i < num_desired_motions; ++i) {
-          if (is_better_motion(worst_kept_motion, &motions[i])) {
-            worst_kept_motion = &motions[i];
-          }
-        }
-      }
-    }
-    trial_count++;
-  }
-
-  // Sort the motions, best first.
-  qsort(motions, num_desired_motions, sizeof(RANSAC_MOTION), compare_motions);
-
-  // Recompute the motions using only the inliers.
-  for (i = 0; i < num_desired_motions; ++i) {
-    if (motions[i].num_inliers >= minpts) {
-      copy_points_at_indices(points1, corners1, motions[i].inlier_indices,
-                             motions[i].num_inliers);
-      copy_points_at_indices(points2, corners2, motions[i].inlier_indices,
-                             motions[i].num_inliers);
-
-      find_transformation(motions[i].num_inliers, points1, points2,
-                          params_by_motion[i].params);
-      memcpy(params_by_motion[i].inliers, motions[i].inlier_indices,
-             sizeof(*motions[i].inlier_indices) * npoints);
-    }
-    num_inliers_by_motion[i] = motions[i].num_inliers;
-  }
-
-finish_ransac:
-  aom_free(points1);
-  aom_free(points2);
-  aom_free(corners1);
-  aom_free(corners2);
-  aom_free(image1_coord);
-  aom_free(current_motion.inlier_indices);
-  if (motions) {
-    for (i = 0; i < num_desired_motions; ++i) {
-      aom_free(motions[i].inlier_indices);
-    }
-    aom_free(motions);
-  }
-
-  return ret_val;
-}
-
-static int is_collinear3(double *p1, double *p2, double *p3) {
+static bool is_collinear3(double *p1, double *p2, double *p3) {
   static const double collinear_eps = 1e-3;
   const double v =
       (p2[0] - p1[0]) * (p3[1] - p1[1]) - (p2[1] - p1[1]) * (p3[0] - p1[0]);
   return fabs(v) < collinear_eps;
 }
 
-static int is_degenerate_translation(double *p) {
+#if ALLOW_TRANSLATION_MODELS
+static bool is_degenerate_translation(double *p) {
   return (p[0] - p[2]) * (p[0] - p[2]) + (p[1] - p[3]) * (p[1] - p[3]) <= 2;
 }
+#endif  // ALLOW_TRANSLATION_MODELS
 
-static int is_degenerate_affine(double *p) {
+static bool is_degenerate_affine(double *p) {
   return is_collinear3(p, p + 2, p + 4);
 }
 
-static int ransac_translation(int *matched_points, int npoints,
-                              int *num_inliers_by_motion,
-                              MotionModel *params_by_motion,
-                              int num_desired_motions) {
-  return ransac(matched_points, npoints, num_inliers_by_motion,
-                params_by_motion, num_desired_motions, 3,
-                is_degenerate_translation, find_translation,
-                project_points_double_translation);
-}
+static const RansacModelInfo ransac_model_info[TRANS_TYPES] = {
+  // IDENTITY
+  { NULL, NULL, NULL, 0 },
+// TRANSLATION
+#if ALLOW_TRANSLATION_MODELS
+  { is_degenerate_translation, find_translation, project_points_translation,
+    3 },
+#else
+  { NULL, NULL, NULL, 0 },
+#endif
+  // ROTZOOM
+  { is_degenerate_affine, find_rotzoom, project_points_affine, 3 },
+  // AFFINE
+  { is_degenerate_affine, find_affine, project_points_affine, 3 },
+};
 
-static int ransac_rotzoom(int *matched_points, int npoints,
-                          int *num_inliers_by_motion,
-                          MotionModel *params_by_motion,
-                          int num_desired_motions) {
-  return ransac(matched_points, npoints, num_inliers_by_motion,
-                params_by_motion, num_desired_motions, 3, is_degenerate_affine,
-                find_rotzoom, project_points_double_rotzoom);
-}
+// Returns true on success, false on error
+bool ransac(const Correspondence *matched_points, int npoints,
+            TransformationType type, MotionModel *motion_models,
+            int num_desired_motions) {
+#if ALLOW_TRANSLATION_MODELS
+  assert(type > IDENTITY && type < TRANS_TYPES);
+#else
+  assert(type > TRANSLATION && type < TRANS_TYPES);
+#endif  // ALLOW_TRANSLATION_MODELS
 
-static int ransac_affine(int *matched_points, int npoints,
-                         int *num_inliers_by_motion,
-                         MotionModel *params_by_motion,
-                         int num_desired_motions) {
-  return ransac(matched_points, npoints, num_inliers_by_motion,
-                params_by_motion, num_desired_motions, 3, is_degenerate_affine,
-                find_affine, project_points_double_affine);
-}
-
-RansacFunc av1_get_ransac_type(TransformationType type) {
-  switch (type) {
-    case AFFINE: return ransac_affine;
-    case ROTZOOM: return ransac_rotzoom;
-    case TRANSLATION: return ransac_translation;
-    default: assert(0); return NULL;
-  }
-}
-
-static int ransac_translation_double_prec(double *matched_points, int npoints,
-                                          int *num_inliers_by_motion,
-                                          MotionModel *params_by_motion,
-                                          int num_desired_motions) {
-  return ransac_double_prec(matched_points, npoints, num_inliers_by_motion,
-                            params_by_motion, num_desired_motions, 3,
-                            is_degenerate_translation, find_translation,
-                            project_points_double_translation);
-}
-
-static int ransac_rotzoom_double_prec(double *matched_points, int npoints,
-                                      int *num_inliers_by_motion,
-                                      MotionModel *params_by_motion,
-                                      int num_desired_motions) {
-  return ransac_double_prec(matched_points, npoints, num_inliers_by_motion,
-                            params_by_motion, num_desired_motions, 3,
-                            is_degenerate_affine, find_rotzoom,
-                            project_points_double_rotzoom);
-}
-
-static int ransac_affine_double_prec(double *matched_points, int npoints,
-                                     int *num_inliers_by_motion,
-                                     MotionModel *params_by_motion,
-                                     int num_desired_motions) {
-  return ransac_double_prec(matched_points, npoints, num_inliers_by_motion,
-                            params_by_motion, num_desired_motions, 3,
-                            is_degenerate_affine, find_affine,
-                            project_points_double_affine);
-}
-
-RansacFuncDouble av1_get_ransac_double_prec_type(TransformationType type) {
-  switch (type) {
-    case AFFINE: return ransac_affine_double_prec;
-    case ROTZOOM: return ransac_rotzoom_double_prec;
-    case TRANSLATION: return ransac_translation_double_prec;
-    default: assert(0); return NULL;
-  }
+  return ransac_internal(matched_points, npoints, motion_models,
+                         num_desired_motions, &ransac_model_info[type]);
 }

diff --git a/aom_dsp/flow_estimation/ransac.h b/aom_dsp/flow_estimation/ransac.h
index aa3a243..6047580 100644
--- a/aom_dsp/flow_estimation/ransac.h
+++ b/aom_dsp/flow_estimation/ransac.h

@@ -16,6 +16,7 @@
 #include <stdlib.h>
 #include <math.h>
 #include <memory.h>
+#include <stdbool.h>
 
 #include "aom_dsp/flow_estimation/flow_estimation.h"
 
@@ -23,14 +24,9 @@
 extern "C" {
 #endif
 
-typedef int (*RansacFunc)(int *matched_points, int npoints,
-                          int *num_inliers_by_motion,
-                          MotionModel *params_by_motion, int num_motions);
-typedef int (*RansacFuncDouble)(double *matched_points, int npoints,
-                                int *num_inliers_by_motion,
-                                MotionModel *params_by_motion, int num_motions);
-RansacFunc av1_get_ransac_type(TransformationType type);
-RansacFuncDouble av1_get_ransac_double_prec_type(TransformationType type);
+bool ransac(const Correspondence *matched_points, int npoints,
+            TransformationType type, MotionModel *motion_models,
+            int num_desired_motions);
 
 #ifdef __cplusplus
 }

diff --git a/aom_dsp/flow_estimation/x86/corner_match_avx2.c b/aom_dsp/flow_estimation/x86/corner_match_avx2.c
index 9830ad8..87c76fa 100644
--- a/aom_dsp/flow_estimation/x86/corner_match_avx2.c
+++ b/aom_dsp/flow_estimation/x86/corner_match_avx2.c

@@ -24,12 +24,13 @@
 #error "Need to change byte_mask in corner_match_sse4.c if MATCH_SZ != 13"
 #endif
 
-/* Compute corr(im1, im2) * MATCH_SZ * stddev(im1), where the
+/* Compute corr(frame1, frame2) * MATCH_SZ * stddev(frame1), where the
 correlation/standard deviation are taken over MATCH_SZ by MATCH_SZ windows
 of each image, centered at (x1, y1) and (x2, y2) respectively.
 */
-double av1_compute_cross_correlation_avx2(unsigned char *im1, int stride1,
-                                          int x1, int y1, unsigned char *im2,
+double av1_compute_cross_correlation_avx2(const unsigned char *frame1,
+                                          int stride1, int x1, int y1,
+                                          const unsigned char *frame2,
                                           int stride2, int x2, int y2) {
   int i, stride1_i = 0, stride2_i = 0;
   __m256i temp1, sum_vec, sumsq2_vec, cross_vec, v, v1_1, v2_1;
@@ -41,13 +42,13 @@
   sumsq2_vec = zero;
   cross_vec = zero;
 
-  im1 += (y1 - MATCH_SZ_BY2) * stride1 + (x1 - MATCH_SZ_BY2);
-  im2 += (y2 - MATCH_SZ_BY2) * stride2 + (x2 - MATCH_SZ_BY2);
+  frame1 += (y1 - MATCH_SZ_BY2) * stride1 + (x1 - MATCH_SZ_BY2);
+  frame2 += (y2 - MATCH_SZ_BY2) * stride2 + (x2 - MATCH_SZ_BY2);
 
   for (i = 0; i < MATCH_SZ; ++i) {
-    v1 = _mm_and_si128(_mm_loadu_si128((__m128i *)&im1[stride1_i]), mask);
+    v1 = _mm_and_si128(_mm_loadu_si128((__m128i *)&frame1[stride1_i]), mask);
     v1_1 = _mm256_cvtepu8_epi16(v1);
-    v2 = _mm_and_si128(_mm_loadu_si128((__m128i *)&im2[stride2_i]), mask);
+    v2 = _mm_and_si128(_mm_loadu_si128((__m128i *)&frame2[stride2_i]), mask);
     v2_1 = _mm256_cvtepu8_epi16(v2);
 
     v = _mm256_insertf128_si256(_mm256_castsi128_si256(v1), v2, 1);

diff --git a/aom_dsp/flow_estimation/x86/corner_match_sse4.c b/aom_dsp/flow_estimation/x86/corner_match_sse4.c
index 40eec6c..b3cb5bc 100644
--- a/aom_dsp/flow_estimation/x86/corner_match_sse4.c
+++ b/aom_dsp/flow_estimation/x86/corner_match_sse4.c

@@ -28,12 +28,13 @@
 #error "Need to change byte_mask in corner_match_sse4.c if MATCH_SZ != 13"
 #endif
 
-/* Compute corr(im1, im2) * MATCH_SZ * stddev(im1), where the
+/* Compute corr(frame1, frame2) * MATCH_SZ * stddev(frame1), where the
    correlation/standard deviation are taken over MATCH_SZ by MATCH_SZ windows
    of each image, centered at (x1, y1) and (x2, y2) respectively.
 */
-double av1_compute_cross_correlation_sse4_1(unsigned char *im1, int stride1,
-                                            int x1, int y1, unsigned char *im2,
+double av1_compute_cross_correlation_sse4_1(const unsigned char *frame1,
+                                            int stride1, int x1, int y1,
+                                            const unsigned char *frame2,
                                             int stride2, int x2, int y2) {
   int i;
   // 2 16-bit partial sums in lanes 0, 4 (== 2 32-bit partial sums in lanes 0,
@@ -47,14 +48,14 @@
   const __m128i mask = _mm_load_si128((__m128i *)byte_mask);
   const __m128i zero = _mm_setzero_si128();
 
-  im1 += (y1 - MATCH_SZ_BY2) * stride1 + (x1 - MATCH_SZ_BY2);
-  im2 += (y2 - MATCH_SZ_BY2) * stride2 + (x2 - MATCH_SZ_BY2);
+  frame1 += (y1 - MATCH_SZ_BY2) * stride1 + (x1 - MATCH_SZ_BY2);
+  frame2 += (y2 - MATCH_SZ_BY2) * stride2 + (x2 - MATCH_SZ_BY2);
 
   for (i = 0; i < MATCH_SZ; ++i) {
     const __m128i v1 =
-        _mm_and_si128(_mm_loadu_si128((__m128i *)&im1[i * stride1]), mask);
+        _mm_and_si128(_mm_loadu_si128((__m128i *)&frame1[i * stride1]), mask);
     const __m128i v2 =
-        _mm_and_si128(_mm_loadu_si128((__m128i *)&im2[i * stride2]), mask);
+        _mm_and_si128(_mm_loadu_si128((__m128i *)&frame2[i * stride2]), mask);
 
     // Using the 'sad' intrinsic here is a bit faster than adding
     // v1_l + v1_r and v2_l + v2_r, plus it avoids the need for a 16->32 bit

diff --git a/aom_dsp/flow_estimation/x86/disflow_sse4.c b/aom_dsp/flow_estimation/x86/disflow_sse4.c
new file mode 100644
index 0000000..a62e9a4
--- /dev/null
+++ b/aom_dsp/flow_estimation/x86/disflow_sse4.c

@@ -0,0 +1,560 @@
+/*
+ * Copyright (c) 2022, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 3-Clause Clear License
+ * and the Alliance for Open Media Patent License 1.0. If the BSD 3-Clause Clear
+ * License was not distributed with this source code in the LICENSE file, you
+ * can obtain it at aomedia.org/license/software-license/bsd-3-c-c/.  If the
+ * Alliance for Open Media Patent License 1.0 was not distributed with this
+ * source code in the PATENTS file, you can obtain it at
+ * aomedia.org/license/patent-license/.
+ */
+
+#include <assert.h>
+#include <math.h>
+#include <smmintrin.h>
+
+#include "aom_dsp/aom_dsp_common.h"
+#include "aom_dsp/flow_estimation/disflow.h"
+#include "aom_dsp/x86/synonyms.h"
+
+#include "config/aom_dsp_rtcd.h"
+
+// Internal cross-check against C code
+// If you set this to 1 and compile in debug mode, then the outputs of the two
+// convolution stages will be checked against the plain C version of the code,
+// and an assertion will be fired if the results differ.
+#define CHECK_RESULTS 0
+
+// Note: Max sum(+ve coefficients) = 1.125 * scale
+static INLINE void get_cubic_kernel_dbl(double x, double *kernel) {
+  assert(0 <= x && x < 1);
+  double x2 = x * x;
+  double x3 = x2 * x;
+  kernel[0] = -0.5 * x + x2 - 0.5 * x3;
+  kernel[1] = 1.0 - 2.5 * x2 + 1.5 * x3;
+  kernel[2] = 0.5 * x + 2.0 * x2 - 1.5 * x3;
+  kernel[3] = -0.5 * x2 + 0.5 * x3;
+}
+
+static INLINE void get_cubic_kernel_int(double x, int16_t *kernel) {
+  double kernel_dbl[4];
+  get_cubic_kernel_dbl(x, kernel_dbl);
+
+  kernel[0] = (int16_t)rint(kernel_dbl[0] * (1 << DISFLOW_INTERP_BITS));
+  kernel[1] = (int16_t)rint(kernel_dbl[1] * (1 << DISFLOW_INTERP_BITS));
+  kernel[2] = (int16_t)rint(kernel_dbl[2] * (1 << DISFLOW_INTERP_BITS));
+  kernel[3] = (int16_t)rint(kernel_dbl[3] * (1 << DISFLOW_INTERP_BITS));
+}
+
+#if CHECK_RESULTS
+static INLINE int get_cubic_value_int(const int *p, const int16_t *kernel) {
+  return kernel[0] * p[0] + kernel[1] * p[1] + kernel[2] * p[2] +
+         kernel[3] * p[3];
+}
+#endif  // CHECK_RESULTS
+
+// Compare two regions of width x height pixels, one rooted at position
+// (x, y) in src and the other at (x + u, y + v) in ref.
+// This function returns the sum of squared pixel differences between
+// the two regions.
+//
+// TODO(rachelbarker): Test speed/quality impact of using bilinear interpolation
+// instad of bicubic interpolation
+static INLINE void compute_flow_error(const uint8_t *src, const uint8_t *ref,
+                                      int width, int height, int stride, int x,
+                                      int y, double u, double v, int16_t *dt) {
+  // This function is written to do 8x8 convolutions only
+  assert(DISFLOW_PATCH_SIZE == 8);
+
+  // Split offset into integer and fractional parts, and compute cubic
+  // interpolation kernels
+  const int u_int = (int)floor(u);
+  const int v_int = (int)floor(v);
+  const double u_frac = u - floor(u);
+  const double v_frac = v - floor(v);
+
+  int16_t h_kernel[4];
+  int16_t v_kernel[4];
+  get_cubic_kernel_int(u_frac, h_kernel);
+  get_cubic_kernel_int(v_frac, v_kernel);
+
+  // Storage for intermediate values between the two convolution directions
+  int16_t tmp_[DISFLOW_PATCH_SIZE * (DISFLOW_PATCH_SIZE + 3)];
+  int16_t *tmp = tmp_ + DISFLOW_PATCH_SIZE;  // Offset by one row
+
+  // Clamp coordinates so that all pixels we fetch will remain within the
+  // allocated border region, but allow them to go far enough out that
+  // the border pixels' values do not change.
+  // Since we are calculating an 8x8 block, the bottom-right pixel
+  // in the block has coordinates (x0 + 7, y0 + 7). Then, the cubic
+  // interpolation has 4 taps, meaning that the output of pixel
+  // (x_w, y_w) depends on the pixels in the range
+  // ([x_w - 1, x_w + 2], [y_w - 1, y_w + 2]).
+  //
+  // Thus the most extreme coordinates which will be fetched are
+  // (x0 - 1, y0 - 1) and (x0 + 9, y0 + 9).
+  const int x0 = clamp(x + u_int, -9, width);
+  const int y0 = clamp(y + v_int, -9, height);
+
+  // Horizontal convolution
+
+  // Prepare the kernel vectors
+  // We split the kernel into two vectors with kernel indices:
+  // 0, 1, 0, 1, 0, 1, 0, 1, and
+  // 2, 3, 2, 3, 2, 3, 2, 3
+  __m128i h_kernel_01 = xx_set2_epi16(h_kernel[0], h_kernel[1]);
+  __m128i h_kernel_23 = xx_set2_epi16(h_kernel[2], h_kernel[3]);
+
+  __m128i round_const_h = _mm_set1_epi32(1 << (DISFLOW_INTERP_BITS - 6 - 1));
+
+  for (int i = -1; i < DISFLOW_PATCH_SIZE + 2; ++i) {
+    const int y_w = y0 + i;
+    const uint8_t *ref_row = &ref[y_w * stride + (x0 - 1)];
+    int16_t *tmp_row = &tmp[i * DISFLOW_PATCH_SIZE];
+
+    // Load this row of pixels.
+    // For an 8x8 patch, we need to load the 8 image pixels + 3 extras,
+    // for a total of 11 pixels. Here we load 16 pixels, but only use
+    // the first 11.
+    __m128i row = _mm_loadu_si128((__m128i *)ref_row);
+
+    // Expand pixels to int16s
+    __m128i px_0to7_i16 = _mm_cvtepu8_epi16(row);
+    __m128i px_4to10_i16 = _mm_cvtepu8_epi16(_mm_srli_si128(row, 4));
+
+    // Relevant multiply instruction
+    // This multiplies pointwise, then sums in pairs.
+    //_mm_madd_epi16();
+
+    // Compute first four outputs
+    // input pixels 0, 1, 1, 2, 2, 3, 3, 4
+    // * kernel     0, 1, 0, 1, 0, 1, 0, 1
+    __m128i px0 =
+        _mm_unpacklo_epi16(px_0to7_i16, _mm_srli_si128(px_0to7_i16, 2));
+    // input pixels 2, 3, 3, 4, 4, 5, 5, 6
+    // * kernel     2, 3, 2, 3, 2, 3, 2, 3
+    __m128i px1 = _mm_unpacklo_epi16(_mm_srli_si128(px_0to7_i16, 4),
+                                     _mm_srli_si128(px_0to7_i16, 6));
+    // Convolve with kernel and sum 2x2 boxes to form first 4 outputs
+    __m128i sum0 = _mm_add_epi32(_mm_madd_epi16(px0, h_kernel_01),
+                                 _mm_madd_epi16(px1, h_kernel_23));
+
+    __m128i out0 = _mm_srai_epi32(_mm_add_epi32(sum0, round_const_h),
+                                  DISFLOW_INTERP_BITS - 6);
+
+    // Compute second four outputs
+    __m128i px2 =
+        _mm_unpacklo_epi16(px_4to10_i16, _mm_srli_si128(px_4to10_i16, 2));
+    __m128i px3 = _mm_unpacklo_epi16(_mm_srli_si128(px_4to10_i16, 4),
+                                     _mm_srli_si128(px_4to10_i16, 6));
+    __m128i sum1 = _mm_add_epi32(_mm_madd_epi16(px2, h_kernel_01),
+                                 _mm_madd_epi16(px3, h_kernel_23));
+
+    // Round by just enough bits that the result is
+    // guaranteed to fit into an i16. Then the next stage can use 16 x 16 -> 32
+    // bit multiplies, which should be a fair bit faster than 32 x 32 -> 32
+    // as it does now
+    // This means shifting down so we have 6 extra bits, for a maximum value
+    // of +18360, which can occur if u_frac == 0.5 and the input pixels are
+    // {0, 255, 255, 0}.
+    __m128i out1 = _mm_srai_epi32(_mm_add_epi32(sum1, round_const_h),
+                                  DISFLOW_INTERP_BITS - 6);
+
+    _mm_storeu_si128((__m128i *)tmp_row, _mm_packs_epi32(out0, out1));
+
+#if CHECK_RESULTS && !defined(NDEBUG)
+    // Cross-check
+    for (int j = 0; j < DISFLOW_PATCH_SIZE; ++j) {
+      const int x_w = x0 + j;
+      int arr[4];
+
+      arr[0] = (int)ref[y_w * stride + (x_w - 1)];
+      arr[1] = (int)ref[y_w * stride + (x_w + 0)];
+      arr[2] = (int)ref[y_w * stride + (x_w + 1)];
+      arr[3] = (int)ref[y_w * stride + (x_w + 2)];
+
+      // Apply kernel and round, keeping 6 extra bits of precision.
+      //
+      // 6 is the maximum allowable number of extra bits which will avoid
+      // the intermediate values overflowing an int16_t. The most extreme
+      // intermediate value occurs when:
+      // * The input pixels are [0, 255, 255, 0]
+      // * u_frac = 0.5
+      // In this case, the un-scaled output is 255 * 1.125 = 286.875.
+      // As an integer with 6 fractional bits, that is 18360, which fits
+      // in an int16_t. But with 7 fractional bits it would be 36720,
+      // which is too large.
+      const int c_value = ROUND_POWER_OF_TWO(get_cubic_value_int(arr, h_kernel),
+                                             DISFLOW_INTERP_BITS - 6);
+      (void)c_value;  // Suppress warnings
+      assert(tmp_row[j] == c_value);
+    }
+#endif  // CHECK_RESULTS
+  }
+
+  // Vertical convolution
+  const int round_bits = DISFLOW_INTERP_BITS + 6 - DISFLOW_DERIV_SCALE_LOG2;
+  __m128i round_const_v = _mm_set1_epi32(1 << (round_bits - 1));
+
+  __m128i v_kernel_01 = xx_set2_epi16(v_kernel[0], v_kernel[1]);
+  __m128i v_kernel_23 = xx_set2_epi16(v_kernel[2], v_kernel[3]);
+
+  for (int i = 0; i < DISFLOW_PATCH_SIZE; ++i) {
+    int16_t *tmp_row = &tmp[i * DISFLOW_PATCH_SIZE];
+
+    // Load 4 rows of 8 x 16-bit values
+    __m128i px0 = _mm_loadu_si128((__m128i *)(tmp_row - DISFLOW_PATCH_SIZE));
+    __m128i px1 = _mm_loadu_si128((__m128i *)tmp_row);
+    __m128i px2 = _mm_loadu_si128((__m128i *)(tmp_row + DISFLOW_PATCH_SIZE));
+    __m128i px3 =
+        _mm_loadu_si128((__m128i *)(tmp_row + 2 * DISFLOW_PATCH_SIZE));
+
+    // We want to calculate px0 * v_kernel[0] + px1 * v_kernel[1] + ... ,
+    // but each multiply expands its output to 32 bits. So we need to be
+    // a little clever about how we do this
+    __m128i sum0 = _mm_add_epi32(
+        _mm_madd_epi16(_mm_unpacklo_epi16(px0, px1), v_kernel_01),
+        _mm_madd_epi16(_mm_unpacklo_epi16(px2, px3), v_kernel_23));
+    __m128i sum1 = _mm_add_epi32(
+        _mm_madd_epi16(_mm_unpackhi_epi16(px0, px1), v_kernel_01),
+        _mm_madd_epi16(_mm_unpackhi_epi16(px2, px3), v_kernel_23));
+
+    __m128i sum0_rounded =
+        _mm_srai_epi32(_mm_add_epi32(sum0, round_const_v), round_bits);
+    __m128i sum1_rounded =
+        _mm_srai_epi32(_mm_add_epi32(sum1, round_const_v), round_bits);
+
+    __m128i warped = _mm_packs_epi32(sum0_rounded, sum1_rounded);
+    __m128i src_pixels_u8 =
+        _mm_loadl_epi64((__m128i *)&src[(y + i) * stride + x]);
+    __m128i src_pixels = _mm_slli_epi16(_mm_cvtepu8_epi16(src_pixels_u8), 3);
+
+    // Calculate delta from the target patch
+    __m128i err = _mm_sub_epi16(warped, src_pixels);
+    _mm_storeu_si128((__m128i *)&dt[i * DISFLOW_PATCH_SIZE], err);
+
+#if CHECK_RESULTS
+    for (int j = 0; j < DISFLOW_PATCH_SIZE; ++j) {
+      int16_t *p = &tmp[i * DISFLOW_PATCH_SIZE + j];
+      int arr[4] = { p[-DISFLOW_PATCH_SIZE], p[0], p[DISFLOW_PATCH_SIZE],
+                     p[2 * DISFLOW_PATCH_SIZE] };
+      const int result = get_cubic_value_int(arr, v_kernel);
+
+      // Apply kernel and round.
+      // This time, we have to round off the 6 extra bits which were kept
+      // earlier, but we also want to keep DISFLOW_DERIV_SCALE_LOG2 extra bits
+      // of precision to match the scale of the dx and dy arrays.
+      const int c_warped = ROUND_POWER_OF_TWO(result, round_bits);
+      const int c_src_px = src[(x + j) + (y + i) * stride] << 3;
+      const int c_err = c_warped - c_src_px;
+      (void)c_err;
+      assert(dt[i * DISFLOW_PATCH_SIZE + j] == c_err);
+    }
+#endif  // CHECK_RESULTS
+  }
+}
+
+static INLINE void sobel_filter_x(const uint8_t *src, int src_stride,
+                                  int16_t *dst, int dst_stride) {
+  int16_t tmp_[DISFLOW_PATCH_SIZE * (DISFLOW_PATCH_SIZE + 2)];
+  int16_t *tmp = tmp_ + DISFLOW_PATCH_SIZE;
+#if CHECK_RESULTS
+  const int taps = 3;
+#endif  // CHECK_RESULTS
+
+  // Horizontal filter
+  // As the kernel is simply {1, 0, -1}, we implement this as simply
+  //  out[x] = image[x-1] - image[x+1]
+  // rather than doing a "proper" convolution operation
+  for (int y = -1; y < DISFLOW_PATCH_SIZE + 1; ++y) {
+    const uint8_t *src_row = src + y * src_stride;
+    int16_t *tmp_row = tmp + y * DISFLOW_PATCH_SIZE;
+
+    // Load pixels and expand to 16 bits
+    __m128i row = _mm_loadu_si128((__m128i *)(src_row - 1));
+    __m128i px0 = _mm_cvtepu8_epi16(row);
+    __m128i px2 = _mm_cvtepu8_epi16(_mm_srli_si128(row, 2));
+
+    __m128i out = _mm_sub_epi16(px0, px2);
+
+    // Store to intermediate array
+    _mm_storeu_si128((__m128i *)tmp_row, out);
+
+#if CHECK_RESULTS
+    // Cross-check
+    static const int16_t h_kernel[3] = { 1, 0, -1 };
+    for (int x = 0; x < DISFLOW_PATCH_SIZE; ++x) {
+      int sum = 0;
+      for (int k = 0; k < taps; ++k) {
+        sum += h_kernel[k] * src_row[x + k - 1];
+      }
+      (void)sum;
+      assert(tmp_row[x] == sum);
+    }
+#endif  // CHECK_RESULTS
+  }
+
+  // Vertical filter
+  // Here the kernel is {1, 2, 1}, which can be implemented
+  // with simple sums rather than multiplies and adds.
+  // In order to minimize dependency chains, we evaluate in the order
+  // (image[y - 1] + image[y + 1]) + (image[y] << 1)
+  // This way, the first addition and the shift can happen in parallel
+  for (int y = 0; y < DISFLOW_PATCH_SIZE; ++y) {
+    const int16_t *tmp_row = tmp + y * DISFLOW_PATCH_SIZE;
+    int16_t *dst_row = dst + y * dst_stride;
+
+    __m128i px0 = _mm_loadu_si128((__m128i *)(tmp_row - DISFLOW_PATCH_SIZE));
+    __m128i px1 = _mm_loadu_si128((__m128i *)tmp_row);
+    __m128i px2 = _mm_loadu_si128((__m128i *)(tmp_row + DISFLOW_PATCH_SIZE));
+
+    __m128i out =
+        _mm_add_epi16(_mm_add_epi16(px0, px2), _mm_slli_epi16(px1, 1));
+
+    _mm_storeu_si128((__m128i *)dst_row, out);
+
+#if CHECK_RESULTS
+    static const int16_t v_kernel[3] = { 1, 2, 1 };
+    for (int x = 0; x < DISFLOW_PATCH_SIZE; ++x) {
+      int sum = 0;
+      for (int k = 0; k < taps; ++k) {
+        sum += v_kernel[k] * tmp[(y + k - 1) * DISFLOW_PATCH_SIZE + x];
+      }
+      (void)sum;
+      assert(dst_row[x] == sum);
+    }
+#endif  // CHECK_RESULTS
+  }
+}
+
+static INLINE void sobel_filter_y(const uint8_t *src, int src_stride,
+                                  int16_t *dst, int dst_stride) {
+  int16_t tmp_[DISFLOW_PATCH_SIZE * (DISFLOW_PATCH_SIZE + 2)];
+  int16_t *tmp = tmp_ + DISFLOW_PATCH_SIZE;
+#if CHECK_RESULTS
+  const int taps = 3;
+#endif  // CHECK_RESULTS
+
+  // Horizontal filter
+  // Here the kernel is {1, 2, 1}, which can be implemented
+  // with simple sums rather than multiplies and adds.
+  // In order to minimize dependency chains, we evaluate in the order
+  // (image[y - 1] + image[y + 1]) + (image[y] << 1)
+  // This way, the first addition and the shift can happen in parallel
+  for (int y = -1; y < DISFLOW_PATCH_SIZE + 1; ++y) {
+    const uint8_t *src_row = src + y * src_stride;
+    int16_t *tmp_row = tmp + y * DISFLOW_PATCH_SIZE;
+
+    // Load pixels and expand to 16 bits
+    __m128i row = _mm_loadu_si128((__m128i *)(src_row - 1));
+    __m128i px0 = _mm_cvtepu8_epi16(row);
+    __m128i px1 = _mm_cvtepu8_epi16(_mm_srli_si128(row, 1));
+    __m128i px2 = _mm_cvtepu8_epi16(_mm_srli_si128(row, 2));
+
+    __m128i out =
+        _mm_add_epi16(_mm_add_epi16(px0, px2), _mm_slli_epi16(px1, 1));
+
+    // Store to intermediate array
+    _mm_storeu_si128((__m128i *)tmp_row, out);
+
+#if CHECK_RESULTS
+    // Cross-check
+    static const int16_t h_kernel[3] = { 1, 2, 1 };
+    for (int x = 0; x < DISFLOW_PATCH_SIZE; ++x) {
+      int sum = 0;
+      for (int k = 0; k < taps; ++k) {
+        sum += h_kernel[k] * src_row[x + k - 1];
+      }
+      (void)sum;
+      assert(tmp_row[x] == sum);
+    }
+#endif  // CHECK_RESULTS
+  }
+
+  // Vertical filter
+  // As the kernel is simply {1, 0, -1}, we implement this as simply
+  //  out[x] = image[x-1] - image[x+1]
+  // rather than doing a "proper" convolution operation
+  for (int y = 0; y < DISFLOW_PATCH_SIZE; ++y) {
+    const int16_t *tmp_row = tmp + y * DISFLOW_PATCH_SIZE;
+    int16_t *dst_row = dst + y * dst_stride;
+
+    __m128i px0 = _mm_loadu_si128((__m128i *)(tmp_row - DISFLOW_PATCH_SIZE));
+    __m128i px2 = _mm_loadu_si128((__m128i *)(tmp_row + DISFLOW_PATCH_SIZE));
+
+    __m128i out = _mm_sub_epi16(px0, px2);
+
+    _mm_storeu_si128((__m128i *)dst_row, out);
+
+#if CHECK_RESULTS
+    static const int16_t v_kernel[3] = { 1, 0, -1 };
+    for (int x = 0; x < DISFLOW_PATCH_SIZE; ++x) {
+      int sum = 0;
+      for (int k = 0; k < taps; ++k) {
+        sum += v_kernel[k] * tmp[(y + k - 1) * DISFLOW_PATCH_SIZE + x];
+      }
+      (void)sum;
+      assert(dst_row[x] == sum);
+    }
+#endif  // CHECK_RESULTS
+  }
+}
+
+static INLINE void compute_flow_vector(const int16_t *dx, int dx_stride,
+                                       const int16_t *dy, int dy_stride,
+                                       const int16_t *dt, int dt_stride,
+                                       int *b) {
+  __m128i b0_acc = _mm_setzero_si128();
+  __m128i b1_acc = _mm_setzero_si128();
+
+  for (int i = 0; i < DISFLOW_PATCH_SIZE; i++) {
+    // Need to load 8 values of dx, 8 of dy, 8 of dt, which conveniently
+    // works out to one register each. Then just calculate dx * dt, dy * dt,
+    // and (implicitly) sum horizontally in pairs.
+    // This gives four 32-bit partial sums for each of b[0] and b[1],
+    // which can be accumulated and summed at the end.
+    __m128i dx_row = _mm_loadu_si128((__m128i *)&dx[i * dx_stride]);
+    __m128i dy_row = _mm_loadu_si128((__m128i *)&dy[i * dy_stride]);
+    __m128i dt_row = _mm_loadu_si128((__m128i *)&dt[i * dt_stride]);
+
+    b0_acc = _mm_add_epi32(b0_acc, _mm_madd_epi16(dx_row, dt_row));
+    b1_acc = _mm_add_epi32(b1_acc, _mm_madd_epi16(dy_row, dt_row));
+  }
+
+  // We need to set b[0] = sum(b0_acc), b[1] = sum(b1_acc).
+  // We might as well use a `hadd` instruction to do 4 of the additions
+  // needed here. Then that just leaves two more additions, which can be
+  // done in scalar code
+  __m128i partial_sum = _mm_hadd_epi32(b0_acc, b1_acc);
+  b[0] = _mm_extract_epi32(partial_sum, 0) + _mm_extract_epi32(partial_sum, 1);
+  b[1] = _mm_extract_epi32(partial_sum, 2) + _mm_extract_epi32(partial_sum, 3);
+
+#if CHECK_RESULTS
+  int c_result[2] = { 0 };
+
+  for (int i = 0; i < DISFLOW_PATCH_SIZE; i++) {
+    for (int j = 0; j < DISFLOW_PATCH_SIZE; j++) {
+      c_result[0] += dx[i * dx_stride + j] * dt[i * dt_stride + j];
+      c_result[1] += dy[i * dy_stride + j] * dt[i * dt_stride + j];
+    }
+  }
+
+  assert(b[0] == c_result[0]);
+  assert(b[1] == c_result[1]);
+#endif  // CHECK_RESULTS
+}
+
+static INLINE void compute_flow_matrix(const int16_t *dx, int dx_stride,
+                                       const int16_t *dy, int dy_stride,
+                                       double *M) {
+  __m128i acc[4] = { 0 };
+
+  for (int i = 0; i < DISFLOW_PATCH_SIZE; i++) {
+    __m128i dx_row = _mm_loadu_si128((__m128i *)&dx[i * dx_stride]);
+    __m128i dy_row = _mm_loadu_si128((__m128i *)&dy[i * dy_stride]);
+
+    acc[0] = _mm_add_epi32(acc[0], _mm_madd_epi16(dx_row, dx_row));
+    acc[1] = _mm_add_epi32(acc[1], _mm_madd_epi16(dx_row, dy_row));
+    // Don't compute acc[2], as it should be equal to acc[1]
+    acc[3] = _mm_add_epi32(acc[3], _mm_madd_epi16(dy_row, dy_row));
+  }
+
+  // Condense sums
+  __m128i partial_sum_0 = _mm_hadd_epi32(acc[0], acc[1]);
+  __m128i partial_sum_1 = _mm_hadd_epi32(acc[1], acc[3]);
+  __m128i result = _mm_hadd_epi32(partial_sum_0, partial_sum_1);
+
+  // Apply regularization
+  // We follow the standard regularization method of adding `k * I` before
+  // inverting. This ensures that the matrix will be invertible.
+  //
+  // Setting the regularization strength k to 1 seems to work well here, as
+  // typical values coming from the other equations are very large (1e5 to
+  // 1e6, with an upper limit of around 6e7, at the time of writing).
+  // It also preserves the property that all matrix values are whole numbers,
+  // which is convenient for integerized SIMD implementation.
+  result = _mm_add_epi32(result, _mm_set_epi32(1, 0, 0, 1));
+
+#if CHECK_RESULTS
+  int tmp[4] = { 0 };
+
+  for (int i = 0; i < DISFLOW_PATCH_SIZE; i++) {
+    for (int j = 0; j < DISFLOW_PATCH_SIZE; j++) {
+      tmp[0] += dx[i * dx_stride + j] * dx[i * dx_stride + j];
+      tmp[1] += dx[i * dx_stride + j] * dy[i * dy_stride + j];
+      // Don't compute tmp[2], as it should be equal to tmp[1]
+      tmp[3] += dy[i * dy_stride + j] * dy[i * dy_stride + j];
+    }
+  }
+
+  // Apply regularization
+  tmp[0] += 1;
+  tmp[3] += 1;
+
+  tmp[2] = tmp[1];
+
+  assert(tmp[0] == _mm_extract_epi32(result, 0));
+  assert(tmp[1] == _mm_extract_epi32(result, 1));
+  assert(tmp[2] == _mm_extract_epi32(result, 2));
+  assert(tmp[3] == _mm_extract_epi32(result, 3));
+#endif  // CHECK_RESULTS
+
+  // Convert results to doubles and store
+  _mm_storeu_pd(M, _mm_cvtepi32_pd(result));
+  _mm_storeu_pd(M + 2, _mm_cvtepi32_pd(_mm_srli_si128(result, 8)));
+}
+
+// Try to invert the matrix M
+// Note: Due to the nature of how a least-squares matrix is constructed, all of
+// the eigenvalues will be >= 0, and therefore det M >= 0 as well.
+// The regularization term `+ k * I` further ensures that det M >= k^2.
+// As mentioned in compute_flow_matrix(), here we use k = 1, so det M >= 1.
+// So we don't have to worry about non-invertible matrices here.
+static INLINE void invert_2x2(const double *M, double *M_inv) {
+  double det = (M[0] * M[3]) - (M[1] * M[2]);
+  assert(det >= 1);
+  const double det_inv = 1 / det;
+
+  M_inv[0] = M[3] * det_inv;
+  M_inv[1] = -M[1] * det_inv;
+  M_inv[2] = -M[2] * det_inv;
+  M_inv[3] = M[0] * det_inv;
+}
+
+void aom_compute_flow_at_point_sse4_1(const uint8_t *src, const uint8_t *ref,
+                                      int x, int y, int width, int height,
+                                      int stride, double *u, double *v) {
+  double M[4];
+  double M_inv[4];
+  int b[2];
+  int16_t dt[DISFLOW_PATCH_SIZE * DISFLOW_PATCH_SIZE];
+  int16_t dx[DISFLOW_PATCH_SIZE * DISFLOW_PATCH_SIZE];
+  int16_t dy[DISFLOW_PATCH_SIZE * DISFLOW_PATCH_SIZE];
+
+  // Compute gradients within this patch
+  const uint8_t *src_patch = &src[y * stride + x];
+  sobel_filter_x(src_patch, stride, dx, DISFLOW_PATCH_SIZE);
+  sobel_filter_y(src_patch, stride, dy, DISFLOW_PATCH_SIZE);
+
+  compute_flow_matrix(dx, DISFLOW_PATCH_SIZE, dy, DISFLOW_PATCH_SIZE, M);
+  invert_2x2(M, M_inv);
+
+  for (int itr = 0; itr < DISFLOW_MAX_ITR; itr++) {
+    compute_flow_error(src, ref, width, height, stride, x, y, *u, *v, dt);
+    compute_flow_vector(dx, DISFLOW_PATCH_SIZE, dy, DISFLOW_PATCH_SIZE, dt,
+                        DISFLOW_PATCH_SIZE, b);
+
+    // Solve flow equations to find a better estimate for the flow vector
+    // at this point
+    const double step_u = M_inv[0] * b[0] + M_inv[1] * b[1];
+    const double step_v = M_inv[2] * b[0] + M_inv[3] * b[1];
+    *u += fclamp(step_u * DISFLOW_STEP_SIZE, -2, 2);
+    *v += fclamp(step_v * DISFLOW_STEP_SIZE, -2, 2);
+
+    if (fabs(step_u) + fabs(step_v) < DISFLOW_STEP_SIZE_THRESOLD) {
+      // Stop iteration when we're close to convergence
+      break;
+    }
+  }
+}

diff --git a/aom_dsp/fwd_txfm.c b/aom_dsp/fwd_txfm.c
index 3d30444..5503501 100644
--- a/aom_dsp/fwd_txfm.c
+++ b/aom_dsp/fwd_txfm.c

@@ -16,19 +16,16 @@
 void aom_fdct4x4_c(const int16_t *input, tran_low_t *output, int stride) {
   // The 2D transform is done with two passes which are actually pretty
   // similar. In the first one, we transform the columns and transpose
-  // the results. In the second one, we transform the rows. To achieve that,
-  // as the first pass results are transposed, we transpose the columns (that
-  // is the transposed rows) and transpose the results (so that it goes back
-  // in normal/row positions).
+  // the results. In the second one, we transform the rows.
   // We need an intermediate buffer between passes.
   tran_low_t intermediate[4 * 4];
   const tran_low_t *in_low = NULL;
   tran_low_t *out = intermediate;
-  // Do the two transform/transpose passes
+  // Do the two transform passes
   for (int pass = 0; pass < 2; ++pass) {
-    tran_high_t in_high[4];    // canbe16
-    tran_high_t step[4];       // canbe16
-    tran_high_t temp1, temp2;  // needs32
+    tran_high_t in_high[4];  // canbe16
+    tran_high_t step[4];     // canbe16
+    tran_low_t temp[4];
     for (int i = 0; i < 4; ++i) {
       // Load inputs.
       if (pass == 0) {
@@ -39,30 +36,40 @@
         if (i == 0 && in_high[0]) {
           ++in_high[0];
         }
+        ++input;  // Next column
       } else {
         assert(in_low != NULL);
         in_high[0] = in_low[0 * 4];
         in_high[1] = in_low[1 * 4];
         in_high[2] = in_low[2 * 4];
         in_high[3] = in_low[3 * 4];
-        ++in_low;
+        ++in_low;  // Next column (which is a transposed row)
       }
       // Transform.
       step[0] = in_high[0] + in_high[3];
       step[1] = in_high[1] + in_high[2];
       step[2] = in_high[1] - in_high[2];
       step[3] = in_high[0] - in_high[3];
-      temp1 = (step[0] + step[1]) * cospi_16_64;
-      temp2 = (step[0] - step[1]) * cospi_16_64;
-      out[0] = (tran_low_t)fdct_round_shift(temp1);
-      out[2] = (tran_low_t)fdct_round_shift(temp2);
-      temp1 = step[2] * cospi_24_64 + step[3] * cospi_8_64;
-      temp2 = -step[2] * cospi_8_64 + step[3] * cospi_24_64;
-      out[1] = (tran_low_t)fdct_round_shift(temp1);
-      out[3] = (tran_low_t)fdct_round_shift(temp2);
-      // Do next column (which is a transposed row in second/horizontal pass)
-      ++input;
-      out += 4;
+      temp[0] = (tran_low_t)fdct_round_shift((step[0] + step[1]) * cospi_16_64);
+      temp[2] = (tran_low_t)fdct_round_shift((step[0] - step[1]) * cospi_16_64);
+      temp[1] = (tran_low_t)fdct_round_shift(step[2] * cospi_24_64 +
+                                             step[3] * cospi_8_64);
+      temp[3] = (tran_low_t)fdct_round_shift(-step[2] * cospi_8_64 +
+                                             step[3] * cospi_24_64);
+      // Only transpose the first pass.
+      if (pass == 0) {
+        out[0] = temp[0];
+        out[1] = temp[1];
+        out[2] = temp[2];
+        out[3] = temp[3];
+        out += 4;
+      } else {
+        out[0 * 4] = temp[0];
+        out[1 * 4] = temp[1];
+        out[2 * 4] = temp[2];
+        out[3 * 4] = temp[3];
+        ++out;
+      }
     }
     // Setup in/out for next pass.
     in_low = intermediate;
@@ -78,19 +85,16 @@
 void aom_fdct4x4_lp_c(const int16_t *input, int16_t *output, int stride) {
   // The 2D transform is done with two passes which are actually pretty
   // similar. In the first one, we transform the columns and transpose
-  // the results. In the second one, we transform the rows. To achieve that,
-  // as the first pass results are transposed, we transpose the columns (that
-  // is the transposed rows) and transpose the results (so that it goes back
-  // in normal/row positions).
+  // the results. In the second one, we transform the rows.
   // We need an intermediate buffer between passes.
   int16_t intermediate[4 * 4];
   const int16_t *in_low = NULL;
   int16_t *out = intermediate;
-  // Do the two transform/transpose passes
+  // Do the two transform passes
   for (int pass = 0; pass < 2; ++pass) {
-    int32_t in_high[4];    // canbe16
-    int32_t step[4];       // canbe16
-    int32_t temp1, temp2;  // needs32
+    int32_t in_high[4];  // canbe16
+    int32_t step[4];     // canbe16
+    int16_t temp[4];
     for (int i = 0; i < 4; ++i) {
       // Load inputs.
       if (pass == 0) {
@@ -98,6 +102,7 @@
         in_high[1] = input[1 * stride] * 16;
         in_high[2] = input[2 * stride] * 16;
         in_high[3] = input[3 * stride] * 16;
+        ++input;
         if (i == 0 && in_high[0]) {
           ++in_high[0];
         }
@@ -114,17 +119,26 @@
       step[1] = in_high[1] + in_high[2];
       step[2] = in_high[1] - in_high[2];
       step[3] = in_high[0] - in_high[3];
-      temp1 = (step[0] + step[1]) * (int32_t)cospi_16_64;
-      temp2 = (step[0] - step[1]) * (int32_t)cospi_16_64;
-      out[0] = (int16_t)fdct_round_shift(temp1);
-      out[2] = (int16_t)fdct_round_shift(temp2);
-      temp1 = step[2] * (int32_t)cospi_24_64 + step[3] * (int32_t)cospi_8_64;
-      temp2 = -step[2] * (int32_t)cospi_8_64 + step[3] * (int32_t)cospi_24_64;
-      out[1] = (int16_t)fdct_round_shift(temp1);
-      out[3] = (int16_t)fdct_round_shift(temp2);
-      // Do next column (which is a transposed row in second/horizontal pass)
-      ++input;
-      out += 4;
+      temp[0] = (int16_t)fdct_round_shift((step[0] + step[1]) * cospi_16_64);
+      temp[2] = (int16_t)fdct_round_shift((step[0] - step[1]) * cospi_16_64);
+      temp[1] = (int16_t)fdct_round_shift(step[2] * cospi_24_64 +
+                                          step[3] * cospi_8_64);
+      temp[3] = (int16_t)fdct_round_shift(-step[2] * cospi_8_64 +
+                                          step[3] * cospi_24_64);
+      // Only transpose the first pass.
+      if (pass == 0) {
+        out[0] = temp[0];
+        out[1] = temp[1];
+        out[2] = temp[2];
+        out[3] = temp[3];
+        out += 4;
+      } else {
+        out[0 * 4] = temp[0];
+        out[1 * 4] = temp[1];
+        out[2 * 4] = temp[2];
+        out[3 * 4] = temp[3];
+        ++out;
+      }
     }
     // Setup in/out for next pass.
     in_low = intermediate;
@@ -137,6 +151,7 @@
   }
 }
 
+#if CONFIG_INTERNAL_STATS
 void aom_fdct8x8_c(const int16_t *input, tran_low_t *final_output, int stride) {
   int i, j;
   tran_low_t intermediate[64];
@@ -220,8 +235,9 @@
     for (j = 0; j < 8; ++j) final_output[j + i * 8] /= 2;
   }
 }
+#endif  // CONFIG_INTERNAL_STATS
 
-#if CONFIG_AV1_HIGHBITDEPTH
+#if CONFIG_AV1_HIGHBITDEPTH && CONFIG_INTERNAL_STATS
 void aom_highbd_fdct8x8_c(const int16_t *input, tran_low_t *final_output,
                           int stride) {
   aom_fdct8x8_c(input, final_output, stride);

diff --git a/aom_dsp/grain_table.h b/aom_dsp/grain_table.h
index 3f75101..49e8498 100644
--- a/aom_dsp/grain_table.h
+++ b/aom_dsp/grain_table.h

@@ -52,7 +52,7 @@
 /*!\brief Add a mapping from [time_stamp, end_time) to the given grain
  * parameters
  *
- * \param[in/out] table      The grain table
+ * \param[in,out] table      The grain table
  * \param[in]     time_stamp The start time stamp
  * \param[in]     end_stamp  The end time_stamp
  * \param[in]     grain      The grain parameters

diff --git a/aom_dsp/mathutils.h b/aom_dsp/mathutils.h
index 22b0202..cbb6cf4 100644
--- a/aom_dsp/mathutils.h
+++ b/aom_dsp/mathutils.h

@@ -63,32 +63,51 @@
 // Solves for n-dim x in a least squares sense to minimize |Ax - b|^2
 // The solution is simply x = (A'A)^-1 A'b or simply the solution for
 // the system: A'A x = A'b
-static INLINE int least_squares(int n, double *A, int rows, int stride,
-                                double *b, double *scratch, double *x) {
-  int i, j, k;
-  double *scratch_ = NULL;
-  double *AtA, *Atb;
-  if (!scratch) {
-    scratch_ = (double *)aom_malloc(sizeof(*scratch) * n * (n + 1));
-    if (!scratch_) return 0;
-    scratch = scratch_;
-  }
-  AtA = scratch;
-  Atb = scratch + n * n;
+//
+// This process is split into three steps in order to avoid needing to
+// explicitly allocate the A matrix, which may be very large if there
+// are many equations to solve.
+//
+// The process for using this is (in pseudocode):
+//
+// Allocate mat (size n*n), y (size n), a (size n), x (size n)
+// least_squares_init(mat, y, n)
+// for each equation a . x = b {
+//    least_squares_accumulate(mat, y, a, b, n)
+// }
+// least_squares_solve(mat, y, x, n)
+//
+// where:
+// * mat, y are accumulators for the values A'A and A'b respectively,
+// * a, b are the coefficients of each individual equation,
+// * x is the result vector
+// * and n is the problem size
+static INLINE void least_squares_init(double *mat, double *y, int n) {
+  memset(mat, 0, n * n * sizeof(double));
+  memset(y, 0, n * sizeof(double));
+}
 
-  for (i = 0; i < n; ++i) {
-    for (j = i; j < n; ++j) {
-      AtA[i * n + j] = 0.0;
-      for (k = 0; k < rows; ++k)
-        AtA[i * n + j] += A[k * stride + i] * A[k * stride + j];
-      AtA[j * n + i] = AtA[i * n + j];
+// Round the given positive value to nearest integer
+static AOM_FORCE_INLINE int iroundpf(float x) {
+  assert(x >= 0.0);
+  return (int)(x + 0.5f);
+}
+
+static INLINE void least_squares_accumulate(double *mat, double *y,
+                                            const double *a, double b, int n) {
+  for (int i = 0; i < n; i++) {
+    for (int j = 0; j < n; j++) {
+      mat[i * n + j] += a[i] * a[j];
     }
-    Atb[i] = 0;
-    for (k = 0; k < rows; ++k) Atb[i] += A[k * stride + i] * b[k];
   }
-  int ret = linsolve(n, AtA, n, Atb, x);
-  aom_free(scratch_);
-  return ret;
+  for (int i = 0; i < n; i++) {
+    y[i] += a[i] * b;
+  }
+}
+
+static INLINE int least_squares_solve(double *mat, double *y, double *x,
+                                      int n) {
+  return linsolve(n, mat, n, y, x);
 }
 
 // Matrix multiply
@@ -108,4 +127,19 @@
   }
 }
 
+static AOM_INLINE float approx_exp(float y) {
+#define A ((1 << 23) / 0.69314718056f)  // (1 << 23) / ln(2)
+#define B \
+  127  // Offset for the exponent according to IEEE floating point standard.
+#define C 60801  // Magic number controls the accuracy of approximation
+  union {
+    float as_float;
+    int32_t as_int32;
+  } container;
+  container.as_int32 = ((int32_t)(y * A)) + ((B << 23) - C);
+  return container.as_float;
+#undef A
+#undef B
+#undef C
+}
 #endif  // AOM_AOM_DSP_MATHUTILS_H_

diff --git a/aom_dsp/noise_model.c b/aom_dsp/noise_model.c
index 8521232..065ec9a 100644
--- a/aom_dsp/noise_model.c
+++ b/aom_dsp/noise_model.c

@@ -571,7 +571,6 @@
   const int num_blocks_w = (w + block_size - 1) / block_size;
   const int num_blocks_h = (h + block_size - 1) / block_size;
   int num_flat = 0;
-  int bx = 0, by = 0;
   double *plane = (double *)aom_malloc(n * sizeof(*plane));
   double *block = (double *)aom_malloc(n * sizeof(*block));
   index_and_score_t *scores = (index_and_score_t *)aom_malloc(
@@ -587,19 +586,18 @@
 #ifdef NOISE_MODEL_LOG_SCORE
   fprintf(stderr, "score = [");
 #endif
-  for (by = 0; by < num_blocks_h; ++by) {
-    for (bx = 0; bx < num_blocks_w; ++bx) {
+  for (int by = 0; by < num_blocks_h; ++by) {
+    for (int bx = 0; bx < num_blocks_w; ++bx) {
       // Compute gradient covariance matrix.
-      double Gxx = 0, Gxy = 0, Gyy = 0;
-      double var = 0;
-      double mean = 0;
-      int xi, yi;
       aom_flat_block_finder_extract_block(block_finder, data, w, h, stride,
                                           bx * block_size, by * block_size,
                                           plane, block);
+      double Gxx = 0, Gxy = 0, Gyy = 0;
+      double mean = 0;
+      double var = 0;
 
-      for (yi = 1; yi < block_size - 1; ++yi) {
-        for (xi = 1; xi < block_size - 1; ++xi) {
+      for (int yi = 1; yi < block_size - 1; ++yi) {
+        for (int xi = 1; xi < block_size - 1; ++xi) {
           const double gx = (block[yi * block_size + xi + 1] -
                              block[yi * block_size + xi - 1]) /
                             2;
@@ -1623,6 +1621,8 @@
   return 1;
 }
 
+// TODO(aomedia:3151): Handle a monochrome image (sd->u_buffer and sd->v_buffer
+// are null pointers) correctly.
 int aom_denoise_and_model_run(struct aom_denoise_and_model_t *ctx,
                               YV12_BUFFER_CONFIG *sd,
                               aom_film_grain_t *film_grain, int apply_denoise) {
@@ -1680,10 +1680,12 @@
     if (apply_denoise) {
       memcpy(raw_data[0], ctx->denoised[0],
              (strides[0] * sd->y_height) << use_highbd);
-      memcpy(raw_data[1], ctx->denoised[1],
-             (strides[1] * sd->uv_height) << use_highbd);
-      memcpy(raw_data[2], ctx->denoised[2],
-             (strides[2] * sd->uv_height) << use_highbd);
+      if (!sd->monochrome) {
+        memcpy(raw_data[1], ctx->denoised[1],
+               (strides[1] * sd->uv_height) << use_highbd);
+        memcpy(raw_data[2], ctx->denoised[2],
+               (strides[2] * sd->uv_height) << use_highbd);
+      }
     }
   }
   return 1;

diff --git a/aom_dsp/noise_model.h b/aom_dsp/noise_model.h
index f385251..8228aea 100644
--- a/aom_dsp/noise_model.h
+++ b/aom_dsp/noise_model.h

@@ -293,13 +293,13 @@
  * parameter will be true when the input buffer was successfully denoised and
  * grain was modelled. Returns false on error.
  *
- * \param[in]      ctx          Struct allocated with
+ * \param[in]     ctx           Struct allocated with
  *                              aom_denoise_and_model_alloc that holds some
  *                              buffers for denoising and the current noise
  *                              estimate.
- * \param[in/out]   buf         The raw input buffer to be denoised.
+ * \param[in,out] buf           The raw input buffer to be denoised.
  * \param[out]    grain         Output film grain parameters
- * \param[out]    apply_denoise Whether or not to apply the denoising to the
+ * \param[in]     apply_denoise Whether or not to apply the denoising to the
  *                              frame that will be encoded
  */
 int aom_denoise_and_model_run(struct aom_denoise_and_model_t *ctx,

diff --git a/aom_dsp/prob.h b/aom_dsp/prob.h
index 5e25b9c..5711a40 100644
--- a/aom_dsp/prob.h
+++ b/aom_dsp/prob.h

@@ -31,16 +31,12 @@
 #define CDF_SIZE(x) ((x) + 1)
 #define CDF_PROB_BITS 15
 #define CDF_PROB_TOP (1 << CDF_PROB_BITS)
-#define CDF_INIT_TOP 32768
-#define CDF_SHIFT (15 - CDF_PROB_BITS)
 /*The value stored in an iCDF is CDF_PROB_TOP minus the actual cumulative
   probability (an "inverse" CDF).
   This function converts from one representation to the other (and is its own
   inverse).*/
 #define AOM_ICDF(x) (CDF_PROB_TOP - (x))
 
-#if CDF_SHIFT == 0
-
 #define AOM_CDF2(a0) AOM_ICDF(a0), AOM_ICDF(CDF_PROB_TOP), 0
 #define AOM_CDF3(a0, a1) AOM_ICDF(a0), AOM_ICDF(a1), AOM_ICDF(CDF_PROB_TOP), 0
 #define AOM_CDF4(a0, a1, a2) \
@@ -101,535 +97,6 @@
       AOM_ICDF(a11), AOM_ICDF(a12), AOM_ICDF(a13), AOM_ICDF(a14),             \
       AOM_ICDF(CDF_PROB_TOP), 0
 
-#else
-#define AOM_CDF2(a0)                                       \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 2) + \
-            ((CDF_INIT_TOP - 2) >> 1)) /                   \
-               ((CDF_INIT_TOP - 2)) +                      \
-           1)                                              \
-  , AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF3(a0, a1)                                       \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 3) +     \
-            ((CDF_INIT_TOP - 3) >> 1)) /                       \
-               ((CDF_INIT_TOP - 3)) +                          \
-           1)                                                  \
-  ,                                                            \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 3) + \
-                ((CDF_INIT_TOP - 3) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 3)) +                      \
-               2),                                             \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF4(a0, a1, a2)                                   \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 4) +     \
-            ((CDF_INIT_TOP - 4) >> 1)) /                       \
-               ((CDF_INIT_TOP - 4)) +                          \
-           1)                                                  \
-  ,                                                            \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 4) + \
-                ((CDF_INIT_TOP - 4) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 4)) +                      \
-               2),                                             \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 4) + \
-                ((CDF_INIT_TOP - 4) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 4)) +                      \
-               3),                                             \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF5(a0, a1, a2, a3)                               \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 5) +     \
-            ((CDF_INIT_TOP - 5) >> 1)) /                       \
-               ((CDF_INIT_TOP - 5)) +                          \
-           1)                                                  \
-  ,                                                            \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 5) + \
-                ((CDF_INIT_TOP - 5) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 5)) +                      \
-               2),                                             \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 5) + \
-                ((CDF_INIT_TOP - 5) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 5)) +                      \
-               3),                                             \
-      AOM_ICDF((((a3)-4) * ((CDF_INIT_TOP >> CDF_SHIFT) - 5) + \
-                ((CDF_INIT_TOP - 5) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 5)) +                      \
-               4),                                             \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF6(a0, a1, a2, a3, a4)                           \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 6) +     \
-            ((CDF_INIT_TOP - 6) >> 1)) /                       \
-               ((CDF_INIT_TOP - 6)) +                          \
-           1)                                                  \
-  ,                                                            \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 6) + \
-                ((CDF_INIT_TOP - 6) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 6)) +                      \
-               2),                                             \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 6) + \
-                ((CDF_INIT_TOP - 6) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 6)) +                      \
-               3),                                             \
-      AOM_ICDF((((a3)-4) * ((CDF_INIT_TOP >> CDF_SHIFT) - 6) + \
-                ((CDF_INIT_TOP - 6) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 6)) +                      \
-               4),                                             \
-      AOM_ICDF((((a4)-5) * ((CDF_INIT_TOP >> CDF_SHIFT) - 6) + \
-                ((CDF_INIT_TOP - 6) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 6)) +                      \
-               5),                                             \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF7(a0, a1, a2, a3, a4, a5)                       \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 7) +     \
-            ((CDF_INIT_TOP - 7) >> 1)) /                       \
-               ((CDF_INIT_TOP - 7)) +                          \
-           1)                                                  \
-  ,                                                            \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 7) + \
-                ((CDF_INIT_TOP - 7) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 7)) +                      \
-               2),                                             \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 7) + \
-                ((CDF_INIT_TOP - 7) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 7)) +                      \
-               3),                                             \
-      AOM_ICDF((((a3)-4) * ((CDF_INIT_TOP >> CDF_SHIFT) - 7) + \
-                ((CDF_INIT_TOP - 7) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 7)) +                      \
-               4),                                             \
-      AOM_ICDF((((a4)-5) * ((CDF_INIT_TOP >> CDF_SHIFT) - 7) + \
-                ((CDF_INIT_TOP - 7) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 7)) +                      \
-               5),                                             \
-      AOM_ICDF((((a5)-6) * ((CDF_INIT_TOP >> CDF_SHIFT) - 7) + \
-                ((CDF_INIT_TOP - 7) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 7)) +                      \
-               6),                                             \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF8(a0, a1, a2, a3, a4, a5, a6)                   \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 8) +     \
-            ((CDF_INIT_TOP - 8) >> 1)) /                       \
-               ((CDF_INIT_TOP - 8)) +                          \
-           1)                                                  \
-  ,                                                            \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 8) + \
-                ((CDF_INIT_TOP - 8) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 8)) +                      \
-               2),                                             \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 8) + \
-                ((CDF_INIT_TOP - 8) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 8)) +                      \
-               3),                                             \
-      AOM_ICDF((((a3)-4) * ((CDF_INIT_TOP >> CDF_SHIFT) - 8) + \
-                ((CDF_INIT_TOP - 8) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 8)) +                      \
-               4),                                             \
-      AOM_ICDF((((a4)-5) * ((CDF_INIT_TOP >> CDF_SHIFT) - 8) + \
-                ((CDF_INIT_TOP - 8) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 8)) +                      \
-               5),                                             \
-      AOM_ICDF((((a5)-6) * ((CDF_INIT_TOP >> CDF_SHIFT) - 8) + \
-                ((CDF_INIT_TOP - 8) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 8)) +                      \
-               6),                                             \
-      AOM_ICDF((((a6)-7) * ((CDF_INIT_TOP >> CDF_SHIFT) - 8) + \
-                ((CDF_INIT_TOP - 8) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 8)) +                      \
-               7),                                             \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF9(a0, a1, a2, a3, a4, a5, a6, a7)               \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 9) +     \
-            ((CDF_INIT_TOP - 9) >> 1)) /                       \
-               ((CDF_INIT_TOP - 9)) +                          \
-           1)                                                  \
-  ,                                                            \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 9) + \
-                ((CDF_INIT_TOP - 9) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 9)) +                      \
-               2),                                             \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 9) + \
-                ((CDF_INIT_TOP - 9) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 9)) +                      \
-               3),                                             \
-      AOM_ICDF((((a3)-4) * ((CDF_INIT_TOP >> CDF_SHIFT) - 9) + \
-                ((CDF_INIT_TOP - 9) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 9)) +                      \
-               4),                                             \
-      AOM_ICDF((((a4)-5) * ((CDF_INIT_TOP >> CDF_SHIFT) - 9) + \
-                ((CDF_INIT_TOP - 9) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 9)) +                      \
-               5),                                             \
-      AOM_ICDF((((a5)-6) * ((CDF_INIT_TOP >> CDF_SHIFT) - 9) + \
-                ((CDF_INIT_TOP - 9) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 9)) +                      \
-               6),                                             \
-      AOM_ICDF((((a6)-7) * ((CDF_INIT_TOP >> CDF_SHIFT) - 9) + \
-                ((CDF_INIT_TOP - 9) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 9)) +                      \
-               7),                                             \
-      AOM_ICDF((((a7)-8) * ((CDF_INIT_TOP >> CDF_SHIFT) - 9) + \
-                ((CDF_INIT_TOP - 9) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 9)) +                      \
-               8),                                             \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF10(a0, a1, a2, a3, a4, a5, a6, a7, a8)           \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 10) +     \
-            ((CDF_INIT_TOP - 10) >> 1)) /                       \
-               ((CDF_INIT_TOP - 10)) +                          \
-           1)                                                   \
-  ,                                                             \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 10) + \
-                ((CDF_INIT_TOP - 10) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 10)) +                      \
-               2),                                              \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 10) + \
-                ((CDF_INIT_TOP - 10) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 10)) +                      \
-               3),                                              \
-      AOM_ICDF((((a3)-4) * ((CDF_INIT_TOP >> CDF_SHIFT) - 10) + \
-                ((CDF_INIT_TOP - 10) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 10)) +                      \
-               4),                                              \
-      AOM_ICDF((((a4)-5) * ((CDF_INIT_TOP >> CDF_SHIFT) - 10) + \
-                ((CDF_INIT_TOP - 10) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 10)) +                      \
-               5),                                              \
-      AOM_ICDF((((a5)-6) * ((CDF_INIT_TOP >> CDF_SHIFT) - 10) + \
-                ((CDF_INIT_TOP - 10) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 10)) +                      \
-               6),                                              \
-      AOM_ICDF((((a6)-7) * ((CDF_INIT_TOP >> CDF_SHIFT) - 10) + \
-                ((CDF_INIT_TOP - 10) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 10)) +                      \
-               7),                                              \
-      AOM_ICDF((((a7)-8) * ((CDF_INIT_TOP >> CDF_SHIFT) - 10) + \
-                ((CDF_INIT_TOP - 10) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 10)) +                      \
-               8),                                              \
-      AOM_ICDF((((a8)-9) * ((CDF_INIT_TOP >> CDF_SHIFT) - 10) + \
-                ((CDF_INIT_TOP - 10) >> 1)) /                   \
-                   ((CDF_INIT_TOP - 10)) +                      \
-               9),                                              \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF11(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9)        \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 11) +      \
-            ((CDF_INIT_TOP - 11) >> 1)) /                        \
-               ((CDF_INIT_TOP - 11)) +                           \
-           1)                                                    \
-  ,                                                              \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 11) +  \
-                ((CDF_INIT_TOP - 11) >> 1)) /                    \
-                   ((CDF_INIT_TOP - 11)) +                       \
-               2),                                               \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 11) +  \
-                ((CDF_INIT_TOP - 11) >> 1)) /                    \
-                   ((CDF_INIT_TOP - 11)) +                       \
-               3),                                               \
-      AOM_ICDF((((a3)-4) * ((CDF_INIT_TOP >> CDF_SHIFT) - 11) +  \
-                ((CDF_INIT_TOP - 11) >> 1)) /                    \
-                   ((CDF_INIT_TOP - 11)) +                       \
-               4),                                               \
-      AOM_ICDF((((a4)-5) * ((CDF_INIT_TOP >> CDF_SHIFT) - 11) +  \
-                ((CDF_INIT_TOP - 11) >> 1)) /                    \
-                   ((CDF_INIT_TOP - 11)) +                       \
-               5),                                               \
-      AOM_ICDF((((a5)-6) * ((CDF_INIT_TOP >> CDF_SHIFT) - 11) +  \
-                ((CDF_INIT_TOP - 11) >> 1)) /                    \
-                   ((CDF_INIT_TOP - 11)) +                       \
-               6),                                               \
-      AOM_ICDF((((a6)-7) * ((CDF_INIT_TOP >> CDF_SHIFT) - 11) +  \
-                ((CDF_INIT_TOP - 11) >> 1)) /                    \
-                   ((CDF_INIT_TOP - 11)) +                       \
-               7),                                               \
-      AOM_ICDF((((a7)-8) * ((CDF_INIT_TOP >> CDF_SHIFT) - 11) +  \
-                ((CDF_INIT_TOP - 11) >> 1)) /                    \
-                   ((CDF_INIT_TOP - 11)) +                       \
-               8),                                               \
-      AOM_ICDF((((a8)-9) * ((CDF_INIT_TOP >> CDF_SHIFT) - 11) +  \
-                ((CDF_INIT_TOP - 11) >> 1)) /                    \
-                   ((CDF_INIT_TOP - 11)) +                       \
-               9),                                               \
-      AOM_ICDF((((a9)-10) * ((CDF_INIT_TOP >> CDF_SHIFT) - 11) + \
-                ((CDF_INIT_TOP - 11) >> 1)) /                    \
-                   ((CDF_INIT_TOP - 11)) +                       \
-               10),                                              \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF12(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10)    \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 12) +       \
-            ((CDF_INIT_TOP - 12) >> 1)) /                         \
-               ((CDF_INIT_TOP - 12)) +                            \
-           1)                                                     \
-  ,                                                               \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 12) +   \
-                ((CDF_INIT_TOP - 12) >> 1)) /                     \
-                   ((CDF_INIT_TOP - 12)) +                        \
-               2),                                                \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 12) +   \
-                ((CDF_INIT_TOP - 12) >> 1)) /                     \
-                   ((CDF_INIT_TOP - 12)) +                        \
-               3),                                                \
-      AOM_ICDF((((a3)-4) * ((CDF_INIT_TOP >> CDF_SHIFT) - 12) +   \
-                ((CDF_INIT_TOP - 12) >> 1)) /                     \
-                   ((CDF_INIT_TOP - 12)) +                        \
-               4),                                                \
-      AOM_ICDF((((a4)-5) * ((CDF_INIT_TOP >> CDF_SHIFT) - 12) +   \
-                ((CDF_INIT_TOP - 12) >> 1)) /                     \
-                   ((CDF_INIT_TOP - 12)) +                        \
-               5),                                                \
-      AOM_ICDF((((a5)-6) * ((CDF_INIT_TOP >> CDF_SHIFT) - 12) +   \
-                ((CDF_INIT_TOP - 12) >> 1)) /                     \
-                   ((CDF_INIT_TOP - 12)) +                        \
-               6),                                                \
-      AOM_ICDF((((a6)-7) * ((CDF_INIT_TOP >> CDF_SHIFT) - 12) +   \
-                ((CDF_INIT_TOP - 12) >> 1)) /                     \
-                   ((CDF_INIT_TOP - 12)) +                        \
-               7),                                                \
-      AOM_ICDF((((a7)-8) * ((CDF_INIT_TOP >> CDF_SHIFT) - 12) +   \
-                ((CDF_INIT_TOP - 12) >> 1)) /                     \
-                   ((CDF_INIT_TOP - 12)) +                        \
-               8),                                                \
-      AOM_ICDF((((a8)-9) * ((CDF_INIT_TOP >> CDF_SHIFT) - 12) +   \
-                ((CDF_INIT_TOP - 12) >> 1)) /                     \
-                   ((CDF_INIT_TOP - 12)) +                        \
-               9),                                                \
-      AOM_ICDF((((a9)-10) * ((CDF_INIT_TOP >> CDF_SHIFT) - 12) +  \
-                ((CDF_INIT_TOP - 12) >> 1)) /                     \
-                   ((CDF_INIT_TOP - 12)) +                        \
-               10),                                               \
-      AOM_ICDF((((a10)-11) * ((CDF_INIT_TOP >> CDF_SHIFT) - 12) + \
-                ((CDF_INIT_TOP - 12) >> 1)) /                     \
-                   ((CDF_INIT_TOP - 12)) +                        \
-               11),                                               \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF13(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11) \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 13) +         \
-            ((CDF_INIT_TOP - 13) >> 1)) /                           \
-               ((CDF_INIT_TOP - 13)) +                              \
-           1)                                                       \
-  ,                                                                 \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 13) +     \
-                ((CDF_INIT_TOP - 13) >> 1)) /                       \
-                   ((CDF_INIT_TOP - 13)) +                          \
-               2),                                                  \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 13) +     \
-                ((CDF_INIT_TOP - 13) >> 1)) /                       \
-                   ((CDF_INIT_TOP - 13)) +                          \
-               3),                                                  \
-      AOM_ICDF((((a3)-4) * ((CDF_INIT_TOP >> CDF_SHIFT) - 13) +     \
-                ((CDF_INIT_TOP - 13) >> 1)) /                       \
-                   ((CDF_INIT_TOP - 13)) +                          \
-               4),                                                  \
-      AOM_ICDF((((a4)-5) * ((CDF_INIT_TOP >> CDF_SHIFT) - 13) +     \
-                ((CDF_INIT_TOP - 13) >> 1)) /                       \
-                   ((CDF_INIT_TOP - 13)) +                          \
-               5),                                                  \
-      AOM_ICDF((((a5)-6) * ((CDF_INIT_TOP >> CDF_SHIFT) - 13) +     \
-                ((CDF_INIT_TOP - 13) >> 1)) /                       \
-                   ((CDF_INIT_TOP - 13)) +                          \
-               6),                                                  \
-      AOM_ICDF((((a6)-7) * ((CDF_INIT_TOP >> CDF_SHIFT) - 13) +     \
-                ((CDF_INIT_TOP - 13) >> 1)) /                       \
-                   ((CDF_INIT_TOP - 13)) +                          \
-               7),                                                  \
-      AOM_ICDF((((a7)-8) * ((CDF_INIT_TOP >> CDF_SHIFT) - 13) +     \
-                ((CDF_INIT_TOP - 13) >> 1)) /                       \
-                   ((CDF_INIT_TOP - 13)) +                          \
-               8),                                                  \
-      AOM_ICDF((((a8)-9) * ((CDF_INIT_TOP >> CDF_SHIFT) - 13) +     \
-                ((CDF_INIT_TOP - 13) >> 1)) /                       \
-                   ((CDF_INIT_TOP - 13)) +                          \
-               9),                                                  \
-      AOM_ICDF((((a9)-10) * ((CDF_INIT_TOP >> CDF_SHIFT) - 13) +    \
-                ((CDF_INIT_TOP - 13) >> 1)) /                       \
-                   ((CDF_INIT_TOP - 13)) +                          \
-               10),                                                 \
-      AOM_ICDF((((a10)-11) * ((CDF_INIT_TOP >> CDF_SHIFT) - 13) +   \
-                ((CDF_INIT_TOP - 13) >> 1)) /                       \
-                   ((CDF_INIT_TOP - 13)) +                          \
-               11),                                                 \
-      AOM_ICDF((((a11)-12) * ((CDF_INIT_TOP >> CDF_SHIFT) - 13) +   \
-                ((CDF_INIT_TOP - 13) >> 1)) /                       \
-                   ((CDF_INIT_TOP - 13)) +                          \
-               12),                                                 \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF14(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12) \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +              \
-            ((CDF_INIT_TOP - 14) >> 1)) /                                \
-               ((CDF_INIT_TOP - 14)) +                                   \
-           1)                                                            \
-  ,                                                                      \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +          \
-                ((CDF_INIT_TOP - 14) >> 1)) /                            \
-                   ((CDF_INIT_TOP - 14)) +                               \
-               2),                                                       \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +          \
-                ((CDF_INIT_TOP - 14) >> 1)) /                            \
-                   ((CDF_INIT_TOP - 14)) +                               \
-               3),                                                       \
-      AOM_ICDF((((a3)-4) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +          \
-                ((CDF_INIT_TOP - 14) >> 1)) /                            \
-                   ((CDF_INIT_TOP - 14)) +                               \
-               4),                                                       \
-      AOM_ICDF((((a4)-5) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +          \
-                ((CDF_INIT_TOP - 14) >> 1)) /                            \
-                   ((CDF_INIT_TOP - 14)) +                               \
-               5),                                                       \
-      AOM_ICDF((((a5)-6) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +          \
-                ((CDF_INIT_TOP - 14) >> 1)) /                            \
-                   ((CDF_INIT_TOP - 14)) +                               \
-               6),                                                       \
-      AOM_ICDF((((a6)-7) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +          \
-                ((CDF_INIT_TOP - 14) >> 1)) /                            \
-                   ((CDF_INIT_TOP - 14)) +                               \
-               7),                                                       \
-      AOM_ICDF((((a7)-8) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +          \
-                ((CDF_INIT_TOP - 14) >> 1)) /                            \
-                   ((CDF_INIT_TOP - 14)) +                               \
-               8),                                                       \
-      AOM_ICDF((((a8)-9) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +          \
-                ((CDF_INIT_TOP - 14) >> 1)) /                            \
-                   ((CDF_INIT_TOP - 14)) +                               \
-               9),                                                       \
-      AOM_ICDF((((a9)-10) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +         \
-                ((CDF_INIT_TOP - 14) >> 1)) /                            \
-                   ((CDF_INIT_TOP - 14)) +                               \
-               10),                                                      \
-      AOM_ICDF((((a10)-11) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +        \
-                ((CDF_INIT_TOP - 14) >> 1)) /                            \
-                   ((CDF_INIT_TOP - 14)) +                               \
-               11),                                                      \
-      AOM_ICDF((((a11)-12) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +        \
-                ((CDF_INIT_TOP - 14) >> 1)) /                            \
-                   ((CDF_INIT_TOP - 14)) +                               \
-               12),                                                      \
-      AOM_ICDF((((a12)-13) * ((CDF_INIT_TOP >> CDF_SHIFT) - 14) +        \
-                ((CDF_INIT_TOP - 14) >> 1)) /                            \
-                   ((CDF_INIT_TOP - 14)) +                               \
-               13),                                                      \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF15(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13) \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +                   \
-            ((CDF_INIT_TOP - 15) >> 1)) /                                     \
-               ((CDF_INIT_TOP - 15)) +                                        \
-           1)                                                                 \
-  ,                                                                           \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +               \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               2),                                                            \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +               \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               3),                                                            \
-      AOM_ICDF((((a3)-4) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +               \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               4),                                                            \
-      AOM_ICDF((((a4)-5) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +               \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               5),                                                            \
-      AOM_ICDF((((a5)-6) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +               \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               6),                                                            \
-      AOM_ICDF((((a6)-7) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +               \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               7),                                                            \
-      AOM_ICDF((((a7)-8) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +               \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               8),                                                            \
-      AOM_ICDF((((a8)-9) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +               \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               9),                                                            \
-      AOM_ICDF((((a9)-10) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +              \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               10),                                                           \
-      AOM_ICDF((((a10)-11) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +             \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               11),                                                           \
-      AOM_ICDF((((a11)-12) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +             \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               12),                                                           \
-      AOM_ICDF((((a12)-13) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +             \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               13),                                                           \
-      AOM_ICDF((((a13)-14) * ((CDF_INIT_TOP >> CDF_SHIFT) - 15) +             \
-                ((CDF_INIT_TOP - 15) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 15)) +                                    \
-               14),                                                           \
-      AOM_ICDF(CDF_PROB_TOP), 0
-#define AOM_CDF16(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, \
-                  a14)                                                        \
-  AOM_ICDF((((a0)-1) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +                   \
-            ((CDF_INIT_TOP - 16) >> 1)) /                                     \
-               ((CDF_INIT_TOP - 16)) +                                        \
-           1)                                                                 \
-  ,                                                                           \
-      AOM_ICDF((((a1)-2) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +               \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               2),                                                            \
-      AOM_ICDF((((a2)-3) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +               \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               3),                                                            \
-      AOM_ICDF((((a3)-4) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +               \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               4),                                                            \
-      AOM_ICDF((((a4)-5) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +               \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               5),                                                            \
-      AOM_ICDF((((a5)-6) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +               \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               6),                                                            \
-      AOM_ICDF((((a6)-7) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +               \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               7),                                                            \
-      AOM_ICDF((((a7)-8) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +               \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               8),                                                            \
-      AOM_ICDF((((a8)-9) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +               \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               9),                                                            \
-      AOM_ICDF((((a9)-10) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +              \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               10),                                                           \
-      AOM_ICDF((((a10)-11) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +             \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               11),                                                           \
-      AOM_ICDF((((a11)-12) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +             \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               12),                                                           \
-      AOM_ICDF((((a12)-13) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +             \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               13),                                                           \
-      AOM_ICDF((((a13)-14) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +             \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               14),                                                           \
-      AOM_ICDF((((a14)-15) * ((CDF_INIT_TOP >> CDF_SHIFT) - 16) +             \
-                ((CDF_INIT_TOP - 16) >> 1)) /                                 \
-                   ((CDF_INIT_TOP - 16)) +                                    \
-               15),                                                           \
-      AOM_ICDF(CDF_PROB_TOP), 0
-
-#endif
-
 static INLINE uint8_t get_prob(unsigned int num, unsigned int den) {
   assert(den != 0);
   {

diff --git a/aom_dsp/psnr.c b/aom_dsp/psnr.c
index 08fb69c..f71590c 100644
--- a/aom_dsp/psnr.c
+++ b/aom_dsp/psnr.c

@@ -44,9 +44,9 @@
 }
 
 #if CONFIG_AV1_HIGHBITDEPTH
-static int64_t encoder_highbd_8_sse(const uint8_t *a8, int a_stride,
-                                    const uint8_t *b8, int b_stride, int w,
-                                    int h) {
+static int64_t encoder_highbd_sse(const uint8_t *a8, int a_stride,
+                                  const uint8_t *b8, int b_stride, int w,
+                                  int h) {
   const uint16_t *a = CONVERT_TO_SHORTPTR(a8);
   const uint16_t *b = CONVERT_TO_SHORTPTR(b8);
   int64_t sse = 0;
@@ -84,10 +84,8 @@
   for (y = 0; y < height / 16; ++y) {
     const uint8_t *pa = a;
     const uint8_t *pb = b;
-    unsigned int sse;
     for (x = 0; x < width / 16; ++x) {
-      aom_mse16x16(pa, a_stride, pb, b_stride, &sse);
-      total_sse += sse;
+      total_sse += aom_sse(pa, a_stride, pb, b_stride, 16, 16);
 
       pa += 16;
       pb += 16;
@@ -128,22 +126,20 @@
   const int dh = height % 16;
 
   if (dw > 0) {
-    total_sse += encoder_highbd_8_sse(&a[width - dw], a_stride, &b[width - dw],
-                                      b_stride, dw, height);
+    total_sse += encoder_highbd_sse(&a[width - dw], a_stride, &b[width - dw],
+                                    b_stride, dw, height);
   }
   if (dh > 0) {
-    total_sse += encoder_highbd_8_sse(&a[(height - dh) * a_stride], a_stride,
-                                      &b[(height - dh) * b_stride], b_stride,
-                                      width - dw, dh);
+    total_sse += encoder_highbd_sse(&a[(height - dh) * a_stride], a_stride,
+                                    &b[(height - dh) * b_stride], b_stride,
+                                    width - dw, dh);
   }
 
   for (y = 0; y < height / 16; ++y) {
     const uint8_t *pa = a;
     const uint8_t *pb = b;
-    unsigned int sse;
     for (x = 0; x < width / 16; ++x) {
-      aom_highbd_8_mse16x16(pa, a_stride, pb, b_stride, &sse);
-      total_sse += sse;
+      total_sse += aom_highbd_sse(pa, a_stride, pb, b_stride, 16, 16);
       pa += 16;
       pb += 16;
     }

diff --git a/aom_dsp/pyramid.c b/aom_dsp/pyramid.c
new file mode 100644
index 0000000..a26d302
--- /dev/null
+++ b/aom_dsp/pyramid.c

@@ -0,0 +1,411 @@
+/*
+ * Copyright (c) 2022, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include "aom_dsp/pyramid.h"
+#include "aom_mem/aom_mem.h"
+#include "aom_ports/bitops.h"
+#include "aom_util/aom_thread.h"
+
+// TODO(rachelbarker): Move needed code from av1/ to aom_dsp/
+#include "av1/common/resize.h"
+
+#include <assert.h>
+#include <string.h>
+
+// Lifecycle:
+// * Frame buffer alloc code calls aom_get_pyramid_alloc_size()
+//   to work out how much space is needed for a given number of pyramid
+//   levels. This is counted in the size checked against the max allocation
+//   limit
+// * Then calls aom_alloc_pyramid() to actually create the pyramid
+// * Pyramid is initially marked as invalid (no data)
+// * Whenever pyramid is needed, we check the valid flag. If set, use existing
+//   data. If not set, compute full pyramid
+// * Whenever frame buffer is reused, clear the valid flag
+// * Whenever frame buffer is resized, reallocate pyramid
+
+size_t aom_get_pyramid_alloc_size(int width, int height, int n_levels,
+                                  bool image_is_16bit) {
+  // Limit number of levels on small frames
+  const int msb = get_msb(AOMMIN(width, height));
+  const int max_levels = AOMMAX(msb - MIN_PYRAMID_SIZE_LOG2, 1);
+  n_levels = AOMMIN(n_levels, max_levels);
+
+  size_t alloc_size = 0;
+  alloc_size += sizeof(ImagePyramid);
+  alloc_size += n_levels * sizeof(PyramidLayer);
+
+  // Calculate how much memory is needed for downscaled frame buffers
+  size_t buffer_size = 0;
+
+  // Work out if we need to allocate a few extra bytes for alignment.
+  // aom_memalign() will ensure that the start of the allocation is aligned
+  // to a multiple of PYRAMID_ALIGNMENT. But we want the first image pixel
+  // to be aligned, not the first byte of the allocation.
+  //
+  // In the loop below, we ensure that the stride of every image is a multiple
+  // of PYRAMID_ALIGNMENT. Thus the allocated size of each pyramid level will
+  // also be a multiple of PYRAMID_ALIGNMENT. Thus, as long as we can get the
+  // first pixel in the first pyramid layer aligned properly, that will
+  // automatically mean that the first pixel of every row of every layer is
+  // properly aligned too.
+  //
+  // Thus all we need to consider is the first pixel in the first layer.
+  // This is located at offset
+  //   extra_bytes + level_stride * PYRAMID_PADDING + PYRAMID_PADDING
+  // bytes into the buffer. Since level_stride is a multiple of
+  // PYRAMID_ALIGNMENT, we can ignore that. So we need
+  //   extra_bytes + PYRAMID_PADDING = multiple of PYRAMID_ALIGNMENT
+  //
+  // To solve this, we can round PYRAMID_PADDING up to the next multiple
+  // of PYRAMID_ALIGNMENT, then subtract the orginal value to calculate
+  // how many extra bytes are needed.
+  size_t first_px_offset =
+      (PYRAMID_PADDING + PYRAMID_ALIGNMENT - 1) & ~(PYRAMID_ALIGNMENT - 1);
+  size_t extra_bytes = first_px_offset - PYRAMID_PADDING;
+  buffer_size += extra_bytes;
+
+  // If the original image is stored in an 8-bit buffer, then we can point the
+  // lowest pyramid level at that buffer rather than allocating a new one.
+  int first_allocated_level = image_is_16bit ? 0 : 1;
+
+  for (int level = first_allocated_level; level < n_levels; level++) {
+    int level_width = width >> level;
+    int level_height = height >> level;
+
+    // Allocate padding for each layer
+    int padded_width = level_width + 2 * PYRAMID_PADDING;
+    int padded_height = level_height + 2 * PYRAMID_PADDING;
+
+    // Align the layer stride to be a multiple of PYRAMID_ALIGNMENT
+    // This ensures that, as long as the top-left pixel in this pyramid level is
+    // properly aligned, then so will the leftmost pixel in every row of the
+    // pyramid level.
+    int level_stride =
+        (padded_width + PYRAMID_ALIGNMENT - 1) & ~(PYRAMID_ALIGNMENT - 1);
+
+    buffer_size += level_stride * padded_height;
+  }
+
+  alloc_size += buffer_size;
+
+  return alloc_size;
+}
+
+ImagePyramid *aom_alloc_pyramid(int width, int height, int n_levels,
+                                bool image_is_16bit) {
+  // Limit number of levels on small frames
+  const int msb = get_msb(AOMMIN(width, height));
+  const int max_levels = AOMMAX(msb - MIN_PYRAMID_SIZE_LOG2, 1);
+  n_levels = AOMMIN(n_levels, max_levels);
+
+  ImagePyramid *pyr = aom_calloc(1, sizeof(*pyr));
+  if (!pyr) {
+    return NULL;
+  }
+
+  pyr->layers = aom_calloc(n_levels, sizeof(PyramidLayer));
+  if (!pyr->layers) {
+    aom_free(pyr);
+    return NULL;
+  }
+
+  pyr->valid = false;
+  pyr->n_levels = n_levels;
+
+  // Compute sizes and offsets for each pyramid level
+  // These are gathered up first, so that we can allocate all pyramid levels
+  // in a single buffer
+  size_t buffer_size = 0;
+  size_t *layer_offsets = aom_calloc(n_levels, sizeof(size_t));
+  if (!layer_offsets) {
+    aom_free(pyr);
+    aom_free(pyr->layers);
+    return NULL;
+  }
+
+  // Work out if we need to allocate a few extra bytes for alignment.
+  // aom_memalign() will ensure that the start of the allocation is aligned
+  // to a multiple of PYRAMID_ALIGNMENT. But we want the first image pixel
+  // to be aligned, not the first byte of the allocation.
+  //
+  // In the loop below, we ensure that the stride of every image is a multiple
+  // of PYRAMID_ALIGNMENT. Thus the allocated size of each pyramid level will
+  // also be a multiple of PYRAMID_ALIGNMENT. Thus, as long as we can get the
+  // first pixel in the first pyramid layer aligned properly, that will
+  // automatically mean that the first pixel of every row of every layer is
+  // properly aligned too.
+  //
+  // Thus all we need to consider is the first pixel in the first layer.
+  // This is located at offset
+  //   extra_bytes + level_stride * PYRAMID_PADDING + PYRAMID_PADDING
+  // bytes into the buffer. Since level_stride is a multiple of
+  // PYRAMID_ALIGNMENT, we can ignore that. So we need
+  //   extra_bytes + PYRAMID_PADDING = multiple of PYRAMID_ALIGNMENT
+  //
+  // To solve this, we can round PYRAMID_PADDING up to the next multiple
+  // of PYRAMID_ALIGNMENT, then subtract the orginal value to calculate
+  // how many extra bytes are needed.
+  size_t first_px_offset =
+      (PYRAMID_PADDING + PYRAMID_ALIGNMENT - 1) & ~(PYRAMID_ALIGNMENT - 1);
+  size_t extra_bytes = first_px_offset - PYRAMID_PADDING;
+  buffer_size += extra_bytes;
+
+  // If the original image is stored in an 8-bit buffer, then we can point the
+  // lowest pyramid level at that buffer rather than allocating a new one.
+  int first_allocated_level = image_is_16bit ? 0 : 1;
+
+  for (int level = first_allocated_level; level < n_levels; level++) {
+    PyramidLayer *layer = &pyr->layers[level];
+
+    int level_width = width >> level;
+    int level_height = height >> level;
+
+    // Allocate padding for each layer
+    int padded_width = level_width + 2 * PYRAMID_PADDING;
+    int padded_height = level_height + 2 * PYRAMID_PADDING;
+
+    // Align the layer stride to be a multiple of PYRAMID_ALIGNMENT
+    // This ensures that, as long as the top-left pixel in this pyramid level is
+    // properly aligned, then so will the leftmost pixel in every row of the
+    // pyramid level.
+    int level_stride =
+        (padded_width + PYRAMID_ALIGNMENT - 1) & ~(PYRAMID_ALIGNMENT - 1);
+
+    size_t level_alloc_start = buffer_size;
+    size_t level_start =
+        level_alloc_start + PYRAMID_PADDING * level_stride + PYRAMID_PADDING;
+
+    buffer_size += level_stride * padded_height;
+
+    layer_offsets[level] = level_start;
+    layer->width = level_width;
+    layer->height = level_height;
+    layer->stride = level_stride;
+  }
+
+  pyr->buffer_alloc =
+      aom_memalign(PYRAMID_ALIGNMENT, buffer_size * sizeof(*pyr->buffer_alloc));
+  if (!pyr->buffer_alloc) {
+    aom_free(pyr);
+    aom_free(pyr->layers);
+    aom_free(layer_offsets);
+    return NULL;
+  }
+
+  // Fill in pointers for each level
+  // If image is 8-bit, then the lowest level is left unconfigured for now,
+  // and will be set up properly when the pyramid is filled in
+  for (int level = first_allocated_level; level < n_levels; level++) {
+    PyramidLayer *layer = &pyr->layers[level];
+    layer->buffer = pyr->buffer_alloc + layer_offsets[level];
+  }
+
+#if CONFIG_MULTITHREAD
+  pthread_mutex_init(&pyr->mutex, NULL);
+#endif  // CONFIG_MULTITHREAD
+
+  aom_free(layer_offsets);
+  return pyr;
+}
+
+// Fill the border region of a pyramid frame.
+// This must be called after the main image area is filled out.
+// `img_buf` should point to the first pixel in the image area,
+// ie. it should be pyr->level_buffer + pyr->level_loc[level].
+static INLINE void fill_border(uint8_t *img_buf, const int width,
+                               const int height, const int stride) {
+  // Fill left and right areas
+  for (int row = 0; row < height; row++) {
+    uint8_t *row_start = &img_buf[row * stride];
+    uint8_t left_pixel = row_start[0];
+    memset(row_start - PYRAMID_PADDING, left_pixel, PYRAMID_PADDING);
+    uint8_t right_pixel = row_start[width - 1];
+    memset(row_start + width, right_pixel, PYRAMID_PADDING);
+  }
+
+  // Fill top area
+  for (int row = -PYRAMID_PADDING; row < 0; row++) {
+    uint8_t *row_start = &img_buf[row * stride];
+    memcpy(row_start - PYRAMID_PADDING, img_buf - PYRAMID_PADDING,
+           width + 2 * PYRAMID_PADDING);
+  }
+
+  // Fill bottom area
+  uint8_t *last_row_start = &img_buf[(height - 1) * stride];
+  for (int row = height; row < height + PYRAMID_PADDING; row++) {
+    uint8_t *row_start = &img_buf[row * stride];
+    memcpy(row_start - PYRAMID_PADDING, last_row_start - PYRAMID_PADDING,
+           width + 2 * PYRAMID_PADDING);
+  }
+}
+
+// Compute coarse to fine pyramids for a frame
+// This must only be called while holding frame_pyr->mutex
+static INLINE void fill_pyramid(const YV12_BUFFER_CONFIG *frame, int bit_depth,
+                                ImagePyramid *frame_pyr) {
+  int n_levels = frame_pyr->n_levels;
+  const int frame_width = frame->y_crop_width;
+  const int frame_height = frame->y_crop_height;
+  const int frame_stride = frame->y_stride;
+  assert((frame_width >> n_levels) >= 0);
+  assert((frame_height >> n_levels) >= 0);
+
+  PyramidLayer *first_layer = &frame_pyr->layers[0];
+  if (frame->flags & YV12_FLAG_HIGHBITDEPTH) {
+    // For frames stored in a 16-bit buffer, we need to downconvert to 8 bits
+    assert(first_layer->width == frame_width);
+    assert(first_layer->height == frame_height);
+
+    uint16_t *frame_buffer = CONVERT_TO_SHORTPTR(frame->y_buffer);
+    uint8_t *pyr_buffer = first_layer->buffer;
+    int pyr_stride = first_layer->stride;
+    for (int y = 0; y < frame_height; y++) {
+      uint16_t *frame_row = frame_buffer + y * frame_stride;
+      uint8_t *pyr_row = pyr_buffer + y * pyr_stride;
+      for (int x = 0; x < frame_width; x++) {
+        pyr_row[x] = frame_row[x] >> (bit_depth - 8);
+      }
+    }
+
+    fill_border(pyr_buffer, frame_width, frame_height, pyr_stride);
+  } else {
+    // For frames stored in an 8-bit buffer, we need to configure the first
+    // pyramid layer to point at the original image buffer
+    first_layer->buffer = frame->y_buffer;
+    first_layer->width = frame_width;
+    first_layer->height = frame_height;
+    first_layer->stride = frame_stride;
+  }
+
+  // Fill in the remaining levels through progressive downsampling
+  for (int level = 1; level < n_levels; ++level) {
+    PyramidLayer *prev_layer = &frame_pyr->layers[level - 1];
+    uint8_t *prev_buffer = prev_layer->buffer;
+    int prev_stride = prev_layer->stride;
+
+    PyramidLayer *this_layer = &frame_pyr->layers[level];
+    uint8_t *this_buffer = this_layer->buffer;
+    int this_width = this_layer->width;
+    int this_height = this_layer->height;
+    int this_stride = this_layer->stride;
+
+    // Compute the this pyramid level by downsampling the current level.
+    //
+    // We downsample by a factor of exactly 2, clipping the rightmost and
+    // bottommost pixel off of the current level if needed. We do this for
+    // two main reasons:
+    //
+    // 1) In the disflow code, when stepping from a higher pyramid level to a
+    //    lower pyramid level, we need to not just interpolate the flow field
+    //    but also to scale each flow vector by the upsampling ratio.
+    //    So it is much more convenient if this ratio is simply 2.
+    //
+    // 2) Up/downsampling by a factor of 2 can be implemented much more
+    //    efficiently than up/downsampling by a generic ratio.
+    //    TODO(rachelbarker): Use optimized downsample-by-2 function
+    av1_resize_plane(prev_buffer, this_height << 1, this_width << 1,
+                     prev_stride, this_buffer, this_height, this_width,
+                     this_stride);
+    fill_border(this_buffer, this_width, this_height, this_stride);
+  }
+}
+
+// Fill out a downsampling pyramid for a given frame.
+//
+// The top level (index 0) will always be an 8-bit copy of the input frame,
+// regardless of the input bit depth. Additional levels are then downscaled
+// by powers of 2.
+//
+// For small input frames, the number of levels actually constructed
+// will be limited so that the smallest image is at least MIN_PYRAMID_SIZE
+// pixels along each side.
+//
+// However, if the input frame has a side of length < MIN_PYRAMID_SIZE,
+// we will still construct the top level.
+void aom_compute_pyramid(const YV12_BUFFER_CONFIG *frame, int bit_depth,
+                         ImagePyramid *pyr) {
+  assert(pyr);
+
+  // Per the comments in the ImagePyramid struct, we must take this mutex
+  // before reading or writing the "valid" flag, and hold it while computing
+  // the pyramid, to ensure proper behaviour if multiple threads call this
+  // function simultaneously
+#if CONFIG_MULTITHREAD
+  pthread_mutex_lock(&pyr->mutex);
+#endif  // CONFIG_MULTITHREAD
+
+  if (!pyr->valid) {
+    fill_pyramid(frame, bit_depth, pyr);
+    pyr->valid = true;
+  }
+
+  // At this point, the pyramid is guaranteed to be valid, and can be safely
+  // read from without holding the mutex any more
+
+#if CONFIG_MULTITHREAD
+  pthread_mutex_unlock(&pyr->mutex);
+#endif  // CONFIG_MULTITHREAD
+}
+
+#ifndef NDEBUG
+// Check if a pyramid has already been computed.
+// This is mostly a debug helper - as it is necessary to hold pyr->mutex
+// while reading the valid flag, we cannot just write:
+//   assert(pyr->valid);
+// This function allows the check to be correctly written as:
+//   assert(aom_is_pyramid_valid(pyr));
+bool aom_is_pyramid_valid(ImagePyramid *pyr) {
+  assert(pyr);
+
+  // Per the comments in the ImagePyramid struct, we must take this mutex
+  // before reading or writing the "valid" flag, and hold it while computing
+  // the pyramid, to ensure proper behaviour if multiple threads call this
+  // function simultaneously
+#if CONFIG_MULTITHREAD
+  pthread_mutex_lock(&pyr->mutex);
+#endif  // CONFIG_MULTITHREAD
+
+  bool valid = pyr->valid;
+
+#if CONFIG_MULTITHREAD
+  pthread_mutex_unlock(&pyr->mutex);
+#endif  // CONFIG_MULTITHREAD
+
+  return valid;
+}
+#endif
+
+// Mark a pyramid as no longer containing valid data.
+// This must be done whenever the corresponding frame buffer is reused
+void aom_invalidate_pyramid(ImagePyramid *pyr) {
+  if (pyr) {
+#if CONFIG_MULTITHREAD
+    pthread_mutex_lock(&pyr->mutex);
+#endif  // CONFIG_MULTITHREAD
+    pyr->valid = false;
+#if CONFIG_MULTITHREAD
+    pthread_mutex_unlock(&pyr->mutex);
+#endif  // CONFIG_MULTITHREAD
+  }
+}
+
+// Release the memory associated with a pyramid
+void aom_free_pyramid(ImagePyramid *pyr) {
+  if (pyr) {
+#if CONFIG_MULTITHREAD
+    pthread_mutex_destroy(&pyr->mutex);
+#endif  // CONFIG_MULTITHREAD
+    aom_free(pyr->buffer_alloc);
+    aom_free(pyr->layers);
+    aom_free(pyr);
+  }
+}

diff --git a/aom_dsp/pyramid.h b/aom_dsp/pyramid.h
new file mode 100644
index 0000000..812aae1
--- /dev/null
+++ b/aom_dsp/pyramid.h

@@ -0,0 +1,127 @@
+/*
+ * Copyright (c) 2022, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#ifndef AOM_AOM_DSP_PYRAMID_H_
+#define AOM_AOM_DSP_PYRAMID_H_
+
+#include <stddef.h>
+#include <stdint.h>
+#include <stdbool.h>
+
+#include "config/aom_config.h"
+
+#include "aom_scale/yv12config.h"
+#include "aom_util/aom_thread.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// Minimum dimensions of a downsampled image
+#define MIN_PYRAMID_SIZE_LOG2 3
+#define MIN_PYRAMID_SIZE (1 << MIN_PYRAMID_SIZE_LOG2)
+
+// Size of border around each pyramid image, in pixels
+// Similarly to the border around regular image buffers, this border is filled
+// with copies of the outermost pixels of the frame, to allow for more efficient
+// convolution code
+// TODO(rachelbarker): How many pixels do we actually need here?
+// I think we only need 9 for disflow, but how many for corner matching?
+#define PYRAMID_PADDING 16
+
+// Byte alignment of each line within the image pyramids.
+// That is, the first pixel inside the image (ie, not in the border region),
+// on each row of each pyramid level, is aligned to this byte alignment.
+// This value must be a power of 2.
+#define PYRAMID_ALIGNMENT 32
+
+typedef struct {
+  uint8_t *buffer;
+  int width;
+  int height;
+  int stride;
+} PyramidLayer;
+
+// Struct for an image pyramid
+typedef struct image_pyramid {
+#if CONFIG_MULTITHREAD
+  // Mutex which is used to prevent the pyramid being computed twice at the
+  // same time
+  //
+  // Semantics:
+  // * This mutex must be held whenever reading or writing the `valid` flag
+  //
+  // * This mutex must also be held while computing the image pyramid,
+  //   to ensure that only one thread may do so at a time.
+  //
+  // * However, once you have read the valid flag and seen a true value,
+  //   it is safe to drop the mutex and read from the remaining fields.
+  //   This is because, once the image pyramid is computed, its contents
+  //   will not be changed until the parent frame buffer is recycled,
+  //   which will not happen until there are no more outstanding references
+  //   to the frame buffer.
+  pthread_mutex_t mutex;
+#endif
+  // Flag indicating whether the pyramid contains valid data
+  bool valid;
+  // Number of allocated/filled levels in this pyramid
+  int n_levels;
+  // Pointer to allocated buffer
+  uint8_t *buffer_alloc;
+  // Data for each level
+  // The `buffer` pointers inside this array point into the region which
+  // is stored in the `buffer_alloc` field here
+  PyramidLayer *layers;
+} ImagePyramid;
+
+size_t aom_get_pyramid_alloc_size(int width, int height, int n_levels,
+                                  bool image_is_16bit);
+
+ImagePyramid *aom_alloc_pyramid(int width, int height, int n_levels,
+                                bool image_is_16bit);
+
+// Fill out a downsampling pyramid for a given frame.
+//
+// The top level (index 0) will always be an 8-bit copy of the input frame,
+// regardless of the input bit depth. Additional levels are then downscaled
+// by powers of 2.
+//
+// For small input frames, the number of levels actually constructed
+// will be limited so that the smallest image is at least MIN_PYRAMID_SIZE
+// pixels along each side.
+//
+// However, if the input frame has a side of length < MIN_PYRAMID_SIZE,
+// we will still construct the top level.
+void aom_compute_pyramid(const YV12_BUFFER_CONFIG *frame, int bit_depth,
+                         ImagePyramid *pyr);
+
+#ifndef NDEBUG
+// Check if a pyramid has already been computed.
+// This is mostly a debug helper - as it is necessary to hold pyr->mutex
+// while reading the valid flag, we cannot just write:
+//   assert(pyr->valid);
+// This function allows the check to be correctly written as:
+//   assert(aom_is_pyramid_valid(pyr));
+bool aom_is_pyramid_valid(ImagePyramid *pyr);
+#endif
+
+// Mark a pyramid as no longer containing valid data.
+// This must be done whenever the corresponding frame buffer is reused
+void aom_invalidate_pyramid(ImagePyramid *pyr);
+
+// Release the memory associated with a pyramid
+void aom_free_pyramid(ImagePyramid *pyr);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif  // AOM_AOM_DSP_PYRAMID_H_

diff --git a/aom_dsp/sad.c b/aom_dsp/sad.c
index 5b7b0e4..341a5ff 100644
--- a/aom_dsp/sad.c
+++ b/aom_dsp/sad.c

@@ -35,13 +35,6 @@
   return sad;
 }
 
-#define SAD_MXH(m)                                                         \
-  unsigned int aom_sad##m##xh_c(const uint8_t *a, int a_stride,            \
-                                const uint8_t *b, int b_stride, int width, \
-                                int height) {                              \
-    return sad(a, a_stride, b, b_stride, width, height);                   \
-  }
-
 #define SADMXN(m, n)                                                          \
   unsigned int aom_sad##m##x##n##_c(const uint8_t *src, int src_stride,       \
                                     const uint8_t *ref, int ref_stride) {     \
@@ -68,7 +61,6 @@
     return 2 * sad(src, 2 * src_stride, ref, 2 * ref_stride, (m), (n / 2));   \
   }
 
-#if CONFIG_REALTIME_ONLY
 // Calculate sad against 4 reference locations and store each in sad_array
 #define SAD_MXNX4D(m, n)                                                      \
   void aom_sad##m##x##n##x4d_c(const uint8_t *src, int src_stride,            \
@@ -89,37 +81,6 @@
                              2 * ref_stride, (m), (n / 2));                   \
     }                                                                         \
   }
-#else  // !CONFIG_REALTIME_ONLY
-// Calculate sad against 4 reference locations and store each in sad_array
-#define SAD_MXNX4D(m, n)                                                      \
-  void aom_sad##m##x##n##x4d_c(const uint8_t *src, int src_stride,            \
-                               const uint8_t *const ref_array[4],             \
-                               int ref_stride, uint32_t sad_array[4]) {       \
-    int i;                                                                    \
-    for (i = 0; i < 4; ++i) {                                                 \
-      sad_array[i] =                                                          \
-          aom_sad##m##x##n##_c(src, src_stride, ref_array[i], ref_stride);    \
-    }                                                                         \
-  }                                                                           \
-  void aom_sad##m##x##n##x4d_avg_c(                                           \
-      const uint8_t *src, int src_stride, const uint8_t *const ref_array[4],  \
-      int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]) {    \
-    int i;                                                                    \
-    for (i = 0; i < 4; ++i) {                                                 \
-      sad_array[i] = aom_sad##m##x##n##_avg_c(src, src_stride, ref_array[i],  \
-                                              ref_stride, second_pred);       \
-    }                                                                         \
-  }                                                                           \
-  void aom_sad_skip_##m##x##n##x4d_c(const uint8_t *src, int src_stride,      \
-                                     const uint8_t *const ref_array[4],       \
-                                     int ref_stride, uint32_t sad_array[4]) { \
-    int i;                                                                    \
-    for (i = 0; i < 4; ++i) {                                                 \
-      sad_array[i] = 2 * sad(src, 2 * src_stride, ref_array[i],               \
-                             2 * ref_stride, (m), (n / 2));                   \
-    }                                                                         \
-  }
-#endif  // CONFIG_REALTIME_ONLY
 // Call SIMD version of aom_sad_mxnx4d if the 3d version is unavailable.
 #define SAD_MXNX3D(m, n)                                                      \
   void aom_sad##m##x##n##x3d_c(const uint8_t *src, int src_stride,            \
@@ -208,13 +169,7 @@
 SAD_MXNX4D(4, 4)
 SAD_MXNX3D(4, 4)
 
-SAD_MXH(128)
-SAD_MXH(64)
-SAD_MXH(32)
-SAD_MXH(16)
-SAD_MXH(8)
-SAD_MXH(4)
-
+#if !CONFIG_REALTIME_ONLY
 SADMXN(4, 16)
 SAD_MXNX4D(4, 16)
 SADMXN(16, 4)
@@ -227,7 +182,6 @@
 SAD_MXNX4D(16, 64)
 SADMXN(64, 16)
 SAD_MXNX4D(64, 16)
-#if !CONFIG_REALTIME_ONLY
 SAD_MXNX3D(4, 16)
 SAD_MXNX3D(16, 4)
 SAD_MXNX3D(8, 32)

diff --git a/aom_dsp/simd/v128_intrinsics_arm.h b/aom_dsp/simd/v128_intrinsics_arm.h
index 2d497f4..6488de7 100644
--- a/aom_dsp/simd/v128_intrinsics_arm.h
+++ b/aom_dsp/simd/v128_intrinsics_arm.h

@@ -14,6 +14,8 @@
 
 #include <arm_neon.h>
 
+#include "config/aom_config.h"
+
 #include "aom_dsp/simd/v64_intrinsics_arm.h"
 
 typedef int64x2_t v128;
@@ -29,7 +31,7 @@
 SIMD_INLINE v128 v128_from_v64(v64 a, v64 b) { return vcombine_s64(b, a); }
 
 SIMD_INLINE v128 v128_from_64(uint64_t a, uint64_t b) {
-  return vcombine_s64((int64x1_t)b, (int64x1_t)a);
+  return vcombine_s64(vcreate_s64(b), vcreate_s64(a));
 }
 
 SIMD_INLINE v128 v128_from_32(uint32_t a, uint32_t b, uint32_t c, uint32_t d) {
@@ -97,11 +99,11 @@
   int16x8_t t2 = vmulq_s16(
       vmovl_s8(vreinterpret_s8_s64(vget_high_s64(a))),
       vreinterpretq_s16_u16(vmovl_u8(vreinterpret_u8_s64(vget_high_s64(b)))));
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlvq_s16(t1) + vaddlvq_s16(t2);
 #else
   int64x2_t t = vpaddlq_s32(vaddq_s32(vpaddlq_s16(t1), vpaddlq_s16(t2)));
-  return (int64_t)vget_high_s64(t) + (int64_t)vget_low_s64(t);
+  return vget_lane_s64(vadd_s64(vget_high_s64(t), vget_low_s64(t)), 0);
 #endif
 }
 
@@ -113,11 +115,11 @@
 SIMD_INLINE int64_t v128_dotp_s32(v128 a, v128 b) {
   int64x2_t t = vpaddlq_s32(
       vmulq_s32(vreinterpretq_s32_s64(a), vreinterpretq_s32_s64(b)));
-  return (int64_t)vget_high_s64(t) + (int64_t)vget_low_s64(t);
+  return vget_lane_s64(vadd_s64(vget_high_s64(t), vget_low_s64(t)), 0);
 }
 
 SIMD_INLINE uint64_t v128_hadd_u8(v128 x) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlvq_u8(vreinterpretq_u8_s64(x));
 #else
   uint64x2_t t = vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vreinterpretq_u8_s64(x))));
@@ -155,11 +157,12 @@
 }
 
 SIMD_INLINE uint32_t v128_sad_u8_sum(sad128_internal s) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlvq_u16(s.hi) + vaddlvq_u16(s.lo);
 #else
   uint64x2_t t = vpaddlq_u32(vpaddlq_u16(vaddq_u16(s.hi, s.lo)));
-  return (uint32_t)(uint64_t)(vget_high_u64(t) + vget_low_u64(t));
+  return (uint32_t)vget_lane_u64(vadd_u64(vget_high_u64(t), vget_low_u64(t)),
+                                 0);
 #endif
 }
 
@@ -285,7 +288,7 @@
 }
 
 SIMD_INLINE v128 v128_mulhi_s16(v128 a, v128 b) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_s16(vuzp2q_s16(
       vreinterpretq_s16_s32(vmull_s16(vreinterpret_s16_s64(vget_low_s64(a)),
                                       vreinterpret_s16_s64(vget_low_s64(b)))),
@@ -303,7 +306,7 @@
 }
 
 SIMD_INLINE v128 v128_madd_s16(v128 a, v128 b) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   int32x4_t t1 = vmull_s16(vreinterpret_s16_s64(vget_low_s64(a)),
                            vreinterpret_s16_s64(vget_low_s64(b)));
   int32x4_t t2 =
@@ -316,7 +319,7 @@
 }
 
 SIMD_INLINE v128 v128_madd_us8(v128 a, v128 b) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   int16x8_t t1 = vmulq_s16(
       vreinterpretq_s16_u16(vmovl_u8(vreinterpret_u8_s64(vget_low_s64(a)))),
       vmovl_s8(vreinterpret_s8_s64(vget_low_s64(b))));
@@ -368,7 +371,7 @@
 
 SIMD_INLINE uint32_t v128_movemask_8(v128 a) {
   a = vreinterpretq_s64_u8(vcltq_s8(vreinterpretq_s8_s64(a), vdupq_n_s8(0)));
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   uint8x16_t m =
       vandq_u8(vreinterpretq_u8_s64(a),
                vreinterpretq_u8_u64(vdupq_n_u64(0x8040201008040201ULL)));
@@ -377,8 +380,8 @@
   uint64x2_t m = vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(
       vandq_u8(vreinterpretq_u8_s64(a),
                vreinterpretq_u8_u64(vdupq_n_u64(0x8040201008040201ULL))))));
-  return v64_low_u32(
-      v64_ziplo_8(v128_high_v64((v128)m), v128_low_v64((v128)m)));
+  int64x2_t s = vreinterpretq_s64_u64(m);
+  return v64_low_u32(v64_ziplo_8(vget_high_s64(s), vget_low_s64(s)));
 #endif
 }
 
@@ -413,7 +416,7 @@
 }
 
 SIMD_INLINE v128 v128_ziplo_8(v128 x, v128 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u8(
       vzip1q_u8(vreinterpretq_u8_s64(y), vreinterpretq_u8_s64(x)));
 #else
@@ -423,7 +426,7 @@
 }
 
 SIMD_INLINE v128 v128_ziphi_8(v128 x, v128 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u8(
       vzip2q_u8(vreinterpretq_u8_s64(y), vreinterpretq_u8_s64(x)));
 #else
@@ -438,7 +441,7 @@
 }
 
 SIMD_INLINE v128 v128_ziplo_16(v128 x, v128 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u16(
       vzip1q_u16(vreinterpretq_u16_s64(y), vreinterpretq_u16_s64(x)));
 #else
@@ -448,7 +451,7 @@
 }
 
 SIMD_INLINE v128 v128_ziphi_16(v128 x, v128 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u16(
       vzip2q_u16(vreinterpretq_u16_s64(y), vreinterpretq_u16_s64(x)));
 #else
@@ -463,7 +466,7 @@
 }
 
 SIMD_INLINE v128 v128_ziplo_32(v128 x, v128 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u32(
       vzip1q_u32(vreinterpretq_u32_s64(y), vreinterpretq_u32_s64(x)));
 #else
@@ -473,7 +476,7 @@
 }
 
 SIMD_INLINE v128 v128_ziphi_32(v128 x, v128 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u32(
       vzip2q_u32(vreinterpretq_u32_s64(y), vreinterpretq_u32_s64(x)));
 #else
@@ -488,16 +491,15 @@
 }
 
 SIMD_INLINE v128 v128_ziplo_64(v128 a, v128 b) {
-  return v128_from_v64(vget_low_s64((int64x2_t)a), vget_low_s64((int64x2_t)b));
+  return v128_from_v64(vget_low_s64(a), vget_low_s64(b));
 }
 
 SIMD_INLINE v128 v128_ziphi_64(v128 a, v128 b) {
-  return v128_from_v64(vget_high_s64((int64x2_t)a),
-                       vget_high_s64((int64x2_t)b));
+  return v128_from_v64(vget_high_s64(a), vget_high_s64(b));
 }
 
 SIMD_INLINE v128 v128_unziplo_8(v128 x, v128 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u8(
       vuzp1q_u8(vreinterpretq_u8_s64(y), vreinterpretq_u8_s64(x)));
 #else
@@ -507,7 +509,7 @@
 }
 
 SIMD_INLINE v128 v128_unziphi_8(v128 x, v128 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u8(
       vuzp2q_u8(vreinterpretq_u8_s64(y), vreinterpretq_u8_s64(x)));
 #else
@@ -517,7 +519,7 @@
 }
 
 SIMD_INLINE v128 v128_unziplo_16(v128 x, v128 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u16(
       vuzp1q_u16(vreinterpretq_u16_s64(y), vreinterpretq_u16_s64(x)));
 #else
@@ -528,7 +530,7 @@
 }
 
 SIMD_INLINE v128 v128_unziphi_16(v128 x, v128 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u16(
       vuzp2q_u16(vreinterpretq_u16_s64(y), vreinterpretq_u16_s64(x)));
 #else
@@ -539,7 +541,7 @@
 }
 
 SIMD_INLINE v128 v128_unziplo_32(v128 x, v128 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u32(
       vuzp1q_u32(vreinterpretq_u32_s64(y), vreinterpretq_u32_s64(x)));
 #else
@@ -550,7 +552,7 @@
 }
 
 SIMD_INLINE v128 v128_unziphi_32(v128 x, v128 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u32(
       vuzp2q_u32(vreinterpretq_u32_s64(y), vreinterpretq_u32_s64(x)));
 #else
@@ -637,16 +639,18 @@
 }
 
 SIMD_INLINE v128 v128_shuffle_8(v128 x, v128 pattern) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpretq_s64_u8(
       vqtbl1q_u8(vreinterpretq_u8_s64(x), vreinterpretq_u8_s64(pattern)));
 #else
   uint8x8x2_t p = { { vget_low_u8(vreinterpretq_u8_s64(x)),
                       vget_high_u8(vreinterpretq_u8_s64(x)) } };
-  return v128_from_64((uint64_t)vreinterpret_s64_u8(vtbl2_u8(
-                          p, vreinterpret_u8_s64(vget_high_s64(pattern)))),
-                      (uint64_t)vreinterpret_s64_u8(vtbl2_u8(
-                          p, vreinterpret_u8_s64(vget_low_s64(pattern)))));
+  uint8x8_t shuffle_hi =
+      vtbl2_u8(p, vreinterpret_u8_s64(vget_high_s64(pattern)));
+  uint8x8_t shuffle_lo =
+      vtbl2_u8(p, vreinterpret_u8_s64(vget_low_s64(pattern)));
+  return v128_from_64(vget_lane_u64(vreinterpret_u64_u8(shuffle_hi), 0),
+                      vget_lane_u64(vreinterpret_u64_u8(shuffle_lo), 0));
 #endif
 }
 
@@ -697,72 +701,72 @@
 
 SIMD_INLINE v128 v128_shl_8(v128 a, unsigned int c) {
   return (c > 7) ? v128_zero()
-                 : vreinterpretq_s64_u8(
-                       vshlq_u8(vreinterpretq_u8_s64(a), vdupq_n_s8(c)));
+                 : vreinterpretq_s64_u8(vshlq_u8(vreinterpretq_u8_s64(a),
+                                                 vdupq_n_s8((int8_t)c)));
 }
 
 SIMD_INLINE v128 v128_shr_u8(v128 a, unsigned int c) {
   return (c > 7) ? v128_zero()
-                 : vreinterpretq_s64_u8(
-                       vshlq_u8(vreinterpretq_u8_s64(a), vdupq_n_s8(-c)));
+                 : vreinterpretq_s64_u8(vshlq_u8(vreinterpretq_u8_s64(a),
+                                                 vdupq_n_s8(-(int8_t)c)));
 }
 
 SIMD_INLINE v128 v128_shr_s8(v128 a, unsigned int c) {
   return (c > 7) ? v128_ones()
-                 : vreinterpretq_s64_s8(
-                       vshlq_s8(vreinterpretq_s8_s64(a), vdupq_n_s8(-c)));
+                 : vreinterpretq_s64_s8(vshlq_s8(vreinterpretq_s8_s64(a),
+                                                 vdupq_n_s8(-(int8_t)c)));
 }
 
 SIMD_INLINE v128 v128_shl_16(v128 a, unsigned int c) {
   return (c > 15) ? v128_zero()
-                  : vreinterpretq_s64_u16(
-                        vshlq_u16(vreinterpretq_u16_s64(a), vdupq_n_s16(c)));
+                  : vreinterpretq_s64_u16(vshlq_u16(vreinterpretq_u16_s64(a),
+                                                    vdupq_n_s16((int16_t)c)));
 }
 
 SIMD_INLINE v128 v128_shr_u16(v128 a, unsigned int c) {
   return (c > 15) ? v128_zero()
-                  : vreinterpretq_s64_u16(
-                        vshlq_u16(vreinterpretq_u16_s64(a), vdupq_n_s16(-c)));
+                  : vreinterpretq_s64_u16(vshlq_u16(vreinterpretq_u16_s64(a),
+                                                    vdupq_n_s16(-(int16_t)c)));
 }
 
 SIMD_INLINE v128 v128_shr_s16(v128 a, unsigned int c) {
   return (c > 15) ? v128_ones()
-                  : vreinterpretq_s64_s16(
-                        vshlq_s16(vreinterpretq_s16_s64(a), vdupq_n_s16(-c)));
+                  : vreinterpretq_s64_s16(vshlq_s16(vreinterpretq_s16_s64(a),
+                                                    vdupq_n_s16(-(int16_t)c)));
 }
 
 SIMD_INLINE v128 v128_shl_32(v128 a, unsigned int c) {
   return (c > 31) ? v128_zero()
-                  : vreinterpretq_s64_u32(
-                        vshlq_u32(vreinterpretq_u32_s64(a), vdupq_n_s32(c)));
+                  : vreinterpretq_s64_u32(vshlq_u32(vreinterpretq_u32_s64(a),
+                                                    vdupq_n_s32((int32_t)c)));
 }
 
 SIMD_INLINE v128 v128_shr_u32(v128 a, unsigned int c) {
   return (c > 31) ? v128_zero()
-                  : vreinterpretq_s64_u32(
-                        vshlq_u32(vreinterpretq_u32_s64(a), vdupq_n_s32(-c)));
+                  : vreinterpretq_s64_u32(vshlq_u32(vreinterpretq_u32_s64(a),
+                                                    vdupq_n_s32(-(int32_t)c)));
 }
 
 SIMD_INLINE v128 v128_shr_s32(v128 a, unsigned int c) {
   return (c > 31) ? v128_ones()
-                  : vreinterpretq_s64_s32(
-                        vshlq_s32(vreinterpretq_s32_s64(a), vdupq_n_s32(-c)));
+                  : vreinterpretq_s64_s32(vshlq_s32(vreinterpretq_s32_s64(a),
+                                                    vdupq_n_s32(-(int32_t)c)));
 }
 
 SIMD_INLINE v128 v128_shl_64(v128 a, unsigned int c) {
   return (c > 63) ? v128_zero()
-                  : vreinterpretq_s64_u64(
-                        vshlq_u64(vreinterpretq_u64_s64(a), vdupq_n_s64(c)));
+                  : vreinterpretq_s64_u64(vshlq_u64(vreinterpretq_u64_s64(a),
+                                                    vdupq_n_s64((int64_t)c)));
 }
 
 SIMD_INLINE v128 v128_shr_u64(v128 a, unsigned int c) {
   return (c > 63) ? v128_zero()
-                  : vreinterpretq_s64_u64(
-                        vshlq_u64(vreinterpretq_u64_s64(a), vdupq_n_s64(-c)));
+                  : vreinterpretq_s64_u64(vshlq_u64(vreinterpretq_u64_s64(a),
+                                                    vdupq_n_s64(-(int64_t)c)));
 }
 
 SIMD_INLINE v128 v128_shr_s64(v128 a, unsigned int c) {
-  return (c > 63) ? v128_ones() : vshlq_s64(a, vdupq_n_s64(-c));
+  return (c > 63) ? v128_ones() : vshlq_s64(a, vdupq_n_s64(-(int64_t)c));
 }
 
 #if defined(__OPTIMIZE__) && __OPTIMIZE__ && !defined(__clang__)
@@ -949,8 +953,8 @@
 
 SIMD_INLINE uint32_t v128_sad_u16_sum(sad128_internal_u16 s) {
   uint64x2_t t = vpaddlq_u32(s);
-  return (uint32_t)(uint64_t)vget_high_u64(t) +
-         (uint32_t)(uint64_t)vget_low_u64(t);
+  return (uint32_t)vget_lane_u64(vadd_u64(vget_high_u64(t), vget_low_u64(t)),
+                                 0);
 }
 
 typedef v128 ssd128_internal_s16;

diff --git a/aom_dsp/simd/v256_intrinsics_v128.h b/aom_dsp/simd/v256_intrinsics_v128.h
index 0d22667..4cd83f7 100644
--- a/aom_dsp/simd/v256_intrinsics_v128.h
+++ b/aom_dsp/simd/v256_intrinsics_v128.h

@@ -12,6 +12,8 @@
 #ifndef AOM_AOM_DSP_SIMD_V256_INTRINSICS_V128_H_
 #define AOM_AOM_DSP_SIMD_V256_INTRINSICS_V128_H_
 
+#include "config/aom_config.h"
+
 #if HAVE_NEON
 #include "aom_dsp/simd/v128_intrinsics_arm.h"
 #elif HAVE_SSE2
@@ -614,7 +616,7 @@
 
 SIMD_INLINE v256 v256_shuffle_8(v256 x, v256 pattern) {
 #if HAVE_NEON
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   uint8x16x2_t p = { { vreinterpretq_u8_s64(x.val[0]),
                        vreinterpretq_u8_s64(x.val[1]) } };
   return v256_from_v128(
@@ -626,15 +628,18 @@
                       vget_high_u8(vreinterpretq_u8_s64(x.val[0])),
                       vget_low_u8(vreinterpretq_u8_s64(x.val[1])),
                       vget_high_u8(vreinterpretq_u8_s64(x.val[1])) } };
-  return v256_from_64(
-      (uint64_t)vreinterpret_s64_u8(
-          vtbl4_u8(p, vreinterpret_u8_s64(vget_high_s64(pattern.val[1])))),
-      (uint64_t)vreinterpret_s64_u8(
-          vtbl4_u8(p, vreinterpret_u8_s64(vget_low_s64(pattern.val[1])))),
-      (uint64_t)vreinterpret_s64_u8(
-          vtbl4_u8(p, vreinterpret_u8_s64(vget_high_s64(pattern.val[0])))),
-      (uint64_t)vreinterpret_s64_u8(
-          vtbl4_u8(p, vreinterpret_u8_s64(vget_low_s64(pattern.val[0])))));
+  uint8x8_t shuffle1_hi =
+      vtbl4_u8(p, vreinterpret_u8_s64(vget_high_s64(pattern.val[1])));
+  uint8x8_t shuffle1_lo =
+      vtbl4_u8(p, vreinterpret_u8_s64(vget_low_s64(pattern.val[1])));
+  uint8x8_t shuffle0_hi =
+      vtbl4_u8(p, vreinterpret_u8_s64(vget_high_s64(pattern.val[0])));
+  uint8x8_t shuffle0_lo =
+      vtbl4_u8(p, vreinterpret_u8_s64(vget_low_s64(pattern.val[0])));
+  return v256_from_64(vget_lane_u64(vreinterpret_u64_u8(shuffle1_hi), 0),
+                      vget_lane_u64(vreinterpret_u64_u8(shuffle1_lo), 0),
+                      vget_lane_u64(vreinterpret_u64_u8(shuffle0_hi), 0),
+                      vget_lane_u64(vreinterpret_u64_u8(shuffle0_lo), 0));
 #endif
 #else
   v128 c16 = v128_dup_8(16);
@@ -650,7 +655,7 @@
 
 SIMD_INLINE v256 v256_wideshuffle_8(v256 x, v256 y, v256 pattern) {
 #if HAVE_NEON
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   uint8x16x4_t p = { {
       vreinterpretq_u8_s64(y.val[0]),
       vreinterpretq_u8_s64(y.val[1]),
@@ -672,24 +677,26 @@
                       vget_high_u8(vreinterpretq_u8_s64(y.val[0])),
                       vget_low_u8(vreinterpretq_u8_s64(y.val[1])),
                       vget_high_u8(vreinterpretq_u8_s64(y.val[1])) } };
-  v256 r1 =
-      v256_from_64((uint64_t)vreinterpret_s64_u8(vtbl4_u8(
-                       p, vreinterpret_u8_s64(vget_high_s64(p32.val[1])))),
-                   (uint64_t)vreinterpret_s64_u8(vtbl4_u8(
-                       p, vreinterpret_u8_s64(vget_low_s64(p32.val[1])))),
-                   (uint64_t)vreinterpret_s64_u8(vtbl4_u8(
-                       p, vreinterpret_u8_s64(vget_high_s64(p32.val[0])))),
-                   (uint64_t)vreinterpret_s64_u8(vtbl4_u8(
-                       p, vreinterpret_u8_s64(vget_low_s64(p32.val[0])))));
-  v256 r2 =
-      v256_from_64((uint64_t)vreinterpret_s64_u8(vtbl4_u8(
-                       q, vreinterpret_u8_s64(vget_high_s64(pattern.val[1])))),
-                   (uint64_t)vreinterpret_s64_u8(vtbl4_u8(
-                       q, vreinterpret_u8_s64(vget_low_s64(pattern.val[1])))),
-                   (uint64_t)vreinterpret_s64_u8(vtbl4_u8(
-                       q, vreinterpret_u8_s64(vget_high_s64(pattern.val[0])))),
-                   (uint64_t)vreinterpret_s64_u8(vtbl4_u8(
-                       q, vreinterpret_u8_s64(vget_low_s64(pattern.val[0])))));
+  uint8x8_t shuffle1_hi =
+      vtbl4_u8(p, vreinterpret_u8_s64(vget_high_s64(p32.val[1])));
+  uint8x8_t shuffle1_lo =
+      vtbl4_u8(p, vreinterpret_u8_s64(vget_low_s64(p32.val[1])));
+  uint8x8_t shuffle0_hi =
+      vtbl4_u8(p, vreinterpret_u8_s64(vget_high_s64(p32.val[0])));
+  uint8x8_t shuffle0_lo =
+      vtbl4_u8(p, vreinterpret_u8_s64(vget_low_s64(p32.val[0])));
+  v256 r1 = v256_from_64(vget_lane_u64(vreinterpret_u64_u8(shuffle1_hi), 0),
+                         vget_lane_u64(vreinterpret_u64_u8(shuffle1_lo), 0),
+                         vget_lane_u64(vreinterpret_u64_u8(shuffle0_hi), 0),
+                         vget_lane_u64(vreinterpret_u64_u8(shuffle0_lo), 0));
+  shuffle1_hi = vtbl4_u8(q, vreinterpret_u8_s64(vget_high_s64(pattern.val[1])));
+  shuffle1_lo = vtbl4_u8(q, vreinterpret_u8_s64(vget_low_s64(pattern.val[1])));
+  shuffle0_hi = vtbl4_u8(q, vreinterpret_u8_s64(vget_high_s64(pattern.val[0])));
+  shuffle0_lo = vtbl4_u8(q, vreinterpret_u8_s64(vget_low_s64(pattern.val[0])));
+  v256 r2 = v256_from_64(vget_lane_u64(vreinterpret_u64_u8(shuffle1_hi), 0),
+                         vget_lane_u64(vreinterpret_u64_u8(shuffle1_lo), 0),
+                         vget_lane_u64(vreinterpret_u64_u8(shuffle0_hi), 0),
+                         vget_lane_u64(vreinterpret_u64_u8(shuffle0_lo), 0));
   return v256_blend_8(r1, r2, v256_cmplt_s8(pattern, c32));
 #endif
 #else

diff --git a/aom_dsp/simd/v64_intrinsics_arm.h b/aom_dsp/simd/v64_intrinsics_arm.h
index a4ecdf4..8d07c34 100644
--- a/aom_dsp/simd/v64_intrinsics_arm.h
+++ b/aom_dsp/simd/v64_intrinsics_arm.h

@@ -13,6 +13,9 @@
 #define AOM_AOM_DSP_SIMD_V64_INTRINSICS_ARM_H_
 
 #include <arm_neon.h>
+#include <string.h>
+
+#include "config/aom_config.h"
 
 #include "aom_dsp/simd/v64_intrinsics_arm.h"
 #include "aom_ports/arm.h"
@@ -50,7 +53,7 @@
 
 SIMD_INLINE v64 v64_from_64(uint64_t x) { return vcreate_s64(x); }
 
-SIMD_INLINE uint64_t v64_u64(v64 x) { return (uint64_t)x; }
+SIMD_INLINE uint64_t v64_u64(v64 x) { return (uint64_t)vget_lane_s64(x, 0); }
 
 SIMD_INLINE uint32_t u32_load_aligned(const void *p) {
   return *((uint32_t *)p);
@@ -77,8 +80,7 @@
   } __attribute__((__packed__));
   ((struct Unaligned32Struct *)p)->value = a;
 #else
-  vst1_lane_u32((uint32_t *)p, vreinterpret_u32_s64((uint64x1_t)(uint64_t)a),
-                0);
+  memcpy(p, &a, 4);
 #endif
 }
 
@@ -106,7 +108,8 @@
                  vext_s8(vreinterpret_s8_s64(b), vreinterpret_s8_s64(a), c))
            : b;
 #else
-  return c ? v64_from_64(((uint64_t)b >> c * 8) | ((uint64_t)a << (8 - c) * 8))
+  return c ? v64_from_64(((uint64_t)vget_lane_s64(b, 0) >> c * 8) |
+                         ((uint64_t)vget_lane_s64(a, 0) << (8 - c) * 8))
            : b;
 #endif
 }
@@ -129,35 +132,36 @@
   int16x8_t t =
       vmulq_s16(vmovl_s8(vreinterpret_s8_s64(x)),
                 vreinterpretq_s16_u16(vmovl_u8(vreinterpret_u8_s64(y))));
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlvq_s16(t);
 #else
   int64x2_t r = vpaddlq_s32(vpaddlq_s16(t));
-  return (int64_t)vadd_s64(vget_high_s64(r), vget_low_s64(r));
+  return vget_lane_s64(vadd_s64(vget_high_s64(r), vget_low_s64(r)), 0);
 #endif
 }
 
 SIMD_INLINE int64_t v64_dotp_s16(v64 x, v64 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlvq_s32(
       vmull_s16(vreinterpret_s16_s64(x), vreinterpret_s16_s64(y)));
 #else
   int64x2_t r =
       vpaddlq_s32(vmull_s16(vreinterpret_s16_s64(x), vreinterpret_s16_s64(y)));
-  return (int64_t)(vget_high_s64(r) + vget_low_s64(r));
+  return vget_lane_s64(vadd_s64(vget_high_s64(r), vget_low_s64(r)), 0);
 #endif
 }
 
 SIMD_INLINE uint64_t v64_hadd_u8(v64 x) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlv_u8(vreinterpret_u8_s64(x));
 #else
-  return (uint64_t)vpaddl_u32(vpaddl_u16(vpaddl_u8(vreinterpret_u8_s64(x))));
+  return vget_lane_u64(
+      vpaddl_u32(vpaddl_u16(vpaddl_u8(vreinterpret_u8_s64(x)))), 0);
 #endif
 }
 
 SIMD_INLINE int64_t v64_hadd_s16(v64 a) {
-  return (int64_t)vpaddl_s32(vpaddl_s16(vreinterpret_s16_s64(a)));
+  return vget_lane_s64(vpaddl_s32(vpaddl_s16(vreinterpret_s16_s64(a))), 0);
 }
 
 typedef uint16x8_t sad64_internal;
@@ -171,11 +175,12 @@
 }
 
 SIMD_INLINE uint32_t v64_sad_u8_sum(sad64_internal s) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlvq_u16(s);
 #else
   uint64x2_t r = vpaddlq_u32(vpaddlq_u16(s));
-  return (uint32_t)(uint64_t)(vget_high_u64(r) + vget_low_u64(r));
+  return (uint32_t)vget_lane_u64(vadd_u64(vget_high_u64(r), vget_low_u64(r)),
+                                 0);
 #endif
 }
 
@@ -191,7 +196,7 @@
 }
 
 SIMD_INLINE uint32_t v64_ssd_u8_sum(ssd64_internal s) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddvq_u32(s);
 #else
   uint64x2_t t = vpaddlq_u32(s);
@@ -287,7 +292,7 @@
 }
 
 SIMD_INLINE v64 v64_mulhi_s16(v64 x, v64 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   int16x8_t t = vreinterpretq_s16_s32(
       vmull_s16(vreinterpret_s16_s64(x), vreinterpret_s16_s64(y)));
   return vget_low_s64(vreinterpretq_s64_s16(vuzp2q_s16(t, t)));
@@ -367,7 +372,7 @@
 }
 
 SIMD_INLINE v64 v64_ziplo_8(v64 x, v64 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpret_s64_u8(
       vzip1_u8(vreinterpret_u8_s64(y), vreinterpret_u8_s64(x)));
 #else
@@ -377,7 +382,7 @@
 }
 
 SIMD_INLINE v64 v64_ziphi_8(v64 x, v64 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpret_s64_u8(
       vzip2_u8(vreinterpret_u8_s64(y), vreinterpret_u8_s64(x)));
 #else
@@ -387,7 +392,7 @@
 }
 
 SIMD_INLINE v64 v64_ziplo_16(v64 x, v64 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpret_s64_u16(
       vzip1_u16(vreinterpret_u16_s64(y), vreinterpret_u16_s64(x)));
 #else
@@ -397,7 +402,7 @@
 }
 
 SIMD_INLINE v64 v64_ziphi_16(v64 x, v64 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpret_s64_u16(
       vzip2_u16(vreinterpret_u16_s64(y), vreinterpret_u16_s64(x)));
 #else
@@ -407,7 +412,7 @@
 }
 
 SIMD_INLINE v64 v64_ziplo_32(v64 x, v64 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpret_s64_u32(
       vzip1_u32(vreinterpret_u32_s64(y), vreinterpret_u32_s64(x)));
 #else
@@ -417,7 +422,7 @@
 }
 
 SIMD_INLINE v64 v64_ziphi_32(v64 x, v64 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpret_s64_u32(
       vzip2_u32(vreinterpret_u32_s64(y), vreinterpret_u32_s64(x)));
 #else
@@ -463,7 +468,7 @@
 }
 
 SIMD_INLINE v64 v64_unziplo_8(v64 x, v64 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpret_s64_u8(
       vuzp1_u8(vreinterpret_u8_s64(y), vreinterpret_u8_s64(x)));
 #else
@@ -473,7 +478,7 @@
 }
 
 SIMD_INLINE v64 v64_unziphi_8(v64 x, v64 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpret_s64_u8(
       vuzp2_u8(vreinterpret_u8_s64(y), vreinterpret_u8_s64(x)));
 #else
@@ -483,7 +488,7 @@
 }
 
 SIMD_INLINE v64 v64_unziplo_16(v64 x, v64 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpret_s64_u16(
       vuzp1_u16(vreinterpret_u16_s64(y), vreinterpret_u16_s64(x)));
 #else
@@ -493,7 +498,7 @@
 }
 
 SIMD_INLINE v64 v64_unziphi_16(v64 x, v64 y) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vreinterpret_s64_u16(
       vuzp2_u16(vreinterpret_u16_s64(y), vreinterpret_u16_s64(x)));
 #else
@@ -556,43 +561,48 @@
 }
 
 SIMD_INLINE v64 v64_shl_8(v64 a, unsigned int c) {
-  return vreinterpret_s64_u8(vshl_u8(vreinterpret_u8_s64(a), vdup_n_s8(c)));
+  return vreinterpret_s64_u8(
+      vshl_u8(vreinterpret_u8_s64(a), vdup_n_s8((int8_t)c)));
 }
 
 SIMD_INLINE v64 v64_shr_u8(v64 a, unsigned int c) {
-  return vreinterpret_s64_u8(vshl_u8(vreinterpret_u8_s64(a), vdup_n_s8(-c)));
+  return vreinterpret_s64_u8(
+      vshl_u8(vreinterpret_u8_s64(a), vdup_n_s8(-(int8_t)c)));
 }
 
 SIMD_INLINE v64 v64_shr_s8(v64 a, unsigned int c) {
-  return vreinterpret_s64_s8(vshl_s8(vreinterpret_s8_s64(a), vdup_n_s8(-c)));
+  return vreinterpret_s64_s8(
+      vshl_s8(vreinterpret_s8_s64(a), vdup_n_s8(-(int8_t)c)));
 }
 
 SIMD_INLINE v64 v64_shl_16(v64 a, unsigned int c) {
-  return vreinterpret_s64_u16(vshl_u16(vreinterpret_u16_s64(a), vdup_n_s16(c)));
+  return vreinterpret_s64_u16(
+      vshl_u16(vreinterpret_u16_s64(a), vdup_n_s16((int16_t)c)));
 }
 
 SIMD_INLINE v64 v64_shr_u16(v64 a, unsigned int c) {
   return vreinterpret_s64_u16(
-      vshl_u16(vreinterpret_u16_s64(a), vdup_n_s16(-(int)c)));
+      vshl_u16(vreinterpret_u16_s64(a), vdup_n_s16(-(int16_t)c)));
 }
 
 SIMD_INLINE v64 v64_shr_s16(v64 a, unsigned int c) {
   return vreinterpret_s64_s16(
-      vshl_s16(vreinterpret_s16_s64(a), vdup_n_s16(-(int)c)));
+      vshl_s16(vreinterpret_s16_s64(a), vdup_n_s16(-(int16_t)c)));
 }
 
 SIMD_INLINE v64 v64_shl_32(v64 a, unsigned int c) {
-  return vreinterpret_s64_u32(vshl_u32(vreinterpret_u32_s64(a), vdup_n_s32(c)));
+  return vreinterpret_s64_u32(
+      vshl_u32(vreinterpret_u32_s64(a), vdup_n_s32((int32_t)c)));
 }
 
 SIMD_INLINE v64 v64_shr_u32(v64 a, unsigned int c) {
   return vreinterpret_s64_u32(
-      vshl_u32(vreinterpret_u32_s64(a), vdup_n_s32(-(int)c)));
+      vshl_u32(vreinterpret_u32_s64(a), vdup_n_s32(-(int32_t)c)));
 }
 
 SIMD_INLINE v64 v64_shr_s32(v64 a, unsigned int c) {
   return vreinterpret_s64_s32(
-      vshl_s32(vreinterpret_s32_s64(a), vdup_n_s32(-(int)c)));
+      vshl_s32(vreinterpret_s32_s64(a), vdup_n_s32(-(int32_t)c)));
 }
 
 // The following functions require an immediate.

diff --git a/aom_dsp/variance.c b/aom_dsp/variance.c
index f72feea..63c1e5f 100644
--- a/aom_dsp/variance.c
+++ b/aom_dsp/variance.c

@@ -25,24 +25,6 @@
 #include "av1/common/filter.h"
 #include "av1/common/reconinter.h"
 
-uint32_t aom_get4x4sse_cs_c(const uint8_t *a, int a_stride, const uint8_t *b,
-                            int b_stride) {
-  int distortion = 0;
-  int r, c;
-
-  for (r = 0; r < 4; ++r) {
-    for (c = 0; c < 4; ++c) {
-      int diff = a[c] - b[c];
-      distortion += diff * diff;
-    }
-
-    a += a_stride;
-    b += b_stride;
-  }
-
-  return distortion;
-}
-
 uint32_t aom_get_mb_ss_c(const int16_t *a) {
   unsigned int i, sum = 0;
 
@@ -198,17 +180,6 @@
     return aom_variance##W##x##H(temp3, W, b, b_stride, sse);                  \
   }
 
-/* Identical to the variance call except it takes an additional parameter, sum,
- * and returns that value using pass-by-reference instead of returning
- * sse - sum^2 / w*h
- */
-#define GET_VAR(W, H)                                                         \
-  void aom_get##W##x##H##var_c(const uint8_t *a, int a_stride,                \
-                               const uint8_t *b, int b_stride, uint32_t *sse, \
-                               int *sum) {                                    \
-    variance(a, a_stride, b, b_stride, W, H, sse, sum);                       \
-  }
-
 void aom_get_var_sse_sum_8x8_quad_c(const uint8_t *a, int a_stride,
                                     const uint8_t *b, int b_stride,
                                     uint32_t *sse8x8, int *sum8x8,
@@ -231,7 +202,7 @@
                                       const uint8_t *ref_ptr, int ref_stride,
                                       uint32_t *sse16x16, unsigned int *tot_sse,
                                       int *tot_sum, uint32_t *var16x16) {
-  int sum16x16[64] = { 0 };
+  int sum16x16[2] = { 0 };
   // Loop over two consecutive 16x16 blocks and process as one 16x32 block.
   for (int k = 0; k < 2; k++) {
     variance(src_ptr + (k * 16), source_stride, ref_ptr + (k * 16), ref_stride,
@@ -281,9 +252,6 @@
 VARIANCES(8, 4)
 VARIANCES(4, 8)
 VARIANCES(4, 4)
-VARIANCES(4, 2)
-VARIANCES(2, 4)
-VARIANCES(2, 2)
 
 // Realtime mode doesn't use rectangular blocks.
 #if !CONFIG_REALTIME_ONLY
@@ -295,9 +263,6 @@
 VARIANCES(64, 16)
 #endif
 
-GET_VAR(16, 16)
-GET_VAR(8, 8)
-
 MSE(16, 16)
 MSE(16, 8)
 MSE(8, 16)
@@ -428,25 +393,6 @@
     return (var >= 0) ? (uint32_t)var : 0;                                     \
   }
 
-#define HIGHBD_GET_VAR(S)                                                    \
-  void aom_highbd_8_get##S##x##S##var_c(const uint8_t *src, int src_stride,  \
-                                        const uint8_t *ref, int ref_stride,  \
-                                        uint32_t *sse, int *sum) {           \
-    highbd_8_variance(src, src_stride, ref, ref_stride, S, S, sse, sum);     \
-  }                                                                          \
-                                                                             \
-  void aom_highbd_10_get##S##x##S##var_c(const uint8_t *src, int src_stride, \
-                                         const uint8_t *ref, int ref_stride, \
-                                         uint32_t *sse, int *sum) {          \
-    highbd_10_variance(src, src_stride, ref, ref_stride, S, S, sse, sum);    \
-  }                                                                          \
-                                                                             \
-  void aom_highbd_12_get##S##x##S##var_c(const uint8_t *src, int src_stride, \
-                                         const uint8_t *ref, int ref_stride, \
-                                         uint32_t *sse, int *sum) {          \
-    highbd_12_variance(src, src_stride, ref, ref_stride, S, S, sse, sum);    \
-  }
-
 #define HIGHBD_MSE(W, H)                                                      \
   uint32_t aom_highbd_8_mse##W##x##H##_c(const uint8_t *src, int src_stride,  \
                                          const uint8_t *ref, int ref_stride,  \
@@ -706,9 +652,6 @@
 HIGHBD_VARIANCES(8, 4)
 HIGHBD_VARIANCES(4, 8)
 HIGHBD_VARIANCES(4, 4)
-HIGHBD_VARIANCES(4, 2)
-HIGHBD_VARIANCES(2, 4)
-HIGHBD_VARIANCES(2, 2)
 
 // Realtime mode doesn't use 4x rectangular blocks.
 #if !CONFIG_REALTIME_ONLY
@@ -720,9 +663,6 @@
 HIGHBD_VARIANCES(64, 16)
 #endif
 
-HIGHBD_GET_VAR(8)
-HIGHBD_GET_VAR(16)
-
 HIGHBD_MSE(16, 16)
 HIGHBD_MSE(16, 8)
 HIGHBD_MSE(8, 16)

diff --git a/aom_dsp/x86/aom_subpixel_8t_ssse3.asm b/aom_dsp/x86/aom_subpixel_8t_ssse3.asm
index 3ca7921..e5fafb0 100644
--- a/aom_dsp/x86/aom_subpixel_8t_ssse3.asm
+++ b/aom_dsp/x86/aom_subpixel_8t_ssse3.asm

@@ -30,7 +30,7 @@
 %define LOCAL_VARS_SIZE 16*6
 
 %macro SETUP_LOCAL_VARS 0
-    ; TODO(slavarnway): using xmm registers for these on ARCH_X86_64 +
+    ; TODO(slavarnway): using xmm registers for these on AOM_ARCH_X86_64 +
     ; pmaddubsw has a higher latency on some platforms, this might be eased by
     ; interleaving the instructions.
     %define    k0k1  [rsp + 16*0]
@@ -52,7 +52,7 @@
     mova       k2k3, m1
     mova       k4k5, m2
     mova       k6k7, m3
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
     %define     krd  m12
     %define    tmp0  [rsp + 16*4]
     %define    tmp1  [rsp + 16*5]
@@ -72,7 +72,7 @@
 %endm
 
 ;-------------------------------------------------------------------------------
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   %define LOCAL_VARS_SIZE_H4 0
 %else
   %define LOCAL_VARS_SIZE_H4 16*4
@@ -83,7 +83,7 @@
                             src, sstride, dst, dstride, height, filter
     mova                m4, [filterq]
     packsswb            m4, m4
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
     %define       k0k1k4k5  m8
     %define       k2k3k6k7  m9
     %define            krd  m10
@@ -346,7 +346,7 @@
     psraw         m0, 7
     psraw         m4, 7
 %ifidn %1, h8_add_src
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
     pcmpeqb       m2, m2                  ;all ones
     psrlw         m2, 8                   ;even_byte_mask
 %else
@@ -383,7 +383,7 @@
 ; TODO(Linfeng): Detect cpu type and choose the code with better performance.
 %define X86_SUBPIX_VFILTER_PREFER_SLOW_CELERON 1
 
-%if ARCH_X86_64 && X86_SUBPIX_VFILTER_PREFER_SLOW_CELERON
+%if AOM_ARCH_X86_64 && X86_SUBPIX_VFILTER_PREFER_SLOW_CELERON
     %define NUM_GENERAL_REG_USED 9
 %else
     %define NUM_GENERAL_REG_USED 6
@@ -403,9 +403,9 @@
 
     dec                 heightd
 
-%if ARCH_X86 || X86_SUBPIX_VFILTER_PREFER_SLOW_CELERON
+%if AOM_ARCH_X86 || X86_SUBPIX_VFILTER_PREFER_SLOW_CELERON
 
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
     %define               src1q  r7
     %define           sstride6q  r8
     %define          dst_stride  dstrideq
@@ -528,7 +528,7 @@
     movx                 [dstq], m0
 
 %else
-    ; ARCH_X86_64
+    ; AOM_ARCH_X86_64
 
     movx                     m0, [srcq                ]     ;A
     movx                     m1, [srcq + sstrideq     ]     ;B
@@ -628,7 +628,7 @@
 %endif
     movx                 [dstq], m0
 
-%endif ; ARCH_X86_64
+%endif ; AOM_ARCH_X86_64
 
 .done:
     REP_RET
@@ -642,9 +642,9 @@
     mova                     m4, [filterq]
     SETUP_LOCAL_VARS
 
-%if ARCH_X86 || X86_SUBPIX_VFILTER_PREFER_SLOW_CELERON
+%if AOM_ARCH_X86 || X86_SUBPIX_VFILTER_PREFER_SLOW_CELERON
 
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
     %define               src1q  r7
     %define           sstride6q  r8
     %define          dst_stride  dstrideq
@@ -724,7 +724,7 @@
     REP_RET
 
 %else
-    ; ARCH_X86_64
+    ; AOM_ARCH_X86_64
     dec                 heightd
 
     movu                     m1, [srcq                ]     ;A
@@ -860,7 +860,7 @@
 .done:
     REP_RET
 
-%endif ; ARCH_X86_64
+%endif ; AOM_ARCH_X86_64
 
 %endm
 

diff --git a/aom_dsp/x86/avg_intrin_avx2.c b/aom_dsp/x86/avg_intrin_avx2.c
index c85d8c5..49fcd72 100644
--- a/aom_dsp/x86/avg_intrin_avx2.c
+++ b/aom_dsp/x86/avg_intrin_avx2.c

@@ -16,6 +16,14 @@
 #include "aom_dsp/x86/bitdepth_conversion_avx2.h"
 #include "aom_ports/mem.h"
 
+static INLINE void sign_extend_16bit_to_32bit_avx2(__m256i in, __m256i zero,
+                                                   __m256i *out_lo,
+                                                   __m256i *out_hi) {
+  const __m256i sign_bits = _mm256_cmpgt_epi16(zero, in);
+  *out_lo = _mm256_unpacklo_epi16(in, sign_bits);
+  *out_hi = _mm256_unpackhi_epi16(in, sign_bits);
+}
+
 static void hadamard_col8x2_avx2(__m256i *in, int iter) {
   __m256i a0 = in[0];
   __m256i a1 = in[1];
@@ -224,6 +232,12 @@
   DECLARE_ALIGNED(32, int16_t, temp_coeff[32 * 32]);
   int16_t *t_coeff = temp_coeff;
   int idx;
+  __m256i coeff0_lo, coeff1_lo, coeff2_lo, coeff3_lo, b0_lo, b1_lo, b2_lo,
+      b3_lo;
+  __m256i coeff0_hi, coeff1_hi, coeff2_hi, coeff3_hi, b0_hi, b1_hi, b2_hi,
+      b3_hi;
+  __m256i b0, b1, b2, b3;
+  const __m256i zero = _mm256_setzero_si256();
   for (idx = 0; idx < 4; ++idx) {
     // src_diff: 9 bit, dynamic range [-255, 255]
     const int16_t *src_ptr =
@@ -238,15 +252,38 @@
     const __m256i coeff2 = _mm256_loadu_si256((const __m256i *)(t_coeff + 512));
     const __m256i coeff3 = _mm256_loadu_si256((const __m256i *)(t_coeff + 768));
 
-    __m256i b0 = _mm256_add_epi16(coeff0, coeff1);
-    __m256i b1 = _mm256_sub_epi16(coeff0, coeff1);
-    __m256i b2 = _mm256_add_epi16(coeff2, coeff3);
-    __m256i b3 = _mm256_sub_epi16(coeff2, coeff3);
+    // Sign extend 16 bit to 32 bit.
+    sign_extend_16bit_to_32bit_avx2(coeff0, zero, &coeff0_lo, &coeff0_hi);
+    sign_extend_16bit_to_32bit_avx2(coeff1, zero, &coeff1_lo, &coeff1_hi);
+    sign_extend_16bit_to_32bit_avx2(coeff2, zero, &coeff2_lo, &coeff2_hi);
+    sign_extend_16bit_to_32bit_avx2(coeff3, zero, &coeff3_lo, &coeff3_hi);
 
-    b0 = _mm256_srai_epi16(b0, 2);
-    b1 = _mm256_srai_epi16(b1, 2);
-    b2 = _mm256_srai_epi16(b2, 2);
-    b3 = _mm256_srai_epi16(b3, 2);
+    b0_lo = _mm256_add_epi32(coeff0_lo, coeff1_lo);
+    b0_hi = _mm256_add_epi32(coeff0_hi, coeff1_hi);
+
+    b1_lo = _mm256_sub_epi32(coeff0_lo, coeff1_lo);
+    b1_hi = _mm256_sub_epi32(coeff0_hi, coeff1_hi);
+
+    b2_lo = _mm256_add_epi32(coeff2_lo, coeff3_lo);
+    b2_hi = _mm256_add_epi32(coeff2_hi, coeff3_hi);
+
+    b3_lo = _mm256_sub_epi32(coeff2_lo, coeff3_lo);
+    b3_hi = _mm256_sub_epi32(coeff2_hi, coeff3_hi);
+
+    b0_lo = _mm256_srai_epi32(b0_lo, 2);
+    b1_lo = _mm256_srai_epi32(b1_lo, 2);
+    b2_lo = _mm256_srai_epi32(b2_lo, 2);
+    b3_lo = _mm256_srai_epi32(b3_lo, 2);
+
+    b0_hi = _mm256_srai_epi32(b0_hi, 2);
+    b1_hi = _mm256_srai_epi32(b1_hi, 2);
+    b2_hi = _mm256_srai_epi32(b2_hi, 2);
+    b3_hi = _mm256_srai_epi32(b3_hi, 2);
+
+    b0 = _mm256_packs_epi32(b0_lo, b0_hi);
+    b1 = _mm256_packs_epi32(b1_lo, b1_hi);
+    b2 = _mm256_packs_epi32(b2_lo, b2_hi);
+    b3 = _mm256_packs_epi32(b3_lo, b3_hi);
 
     store_tran_low(_mm256_add_epi16(b0, b2), coeff);
     store_tran_low(_mm256_add_epi16(b1, b3), coeff + 256);

diff --git a/aom_dsp/x86/avg_intrin_sse2.c b/aom_dsp/x86/avg_intrin_sse2.c
index 71e7028..ca2752e 100644
--- a/aom_dsp/x86/avg_intrin_sse2.c
+++ b/aom_dsp/x86/avg_intrin_sse2.c

@@ -17,6 +17,14 @@
 #include "aom_dsp/x86/mem_sse2.h"
 #include "aom_ports/mem.h"
 
+static INLINE void sign_extend_16bit_to_32bit_sse2(__m128i in, __m128i zero,
+                                                   __m128i *out_lo,
+                                                   __m128i *out_hi) {
+  const __m128i sign_bits = _mm_cmplt_epi16(in, zero);
+  *out_lo = _mm_unpacklo_epi16(in, sign_bits);
+  *out_hi = _mm_unpackhi_epi16(in, sign_bits);
+}
+
 void aom_minmax_8x8_sse2(const uint8_t *s, int p, const uint8_t *d, int dp,
                          int *min, int *max) {
   __m128i u0, s0, d0, diff, maxabsdiff, minabsdiff, negdiff, absdiff0, absdiff;
@@ -344,56 +352,6 @@
   hadamard_8x8_sse2(src_diff, src_stride, coeff, 1);
 }
 
-void aom_pixel_scale_sse2(const int16_t *src_diff, ptrdiff_t src_stride,
-                          int16_t *coeff, int log_scale, int h8, int w8) {
-  __m128i src[8];
-  const int16_t *org_src_diff = src_diff;
-  int16_t *org_coeff = coeff;
-  int coeff_stride = w8 << 3;
-  for (int idy = 0; idy < h8; ++idy) {
-    for (int idx = 0; idx < w8; ++idx) {
-      src_diff = org_src_diff + (idx << 3);
-      coeff = org_coeff + (idx << 3);
-
-      src[0] = _mm_load_si128((const __m128i *)src_diff);
-      src[1] = _mm_load_si128((const __m128i *)(src_diff += src_stride));
-      src[2] = _mm_load_si128((const __m128i *)(src_diff += src_stride));
-      src[3] = _mm_load_si128((const __m128i *)(src_diff += src_stride));
-      src[4] = _mm_load_si128((const __m128i *)(src_diff += src_stride));
-      src[5] = _mm_load_si128((const __m128i *)(src_diff += src_stride));
-      src[6] = _mm_load_si128((const __m128i *)(src_diff += src_stride));
-      src[7] = _mm_load_si128((const __m128i *)(src_diff + src_stride));
-
-      src[0] = _mm_slli_epi16(src[0], log_scale);
-      src[1] = _mm_slli_epi16(src[1], log_scale);
-      src[2] = _mm_slli_epi16(src[2], log_scale);
-      src[3] = _mm_slli_epi16(src[3], log_scale);
-      src[4] = _mm_slli_epi16(src[4], log_scale);
-      src[5] = _mm_slli_epi16(src[5], log_scale);
-      src[6] = _mm_slli_epi16(src[6], log_scale);
-      src[7] = _mm_slli_epi16(src[7], log_scale);
-
-      _mm_store_si128((__m128i *)coeff, src[0]);
-      coeff += coeff_stride;
-      _mm_store_si128((__m128i *)coeff, src[1]);
-      coeff += coeff_stride;
-      _mm_store_si128((__m128i *)coeff, src[2]);
-      coeff += coeff_stride;
-      _mm_store_si128((__m128i *)coeff, src[3]);
-      coeff += coeff_stride;
-      _mm_store_si128((__m128i *)coeff, src[4]);
-      coeff += coeff_stride;
-      _mm_store_si128((__m128i *)coeff, src[5]);
-      coeff += coeff_stride;
-      _mm_store_si128((__m128i *)coeff, src[6]);
-      coeff += coeff_stride;
-      _mm_store_si128((__m128i *)coeff, src[7]);
-    }
-    org_src_diff += (src_stride << 3);
-    org_coeff += (coeff_stride << 3);
-  }
-}
-
 static INLINE void hadamard_lp_8x8_sse2(const int16_t *src_diff,
                                         ptrdiff_t src_stride, int16_t *coeff) {
   __m128i src[8];
@@ -552,6 +510,12 @@
   DECLARE_ALIGNED(32, int16_t, temp_coeff[32 * 32]);
   int16_t *t_coeff = temp_coeff;
   int idx;
+  __m128i coeff0_lo, coeff1_lo, coeff2_lo, coeff3_lo, b0_lo, b1_lo, b2_lo,
+      b3_lo;
+  __m128i coeff0_hi, coeff1_hi, coeff2_hi, coeff3_hi, b0_hi, b1_hi, b2_hi,
+      b3_hi;
+  __m128i b0, b1, b2, b3;
+  const __m128i zero = _mm_setzero_si128();
   for (idx = 0; idx < 4; ++idx) {
     const int16_t *src_ptr =
         src_diff + (idx >> 1) * 16 * src_stride + (idx & 0x01) * 16;
@@ -565,15 +529,38 @@
     __m128i coeff2 = _mm_load_si128((const __m128i *)(t_coeff + 512));
     __m128i coeff3 = _mm_load_si128((const __m128i *)(t_coeff + 768));
 
-    __m128i b0 = _mm_add_epi16(coeff0, coeff1);
-    __m128i b1 = _mm_sub_epi16(coeff0, coeff1);
-    __m128i b2 = _mm_add_epi16(coeff2, coeff3);
-    __m128i b3 = _mm_sub_epi16(coeff2, coeff3);
+    // Sign extend 16 bit to 32 bit.
+    sign_extend_16bit_to_32bit_sse2(coeff0, zero, &coeff0_lo, &coeff0_hi);
+    sign_extend_16bit_to_32bit_sse2(coeff1, zero, &coeff1_lo, &coeff1_hi);
+    sign_extend_16bit_to_32bit_sse2(coeff2, zero, &coeff2_lo, &coeff2_hi);
+    sign_extend_16bit_to_32bit_sse2(coeff3, zero, &coeff3_lo, &coeff3_hi);
 
-    b0 = _mm_srai_epi16(b0, 2);
-    b1 = _mm_srai_epi16(b1, 2);
-    b2 = _mm_srai_epi16(b2, 2);
-    b3 = _mm_srai_epi16(b3, 2);
+    b0_lo = _mm_add_epi32(coeff0_lo, coeff1_lo);
+    b0_hi = _mm_add_epi32(coeff0_hi, coeff1_hi);
+
+    b1_lo = _mm_sub_epi32(coeff0_lo, coeff1_lo);
+    b1_hi = _mm_sub_epi32(coeff0_hi, coeff1_hi);
+
+    b2_lo = _mm_add_epi32(coeff2_lo, coeff3_lo);
+    b2_hi = _mm_add_epi32(coeff2_hi, coeff3_hi);
+
+    b3_lo = _mm_sub_epi32(coeff2_lo, coeff3_lo);
+    b3_hi = _mm_sub_epi32(coeff2_hi, coeff3_hi);
+
+    b0_lo = _mm_srai_epi32(b0_lo, 2);
+    b1_lo = _mm_srai_epi32(b1_lo, 2);
+    b2_lo = _mm_srai_epi32(b2_lo, 2);
+    b3_lo = _mm_srai_epi32(b3_lo, 2);
+
+    b0_hi = _mm_srai_epi32(b0_hi, 2);
+    b1_hi = _mm_srai_epi32(b1_hi, 2);
+    b2_hi = _mm_srai_epi32(b2_hi, 2);
+    b3_hi = _mm_srai_epi32(b3_hi, 2);
+
+    b0 = _mm_packs_epi32(b0_lo, b0_hi);
+    b1 = _mm_packs_epi32(b1_lo, b1_hi);
+    b2 = _mm_packs_epi32(b2_lo, b2_hi);
+    b3 = _mm_packs_epi32(b3_lo, b3_hi);
 
     coeff0 = _mm_add_epi16(b0, b2);
     coeff1 = _mm_add_epi16(b1, b3);

diff --git a/aom_dsp/x86/blk_sse_sum_avx2.c b/aom_dsp/x86/blk_sse_sum_avx2.c
index f7c0eb0..fdf7de3 100644
--- a/aom_dsp/x86/blk_sse_sum_avx2.c
+++ b/aom_dsp/x86/blk_sse_sum_avx2.c

@@ -31,7 +31,7 @@
   out_buffer = _mm256_castsi256_si128(regx_sum);
   *x_sum += _mm_cvtsi128_si32(out_buffer);
   out_buffer = _mm256_castsi256_si128(regx2_sum);
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   *x2_sum += _mm_cvtsi128_si64(out_buffer);
 #else
   {

diff --git a/aom_dsp/x86/blk_sse_sum_sse2.c b/aom_dsp/x86/blk_sse_sum_sse2.c
index ef0a024..bf89427 100644
--- a/aom_dsp/x86/blk_sse_sum_sse2.c
+++ b/aom_dsp/x86/blk_sse_sum_sse2.c

@@ -41,7 +41,7 @@
   temp_buffer2 = _mm_unpackhi_epi32(regx2_sum, _mm_setzero_si128());
   regx2_sum = _mm_add_epi64(temp_buffer1, temp_buffer2);
   regx2_sum = _mm_add_epi64(regx2_sum, _mm_srli_si128(regx2_sum, 8));
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   *x2_sum += _mm_cvtsi128_si64(regx2_sum);
 #else
   {
@@ -82,7 +82,7 @@
   temp_buffer2 = _mm_unpackhi_epi32(regx2_sum, _mm_setzero_si128());
   regx2_sum = _mm_add_epi64(temp_buffer1, temp_buffer2);
   regx2_sum = _mm_add_epi64(regx2_sum, _mm_srli_si128(regx2_sum, 8));
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   *x2_sum += _mm_cvtsi128_si64(regx2_sum);
 #else
   {

diff --git a/aom_dsp/x86/convolve_avx2.h b/aom_dsp/x86/convolve_avx2.h
index a709008..f5a382c 100644
--- a/aom_dsp/x86/convolve_avx2.h
+++ b/aom_dsp/x86/convolve_avx2.h

@@ -329,20 +329,20 @@
           _mm256_castsi128_si256(_mm_loadu_si128(                              \
               (__m128i *)(&src_ptr[i * src_stride + src_stride + j]))),        \
           0x20);                                                               \
-      const __m256i s_16l = _mm256_unpacklo_epi8(data, v_zero);                \
-      const __m256i s_16h = _mm256_unpackhi_epi8(data, v_zero);                \
-      const __m256i s_ll = _mm256_unpacklo_epi16(s_16l, s_16l);                \
-      const __m256i s_lh = _mm256_unpackhi_epi16(s_16l, s_16l);                \
+      const __m256i s_16lo = _mm256_unpacklo_epi8(data, v_zero);               \
+      const __m256i s_16hi = _mm256_unpackhi_epi8(data, v_zero);               \
+      const __m256i s_lolo = _mm256_unpacklo_epi16(s_16lo, s_16lo);            \
+      const __m256i s_lohi = _mm256_unpackhi_epi16(s_16lo, s_16lo);            \
                                                                                \
-      const __m256i s_hl = _mm256_unpacklo_epi16(s_16h, s_16h);                \
-      const __m256i s_hh = _mm256_unpackhi_epi16(s_16h, s_16h);                \
+      const __m256i s_hilo = _mm256_unpacklo_epi16(s_16hi, s_16hi);            \
+      const __m256i s_hihi = _mm256_unpackhi_epi16(s_16hi, s_16hi);            \
                                                                                \
-      s[0] = _mm256_alignr_epi8(s_lh, s_ll, 2);                                \
-      s[1] = _mm256_alignr_epi8(s_lh, s_ll, 10);                               \
-      s[2] = _mm256_alignr_epi8(s_hl, s_lh, 2);                                \
-      s[3] = _mm256_alignr_epi8(s_hl, s_lh, 10);                               \
-      s[4] = _mm256_alignr_epi8(s_hh, s_hl, 2);                                \
-      s[5] = _mm256_alignr_epi8(s_hh, s_hl, 10);                               \
+      s[0] = _mm256_alignr_epi8(s_lohi, s_lolo, 2);                            \
+      s[1] = _mm256_alignr_epi8(s_lohi, s_lolo, 10);                           \
+      s[2] = _mm256_alignr_epi8(s_hilo, s_lohi, 2);                            \
+      s[3] = _mm256_alignr_epi8(s_hilo, s_lohi, 10);                           \
+      s[4] = _mm256_alignr_epi8(s_hihi, s_hilo, 2);                            \
+      s[5] = _mm256_alignr_epi8(s_hihi, s_hilo, 10);                           \
                                                                                \
       const __m256i res_lo = convolve_12taps(s, coeffs_h);                     \
                                                                                \
@@ -373,21 +373,21 @@
           _mm256_castsi128_si256(                                              \
               _mm_loadu_si128((__m128i *)(&src_ptr[i * src_stride + j + 4]))), \
           0x20);                                                               \
-      const __m256i s_16l = _mm256_unpacklo_epi8(data, v_zero);                \
-      const __m256i s_16h = _mm256_unpackhi_epi8(data, v_zero);                \
+      const __m256i s_16lo = _mm256_unpacklo_epi8(data, v_zero);               \
+      const __m256i s_16hi = _mm256_unpackhi_epi8(data, v_zero);               \
                                                                                \
-      const __m256i s_ll = _mm256_unpacklo_epi16(s_16l, s_16l);                \
-      const __m256i s_lh = _mm256_unpackhi_epi16(s_16l, s_16l);                \
+      const __m256i s_lolo = _mm256_unpacklo_epi16(s_16lo, s_16lo);            \
+      const __m256i s_lohi = _mm256_unpackhi_epi16(s_16lo, s_16lo);            \
                                                                                \
-      const __m256i s_hl = _mm256_unpacklo_epi16(s_16h, s_16h);                \
-      const __m256i s_hh = _mm256_unpackhi_epi16(s_16h, s_16h);                \
+      const __m256i s_hilo = _mm256_unpacklo_epi16(s_16hi, s_16hi);            \
+      const __m256i s_hihi = _mm256_unpackhi_epi16(s_16hi, s_16hi);            \
                                                                                \
-      s[0] = _mm256_alignr_epi8(s_lh, s_ll, 2);                                \
-      s[1] = _mm256_alignr_epi8(s_lh, s_ll, 10);                               \
-      s[2] = _mm256_alignr_epi8(s_hl, s_lh, 2);                                \
-      s[3] = _mm256_alignr_epi8(s_hl, s_lh, 10);                               \
-      s[4] = _mm256_alignr_epi8(s_hh, s_hl, 2);                                \
-      s[5] = _mm256_alignr_epi8(s_hh, s_hl, 10);                               \
+      s[0] = _mm256_alignr_epi8(s_lohi, s_lolo, 2);                            \
+      s[1] = _mm256_alignr_epi8(s_lohi, s_lolo, 10);                           \
+      s[2] = _mm256_alignr_epi8(s_hilo, s_lohi, 2);                            \
+      s[3] = _mm256_alignr_epi8(s_hilo, s_lohi, 10);                           \
+      s[4] = _mm256_alignr_epi8(s_hihi, s_hilo, 2);                            \
+      s[5] = _mm256_alignr_epi8(s_hihi, s_hilo, 10);                           \
                                                                                \
       const __m256i res_lo = convolve_12taps(s, coeffs_h);                     \
                                                                                \

diff --git a/aom_dsp/x86/fwd_txfm_impl_sse2.h b/aom_dsp/x86/fwd_txfm_impl_sse2.h
index 89fe189..7ee8ba3 100644
--- a/aom_dsp/x86/fwd_txfm_impl_sse2.h
+++ b/aom_dsp/x86/fwd_txfm_impl_sse2.h

@@ -180,25 +180,8 @@
       const __m128i w1 = _mm_srai_epi32(v1, DCT_CONST_BITS2);
       const __m128i w2 = _mm_srai_epi32(v2, DCT_CONST_BITS2);
       const __m128i w3 = _mm_srai_epi32(v3, DCT_CONST_BITS2);
-      // w0 = [o0 o4 o8 oC]
-      // w1 = [o2 o6 oA oE]
-      // w2 = [o1 o5 o9 oD]
-      // w3 = [o3 o7 oB oF]
-      // remember the o's are numbered according to the correct output location
-      const __m128i x0 = _mm_packs_epi32(w0, w1);
-      const __m128i x1 = _mm_packs_epi32(w2, w3);
-      {
-        // x0 = [o0 o4 o8 oC o2 o6 oA oE]
-        // x1 = [o1 o5 o9 oD o3 o7 oB oF]
-        const __m128i y0 = _mm_unpacklo_epi16(x0, x1);
-        const __m128i y1 = _mm_unpackhi_epi16(x0, x1);
-        // y0 = [o0 o1 o4 o5 o8 o9 oC oD]
-        // y1 = [o2 o3 o6 o7 oA oB oE oF]
-        *in0 = _mm_unpacklo_epi32(y0, y1);
-        // in0 = [o0 o1 o2 o3 o4 o5 o6 o7]
-        *in1 = _mm_unpackhi_epi32(y0, y1);
-        // in1 = [o8 o9 oA oB oC oD oE oF]
-      }
+      *in0 = _mm_packs_epi32(w0, w2);
+      *in1 = _mm_packs_epi32(w1, w3);
     }
   }
 }
@@ -230,6 +213,7 @@
   _mm_storeu_si128((__m128i *)(output + 2 * 4), in1);
 }
 
+#if CONFIG_INTERNAL_STATS
 void FDCT8x8_2D(const int16_t *input, tran_low_t *output, int stride) {
   int pass;
   // Constants
@@ -539,6 +523,7 @@
     store_output(&in7, (output + 7 * 8));
   }
 }
+#endif  // CONFIG_INTERNAL_STATS
 
 #undef ADD_EPI16
 #undef SUB_EPI16

diff --git a/aom_dsp/x86/fwd_txfm_ssse3_x86_64.asm b/aom_dsp/x86/fwd_txfm_ssse3_x86_64.asm
index c1fb259..0687904 100644
--- a/aom_dsp/x86/fwd_txfm_ssse3_x86_64.asm
+++ b/aom_dsp/x86/fwd_txfm_ssse3_x86_64.asm

@@ -45,7 +45,7 @@
 
 SECTION .text
 
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
 INIT_XMM ssse3
 cglobal fdct8x8, 3, 5, 13, input, output, stride
 

diff --git a/aom_dsp/x86/highbd_sad4d_sse2.asm b/aom_dsp/x86/highbd_sad4d_sse2.asm
index 9442cd0..03839b4 100644
--- a/aom_dsp/x86/highbd_sad4d_sse2.asm
+++ b/aom_dsp/x86/highbd_sad4d_sse2.asm

@@ -221,21 +221,21 @@
 ;   3: If 0, then normal sad, if 2, then skip every other row
 %macro HIGH_SADNXN4D 2-3 0
 %if %3 == 0  ; normal sad
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
 cglobal highbd_sad%1x%2x4d, 5, 8, 8, src, src_stride, ref1, ref_stride, \
                               res, ref2, ref3, ref4
 %else
 cglobal highbd_sad%1x%2x4d, 4, 7, 8, src, src_stride, ref1, ref_stride, \
                               ref2, ref3, ref4
-%endif  ; ARCH_X86_64
+%endif  ; AOM_ARCH_X86_64
 %else  ; %3 == 2, downsample
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
 cglobal highbd_sad_skip_%1x%2x4d, 5, 8, 8, src, src_stride, ref1, ref_stride, \
                               res, ref2, ref3, ref4
 %else
 cglobal highbd_sad_skip_%1x%2x4d, 4, 7, 8, src, src_stride, ref1, ref_stride, \
                               ref2, ref3, ref4
-%endif  ; ARCH_X86_64
+%endif  ; AOM_ARCH_X86_64
 %endif  ; sad/avg/skip
 
 ; set m1

diff --git a/aom_dsp/x86/highbd_sad_sse2.asm b/aom_dsp/x86/highbd_sad_sse2.asm
index 48b93bf..3dc4e4e 100644
--- a/aom_dsp/x86/highbd_sad_sse2.asm
+++ b/aom_dsp/x86/highbd_sad_sse2.asm

@@ -34,11 +34,11 @@
 cglobal highbd_sad%1x%2_avg, 5, 1 + %3, %5, src, src_stride, ref, ref_stride, \
                                     second_pred, n_rows
 %else ; %3 == 7
-cglobal highbd_sad%1x%2_avg, 5, ARCH_X86_64 + %3, %5, src, src_stride, \
+cglobal highbd_sad%1x%2_avg, 5, AOM_ARCH_X86_64 + %3, %5, src, src_stride, \
                                               ref, ref_stride, \
                                               second_pred, \
                                               src_stride3, ref_stride3
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
 %define n_rowsd r7d
 %else ; x86-32
 %define n_rowsd dword r0m

diff --git a/aom_dsp/x86/highbd_subpel_variance_impl_sse2.asm b/aom_dsp/x86/highbd_subpel_variance_impl_sse2.asm
index 5c78933..c0ccc18 100644
--- a/aom_dsp/x86/highbd_subpel_variance_impl_sse2.asm
+++ b/aom_dsp/x86/highbd_subpel_variance_impl_sse2.asm

@@ -81,7 +81,7 @@
 %endmacro
 
 %macro INC_SRC_BY_SRC_STRIDE  0
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
   add                srcq, src_stridemp
   add                srcq, src_stridemp
 %else
@@ -94,7 +94,7 @@
 %define filter_idx_shift 5
 
 
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   %if %2 == 1 ; avg
     cglobal highbd_sub_pixel_avg_variance%1xh, 9, 10, 13, src, src_stride, \
                                       x_offset, y_offset, \
@@ -271,11 +271,11 @@
 
 .x_zero_y_nonhalf:
   ; x_offset == 0 && y_offset == bilin interpolation
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   lea        bilin_filter, [GLOBAL(bilin_filter_m)]
 %endif
   shl           y_offsetd, filter_idx_shift
-%if ARCH_X86_64 && mmsize == 16
+%if AOM_ARCH_X86_64 && mmsize == 16
   mova                 m8, [bilin_filter+y_offsetq]
   mova                 m9, [bilin_filter+y_offsetq+16]
   mova                m10, [GLOBAL(pw_8)]
@@ -283,7 +283,7 @@
 %define filter_y_b m9
 %define filter_rnd m10
 %else ; x86-32 or mmx
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
 ; x_offset == 0, reuse x_offset reg
 %define tempq x_offsetq
   add y_offsetq, g_bilin_filterm
@@ -498,11 +498,11 @@
 
 .x_half_y_nonhalf:
   ; x_offset == 0.5 && y_offset == bilin interpolation
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   lea        bilin_filter, [GLOBAL(bilin_filter_m)]
 %endif
   shl           y_offsetd, filter_idx_shift
-%if ARCH_X86_64 && mmsize == 16
+%if AOM_ARCH_X86_64 && mmsize == 16
   mova                 m8, [bilin_filter+y_offsetq]
   mova                 m9, [bilin_filter+y_offsetq+16]
   mova                m10, [GLOBAL(pw_8)]
@@ -510,7 +510,7 @@
 %define filter_y_b m9
 %define filter_rnd m10
 %else  ; x86_32
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
 ; x_offset == 0.5. We can reuse x_offset reg
 %define tempq x_offsetq
   add y_offsetq, g_bilin_filterm
@@ -620,11 +620,11 @@
   jnz .x_nonhalf_y_nonzero
 
   ; x_offset == bilin interpolation && y_offset == 0
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   lea        bilin_filter, [GLOBAL(bilin_filter_m)]
 %endif
   shl           x_offsetd, filter_idx_shift
-%if ARCH_X86_64 && mmsize == 16
+%if AOM_ARCH_X86_64 && mmsize == 16
   mova                 m8, [bilin_filter+x_offsetq]
   mova                 m9, [bilin_filter+x_offsetq+16]
   mova                m10, [GLOBAL(pw_8)]
@@ -632,7 +632,7 @@
 %define filter_x_b m9
 %define filter_rnd m10
 %else    ; x86-32
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
 ; y_offset == 0. We can reuse y_offset reg.
 %define tempq y_offsetq
   add x_offsetq, g_bilin_filterm
@@ -719,11 +719,11 @@
   jne .x_nonhalf_y_nonhalf
 
   ; x_offset == bilin interpolation && y_offset == 0.5
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   lea        bilin_filter, [GLOBAL(bilin_filter_m)]
 %endif
   shl           x_offsetd, filter_idx_shift
-%if ARCH_X86_64 && mmsize == 16
+%if AOM_ARCH_X86_64 && mmsize == 16
   mova                 m8, [bilin_filter+x_offsetq]
   mova                 m9, [bilin_filter+x_offsetq+16]
   mova                m10, [GLOBAL(pw_8)]
@@ -731,7 +731,7 @@
 %define filter_x_b m9
 %define filter_rnd m10
 %else    ; x86-32
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
 ; y_offset == 0.5. We can reuse y_offset reg.
 %define tempq y_offsetq
   add x_offsetq, g_bilin_filterm
@@ -846,12 +846,12 @@
 
 .x_nonhalf_y_nonhalf:
 ; loading filter - this is same as in 8-bit depth
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   lea        bilin_filter, [GLOBAL(bilin_filter_m)]
 %endif
   shl           x_offsetd, filter_idx_shift ; filter_idx_shift = 5
   shl           y_offsetd, filter_idx_shift
-%if ARCH_X86_64 && mmsize == 16
+%if AOM_ARCH_X86_64 && mmsize == 16
   mova                 m8, [bilin_filter+x_offsetq]
   mova                 m9, [bilin_filter+x_offsetq+16]
   mova                m10, [bilin_filter+y_offsetq]
@@ -863,7 +863,7 @@
 %define filter_y_b m11
 %define filter_rnd m12
 %else   ; x86-32
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
 ; In this case, there is NO unused register. Used src_stride register. Later,
 ; src_stride has to be loaded from stack when it is needed.
 %define tempq src_strideq

diff --git a/aom_dsp/x86/highbd_variance_sse2.c b/aom_dsp/x86/highbd_variance_sse2.c
index d45885c..e897aab 100644
--- a/aom_dsp/x86/highbd_variance_sse2.c
+++ b/aom_dsp/x86/highbd_variance_sse2.c

@@ -98,43 +98,6 @@
   *sse = (uint32_t)ROUND_POWER_OF_TWO(sse_long, 8);
 }
 
-#define HIGH_GET_VAR(S)                                                       \
-  void aom_highbd_get##S##x##S##var_sse2(const uint8_t *src8, int src_stride, \
-                                         const uint8_t *ref8, int ref_stride, \
-                                         uint32_t *sse, int *sum) {           \
-    uint16_t *src = CONVERT_TO_SHORTPTR(src8);                                \
-    uint16_t *ref = CONVERT_TO_SHORTPTR(ref8);                                \
-    aom_highbd_calc##S##x##S##var_sse2(src, src_stride, ref, ref_stride, sse, \
-                                       sum);                                  \
-  }                                                                           \
-                                                                              \
-  void aom_highbd_10_get##S##x##S##var_sse2(                                  \
-      const uint8_t *src8, int src_stride, const uint8_t *ref8,               \
-      int ref_stride, uint32_t *sse, int *sum) {                              \
-    uint16_t *src = CONVERT_TO_SHORTPTR(src8);                                \
-    uint16_t *ref = CONVERT_TO_SHORTPTR(ref8);                                \
-    aom_highbd_calc##S##x##S##var_sse2(src, src_stride, ref, ref_stride, sse, \
-                                       sum);                                  \
-    *sum = ROUND_POWER_OF_TWO(*sum, 2);                                       \
-    *sse = ROUND_POWER_OF_TWO(*sse, 4);                                       \
-  }                                                                           \
-                                                                              \
-  void aom_highbd_12_get##S##x##S##var_sse2(                                  \
-      const uint8_t *src8, int src_stride, const uint8_t *ref8,               \
-      int ref_stride, uint32_t *sse, int *sum) {                              \
-    uint16_t *src = CONVERT_TO_SHORTPTR(src8);                                \
-    uint16_t *ref = CONVERT_TO_SHORTPTR(ref8);                                \
-    aom_highbd_calc##S##x##S##var_sse2(src, src_stride, ref, ref_stride, sse, \
-                                       sum);                                  \
-    *sum = ROUND_POWER_OF_TWO(*sum, 4);                                       \
-    *sse = ROUND_POWER_OF_TWO(*sse, 8);                                       \
-  }
-
-HIGH_GET_VAR(16)
-HIGH_GET_VAR(8)
-
-#undef HIGH_GET_VAR
-
 #define VAR_FN(w, h, block_size, shift)                                    \
   uint32_t aom_highbd_8_variance##w##x##h##_sse2(                          \
       const uint8_t *src8, int src_stride, const uint8_t *ref8,            \

diff --git a/aom_dsp/x86/intrapred_sse4.c b/aom_dsp/x86/intrapred_sse4.c
index 3f72dc4..fb30420 100644
--- a/aom_dsp/x86/intrapred_sse4.c
+++ b/aom_dsp/x86/intrapred_sse4.c

@@ -602,7 +602,7 @@
   const __m128i c1234 = _mm_setr_epi16(1, 2, 3, 4, 5, 6, 7, 8);
 
   for (int r = 0; r < N; r++) {
-    __m128i b, res, res1, shift, shifty;
+    __m128i b, res, res1, shift;
     __m128i resx, resy, resxy, r6, ydx;
 
     int y = r + 1;
@@ -620,11 +620,7 @@
     }
 
     if (base_shift > 7) {
-      a0_x = _mm_setzero_si128();
-      a1_x = _mm_setzero_si128();
-      a0_y = _mm_setzero_si128();
-      a1_y = _mm_setzero_si128();
-      shift = _mm_setzero_si128();
+      resx = _mm_setzero_si128();
     } else {
       a0_above = _mm_loadu_si128((__m128i *)(above + base_x + base_shift));
       ydx = _mm_set1_epi16(y * dx);
@@ -649,9 +645,15 @@
       }
       a0_x = _mm_cvtepu8_epi16(a0_above);
       a1_x = _mm_cvtepu8_epi16(a1_above);
-      a0_y = _mm_setzero_si128();
-      a1_y = _mm_setzero_si128();
-      shifty = shift;
+
+      diff = _mm_sub_epi16(a1_x, a0_x);  // a[x+1] - a[x]
+      a32 = _mm_slli_epi16(a0_x, 5);     // a[x] * 32
+      a32 = _mm_add_epi16(a32, a16);     // a[x] * 32 + 16
+
+      b = _mm_mullo_epi16(diff, shift);
+      res = _mm_add_epi16(a32, b);
+      res = _mm_srli_epi16(res, 5);
+      resx = _mm_packus_epi16(res, res);
     }
 
     // y calc
@@ -678,34 +680,27 @@
                             left[base_y_c[6]], left[base_y_c[7]]);
 
       if (upsample_left) {
-        shifty = _mm_srli_epi16(
+        shift = _mm_srli_epi16(
             _mm_and_si128(_mm_slli_epi16(y_c, upsample_left), c3f), 1);
       } else {
-        shifty = _mm_srli_epi16(_mm_and_si128(y_c, c3f), 1);
+        shift = _mm_srli_epi16(_mm_and_si128(y_c, c3f), 1);
       }
+
+      diff = _mm_sub_epi16(a1_y, a0_y);  // a[x+1] - a[x]
+      a32 = _mm_slli_epi16(a0_y, 5);     // a[x] * 32
+      a32 = _mm_add_epi16(a32, a16);     // a[x] * 32 + 16
+
+      b = _mm_mullo_epi16(diff, shift);
+      res1 = _mm_add_epi16(a32, b);
+      res1 = _mm_srli_epi16(res1, 5);
+
+      resy = _mm_packus_epi16(res1, res1);
+      resxy = _mm_blendv_epi8(resx, resy, *(__m128i *)Mask[0][base_min_diff]);
+      _mm_storel_epi64((__m128i *)dst, resxy);
+    } else {
+      _mm_storel_epi64((__m128i *)dst, resx);
     }
 
-    diff = _mm_sub_epi16(a1_x, a0_x);  // a[x+1] - a[x]
-    a32 = _mm_slli_epi16(a0_x, 5);     // a[x] * 32
-    a32 = _mm_add_epi16(a32, a16);     // a[x] * 32 + 16
-
-    b = _mm_mullo_epi16(diff, shift);
-    res = _mm_add_epi16(a32, b);
-    res = _mm_srli_epi16(res, 5);
-
-    diff = _mm_sub_epi16(a1_y, a0_y);  // a[x+1] - a[x]
-    a32 = _mm_slli_epi16(a0_y, 5);     // a[x] * 32
-    a32 = _mm_add_epi16(a32, a16);     // a[x] * 32 + 16
-
-    b = _mm_mullo_epi16(diff, shifty);
-    res1 = _mm_add_epi16(a32, b);
-    res1 = _mm_srli_epi16(res1, 5);
-
-    resx = _mm_packus_epi16(res, res);
-    resy = _mm_packus_epi16(res1, res1);
-
-    resxy = _mm_blendv_epi8(resx, resy, *(__m128i *)Mask[0][base_min_diff]);
-    _mm_storel_epi64((__m128i *)(dst), resxy);
     dst += stride;
   }
 }

diff --git a/aom_dsp/x86/jnt_sad_ssse3.c b/aom_dsp/x86/jnt_sad_sse2.c
similarity index 66%
rename from aom_dsp/x86/jnt_sad_ssse3.c
rename to aom_dsp/x86/jnt_sad_sse2.c
index 357f70a..16d2f4b 100644
--- a/aom_dsp/x86/jnt_sad_ssse3.c
+++ b/aom_dsp/x86/jnt_sad_sse2.c

@@ -10,16 +10,16 @@
  */
 
 #include <assert.h>
-#include <emmintrin.h>  // SSE2
-#include <tmmintrin.h>
+#include <emmintrin.h>
 
 #include "config/aom_config.h"
 #include "config/aom_dsp_rtcd.h"
 
 #include "aom_dsp/x86/synonyms.h"
 
-unsigned int aom_sad4xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b,
-                             int b_stride, int width, int height) {
+static unsigned int sad4xh_sse2(const uint8_t *a, int a_stride,
+                                const uint8_t *b, int b_stride, int width,
+                                int height) {
   int i;
   assert(width == 4);
   (void)width;
@@ -59,8 +59,9 @@
   return res;
 }
 
-unsigned int aom_sad8xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b,
-                             int b_stride, int width, int height) {
+static unsigned int sad8xh_sse2(const uint8_t *a, int a_stride,
+                                const uint8_t *b, int b_stride, int width,
+                                int height) {
   int i;
   assert(width == 8);
   (void)width;
@@ -91,8 +92,9 @@
   return res;
 }
 
-unsigned int aom_sad16xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b,
-                              int b_stride, int width, int height) {
+static unsigned int sad16xh_sse2(const uint8_t *a, int a_stride,
+                                 const uint8_t *b, int b_stride, int width,
+                                 int height) {
   int i;
   assert(width == 16);
   (void)width;
@@ -116,8 +118,9 @@
   return res;
 }
 
-unsigned int aom_sad32xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b,
-                              int b_stride, int width, int height) {
+static unsigned int sad32xh_sse2(const uint8_t *a, int a_stride,
+                                 const uint8_t *b, int b_stride, int width,
+                                 int height) {
   int i, j;
   assert(width == 32);
   (void)width;
@@ -143,8 +146,9 @@
   return res;
 }
 
-unsigned int aom_sad64xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b,
-                              int b_stride, int width, int height) {
+static unsigned int sad64xh_sse2(const uint8_t *a, int a_stride,
+                                 const uint8_t *b, int b_stride, int width,
+                                 int height) {
   int i, j;
   assert(width == 64);
   (void)width;
@@ -170,8 +174,9 @@
   return res;
 }
 
-unsigned int aom_sad128xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b,
-                               int b_stride, int width, int height) {
+static unsigned int sad128xh_sse2(const uint8_t *a, int a_stride,
+                                  const uint8_t *b, int b_stride, int width,
+                                  int height) {
   int i, j;
   assert(width == 128);
   (void)width;
@@ -197,47 +202,37 @@
   return res;
 }
 
-#define dist_wtd_sadMxN_sse2(m, n)                                            \
-  unsigned int aom_dist_wtd_sad##m##x##n##_avg_ssse3(                         \
+#define DIST_WTD_SADMXN_SSE2(m, n)                                            \
+  unsigned int aom_dist_wtd_sad##m##x##n##_avg_sse2(                          \
       const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, \
       const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param) {    \
     uint8_t comp_pred[m * n];                                                 \
     aom_dist_wtd_comp_avg_pred(comp_pred, second_pred, m, n, ref, ref_stride, \
                                jcp_param);                                    \
-    return aom_sad##m##xh_sse2(src, src_stride, comp_pred, m, m, n);          \
+    return sad##m##xh_sse2(src, src_stride, comp_pred, m, m, n);              \
   }
 
-#define dist_wtd_sadMxN_avx2(m, n)                                            \
-  unsigned int aom_dist_wtd_sad##m##x##n##_avg_avx2(                          \
-      const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, \
-      const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param) {    \
-    uint8_t comp_pred[m * n];                                                 \
-    aom_dist_wtd_comp_avg_pred(comp_pred, second_pred, m, n, ref, ref_stride, \
-                               jcp_param);                                    \
-    return aom_sad##m##xh_avx2(src, src_stride, comp_pred, m, m, n);          \
-  }
-
-/* clang-format off */
-dist_wtd_sadMxN_sse2(128, 128)
-dist_wtd_sadMxN_sse2(128, 64)
-dist_wtd_sadMxN_sse2(64, 128)
-dist_wtd_sadMxN_sse2(64, 64)
-dist_wtd_sadMxN_sse2(64, 32)
-dist_wtd_sadMxN_sse2(32, 64)
-dist_wtd_sadMxN_sse2(32, 32)
-dist_wtd_sadMxN_sse2(32, 16)
-dist_wtd_sadMxN_sse2(16, 32)
-dist_wtd_sadMxN_sse2(16, 16)
-dist_wtd_sadMxN_sse2(16, 8)
-dist_wtd_sadMxN_sse2(8, 16)
-dist_wtd_sadMxN_sse2(8, 8)
-dist_wtd_sadMxN_sse2(8, 4)
-dist_wtd_sadMxN_sse2(4, 8)
-dist_wtd_sadMxN_sse2(4, 4)
-dist_wtd_sadMxN_sse2(4, 16)
-dist_wtd_sadMxN_sse2(16, 4)
-dist_wtd_sadMxN_sse2(8, 32)
-dist_wtd_sadMxN_sse2(32, 8)
-dist_wtd_sadMxN_sse2(16, 64)
-dist_wtd_sadMxN_sse2(64, 16)
-    /* clang-format on */
+DIST_WTD_SADMXN_SSE2(128, 128)
+DIST_WTD_SADMXN_SSE2(128, 64)
+DIST_WTD_SADMXN_SSE2(64, 128)
+DIST_WTD_SADMXN_SSE2(64, 64)
+DIST_WTD_SADMXN_SSE2(64, 32)
+DIST_WTD_SADMXN_SSE2(32, 64)
+DIST_WTD_SADMXN_SSE2(32, 32)
+DIST_WTD_SADMXN_SSE2(32, 16)
+DIST_WTD_SADMXN_SSE2(16, 32)
+DIST_WTD_SADMXN_SSE2(16, 16)
+DIST_WTD_SADMXN_SSE2(16, 8)
+DIST_WTD_SADMXN_SSE2(8, 16)
+DIST_WTD_SADMXN_SSE2(8, 8)
+DIST_WTD_SADMXN_SSE2(8, 4)
+DIST_WTD_SADMXN_SSE2(4, 8)
+DIST_WTD_SADMXN_SSE2(4, 4)
+#if !CONFIG_REALTIME_ONLY
+DIST_WTD_SADMXN_SSE2(4, 16)
+DIST_WTD_SADMXN_SSE2(16, 4)
+DIST_WTD_SADMXN_SSE2(8, 32)
+DIST_WTD_SADMXN_SSE2(32, 8)
+DIST_WTD_SADMXN_SSE2(16, 64)
+DIST_WTD_SADMXN_SSE2(64, 16)
+#endif

diff --git a/aom_dsp/x86/obmc_intrinsic_ssse3.h b/aom_dsp/x86/obmc_intrinsic_ssse3.h
index 48486c6..27398ff 100644
--- a/aom_dsp/x86/obmc_intrinsic_ssse3.h
+++ b/aom_dsp/x86/obmc_intrinsic_ssse3.h

@@ -24,7 +24,7 @@
 
 static INLINE int64_t xx_hsum_epi64_si64(__m128i v_q) {
   v_q = _mm_add_epi64(v_q, _mm_srli_si128(v_q, 8));
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   return _mm_cvtsi128_si64(v_q);
 #else
   {

diff --git a/aom_dsp/x86/sad4d_sse2.asm b/aom_dsp/x86/sad4d_sse2.asm
index 6de708b..6edad99 100644
--- a/aom_dsp/x86/sad4d_sse2.asm
+++ b/aom_dsp/x86/sad4d_sse2.asm

@@ -15,13 +15,6 @@
 
 SECTION .text
 
-%macro AVG_4x2x4 2
-  movh                  m2, [second_predq]
-  movlhps               m2, m2
-  pavgb                 %1, m2
-  pavgb                 %2, m2
-  lea                   second_predq, [second_predq+8]
-%endmacro
 ; 'spill_src_stride' affect a lot how the code works.
 ;
 ; When 'spill_src_stride' is false, the 'src_strideq' resides in
@@ -64,8 +57,8 @@
   lea                ref4q, [ref4q+ref_strideq*2]
 %endmacro
 
-; PROCESS_4x2x4 first, do_avg
-%macro PROCESS_4x2x4 2
+; PROCESS_4x2x4 first
+%macro PROCESS_4x2x4 1
   movd                  m0, [srcq]
   HANDLE_SECOND_OFFSET
 %if %1 == 1
@@ -87,9 +80,6 @@
   movlhps               m0, m0
   movlhps               m6, m4
   movlhps               m7, m5
-%if %2 == 1
-  AVG_4x2x4             m6, m7
-%endif
   psadbw                m6, m0
   psadbw                m7, m0
 %else
@@ -110,9 +100,6 @@
   movlhps               m0, m0
   movlhps               m1, m2
   movlhps               m3, m4
-%if %2 == 1
-  AVG_4x2x4             m1, m3
-%endif
   psadbw                m1, m0
   psadbw                m3, m0
   paddd                 m6, m1
@@ -120,8 +107,8 @@
 %endif
 %endmacro
 
-; PROCESS_8x2x4 first, do_avg
-%macro PROCESS_8x2x4 2
+; PROCESS_8x2x4 first
+%macro PROCESS_8x2x4 1
   movh                  m0, [srcq]
   HANDLE_SECOND_OFFSET
 %if %1 == 1
@@ -134,14 +121,6 @@
   movhps                m5, [ref2q+ref_strideq]
   movhps                m6, [ref3q+ref_strideq]
   movhps                m7, [ref4q+ref_strideq]
-%if %2 == 1
-  movu                  m3, [second_predq]
-  pavgb                 m4, m3
-  pavgb                 m5, m3
-  pavgb                 m6, m3
-  pavgb                 m7, m3
-  lea                   second_predq, [second_predq+mmsize]
-%endif
   psadbw                m4, m0
   psadbw                m5, m0
   psadbw                m6, m0
@@ -152,11 +131,6 @@
   movhps                m0, [srcq + second_offset]
   movhps                m1, [ref1q+ref_strideq]
   movhps                m2, [ref2q+ref_strideq]
-%if %2 == 1
-  movu                  m3, [second_predq]
-  pavgb                 m1, m3
-  pavgb                 m2, m3
-%endif
   psadbw                m1, m0
   psadbw                m2, m0
   paddd                 m4, m1
@@ -166,11 +140,6 @@
   movhps                m1, [ref3q+ref_strideq]
   movh                  m2, [ref4q]
   movhps                m2, [ref4q+ref_strideq]
-%if %2 == 1
-  pavgb                 m1, m3
-  pavgb                 m2, m3
-  lea                   second_predq, [second_predq+mmsize]
-%endif
   psadbw                m1, m0
   psadbw                m2, m0
   paddd                 m6, m1
@@ -178,37 +147,24 @@
 %endif
 %endmacro
 
-; PROCESS_FIRST_MMSIZE do_avg
-%macro PROCESS_FIRST_MMSIZE 1
+; PROCESS_FIRST_MMSIZE
+%macro PROCESS_FIRST_MMSIZE 0
   mova                  m0, [srcq]
   movu                  m4, [ref1q]
   movu                  m5, [ref2q]
   movu                  m6, [ref3q]
   movu                  m7, [ref4q]
-%if %1 == 1
-  movu                  m3, [second_predq]
-  pavgb                 m4, m3
-  pavgb                 m5, m3
-  pavgb                 m6, m3
-  pavgb                 m7, m3
-  lea                   second_predq, [second_predq+mmsize]
-%endif
   psadbw                m4, m0
   psadbw                m5, m0
   psadbw                m6, m0
   psadbw                m7, m0
 %endmacro
 
-; PROCESS_16x1x4 offset, do_avg
-%macro PROCESS_16x1x4 2
+; PROCESS_16x1x4 offset
+%macro PROCESS_16x1x4 1
   mova                  m0, [srcq + %1]
   movu                  m1, [ref1q + ref_offsetq + %1]
   movu                  m2, [ref2q + ref_offsetq + %1]
-%if %2 == 1
-  movu                  m3, [second_predq]
-  pavgb                 m1, m3
-  pavgb                 m2, m3
-%endif
   psadbw                m1, m0
   psadbw                m2, m0
   paddd                 m4, m1
@@ -216,11 +172,6 @@
 
   movu                  m1, [ref3q + ref_offsetq + %1]
   movu                  m2, [ref4q + ref_offsetq + %1]
-%if %2 == 1
-  pavgb                 m1, m3
-  pavgb                 m2, m3
-  lea                   second_predq, [second_predq+mmsize]
-%endif
   psadbw                m1, m0
   psadbw                m2, m0
   paddd                 m6, m1
@@ -233,9 +184,8 @@
 ; Macro Arguments:
 ;   1: Width
 ;   2: Height
-;   3: If 0, then normal sad, else avg
-;   4: If 0, then normal sad, else skip rows
-%macro SADNXN4D 2-4 0,0
+;   3: If 0, then normal sad, else skip rows
+%macro SADNXN4D 2-3 0
 
 %define spill_src_stride 0
 %define spill_ref_stride 0
@@ -249,8 +199,8 @@
 ; Remove loops in the 4x4 and 8x4 case
 %define use_loop (use_ref_offset || %2 > 4)
 
-%if %4 == 1  ; skip rows
-%if ARCH_X86_64
+%if %3 == 1  ; skip rows
+%if AOM_ARCH_X86_64
 %if use_ref_offset
 cglobal sad_skip_%1x%2x4d, 5, 10, 8, src, src_stride, ref1, ref_stride, res, \
                                      ref2, ref3, ref4, cnt, ref_offset
@@ -276,8 +226,8 @@
                                     ref3, ref4
 %endif
 %endif
-%elif %3 == 0  ; normal sad
-%if ARCH_X86_64
+%else ; normal sad
+%if AOM_ARCH_X86_64
 %if use_ref_offset
 cglobal sad%1x%2x4d, 5, 10, 8, src, src_stride, ref1, ref_stride, res, ref2, \
                                ref3, ref4, cnt, ref_offset
@@ -301,34 +251,6 @@
                               ref4
 %endif
 %endif
-%else ; avg
-%if ARCH_X86_64
-%if use_ref_offset
-cglobal sad%1x%2x4d_avg, 6, 11, 8, src, src_stride, ref1, ref_stride, \
-                                   second_pred, res, ref2, ref3, ref4, cnt, \
-                                   ref_offset
-%elif use_loop
-cglobal sad%1x%2x4d_avg, 6, 10, 8, src, src_stride, ref1, ref_stride, \
-                                   second_pred, res, ref2, ref3, ref4, cnt
-%else
-cglobal sad%1x%2x4d_avg, 6, 9, 8, src, src_stride, ref1, ref_stride, \
-                                   second_pred, res, ref2, ref3, ref4
-%endif
-%else
-%if use_ref_offset
-cglobal sad%1x%2x4d_avg, 5, 7, 8, src, ref4, ref1, ref_offset, second_pred, ref2, ref3
-  %define spill_src_stride 1
-  %define spill_ref_stride 1
-  %define spill_cnt 1
-%elif use_loop
-cglobal sad%1x%2x4d_avg, 5, 7, 8, src, ref4, ref1, ref_stride, second_pred, ref2, ref3
-  %define spill_src_stride 1
-  %define spill_cnt 1
-%else
-cglobal sad%1x%2x4d_avg, 5, 7, 8, src, ref4, ref1, ref_stride, second_pred, ref2, ref3
-  %define spill_src_stride 1
-%endif
-%endif
 %endif
 
 %if spill_src_stride
@@ -345,7 +267,7 @@
   %define cntd word [rsp]
 %endif
 
-%if %4 == 1
+%if %3 == 1
   sal          src_strided, 1
   sal          ref_strided, 1
 %endif
@@ -362,14 +284,12 @@
 %define external_loop (use_ref_offset && %1 > mmsize && %1 != %2)
 
 %if use_ref_offset
-  PROCESS_FIRST_MMSIZE %3
+  PROCESS_FIRST_MMSIZE
 %if %1 > mmsize
   mov          ref_offsetq, 0
-  mov                 cntd, %2 >> %4
+  mov                 cntd, %2 >> %3
 ; Jump part way into the loop for the square version of this width
 %if %3 == 1
-  jmp mangle(private_prefix %+ _sad%1x%1x4d_avg %+ SUFFIX).midloop
-%elif %4 == 1
   jmp mangle(private_prefix %+ _sad_skip_%1x%1x4d %+ SUFFIX).midloop
 %else
   jmp mangle(private_prefix %+ _sad%1x%1x4d %+ SUFFIX).midloop
@@ -377,14 +297,14 @@
 %else
   mov          ref_offsetq, ref_strideq
   add                 srcq, src_strideq
-  mov                 cntd, (%2 >> %4) - 1
+  mov                 cntd, (%2 >> %3) - 1
 %endif
 %if external_loop == 0
 .loop:
 ; Unrolled horizontal loop
 %assign h_offset 0
 %rep %1/mmsize
-  PROCESS_16x1x4 h_offset, %3
+  PROCESS_16x1x4 h_offset
 %if h_offset == 0
 ; The first row of the first column is done outside the loop and jumps here
 .midloop:
@@ -398,13 +318,13 @@
   jnz .loop
 %endif
 %else
-  PROCESS_%1x2x4 1, %3
+  PROCESS_%1x2x4 1
   ADVANCE_END_OF_TWO_LINES
 %if use_loop
-  mov                 cntd, (%2/2 >> %4) - 1
+  mov                 cntd, (%2/2 >> %3) - 1
 .loop:
 %endif
-  PROCESS_%1x2x4 0, %3
+  PROCESS_%1x2x4 0
 %if use_loop
   ADVANCE_END_OF_TWO_LINES
   sub                 cntd, 1
@@ -421,13 +341,10 @@
 %if %3 == 0
   %define resultq r4
   %define resultmp r4mp
-%else
-  %define resultq r5
-  %define resultmp r5mp
 %endif
 
 ; Undo modifications on parameters on the stack
-%if %4 == 1
+%if %3 == 1
 %if spill_src_stride
   shr          src_strided, 1
 %endif
@@ -446,7 +363,7 @@
   punpcklqdq            m4, m6
   punpckhqdq            m5, m7
   paddd                 m4, m5
-%if %4 == 1
+%if %3 == 1
   pslld                 m4, 1
 %endif
   movifnidn             resultq, resultmp
@@ -455,7 +372,7 @@
 %else
   pshufd            m6, m6, 0x08
   pshufd            m7, m7, 0x08
-%if %4 == 1
+%if %3 == 1
   pslld                 m6, 1
   pslld                 m7, 1
 %endif
@@ -492,7 +409,6 @@
 SADNXN4D  16,  64
 SADNXN4D  64,  16
 %endif
-%if CONFIG_REALTIME_ONLY==0
 SADNXN4D 128, 128, 1
 SADNXN4D 128,  64, 1
 SADNXN4D  64, 128, 1
@@ -506,39 +422,16 @@
 SADNXN4D  16,   8, 1
 SADNXN4D   8,  16, 1
 SADNXN4D   8,   8, 1
-SADNXN4D   8,   4, 1
 SADNXN4D   4,   8, 1
-SADNXN4D   4,   4, 1
+%if CONFIG_REALTIME_ONLY==0
 SADNXN4D   4,  16, 1
-SADNXN4D  16,   4, 1
 SADNXN4D   8,  32, 1
 SADNXN4D  32,   8, 1
 SADNXN4D  16,  64, 1
 SADNXN4D  64,  16, 1
 %endif
-SADNXN4D 128, 128, 0, 1
-SADNXN4D 128,  64, 0, 1
-SADNXN4D  64, 128, 0, 1
-SADNXN4D  64,  64, 0, 1
-SADNXN4D  64,  32, 0, 1
-SADNXN4D  32,  64, 0, 1
-SADNXN4D  32,  32, 0, 1
-SADNXN4D  32,  16, 0, 1
-SADNXN4D  16,  32, 0, 1
-SADNXN4D  16,  16, 0, 1
-SADNXN4D  16,   8, 0, 1
-SADNXN4D   8,  16, 0, 1
-SADNXN4D   8,   8, 0, 1
-SADNXN4D   4,   8, 0, 1
-%if CONFIG_REALTIME_ONLY==0
-SADNXN4D   4,  16, 0, 1
-SADNXN4D   8,  32, 0, 1
-SADNXN4D  32,   8, 0, 1
-SADNXN4D  16,  64, 0, 1
-SADNXN4D  64,  16, 0, 1
-%endif
 
 ; Different assembly is needed when the height gets subsampled to 2
-; SADNXN4D 16,  4, 0, 1
-; SADNXN4D  8,  4, 0, 1
-; SADNXN4D  4,  4, 0, 1
+; SADNXN4D 16,  4, 1
+; SADNXN4D  8,  4, 1
+; SADNXN4D  4,  4, 1

diff --git a/aom_dsp/x86/sad_sse2.asm b/aom_dsp/x86/sad_sse2.asm
index de9845a..dbe8ca3 100644
--- a/aom_dsp/x86/sad_sse2.asm
+++ b/aom_dsp/x86/sad_sse2.asm

@@ -42,11 +42,11 @@
 cglobal sad%1x%2_avg, 5, 1 + %3, 5, src, src_stride, ref, ref_stride, \
                                     second_pred, n_rows
 %else ; %3 == 7
-cglobal sad%1x%2_avg, 5, ARCH_X86_64 + %3, 6, src, src_stride, \
+cglobal sad%1x%2_avg, 5, AOM_ARCH_X86_64 + %3, 6, src, src_stride, \
                                               ref, ref_stride, \
                                               second_pred, \
                                               src_stride3, ref_stride3
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
 %define n_rowsd r7d
 %else ; x86-32
 %define n_rowsd dword r0m

diff --git a/aom_dsp/x86/subpel_variance_sse2.asm b/aom_dsp/x86/subpel_variance_sse2.asm
index cbf2890..d1d8373 100644
--- a/aom_dsp/x86/subpel_variance_sse2.asm
+++ b/aom_dsp/x86/subpel_variance_sse2.asm

@@ -98,7 +98,7 @@
 %endmacro
 
 %macro INC_SRC_BY_SRC_STRIDE  0
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
   add                srcq, src_stridemp
 %else
   add                srcq, src_strideq
@@ -117,7 +117,7 @@
 ; 11, not 13, if the registers are ordered correctly. May make a minor speed
 ; difference on Win64
 
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   %if %2 == 1 ; avg
     cglobal sub_pixel_avg_variance%1xh, 9, 10, 13, src, src_stride, \
                                         x_offset, y_offset, dst, dst_stride, \
@@ -355,11 +355,11 @@
 
 .x_zero_y_nonhalf:
   ; x_offset == 0 && y_offset == bilin interpolation
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   lea        bilin_filter, [GLOBAL(bilin_filter_m)]
 %endif
   shl           y_offsetd, filter_idx_shift
-%if ARCH_X86_64 && %1 > 4
+%if AOM_ARCH_X86_64 && %1 > 4
   mova                 m8, [bilin_filter+y_offsetq]
 %if notcpuflag(ssse3) ; FIXME(rbultje) don't scatter registers on x86-64
   mova                 m9, [bilin_filter+y_offsetq+16]
@@ -369,7 +369,7 @@
 %define filter_y_b m9
 %define filter_rnd m10
 %else ; x86-32 or mmx
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
 ; x_offset == 0, reuse x_offset reg
 %define tempq x_offsetq
   add y_offsetq, g_bilin_filterm
@@ -678,11 +678,11 @@
 
 .x_half_y_nonhalf:
   ; x_offset == 0.5 && y_offset == bilin interpolation
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   lea        bilin_filter, [GLOBAL(bilin_filter_m)]
 %endif
   shl           y_offsetd, filter_idx_shift
-%if ARCH_X86_64 && %1 > 4
+%if AOM_ARCH_X86_64 && %1 > 4
   mova                 m8, [bilin_filter+y_offsetq]
 %if notcpuflag(ssse3) ; FIXME(rbultje) don't scatter registers on x86-64
   mova                 m9, [bilin_filter+y_offsetq+16]
@@ -692,7 +692,7 @@
 %define filter_y_b m9
 %define filter_rnd m10
 %else  ;x86_32
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
 ; x_offset == 0.5. We can reuse x_offset reg
 %define tempq x_offsetq
   add y_offsetq, g_bilin_filterm
@@ -836,11 +836,11 @@
   jnz .x_nonhalf_y_nonzero
 
   ; x_offset == bilin interpolation && y_offset == 0
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   lea        bilin_filter, [GLOBAL(bilin_filter_m)]
 %endif
   shl           x_offsetd, filter_idx_shift
-%if ARCH_X86_64 && %1 > 4
+%if AOM_ARCH_X86_64 && %1 > 4
   mova                 m8, [bilin_filter+x_offsetq]
 %if notcpuflag(ssse3) ; FIXME(rbultje) don't scatter registers on x86-64
   mova                 m9, [bilin_filter+x_offsetq+16]
@@ -850,7 +850,7 @@
 %define filter_x_b m9
 %define filter_rnd m10
 %else    ; x86-32
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
 ;y_offset == 0. We can reuse y_offset reg.
 %define tempq y_offsetq
   add x_offsetq, g_bilin_filterm
@@ -978,11 +978,11 @@
   jne .x_nonhalf_y_nonhalf
 
   ; x_offset == bilin interpolation && y_offset == 0.5
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   lea        bilin_filter, [GLOBAL(bilin_filter_m)]
 %endif
   shl           x_offsetd, filter_idx_shift
-%if ARCH_X86_64 && %1 > 4
+%if AOM_ARCH_X86_64 && %1 > 4
   mova                 m8, [bilin_filter+x_offsetq]
 %if notcpuflag(ssse3) ; FIXME(rbultje) don't scatter registers on x86-64
   mova                 m9, [bilin_filter+x_offsetq+16]
@@ -992,7 +992,7 @@
 %define filter_x_b m9
 %define filter_rnd m10
 %else    ; x86-32
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
 ; y_offset == 0.5. We can reuse y_offset reg.
 %define tempq y_offsetq
   add x_offsetq, g_bilin_filterm
@@ -1176,12 +1176,12 @@
   STORE_AND_RET %1
 
 .x_nonhalf_y_nonhalf:
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   lea        bilin_filter, [GLOBAL(bilin_filter_m)]
 %endif
   shl           x_offsetd, filter_idx_shift
   shl           y_offsetd, filter_idx_shift
-%if ARCH_X86_64 && %1 > 4
+%if AOM_ARCH_X86_64 && %1 > 4
   mova                 m8, [bilin_filter+x_offsetq]
 %if notcpuflag(ssse3) ; FIXME(rbultje) don't scatter registers on x86-64
   mova                 m9, [bilin_filter+x_offsetq+16]
@@ -1197,7 +1197,7 @@
 %define filter_y_b m11
 %define filter_rnd m12
 %else   ; x86-32
-%if ARCH_X86=1 && CONFIG_PIC=1
+%if AOM_ARCH_X86=1 && CONFIG_PIC=1
 ; In this case, there is NO unused register. Used src_stride register. Later,
 ; src_stride has to be loaded from stack when it is needed.
 %define tempq src_strideq

diff --git a/aom_dsp/x86/subtract_sse2.asm b/aom_dsp/x86/subtract_sse2.asm
index af38022..fd508c0 100644
--- a/aom_dsp/x86/subtract_sse2.asm
+++ b/aom_dsp/x86/subtract_sse2.asm

@@ -40,8 +40,8 @@
 %macro loop16 6
   mova                  m0, [srcq+%1]
   mova                  m4, [srcq+%2]
-  mova                  m1, [predq+%3]
-  mova                  m5, [predq+%4]
+  movu                  m1, [predq+%3]
+  movu                  m5, [predq+%4]
   punpckhbw             m2, m0, m7
   punpckhbw             m3, m1, m7
   punpcklbw             m0, m7

diff --git a/aom_dsp/x86/sum_squares_sse2.c b/aom_dsp/x86/sum_squares_sse2.c
index 25be856..cf3ed98 100644
--- a/aom_dsp/x86/sum_squares_sse2.c
+++ b/aom_dsp/x86/sum_squares_sse2.c

@@ -23,7 +23,7 @@
 }
 
 static INLINE uint64_t xx_cvtsi128_si64(__m128i a) {
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   return (uint64_t)_mm_cvtsi128_si64(a);
 #else
   {

diff --git a/aom_dsp/x86/synonyms.h b/aom_dsp/x86/synonyms.h
index d538015..6744ec5 100644
--- a/aom_dsp/x86/synonyms.h
+++ b/aom_dsp/x86/synonyms.h

@@ -85,6 +85,16 @@
 #endif
 }
 
+// Fill an SSE register using an interleaved pair of values, ie. set the
+// 8 channels to {a, b, a, b, a, b, a, b}, using the same channel ordering
+// as when a register is stored to / loaded from memory.
+//
+// This is useful for rearranging filter kernels for use with the _mm_madd_epi16
+// instruction
+static INLINE __m128i xx_set2_epi16(int16_t a, int16_t b) {
+  return _mm_setr_epi16(a, b, a, b, a, b, a, b);
+}
+
 static INLINE __m128i xx_round_epu16(__m128i v_val_w) {
   return _mm_avg_epu16(v_val_w, _mm_setzero_si128());
 }

diff --git a/aom_dsp/x86/variance_avx2.c b/aom_dsp/x86/variance_avx2.c
index a475fb7..046d6f1 100644
--- a/aom_dsp/x86/variance_avx2.c
+++ b/aom_dsp/x86/variance_avx2.c

@@ -269,6 +269,95 @@
   _mm256_storeu_si256((__m256i *)(comp_pred), roundA);
 }
 
+void aom_comp_avg_pred_avx2(uint8_t *comp_pred, const uint8_t *pred, int width,
+                            int height, const uint8_t *ref, int ref_stride) {
+  int row = 0;
+  if (width == 8) {
+    do {
+      const __m256i pred_0123 = _mm256_loadu_si256((const __m256i *)(pred));
+      const __m128i ref_0 = _mm_loadl_epi64((const __m128i *)(ref));
+      const __m128i ref_1 =
+          _mm_loadl_epi64((const __m128i *)(ref + ref_stride));
+      const __m128i ref_2 =
+          _mm_loadl_epi64((const __m128i *)(ref + 2 * ref_stride));
+      const __m128i ref_3 =
+          _mm_loadl_epi64((const __m128i *)(ref + 3 * ref_stride));
+      const __m128i ref_01 = _mm_unpacklo_epi64(ref_0, ref_1);
+      const __m128i ref_23 = _mm_unpacklo_epi64(ref_2, ref_3);
+
+      const __m256i ref_0123 =
+          _mm256_inserti128_si256(_mm256_castsi128_si256(ref_01), ref_23, 1);
+      const __m256i average = _mm256_avg_epu8(pred_0123, ref_0123);
+      _mm256_storeu_si256((__m256i *)(comp_pred), average);
+
+      row += 4;
+      pred += 32;
+      comp_pred += 32;
+      ref += 4 * ref_stride;
+    } while (row < height);
+  } else if (width == 16) {
+    do {
+      const __m256i pred_0 = _mm256_loadu_si256((const __m256i *)(pred));
+      const __m256i pred_1 = _mm256_loadu_si256((const __m256i *)(pred + 32));
+      const __m256i tmp0 =
+          _mm256_castsi128_si256(_mm_loadu_si128((const __m128i *)(ref)));
+      const __m256i ref_0 = _mm256_inserti128_si256(
+          tmp0, _mm_loadu_si128((const __m128i *)(ref + ref_stride)), 1);
+      const __m256i tmp1 = _mm256_castsi128_si256(
+          _mm_loadu_si128((const __m128i *)(ref + 2 * ref_stride)));
+      const __m256i ref_1 = _mm256_inserti128_si256(
+          tmp1, _mm_loadu_si128((const __m128i *)(ref + 3 * ref_stride)), 1);
+      const __m256i average_0 = _mm256_avg_epu8(pred_0, ref_0);
+      const __m256i average_1 = _mm256_avg_epu8(pred_1, ref_1);
+      _mm256_storeu_si256((__m256i *)(comp_pred), average_0);
+      _mm256_storeu_si256((__m256i *)(comp_pred + 32), average_1);
+
+      row += 4;
+      pred += 64;
+      comp_pred += 64;
+      ref += 4 * ref_stride;
+    } while (row < height);
+  } else if (width == 32) {
+    do {
+      const __m256i pred_0 = _mm256_loadu_si256((const __m256i *)(pred));
+      const __m256i pred_1 = _mm256_loadu_si256((const __m256i *)(pred + 32));
+      const __m256i ref_0 = _mm256_loadu_si256((const __m256i *)(ref));
+      const __m256i ref_1 =
+          _mm256_loadu_si256((const __m256i *)(ref + ref_stride));
+      const __m256i average_0 = _mm256_avg_epu8(pred_0, ref_0);
+      const __m256i average_1 = _mm256_avg_epu8(pred_1, ref_1);
+      _mm256_storeu_si256((__m256i *)(comp_pred), average_0);
+      _mm256_storeu_si256((__m256i *)(comp_pred + 32), average_1);
+
+      row += 2;
+      pred += 64;
+      comp_pred += 64;
+      ref += 2 * ref_stride;
+    } while (row < height);
+  } else if (width % 64 == 0) {
+    do {
+      for (int x = 0; x < width; x += 64) {
+        const __m256i pred_0 = _mm256_loadu_si256((const __m256i *)(pred + x));
+        const __m256i pred_1 =
+            _mm256_loadu_si256((const __m256i *)(pred + x + 32));
+        const __m256i ref_0 = _mm256_loadu_si256((const __m256i *)(ref + x));
+        const __m256i ref_1 =
+            _mm256_loadu_si256((const __m256i *)(ref + x + 32));
+        const __m256i average_0 = _mm256_avg_epu8(pred_0, ref_0);
+        const __m256i average_1 = _mm256_avg_epu8(pred_1, ref_1);
+        _mm256_storeu_si256((__m256i *)(comp_pred + x), average_0);
+        _mm256_storeu_si256((__m256i *)(comp_pred + x + 32), average_1);
+      }
+      row++;
+      pred += width;
+      comp_pred += width;
+      ref += ref_stride;
+    } while (row < height);
+  } else {
+    aom_comp_avg_pred_c(comp_pred, pred, width, height, ref, ref_stride);
+  }
+}
+
 void aom_comp_mask_pred_avx2(uint8_t *comp_pred, const uint8_t *pred, int width,
                              int height, const uint8_t *ref, int ref_stride,
                              const uint8_t *mask, int mask_stride,

diff --git a/aom_dsp/x86/variance_sse2.c b/aom_dsp/x86/variance_sse2.c
index 7d4ff4f..faec9cf 100644
--- a/aom_dsp/x86/variance_sse2.c
+++ b/aom_dsp/x86/variance_sse2.c

@@ -46,6 +46,12 @@
   return _mm_unpacklo_epi8(p0, _mm_setzero_si128());
 }
 
+static INLINE void load16_8to16_sse2(const uint8_t *const p, __m128i *out) {
+  const __m128i p0 = _mm_loadu_si128((const __m128i *)p);
+  out[0] = _mm_unpacklo_epi8(p0, _mm_setzero_si128());  // lower 8 values
+  out[1] = _mm_unpackhi_epi8(p0, _mm_setzero_si128());  // upper 8 values
+}
+
 // Accumulate 4 32bit numbers in val to 1 32bit number
 static INLINE unsigned int add32x4_sse2(__m128i val) {
   val = _mm_add_epi32(val, _mm_srli_si128(val, 8));
@@ -232,14 +238,6 @@
   }
 }
 
-void aom_get8x8var_sse2(const uint8_t *src_ptr, int src_stride,
-                        const uint8_t *ref_ptr, int ref_stride,
-                        unsigned int *sse, int *sum) {
-  __m128i vsse, vsum;
-  variance8_sse2(src_ptr, src_stride, ref_ptr, ref_stride, 8, &vsse, &vsum);
-  variance_final_128_pel_sse2(vsse, vsum, sse, sum);
-}
-
 void aom_get_var_sse_sum_8x8_quad_sse2(const uint8_t *src_ptr, int src_stride,
                                        const uint8_t *ref_ptr, int ref_stride,
                                        uint32_t *sse8x8, int *sum8x8,
@@ -271,6 +269,42 @@
     var8x8[i] = sse8x8[i] - (uint32_t)(((int64_t)sum8x8[i] * sum8x8[i]) >> 6);
 }
 
+void aom_get_var_sse_sum_16x16_dual_sse2(const uint8_t *src_ptr, int src_stride,
+                                         const uint8_t *ref_ptr, int ref_stride,
+                                         uint32_t *sse16x16,
+                                         unsigned int *tot_sse, int *tot_sum,
+                                         uint32_t *var16x16) {
+  int sum16x16[2] = { 0 };
+  // Loop over 2 16x16 blocks. Process one 16x32 block.
+  for (int k = 0; k < 2; k++) {
+    const uint8_t *src = src_ptr;
+    const uint8_t *ref = ref_ptr;
+    __m128i vsum = _mm_setzero_si128();
+    __m128i vsse = _mm_setzero_si128();
+    for (int i = 0; i < 16; i++) {
+      __m128i s[2];
+      __m128i r[2];
+      load16_8to16_sse2(src + (k * 16), s);
+      load16_8to16_sse2(ref + (k * 16), r);
+      const __m128i diff0 = _mm_sub_epi16(s[0], r[0]);
+      const __m128i diff1 = _mm_sub_epi16(s[1], r[1]);
+      vsse = _mm_add_epi32(vsse, _mm_madd_epi16(diff0, diff0));
+      vsse = _mm_add_epi32(vsse, _mm_madd_epi16(diff1, diff1));
+      vsum = _mm_add_epi16(vsum, _mm_add_epi16(diff0, diff1));
+      src += src_stride;
+      ref += ref_stride;
+    }
+    variance_final_256_pel_sse2(vsse, vsum, &sse16x16[k], &sum16x16[k]);
+  }
+
+  // Calculate variance at 16x16 level and total sse, sum of 16x32 block.
+  *tot_sse += sse16x16[0] + sse16x16[1];
+  *tot_sum += sum16x16[0] + sum16x16[1];
+  for (int i = 0; i < 2; i++)
+    var16x16[i] =
+        sse16x16[i] - (uint32_t)(((int64_t)sum16x16[i] * sum16x16[i]) >> 8);
+}
+
 #define AOM_VAR_NO_LOOP_SSE2(bw, bh, bits, max_pixels)                        \
   unsigned int aom_variance##bw##x##bh##_sse2(                                \
       const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, \

diff --git a/aom_ports/aom_once.h b/aom_ports/aom_once.h
index 0974a3b..680120f 100644
--- a/aom_ports/aom_once.h
+++ b/aom_ports/aom_once.h

@@ -60,29 +60,6 @@
   InitOnceComplete(&aom_init_once, 0, NULL);
 }
 
-#elif CONFIG_MULTITHREAD && defined(__OS2__)
-#define INCL_DOS
-#include <os2.h>
-static void aom_once(void (*func)(void)) {
-  static volatile int done;
-
-  /* If the initialization is complete, return early. */
-  if (done) return;
-
-  /* Causes all other threads in the process to block themselves
-   * and give up their time slice.
-   */
-  DosEnterCritSec();
-
-  if (!done) {
-    func();
-    done = 1;
-  }
-
-  /* Restores normal thread dispatching for the current process. */
-  DosExitCritSec();
-}
-
 #elif CONFIG_MULTITHREAD && HAVE_PTHREAD_H
 #include <pthread.h>
 static void aom_once(void (*func)(void)) {

diff --git a/aom_ports/aom_ports.cmake b/aom_ports/aom_ports.cmake
index 5d9f69a..e3b67e4 100644
--- a/aom_ports/aom_ports.cmake
+++ b/aom_ports/aom_ports.cmake

@@ -27,6 +27,12 @@
 list(APPEND AOM_PORTS_SOURCES_ARM "${AOM_ROOT}/aom_ports/arm.h"
             "${AOM_ROOT}/aom_ports/arm_cpudetect.c")
 
+if(CONFIG_RUNTIME_CPU_DETECT AND ANDROID_NDK)
+  include_directories(${ANDROID_NDK}/sources/android/cpufeatures)
+  list(APPEND AOM_PORTS_SOURCES_ARM
+              "${ANDROID_NDK}/sources/android/cpufeatures/cpu-features.c")
+endif()
+
 list(APPEND AOM_PORTS_SOURCES_PPC "${AOM_ROOT}/aom_ports/ppc.h"
             "${AOM_ROOT}/aom_ports/ppc_cpudetect.c")
 
@@ -43,9 +49,13 @@
 #
 # * The libaom target must exist before this function is called.
 function(setup_aom_ports_targets)
-  if(WIN32 AND "${AOM_TARGET_CPU}" STREQUAL "x86_64")
+  if(XCODE AND "${AOM_TARGET_CPU}" STREQUAL "x86_64")
     add_asm_library("aom_ports" "AOM_PORTS_ASM_X86")
-    set(aom_ports_asm_lib 1)
+    # Xcode is the only one
+    set(aom_ports_is_embedded 1)
+    set(aom_ports_has_symbols 1)
+  elseif(WIN32 AND "${AOM_TARGET_CPU}" STREQUAL "x86_64")
+    add_asm_library("aom_ports" "AOM_PORTS_ASM_X86")
     set(aom_ports_has_symbols 1)
   elseif("${AOM_TARGET_CPU}" MATCHES "arm")
     add_library(aom_ports OBJECT ${AOM_PORTS_SOURCES_ARM})
@@ -68,14 +78,7 @@
   # libaom_srcs.*; if it becomes necessary for a particular generator another
   # method should be used.
   if(aom_ports_has_symbols)
-    if(aom_ports_asm_lib)
-      # When aom_ports is an asm library its name changes based on build
-      # configuration. This handles adding sources to the correct target(s).
-      target_sources(aom_ports_static PRIVATE ${AOM_PORTS_INCLUDES})
-      if(BUILD_SHARED_LIBS)
-        target_sources(aom_ports_shared PRIVATE ${AOM_PORTS_INCLUDES})
-      endif()
-    else()
+    if(NOT aom_ports_is_embedded)
       target_sources(aom_ports PRIVATE ${AOM_PORTS_INCLUDES})
     endif()
     set(AOM_LIB_TARGETS ${AOM_LIB_TARGETS} PARENT_SCOPE)

diff --git a/aom_ports/aom_timer.h b/aom_ports/aom_timer.h
index ff58799..642c5a0 100644
--- a/aom_ports/aom_timer.h
+++ b/aom_ports/aom_timer.h

@@ -14,10 +14,11 @@
 
 #include "config/aom_config.h"
 
-#include "aom/aom_integer.h"
-
 #if CONFIG_OS_SUPPORT
 
+#include <stddef.h>
+#include <stdint.h>
+
 #if defined(_WIN32)
 /*
  * Win32 specific includes

diff --git a/aom_ports/arm_cpudetect.c b/aom_ports/arm_cpudetect.c
index 305b22c..276ef61 100644
--- a/aom_ports/arm_cpudetect.c
+++ b/aom_ports/arm_cpudetect.c

@@ -57,12 +57,14 @@
 }
 
 #elif defined(_MSC_VER) /* end !CONFIG_RUNTIME_CPU_DETECT || __APPLE__ */
+#if HAVE_NEON && !AOM_ARCH_AARCH64
 /*For GetExceptionCode() and EXCEPTION_ILLEGAL_INSTRUCTION.*/
 #undef WIN32_LEAN_AND_MEAN
 #define WIN32_LEAN_AND_MEAN
 #undef WIN32_EXTRA_LEAN
 #define WIN32_EXTRA_LEAN
 #include <windows.h>
+#endif  // HAVE_NEON && !AOM_ARCH_AARCH64
 
 int aom_arm_cpu_caps(void) {
   int flags;
@@ -71,6 +73,9 @@
     return flags;
   }
   mask = arm_cpu_env_mask();
+#if AOM_ARCH_AARCH64
+  return HAS_NEON & mask;
+#else
 /* MSVC has no inline __asm support for ARM, but it does let you __emit
  *  instructions via their assembled hex code.
  * All of these instructions should be essentially nops.
@@ -85,8 +90,9 @@
       /*Ignore exception.*/
     }
   }
-#endif /* HAVE_NEON */
+#endif  /* HAVE_NEON */
   return flags & mask;
+#endif  // AOM_ARCH_AARCH64
 }
 
 #elif defined(__ANDROID__) /* end _MSC_VER */

diff --git a/aom_ports/bitops.h b/aom_ports/bitops.h
index 44df173..3c5b992 100644
--- a/aom_ports/bitops.h
+++ b/aom_ports/bitops.h

@@ -13,6 +13,7 @@
 #define AOM_AOM_PORTS_BITOPS_H_
 
 #include <assert.h>
+#include <stdint.h>
 
 #include "aom_ports/msvc.h"
 #include "config/aom_config.h"
@@ -32,7 +33,13 @@
 // Returns (int)floor(log2(n)). n must be > 0.
 // These versions of get_msb() are only valid when n != 0 because all
 // of the optimized versions are undefined when n == 0:
-// https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
+
+// get_byteswap64:
+// Returns the number (uint64_t) with byte-positions reversed
+// e.g. input 0x123456789ABCDEF0 returns 0xF0DEBC9A78563412
+
+// GCC compiler: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
+// MSVC: https://learn.microsoft.com/en-us/cpp/c-runtime-library/
 
 // use GNU builtins where available.
 #if defined(__GNUC__) && \
@@ -41,6 +48,10 @@
   assert(n != 0);
   return 31 ^ __builtin_clz(n);
 }
+
+static INLINE uint64_t get_byteswap64(uint64_t num) {
+  return __builtin_bswap64(num);
+}
 #elif defined(USE_MSC_INTRINSICS)
 #pragma intrinsic(_BitScanReverse)
 
@@ -50,17 +61,19 @@
   _BitScanReverse(&first_set_bit, n);
   return first_set_bit;
 }
+
+static INLINE uint64_t get_byteswap64(uint64_t num) {
+  return _byteswap_uint64(num);
+}
 #undef USE_MSC_INTRINSICS
 #else
 static INLINE int get_msb(unsigned int n) {
   int log = 0;
   unsigned int value = n;
-  int i;
 
   assert(n != 0);
 
-  for (i = 4; i >= 0; --i) {
-    const int shift = (1 << i);
+  for (int shift = 16; shift != 0; shift >>= 1) {
     const unsigned int x = value >> shift;
     if (x != 0) {
       value = x;
@@ -69,6 +82,26 @@
   }
   return log;
 }
+
+static INLINE uint64_t get_byteswap64(uint64_t num) {
+  uint64_t out = 0x00;
+  uint64_t mask = 0xFF00000000000000;
+  int bit_shift = 56;  // 7 bytes
+  // 4 ms bytes
+  do {
+    out |= (num & mask) >> bit_shift;
+    mask >>= 8;
+    bit_shift -= 16;
+  } while (bit_shift >= 0);
+  // 4 ls bytes
+  bit_shift = 8;  // 1 byte
+  do {
+    out |= (num & mask) << bit_shift;
+    mask >>= 8;
+    bit_shift += 16;
+  } while (bit_shift <= 56);
+  return out;
+}
 #endif
 
 #ifdef __cplusplus

diff --git a/aom_ports/x86.h b/aom_ports/x86.h
index d44d386..c089984 100644
--- a/aom_ports/x86.h
+++ b/aom_ports/x86.h

@@ -44,7 +44,7 @@
 } aom_cpu_t;
 
 #if defined(__GNUC__) && __GNUC__ || defined(__ANDROID__)
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
 #define cpuid(func, func2, ax, bx, cx, dx)                      \
   __asm__ __volatile__("cpuid           \n\t"                   \
                        : "=a"(ax), "=b"(bx), "=c"(cx), "=d"(dx) \
@@ -60,7 +60,7 @@
 #endif
 #elif defined(__SUNPRO_C) || \
     defined(__SUNPRO_CC) /* end __GNUC__ or __ANDROID__*/
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
 #define cpuid(func, func2, ax, bx, cx, dx)     \
   asm volatile(                                \
       "xchg %rsi, %rbx \n\t"                   \
@@ -80,7 +80,7 @@
       : "a"(func), "c"(func2))
 #endif
 #else /* end __SUNPRO__ */
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
 #if defined(_MSC_VER) && _MSC_VER > 1500
 #define cpuid(func, func2, a, b, c, d) \
   do {                                 \
@@ -258,7 +258,7 @@
   asm volatile("rdtsc\n\t" : "=a"(tsc) :);
   return tsc;
 #else
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   return (unsigned int)__rdtsc();
 #else
   __asm rdtsc;
@@ -276,7 +276,7 @@
   asm volatile("rdtsc\n\t" : "=a"(lo), "=d"(hi));
   return ((uint64_t)hi << 32) | lo;
 #else
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   return (uint64_t)__rdtsc();
 #else
   __asm rdtsc;
@@ -298,7 +298,7 @@
   unsigned int ui;
   return (unsigned int)__rdtscp(&ui);
 #else
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   return (unsigned int)__rdtscp();
 #else
   __asm rdtscp;
@@ -336,7 +336,7 @@
 #elif defined(__SUNPRO_C) || defined(__SUNPRO_CC)
 #define x86_pause_hint() asm volatile("pause \n\t")
 #else
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
 #define x86_pause_hint() _mm_pause();
 #else
 #define x86_pause_hint() __asm pause
@@ -361,7 +361,7 @@
   asm volatile("fstcw %0\n\t" : "=m"(*&mode) :);
   return mode;
 }
-#elif ARCH_X86_64
+#elif AOM_ARCH_X86_64
 /* No fldcw intrinsics on Windows x64, punt to external asm */
 extern void aom_winx64_fldcw(unsigned short mode);
 extern unsigned short aom_winx64_fstcw(void);

diff --git a/aom_scale/aom_scale_rtcd.pl b/aom_scale/aom_scale_rtcd.pl
index e84b6f9..ae0a856 100644
--- a/aom_scale/aom_scale_rtcd.pl
+++ b/aom_scale/aom_scale_rtcd.pl

@@ -26,7 +26,7 @@
   add_proto qw/void aom_vertical_band_2_1_scale_i/, "unsigned char *source, int src_pitch, unsigned char *dest, int dest_pitch, unsigned int dest_width";
 }
 
-add_proto qw/int aom_yv12_realloc_with_new_border/, "struct yv12_buffer_config *ybf, int new_border, int byte_alignment, int num_planes";
+add_proto qw/int aom_yv12_realloc_with_new_border/, "struct yv12_buffer_config *ybf, int new_border, int byte_alignment, int num_pyramid_levels, int num_planes";
 
 add_proto qw/void aom_yv12_extend_frame_borders/, "struct yv12_buffer_config *ybf, const int num_planes";
 

diff --git a/aom_scale/generic/yv12config.c b/aom_scale/generic/yv12config.c
index de56263..82376f4 100644
--- a/aom_scale/generic/yv12config.c
+++ b/aom_scale/generic/yv12config.c

@@ -12,6 +12,8 @@
 #include <assert.h>
 
 #include "aom/internal/aom_image_internal.h"
+#include "aom_dsp/pyramid.h"
+#include "aom_dsp/flow_estimation/corner_detect.h"
 #include "aom_mem/aom_mem.h"
 #include "aom_ports/mem.h"
 #include "aom_scale/yv12config.h"
@@ -31,7 +33,14 @@
     if (ybf->buffer_alloc_sz > 0) {
       aom_free(ybf->buffer_alloc);
     }
-    if (ybf->y_buffer_8bit) aom_free(ybf->y_buffer_8bit);
+#if CONFIG_AV1_ENCODER && !CONFIG_REALTIME_ONLY
+    if (ybf->y_pyramid) {
+      aom_free_pyramid(ybf->y_pyramid);
+    }
+    if (ybf->corners) {
+      av1_free_corner_list(ybf->corners);
+    }
+#endif  // CONFIG_AV1_ENCODER && !CONFIG_REALTIME_ONLY
     aom_remove_metadata_from_frame_buffer(ybf);
     /* buffer_alloc isn't accessed by most functions.  Rather y_buffer,
       u_buffer and v_buffer point to buffer_alloc and are used.  Clear out
@@ -51,7 +60,7 @@
     const uint64_t uvplane_size, const int aligned_width,
     const int aligned_height, const int uv_width, const int uv_height,
     const int uv_stride, const int uv_border_w, const int uv_border_h,
-    int alloc_y_buffer_8bit, int alloc_y_plane_only) {
+    int num_pyramid_levels, int alloc_y_plane_only) {
   if (ybf) {
     const int aom_byte_align = (byte_alignment == 0) ? 1 : byte_alignment;
     const uint64_t frame_size =
@@ -59,11 +68,24 @@
 
     uint8_t *buf = NULL;
 
+#if CONFIG_REALTIME_ONLY || !CONFIG_AV1_ENCODER
+    // We should only need an 8-bit version of the source frame if we are
+    // encoding in non-realtime mode
+    (void)num_pyramid_levels;
+    assert(num_pyramid_levels == 0);
+#endif  // CONFIG_REALTIME_ONLY || !CONFIG_AV1_ENCODER
+
 #if defined AOM_MAX_ALLOCABLE_MEMORY
     // The size of ybf->buffer_alloc.
     uint64_t alloc_size = frame_size;
-    // The size of ybf->y_buffer_8bit.
-    if (use_highbitdepth) alloc_size += yplane_size;
+#if CONFIG_AV1_ENCODER && !CONFIG_REALTIME_ONLY
+    // The size of ybf->y_pyramid
+    if (num_pyramid_levels > 0) {
+      alloc_size += aom_get_pyramid_alloc_size(
+          width, height, num_pyramid_levels, use_highbitdepth);
+      alloc_size += av1_get_corner_list_size();
+    }
+#endif  // CONFIG_AV1_ENCODER && !CONFIG_REALTIME_ONLY
     // The decoder may allocate REF_FRAMES frame buffers in the frame buffer
     // pool. Bound the total amount of allocated memory as if these REF_FRAMES
     // frame buffers were allocated in a single allocation.
@@ -159,17 +181,21 @@
 
     ybf->use_external_reference_buffers = 0;
 
-    if (use_highbitdepth && alloc_y_buffer_8bit) {
-      if (ybf->y_buffer_8bit) aom_free(ybf->y_buffer_8bit);
-      ybf->y_buffer_8bit = (uint8_t *)aom_memalign(32, (size_t)yplane_size);
-      if (!ybf->y_buffer_8bit) return AOM_CODEC_MEM_ERROR;
-    } else {
-      if (ybf->y_buffer_8bit) {
-        aom_free(ybf->y_buffer_8bit);
-        ybf->y_buffer_8bit = NULL;
-        ybf->buf_8bit_valid = 0;
-      }
+#if CONFIG_AV1_ENCODER && !CONFIG_REALTIME_ONLY
+    if (ybf->y_pyramid) {
+      aom_free_pyramid(ybf->y_pyramid);
+      ybf->y_pyramid = NULL;
     }
+    if (ybf->corners) {
+      av1_free_corner_list(ybf->corners);
+      ybf->corners = NULL;
+    }
+    if (num_pyramid_levels > 0) {
+      ybf->y_pyramid = aom_alloc_pyramid(width, height, num_pyramid_levels,
+                                         use_highbitdepth);
+      ybf->corners = av1_alloc_corner_list();
+    }
+#endif  // CONFIG_AV1_ENCODER && !CONFIG_REALTIME_ONLY
 
     ybf->corrupted = 0; /* assume not corrupted by errors */
     return 0;
@@ -209,7 +235,7 @@
                              int border, int byte_alignment,
                              aom_codec_frame_buffer_t *fb,
                              aom_get_frame_buffer_cb_fn_t cb, void *cb_priv,
-                             int alloc_y_buffer_8bit, int alloc_y_plane_only) {
+                             int num_pyramid_levels, int alloc_y_plane_only) {
 #if CONFIG_SIZE_LIMIT
   if (width > DECODE_WIDTH_LIMIT || height > DECODE_HEIGHT_LIMIT)
     return AOM_CODEC_MEM_ERROR;
@@ -236,19 +262,21 @@
         ybf, width, height, ss_x, ss_y, use_highbitdepth, border,
         byte_alignment, fb, cb, cb_priv, y_stride, yplane_size, uvplane_size,
         aligned_width, aligned_height, uv_width, uv_height, uv_stride,
-        uv_border_w, uv_border_h, alloc_y_buffer_8bit, alloc_y_plane_only);
+        uv_border_w, uv_border_h, num_pyramid_levels, alloc_y_plane_only);
   }
   return AOM_CODEC_MEM_ERROR;
 }
 
 int aom_alloc_frame_buffer(YV12_BUFFER_CONFIG *ybf, int width, int height,
                            int ss_x, int ss_y, int use_highbitdepth, int border,
-                           int byte_alignment, int alloc_y_plane_only) {
+                           int byte_alignment, int num_pyramid_levels,
+                           int alloc_y_plane_only) {
   if (ybf) {
     aom_free_frame_buffer(ybf);
     return aom_realloc_frame_buffer(ybf, width, height, ss_x, ss_y,
                                     use_highbitdepth, border, byte_alignment,
-                                    NULL, NULL, NULL, 0, alloc_y_plane_only);
+                                    NULL, NULL, NULL, num_pyramid_levels,
+                                    alloc_y_plane_only);
   }
   return AOM_CODEC_MEM_ERROR;
 }

diff --git a/aom_scale/generic/yv12extend.c b/aom_scale/generic/yv12extend.c
index 997ff54..5546112 100644
--- a/aom_scale/generic/yv12extend.c
+++ b/aom_scale/generic/yv12extend.c

@@ -491,7 +491,8 @@
 }
 
 int aom_yv12_realloc_with_new_border_c(YV12_BUFFER_CONFIG *ybf, int new_border,
-                                       int byte_alignment, int num_planes) {
+                                       int byte_alignment,
+                                       int num_pyramid_levels, int num_planes) {
   if (ybf) {
     if (new_border == ybf->border) return 0;
     YV12_BUFFER_CONFIG new_buf;
@@ -499,7 +500,7 @@
     const int error = aom_alloc_frame_buffer(
         &new_buf, ybf->y_crop_width, ybf->y_crop_height, ybf->subsampling_x,
         ybf->subsampling_y, ybf->flags & YV12_FLAG_HIGHBITDEPTH, new_border,
-        byte_alignment, 0);
+        byte_alignment, num_pyramid_levels, 0);
     if (error) return error;
     // Copy image buffer
     aom_yv12_copy_frame(ybf, &new_buf, num_planes);

diff --git a/aom_scale/yv12config.h b/aom_scale/yv12config.h
index 581e923..f192a30 100644
--- a/aom_scale/yv12config.h
+++ b/aom_scale/yv12config.h

@@ -32,6 +32,11 @@
 #define AOM_ENC_ALLINTRA_BORDER 64
 #define AOM_DEC_BORDER_IN_PIXELS 64
 
+#if CONFIG_AV1_ENCODER && !CONFIG_REALTIME_ONLY
+struct image_pyramid;
+struct corner_list;
+#endif  // CONFIG_AV1_ENCODER && !CONFIG_REALTIME_ONLY
+
 /*!\endcond */
 /*!
  * \brief YV12 frame buffer data structure
@@ -90,10 +95,12 @@
   // external reference frame is no longer used.
   uint8_t *store_buf_adr[3];
 
-  // If the frame is stored in a 16-bit buffer, this stores an 8-bit version
-  // for use in global motion detection. It is allocated on-demand.
-  uint8_t *y_buffer_8bit;
-  int buf_8bit_valid;
+  // Global motion search data
+#if CONFIG_AV1_ENCODER && !CONFIG_REALTIME_ONLY
+  // 8-bit downsampling pyramid for the Y plane
+  struct image_pyramid *y_pyramid;
+  struct corner_list *corners;
+#endif  // CONFIG_AV1_ENCODER && !CONFIG_REALTIME_ONLY
 
   uint8_t *buffer_alloc;
   size_t buffer_alloc_sz;
@@ -121,23 +128,44 @@
 
 #define YV12_FLAG_HIGHBITDEPTH 8
 
+// Allocate a frame buffer
+//
+// If ybf currently contains an image, all associated memory will be freed and
+// then reallocated. In contrast, aom_realloc_frame_buffer() will reuse any
+// existing allocations where possible. So, if ybf is likely to already be
+// set up, please consider aom_realloc_frame_buffer() instead.
+//
+// See aom_realloc_frame_buffer() for the meanings of the arguments, and
+// available return values.
 int aom_alloc_frame_buffer(YV12_BUFFER_CONFIG *ybf, int width, int height,
                            int ss_x, int ss_y, int use_highbitdepth, int border,
-                           int byte_alignment, int alloc_y_plane_only);
+                           int byte_alignment, int num_pyramid_levels,
+                           int alloc_y_plane_only);
 
 // Updates the yv12 buffer config with the frame buffer. |byte_alignment| must
 // be a power of 2, from 32 to 1024. 0 sets legacy alignment. If cb is not
 // NULL, then libaom is using the frame buffer callbacks to handle memory.
 // If cb is not NULL, libaom will call cb with minimum size in bytes needed
 // to decode the current frame. If cb is NULL, libaom will allocate memory
-// internally to decode the current frame. Returns 0 on success. Returns < 0
-// on failure.
+// internally to decode the current frame.
+//
+// If num_pyramid_levels > 0, then an image pyramid will be allocated with
+// the specified number of levels.
+//
+// Any buffer which may become a source or ref frame buffer in the encoder
+// must have num_pyramid_levels = cpi->image_pyramid_levels. This will cause
+// an image pyramid to be allocated if one is needed.
+//
+// Any other buffers (in particular, any buffers inside the decoder)
+// must have cpi->image_pyramid_levels = 0, as a pyramid is unneeded there.
+//
+// Returns 0 on success. Returns < 0  on failure.
 int aom_realloc_frame_buffer(YV12_BUFFER_CONFIG *ybf, int width, int height,
                              int ss_x, int ss_y, int use_highbitdepth,
                              int border, int byte_alignment,
                              aom_codec_frame_buffer_t *fb,
                              aom_get_frame_buffer_cb_fn_t cb, void *cb_priv,
-                             int alloc_y_buffer_8bit, int alloc_y_plane_only);
+                             int num_pyramid_levels, int alloc_y_plane_only);
 
 int aom_free_frame_buffer(YV12_BUFFER_CONFIG *ybf);
 

diff --git a/aom_util/aom_thread.c b/aom_util/aom_thread.c
index 3916021..2c62b24 100644
--- a/aom_util/aom_thread.c
+++ b/aom_util/aom_thread.c

@@ -62,22 +62,30 @@
     pthread_setname_np(pthread_self(), thread_name);
   }
 #endif
-  int done = 0;
-  while (!done) {
-    pthread_mutex_lock(&worker->impl_->mutex_);
+  pthread_mutex_lock(&worker->impl_->mutex_);
+  for (;;) {
     while (worker->status_ == OK) {  // wait in idling mode
       pthread_cond_wait(&worker->impl_->condition_, &worker->impl_->mutex_);
     }
     if (worker->status_ == WORK) {
+      // When worker->status_ is WORK, the main thread doesn't change
+      // worker->status_ and will wait until the worker changes worker->status_
+      // to OK. See change_state(). So the worker can safely call execute()
+      // without holding worker->impl_->mutex_. When the worker reacquires
+      // worker->impl_->mutex_, worker->status_ must still be WORK.
+      pthread_mutex_unlock(&worker->impl_->mutex_);
       execute(worker);
+      pthread_mutex_lock(&worker->impl_->mutex_);
+      assert(worker->status_ == WORK);
       worker->status_ = OK;
-    } else if (worker->status_ == NOT_OK) {  // finish the worker
-      done = 1;
+      // signal to the main thread that we're done (for sync())
+      pthread_cond_signal(&worker->impl_->condition_);
+    } else {
+      assert(worker->status_ == NOT_OK);  // finish the worker
+      break;
     }
-    // signal to the main thread that we're done (for sync())
-    pthread_cond_signal(&worker->impl_->condition_);
-    pthread_mutex_unlock(&worker->impl_->mutex_);
   }
+  pthread_mutex_unlock(&worker->impl_->mutex_);
   return THREAD_RETURN(NULL);  // Thread is finished
 }
 

diff --git a/aom_util/aom_thread.h b/aom_util/aom_thread.h
index 7db7924..2df190f 100644
--- a/aom_util/aom_thread.h
+++ b/aom_util/aom_thread.h

@@ -143,151 +143,6 @@
   ok = SleepConditionVariableCS(condition, mutex, INFINITE);
   return !ok;
 }
-#elif defined(__OS2__)
-#define INCL_DOS
-#include <os2.h>  // NOLINT
-
-#include <errno.h>        // NOLINT
-#include <stdlib.h>       // NOLINT
-#include <sys/builtin.h>  // NOLINT
-
-#define pthread_t TID
-#define pthread_mutex_t HMTX
-
-typedef struct {
-  HEV event_sem_;
-  HEV ack_sem_;
-  volatile unsigned wait_count_;
-} pthread_cond_t;
-
-//------------------------------------------------------------------------------
-// simplistic pthread emulation layer
-
-#define THREADFN void *
-#define THREAD_RETURN(val) (val)
-
-typedef struct {
-  void *(*start_)(void *);
-  void *arg_;
-} thread_arg;
-
-static void thread_start(void *arg) {
-  thread_arg targ = *(thread_arg *)arg;
-  free(arg);
-
-  targ.start_(targ.arg_);
-}
-
-static INLINE int pthread_create(pthread_t *const thread, const void *attr,
-                                 void *(*start)(void *), void *arg) {
-  int tid;
-  thread_arg *targ = (thread_arg *)malloc(sizeof(*targ));
-  if (targ == NULL) return 1;
-
-  (void)attr;
-
-  targ->start_ = start;
-  targ->arg_ = arg;
-  tid = (pthread_t)_beginthread(thread_start, NULL, 1024 * 1024, targ);
-  if (tid == -1) {
-    free(targ);
-    return 1;
-  }
-
-  *thread = tid;
-  return 0;
-}
-
-static INLINE int pthread_join(pthread_t thread, void **value_ptr) {
-  (void)value_ptr;
-  return DosWaitThread(&thread, DCWW_WAIT) != 0;
-}
-
-// Mutex
-static INLINE int pthread_mutex_init(pthread_mutex_t *const mutex,
-                                     void *mutexattr) {
-  (void)mutexattr;
-  return DosCreateMutexSem(NULL, mutex, 0, FALSE) != 0;
-}
-
-static INLINE int pthread_mutex_trylock(pthread_mutex_t *const mutex) {
-  return DosRequestMutexSem(*mutex, SEM_IMMEDIATE_RETURN) == 0 ? 0 : EBUSY;
-}
-
-static INLINE int pthread_mutex_lock(pthread_mutex_t *const mutex) {
-  return DosRequestMutexSem(*mutex, SEM_INDEFINITE_WAIT) != 0;
-}
-
-static INLINE int pthread_mutex_unlock(pthread_mutex_t *const mutex) {
-  return DosReleaseMutexSem(*mutex) != 0;
-}
-
-static INLINE int pthread_mutex_destroy(pthread_mutex_t *const mutex) {
-  return DosCloseMutexSem(*mutex) != 0;
-}
-
-// Condition
-static INLINE int pthread_cond_destroy(pthread_cond_t *const condition) {
-  int ok = 1;
-  ok &= DosCloseEventSem(condition->event_sem_) == 0;
-  ok &= DosCloseEventSem(condition->ack_sem_) == 0;
-  return !ok;
-}
-
-static INLINE int pthread_cond_init(pthread_cond_t *const condition,
-                                    void *cond_attr) {
-  int ok = 1;
-  (void)cond_attr;
-
-  ok &=
-      DosCreateEventSem(NULL, &condition->event_sem_, DCE_POSTONE, FALSE) == 0;
-  ok &= DosCreateEventSem(NULL, &condition->ack_sem_, DCE_POSTONE, FALSE) == 0;
-  if (!ok) {
-    pthread_cond_destroy(condition);
-    return 1;
-  }
-  condition->wait_count_ = 0;
-  return 0;
-}
-
-static INLINE int pthread_cond_signal(pthread_cond_t *const condition) {
-  int ok = 1;
-
-  if (!__atomic_cmpxchg32(&condition->wait_count_, 0, 0)) {
-    ok &= DosPostEventSem(condition->event_sem_) == 0;
-    ok &= DosWaitEventSem(condition->ack_sem_, SEM_INDEFINITE_WAIT) == 0;
-  }
-
-  return !ok;
-}
-
-static INLINE int pthread_cond_broadcast(pthread_cond_t *const condition) {
-  int ok = 1;
-
-  while (!__atomic_cmpxchg32(&condition->wait_count_, 0, 0))
-    ok &= pthread_cond_signal(condition) == 0;
-
-  return !ok;
-}
-
-static INLINE int pthread_cond_wait(pthread_cond_t *const condition,
-                                    pthread_mutex_t *const mutex) {
-  int ok = 1;
-
-  __atomic_increment(&condition->wait_count_);
-
-  ok &= pthread_mutex_unlock(mutex) == 0;
-
-  ok &= DosWaitEventSem(condition->event_sem_, SEM_INDEFINITE_WAIT) == 0;
-
-  __atomic_decrement(&condition->wait_count_);
-
-  ok &= DosPostEventSem(condition->ack_sem_) == 0;
-
-  pthread_mutex_lock(mutex);
-
-  return !ok;
-}
 #else                 // _WIN32
 #include <pthread.h>  // NOLINT
 #define THREADFN void *

diff --git a/aom_util/aom_util.cmake b/aom_util/aom_util.cmake
index 1a1bfe1..6bf4faf 100644
--- a/aom_util/aom_util.cmake
+++ b/aom_util/aom_util.cmake

@@ -15,9 +15,12 @@
 
 list(APPEND AOM_UTIL_SOURCES "${AOM_ROOT}/aom_util/aom_thread.c"
             "${AOM_ROOT}/aom_util/aom_thread.h"
-            "${AOM_ROOT}/aom_util/endian_inl.h"
-            "${AOM_ROOT}/aom_util/debug_util.c"
-            "${AOM_ROOT}/aom_util/debug_util.h")
+            "${AOM_ROOT}/aom_util/endian_inl.h")
+
+if(CONFIG_BITSTREAM_DEBUG)
+  list(APPEND AOM_UTIL_SOURCES "${AOM_ROOT}/aom_util/debug_util.c"
+              "${AOM_ROOT}/aom_util/debug_util.h")
+endif()
 
 # Creates the aom_util build target and makes libaom depend on it. The libaom
 # target must exist before this function is called.

diff --git a/aom_util/debug_util.c b/aom_util/debug_util.c
index 3e9c314..7b24550 100644
--- a/aom_util/debug_util.c
+++ b/aom_util/debug_util.c

@@ -32,7 +32,7 @@
 int aom_bitstream_queue_get_frame_read(void) { return frame_idx_r; }
 
 #if CONFIG_BITSTREAM_DEBUG
-#define QUEUE_MAX_SIZE 2000000
+#define QUEUE_MAX_SIZE 4000000
 static int result_queue[QUEUE_MAX_SIZE];
 static int nsymbs_queue[QUEUE_MAX_SIZE];
 static aom_cdf_prob cdf_queue[QUEUE_MAX_SIZE][16];

diff --git a/apps/aomdec.c b/apps/aomdec.c
index ab4a37f..1efc091 100644
--- a/apps/aomdec.c
+++ b/apps/aomdec.c

@@ -9,7 +9,6 @@
  * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
  */
 
-#include <assert.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <stdarg.h>
@@ -118,10 +117,18 @@
 };
 
 #if CONFIG_LIBYUV
-static INLINE int libyuv_scale(aom_image_t *src, aom_image_t *dst,
+// Returns 0 on success and returns -1 on failure.
+static INLINE int libyuv_scale(const aom_image_t *src, aom_image_t *dst,
                                FilterModeEnum mode) {
+  if (src->fmt != dst->fmt) {
+    fprintf(stderr,
+            "%s failed to scale output frame because format changed from %s to "
+            "%s\n",
+            exec_name, image_format_to_string(dst->fmt),
+            image_format_to_string(src->fmt));
+    return -1;
+  }
   if (src->fmt == AOM_IMG_FMT_I42016) {
-    assert(dst->fmt == AOM_IMG_FMT_I42016);
     return I420Scale_16(
         (uint16_t *)src->planes[AOM_PLANE_Y], src->stride[AOM_PLANE_Y] / 2,
         (uint16_t *)src->planes[AOM_PLANE_U], src->stride[AOM_PLANE_U] / 2,
@@ -131,15 +138,18 @@
         dst->stride[AOM_PLANE_U] / 2, (uint16_t *)dst->planes[AOM_PLANE_V],
         dst->stride[AOM_PLANE_V] / 2, dst->d_w, dst->d_h, mode);
   }
-  assert(src->fmt == AOM_IMG_FMT_I420);
-  assert(dst->fmt == AOM_IMG_FMT_I420);
-  return I420Scale(src->planes[AOM_PLANE_Y], src->stride[AOM_PLANE_Y],
-                   src->planes[AOM_PLANE_U], src->stride[AOM_PLANE_U],
-                   src->planes[AOM_PLANE_V], src->stride[AOM_PLANE_V], src->d_w,
-                   src->d_h, dst->planes[AOM_PLANE_Y], dst->stride[AOM_PLANE_Y],
-                   dst->planes[AOM_PLANE_U], dst->stride[AOM_PLANE_U],
-                   dst->planes[AOM_PLANE_V], dst->stride[AOM_PLANE_V], dst->d_w,
-                   dst->d_h, mode);
+  if (src->fmt == AOM_IMG_FMT_I420) {
+    return I420Scale(src->planes[AOM_PLANE_Y], src->stride[AOM_PLANE_Y],
+                     src->planes[AOM_PLANE_U], src->stride[AOM_PLANE_U],
+                     src->planes[AOM_PLANE_V], src->stride[AOM_PLANE_V],
+                     src->d_w, src->d_h, dst->planes[AOM_PLANE_Y],
+                     dst->stride[AOM_PLANE_Y], dst->planes[AOM_PLANE_U],
+                     dst->stride[AOM_PLANE_U], dst->planes[AOM_PLANE_V],
+                     dst->stride[AOM_PLANE_V], dst->d_w, dst->d_h, mode);
+  }
+  fprintf(stderr, "%s cannot scale output frame of format %s\n", exec_name,
+          image_format_to_string(src->fmt));
+  return -1;
 }
 #endif
 
@@ -371,7 +381,7 @@
         case '7': snprintf(q, q_len - 1, "%07d", frame_in); break;
         case '8': snprintf(q, q_len - 1, "%08d", frame_in); break;
         case '9': snprintf(q, q_len - 1, "%09d", frame_in); break;
-        default: die("Unrecognized pattern %%%c\n", p[1]); break;
+        default: die("Unrecognized pattern %%%c\n", p[1]);
       }
 
       pat_len = strlen(q);
@@ -878,7 +888,7 @@
 
           if (img->d_w != scaled_img->d_w || img->d_h != scaled_img->d_h) {
 #if CONFIG_LIBYUV
-            libyuv_scale(img, scaled_img, kFilterBox);
+            if (libyuv_scale(img, scaled_img, kFilterBox) != 0) goto fail;
             img = scaled_img;
 #else
             fprintf(

diff --git a/apps/aomenc.c b/apps/aomenc.c
index ef208fd..09306f2 100644
--- a/apps/aomenc.c
+++ b/apps/aomenc.c

@@ -237,6 +237,8 @@
                                         AV1E_SET_ENABLE_TX_SIZE_SEARCH,
                                         AV1E_SET_LOOPFILTER_CONTROL,
                                         AV1E_SET_AUTO_INTRA_TOOLS_OFF,
+                                        AV1E_ENABLE_RATE_GUIDE_DELTAQ,
+                                        AV1E_SET_RATE_DISTRIBUTION_INFO,
                                         0 };
 
 const arg_def_t *main_args[] = { &g_av1_codec_arg_defs.help,
@@ -437,6 +439,8 @@
 #endif
   &g_av1_codec_arg_defs.dv_cost_upd_freq,
   &g_av1_codec_arg_defs.partition_info_path,
+  &g_av1_codec_arg_defs.enable_rate_guide_deltaq,
+  &g_av1_codec_arg_defs.rate_distribution_info,
   &g_av1_codec_arg_defs.enable_directional_intra,
   &g_av1_codec_arg_defs.enable_tx_size_search,
   &g_av1_codec_arg_defs.loopfilter_control,
@@ -533,6 +537,8 @@
   const char *vmaf_model_path;
 #endif
   const char *partition_info_path;
+  unsigned int enable_rate_guide_deltaq;
+  const char *rate_distribution_info;
   aom_color_range_t color_range;
   const char *two_pass_input;
   const char *two_pass_output;
@@ -1130,6 +1136,12 @@
     } else if (arg_match(&arg, &g_av1_codec_arg_defs.partition_info_path,
                          argi)) {
       config->partition_info_path = arg.val;
+    } else if (arg_match(&arg, &g_av1_codec_arg_defs.enable_rate_guide_deltaq,
+                         argi)) {
+      config->enable_rate_guide_deltaq = arg_parse_uint(&arg);
+    } else if (arg_match(&arg, &g_av1_codec_arg_defs.rate_distribution_info,
+                         argi)) {
+      config->rate_distribution_info = arg.val;
     } else if (arg_match(&arg, &g_av1_codec_arg_defs.use_fixed_qp_offsets,
                          argi)) {
       config->cfg.use_fixed_qp_offsets = arg_parse_uint(&arg);
@@ -1294,21 +1306,6 @@
   }
 }
 
-static const char *image_format_to_string(aom_img_fmt_t f) {
-  switch (f) {
-    case AOM_IMG_FMT_I420: return "I420";
-    case AOM_IMG_FMT_I422: return "I422";
-    case AOM_IMG_FMT_I444: return "I444";
-    case AOM_IMG_FMT_YV12: return "YV12";
-    case AOM_IMG_FMT_NV12: return "NV12";
-    case AOM_IMG_FMT_YV1216: return "YV1216";
-    case AOM_IMG_FMT_I42016: return "I42016";
-    case AOM_IMG_FMT_I42216: return "I42216";
-    case AOM_IMG_FMT_I44416: return "I44416";
-    default: return "Other";
-  }
-}
-
 static void show_stream_config(struct stream_state *stream,
                                struct AvxEncoderConfig *global,
                                struct AvxInputContext *input) {
@@ -1540,6 +1537,16 @@
                                   AV1E_SET_PARTITION_INFO_PATH,
                                   stream->config.partition_info_path);
   }
+  if (stream->config.enable_rate_guide_deltaq) {
+    AOM_CODEC_CONTROL_TYPECHECKED(&stream->encoder,
+                                  AV1E_ENABLE_RATE_GUIDE_DELTAQ,
+                                  stream->config.enable_rate_guide_deltaq);
+  }
+  if (stream->config.rate_distribution_info) {
+    AOM_CODEC_CONTROL_TYPECHECKED(&stream->encoder,
+                                  AV1E_SET_RATE_DISTRIBUTION_INFO,
+                                  stream->config.rate_distribution_info);
+  }
 
   if (stream->config.film_grain_filename) {
     AOM_CODEC_CONTROL_TYPECHECKED(&stream->encoder, AV1E_SET_FILM_GRAIN_TABLE,

diff --git a/av1/arg_defs.c b/av1/arg_defs.c
index abfd4b3..35a2ab4 100644
--- a/av1/arg_defs.c
+++ b/av1/arg_defs.c

@@ -47,6 +47,7 @@
   { "vmaf", AOM_TUNE_VMAF_MAX_GAIN },
   { "vmaf_neg", AOM_TUNE_VMAF_NEG_MAX_GAIN },
   { "butteraugli", AOM_TUNE_BUTTERAUGLI },
+  { "vmaf_saliency_map", AOM_TUNE_VMAF_SALIENCY_MAP },
   { NULL, 0 }
 };
 
@@ -226,13 +227,17 @@
       ARG_DEF(NULL, "use-16bit-internal", 0, "Force use of 16-bit pipeline"),
   .dropframe_thresh =
       ARG_DEF(NULL, "drop-frame", 1, "Temporal resampling threshold (buf %)"),
-  .resize_mode = ARG_DEF(NULL, "resize-mode", 1, "Frame resize mode"),
+  .resize_mode = ARG_DEF(
+      NULL, "resize-mode", 1,
+      "Frame resize mode (0: off (default), 1: fixed, 2: random, 3: dynamic)"),
   .resize_denominator =
       ARG_DEF(NULL, "resize-denominator", 1, "Frame resize denominator"),
   .resize_kf_denominator = ARG_DEF(NULL, "resize-kf-denominator", 1,
                                    "Frame resize keyframe denominator"),
   .superres_mode =
-      ARG_DEF(NULL, "superres-mode", 1, "Frame super-resolution mode"),
+      ARG_DEF(NULL, "superres-mode", 1,
+              "Frame super-resolution mode (0: disabled (default), 1: fixed, "
+              "2: random, 3: qthresh, 4: auto)"),
   .superres_denominator = ARG_DEF(NULL, "superres-denominator", 1,
                                   "Frame super-resolution denominator"),
   .superres_kf_denominator =
@@ -495,6 +500,16 @@
 #endif
   .partition_info_path = ARG_DEF(NULL, "partition-info-path", 1,
                                  "Partition information read and write path"),
+  .enable_rate_guide_deltaq =
+      ARG_DEF(NULL, "enable-rate-guide-deltaq", 1,
+              "Enable rate guide deltaq (1), by default off (0). "
+              "It requires --deltaq-mode=3. "
+              "If turned on, it requires an input file specified "
+              "by --rate-distribution-info."),
+  .rate_distribution_info =
+      ARG_DEF(NULL, "rate-distribution-info", 1,
+              "Rate distribution information input."
+              "It requires --enable-rate-guide-deltaq=1."),
   .film_grain_test = ARG_DEF(
       NULL, "film-grain-test", 1,
       "Film grain test vectors (0: none (default), 1: test-1  2: test-2, "

diff --git a/av1/arg_defs.h b/av1/arg_defs.h
index e15a84c..b9d0cfe 100644
--- a/av1/arg_defs.h
+++ b/av1/arg_defs.h

@@ -21,6 +21,7 @@
 #include "common/webmenc.h"
 #endif
 #include "aom/aomcx.h"
+#include "aom_dsp/flow_estimation/flow_estimation.h"
 
 enum TestDecodeFatality {
   TEST_DECODE_OFF,
@@ -185,6 +186,8 @@
   arg_def_t vmaf_model_path;
 #endif
   arg_def_t partition_info_path;
+  arg_def_t enable_rate_guide_deltaq;
+  arg_def_t rate_distribution_info;
   arg_def_t film_grain_test;
   arg_def_t film_grain_table;
 #if CONFIG_DENOISE

diff --git a/av1/av1.cmake b/av1/av1.cmake
index ae1eba7..43b7665 100644
--- a/av1/av1.cmake
+++ b/av1/av1.cmake

@@ -214,6 +214,7 @@
             "${AOM_ROOT}/av1/encoder/rd.h"
             "${AOM_ROOT}/av1/encoder/rdopt.c"
             "${AOM_ROOT}/av1/encoder/nonrd_pickmode.c"
+            "${AOM_ROOT}/av1/encoder/nonrd_opt.c"
             "${AOM_ROOT}/av1/encoder/nonrd_opt.h"
             "${AOM_ROOT}/av1/encoder/rdopt.h"
             "${AOM_ROOT}/av1/encoder/rdopt_data_defs.h"
@@ -357,9 +358,11 @@
             "${AOM_ROOT}/av1/encoder/arm/neon/av1_error_neon.c"
             "${AOM_ROOT}/av1/encoder/arm/neon/encodetxb_neon.c"
             "${AOM_ROOT}/av1/encoder/arm/neon/hybrid_fwd_txfm_neon.c"
+            "${AOM_ROOT}/av1/encoder/arm/neon/av1_k_means_neon.c"
             "${AOM_ROOT}/av1/encoder/arm/neon/av1_fwd_txfm2d_neon.c"
             "${AOM_ROOT}/av1/encoder/arm/neon/highbd_fwd_txfm_neon.c"
             "${AOM_ROOT}/av1/encoder/arm/neon/wedge_utils_neon.c"
+            "${AOM_ROOT}/av1/encoder/arm/neon/reconinter_enc_neon.c"
             "${AOM_ROOT}/av1/encoder/arm/neon/temporal_filter_neon.c")
 
 list(APPEND AOM_AV1_ENCODER_INTRIN_ARM_CRC32
@@ -400,6 +403,11 @@
               "${AOM_ROOT}/av1/encoder/tune_butteraugli.h")
 endif()
 
+if(CONFIG_SALIENCY_MAP)
+  list(APPEND AOM_AV1_ENCODER_SOURCES "${AOM_ROOT}/av1/encoder/saliency_map.c"
+              "${AOM_ROOT}/av1/encoder/saliency_map.h")
+endif()
+
 if(CONFIG_OPTICAL_FLOW_API)
   list(APPEND AOM_AV1_ENCODER_SOURCES
               "${AOM_ROOT}/av1/encoder/sparse_linear_solver.c"
@@ -437,6 +445,9 @@
               "${AOM_ROOT}/av1/common/x86/highbd_wiener_convolve_avx2.c"
               "${AOM_ROOT}/av1/common/x86/highbd_warp_affine_avx2.c")
 
+  list(APPEND AOM_AV1_COMMON_INTRIN_NEON
+              "${AOM_ROOT}/av1/common/arm/highbd_convolve_neon.c")
+
   list(APPEND AOM_AV1_ENCODER_INTRIN_SSE2
               "${AOM_ROOT}/av1/encoder/x86/highbd_block_error_intrin_sse2.c"
               "${AOM_ROOT}/av1/encoder/x86/highbd_temporal_filter_sse2.c")

diff --git a/av1/av1_cx_iface.c b/av1/av1_cx_iface.c
index 178d966..403f994 100644
--- a/av1/av1_cx_iface.c
+++ b/av1/av1_cx_iface.c

@@ -20,13 +20,17 @@
 #include "aom/aom_encoder.h"
 #include "aom/internal/aom_codec_internal.h"
 
+#include "aom_dsp/flow_estimation/flow_estimation.h"
+
 #include "av1/av1_iface_common.h"
 #include "av1/encoder/bitstream.h"
 #include "av1/encoder/encoder.h"
+#include "av1/encoder/encoder_alloc.h"
 #include "av1/encoder/encoder_utils.h"
 #include "av1/encoder/ethread.h"
 #include "av1/encoder/external_partition.h"
 #include "av1/encoder/firstpass.h"
+#include "av1/encoder/rc_utils.h"
 #include "av1/arg_defs.h"
 
 #include "common/args_helper.h"
@@ -53,6 +57,8 @@
   aom_tune_metric tuning;
   const char *vmaf_model_path;
   const char *partition_info_path;
+  unsigned int enable_rate_guide_deltaq;
+  const char *rate_distribution_info;
   aom_dist_metric dist_metric;
   unsigned int cq_level;  // constrained quality level
   unsigned int rc_max_intra_bitrate_pct;
@@ -231,6 +237,8 @@
   AOM_TUNE_PSNR,  // tuning
   "/usr/local/share/model/vmaf_v0.6.1.json",  // VMAF model path
   ".",                                        // partition info path
+  0,                                          // enable rate guide deltaq
+  "./rate_map.txt",                           // rate distribution input
   AOM_DIST_METRIC_PSNR,                       // dist_metric
   10,                                         // cq_level
   0,                                          // rc_max_intra_bitrate_pct
@@ -380,6 +388,8 @@
   AOM_TUNE_PSNR,  // tuning
   "/usr/local/share/model/vmaf_v0.6.1.json",  // VMAF model path
   ".",                                        // partition info path
+  0,                                          // enable rate guide deltaq
+  "./rate_map.txt",                           // rate distribution input
   AOM_DIST_METRIC_PSNR,                       // dist_metric
   10,                                         // cq_level
   0,                                          // rc_max_intra_bitrate_pct
@@ -552,6 +562,7 @@
   ratio->den /= denom;
 }
 
+// Called by encoder_encode() only. Must not be called by encoder_init().
 static aom_codec_err_t update_error_state(
     aom_codec_alg_priv_t *ctx, const struct aom_internal_error_info *error) {
   const aom_codec_err_t res = error->error_code;
@@ -703,12 +714,13 @@
   RANGE_CHECK_HI(extra_cfg, enable_auto_alt_ref, 1);
   RANGE_CHECK_HI(extra_cfg, enable_auto_bwd_ref, 2);
   RANGE_CHECK(extra_cfg, cpu_used, 0,
-              (cfg->g_usage == AOM_USAGE_REALTIME) ? 10 : 9);
+              (cfg->g_usage == AOM_USAGE_REALTIME) ? 11 : 9);
   RANGE_CHECK_HI(extra_cfg, noise_sensitivity, 6);
   RANGE_CHECK(extra_cfg, superblock_size, AOM_SUPERBLOCK_SIZE_64X64,
               AOM_SUPERBLOCK_SIZE_DYNAMIC);
   RANGE_CHECK_HI(cfg, large_scale_tile, 1);
   RANGE_CHECK_HI(extra_cfg, single_tile_decoding, 1);
+  RANGE_CHECK_HI(extra_cfg, enable_rate_guide_deltaq, 1);
 
   RANGE_CHECK_HI(extra_cfg, row_mt, 1);
   RANGE_CHECK_HI(extra_cfg, fp_mt, 1);
@@ -761,6 +773,10 @@
     ERROR("Current pass is larger than total number of passes.");
   }
 
+  if (cfg->g_profile == (unsigned int)PROFILE_1 && cfg->monochrome) {
+    ERROR("Monochrome is not supported in profile 1");
+  }
+
   if (cfg->g_profile <= (unsigned int)PROFILE_1 &&
       cfg->g_bit_depth > AOM_BITS_10) {
     ERROR("Codec bit-depth 12 not supported in profile < 2");
@@ -814,7 +830,7 @@
   }
 #endif
 
-  RANGE_CHECK(extra_cfg, tuning, AOM_TUNE_PSNR, AOM_TUNE_BUTTERAUGLI);
+  RANGE_CHECK(extra_cfg, tuning, AOM_TUNE_PSNR, AOM_TUNE_VMAF_SALIENCY_MAP);
 
   RANGE_CHECK(extra_cfg, dist_metric, AOM_DIST_METRIC_PSNR,
               AOM_DIST_METRIC_QM_PSNR);
@@ -988,9 +1004,9 @@
   extra_cfg->reduced_tx_type_set = cfg->reduced_tx_type_set;
 }
 
-static aom_codec_err_t set_encoder_config(AV1EncoderConfig *oxcf,
-                                          const aom_codec_enc_cfg_t *cfg,
-                                          struct av1_extracfg *extra_cfg) {
+static void set_encoder_config(AV1EncoderConfig *oxcf,
+                               const aom_codec_enc_cfg_t *cfg,
+                               struct av1_extracfg *extra_cfg) {
   if (cfg->encoder_cfg.init_by_cfg_file) {
     update_default_encoder_config(&cfg->encoder_cfg, extra_cfg);
   }
@@ -1082,16 +1098,6 @@
     dec_model_cfg->decoder_model_info_present_flag = 0;
     dec_model_cfg->display_model_info_present_flag = 1;
   } else if (extra_cfg->timing_info_type == AOM_TIMING_DEC_MODEL) {
-    //    if( extra_cfg->arnr_strength > 0 )
-    //    {
-    //      printf("Only --arnr-strength=0 can currently be used with
-    //      --timing-info=model."); return AOM_CODEC_INVALID_PARAM;
-    //    }
-    //    if( extra_cfg->enable_superres)
-    //    {
-    //      printf("Only --superres-mode=0 can currently be used with
-    //      --timing-info=model."); return AOM_CODEC_INVALID_PARAM;
-    //    }
     dec_model_cfg->num_units_in_decoding_tick = cfg->g_timebase.num;
     dec_model_cfg->timing_info.equal_picture_interval = 0;
     dec_model_cfg->decoder_model_info_present_flag = 1;
@@ -1156,7 +1162,19 @@
   tool_cfg->enable_interintra_comp = extra_cfg->enable_interintra_comp;
   tool_cfg->ref_frame_mvs_present =
       extra_cfg->enable_ref_frame_mvs & extra_cfg->enable_order_hint;
-  tool_cfg->enable_global_motion = extra_cfg->enable_global_motion;
+
+  // Explicitly disable global motion in a few cases:
+  // * For realtime mode, we never search global motion, and disabling
+  //   it here prevents later code from allocating buffers we don't need
+  // * For large scale tile mode, some of the intended use cases expect
+  //   all frame headers to be identical. This breaks if global motion is
+  //   used, since global motion data is stored in the frame header.
+  //   eg, see test/lightfield_test.sh, which checks that all frame headers
+  //   are the same.
+  tool_cfg->enable_global_motion = extra_cfg->enable_global_motion &&
+                                   cfg->g_usage != AOM_USAGE_REALTIME &&
+                                   !cfg->large_scale_tile;
+
   tool_cfg->error_resilient_mode =
       cfg->g_error_resilient | extra_cfg->error_resilient_mode;
   tool_cfg->frame_parallel_decoding_mode =
@@ -1452,13 +1470,14 @@
 
   oxcf->partition_info_path = extra_cfg->partition_info_path;
 
+  oxcf->enable_rate_guide_deltaq = extra_cfg->enable_rate_guide_deltaq;
+  oxcf->rate_distribution_info = extra_cfg->rate_distribution_info;
+
   oxcf->strict_level_conformance = extra_cfg->strict_level_conformance;
 
   oxcf->kf_max_pyr_height = extra_cfg->kf_max_pyr_height;
 
   oxcf->sb_qp_sweep = extra_cfg->sb_qp_sweep;
-
-  return AOM_CODEC_OK;
 }
 
 AV1EncoderConfig av1_get_encoder_config(const aom_codec_enc_cfg_t *cfg) {
@@ -1557,7 +1576,7 @@
 }
 
 static aom_codec_err_t update_extra_cfg(aom_codec_alg_priv_t *ctx,
-                                        struct av1_extracfg *extra_cfg) {
+                                        const struct av1_extracfg *extra_cfg) {
   const aom_codec_err_t res = validate_config(ctx, &ctx->cfg, extra_cfg);
   if (res == AOM_CODEC_OK) {
     ctx->extra_cfg = *extra_cfg;
@@ -1620,22 +1639,28 @@
 
 static aom_codec_err_t ctrl_set_row_mt(aom_codec_alg_priv_t *ctx,
                                        va_list args) {
+  unsigned int row_mt = CAST(AV1E_SET_ROW_MT, args);
+  if (row_mt == ctx->extra_cfg.row_mt) return AOM_CODEC_OK;
   struct av1_extracfg extra_cfg = ctx->extra_cfg;
-  extra_cfg.row_mt = CAST(AV1E_SET_ROW_MT, args);
+  extra_cfg.row_mt = row_mt;
   return update_extra_cfg(ctx, &extra_cfg);
 }
 
 static aom_codec_err_t ctrl_set_tile_columns(aom_codec_alg_priv_t *ctx,
                                              va_list args) {
+  unsigned int tile_columns = CAST(AV1E_SET_TILE_COLUMNS, args);
+  if (tile_columns == ctx->extra_cfg.tile_columns) return AOM_CODEC_OK;
   struct av1_extracfg extra_cfg = ctx->extra_cfg;
-  extra_cfg.tile_columns = CAST(AV1E_SET_TILE_COLUMNS, args);
+  extra_cfg.tile_columns = tile_columns;
   return update_extra_cfg(ctx, &extra_cfg);
 }
 
 static aom_codec_err_t ctrl_set_tile_rows(aom_codec_alg_priv_t *ctx,
                                           va_list args) {
+  unsigned int tile_rows = CAST(AV1E_SET_TILE_ROWS, args);
+  if (tile_rows == ctx->extra_cfg.tile_rows) return AOM_CODEC_OK;
   struct av1_extracfg extra_cfg = ctx->extra_cfg;
-  extra_cfg.tile_rows = CAST(AV1E_SET_TILE_ROWS, args);
+  extra_cfg.tile_rows = tile_rows;
   return update_extra_cfg(ctx, &extra_cfg);
 }
 
@@ -2139,6 +2164,9 @@
                                         va_list args) {
   struct av1_extracfg extra_cfg = ctx->extra_cfg;
   extra_cfg.aq_mode = CAST(AV1E_SET_AQ_MODE, args);
+
+  // Skip AQ mode if using fixed QP for current frame.
+  if (ctx->ppi->cpi->rc.use_external_qp_one_pass) extra_cfg.aq_mode = 0;
   return update_extra_cfg(ctx, &extra_cfg);
 }
 
@@ -2248,6 +2276,25 @@
   return update_extra_cfg(ctx, &extra_cfg);
 }
 
+static aom_codec_err_t ctrl_enable_rate_guide_deltaq(aom_codec_alg_priv_t *ctx,
+                                                     va_list args) {
+  struct av1_extracfg extra_cfg = ctx->extra_cfg;
+  extra_cfg.enable_rate_guide_deltaq =
+      CAST(AV1E_ENABLE_RATE_GUIDE_DELTAQ, args);
+  return update_extra_cfg(ctx, &extra_cfg);
+}
+
+static aom_codec_err_t ctrl_set_rate_distribution_info(
+    aom_codec_alg_priv_t *ctx, va_list args) {
+  struct av1_extracfg extra_cfg = ctx->extra_cfg;
+  const char *str = CAST(AV1E_SET_RATE_DISTRIBUTION_INFO, args);
+  const aom_codec_err_t ret = allocate_and_set_string(
+      str, default_extra_cfg.rate_distribution_info,
+      &extra_cfg.rate_distribution_info, ctx->ppi->error.detail);
+  if (ret != AOM_CODEC_OK) return ret;
+  return update_extra_cfg(ctx, &extra_cfg);
+}
+
 static aom_codec_err_t ctrl_set_film_grain_test_vector(
     aom_codec_alg_priv_t *ctx, va_list args) {
   struct av1_extracfg extra_cfg = ctx->extra_cfg;
@@ -2411,10 +2458,15 @@
   const int val = CAST(AV1E_SET_TARGET_SEQ_LEVEL_IDX, args);
   const int level = val % 100;
   const int operating_point_idx = val / 100;
-  if (operating_point_idx >= 0 &&
-      operating_point_idx < MAX_NUM_OPERATING_POINTS) {
-    extra_cfg.target_seq_level_idx[operating_point_idx] = (AV1_LEVEL)level;
+  if (operating_point_idx < 0 ||
+      operating_point_idx >= MAX_NUM_OPERATING_POINTS) {
+    char *const err_string = ctx->ppi->error.detail;
+    snprintf(err_string, ARG_ERR_MSG_MAX_LEN,
+             "Invalid operating point index: %d", operating_point_idx);
+    ctx->base.err_detail = err_string;
+    return AOM_CODEC_INVALID_PARAM;
   }
+  extra_cfg.target_seq_level_idx[operating_point_idx] = (AV1_LEVEL)level;
   return update_extra_cfg(ctx, &extra_cfg);
 }
 
@@ -2484,6 +2536,39 @@
   return AOM_CODEC_OK;
 }
 
+static aom_codec_err_t ctrl_set_quantizer_one_pass(aom_codec_alg_priv_t *ctx,
+                                                   va_list args) {
+  const int qp = CAST(AV1E_SET_QUANTIZER_ONE_PASS, args);
+
+  if (qp < 0 || qp > 63) return AOM_CODEC_INVALID_PARAM;
+
+  aom_codec_enc_cfg_t *cfg = &ctx->cfg;
+  struct av1_extracfg extra_cfg = ctx->extra_cfg;
+  cfg->rc_min_quantizer = cfg->rc_max_quantizer = qp;
+  extra_cfg.aq_mode = 0;
+  ctx->ppi->cpi->rc.use_external_qp_one_pass = 1;
+
+  return update_extra_cfg(ctx, &extra_cfg);
+}
+
+static aom_codec_err_t ctrl_set_bitrate_one_pass_cbr(aom_codec_alg_priv_t *ctx,
+                                                     va_list args) {
+  AV1_PRIMARY *const ppi = ctx->ppi;
+  AV1_COMP *const cpi = ppi->cpi;
+  AV1EncoderConfig *oxcf = &cpi->oxcf;
+  if (!is_one_pass_rt_params(cpi) || oxcf->rc_cfg.mode != AOM_CBR ||
+      cpi->ppi->use_svc || ppi->num_fp_contexts != 1 || ppi->cpi_lap != NULL) {
+    return AOM_CODEC_INVALID_PARAM;
+  }
+  const int new_bitrate = CAST(AV1E_SET_BITRATE_ONE_PASS_CBR, args);
+  ctx->cfg.rc_target_bitrate = new_bitrate;
+  oxcf->rc_cfg.target_bandwidth = new_bitrate * 1000;
+  set_primary_rc_buffer_sizes(oxcf, ppi);
+  av1_new_framerate(cpi, cpi->framerate);
+  check_reset_rc_flag(cpi);
+  return AOM_CODEC_OK;
+}
+
 #if !CONFIG_REALTIME_ONLY
 aom_codec_err_t av1_create_stats_buffer(FIRSTPASS_STATS **frame_stats_buffer,
                                         STATS_BUFFER_CTX *stats_buf_context,
@@ -2517,19 +2602,33 @@
                                                   COMPRESSOR_STAGE stage,
                                                   int lap_lag_in_frames) {
   aom_codec_err_t res = AOM_CODEC_OK;
+  BufferPool *buffer_pool = *p_buffer_pool;
 
-  if (*p_buffer_pool == NULL) {
-    *p_buffer_pool = (BufferPool *)aom_calloc(1, sizeof(BufferPool));
-    if (*p_buffer_pool == NULL) return AOM_CODEC_MEM_ERROR;
-
+  if (buffer_pool == NULL) {
+    buffer_pool = (BufferPool *)aom_calloc(1, sizeof(BufferPool));
+    if (buffer_pool == NULL) return AOM_CODEC_MEM_ERROR;
+    buffer_pool->num_frame_bufs =
+        (oxcf->mode == ALLINTRA) ? FRAME_BUFFERS_ALLINTRA : FRAME_BUFFERS;
+    buffer_pool->frame_bufs = (RefCntBuffer *)aom_calloc(
+        buffer_pool->num_frame_bufs, sizeof(*buffer_pool->frame_bufs));
+    if (buffer_pool->frame_bufs == NULL) {
+      buffer_pool->num_frame_bufs = 0;
+      aom_free(buffer_pool);
+      return AOM_CODEC_MEM_ERROR;
+    }
 #if CONFIG_MULTITHREAD
-    if (pthread_mutex_init(&((*p_buffer_pool)->pool_mutex), NULL)) {
+    if (pthread_mutex_init(&buffer_pool->pool_mutex, NULL)) {
+      aom_free(buffer_pool->frame_bufs);
+      buffer_pool->frame_bufs = NULL;
+      buffer_pool->num_frame_bufs = 0;
+      aom_free(buffer_pool);
       return AOM_CODEC_MEM_ERROR;
     }
 #endif
+    *p_buffer_pool = buffer_pool;
   }
-  *p_cpi = av1_create_compressor(ppi, oxcf, *p_buffer_pool, stage,
-                                 lap_lag_in_frames);
+  *p_cpi =
+      av1_create_compressor(ppi, oxcf, buffer_pool, stage, lap_lag_in_frames);
   if (*p_cpi == NULL) res = AOM_CODEC_MEM_ERROR;
 
   return res;
@@ -2705,6 +2804,8 @@
                         &extra_cfg->second_pass_log);
   check_and_free_string(default_extra_cfg.partition_info_path,
                         &extra_cfg->partition_info_path);
+  check_and_free_string(default_extra_cfg.rate_distribution_info,
+                        &extra_cfg->rate_distribution_info);
   check_and_free_string(default_extra_cfg.film_grain_table_filename,
                         &extra_cfg->film_grain_table_filename);
 }
@@ -2894,10 +2995,6 @@
   if (res == AOM_CODEC_OK) {
     AV1_COMP *cpi = ppi->cpi;
 
-    const int num_layers =
-        cpi->svc.number_spatial_layers * cpi->svc.number_temporal_layers;
-    if (!av1_alloc_layer_context(cpi, num_layers)) return AOM_CODEC_MEM_ERROR;
-
     // Set up internal flags
     if (ctx->base.init_flags & AOM_CODEC_USE_PSNR) ppi->b_calculate_psnr = 1;
 
@@ -2944,7 +3041,7 @@
             subsampling_x, subsampling_y, use_highbitdepth, lag_in_frames,
             src_border_in_pixels, cpi->common.features.byte_alignment,
             ctx->num_lap_buffers, (cpi->oxcf.kf_cfg.key_freq_max == 0),
-            cpi->oxcf.tool_cfg.enable_global_motion);
+            cpi->image_pyramid_levels);
       }
       if (!ppi->lookahead)
         aom_internal_error(&ppi->error, AOM_CODEC_MEM_ERROR,
@@ -3015,6 +3112,19 @@
       }
 #endif  // CONFIG_MULTITHREAD
     }
+
+    // Re-allocate thread data if workers for encoder multi-threading stage
+    // exceeds prev_num_enc_workers.
+    const int num_enc_workers =
+        av1_get_num_mod_workers_for_alloc(&ppi->p_mt_info, MOD_ENC);
+    if (ppi->p_mt_info.prev_num_enc_workers < num_enc_workers &&
+        num_enc_workers <= ppi->p_mt_info.num_workers) {
+      free_thread_data(ppi);
+      for (int j = 0; j < ppi->num_fp_contexts; j++)
+        aom_free(ppi->parallel_cpi[j]->td.tctx);
+      av1_init_tile_thread_data(ppi, cpi->oxcf.pass == AOM_RC_FIRST_PASS);
+    }
+
     for (int i = 0; i < ppi->num_fp_contexts; i++) {
       av1_init_frame_mt(ppi, ppi->parallel_cpi[i]);
     }
@@ -3414,6 +3524,7 @@
   AV1_PRIMARY *const ppi = ctx->ppi;
   AV1_COMP *const cpi = ppi->cpi;
   AV1_COMMON *const cm = &cpi->common;
+  AV1EncoderConfig *oxcf = &cpi->oxcf;
   aom_svc_params_t *const params = va_arg(args, aom_svc_params_t *);
   int64_t target_bandwidth = 0;
   ppi->number_spatial_layers = params->number_spatial_layers;
@@ -3425,6 +3536,13 @@
     ctx->ppi->use_svc = 1;
     const int num_layers =
         ppi->number_spatial_layers * ppi->number_temporal_layers;
+    for (int layer = 0; layer < num_layers; ++layer) {
+      if (params->max_quantizers[layer] > 63 ||
+          params->min_quantizers[layer] < 0 ||
+          params->min_quantizers[layer] > params->max_quantizers[layer]) {
+        return AOM_CODEC_INVALID_PARAM;
+      }
+    }
     if (!av1_alloc_layer_context(cpi, num_layers)) return AOM_CODEC_MEM_ERROR;
 
     for (sl = 0; sl < ppi->number_spatial_layers; ++sl) {
@@ -3450,7 +3568,10 @@
       }
       av1_init_layer_context(cpi);
     }
+    oxcf->rc_cfg.target_bandwidth = target_bandwidth;
+    set_primary_rc_buffer_sizes(oxcf, cpi->ppi);
     av1_update_layer_context_change_config(cpi, target_bandwidth);
+    check_reset_rc_flag(cpi);
   }
   av1_check_fpmt_config(ctx->ppi, &ctx->ppi->cpi->oxcf);
   return AOM_CODEC_OK;
@@ -3660,6 +3781,17 @@
                             argv, err_string)) {
     err = allocate_and_set_string(value, default_extra_cfg.partition_info_path,
                                   &extra_cfg.partition_info_path, err_string);
+  } else if (arg_match_helper(&arg,
+                              &g_av1_codec_arg_defs.enable_rate_guide_deltaq,
+                              argv, err_string)) {
+    extra_cfg.enable_rate_guide_deltaq =
+        arg_parse_uint_helper(&arg, err_string);
+  } else if (arg_match_helper(&arg,
+                              &g_av1_codec_arg_defs.rate_distribution_info,
+                              argv, err_string)) {
+    err =
+        allocate_and_set_string(value, default_extra_cfg.rate_distribution_info,
+                                &extra_cfg.rate_distribution_info, err_string);
   } else if (arg_match_helper(&arg, &g_av1_codec_arg_defs.dist_metric, argv,
                               err_string)) {
     extra_cfg.dist_metric = arg_parse_enum_helper(&arg, err_string);
@@ -3951,8 +4083,12 @@
     const int val = arg_parse_int_helper(&arg, err_string);
     const int level = val % 100;
     const int operating_point_idx = val / 100;
-    if (operating_point_idx >= 0 &&
-        operating_point_idx < MAX_NUM_OPERATING_POINTS) {
+    if (operating_point_idx < 0 ||
+        operating_point_idx >= MAX_NUM_OPERATING_POINTS) {
+      snprintf(err_string, ARG_ERR_MSG_MAX_LEN,
+               "Invalid operating point index: %d", operating_point_idx);
+      err = AOM_CODEC_INVALID_PARAM;
+    } else {
       extra_cfg.target_seq_level_idx[operating_point_idx] = (AV1_LEVEL)level;
     }
   } else if (arg_match_helper(&arg,
@@ -4050,6 +4186,16 @@
   return AOM_CODEC_OK;
 }
 
+static aom_codec_err_t ctrl_get_luma_cdef_strength(aom_codec_alg_priv_t *ctx,
+                                                   va_list args) {
+  int *arg = va_arg(args, int *);
+  AV1_COMMON const *cm = &ctx->ppi->cpi->common;
+  if (arg == NULL) return AOM_CODEC_INVALID_PARAM;
+  memcpy(arg, cm->cdef_info.cdef_strengths, CDEF_MAX_STRENGTHS * sizeof(*arg));
+
+  return AOM_CODEC_OK;
+}
+
 static aom_codec_ctrl_fn_map_t encoder_ctrl_maps[] = {
   { AV1_COPY_REFERENCE, ctrl_copy_reference },
   { AOME_USE_REFERENCE, ctrl_use_reference },
@@ -4165,6 +4311,8 @@
   { AV1E_SET_SINGLE_TILE_DECODING, ctrl_set_single_tile_decoding },
   { AV1E_SET_VMAF_MODEL_PATH, ctrl_set_vmaf_model_path },
   { AV1E_SET_PARTITION_INFO_PATH, ctrl_set_partition_info_path },
+  { AV1E_ENABLE_RATE_GUIDE_DELTAQ, ctrl_enable_rate_guide_deltaq },
+  { AV1E_SET_RATE_DISTRIBUTION_INFO, ctrl_set_rate_distribution_info },
   { AV1E_SET_FILM_GRAIN_TEST_VECTOR, ctrl_set_film_grain_test_vector },
   { AV1E_SET_FILM_GRAIN_TABLE, ctrl_set_film_grain_table },
   { AV1E_SET_DENOISE_NOISE_LEVEL, ctrl_set_denoise_noise_level },
@@ -4190,6 +4338,8 @@
   { AV1E_SET_SKIP_POSTPROC_FILTERING, ctrl_set_skip_postproc_filtering },
   { AV1E_SET_AUTO_INTRA_TOOLS_OFF, ctrl_set_auto_intra_tools_off },
   { AV1E_SET_RTC_EXTERNAL_RC, ctrl_set_rtc_external_rc },
+  { AV1E_SET_QUANTIZER_ONE_PASS, ctrl_set_quantizer_one_pass },
+  { AV1E_SET_BITRATE_ONE_PASS_CBR, ctrl_set_bitrate_one_pass_cbr },
 
   // Getters
   { AOME_GET_LAST_QUANTIZER, ctrl_get_quantizer },
@@ -4205,6 +4355,7 @@
   { AV1E_GET_BASELINE_GF_INTERVAL, ctrl_get_baseline_gf_interval },
   { AV1E_GET_TARGET_SEQ_LEVEL_IDX, ctrl_get_target_seq_level_idx },
   { AV1E_GET_NUM_OPERATING_POINTS, ctrl_get_num_operating_points },
+  { AV1E_GET_LUMA_CDEF_STRENGTH, ctrl_get_luma_cdef_strength },
 
   CTRL_MAP_END,
 };

diff --git a/av1/av1_dx_iface.c b/av1/av1_dx_iface.c
index 809268f..a1e7558 100644
--- a/av1/av1_dx_iface.c
+++ b/av1/av1_dx_iface.c

@@ -121,16 +121,13 @@
       AV1Decoder *const pbi = frame_worker_data->pbi;
       aom_free(pbi->common.tpl_mvs);
       pbi->common.tpl_mvs = NULL;
-      av1_remove_common(&frame_worker_data->pbi->common);
+      av1_remove_common(&pbi->common);
       av1_free_cdef_buffers(&pbi->common, &pbi->cdef_worker, &pbi->cdef_sync);
       av1_free_cdef_sync(&pbi->cdef_sync);
       av1_free_restoration_buffers(&pbi->common);
       av1_decoder_remove(pbi);
     }
     aom_free(frame_worker_data);
-#if CONFIG_MULTITHREAD
-    pthread_mutex_destroy(&ctx->buffer_pool->pool_mutex);
-#endif
   }
 
   if (ctx->buffer_pool) {
@@ -140,6 +137,9 @@
     }
     av1_free_ref_frame_buffers(ctx->buffer_pool);
     av1_free_internal_frame_buffers(&ctx->buffer_pool->int_frame_buffers);
+#if CONFIG_MULTITHREAD
+    pthread_mutex_destroy(&ctx->buffer_pool->pool_mutex);
+#endif
   }
 
   aom_free(ctx->frame_worker);
@@ -428,9 +428,23 @@
 
   ctx->buffer_pool = (BufferPool *)aom_calloc(1, sizeof(BufferPool));
   if (ctx->buffer_pool == NULL) return AOM_CODEC_MEM_ERROR;
+  ctx->buffer_pool->num_frame_bufs = FRAME_BUFFERS;
+  ctx->buffer_pool->frame_bufs = (RefCntBuffer *)aom_calloc(
+      ctx->buffer_pool->num_frame_bufs, sizeof(*ctx->buffer_pool->frame_bufs));
+  if (ctx->buffer_pool->frame_bufs == NULL) {
+    ctx->buffer_pool->num_frame_bufs = 0;
+    aom_free(ctx->buffer_pool);
+    ctx->buffer_pool = NULL;
+    return AOM_CODEC_MEM_ERROR;
+  }
 
 #if CONFIG_MULTITHREAD
   if (pthread_mutex_init(&ctx->buffer_pool->pool_mutex, NULL)) {
+    aom_free(ctx->buffer_pool->frame_bufs);
+    ctx->buffer_pool->frame_bufs = NULL;
+    ctx->buffer_pool->num_frame_bufs = 0;
+    aom_free(ctx->buffer_pool);
+    ctx->buffer_pool = NULL;
     set_error_detail(ctx, "Failed to allocate buffer pool mutex");
     return AOM_CODEC_MEM_ERROR;
   }
@@ -443,18 +457,24 @@
   }
 
   AVxWorker *const worker = ctx->frame_worker;
-  FrameWorkerData *frame_worker_data = NULL;
   winterface->init(worker);
   worker->thread_name = "aom frameworker";
   worker->data1 = aom_memalign(32, sizeof(FrameWorkerData));
   if (worker->data1 == NULL) {
+    winterface->end(worker);
+    aom_free(worker);
+    ctx->frame_worker = NULL;
     set_error_detail(ctx, "Failed to allocate frame_worker_data");
     return AOM_CODEC_MEM_ERROR;
   }
-  frame_worker_data = (FrameWorkerData *)worker->data1;
+  FrameWorkerData *frame_worker_data = (FrameWorkerData *)worker->data1;
   frame_worker_data->pbi = av1_decoder_create(ctx->buffer_pool);
   if (frame_worker_data->pbi == NULL) {
-    set_error_detail(ctx, "Failed to allocate frame_worker_data");
+    winterface->end(worker);
+    aom_free(frame_worker_data);
+    aom_free(worker);
+    ctx->frame_worker = NULL;
+    set_error_detail(ctx, "Failed to allocate frame_worker_data->pbi");
     return AOM_CODEC_MEM_ERROR;
   }
   frame_worker_data->frame_context_ready = 0;

diff --git a/av1/common/alloccommon.c b/av1/common/alloccommon.c
index 677078d..6e95f70 100644
--- a/av1/common/alloccommon.c
+++ b/av1/common/alloccommon.c

@@ -36,7 +36,7 @@
 void av1_free_ref_frame_buffers(BufferPool *pool) {
   int i;
 
-  for (i = 0; i < FRAME_BUFFERS; ++i) {
+  for (i = 0; i < pool->num_frame_bufs; ++i) {
     if (pool->frame_bufs[i].ref_count > 0 &&
         pool->frame_bufs[i].raw_frame_buffer.data != NULL) {
       pool->release_fb_cb(pool->cb_priv, &pool->frame_bufs[i].raw_frame_buffer);
@@ -51,6 +51,9 @@
     pool->frame_bufs[i].seg_map = NULL;
     aom_free_frame_buffer(&pool->frame_bufs[i].buf);
   }
+  aom_free(pool->frame_bufs);
+  pool->frame_bufs = NULL;
+  pool->num_frame_bufs = 0;
 }
 
 static INLINE void free_cdef_linebuf_conditional(
@@ -286,12 +289,12 @@
 }
 
 // Assumes cm->rst_info[p].restoration_unit_size is already initialized
-void av1_alloc_restoration_buffers(AV1_COMMON *cm) {
+void av1_alloc_restoration_buffers(AV1_COMMON *cm, bool is_sgr_enabled) {
   const int num_planes = av1_num_planes(cm);
   for (int p = 0; p < num_planes; ++p)
     av1_alloc_restoration_struct(cm, &cm->rst_info[p], p > 0);
 
-  if (cm->rst_tmpbuf == NULL) {
+  if (cm->rst_tmpbuf == NULL && is_sgr_enabled) {
     CHECK_MEM_ERROR(cm, cm->rst_tmpbuf,
                     (int32_t *)aom_memalign(16, RESTORATION_TMPBUF_SIZE));
   }

diff --git a/av1/common/alloccommon.h b/av1/common/alloccommon.h
index fc4a8ba..d31b4c5 100644
--- a/av1/common/alloccommon.h
+++ b/av1/common/alloccommon.h

@@ -14,6 +14,8 @@
 
 #define INVALID_IDX -1  // Invalid buffer index.
 
+#include <stdbool.h>
+
 #include "config/aom_config.h"
 
 #include "av1/common/enums.h"
@@ -48,7 +50,7 @@
 void av1_free_cdef_buffers(struct AV1Common *const cm,
                            struct AV1CdefWorker **cdef_worker,
                            struct AV1CdefSyncData *cdef_sync);
-void av1_alloc_restoration_buffers(struct AV1Common *cm);
+void av1_alloc_restoration_buffers(struct AV1Common *cm, bool is_sgr_enabled);
 void av1_free_restoration_buffers(struct AV1Common *cm);
 
 int av1_alloc_state_buffers(struct AV1Common *cm, int width, int height);

diff --git a/av1/common/arm/av1_inv_txfm_neon.c b/av1/common/arm/av1_inv_txfm_neon.c
index 1628cbf..8afcd1f 100644
--- a/av1/common/arm/av1_inv_txfm_neon.c
+++ b/av1/common/arm/av1_inv_txfm_neon.c

@@ -467,12 +467,13 @@
 }
 
 static INLINE void load_buffer_32bit_to_16bit_neon(const int32_t *input,
+                                                   int stride,
                                                    int16x8_t *const a,
                                                    int out_size) {
-  for (int i = 0; i < 8; ++i) {
+  for (int i = 0; i < out_size; ++i) {
     a[i] = vcombine_s16(vmovn_s32(vld1q_s32(input)),
                         vmovn_s32(vld1q_s32(input + 4)));
-    input += out_size;
+    input += stride;
   }
 }
 
@@ -3590,28 +3591,22 @@
   const int buf_size_w_div8 = txfm_size_col >> 3;
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
   const int buf_size_nonzero_h_div8 = (eoby + 8) >> 3;
-  const int buf_size_nonzero_w_div8 = (eobx + 8) >> 3;
-  const int32_t *input_1;
+  const int buf_size_nonzero_w = (eobx + 8) >> 3 << 3;
+  const int input_stride = txfm_size_row;
   int temp_b = 0;
 
   for (int i = 0; i < buf_size_nonzero_h_div8; i++) {
-    input_1 = input;
-    for (int j = 0; j < buf_size_nonzero_w_div8; ++j) {
-      int k = j * 8 + i * txfm_size_col;
-      load_buffer_32bit_to_16bit_neon(input_1, &a[k], txfm_size_col);
-      transpose_s16_8x8q(&a[k], &a[k]);
-      input_1 += 8;
-    }
-    input += (txfm_size_col * 8);
+    int16x8_t *cur_a = &a[i * txfm_size_col];
+    load_buffer_32bit_to_16bit_neon(input, input_stride, cur_a,
+                                    buf_size_nonzero_w);
+    input += 8;
     if (abs(rect_type) == 1) {
-      int y = i * txfm_size_col;
-      round_shift_for_rect(&a[y], &a[y], txfm_size_col);
+      round_shift_for_rect(cur_a, cur_a, buf_size_nonzero_w);
     }
-    identity_txfm_round_neon(&a[i * txfm_size_col], &a[i * txfm_size_col],
-                             txw_idx, txfm_size_col, -shift[0]);
+    identity_txfm_round_neon(cur_a, cur_a, txw_idx, buf_size_nonzero_w,
+                             -shift[0]);
     for (int j = 0; j < buf_size_w_div8; ++j) {
-      int k = j * 8 + i * txfm_size_col;
-      transpose_s16_8x8q(&a[k], &b[temp_b + txfm_size_row * j]);
+      transpose_s16_8x8q(&cur_a[j * 8], &b[temp_b + txfm_size_row * j]);
     }
     temp_b += 8;
   }
@@ -3646,9 +3641,9 @@
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
   const int buf_size_w_div8 = txfm_size_col >> 3;
   const int buf_size_nonzero_h_div8 = (eoby + 8) >> 3;
-  const int buf_size_nonzero_w_div8 = (eobx + 8) >> 3;
+  const int buf_size_nonzero_w = (eobx + 8) >> 3 << 3;
+  const int input_stride = txfm_size_row;
   const int fun_idx_x = lowbd_txfm_all_1d_zeros_idx[eobx];
-  const int32_t *input_1;
   int temp_b = 0;
   const transform_neon row_txfm =
       lowbd_txfm_all_1d_zeros_w_arr[txw_idx][hitx_1d_tab[tx_type]][fun_idx_x];
@@ -3658,33 +3653,26 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
 
   for (int i = 0; i < buf_size_nonzero_h_div8; i++) {
-    input_1 = input;
-    for (int j = 0; j < buf_size_nonzero_w_div8; ++j) {
-      int k = j * 8 + i * txfm_size_col;
-      load_buffer_32bit_to_16bit_neon(input_1, &a[k], txfm_size_col);
-      transpose_s16_8x8q(&a[k], &a[k]);
-      input_1 += 8;
-    }
-    input += (txfm_size_col * 8);
+    int16x8_t *cur_a = &a[i * txfm_size_col];
+    load_buffer_32bit_to_16bit_neon(input, input_stride, cur_a,
+                                    buf_size_nonzero_w);
+    input += 8;
     if (abs(rect_type) == 1) {
-      int y = i * txfm_size_col;
-      round_shift_for_rect(&a[y], &a[y], txfm_size_col);
+      round_shift_for_rect(cur_a, cur_a, buf_size_nonzero_w);
     }
-    row_txfm(&a[i * txfm_size_col], &a[i * txfm_size_col], INV_COS_BIT);
-    av1_round_shift_array_16_neon(&a[i * txfm_size_col], txfm_size_col,
-                                  -shift[0]);
+    row_txfm(cur_a, cur_a, INV_COS_BIT);
+    av1_round_shift_array_16_neon(cur_a, txfm_size_col, -shift[0]);
     if (lr_flip == 1) {
       for (int j = 0; j < buf_size_w_div8; ++j) {
-        int k = j * 8 + i * txfm_size_col;
-        flip_buf_ud_neon(&a[k], 8);
+        flip_buf_ud_neon(&cur_a[j * 8], 8);
         transpose_s16_8x8q(
-            &a[k], &b[temp_b + txfm_size_row * (buf_size_w_div8 - 1 - j)]);
+            &cur_a[j * 8],
+            &b[temp_b + txfm_size_row * (buf_size_w_div8 - 1 - j)]);
       }
       temp_b += 8;
     } else {
       for (int j = 0; j < buf_size_w_div8; ++j) {
-        int k = j * 8 + i * txfm_size_col;
-        transpose_s16_8x8q(&a[k], &b[temp_b + txfm_size_row * j]);
+        transpose_s16_8x8q(&cur_a[j * 8], &b[temp_b + txfm_size_row * j]);
       }
       temp_b += 8;
     }
@@ -3720,9 +3708,9 @@
   const int buf_size_w_div8 = txfm_size_col >> 3;
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
   const int buf_size_nonzero_h_div8 = (eoby + 8) >> 3;
-  const int buf_size_nonzero_w_div8 = (eobx + 8) >> 3;
+  const int buf_size_nonzero_w = (eobx + 8) >> 3 << 3;
+  const int input_stride = txfm_size_row;
   const int fun_idx_y = lowbd_txfm_all_1d_zeros_idx[eoby];
-  const int32_t *input_1;
   int temp_b = 0;
   const transform_neon col_txfm =
       lowbd_txfm_all_1d_zeros_w_arr[txh_idx][vitx_1d_tab[tx_type]][fun_idx_y];
@@ -3732,23 +3720,17 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
 
   for (int i = 0; i < buf_size_nonzero_h_div8; i++) {
-    input_1 = input;
-    for (int j = 0; j < buf_size_nonzero_w_div8; ++j) {
-      int k = j * 8 + i * txfm_size_col;
-      load_buffer_32bit_to_16bit_neon(input_1, &a[k], txfm_size_col);
-      transpose_s16_8x8q(&a[k], &a[k]);
-      input_1 += 8;
-    }
-    input += (txfm_size_col * 8);
+    int16x8_t *cur_a = &a[i * txfm_size_col];
+    load_buffer_32bit_to_16bit_neon(input, input_stride, cur_a,
+                                    buf_size_nonzero_w);
+    input += 8;
     if (abs(rect_type) == 1) {
-      int y = i * txfm_size_col;
-      round_shift_for_rect(&a[y], &a[y], txfm_size_col);
+      round_shift_for_rect(cur_a, cur_a, buf_size_nonzero_w);
     }
-    identity_txfm_round_neon(&a[i * txfm_size_col], &a[i * txfm_size_col],
-                             txw_idx, txfm_size_col, -shift[0]);
+    identity_txfm_round_neon(cur_a, cur_a, txw_idx, buf_size_nonzero_w,
+                             -shift[0]);
     for (int j = 0; j < buf_size_w_div8; ++j) {
-      int k = j * 8 + i * txfm_size_col;
-      transpose_s16_8x8q(&a[k], &b[temp_b + txfm_size_row * j]);
+      transpose_s16_8x8q(&cur_a[j * 8], &b[temp_b + txfm_size_row * j]);
     }
     temp_b += 8;
   }
@@ -3796,9 +3778,11 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
 
   for (int i = 0; i < txfm_size_row; i++) {
-    row_txfm(input, buf_ptr, INV_COS_BIT, stage_range);
+    for (int c = 0; c < txfm_size_col; ++c)
+      temp_in[c] = input[c * txfm_size_row];
+    row_txfm(temp_in, buf_ptr, INV_COS_BIT, stage_range);
 
-    input += txfm_size_col;
+    input++;
     buf_ptr += txfm_size_col;
   }
 
@@ -3858,11 +3842,12 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
 
   for (int i = 0; i < txfm_size_row; i++) {
-    for (int j = 0; j < txfm_size_col; j++)
-      temp_in[j] = round_shift((int64_t)input[j] * NewInvSqrt2, NewSqrt2Bits);
+    for (int c = 0; c < txfm_size_col; c++)
+      temp_in[c] = round_shift((int64_t)input[c * txfm_size_row] * NewInvSqrt2,
+                               NewSqrt2Bits);
 
     row_txfm(temp_in, buf_ptr, INV_COS_BIT, stage_range);
-    input += txfm_size_col;
+    input++;
     buf_ptr += txfm_size_col;
   }
 
@@ -3922,11 +3907,12 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
 
   for (int i = 0; i < txfm_size_row; i++) {
-    for (int j = 0; j < txfm_size_col; j++)
-      temp_in[j] = round_shift((int64_t)input[j] * NewInvSqrt2, NewSqrt2Bits);
+    for (int c = 0; c < txfm_size_col; c++)
+      temp_in[c] = round_shift((int64_t)input[c * txfm_size_row] * NewInvSqrt2,
+                               NewSqrt2Bits);
 
     row_txfm(temp_in, buf_ptr, INV_COS_BIT, stage_range);
-    input += txfm_size_col;
+    input++;
     buf_ptr += txfm_size_col;
   }
 
@@ -3986,9 +3972,11 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
 
   for (int i = 0; i < txfm_size_row; i++) {
-    row_txfm(input, buf_ptr, INV_COS_BIT, stage_range);
+    for (int c = 0; c < txfm_size_col; c++)
+      temp_in[c] = input[c * txfm_size_row];
+    row_txfm(temp_in, buf_ptr, INV_COS_BIT, stage_range);
     av1_round_shift_array(buf_ptr, txfm_size_col, -shift[0]);
-    input += txfm_size_col;
+    input++;
     buf_ptr += txfm_size_col;
   }
 
@@ -4048,9 +4036,11 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
 
   for (int i = 0; i < txfm_size_row; i++) {
-    row_txfm(input, buf_ptr, INV_COS_BIT, stage_range);
+    for (int c = 0; c < txfm_size_col; c++)
+      temp_in[c] = input[c * txfm_size_row];
+    row_txfm(temp_in, buf_ptr, INV_COS_BIT, stage_range);
     av1_round_shift_array(buf_ptr, txfm_size_col, -shift[0]);
-    input += txfm_size_col;
+    input++;
     buf_ptr += txfm_size_col;
   }
 
@@ -4097,11 +4087,10 @@
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
   const int buf_size_w_div8 = txfm_size_col >> 3;
   const int buf_size_nonzero_h_div8 = (eoby + 8) >> 3;
-  const int buf_size_nonzero_w_div8 = (eobx + 8) >> 3;
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int buf_size_nonzero_w = (eobx + 8) >> 3 << 3;
+  const int input_stride = AOMMIN(32, txfm_size_row);
   const int fun_idx_x = lowbd_txfm_all_1d_zeros_idx[eobx];
   const int fun_idx_y = lowbd_txfm_all_1d_zeros_idx[eoby];
-  const int32_t *input_1;
   int temp_b = 0;
 
   const transform_neon row_txfm =
@@ -4115,33 +4104,26 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
 
   for (int i = 0; i < buf_size_nonzero_h_div8; i++) {
-    input_1 = input;
-    for (int j = 0; j < buf_size_nonzero_w_div8; ++j) {
-      int k = j * 8 + i * txfm_size_col;
-      load_buffer_32bit_to_16bit_neon(input_1, &a[k], input_stride);
-      transpose_s16_8x8q(&a[k], &a[k]);
-      input_1 += 8;
-    }
-    input += (input_stride * 8);
+    int16x8_t *cur_a = &a[i * txfm_size_col];
+    load_buffer_32bit_to_16bit_neon(input, input_stride, cur_a,
+                                    buf_size_nonzero_w);
+    input += 8;
     if (abs(rect_type) == 1) {
-      int y = i * txfm_size_col;
-      round_shift_for_rect(&a[y], &a[y], input_stride);
+      round_shift_for_rect(cur_a, cur_a, buf_size_nonzero_w);
     }
-    row_txfm(&a[i * txfm_size_col], &a[i * txfm_size_col], INV_COS_BIT);
-    av1_round_shift_array_16_neon(&a[i * txfm_size_col], txfm_size_col,
-                                  -shift[0]);
+    row_txfm(cur_a, cur_a, INV_COS_BIT);
+    av1_round_shift_array_16_neon(cur_a, txfm_size_col, -shift[0]);
     if (lr_flip == 1) {
       for (int j = 0; j < buf_size_w_div8; ++j) {
-        int k = j * 8 + i * txfm_size_col;
-        flip_buf_ud_neon(&a[k], 8);
+        flip_buf_ud_neon(&cur_a[j * 8], 8);
         transpose_s16_8x8q(
-            &a[k], &b[temp_b + txfm_size_row * (buf_size_w_div8 - 1 - j)]);
+            &cur_a[j * 8],
+            &b[temp_b + txfm_size_row * (buf_size_w_div8 - 1 - j)]);
       }
       temp_b += 8;
     } else {
       for (int j = 0; j < buf_size_w_div8; ++j) {
-        int k = j * 8 + i * txfm_size_col;
-        transpose_s16_8x8q(&a[k], &b[temp_b + txfm_size_row * j]);
+        transpose_s16_8x8q(&cur_a[j * 8], &b[temp_b + txfm_size_row * j]);
       }
       temp_b += 8;
     }

diff --git a/av1/common/arm/blend_a64_hmask_neon.c b/av1/common/arm/blend_a64_hmask_neon.c
index 89252ef..baad328 100644
--- a/av1/common/arm/blend_a64_hmask_neon.c
+++ b/av1/common/arm/blend_a64_hmask_neon.c

@@ -34,8 +34,6 @@
   uint8x8_t tmp0, tmp1;
   uint8x16_t res_q;
   uint16x8_t res, res_low, res_high;
-  uint32x2_t tmp0_32 = vdup_n_u32(0), tmp1_32 = vdup_n_u32(0);
-  uint16x4_t tmp0_16 = vdup_n_u16(0), tmp1_16 = vdup_n_u16(0);
   const uint8x8_t vdup_64 = vdup_n_u8((uint8_t)64);
 
   if (w >= 16) {
@@ -91,10 +89,8 @@
       __builtin_prefetch(src0 + 1 * src0_stride);
       __builtin_prefetch(src1 + 0 * src1_stride);
       __builtin_prefetch(src1 + 1 * src1_stride);
-      load_unaligned_u8_4x2(src0, src0_stride, &tmp0_32);
-      tmp0 = vreinterpret_u8_u32(tmp0_32);
-      load_unaligned_u8_4x2(src1, src1_stride, &tmp1_32);
-      tmp1 = vreinterpret_u8_u32(tmp1_32);
+      tmp0 = load_unaligned_u8_4x2(src0, src0_stride);
+      tmp1 = load_unaligned_u8_4x2(src1, src1_stride);
       res = vmull_u8(m, tmp0);
       res = vmlal_u8(res, max_minus_m, tmp1);
       const uint8x8_t result = vrshrn_n_u16(res, AOM_BLEND_A64_ROUND_BITS);
@@ -113,10 +109,8 @@
       __builtin_prefetch(src0 + 1 * src0_stride);
       __builtin_prefetch(src1 + 0 * src1_stride);
       __builtin_prefetch(src1 + 1 * src1_stride);
-      load_unaligned_u8_2x2(src0, src0_stride, &tmp0_16);
-      tmp0 = vreinterpret_u8_u16(tmp0_16);
-      load_unaligned_u8_2x2(src1, src1_stride, &tmp1_16);
-      tmp1 = vreinterpret_u8_u16(tmp1_16);
+      tmp0 = load_unaligned_u8_2x2(src0, src0_stride);
+      tmp1 = load_unaligned_u8_2x2(src1, src1_stride);
       res = vmull_u8(m, tmp0);
       res = vmlal_u8(res, max_minus_m, tmp1);
       const uint8x8_t result = vrshrn_n_u16(res, AOM_BLEND_A64_ROUND_BITS);

diff --git a/av1/common/arm/blend_a64_vmask_neon.c b/av1/common/arm/blend_a64_vmask_neon.c
index 2132fbd..c316977 100644
--- a/av1/common/arm/blend_a64_vmask_neon.c
+++ b/av1/common/arm/blend_a64_vmask_neon.c

@@ -27,8 +27,6 @@
   uint8x8_t tmp0, tmp1;
   uint8x16_t tmp0_q, tmp1_q, res_q;
   uint16x8_t res, res_low, res_high;
-  uint32x2_t tmp0_32 = vdup_n_u32(0), tmp1_32 = vdup_n_u32(0);
-  uint16x4_t tmp0_16 = vdup_n_u16(0), tmp1_16 = vdup_n_u16(0);
   assert(IMPLIES(src0 == dst, src0_stride == dst_stride));
   assert(IMPLIES(src1 == dst, src1_stride == dst_stride));
 
@@ -89,10 +87,8 @@
       const uint16x4_t max_minus_m2 = vdup_n_u16(64 - (uint16_t)mask[i + 1]);
       const uint8x8_t max_minus_m =
           vmovn_u16(vcombine_u16(max_minus_m1, max_minus_m2));
-      load_unaligned_u8_4x2(src0, src0_stride, &tmp0_32);
-      tmp0 = vreinterpret_u8_u32(tmp0_32);
-      load_unaligned_u8_4x2(src1, src1_stride, &tmp1_32);
-      tmp1 = vreinterpret_u8_u32(tmp1_32);
+      tmp0 = load_unaligned_u8_4x2(src0, src0_stride);
+      tmp1 = load_unaligned_u8_4x2(src1, src1_stride);
       res = vmull_u8(m, tmp0);
       res = vmlal_u8(res, max_minus_m, tmp1);
       const uint8x8_t result = vrshrn_n_u16(res, AOM_BLEND_A64_ROUND_BITS);
@@ -118,10 +114,8 @@
       const uint16x4x2_t max_minus_m_trn = vtrn_u16(
           vreinterpret_u16_u8(max_minus_m1), vreinterpret_u16_u8(max_minus_m2));
       const uint8x8_t max_minus_m = vreinterpret_u8_u16(max_minus_m_trn.val[0]);
-      load_unaligned_u8_2x2(src0, src0_stride, &tmp0_16);
-      tmp0 = vreinterpret_u8_u16(tmp0_16);
-      load_unaligned_u8_2x2(src1, src1_stride, &tmp1_16);
-      tmp1 = vreinterpret_u8_u16(tmp1_16);
+      tmp0 = load_unaligned_u8_2x2(src0, src0_stride);
+      tmp1 = load_unaligned_u8_2x2(src1, src1_stride);
       res = vmull_u8(m, tmp0);
       res = vmlal_u8(res, max_minus_m, tmp1);
       const uint8x8_t result = vrshrn_n_u16(res, AOM_BLEND_A64_ROUND_BITS);

diff --git a/av1/common/arm/cfl_neon.c b/av1/common/arm/cfl_neon.c
index 371be5f..0871b4f 100644
--- a/av1/common/arm/cfl_neon.c
+++ b/av1/common/arm/cfl_neon.c

@@ -10,6 +10,7 @@
  */
 #include <arm_neon.h>
 
+#include "config/aom_config.h"
 #include "config/av1_rtcd.h"
 
 #include "av1/common/cfl.h"
@@ -31,12 +32,12 @@
 
 // Store half of a vector.
 static INLINE void vsth_u16(uint16_t *ptr, uint16x4_t val) {
-  *((uint32_t *)ptr) = vreinterpret_u32_u16(val)[0];
+  vst1_lane_u32((uint32_t *)ptr, vreinterpret_u32_u16(val), 0);
 }
 
 // Store half of a vector.
 static INLINE void vsth_u8(uint8_t *ptr, uint8x8_t val) {
-  *((uint32_t *)ptr) = vreinterpret_u32_u8(val)[0];
+  vst1_lane_u32((uint32_t *)ptr, vreinterpret_u32_u8(val), 0);
 }
 
 static void cfl_luma_subsampling_420_lbd_neon(const uint8_t *input,
@@ -132,7 +133,7 @@
 }
 
 #if CONFIG_AV1_HIGHBITDEPTH
-#ifndef __aarch64__
+#if !AOM_ARCH_AARCH64
 uint16x8_t vpaddq_u16(uint16x8_t a, uint16x8_t b) {
   return vcombine_u16(vpadd_u16(vget_low_u16(a), vget_high_u16(a)),
                       vpadd_u16(vget_low_u16(b), vget_high_u16(b)));
@@ -269,7 +270,7 @@
   // unsigned integer for the sum, we can do one addition operation inside 16
   // bits (8 lanes) before having to convert to 32 bits (4 lanes).
   const uint16_t *sum_buf = src;
-  uint32x4_t sum_32x4 = { 0, 0, 0, 0 };
+  uint32x4_t sum_32x4 = vdupq_n_u32(0);
   do {
     // For all widths, we load, add and combine the data so it fits in 4 lanes.
     if (width == 4) {
@@ -313,7 +314,7 @@
 
   // Permute and add in such a way that each lane contains the block sum.
   // [A+C+B+D, B+D+A+C, C+A+D+B, D+B+C+A]
-#ifdef __aarch64__
+#if AOM_ARCH_AARCH64
   sum_32x4 = vpaddq_u32(sum_32x4, sum_32x4);
   sum_32x4 = vpaddq_u32(sum_32x4, sum_32x4);
 #else

diff --git a/av1/common/arm/convolve_neon.c b/av1/common/arm/convolve_neon.c
index 012b3f7..713aaad 100644
--- a/av1/common/arm/convolve_neon.c
+++ b/av1/common/arm/convolve_neon.c

@@ -13,6 +13,7 @@
 #include <assert.h>
 #include <arm_neon.h>
 
+#include "config/aom_config.h"
 #include "config/av1_rtcd.h"
 
 #include "aom_dsp/aom_dsp_common.h"
@@ -44,17 +45,18 @@
   return sum;
 }
 
-#if !defined(__aarch64__)
-static INLINE uint8x8_t convolve8_horiz_4x1(
-    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
-    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
-    const int16x4_t s6, const int16x4_t s7, const int16x8_t filter,
-    const int16x4_t shift_round_0, const int16x4_t shift_by_bits) {
+#if !AOM_ARCH_AARCH64
+static INLINE uint8x8_t convolve8_x_4x1(const int16x4_t s0, const int16x4_t s1,
+                                        const int16x4_t s2, const int16x4_t s3,
+                                        const int16x4_t s4, const int16x4_t s5,
+                                        const int16x4_t s6, const int16x4_t s7,
+                                        const int16x8_t filter,
+                                        const int16x4_t horiz_const) {
   const int16x4_t filter_lo = vget_low_s16(filter);
   const int16x4_t filter_hi = vget_high_s16(filter);
-  int16x4_t sum;
+  int16x4_t sum = horiz_const;
 
-  sum = vmul_lane_s16(s0, filter_lo, 0);
+  sum = vmla_lane_s16(sum, s0, filter_lo, 0);
   sum = vmla_lane_s16(sum, s1, filter_lo, 1);
   sum = vmla_lane_s16(sum, s2, filter_lo, 2);
   sum = vmla_lane_s16(sum, s3, filter_lo, 3);
@@ -63,12 +65,200 @@
   sum = vmla_lane_s16(sum, s6, filter_hi, 2);
   sum = vmla_lane_s16(sum, s7, filter_hi, 3);
 
-  sum = vqrshl_s16(sum, shift_round_0);
-  sum = vqrshl_s16(sum, shift_by_bits);
-
-  return vqmovun_s16(vcombine_s16(sum, sum));
+  // We halved the convolution filter values so - 1 from the right shift.
+  return vqrshrun_n_s16(vcombine_s16(sum, vdup_n_s16(0)), FILTER_BITS - 1);
 }
-#endif  // !defined(__arch64__)
+#endif  // !AOM_ARCH_AARCH64
+
+#if AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_MATMUL_INT8)
+
+static INLINE int32x4_t convolve12_4_usdot(uint8x16_t samples,
+                                           const int8x16_t filters,
+                                           const uint8x16x3_t permute_tbl,
+                                           const int32x4_t horiz_const) {
+  uint8x16_t permuted_samples[3];
+  int32x4_t sum;
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_u8(samples, permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_u8(samples, permute_tbl.val[1]);
+  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
+  permuted_samples[2] = vqtbl1q_u8(samples, permute_tbl.val[2]);
+
+  /* First 4 output values. */
+  sum = vusdotq_laneq_s32(horiz_const, permuted_samples[0], filters, 0);
+  sum = vusdotq_laneq_s32(sum, permuted_samples[1], filters, 1);
+  sum = vusdotq_laneq_s32(sum, permuted_samples[2], filters, 2);
+
+  return sum;
+}
+
+static INLINE int16x8_t convolve12_8_usdot(uint8x16_t samples0,
+                                           uint8x16_t samples1,
+                                           const int8x16_t filters,
+                                           const uint8x16x3_t permute_tbl,
+                                           const int32x4_t horiz_const) {
+  uint8x16_t permuted_samples[4];
+  int32x4_t sum[2];
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_u8(samples0, permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_u8(samples0, permute_tbl.val[1]);
+  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
+  permuted_samples[2] = vqtbl1q_u8(samples0, permute_tbl.val[2]);
+  /* {12, 13, 14, 15, 13, 14, 15, 16, 14, 15, 16, 17, 15, 16, 17, 18 } */
+  permuted_samples[3] = vqtbl1q_u8(samples1, permute_tbl.val[2]);
+
+  /* First 4 output values. */
+  sum[0] = vusdotq_laneq_s32(horiz_const, permuted_samples[0], filters, 0);
+  sum[0] = vusdotq_laneq_s32(sum[0], permuted_samples[1], filters, 1);
+  sum[0] = vusdotq_laneq_s32(sum[0], permuted_samples[2], filters, 2);
+  /* Second 4 output values. */
+  sum[1] = vusdotq_laneq_s32(horiz_const, permuted_samples[1], filters, 0);
+  sum[1] = vusdotq_laneq_s32(sum[1], permuted_samples[2], filters, 1);
+  sum[1] = vusdotq_laneq_s32(sum[1], permuted_samples[3], filters, 2);
+
+  /* Narrow and re-pack. */
+  return vcombine_s16(vqrshrn_n_s32(sum[0], FILTER_BITS),
+                      vqrshrn_n_s32(sum[1], FILTER_BITS));
+}
+
+#elif AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
+
+static INLINE int16x4_t convolve12_horiz_4_sdot(
+    uint8x16_t samples, const int8x16_t filters, const int32x4_t correction,
+    const uint8x16_t range_limit, const uint8x16x3_t permute_tbl) {
+  int8x16_t clamped_samples, permuted_samples[3];
+  int32x4_t sum;
+
+  /* Clamp sample range to [-128, 127] for 8-bit signed dot product. */
+  clamped_samples = vreinterpretq_s8_u8(vsubq_u8(samples, range_limit));
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_s8(clamped_samples, permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_s8(clamped_samples, permute_tbl.val[1]);
+  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
+  permuted_samples[2] = vqtbl1q_s8(clamped_samples, permute_tbl.val[2]);
+
+  /* Accumulate dot product into 'correction' to account for range clamp. */
+  /* First 4 output values. */
+  sum = vdotq_laneq_s32(correction, permuted_samples[0], filters, 0);
+  sum = vdotq_laneq_s32(sum, permuted_samples[1], filters, 1);
+  sum = vdotq_laneq_s32(sum, permuted_samples[2], filters, 2);
+
+  /* Narrow and re-pack. */
+  return vshrn_n_s32(sum, ROUND0_BITS);
+}
+
+static INLINE int16x8_t convolve12_horiz_8_sdot(
+    uint8x16_t samples0, uint8x16_t samples1, const int8x16_t filters,
+    const int32x4_t correction, const uint8x16_t range_limit,
+    const uint8x16x3_t permute_tbl) {
+  int8x16_t clamped_samples[2], permuted_samples[4];
+  int32x4_t sum[2];
+
+  /* Clamp sample range to [-128, 127] for 8-bit signed dot product. */
+  clamped_samples[0] = vreinterpretq_s8_u8(vsubq_u8(samples0, range_limit));
+  clamped_samples[1] = vreinterpretq_s8_u8(vsubq_u8(samples1, range_limit));
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_s8(clamped_samples[0], permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_s8(clamped_samples[0], permute_tbl.val[1]);
+  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
+  permuted_samples[2] = vqtbl1q_s8(clamped_samples[0], permute_tbl.val[2]);
+  /* {12, 13, 14, 15, 13, 14, 15, 16, 14, 15, 16, 17, 15, 16, 17, 18 } */
+  permuted_samples[3] = vqtbl1q_s8(clamped_samples[1], permute_tbl.val[2]);
+
+  /* Accumulate dot product into 'correction' to account for range clamp. */
+  /* First 4 output values. */
+  sum[0] = vdotq_laneq_s32(correction, permuted_samples[0], filters, 0);
+  sum[0] = vdotq_laneq_s32(sum[0], permuted_samples[1], filters, 1);
+  sum[0] = vdotq_laneq_s32(sum[0], permuted_samples[2], filters, 2);
+  /* Second 4 output values. */
+  sum[1] = vdotq_laneq_s32(correction, permuted_samples[1], filters, 0);
+  sum[1] = vdotq_laneq_s32(sum[1], permuted_samples[2], filters, 1);
+  sum[1] = vdotq_laneq_s32(sum[1], permuted_samples[3], filters, 2);
+
+  /* Narrow and re-pack. */
+  return vcombine_s16(vshrn_n_s32(sum[0], ROUND0_BITS),
+                      vshrn_n_s32(sum[1], ROUND0_BITS));
+}
+
+static INLINE int32x4_t convolve12_4_sdot(uint8x16_t samples,
+                                          const int8x16_t filters,
+                                          const int32x4_t correction,
+                                          const uint8x16_t range_limit,
+                                          const uint8x16x3_t permute_tbl) {
+  int8x16_t clamped_samples, permuted_samples[3];
+  int32x4_t sum;
+
+  /* Clamp sample range to [-128, 127] for 8-bit signed dot product. */
+  clamped_samples = vreinterpretq_s8_u8(vsubq_u8(samples, range_limit));
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_s8(clamped_samples, permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_s8(clamped_samples, permute_tbl.val[1]);
+  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
+  permuted_samples[2] = vqtbl1q_s8(clamped_samples, permute_tbl.val[2]);
+
+  /* Accumulate dot product into 'correction' to account for range clamp. */
+  /* First 4 output values. */
+  sum = vdotq_laneq_s32(correction, permuted_samples[0], filters, 0);
+  sum = vdotq_laneq_s32(sum, permuted_samples[1], filters, 1);
+  sum = vdotq_laneq_s32(sum, permuted_samples[2], filters, 2);
+
+  return sum;
+}
+
+static INLINE int16x8_t convolve12_8_sdot(uint8x16_t samples0,
+                                          uint8x16_t samples1,
+                                          const int8x16_t filters,
+                                          const int32x4_t correction,
+                                          const uint8x16_t range_limit,
+                                          const uint8x16x3_t permute_tbl) {
+  int8x16_t clamped_samples[2], permuted_samples[4];
+  int32x4_t sum[2];
+
+  /* Clamp sample range to [-128, 127] for 8-bit signed dot product. */
+  clamped_samples[0] = vreinterpretq_s8_u8(vsubq_u8(samples0, range_limit));
+  clamped_samples[1] = vreinterpretq_s8_u8(vsubq_u8(samples1, range_limit));
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_s8(clamped_samples[0], permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_s8(clamped_samples[0], permute_tbl.val[1]);
+  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
+  permuted_samples[2] = vqtbl1q_s8(clamped_samples[0], permute_tbl.val[2]);
+  /* {12, 13, 14, 15, 13, 14, 15, 16, 14, 15, 16, 17, 15, 16, 17, 18 } */
+  permuted_samples[3] = vqtbl1q_s8(clamped_samples[1], permute_tbl.val[2]);
+
+  /* Accumulate dot product into 'correction' to account for range clamp. */
+  /* First 4 output values. */
+  sum[0] = vdotq_laneq_s32(correction, permuted_samples[0], filters, 0);
+  sum[0] = vdotq_laneq_s32(sum[0], permuted_samples[1], filters, 1);
+  sum[0] = vdotq_laneq_s32(sum[0], permuted_samples[2], filters, 2);
+  /* Second 4 output values. */
+  sum[1] = vdotq_laneq_s32(correction, permuted_samples[1], filters, 0);
+  sum[1] = vdotq_laneq_s32(sum[1], permuted_samples[2], filters, 1);
+  sum[1] = vdotq_laneq_s32(sum[1], permuted_samples[3], filters, 2);
+
+  /* Narrow and re-pack. */
+  return vcombine_s16(vqrshrn_n_s32(sum[0], FILTER_BITS),
+                      vqrshrn_n_s32(sum[1], FILTER_BITS));
+}
+
+#endif  // AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_MATMUL_INT8)
 
 static INLINE uint8x8_t convolve8_vert_8x4(
     const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
@@ -90,12 +280,10 @@
   return vqrshrun_n_s16(sum, FILTER_BITS - 1);
 }
 
-static INLINE int16x4_t convolve8_vert_4x4_s32(
+static INLINE int16x4_t convolve8_vert_4_s32(
     const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
     const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
-    const int16x4_t s6, const int16x4_t s7, const int16x8_t y_filter,
-    const int32x4_t round_shift_vec, const int32x4_t offset_const,
-    const int32x4_t sub_const_vec) {
+    const int16x4_t s6, const int16x4_t s7, const int16x8_t y_filter) {
   const int16x4_t y_filter_lo = vget_low_s16(y_filter);
   const int16x4_t y_filter_hi = vget_high_s16(y_filter);
   int32x4_t sum;
@@ -109,19 +297,14 @@
   sum = vmlal_lane_s16(sum, s6, y_filter_hi, 2);
   sum = vmlal_lane_s16(sum, s7, y_filter_hi, 3);
 
-  sum = vaddq_s32(sum, offset_const);
-  sum = vqrshlq_s32(sum, round_shift_vec);
-  sum = vsubq_s32(sum, sub_const_vec);
-
-  return vmovn_s32(sum);
+  return vqrshrn_n_s32(sum, 2 * FILTER_BITS - ROUND0_BITS);
 }
 
-static INLINE uint8x8_t convolve8_vert_8x4_s32(
-    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
-    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
-    const int16x8_t s6, const int16x8_t s7, const int16x8_t y_filter,
-    const int32x4_t round_shift_vec, const int32x4_t offset_const,
-    const int32x4_t sub_const_vec, const int16x8_t vec_round_bits) {
+static INLINE uint8x8_t
+convolve8_vert_8_s32(const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+                     const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+                     const int16x8_t s6, const int16x8_t s7,
+                     const int16x8_t y_filter, const int16x8_t sub_const) {
   const int16x4_t y_filter_lo = vget_low_s16(y_filter);
   const int16x4_t y_filter_hi = vget_high_s16(y_filter);
   int32x4_t sum0, sum1;
@@ -145,132 +328,163 @@
   sum1 = vmlal_lane_s16(sum1, vget_high_s16(s6), y_filter_hi, 2);
   sum1 = vmlal_lane_s16(sum1, vget_high_s16(s7), y_filter_hi, 3);
 
-  sum0 = vaddq_s32(sum0, offset_const);
-  sum1 = vaddq_s32(sum1, offset_const);
-  sum0 = vqrshlq_s32(sum0, round_shift_vec);
-  sum1 = vqrshlq_s32(sum1, round_shift_vec);
-  sum0 = vsubq_s32(sum0, sub_const_vec);
-  sum1 = vsubq_s32(sum1, sub_const_vec);
-
-  res = vcombine_s16(vmovn_s32(sum0), vmovn_s32(sum1));
-  res = vqrshlq_s16(res, vec_round_bits);
+  res = vcombine_s16(vqrshrn_n_s32(sum0, 2 * FILTER_BITS - ROUND0_BITS),
+                     vqrshrn_n_s32(sum1, 2 * FILTER_BITS - ROUND0_BITS));
+  res = vsubq_s16(res, sub_const);
 
   return vqmovun_s16(res);
 }
 
-static INLINE int16x4_t convolve12_vert_4x4_s32(
-    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
-    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
-    const int16x4_t s6, const int16x4_t s7, const int16x4_t s8,
-    const int16x4_t s9, const int16x4_t s10, const int16x4_t s11,
-    const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11,
-    const int32x4_t round_shift_vec, const int32x4_t offset_const,
-    const int32x4_t sub_const_vec) {
-  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter_0_7);
-  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter_0_7);
-  int32x4_t sum;
+#if AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_MATMUL_INT8)
 
-  sum = vmull_lane_s16(s0, y_filter_0_3, 0);
-  sum = vmlal_lane_s16(sum, s1, y_filter_0_3, 1);
-  sum = vmlal_lane_s16(sum, s2, y_filter_0_3, 2);
-  sum = vmlal_lane_s16(sum, s3, y_filter_0_3, 3);
-  sum = vmlal_lane_s16(sum, s4, y_filter_4_7, 0);
-  sum = vmlal_lane_s16(sum, s5, y_filter_4_7, 1);
-  sum = vmlal_lane_s16(sum, s6, y_filter_4_7, 2);
-  sum = vmlal_lane_s16(sum, s7, y_filter_4_7, 3);
-  sum = vmlal_lane_s16(sum, s8, y_filter_8_11, 0);
-  sum = vmlal_lane_s16(sum, s9, y_filter_8_11, 1);
-  sum = vmlal_lane_s16(sum, s10, y_filter_8_11, 2);
-  sum = vmlal_lane_s16(sum, s11, y_filter_8_11, 3);
+void convolve_x_sr_12tap_neon(const uint8_t *src, int src_stride, uint8_t *dst,
+                              int dst_stride, int w, int h,
+                              const int16_t *x_filter_ptr) {
+  const int16x8_t filter_0_7 = vld1q_s16(x_filter_ptr);
+  const int16x4_t filter_8_11 = vld1_s16(x_filter_ptr + 8);
+  const int16x8_t filter_8_15 = vcombine_s16(filter_8_11, vdup_n_s16(0));
+  const int8x16_t filter =
+      vcombine_s8(vmovn_s16(filter_0_7), vmovn_s16(filter_8_15));
 
-  sum = vaddq_s32(sum, offset_const);
-  sum = vqrshlq_s32(sum, round_shift_vec);
-  sum = vsubq_s32(sum, sub_const_vec);
+  // Special case the following no-op filter as 128 won't fit into the
+  // 8-bit signed dot-product instruction:
+  // { 0, 0, 0, 0, 0, 128, 0, 0, 0, 0, 0, 0 }
+  if (vgetq_lane_s16(filter_0_7, 5) == 128) {
+    uint8x8_t d0;
 
-  return vmovn_s32(sum);
+    // Undo the horizontal offset in the calling function.
+    src += 5;
+
+    for (int i = 0; i < h; i++) {
+      for (int j = 0; j < w; j += 8) {
+        d0 = vld1_u8(src + i * src_stride + j);
+        if (w == 2) {
+          store_u8_2x1(dst + i * dst_stride, d0, 0);
+        } else if (w == 4) {
+          store_u8_4x1(dst + i * dst_stride, d0, 0);
+        } else {
+          vst1_u8(dst + i * dst_stride + j, d0);
+        }
+      }
+    }
+  } else {
+    const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
+    // This shim of 1 << (ROUND0_BITS - 1) enables us to use a single rounding
+    // right shift by FILTER_BITS - instead of a first rounding right shift by
+    // ROUND0_BITS, followed by second rounding right shift by FILTER_BITS -
+    // ROUND0_BITS.
+    const int32x4_t horiz_const = vdupq_n_s32(1 << (ROUND0_BITS - 1));
+
+    if (w <= 4) {
+      uint8x16_t s0, s1, s2, s3;
+      int32x4_t d0, d1, d2, d3;
+      int16x8_t t01, t23;
+      uint8x8_t d01, d23;
+
+      do {
+        load_u8_16x4(src, src_stride, &s0, &s1, &s2, &s3);
+
+        d0 = convolve12_4_usdot(s0, filter, permute_tbl, horiz_const);
+        d1 = convolve12_4_usdot(s1, filter, permute_tbl, horiz_const);
+        d2 = convolve12_4_usdot(s2, filter, permute_tbl, horiz_const);
+        d3 = convolve12_4_usdot(s3, filter, permute_tbl, horiz_const);
+
+        t01 = vcombine_s16(vqrshrn_n_s32(d0, FILTER_BITS),
+                           vqrshrn_n_s32(d1, FILTER_BITS));
+        t23 = vcombine_s16(vqrshrn_n_s32(d2, FILTER_BITS),
+                           vqrshrn_n_s32(d3, FILTER_BITS));
+
+        d01 = vqmovun_s16(t01);
+        d23 = vqmovun_s16(t23);
+
+        if (w == 2) {
+          store_u8_2x1(dst + 0 * dst_stride, d01, 0);
+          store_u8_2x1(dst + 1 * dst_stride, d01, 2);
+          if (h != 2) {
+            store_u8_2x1(dst + 2 * dst_stride, d23, 0);
+            store_u8_2x1(dst + 3 * dst_stride, d23, 2);
+          }
+        } else {
+          store_u8_4x1(dst + 0 * dst_stride, d01, 0);
+          store_u8_4x1(dst + 1 * dst_stride, d01, 1);
+          if (h != 2) {
+            store_u8_4x1(dst + 2 * dst_stride, d23, 0);
+            store_u8_4x1(dst + 3 * dst_stride, d23, 1);
+          }
+        }
+
+        dst += 4 * dst_stride;
+        src += 4 * src_stride;
+        h -= 4;
+      } while (h > 0);
+    } else {
+      uint8x16_t s0, s1, s2, s3, s4, s5, s6, s7;
+      int16x8_t d0, d1, d2, d3;
+      uint8x8_t dd0, dd1, dd2, dd3;
+
+      do {
+        const uint8_t *s = src;
+        uint8_t *d = dst;
+        int width = w;
+
+        do {
+          load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
+          load_u8_16x4(s + 4, src_stride, &s4, &s5, &s6, &s7);
+
+          d0 = convolve12_8_usdot(s0, s4, filter, permute_tbl, horiz_const);
+          d1 = convolve12_8_usdot(s1, s5, filter, permute_tbl, horiz_const);
+          d2 = convolve12_8_usdot(s2, s6, filter, permute_tbl, horiz_const);
+          d3 = convolve12_8_usdot(s3, s7, filter, permute_tbl, horiz_const);
+
+          dd0 = vqmovun_s16(d0);
+          dd1 = vqmovun_s16(d1);
+          dd2 = vqmovun_s16(d2);
+          dd3 = vqmovun_s16(d3);
+
+          store_u8_8x2(d + 0 * dst_stride, dst_stride, dd0, dd1);
+          if (h != 2) {
+            store_u8_8x2(d + 2 * dst_stride, dst_stride, dd2, dd3);
+          }
+
+          s += 8;
+          d += 8;
+          width -= 8;
+        } while (width > 0);
+        src += 4 * src_stride;
+        dst += 4 * dst_stride;
+        h -= 4;
+      } while (h > 0);
+    }
+  }
 }
 
-static INLINE uint8x8_t convolve12_vert_8x4_s32(
-    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
-    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
-    const int16x8_t s6, const int16x8_t s7, const int16x8_t s8,
-    const int16x8_t s9, const int16x8_t s10, const int16x8_t s11,
-    const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11,
-    const int32x4_t round_shift_vec, const int32x4_t offset_const,
-    const int32x4_t sub_const_vec, const int16x8_t vec_round_bits) {
-  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter_0_7);
-  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter_0_7);
-  int32x4_t sum0, sum1;
-  int16x8_t res;
-
-  sum0 = vmull_lane_s16(vget_low_s16(s0), y_filter_0_3, 0);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s1), y_filter_0_3, 1);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s2), y_filter_0_3, 2);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s3), y_filter_0_3, 3);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s4), y_filter_4_7, 0);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s5), y_filter_4_7, 1);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s6), y_filter_4_7, 2);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s7), y_filter_4_7, 3);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s8), y_filter_8_11, 0);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s9), y_filter_8_11, 1);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s10), y_filter_8_11, 2);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s11), y_filter_8_11, 3);
-
-  sum1 = vmull_lane_s16(vget_high_s16(s0), y_filter_0_3, 0);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s1), y_filter_0_3, 1);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s2), y_filter_0_3, 2);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s3), y_filter_0_3, 3);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s4), y_filter_4_7, 0);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s5), y_filter_4_7, 1);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s6), y_filter_4_7, 2);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s7), y_filter_4_7, 3);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s8), y_filter_8_11, 0);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s9), y_filter_8_11, 1);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s10), y_filter_8_11, 2);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s11), y_filter_8_11, 3);
-
-  sum0 = vaddq_s32(sum0, offset_const);
-  sum1 = vaddq_s32(sum1, offset_const);
-  sum0 = vqrshlq_s32(sum0, round_shift_vec);
-  sum1 = vqrshlq_s32(sum1, round_shift_vec);
-  sum0 = vsubq_s32(sum0, sub_const_vec);
-  sum1 = vsubq_s32(sum1, sub_const_vec);
-
-  res = vcombine_s16(vmovn_s32(sum0), vmovn_s32(sum1));
-  res = vqrshlq_s16(res, vec_round_bits);
-
-  return vqmovun_s16(res);
-}
-
-#if defined(__aarch64__) && defined(__ARM_FEATURE_MATMUL_INT8)
-
 void av1_convolve_x_sr_neon(const uint8_t *src, int src_stride, uint8_t *dst,
                             int dst_stride, int w, int h,
                             const InterpFilterParams *filter_params_x,
                             const int subpel_x_qn,
                             ConvolveParams *conv_params) {
-  if (filter_params_x->taps > 8) {
-    av1_convolve_x_sr_c(src, src_stride, dst, dst_stride, w, h, filter_params_x,
-                        subpel_x_qn, conv_params);
-    return;
-  }
+  (void)conv_params;
   const uint8_t horiz_offset = filter_params_x->taps / 2 - 1;
-  const int8_t bits = FILTER_BITS - conv_params->round_0;
-
-  assert(bits >= 0);
-  assert((FILTER_BITS - conv_params->round_1) >= 0 ||
-         ((conv_params->round_0 + conv_params->round_1) == 2 * FILTER_BITS));
 
   const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
       filter_params_x, subpel_x_qn & SUBPEL_MASK);
+  src -= horiz_offset;
+
+  if (filter_params_x->taps > 8) {
+    convolve_x_sr_12tap_neon(src, src_stride, dst, dst_stride, w, h,
+                             x_filter_ptr);
+    return;
+  }
+
   // Filter values are even, so downshift by 1 to reduce intermediate precision
   // requirements.
   const int8x8_t x_filter = vshrn_n_s16(vld1q_s16(x_filter_ptr), 1);
-
-  const int16x8_t shift_round_0 = vdupq_n_s16(-conv_params->round_0 + 1);
-  const int16x8_t shift_by_bits = vdupq_n_s16(-bits);
-
-  src -= horiz_offset;
+  // This shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use a single
+  // rounding right shift by FILTER_BITS - instead of a first rounding right
+  // shift by ROUND0_BITS, followed by second rounding right shift by
+  // FILTER_BITS - ROUND0_BITS.
+  // The outermost -1 is needed because we halved the filter values.
+  const int32x4_t horiz_const = vdupq_n_s32(1 << ((ROUND0_BITS - 1) - 1));
 
   if (w <= 4) {
     const uint8x16x2_t permute_tbl = vld1q_u8_x2(dot_prod_permute_tbl);
@@ -280,49 +494,33 @@
     uint8x8_t d01, d23;
 
     do {
-      s0 = vld1q_u8(src + 0 * src_stride);
-      s1 = vld1q_u8(src + 1 * src_stride);
-      s2 = vld1q_u8(src + 2 * src_stride);
-      s3 = vld1q_u8(src + 3 * src_stride);
+      load_u8_16x4(src, src_stride, &s0, &s1, &s2, &s3);
 
-      t0 = convolve8_4_usdot(s0, x_filter, permute_tbl, vdupq_n_s32(0));
-      t1 = convolve8_4_usdot(s1, x_filter, permute_tbl, vdupq_n_s32(0));
-      t2 = convolve8_4_usdot(s2, x_filter, permute_tbl, vdupq_n_s32(0));
-      t3 = convolve8_4_usdot(s3, x_filter, permute_tbl, vdupq_n_s32(0));
+      t0 = convolve8_4_usdot(s0, x_filter, permute_tbl, horiz_const);
+      t1 = convolve8_4_usdot(s1, x_filter, permute_tbl, horiz_const);
+      t2 = convolve8_4_usdot(s2, x_filter, permute_tbl, horiz_const);
+      t3 = convolve8_4_usdot(s3, x_filter, permute_tbl, horiz_const);
 
       t01 = vcombine_s16(vmovn_s32(t0), vmovn_s32(t1));
       t23 = vcombine_s16(vmovn_s32(t2), vmovn_s32(t3));
 
-      t01 = vqrshlq_s16(t01, shift_round_0);
-      t23 = vqrshlq_s16(t23, shift_round_0);
-
-      t01 = vqrshlq_s16(t01, shift_by_bits);
-      t23 = vqrshlq_s16(t23, shift_by_bits);
-
-      d01 = vqmovun_s16(t01);
-      d23 = vqmovun_s16(t23);
+      // We halved the convolution filter values so - 1 from the right shift.
+      d01 = vqrshrun_n_s16(t01, FILTER_BITS - 1);
+      d23 = vqrshrun_n_s16(t23, FILTER_BITS - 1);
 
       if (w == 2) {
-        vst1_lane_u16((uint16_t *)(dst + 0 * dst_stride),
-                      vreinterpret_u16_u8(d01), 0);
-        vst1_lane_u16((uint16_t *)(dst + 1 * dst_stride),
-                      vreinterpret_u16_u8(d01), 2);
+        store_u8_2x1(dst + 0 * dst_stride, d01, 0);
+        store_u8_2x1(dst + 1 * dst_stride, d01, 2);
         if (h != 2) {
-          vst1_lane_u16((uint16_t *)(dst + 2 * dst_stride),
-                        vreinterpret_u16_u8(d23), 0);
-          vst1_lane_u16((uint16_t *)(dst + 3 * dst_stride),
-                        vreinterpret_u16_u8(d23), 2);
+          store_u8_2x1(dst + 2 * dst_stride, d23, 0);
+          store_u8_2x1(dst + 3 * dst_stride, d23, 2);
         }
       } else {
-        vst1_lane_u32((uint32_t *)(dst + 0 * dst_stride),
-                      vreinterpret_u32_u8(d01), 0);
-        vst1_lane_u32((uint32_t *)(dst + 1 * dst_stride),
-                      vreinterpret_u32_u8(d01), 1);
+        store_u8_4x1(dst + 0 * dst_stride, d01, 0);
+        store_u8_4x1(dst + 1 * dst_stride, d01, 1);
         if (h != 2) {
-          vst1_lane_u32((uint32_t *)(dst + 2 * dst_stride),
-                        vreinterpret_u32_u8(d23), 0);
-          vst1_lane_u32((uint32_t *)(dst + 3 * dst_stride),
-                        vreinterpret_u32_u8(d23), 1);
+          store_u8_4x1(dst + 2 * dst_stride, d23, 0);
+          store_u8_4x1(dst + 3 * dst_stride, d23, 1);
         }
       }
 
@@ -343,29 +541,18 @@
       uint8_t *d = dst;
 
       do {
-        s0 = vld1q_u8(s + 0 * src_stride);
-        s1 = vld1q_u8(s + 1 * src_stride);
-        s2 = vld1q_u8(s + 2 * src_stride);
-        s3 = vld1q_u8(s + 3 * src_stride);
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
 
-        t0 = convolve8_8_usdot(s0, x_filter, permute_tbl, vdupq_n_s32(0),
-                               shift_round_0);
-        t1 = convolve8_8_usdot(s1, x_filter, permute_tbl, vdupq_n_s32(0),
-                               shift_round_0);
-        t2 = convolve8_8_usdot(s2, x_filter, permute_tbl, vdupq_n_s32(0),
-                               shift_round_0);
-        t3 = convolve8_8_usdot(s3, x_filter, permute_tbl, vdupq_n_s32(0),
-                               shift_round_0);
+        t0 = convolve8_x_8_usdot(s0, x_filter, permute_tbl, horiz_const);
+        t1 = convolve8_x_8_usdot(s1, x_filter, permute_tbl, horiz_const);
+        t2 = convolve8_x_8_usdot(s2, x_filter, permute_tbl, horiz_const);
+        t3 = convolve8_x_8_usdot(s3, x_filter, permute_tbl, horiz_const);
 
-        t0 = vqrshlq_s16(t0, shift_by_bits);
-        t1 = vqrshlq_s16(t1, shift_by_bits);
-        t2 = vqrshlq_s16(t2, shift_by_bits);
-        t3 = vqrshlq_s16(t3, shift_by_bits);
-
-        d0 = vqmovun_s16(t0);
-        d1 = vqmovun_s16(t1);
-        d2 = vqmovun_s16(t2);
-        d3 = vqmovun_s16(t3);
+        // We halved the convolution filter values so - 1 from the right shift.
+        d0 = vqrshrun_n_s16(t0, FILTER_BITS - 1);
+        d1 = vqrshrun_n_s16(t1, FILTER_BITS - 1);
+        d2 = vqrshrun_n_s16(t2, FILTER_BITS - 1);
+        d3 = vqrshrun_n_s16(t3, FILTER_BITS - 1);
 
         vst1_u8(d + 0 * dst_stride, d0);
         vst1_u8(d + 1 * dst_stride, d1);
@@ -386,40 +573,174 @@
   }
 }
 
-#elif defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#elif AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
+
+void convolve_x_sr_12tap_neon(const uint8_t *src, int src_stride, uint8_t *dst,
+                              int dst_stride, int w, int h,
+                              const int16_t *x_filter_ptr) {
+  const int16x8_t filter_0_7 = vld1q_s16(x_filter_ptr);
+  const int16x4_t filter_8_11 = vld1_s16(x_filter_ptr + 8);
+  const int16x8_t filter_8_15 = vcombine_s16(filter_8_11, vdup_n_s16(0));
+  const int8x16_t filter =
+      vcombine_s8(vmovn_s16(filter_0_7), vmovn_s16(filter_8_15));
+
+  const int32x4_t correct_tmp =
+      vaddq_s32(vpaddlq_s16(vshlq_n_s16(filter_0_7, 7)),
+                vpaddlq_s16(vshlq_n_s16(filter_8_15, 7)));
+  // This shim of 1 << (ROUND0_BITS - 1) enables us to use a single rounding
+  // right shift by FILTER_BITS - instead of a first rounding right shift by
+  // ROUND0_BITS, followed by second rounding right shift by FILTER_BITS -
+  // ROUND0_BITS.
+  int32x4_t correction =
+      vdupq_n_s32(vaddvq_s32(correct_tmp) + (1 << (ROUND0_BITS - 1)));
+  const uint8x16_t range_limit = vdupq_n_u8(128);
+  const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
+
+  // Special case the following no-op filter as 128 won't fit into the
+  // 8-bit signed dot-product instruction:
+  // { 0, 0, 0, 0, 0, 128, 0, 0, 0, 0, 0, 0 }
+  if (vgetq_lane_s16(filter_0_7, 5) == 128) {
+    uint8x8_t d0;
+
+    // Undo the horizontal offset in the calling function.
+    src += 5;
+
+    for (int i = 0; i < h; i++) {
+      for (int j = 0; j < w; j += 8) {
+        d0 = vld1_u8(src + i * src_stride + j);
+        if (w == 2) {
+          store_u8_2x1(dst + i * dst_stride, d0, 0);
+        } else if (w == 4) {
+          store_u8_4x1(dst + i * dst_stride, d0, 0);
+        } else {
+          vst1_u8(dst + i * dst_stride + j, d0);
+        }
+      }
+    }
+  } else {
+    if (w <= 4) {
+      uint8x16_t s0, s1, s2, s3;
+      int32x4_t d0, d1, d2, d3;
+      int16x8_t t01, t23;
+      uint8x8_t d01, d23;
+
+      do {
+        load_u8_16x4(src, src_stride, &s0, &s1, &s2, &s3);
+
+        d0 =
+            convolve12_4_sdot(s0, filter, correction, range_limit, permute_tbl);
+        d1 =
+            convolve12_4_sdot(s1, filter, correction, range_limit, permute_tbl);
+        d2 =
+            convolve12_4_sdot(s2, filter, correction, range_limit, permute_tbl);
+        d3 =
+            convolve12_4_sdot(s3, filter, correction, range_limit, permute_tbl);
+
+        t01 = vcombine_s16(vqrshrn_n_s32(d0, FILTER_BITS),
+                           vqrshrn_n_s32(d1, FILTER_BITS));
+        t23 = vcombine_s16(vqrshrn_n_s32(d2, FILTER_BITS),
+                           vqrshrn_n_s32(d3, FILTER_BITS));
+
+        d01 = vqmovun_s16(t01);
+        d23 = vqmovun_s16(t23);
+
+        if (w == 2) {
+          store_u8_2x1(dst + 0 * dst_stride, d01, 0);
+          store_u8_2x1(dst + 1 * dst_stride, d01, 2);
+          if (h != 2) {
+            store_u8_2x1(dst + 2 * dst_stride, d23, 0);
+            store_u8_2x1(dst + 3 * dst_stride, d23, 2);
+          }
+        } else {
+          store_u8_4x1(dst + 0 * dst_stride, d01, 0);
+          store_u8_4x1(dst + 1 * dst_stride, d01, 1);
+          if (h != 2) {
+            store_u8_4x1(dst + 2 * dst_stride, d23, 0);
+            store_u8_4x1(dst + 3 * dst_stride, d23, 1);
+          }
+        }
+
+        dst += 4 * dst_stride;
+        src += 4 * src_stride;
+        h -= 4;
+      } while (h > 0);
+    } else {
+      uint8x16_t s0, s1, s2, s3, s4, s5, s6, s7;
+      int16x8_t d0, d1, d2, d3;
+      uint8x8_t dd0, dd1, dd2, dd3;
+
+      do {
+        const uint8_t *s = src;
+        uint8_t *d = dst;
+        int width = w;
+
+        do {
+          load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
+          load_u8_16x4(s + 4, src_stride, &s4, &s5, &s6, &s7);
+
+          d0 = convolve12_8_sdot(s0, s4, filter, correction, range_limit,
+                                 permute_tbl);
+          d1 = convolve12_8_sdot(s1, s5, filter, correction, range_limit,
+                                 permute_tbl);
+          d2 = convolve12_8_sdot(s2, s6, filter, correction, range_limit,
+                                 permute_tbl);
+          d3 = convolve12_8_sdot(s3, s7, filter, correction, range_limit,
+                                 permute_tbl);
+
+          dd0 = vqmovun_s16(d0);
+          dd1 = vqmovun_s16(d1);
+          dd2 = vqmovun_s16(d2);
+          dd3 = vqmovun_s16(d3);
+
+          store_u8_8x2(d + 0 * dst_stride, dst_stride, dd0, dd1);
+          if (h != 2) {
+            store_u8_8x2(d + 2 * dst_stride, dst_stride, dd2, dd3);
+          }
+
+          s += 8;
+          d += 8;
+          width -= 8;
+        } while (width > 0);
+        src += 4 * src_stride;
+        dst += 4 * dst_stride;
+        h -= 4;
+      } while (h > 0);
+    }
+  }
+}
 
 void av1_convolve_x_sr_neon(const uint8_t *src, int src_stride, uint8_t *dst,
                             int dst_stride, int w, int h,
                             const InterpFilterParams *filter_params_x,
                             const int subpel_x_qn,
                             ConvolveParams *conv_params) {
-  if (filter_params_x->taps > 8) {
-    av1_convolve_x_sr_c(src, src_stride, dst, dst_stride, w, h, filter_params_x,
-                        subpel_x_qn, conv_params);
-    return;
-  }
+  (void)conv_params;
   const uint8_t horiz_offset = filter_params_x->taps / 2 - 1;
-  const int8_t bits = FILTER_BITS - conv_params->round_0;
-
-  assert(bits >= 0);
-  assert((FILTER_BITS - conv_params->round_1) >= 0 ||
-         ((conv_params->round_0 + conv_params->round_1) == 2 * FILTER_BITS));
 
   const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
       filter_params_x, subpel_x_qn & SUBPEL_MASK);
+  src -= horiz_offset;
+
+  if (filter_params_x->taps > 8) {
+    convolve_x_sr_12tap_neon(src, src_stride, dst, dst_stride, w, h,
+                             x_filter_ptr);
+    return;
+  }
+
   // Filter values are even, so downshift by 1 to reduce intermediate precision
   // requirements.
   const int8x8_t x_filter = vshrn_n_s16(vld1q_s16(x_filter_ptr), 1);
   // Dot product constants.
   const int16x8_t correct_tmp = vshll_n_s8(x_filter, 7);
-  const int32x4_t correction = vdupq_n_s32(vaddlvq_s16(correct_tmp));
+  // This shim of (1 << ((ROUND0_BITS - 1) - 1) enables us to use a single
+  // rounding right shift by FILTER_BITS - instead of a first rounding right
+  // shift by ROUND0_BITS, followed by second rounding right shift by
+  // FILTER_BITS - ROUND0_BITS.
+  // The outermost -1 is needed because we halved the filter values.
+  const int32x4_t correction =
+      vdupq_n_s32(vaddlvq_s16(correct_tmp) + (1 << ((ROUND0_BITS - 1) - 1)));
   const uint8x16_t range_limit = vdupq_n_u8(128);
 
-  const int16x8_t shift_round_0 = vdupq_n_s16(-conv_params->round_0 + 1);
-  const int16x8_t shift_by_bits = vdupq_n_s16(-bits);
-
-  src -= horiz_offset;
-
   if (w <= 4) {
     const uint8x16x2_t permute_tbl = vld1q_u8_x2(dot_prod_permute_tbl);
     uint8x16_t s0, s1, s2, s3;
@@ -428,10 +749,7 @@
     uint8x8_t d01, d23;
 
     do {
-      s0 = vld1q_u8(src + 0 * src_stride);
-      s1 = vld1q_u8(src + 1 * src_stride);
-      s2 = vld1q_u8(src + 2 * src_stride);
-      s3 = vld1q_u8(src + 3 * src_stride);
+      load_u8_16x4(src, src_stride, &s0, &s1, &s2, &s3);
 
       t0 = convolve8_4_sdot(s0, x_filter, correction, range_limit, permute_tbl);
       t1 = convolve8_4_sdot(s1, x_filter, correction, range_limit, permute_tbl);
@@ -441,36 +759,23 @@
       t01 = vcombine_s16(vmovn_s32(t0), vmovn_s32(t1));
       t23 = vcombine_s16(vmovn_s32(t2), vmovn_s32(t3));
 
-      t01 = vqrshlq_s16(t01, shift_round_0);
-      t23 = vqrshlq_s16(t23, shift_round_0);
-
-      t01 = vqrshlq_s16(t01, shift_by_bits);
-      t23 = vqrshlq_s16(t23, shift_by_bits);
-
-      d01 = vqmovun_s16(t01);
-      d23 = vqmovun_s16(t23);
+      // We halved the convolution filter values so - 1 from the right shift.
+      d01 = vqrshrun_n_s16(t01, FILTER_BITS - 1);
+      d23 = vqrshrun_n_s16(t23, FILTER_BITS - 1);
 
       if (w == 2) {
-        vst1_lane_u16((uint16_t *)(dst + 0 * dst_stride),
-                      vreinterpret_u16_u8(d01), 0);
-        vst1_lane_u16((uint16_t *)(dst + 1 * dst_stride),
-                      vreinterpret_u16_u8(d01), 2);
+        store_u8_2x1(dst + 0 * dst_stride, d01, 0);
+        store_u8_2x1(dst + 1 * dst_stride, d01, 2);
         if (h != 2) {
-          vst1_lane_u16((uint16_t *)(dst + 2 * dst_stride),
-                        vreinterpret_u16_u8(d23), 0);
-          vst1_lane_u16((uint16_t *)(dst + 3 * dst_stride),
-                        vreinterpret_u16_u8(d23), 2);
+          store_u8_2x1(dst + 2 * dst_stride, d23, 0);
+          store_u8_2x1(dst + 3 * dst_stride, d23, 2);
         }
       } else {
-        vst1_lane_u32((uint32_t *)(dst + 0 * dst_stride),
-                      vreinterpret_u32_u8(d01), 0);
-        vst1_lane_u32((uint32_t *)(dst + 1 * dst_stride),
-                      vreinterpret_u32_u8(d01), 1);
+        store_u8_4x1(dst + 0 * dst_stride, d01, 0);
+        store_u8_4x1(dst + 1 * dst_stride, d01, 1);
         if (h != 2) {
-          vst1_lane_u32((uint32_t *)(dst + 2 * dst_stride),
-                        vreinterpret_u32_u8(d23), 0);
-          vst1_lane_u32((uint32_t *)(dst + 3 * dst_stride),
-                        vreinterpret_u32_u8(d23), 1);
+          store_u8_4x1(dst + 2 * dst_stride, d23, 0);
+          store_u8_4x1(dst + 3 * dst_stride, d23, 1);
         }
       }
 
@@ -478,7 +783,6 @@
       src += 4 * src_stride;
       dst += 4 * dst_stride;
     } while (h > 0);
-
   } else {
     const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
     uint8x16_t s0, s1, s2, s3;
@@ -491,29 +795,22 @@
       uint8_t *d = dst;
 
       do {
-        s0 = vld1q_u8(s + 0 * src_stride);
-        s1 = vld1q_u8(s + 1 * src_stride);
-        s2 = vld1q_u8(s + 2 * src_stride);
-        s3 = vld1q_u8(s + 3 * src_stride);
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
 
-        t0 = convolve8_8_sdot(s0, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
-        t1 = convolve8_8_sdot(s1, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
-        t2 = convolve8_8_sdot(s2, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
-        t3 = convolve8_8_sdot(s3, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
+        t0 = convolve8_x_8_sdot(s0, x_filter, correction, range_limit,
+                                permute_tbl);
+        t1 = convolve8_x_8_sdot(s1, x_filter, correction, range_limit,
+                                permute_tbl);
+        t2 = convolve8_x_8_sdot(s2, x_filter, correction, range_limit,
+                                permute_tbl);
+        t3 = convolve8_x_8_sdot(s3, x_filter, correction, range_limit,
+                                permute_tbl);
 
-        t0 = vqrshlq_s16(t0, shift_by_bits);
-        t1 = vqrshlq_s16(t1, shift_by_bits);
-        t2 = vqrshlq_s16(t2, shift_by_bits);
-        t3 = vqrshlq_s16(t3, shift_by_bits);
-
-        d0 = vqmovun_s16(t0);
-        d1 = vqmovun_s16(t1);
-        d2 = vqmovun_s16(t2);
-        d3 = vqmovun_s16(t3);
+        // We halved the convolution filter values so - 1 from the right shift.
+        d0 = vqrshrun_n_s16(t0, FILTER_BITS - 1);
+        d1 = vqrshrun_n_s16(t1, FILTER_BITS - 1);
+        d2 = vqrshrun_n_s16(t2, FILTER_BITS - 1);
+        d3 = vqrshrun_n_s16(t3, FILTER_BITS - 1);
 
         vst1_u8(d + 0 * dst_stride, d0);
         vst1_u8(d + 1 * dst_stride, d1);
@@ -534,18 +831,18 @@
   }
 }
 
-#else  // !(defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD))
+#else  // !(AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD))
 
-static INLINE uint8x8_t convolve8_horiz_8x8(
-    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
-    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
-    const int16x8_t s6, const int16x8_t s7, const int16x8_t filter,
-    const int16x8_t shift_round_0, const int16x8_t shift_by_bits) {
+static INLINE uint8x8_t
+convolve8_horiz_8x8(const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+                    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+                    const int16x8_t s6, const int16x8_t s7,
+                    const int16x8_t filter, const int16x8_t horiz_const) {
   const int16x4_t filter_lo = vget_low_s16(filter);
   const int16x4_t filter_hi = vget_high_s16(filter);
-  int16x8_t sum;
+  int16x8_t sum = horiz_const;
 
-  sum = vmulq_lane_s16(s0, filter_lo, 0);
+  sum = vmlaq_lane_s16(sum, s0, filter_lo, 0);
   sum = vmlaq_lane_s16(sum, s1, filter_lo, 1);
   sum = vmlaq_lane_s16(sum, s2, filter_lo, 2);
   sum = vmlaq_lane_s16(sum, s3, filter_lo, 3);
@@ -554,10 +851,218 @@
   sum = vmlaq_lane_s16(sum, s6, filter_hi, 2);
   sum = vmlaq_lane_s16(sum, s7, filter_hi, 3);
 
-  sum = vqrshlq_s16(sum, shift_round_0);
-  sum = vqrshlq_s16(sum, shift_by_bits);
+  // We halved the convolution filter values so - 1 from the right shift.
+  return vqrshrun_n_s16(sum, FILTER_BITS - 1);
+}
 
-  return vqmovun_s16(sum);
+static INLINE int16x4_t convolve12_x_4x4_s16(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x4_t s6, const int16x4_t s7, const int16x4_t s8,
+    const int16x4_t s9, const int16x4_t s10, const int16x4_t s11,
+    const int16x8_t x_filter_0_7, const int16x4_t x_filter_8_11,
+    const int32x4_t horiz_const) {
+  const int16x4_t x_filter_0_3 = vget_low_s16(x_filter_0_7);
+  const int16x4_t x_filter_4_7 = vget_high_s16(x_filter_0_7);
+  int32x4_t sum = horiz_const;
+
+  sum = vmlal_lane_s16(sum, s0, x_filter_0_3, 0);
+  sum = vmlal_lane_s16(sum, s1, x_filter_0_3, 1);
+  sum = vmlal_lane_s16(sum, s2, x_filter_0_3, 2);
+  sum = vmlal_lane_s16(sum, s3, x_filter_0_3, 3);
+  sum = vmlal_lane_s16(sum, s4, x_filter_4_7, 0);
+  sum = vmlal_lane_s16(sum, s5, x_filter_4_7, 1);
+  sum = vmlal_lane_s16(sum, s6, x_filter_4_7, 2);
+  sum = vmlal_lane_s16(sum, s7, x_filter_4_7, 3);
+  sum = vmlal_lane_s16(sum, s8, x_filter_8_11, 0);
+  sum = vmlal_lane_s16(sum, s9, x_filter_8_11, 1);
+  sum = vmlal_lane_s16(sum, s10, x_filter_8_11, 2);
+  sum = vmlal_lane_s16(sum, s11, x_filter_8_11, 3);
+
+  return vqrshrn_n_s32(sum, FILTER_BITS);
+}
+
+// 4 column per iteration filtering for 12-tap convolve_x_sr.
+// Processes one row at a time.
+static INLINE void x_filter_12tap_w4_single_row(
+    const uint8_t *src_ptr, int src_stride, uint8_t *dst_ptr,
+    const int dst_stride, int w, int h, const int16x8_t x_filter_0_7,
+    const int16x4_t x_filter_8_11) {
+  // This shim of 1 << (ROUND0_BITS - 1) enables us to use a single
+  // rounding right shift by FILTER_BITS - instead of a first rounding right
+  // shift by ROUND0_BITS, followed by second rounding right shift by
+  // FILTER_BITS - ROUND0_BITS.
+  const int32x4_t horiz_const = vdupq_n_s32(1 << (ROUND0_BITS - 1));
+
+  do {
+    const uint8_t *s = src_ptr;
+    uint8_t *d = dst_ptr;
+    int width = w;
+
+    do {
+      uint8x8_t dd0;
+      uint8x16_t t0;
+      int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, d0;
+      int16x8_t tt0, tt1;
+
+      t0 = vld1q_u8(s);
+      tt0 = vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(t0)));
+      tt1 = vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(t0)));
+
+      s0 = vget_low_s16(tt0);
+      s4 = vget_high_s16(tt0);
+      s8 = vget_low_s16(tt1);
+      s12 = vget_high_s16(tt1);
+
+      s1 = vext_s16(s0, s4, 1);    //  a1  a2  a3  a4
+      s2 = vext_s16(s0, s4, 2);    //  a2  a3  a4  a5
+      s3 = vext_s16(s0, s4, 3);    //  a3  a4  a5  a6
+      s5 = vext_s16(s4, s8, 1);    //  a5  a6  a7  a8
+      s6 = vext_s16(s4, s8, 2);    //  a6  a7  a8  a9
+      s7 = vext_s16(s4, s8, 3);    //  a7  a8  a9 a10
+      s9 = vext_s16(s8, s12, 1);   //  a9 a10 a11 a12
+      s10 = vext_s16(s8, s12, 2);  // a10 a11 a12 a13
+      s11 = vext_s16(s8, s12, 3);  // a11 a12 a13 a14
+
+      d0 = convolve12_x_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10,
+                                s11, x_filter_0_7, x_filter_8_11, horiz_const);
+
+      dd0 = vqmovun_s16(vcombine_s16(d0, vdup_n_s16(0)));
+
+      if (w == 2) {
+        store_u8_2x1(d, dd0, 0);
+      } else {
+        store_u8_4x1(d, dd0, 0);
+      }
+
+      s += 4;
+      d += 4;
+      width -= 4;
+    } while (width > 0);
+
+    src_ptr += src_stride;
+    dst_ptr += dst_stride;
+  } while (--h != 0);
+}
+
+static INLINE void convolve_x_sr_12tap_neon(const uint8_t *src_ptr,
+                                            int src_stride, uint8_t *dst_ptr,
+                                            const int dst_stride, int w, int h,
+                                            const int16_t *x_filter_ptr) {
+  const int16x8_t x_filter_0_7 = vld1q_s16(x_filter_ptr);
+  const int16x4_t x_filter_8_11 = vld1_s16(x_filter_ptr + 8);
+
+#if AOM_ARCH_AARCH64
+  // This shim of 1 << (ROUND0_BITS - 1) enables us to use a single
+  // rounding right shift by FILTER_BITS - instead of a first rounding right
+  // shift by ROUND0_BITS, followed by second rounding right shift by
+  // FILTER_BITS - ROUND0_BITS.
+  const int32x4_t horiz_const = vdupq_n_s32(1 << (ROUND0_BITS - 1));
+
+  do {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
+    uint8x8_t t0, t1, t2, t3;
+
+    const uint8_t *s = src_ptr;
+    uint8_t *d = dst_ptr;
+    int width = w;
+
+    load_u8_8x4(s, src_stride, &t0, &t1, &t2, &t3);
+    transpose_u8_8x4(&t0, &t1, &t2, &t3);
+
+    s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+    s1 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+    s2 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+    s3 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+    s4 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+    s5 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+    s6 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+    s7 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+
+    load_u8_8x4(s + 8, src_stride, &t0, &t1, &t2, &t3);
+    transpose_u8_8x4(&t0, &t1, &t2, &t3);
+
+    s8 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+    s9 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+    s10 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+
+    s += 11;
+
+    do {
+      int16x4_t s11, s12, s13, s14, d0, d1, d2, d3;
+      int16x8_t d01, d23;
+      uint8x8_t dd01, dd23;
+
+      load_u8_8x4(s, src_stride, &t0, &t1, &t2, &t3);
+      transpose_u8_8x4(&t0, &t1, &t2, &t3);
+
+      s11 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s12 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+      s13 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+      s14 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+
+      d0 = convolve12_x_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10,
+                                s11, x_filter_0_7, x_filter_8_11, horiz_const);
+      d1 = convolve12_x_4x4_s16(s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11,
+                                s12, x_filter_0_7, x_filter_8_11, horiz_const);
+      d2 = convolve12_x_4x4_s16(s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12,
+                                s13, x_filter_0_7, x_filter_8_11, horiz_const);
+      d3 = convolve12_x_4x4_s16(s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13,
+                                s14, x_filter_0_7, x_filter_8_11, horiz_const);
+
+      transpose_s16_4x4d(&d0, &d1, &d2, &d3);
+
+      d01 = vcombine_s16(d0, d1);
+      d23 = vcombine_s16(d2, d3);
+
+      dd01 = vqmovun_s16(d01);
+      dd23 = vqmovun_s16(d23);
+
+      if (w == 2) {
+        store_u8_2x1(d + 0 * dst_stride, dd01, 0);
+        store_u8_2x1(d + 1 * dst_stride, dd01, 2);
+        if (h != 2) {
+          store_u8_2x1(d + 2 * dst_stride, dd23, 0);
+          store_u8_2x1(d + 3 * dst_stride, dd23, 2);
+        }
+      } else {
+        store_u8_4x1(d + 0 * dst_stride, dd01, 0);
+        store_u8_4x1(d + 1 * dst_stride, dd01, 1);
+        if (h != 2) {
+          store_u8_4x1(d + 2 * dst_stride, dd23, 0);
+          store_u8_4x1(d + 3 * dst_stride, dd23, 1);
+        }
+      }
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s5 = s9;
+      s6 = s10;
+      s7 = s11;
+      s8 = s12;
+      s9 = s13;
+      s10 = s14;
+      s += 4;
+      d += 4;
+      width -= 4;
+    } while (width > 0);
+
+    src_ptr += 4 * src_stride;
+    dst_ptr += 4 * dst_stride;
+    h -= 4;
+  } while (h >= 4);
+
+  if (h > 0) {
+    x_filter_12tap_w4_single_row(src_ptr, src_stride, dst_ptr, dst_stride, w, h,
+                                 x_filter_0_7, x_filter_8_11);
+  }
+#else   // !AOM_ARCH_AARCH64
+  x_filter_12tap_w4_single_row(src_ptr, src_stride, dst_ptr, dst_stride, w, h,
+                               x_filter_0_7, x_filter_8_11);
+#endif  // AOM_ARCH_AARCH64
 }
 
 void av1_convolve_x_sr_neon(const uint8_t *src, int src_stride, uint8_t *dst,
@@ -565,33 +1070,33 @@
                             const InterpFilterParams *filter_params_x,
                             const int subpel_x_qn,
                             ConvolveParams *conv_params) {
-  if (filter_params_x->taps > 8) {
-    av1_convolve_x_sr_c(src, src_stride, dst, dst_stride, w, h, filter_params_x,
-                        subpel_x_qn, conv_params);
-    return;
-  }
+  (void)conv_params;
   const uint8_t horiz_offset = filter_params_x->taps / 2 - 1;
-  const int8_t bits = FILTER_BITS - conv_params->round_0;
-
-  uint8x8_t t0;
-#if defined(__aarch64__)
-  uint8x8_t t1, t2, t3;
-#endif
-
-  assert(bits >= 0);
-  assert((FILTER_BITS - conv_params->round_1) >= 0 ||
-         ((conv_params->round_0 + conv_params->round_1) == 2 * FILTER_BITS));
 
   const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
       filter_params_x, subpel_x_qn & SUBPEL_MASK);
+  src -= horiz_offset;
+
+  if (filter_params_x->taps > 8) {
+    convolve_x_sr_12tap_neon(src, src_stride, dst, dst_stride, w, h,
+                             x_filter_ptr);
+    return;
+  }
+
+  uint8x8_t t0;
+#if AOM_ARCH_AARCH64
+  uint8x8_t t1, t2, t3;
+  // This shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use a single
+  // rounding right shift by FILTER_BITS - instead of a first rounding right
+  // shift by ROUND0_BITS, followed by second rounding right shift by
+  // FILTER_BITS - ROUND0_BITS.
+  // The outermost -1 is needed because we halved the filter values.
+  const int16x8_t horiz_const = vdupq_n_s16(1 << ((ROUND0_BITS - 1) - 1));
+#endif  // AOM_ARCH_AARCH64
   // Filter values are even so downshift by 1 to reduce precision requirements.
   const int16x8_t x_filter = vshrq_n_s16(vld1q_s16(x_filter_ptr), 1);
 
-  const int16x8_t shift_round_0 = vdupq_n_s16(-conv_params->round_0 + 1);
-  const int16x8_t shift_by_bits = vdupq_n_s16(-bits);
-
-  src -= horiz_offset;
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   if (h == 4) {
     uint8x8_t d01, d23;
     int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, d0, d1, d2, d3;
@@ -628,42 +1133,32 @@
       s10 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
 
       d0 = convolve8_4x4(s0, s1, s2, s3, s4, s5, s6, s7, x_filter);
-
       d1 = convolve8_4x4(s1, s2, s3, s4, s5, s6, s7, s8, x_filter);
-
       d2 = convolve8_4x4(s2, s3, s4, s5, s6, s7, s8, s9, x_filter);
-
       d3 = convolve8_4x4(s3, s4, s5, s6, s7, s8, s9, s10, x_filter);
 
-      d01_temp = vqrshlq_s16(vcombine_s16(d0, d1), shift_round_0);
-      d23_temp = vqrshlq_s16(vcombine_s16(d2, d3), shift_round_0);
+      d01_temp = vcombine_s16(d0, d1);
+      d23_temp = vcombine_s16(d2, d3);
 
-      d01_temp = vqrshlq_s16(d01_temp, shift_by_bits);
-      d23_temp = vqrshlq_s16(d23_temp, shift_by_bits);
+      d01_temp = vaddq_s16(d01_temp, horiz_const);
+      d23_temp = vaddq_s16(d23_temp, horiz_const);
 
-      d01 = vqmovun_s16(d01_temp);
-      d23 = vqmovun_s16(d23_temp);
+      // We halved the convolution filter values so - 1 from the right shift.
+      d01 = vqrshrun_n_s16(d01_temp, FILTER_BITS - 1);
+      d23 = vqrshrun_n_s16(d23_temp, FILTER_BITS - 1);
 
       transpose_u8_4x4(&d01, &d23);
 
-      if (w != 2) {
-        vst1_lane_u32((uint32_t *)(dst + 0 * dst_stride),  // 00 01 02 03
-                      vreinterpret_u32_u8(d01), 0);
-        vst1_lane_u32((uint32_t *)(dst + 1 * dst_stride),  // 10 11 12 13
-                      vreinterpret_u32_u8(d23), 0);
-        vst1_lane_u32((uint32_t *)(dst + 2 * dst_stride),  // 20 21 22 23
-                      vreinterpret_u32_u8(d01), 1);
-        vst1_lane_u32((uint32_t *)(dst + 3 * dst_stride),  // 30 31 32 33
-                      vreinterpret_u32_u8(d23), 1);
+      if (w == 2) {
+        store_u8_2x1(dst + 0 * dst_stride, d01, 0);
+        store_u8_2x1(dst + 1 * dst_stride, d23, 0);
+        store_u8_2x1(dst + 2 * dst_stride, d01, 2);
+        store_u8_2x1(dst + 3 * dst_stride, d23, 2);
       } else {
-        vst1_lane_u16((uint16_t *)(dst + 0 * dst_stride),  // 00 01
-                      vreinterpret_u16_u8(d01), 0);
-        vst1_lane_u16((uint16_t *)(dst + 1 * dst_stride),  // 10 11
-                      vreinterpret_u16_u8(d23), 0);
-        vst1_lane_u16((uint16_t *)(dst + 2 * dst_stride),  // 20 21
-                      vreinterpret_u16_u8(d01), 2);
-        vst1_lane_u16((uint16_t *)(dst + 3 * dst_stride),  // 30 31
-                      vreinterpret_u16_u8(d23), 2);
+        store_u8_4x1(dst + 0 * dst_stride, d01, 0);
+        store_u8_4x1(dst + 1 * dst_stride, d23, 0);
+        store_u8_4x1(dst + 2 * dst_stride, d01, 1);
+        store_u8_4x1(dst + 3 * dst_stride, d23, 1);
       }
 
       s0 = s4;
@@ -678,18 +1173,18 @@
       w -= 4;
     } while (w > 0);
   } else {
-#endif
+#endif  // AOM_ARCH_AARCH64
     int width;
     const uint8_t *s;
     int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     int16x8_t s8, s9, s10;
     uint8x8_t t4, t5, t6, t7;
-#endif
+#endif  // AOM_ARCH_AARCH64
 
     if (w <= 4) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
       do {
         load_u8_8x8(src, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
         transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
@@ -729,88 +1224,53 @@
         __builtin_prefetch(src + 6 * src_stride);
         __builtin_prefetch(src + 7 * src_stride);
         t0 = convolve8_horiz_8x8(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
-                                 shift_round_0, shift_by_bits);
+                                 horiz_const);
         t1 = convolve8_horiz_8x8(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
-                                 shift_round_0, shift_by_bits);
+                                 horiz_const);
         t2 = convolve8_horiz_8x8(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
-                                 shift_round_0, shift_by_bits);
+                                 horiz_const);
         t3 = convolve8_horiz_8x8(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
-                                 shift_round_0, shift_by_bits);
+                                 horiz_const);
 
         transpose_u8_8x4(&t0, &t1, &t2, &t3);
 
-        if ((w == 4) && (h > 4)) {
-          vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(t0),
-                        0);  // 00 01 02 03
-          dst += dst_stride;
-          vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(t1),
-                        0);  // 10 11 12 13
-          dst += dst_stride;
-          vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(t2),
-                        0);  // 20 21 22 23
-          dst += dst_stride;
-          vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(t3),
-                        0);  // 30 31 32 33
-          dst += dst_stride;
-          vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(t0),
-                        1);  // 40 41 42 43
-          dst += dst_stride;
-          vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(t1),
-                        1);  // 50 51 52 53
-          dst += dst_stride;
-          vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(t2),
-                        1);  // 60 61 62 63
-          dst += dst_stride;
-          vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(t3),
-                        1);  // 70 71 72 73
-          dst += dst_stride;
-        } else if ((w == 4) && (h == 2)) {
-          vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(t0),
-                        0);  // 00 01 02 03
-          dst += dst_stride;
-          vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(t1),
-                        0);  // 10 11 12 13
-          dst += dst_stride;
-        } else if ((w == 2) && (h > 4)) {
-          vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(t0),
-                        0);  // 00 01
-          dst += dst_stride;
-          vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(t1),
-                        0);  // 10 11
-          dst += dst_stride;
-          vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(t2),
-                        0);  // 20 21
-          dst += dst_stride;
-          vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(t3),
-                        0);  // 30 31
-          dst += dst_stride;
-          vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(t0),
-                        2);  // 40 41
-          dst += dst_stride;
-          vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(t1),
-                        2);  // 50 51
-          dst += dst_stride;
-          vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(t2),
-                        2);  // 60 61
-          dst += dst_stride;
-          vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(t3),
-                        2);  // 70 71
-          dst += dst_stride;
-        } else if ((w == 2) && (h == 2)) {
-          vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(t0),
-                        0);  // 00 01
-          dst += dst_stride;
-          vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(t1),
-                        0);  // 10 11
-          dst += dst_stride;
+        if (w == 4) {
+          store_u8_4x1(dst + 0 * dst_stride, t0, 0);
+          store_u8_4x1(dst + 1 * dst_stride, t1, 0);
+          if (h > 4) {
+            store_u8_4x1(dst + 2 * dst_stride, t2, 0);
+            store_u8_4x1(dst + 3 * dst_stride, t3, 0);
+            store_u8_4x1(dst + 4 * dst_stride, t0, 1);
+            store_u8_4x1(dst + 5 * dst_stride, t1, 1);
+            store_u8_4x1(dst + 6 * dst_stride, t2, 1);
+            store_u8_4x1(dst + 7 * dst_stride, t3, 1);
+          }
+        } else if (w == 2) {
+          store_u8_2x1(dst + 0 * dst_stride, t0, 0);
+          store_u8_2x1(dst + 1 * dst_stride, t1, 0);
+          if (h > 4) {
+            store_u8_2x1(dst + 2 * dst_stride, t2, 0);
+            store_u8_2x1(dst + 3 * dst_stride, t3, 0);
+            store_u8_2x1(dst + 4 * dst_stride, t0, 2);
+            store_u8_2x1(dst + 5 * dst_stride, t1, 2);
+            store_u8_2x1(dst + 6 * dst_stride, t2, 2);
+            store_u8_2x1(dst + 7 * dst_stride, t3, 2);
+          }
         }
+
+        dst += 8 * dst_stride;
         h -= 8;
       } while (h > 0);
-#else
+#else   // !AOM_ARCH_AARCH64
+    // This shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use a single
+    // rounding right shift by FILTER_BITS - instead of a first rounding right
+    // shift by ROUND0_BITS, followed by second rounding right shift by
+    // FILTER_BITS - ROUND0_BITS.
+    // The outermost -1 is needed because we halved the filter values.
+    const int16x4_t horiz_const = vdup_n_s16(1 << ((ROUND0_BITS - 1) - 1));
     int16x8_t tt0;
     int16x4_t x0, x1, x2, x3, x4, x5, x6, x7;
-    const int16x4_t shift_round_0_low = vget_low_s16(shift_round_0);
-    const int16x4_t shift_by_bits_low = vget_low_s16(shift_by_bits);
+
     do {
       t0 = vld1_u8(src);  // a0 a1 a2 a3 a4 a5 a6 a7
       tt0 = vreinterpretq_s16_u16(vmovl_u8(t0));
@@ -830,24 +1290,23 @@
 
       src += src_stride;
 
-      t0 = convolve8_horiz_4x1(x0, x1, x2, x3, x4, x5, x6, x7, x_filter,
-                               shift_round_0_low, shift_by_bits_low);
+      t0 = convolve8_x_4x1(x0, x1, x2, x3, x4, x5, x6, x7, x_filter,
+                           horiz_const);
 
       if (w == 4) {
-        vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(t0),
-                      0);  // 00 01 02 03
+        store_u8_4x1(dst, t0, 0);
         dst += dst_stride;
       } else if (w == 2) {
-        vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(t0), 0);  // 00 01
+        store_u8_2x1(dst, t0, 0);
         dst += dst_stride;
       }
       h -= 1;
     } while (h > 0);
-#endif
+#endif  // AOM_ARCH_AARCH64
     } else {
       uint8_t *d;
       int16x8_t s11;
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
       int16x8_t s12, s13, s14;
       do {
         __builtin_prefetch(src + 0 * src_stride);
@@ -893,35 +1352,30 @@
           s14 = vreinterpretq_s16_u16(vmovl_u8(t7));
 
           t0 = convolve8_horiz_8x8(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
-                                   shift_round_0, shift_by_bits);
-
+                                   horiz_const);
           t1 = convolve8_horiz_8x8(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
-                                   shift_round_0, shift_by_bits);
-
+                                   horiz_const);
           t2 = convolve8_horiz_8x8(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
-                                   shift_round_0, shift_by_bits);
-
+                                   horiz_const);
           t3 = convolve8_horiz_8x8(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
-                                   shift_round_0, shift_by_bits);
-
+                                   horiz_const);
           t4 = convolve8_horiz_8x8(s4, s5, s6, s7, s8, s9, s10, s11, x_filter,
-                                   shift_round_0, shift_by_bits);
-
+                                   horiz_const);
           t5 = convolve8_horiz_8x8(s5, s6, s7, s8, s9, s10, s11, s12, x_filter,
-                                   shift_round_0, shift_by_bits);
-
+                                   horiz_const);
           t6 = convolve8_horiz_8x8(s6, s7, s8, s9, s10, s11, s12, s13, x_filter,
-                                   shift_round_0, shift_by_bits);
-
+                                   horiz_const);
           t7 = convolve8_horiz_8x8(s7, s8, s9, s10, s11, s12, s13, s14,
-                                   x_filter, shift_round_0, shift_by_bits);
+                                   x_filter, horiz_const);
 
           transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
           if (h != 2) {
             store_u8_8x8(d, dst_stride, t0, t1, t2, t3, t4, t5, t6, t7);
           } else {
-            store_row2_u8_8x8(d, dst_stride, t0, t1);
+            store_u8_8x2(d, dst_stride, t0, t1);
           }
+
           s0 = s8;
           s1 = s9;
           s2 = s10;
@@ -937,7 +1391,14 @@
         dst += 8 * dst_stride;
         h -= 8;
       } while (h > 0);
-#else
+#else   // !AOM_ARCH_AARCH64
+    // This shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use a single
+    // rounding right shift by FILTER_BITS - instead of a first rounding right
+    // shift by ROUND0_BITS, followed by second rounding right shift by
+    // FILTER_BITS - ROUND0_BITS.
+    // The outermost -1 is needed because we halved the filter values.
+    const int16x8_t horiz_const = vdupq_n_s16(1 << ((ROUND0_BITS - 1) - 1));
+
     do {
       t0 = vld1_u8(src);  // a0 a1 a2 a3 a4 a5 a6 a7
       s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
@@ -962,7 +1423,8 @@
         s7 = vextq_s16(s11, s7, 7);  // a7 a8 a9 a10 a11 a12 a13 a14
 
         t0 = convolve8_horiz_8x8(s11, s1, s2, s3, s4, s5, s6, s7, x_filter,
-                                 shift_round_0, shift_by_bits);
+                                 horiz_const);
+
         vst1_u8(d, t0);
 
         s += 8;
@@ -973,40 +1435,467 @@
       dst += dst_stride;
       h -= 1;
     } while (h > 0);
-#endif
+#endif  // AOM_ARCH_AARCH64
     }
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   }
-#endif
+#endif  // AOM_ARCH_AARCH64
 }
 
-#endif  // defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#endif  // AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_MATMUL_INT8)
+
+static INLINE void convolve_y_sr_6tap_neon(const uint8_t *src_ptr,
+                                           int src_stride, uint8_t *dst_ptr,
+                                           const int dst_stride, int w, int h,
+                                           const int16x8_t y_filter_0_7) {
+  if (w <= 4) {
+    uint8x8_t t0, t1, t2, t3, t4, t5;
+    int16x4_t s0, s1, s2, s3, s4, s5, d0;
+    uint8x8_t d01;
+
+#if AOM_ARCH_AARCH64
+    uint8x8_t t6, t7, t8;
+    int16x4_t s6, s7, s8, d1, d2, d3;
+    uint8x8_t d23;
+#endif  // AOM_ARCH_AARCH64
+
+    const uint8_t *s = src_ptr + src_stride;
+    uint8_t *d = dst_ptr;
+
+    load_u8_8x5(s, src_stride, &t0, &t1, &t2, &t3, &t4);
+    s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+    s1 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+    s2 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+    s3 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+    s4 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t4)));
+    s += 5 * src_stride;
+
+    do {
+#if AOM_ARCH_AARCH64
+      load_u8_8x4(s, src_stride, &t5, &t6, &t7, &t8);
+      s5 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t5)));
+      s6 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t6)));
+      s7 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t7)));
+      s8 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t8)));
+
+      d0 = convolve6_4x4(s0, s1, s2, s3, s4, s5, y_filter_0_7);
+      d1 = convolve6_4x4(s1, s2, s3, s4, s5, s6, y_filter_0_7);
+      d2 = convolve6_4x4(s2, s3, s4, s5, s6, s7, y_filter_0_7);
+      d3 = convolve6_4x4(s3, s4, s5, s6, s7, s8, y_filter_0_7);
+
+      d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
+
+      if (w == 2) {
+        store_u8_2x1(d + 0 * dst_stride, d01, 0);
+        store_u8_2x1(d + 1 * dst_stride, d01, 2);
+        if (h != 2) {
+          store_u8_2x1(d + 2 * dst_stride, d23, 0);
+          store_u8_2x1(d + 3 * dst_stride, d23, 2);
+        }
+      } else {
+        store_u8_4x1(d + 0 * dst_stride, d01, 0);
+        store_u8_4x1(d + 1 * dst_stride, d01, 1);
+        if (h != 2) {
+          store_u8_4x1(d + 2 * dst_stride, d23, 0);
+          store_u8_4x1(d + 3 * dst_stride, d23, 1);
+        }
+      }
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s += 4 * src_stride;
+      d += 4 * dst_stride;
+      h -= 4;
+#else   // !AOM_ARCH_AARCH64
+      t5 = vld1_u8(s);
+      s5 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t5)));
+
+      d0 = convolve6_4x4(s0, s1, s2, s3, s4, s5, y_filter_0_7);
+      d01 = vqrshrun_n_s16(vcombine_s16(d0, vdup_n_s16(0)), FILTER_BITS - 1);
+
+      if (w == 2) {
+        store_u8_2x1(d, d01, 0);
+      } else {
+        store_u8_4x1(d, d01, 0);
+      }
+
+      s0 = s1;
+      s1 = s2;
+      s2 = s3;
+      s3 = s4;
+      s4 = s5;
+      s += src_stride;
+      d += dst_stride;
+      h--;
+#endif  // AOM_ARCH_AARCH64
+    } while (h > 0);
+  } else {
+    // if width is a multiple of 8 & height is a multiple of 4
+    uint8x8_t t0, t1, t2, t3, t4, t5;
+    int16x8_t s0, s1, s2, s3, s4, s5, dd0;
+    uint8x8_t d0;
+#if AOM_ARCH_AARCH64
+    uint8x8_t t6, t7, t8;
+    int16x8_t s6, s7, s8, dd1, dd2, dd3;
+    uint8x8_t d1, d2, d3;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      int height = h;
+      const uint8_t *s = src_ptr + src_stride;
+      uint8_t *d = dst_ptr;
+
+      load_u8_8x5(s, src_stride, &t0, &t1, &t2, &t3, &t4);
+      s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
+      s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
+      s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
+      s3 = vreinterpretq_s16_u16(vmovl_u8(t3));
+      s4 = vreinterpretq_s16_u16(vmovl_u8(t4));
+      s += 5 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        load_u8_8x4(s, src_stride, &t5, &t6, &t7, &t8);
+        s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
+        s6 = vreinterpretq_s16_u16(vmovl_u8(t6));
+        s7 = vreinterpretq_s16_u16(vmovl_u8(t7));
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t8));
+
+        dd0 = convolve6_8x4(s0, s1, s2, s3, s4, s5, y_filter_0_7);
+        dd1 = convolve6_8x4(s1, s2, s3, s4, s5, s6, y_filter_0_7);
+        dd2 = convolve6_8x4(s2, s3, s4, s5, s6, s7, y_filter_0_7);
+        dd3 = convolve6_8x4(s3, s4, s5, s6, s7, s8, y_filter_0_7);
+
+        d0 = vqrshrun_n_s16(dd0, FILTER_BITS - 1);
+        d1 = vqrshrun_n_s16(dd1, FILTER_BITS - 1);
+        d2 = vqrshrun_n_s16(dd2, FILTER_BITS - 1);
+        d3 = vqrshrun_n_s16(dd3, FILTER_BITS - 1);
+
+        if (h != 2) {
+          store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+        } else {
+          store_u8_8x2(d, dst_stride, d0, d1);
+        }
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+#else   // !AOM_ARCH_AARCH64
+        t5 = vld1_u8(s);
+        s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
+
+        dd0 = convolve6_8x4(s0, s1, s2, s3, s4, s5, y_filter_0_7);
+        d0 = vqrshrun_n_s16(dd0, FILTER_BITS - 1);
+
+        vst1_u8(d, d0);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height > 0);
+
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w > 0);
+  }
+}
+
+static INLINE int16x4_t convolve12_y_4x4_s32(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x4_t s6, const int16x4_t s7, const int16x4_t s8,
+    const int16x4_t s9, const int16x4_t s10, const int16x4_t s11,
+    const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter_0_7);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter_0_7);
+  int16x4_t sum;
+
+  sum = vmul_lane_s16(s0, y_filter_0_3, 0);
+  sum = vmla_lane_s16(sum, s1, y_filter_0_3, 1);
+  sum = vmla_lane_s16(sum, s2, y_filter_0_3, 2);
+  sum = vmla_lane_s16(sum, s3, y_filter_0_3, 3);
+  sum = vmla_lane_s16(sum, s4, y_filter_4_7, 0);
+
+  sum = vmla_lane_s16(sum, s7, y_filter_4_7, 3);
+  sum = vmla_lane_s16(sum, s8, y_filter_8_11, 0);
+  sum = vmla_lane_s16(sum, s9, y_filter_8_11, 1);
+  sum = vmla_lane_s16(sum, s10, y_filter_8_11, 2);
+  sum = vmla_lane_s16(sum, s11, y_filter_8_11, 3);
+
+  // Separate out the two filter values in the middle of the kernel that have
+  // the largest magnitude and use saturating addition to prevent overflow. This
+  // means we can stay at 16-bit elements, rather than having to widen
+  // everything to a 32-bit result, requiring twice the number of instructions.
+  sum = vqadd_s16(sum, vmul_lane_s16(s5, y_filter_4_7, 1));
+  sum = vqadd_s16(sum, vmul_lane_s16(s6, y_filter_4_7, 2));
+
+  return sum;
+}
+
+static INLINE uint8x8_t convolve12_y_8x4_s32(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+    const int16x8_t s6, const int16x8_t s7, const int16x8_t s8,
+    const int16x8_t s9, const int16x8_t s10, const int16x8_t s11,
+    const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter_0_7);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter_0_7);
+  int16x8_t sum;
+
+  sum = vmulq_lane_s16(s0, y_filter_0_3, 0);
+  sum = vmlaq_lane_s16(sum, s1, y_filter_0_3, 1);
+  sum = vmlaq_lane_s16(sum, s2, y_filter_0_3, 2);
+  sum = vmlaq_lane_s16(sum, s3, y_filter_0_3, 3);
+  sum = vmlaq_lane_s16(sum, s4, y_filter_4_7, 0);
+
+  sum = vmlaq_lane_s16(sum, s7, y_filter_4_7, 3);
+  sum = vmlaq_lane_s16(sum, s8, y_filter_8_11, 0);
+  sum = vmlaq_lane_s16(sum, s9, y_filter_8_11, 1);
+  sum = vmlaq_lane_s16(sum, s10, y_filter_8_11, 2);
+  sum = vmlaq_lane_s16(sum, s11, y_filter_8_11, 3);
+
+  // Separate out the two filter values in the middle of the kernel that have
+  // the largest magnitude and use saturating addition to prevent overflow. This
+  // means we can stay at 16-bit elements, rather than having to widen
+  // everything to a 32-bit result, requiring twice the number of instructions.
+  sum = vqaddq_s16(sum, vmulq_lane_s16(s5, y_filter_4_7, 1));
+  sum = vqaddq_s16(sum, vmulq_lane_s16(s6, y_filter_4_7, 2));
+
+  return vqrshrun_n_s16(sum, FILTER_BITS);
+}
+
+static INLINE void convolve_y_sr_12tap_neon(const uint8_t *src_ptr,
+                                            int src_stride, uint8_t *dst_ptr,
+                                            int dst_stride, int w, int h,
+                                            const int16_t *y_filter_ptr) {
+  // Special case the following no-op filter as 128 won't fit into the
+  // 8-bit signed dot-product instruction:
+  // { 0, 0, 0, 0, 0, 128, 0, 0, 0, 0, 0, 0 }
+  if (y_filter_ptr[5] == 128) {
+    // Undo the horizontal offset in the calling function
+    src_ptr += 5 * src_stride;
+
+    if (w <= 4) {
+      for (int i = 0; i < h; i += 2) {
+        uint8x8_t d0 = load_unaligned_u8(src_ptr + i * src_stride, src_stride);
+        if (w == 2) {
+          store_u8_2x1(dst_ptr + i * dst_stride, d0, 0);
+          store_u8_2x1(dst_ptr + (i + 1) * dst_stride, d0, 1);
+        } else if (w == 4) {
+          store_u8_4x1(dst_ptr + i * dst_stride, d0, 0);
+          store_u8_4x1(dst_ptr + (i + 1) * dst_stride, d0, 1);
+        }
+      }
+    } else {
+      for (int i = 0; i < h; i++) {
+        for (int j = 0; j < w; j += 8) {
+          uint8x8_t d0 = vld1_u8(src_ptr + i * src_stride + j);
+          vst1_u8(dst_ptr + i * dst_stride + j, d0);
+        }
+      }
+    }
+    return;
+  }
+
+  const int16x8_t y_filter_0_7 = vld1q_s16(y_filter_ptr);
+  const int16x4_t y_filter_8_11 = vld1_s16(y_filter_ptr + 8);
+
+  if (w <= 4) {
+    uint8x8_t t0, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13, t14;
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+    int16x4_t d0, d1, d2, d3;
+    int16x8_t dd01, dd23;
+    uint8x8_t d01, d23;
+
+    load_u8_8x11(src_ptr, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7,
+                 &t8, &t9, &t10);
+    s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+    s1 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+    s2 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+    s3 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+    s4 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t4)));
+    s5 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t5)));
+    s6 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t6)));
+    s7 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t7)));
+    s8 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t8)));
+    s9 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t9)));
+    s10 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t10)));
+
+    src_ptr += 11 * src_stride;
+
+    do {
+      load_u8_8x4(src_ptr, src_stride, &t11, &t12, &t13, &t14);
+      s11 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t11)));
+      s12 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t12)));
+      s13 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t13)));
+      s14 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t14)));
+
+      d0 = convolve12_y_4x4_s32(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10,
+                                s11, y_filter_0_7, y_filter_8_11);
+      d1 = convolve12_y_4x4_s32(s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11,
+                                s12, y_filter_0_7, y_filter_8_11);
+      d2 = convolve12_y_4x4_s32(s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12,
+                                s13, y_filter_0_7, y_filter_8_11);
+      d3 = convolve12_y_4x4_s32(s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13,
+                                s14, y_filter_0_7, y_filter_8_11);
+
+      dd01 = vcombine_s16(d0, d1);
+      dd23 = vcombine_s16(d2, d3);
+
+      d01 = vqrshrun_n_s16(dd01, FILTER_BITS);
+      d23 = vqrshrun_n_s16(dd23, FILTER_BITS);
+
+      if (w == 2) {
+        store_u8_2x1(dst_ptr + 0 * dst_stride, d01, 0);
+        store_u8_2x1(dst_ptr + 1 * dst_stride, d01, 2);
+        if (h != 2) {
+          store_u8_2x1(dst_ptr + 2 * dst_stride, d23, 0);
+          store_u8_2x1(dst_ptr + 3 * dst_stride, d23, 2);
+        }
+      } else {
+        store_u8_4x1(dst_ptr + 0 * dst_stride, d01, 0);
+        store_u8_4x1(dst_ptr + 1 * dst_stride, d01, 1);
+        if (h != 2) {
+          store_u8_4x1(dst_ptr + 2 * dst_stride, d23, 0);
+          store_u8_4x1(dst_ptr + 3 * dst_stride, d23, 1);
+        }
+      }
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s5 = s9;
+      s6 = s10;
+      s7 = s11;
+      s8 = s12;
+      s9 = s13;
+      s10 = s14;
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      h -= 4;
+    } while (h > 0);
+  } else {
+    uint8x8_t t0, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13, t14;
+    uint8x8_t d0, d1, d2, d3;
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+
+    do {
+      const uint8_t *s = src_ptr;
+      uint8_t *d = dst_ptr;
+      int height = h;
+
+      load_u8_8x11(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7, &t8,
+                   &t9, &t10);
+      s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
+      s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
+      s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
+      s3 = vreinterpretq_s16_u16(vmovl_u8(t3));
+      s4 = vreinterpretq_s16_u16(vmovl_u8(t4));
+      s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
+      s6 = vreinterpretq_s16_u16(vmovl_u8(t6));
+      s7 = vreinterpretq_s16_u16(vmovl_u8(t7));
+      s8 = vreinterpretq_s16_u16(vmovl_u8(t8));
+      s9 = vreinterpretq_s16_u16(vmovl_u8(t9));
+      s10 = vreinterpretq_s16_u16(vmovl_u8(t10));
+
+      s += 11 * src_stride;
+
+      do {
+        load_u8_8x4(s, src_stride, &t11, &t12, &t13, &t14);
+        s11 = vreinterpretq_s16_u16(vmovl_u8(t11));
+        s12 = vreinterpretq_s16_u16(vmovl_u8(t12));
+        s13 = vreinterpretq_s16_u16(vmovl_u8(t13));
+        s14 = vreinterpretq_s16_u16(vmovl_u8(t14));
+
+        d0 = convolve12_y_8x4_s32(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10,
+                                  s11, y_filter_0_7, y_filter_8_11);
+        d1 = convolve12_y_8x4_s32(s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11,
+                                  s12, y_filter_0_7, y_filter_8_11);
+        d2 = convolve12_y_8x4_s32(s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12,
+                                  s13, y_filter_0_7, y_filter_8_11);
+        d3 = convolve12_y_8x4_s32(s3, s4, s5, s6, s7, s8, s9, s10, s11, s12,
+                                  s13, s14, y_filter_0_7, y_filter_8_11);
+
+        if (h != 2) {
+          store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+        } else {
+          store_u8_8x2(d, dst_stride, d0, d1);
+        }
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s7 = s11;
+        s8 = s12;
+        s9 = s13;
+        s10 = s14;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height > 0);
+
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w > 0);
+  }
+}
 
 void av1_convolve_y_sr_neon(const uint8_t *src, int src_stride, uint8_t *dst,
                             int dst_stride, int w, int h,
                             const InterpFilterParams *filter_params_y,
                             const int subpel_y_qn) {
-  if (filter_params_y->taps > 8) {
-    av1_convolve_y_sr_c(src, src_stride, dst, dst_stride, w, h, filter_params_y,
-                        subpel_y_qn);
-    return;
-  }
+  const int y_filter_taps = get_filter_tap(filter_params_y, subpel_y_qn);
   const int vert_offset = filter_params_y->taps / 2 - 1;
 
   src -= vert_offset * src_stride;
 
   const int16_t *y_filter_ptr = av1_get_interp_filter_subpel_kernel(
       filter_params_y, subpel_y_qn & SUBPEL_MASK);
+
+  if (y_filter_taps > 8) {
+    convolve_y_sr_12tap_neon(src, src_stride, dst, dst_stride, w, h,
+                             y_filter_ptr);
+    return;
+  }
+
   // Filter values are even so downshift by 1 to reduce precision requirements.
   const int16x8_t y_filter = vshrq_n_s16(vld1q_s16(y_filter_ptr), 1);
 
+  if (y_filter_taps < 8) {
+    convolve_y_sr_6tap_neon(src, src_stride, dst, dst_stride, w, h, y_filter);
+    return;
+  }
+
   if (w <= 4) {
     uint8x8_t d01;
     int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, d0;
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     uint8x8_t d23;
     int16x4_t s8, s9, s10, d1, d2, d3;
-#endif
+#endif  // AOM_ARCH_AARCH64
     s0 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(vld1_u8(src))));
     src += src_stride;
     s1 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(vld1_u8(src))));
@@ -1025,7 +1914,7 @@
     do {
       s7 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(vld1_u8(src))));
       src += src_stride;
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
       s8 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(vld1_u8(src))));
       src += src_stride;
       s9 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(vld1_u8(src))));
@@ -1048,41 +1937,23 @@
 
       d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
       d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
-      if ((w == 4) && (h != 2)) {
-        vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(d01),
-                      0);  // 00 01 02 03
-        dst += dst_stride;
-        vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(d01),
-                      1);  // 10 11 12 13
-        dst += dst_stride;
-        vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(d23),
-                      0);  // 20 21 22 23
-        dst += dst_stride;
-        vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(d23),
-                      1);  // 30 31 32 33
-        dst += dst_stride;
-      } else if ((w == 4) && (h == 2)) {
-        vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(d01),
-                      0);  // 00 01 02 03
-        dst += dst_stride;
-        vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(d01),
-                      1);  // 10 11 12 13
-        dst += dst_stride;
-      } else if ((w == 2) && (h != 2)) {
-        vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(d01), 0);  // 00 01
-        dst += dst_stride;
-        vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(d01), 2);  // 10 11
-        dst += dst_stride;
-        vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(d23), 0);  // 20 21
-        dst += dst_stride;
-        vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(d23), 2);  // 30 31
-        dst += dst_stride;
-      } else if ((w == 2) && (h == 2)) {
-        vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(d01), 0);  // 00 01
-        dst += dst_stride;
-        vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(d01), 2);  // 10 11
-        dst += dst_stride;
+
+      if (w == 2) {
+        store_u8_2x1(dst + 0 * dst_stride, d01, 0);
+        store_u8_2x1(dst + 1 * dst_stride, d01, 2);
+        if (h != 2) {
+          store_u8_2x1(dst + 2 * dst_stride, d23, 0);
+          store_u8_2x1(dst + 3 * dst_stride, d23, 2);
+        }
+      } else {
+        store_u8_4x1(dst + 0 * dst_stride, d01, 0);
+        store_u8_4x1(dst + 1 * dst_stride, d01, 1);
+        if (h != 2) {
+          store_u8_4x1(dst + 2 * dst_stride, d23, 0);
+          store_u8_4x1(dst + 3 * dst_stride, d23, 1);
+        }
       }
+
       s0 = s4;
       s1 = s5;
       s2 = s6;
@@ -1090,8 +1961,9 @@
       s4 = s8;
       s5 = s9;
       s6 = s10;
+      dst += 4 * dst_stride;
       h -= 4;
-#else
+#else   // !AOM_ARCH_AARCH64
       __builtin_prefetch(dst + 0 * dst_stride);
       __builtin_prefetch(src + 0 * src_stride);
 
@@ -1100,11 +1972,9 @@
       d01 = vqrshrun_n_s16(vcombine_s16(d0, d0), FILTER_BITS - 1);
 
       if (w == 4) {
-        vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(d01), 0);
-        dst += dst_stride;
+        store_u8_4x1(dst, d01, 0);
       } else if (w == 2) {
-        vst1_lane_u16((uint16_t *)dst, vreinterpret_u16_u8(d01), 0);
-        dst += dst_stride;
+        store_u8_2x1(dst, d01, 0);
       }
       s0 = s1;
       s1 = s2;
@@ -1113,8 +1983,9 @@
       s4 = s5;
       s5 = s6;
       s6 = s7;
+      dst += dst_stride;
       h -= 1;
-#endif
+#endif  // AOM_ARCH_AARCH64
     } while (h > 0);
   } else {
     int height;
@@ -1122,10 +1993,10 @@
     uint8_t *d;
     uint8x8_t t0;
     int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     uint8x8_t t1, t2, t3;
     int16x8_t s8, s9, s10;
-#endif
+#endif  // AOM_ARCH_AARCH64
     do {
       __builtin_prefetch(src + 0 * src_stride);
       __builtin_prefetch(src + 1 * src_stride);
@@ -1155,7 +2026,7 @@
       do {
         s7 = vreinterpretq_s16_u16(vmovl_u8(vld1_u8(s)));
         s += src_stride;
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
         s8 = vreinterpretq_s16_u16(vmovl_u8(vld1_u8(s)));
         s += src_stride;
         s9 = vreinterpretq_s16_u16(vmovl_u8(vld1_u8(s)));
@@ -1175,20 +2046,11 @@
         t1 = convolve8_vert_8x4(s1, s2, s3, s4, s5, s6, s7, s8, y_filter);
         t2 = convolve8_vert_8x4(s2, s3, s4, s5, s6, s7, s8, s9, y_filter);
         t3 = convolve8_vert_8x4(s3, s4, s5, s6, s7, s8, s9, s10, y_filter);
+
         if (h != 2) {
-          vst1_u8(d, t0);
-          d += dst_stride;
-          vst1_u8(d, t1);
-          d += dst_stride;
-          vst1_u8(d, t2);
-          d += dst_stride;
-          vst1_u8(d, t3);
-          d += dst_stride;
+          store_u8_8x4(d, dst_stride, t0, t1, t2, t3);
         } else {
-          vst1_u8(d, t0);
-          d += dst_stride;
-          vst1_u8(d, t1);
-          d += dst_stride;
+          store_u8_8x2(d, dst_stride, t0, t1);
         }
         s0 = s4;
         s1 = s5;
@@ -1197,8 +2059,9 @@
         s4 = s8;
         s5 = s9;
         s6 = s10;
+        d += 4 * dst_stride;
         height -= 4;
-#else
+#else   // !AOM_ARCH_AARCH64
         __builtin_prefetch(d);
         __builtin_prefetch(s);
 
@@ -1215,7 +2078,7 @@
         s5 = s6;
         s6 = s7;
         height -= 1;
-#endif
+#endif  // AOM_ARCH_AARCH64
       } while (height > 0);
       src += 8;
       dst += 8;
@@ -1224,13 +2087,12 @@
   }
 }
 
-#if defined(__aarch64__) && defined(__ARM_FEATURE_MATMUL_INT8)
+#if AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_MATMUL_INT8)
 
-static INLINE int16x4_t convolve12_4_usdot(uint8x16_t samples,
-                                           const int8x16_t filters,
-                                           const uint8x16x3_t permute_tbl,
-                                           const int32x4_t horiz_const,
-                                           const int32x4_t shift_round_0) {
+static INLINE int16x4_t convolve12_horiz_4_usdot(uint8x16_t samples,
+                                                 const int8x16_t filters,
+                                                 const uint8x16x3_t permute_tbl,
+                                                 int32x4_t horiz_const) {
   uint8x16_t permuted_samples[3];
   int32x4_t sum;
 
@@ -1248,17 +2110,14 @@
   sum = vusdotq_laneq_s32(sum, permuted_samples[2], filters, 2);
 
   /* Narrow and re-pack. */
-  sum = vqrshlq_s32(sum, shift_round_0);
-
-  return vmovn_s32(sum);
+  return vshrn_n_s32(sum, ROUND0_BITS);
 }
 
-static INLINE int16x8_t convolve12_8_usdot(uint8x16_t samples0,
-                                           uint8x16_t samples1,
-                                           const int8x16_t filters,
-                                           const uint8x16x3_t permute_tbl,
-                                           const int32x4_t horiz_const,
-                                           const int32x4_t shift_round_0) {
+static INLINE int16x8_t convolve12_horiz_8_usdot(uint8x16_t samples0,
+                                                 uint8x16_t samples1,
+                                                 const int8x16_t filters,
+                                                 const uint8x16x3_t permute_tbl,
+                                                 const int32x4_t horiz_const) {
   uint8x16_t permuted_samples[4];
   int32x4_t sum[2];
 
@@ -1282,16 +2141,14 @@
   sum[1] = vusdotq_laneq_s32(sum[1], permuted_samples[3], filters, 2);
 
   /* Narrow and re-pack. */
-  sum[0] = vqrshlq_s32(sum[0], shift_round_0);
-  sum[1] = vqrshlq_s32(sum[1], shift_round_0);
-
-  return vcombine_s16(vmovn_s32(sum[0]), vmovn_s32(sum[1]));
+  return vcombine_s16(vshrn_n_s32(sum[0], ROUND0_BITS),
+                      vshrn_n_s32(sum[1], ROUND0_BITS));
 }
 
-static INLINE void av1_convolve_2d_sr_horiz_12tap_neon(
+static INLINE void convolve_2d_sr_horiz_12tap_neon(
     const uint8_t *src_ptr, int src_stride, int16_t *dst_ptr,
     const int dst_stride, int w, int h, const int16x8_t x_filter_0_7,
-    const int16x4_t x_filter_8_11, const int round_0) {
+    const int16x4_t x_filter_8_11) {
   const int bd = 8;
 
   // Special case the following no-op filter as 128 won't fit into the
@@ -1299,7 +2156,6 @@
   // { 0, 0, 0, 0, 0, 128, 0, 0, 0, 0, 0, 0 }
   if (vgetq_lane_s16(x_filter_0_7, 5) == 128) {
     const int16x8_t horiz_const = vdupq_n_s16((1 << (bd - 1)));
-    const int16x8_t shift_round_0 = vdupq_n_s16(FILTER_BITS - round_0);
     // Undo the horizontal offset in the calling function.
     src_ptr += 5;
 
@@ -1307,10 +2163,10 @@
       for (int j = 0; j < w; j += 8) {
         uint8x8_t s0 = vld1_u8(src_ptr + i * src_stride + j);
         uint16x8_t t0 = vaddw_u8(vreinterpretq_u16_s16(horiz_const), s0);
-        int16x8_t d0 = vqrshlq_s16(vreinterpretq_s16_u16(t0), shift_round_0);
+        int16x8_t d0 =
+            vshlq_n_s16(vreinterpretq_s16_u16(t0), FILTER_BITS - ROUND0_BITS);
         if (w == 2) {
-          vst1q_lane_s32((int32_t *)(dst_ptr + i * dst_stride),
-                         vreinterpretq_s32_s16(d0), 0);
+          store_s16_2x1(dst_ptr + i * dst_stride, vget_low_s16(d0), 0);
         } else if (w == 4) {
           vst1_s16(dst_ptr + i * dst_stride, vget_low_s16(d0));
         } else {
@@ -1325,9 +2181,10 @@
     };
     const int8x16_t x_filter = vcombine_s8(vmovn_s16(x_filter_s16.val[0]),
                                            vmovn_s16(x_filter_s16.val[1]));
-
-    const int32x4_t horiz_const = vdupq_n_s32((1 << (bd + FILTER_BITS - 1)));
-    const int32x4_t shift_round_0 = vdupq_n_s32(-round_0);
+    // This shim of 1 << (ROUND0_BITS - 1) enables us to use non-rounding shifts
+    // - which are generally faster than rounding shifts on modern CPUs.
+    const int32x4_t horiz_const =
+        vdupq_n_s32((1 << (bd + FILTER_BITS - 1)) + (1 << (ROUND0_BITS - 1)));
     const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
 
     if (w <= 4) {
@@ -1340,34 +2197,20 @@
           uint8x16_t s0, s1, s2, s3;
           int16x4_t d0, d1, d2, d3;
 
-          s0 = vld1q_u8(s + 0 * src_stride);
-          s1 = vld1q_u8(s + 1 * src_stride);
-          s2 = vld1q_u8(s + 2 * src_stride);
-          s3 = vld1q_u8(s + 3 * src_stride);
+          load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
 
-          d0 = convolve12_4_usdot(s0, x_filter, permute_tbl, horiz_const,
-                                  shift_round_0);
-          d1 = convolve12_4_usdot(s1, x_filter, permute_tbl, horiz_const,
-                                  shift_round_0);
-          d2 = convolve12_4_usdot(s2, x_filter, permute_tbl, horiz_const,
-                                  shift_round_0);
-          d3 = convolve12_4_usdot(s3, x_filter, permute_tbl, horiz_const,
-                                  shift_round_0);
+          d0 = convolve12_horiz_4_usdot(s0, x_filter, permute_tbl, horiz_const);
+          d1 = convolve12_horiz_4_usdot(s1, x_filter, permute_tbl, horiz_const);
+          d2 = convolve12_horiz_4_usdot(s2, x_filter, permute_tbl, horiz_const);
+          d3 = convolve12_horiz_4_usdot(s3, x_filter, permute_tbl, horiz_const);
 
           if (w == 2) {
-            vst1_lane_s32((int32_t *)(d + 0 * dst_stride),
-                          vreinterpret_s32_s16(d0), 0);
-            vst1_lane_s32((int32_t *)(d + 1 * dst_stride),
-                          vreinterpret_s32_s16(d1), 0);
-            vst1_lane_s32((int32_t *)(d + 2 * dst_stride),
-                          vreinterpret_s32_s16(d2), 0);
-            vst1_lane_s32((int32_t *)(d + 3 * dst_stride),
-                          vreinterpret_s32_s16(d3), 0);
+            store_s16_2x1(d + 0 * dst_stride, d0, 0);
+            store_s16_2x1(d + 1 * dst_stride, d1, 0);
+            store_s16_2x1(d + 2 * dst_stride, d2, 0);
+            store_s16_2x1(d + 3 * dst_stride, d3, 0);
           } else {
-            vst1_s16(d + 0 * dst_stride, d0);
-            vst1_s16(d + 1 * dst_stride, d1);
-            vst1_s16(d + 2 * dst_stride, d2);
-            vst1_s16(d + 3 * dst_stride, d3);
+            store_s16_4x4(d, dst_stride, d0, d1, d2, d3);
           }
 
           s += 4;
@@ -1391,11 +2234,10 @@
 
           s0 = vld1q_u8(s);
 
-          d0 = convolve12_4_usdot(s0, x_filter, permute_tbl, horiz_const,
-                                  shift_round_0);
+          d0 = convolve12_horiz_4_usdot(s0, x_filter, permute_tbl, horiz_const);
 
           if (w == 2) {
-            vst1_lane_s32((int32_t *)d, vreinterpret_s32_s16(d0), 0);
+            store_s16_2x1(d, d0, 0);
           } else {
             vst1_s16(d, d0);
           }
@@ -1418,28 +2260,19 @@
           uint8x16_t s0[2], s1[2], s2[2], s3[2];
           int16x8_t d0, d1, d2, d3;
 
-          s0[0] = vld1q_u8(s + 0 * src_stride);
-          s1[0] = vld1q_u8(s + 1 * src_stride);
-          s2[0] = vld1q_u8(s + 2 * src_stride);
-          s3[0] = vld1q_u8(s + 3 * src_stride);
-          s0[1] = vld1q_u8(s + 0 * src_stride + 4);
-          s1[1] = vld1q_u8(s + 1 * src_stride + 4);
-          s2[1] = vld1q_u8(s + 2 * src_stride + 4);
-          s3[1] = vld1q_u8(s + 3 * src_stride + 4);
+          load_u8_16x4(s, src_stride, &s0[0], &s1[0], &s2[0], &s3[0]);
+          load_u8_16x4(s + 4, src_stride, &s0[1], &s1[1], &s2[1], &s3[1]);
 
-          d0 = convolve12_8_usdot(s0[0], s0[1], x_filter, permute_tbl,
-                                  horiz_const, shift_round_0);
-          d1 = convolve12_8_usdot(s1[0], s1[1], x_filter, permute_tbl,
-                                  horiz_const, shift_round_0);
-          d2 = convolve12_8_usdot(s2[0], s2[1], x_filter, permute_tbl,
-                                  horiz_const, shift_round_0);
-          d3 = convolve12_8_usdot(s3[0], s3[1], x_filter, permute_tbl,
-                                  horiz_const, shift_round_0);
+          d0 = convolve12_horiz_8_usdot(s0[0], s0[1], x_filter, permute_tbl,
+                                        horiz_const);
+          d1 = convolve12_horiz_8_usdot(s1[0], s1[1], x_filter, permute_tbl,
+                                        horiz_const);
+          d2 = convolve12_horiz_8_usdot(s2[0], s2[1], x_filter, permute_tbl,
+                                        horiz_const);
+          d3 = convolve12_horiz_8_usdot(s3[0], s3[1], x_filter, permute_tbl,
+                                        horiz_const);
 
-          vst1q_s16(d + 0 * dst_stride, d0);
-          vst1q_s16(d + 1 * dst_stride, d1);
-          vst1q_s16(d + 2 * dst_stride, d2);
-          vst1q_s16(d + 3 * dst_stride, d3);
+          store_s16_8x4(d, dst_stride, d0, d1, d2, d3);
 
           s += 8;
           d += 8;
@@ -1463,8 +2296,8 @@
           s0[0] = vld1q_u8(s);
           s0[1] = vld1q_u8(s + 4);
 
-          d0 = convolve12_8_usdot(s0[0], s0[1], x_filter, permute_tbl,
-                                  horiz_const, shift_round_0);
+          d0 = convolve12_horiz_8_usdot(s0[0], s0[1], x_filter, permute_tbl,
+                                        horiz_const);
 
           vst1q_s16(d, d0);
 
@@ -1480,82 +2313,12 @@
   }
 }
 
-#elif defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#elif AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
 
-static INLINE int16x4_t convolve12_4_sdot(uint8x16_t samples,
-                                          const int8x16_t filters,
-                                          const int32x4_t correction,
-                                          const uint8x16_t range_limit,
-                                          const uint8x16x3_t permute_tbl,
-                                          const int32x4_t shift_round_0) {
-  int8x16_t clamped_samples, permuted_samples[3];
-  int32x4_t sum;
-
-  /* Clamp sample range to [-128, 127] for 8-bit signed dot product. */
-  clamped_samples = vreinterpretq_s8_u8(vsubq_u8(samples, range_limit));
-
-  /* Permute samples ready for dot product. */
-  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
-  permuted_samples[0] = vqtbl1q_s8(clamped_samples, permute_tbl.val[0]);
-  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
-  permuted_samples[1] = vqtbl1q_s8(clamped_samples, permute_tbl.val[1]);
-  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
-  permuted_samples[2] = vqtbl1q_s8(clamped_samples, permute_tbl.val[2]);
-
-  /* Accumulate dot product into 'correction' to account for range clamp. */
-  /* First 4 output values. */
-  sum = vdotq_laneq_s32(correction, permuted_samples[0], filters, 0);
-  sum = vdotq_laneq_s32(sum, permuted_samples[1], filters, 1);
-  sum = vdotq_laneq_s32(sum, permuted_samples[2], filters, 2);
-
-  /* Narrow and re-pack. */
-  sum = vqrshlq_s32(sum, shift_round_0);
-
-  return vmovn_s32(sum);
-}
-
-static INLINE int16x8_t convolve12_8_sdot(
-    uint8x16_t samples0, uint8x16_t samples1, const int8x16_t filters,
-    const int32x4_t correction, const uint8x16_t range_limit,
-    const uint8x16x3_t permute_tbl, const int32x4_t shift_round_0) {
-  int8x16_t clamped_samples[2], permuted_samples[4];
-  int32x4_t sum[2];
-
-  /* Clamp sample range to [-128, 127] for 8-bit signed dot product. */
-  clamped_samples[0] = vreinterpretq_s8_u8(vsubq_u8(samples0, range_limit));
-  clamped_samples[1] = vreinterpretq_s8_u8(vsubq_u8(samples1, range_limit));
-
-  /* Permute samples ready for dot product. */
-  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
-  permuted_samples[0] = vqtbl1q_s8(clamped_samples[0], permute_tbl.val[0]);
-  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
-  permuted_samples[1] = vqtbl1q_s8(clamped_samples[0], permute_tbl.val[1]);
-  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
-  permuted_samples[2] = vqtbl1q_s8(clamped_samples[0], permute_tbl.val[2]);
-  /* {12, 13, 14, 15, 13, 14, 15, 16, 14, 15, 16, 17, 15, 16, 17, 18 } */
-  permuted_samples[3] = vqtbl1q_s8(clamped_samples[1], permute_tbl.val[2]);
-
-  /* Accumulate dot product into 'correction' to account for range clamp. */
-  /* First 4 output values. */
-  sum[0] = vdotq_laneq_s32(correction, permuted_samples[0], filters, 0);
-  sum[0] = vdotq_laneq_s32(sum[0], permuted_samples[1], filters, 1);
-  sum[0] = vdotq_laneq_s32(sum[0], permuted_samples[2], filters, 2);
-  /* Second 4 output values. */
-  sum[1] = vdotq_laneq_s32(correction, permuted_samples[1], filters, 0);
-  sum[1] = vdotq_laneq_s32(sum[1], permuted_samples[2], filters, 1);
-  sum[1] = vdotq_laneq_s32(sum[1], permuted_samples[3], filters, 2);
-
-  /* Narrow and re-pack. */
-  sum[0] = vqrshlq_s32(sum[0], shift_round_0);
-  sum[1] = vqrshlq_s32(sum[1], shift_round_0);
-
-  return vcombine_s16(vmovn_s32(sum[0]), vmovn_s32(sum[1]));
-}
-
-static INLINE void av1_convolve_2d_sr_horiz_12tap_neon(
+static INLINE void convolve_2d_sr_horiz_12tap_neon(
     const uint8_t *src_ptr, int src_stride, int16_t *dst_ptr,
     const int dst_stride, int w, int h, const int16x8_t x_filter_0_7,
-    const int16x4_t x_filter_8_11, const int round_0) {
+    const int16x4_t x_filter_8_11) {
   const int bd = 8;
 
   // Special case the following no-op filter as 128 won't fit into the
@@ -1563,7 +2326,6 @@
   // { 0, 0, 0, 0, 0, 128, 0, 0, 0, 0, 0, 0 }
   if (vgetq_lane_s16(x_filter_0_7, 5) == 128) {
     const int16x8_t horiz_const = vdupq_n_s16((1 << (bd - 1)));
-    const int16x8_t shift_round_0 = vdupq_n_s16(FILTER_BITS - round_0);
     // Undo the horizontal offset in the calling function.
     src_ptr += 5;
 
@@ -1571,10 +2333,10 @@
       for (int j = 0; j < w; j += 8) {
         uint8x8_t s0 = vld1_u8(src_ptr + i * src_stride + j);
         uint16x8_t t0 = vaddw_u8(vreinterpretq_u16_s16(horiz_const), s0);
-        int16x8_t d0 = vqrshlq_s16(vreinterpretq_s16_u16(t0), shift_round_0);
+        int16x8_t d0 =
+            vshlq_n_s16(vreinterpretq_s16_u16(t0), FILTER_BITS - ROUND0_BITS);
         if (w == 2) {
-          vst1q_lane_s32((int32_t *)(dst_ptr + i * dst_stride),
-                         vreinterpretq_s32_s16(d0), 0);
+          store_s16_2x1(dst_ptr + i * dst_stride, vget_low_s16(d0), 0);
         } else if (w == 4) {
           vst1_s16(dst_ptr + i * dst_stride, vget_low_s16(d0));
         } else {
@@ -1583,8 +2345,6 @@
       }
     }
   } else {
-    const int32x4_t shift_round_0 = vdupq_n_s32(-round_0);
-
     // Narrow filter values to 8-bit.
     const int16x8x2_t x_filter_s16 = {
       { x_filter_0_7, vcombine_s16(x_filter_8_11, vdup_n_s16(0)) }
@@ -1592,8 +2352,11 @@
     const int8x16_t x_filter = vcombine_s8(vmovn_s16(x_filter_s16.val[0]),
                                            vmovn_s16(x_filter_s16.val[1]));
 
+    // This shim of 1 << (ROUND0_BITS - 1) enables us to use non-rounding shifts
+    // - which are generally faster than rounding shifts on modern CPUs.
+    const int32_t horiz_const =
+        ((1 << (bd + FILTER_BITS - 1)) + (1 << (ROUND0_BITS - 1)));
     // Dot product constants.
-    const int32_t horiz_const = (1 << (bd + FILTER_BITS - 1));
     const int32x4_t correct_tmp =
         vaddq_s32(vpaddlq_s16(vshlq_n_s16(x_filter_s16.val[0], 7)),
                   vpaddlq_s16(vshlq_n_s16(x_filter_s16.val[1], 7)));
@@ -1612,34 +2375,24 @@
           uint8x16_t s0, s1, s2, s3;
           int16x4_t d0, d1, d2, d3;
 
-          s0 = vld1q_u8(s + 0 * src_stride);
-          s1 = vld1q_u8(s + 1 * src_stride);
-          s2 = vld1q_u8(s + 2 * src_stride);
-          s3 = vld1q_u8(s + 3 * src_stride);
+          load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
 
-          d0 = convolve12_4_sdot(s0, x_filter, correction, range_limit,
-                                 permute_tbl, shift_round_0);
-          d1 = convolve12_4_sdot(s1, x_filter, correction, range_limit,
-                                 permute_tbl, shift_round_0);
-          d2 = convolve12_4_sdot(s2, x_filter, correction, range_limit,
-                                 permute_tbl, shift_round_0);
-          d3 = convolve12_4_sdot(s3, x_filter, correction, range_limit,
-                                 permute_tbl, shift_round_0);
+          d0 = convolve12_horiz_4_sdot(s0, x_filter, correction, range_limit,
+                                       permute_tbl);
+          d1 = convolve12_horiz_4_sdot(s1, x_filter, correction, range_limit,
+                                       permute_tbl);
+          d2 = convolve12_horiz_4_sdot(s2, x_filter, correction, range_limit,
+                                       permute_tbl);
+          d3 = convolve12_horiz_4_sdot(s3, x_filter, correction, range_limit,
+                                       permute_tbl);
 
           if (w == 2) {
-            vst1_lane_s32((int32_t *)(d + 0 * dst_stride),
-                          vreinterpret_s32_s16(d0), 0);
-            vst1_lane_s32((int32_t *)(d + 1 * dst_stride),
-                          vreinterpret_s32_s16(d1), 0);
-            vst1_lane_s32((int32_t *)(d + 2 * dst_stride),
-                          vreinterpret_s32_s16(d2), 0);
-            vst1_lane_s32((int32_t *)(d + 3 * dst_stride),
-                          vreinterpret_s32_s16(d3), 0);
+            store_s16_2x1(d + 0 * dst_stride, d0, 0);
+            store_s16_2x1(d + 1 * dst_stride, d1, 0);
+            store_s16_2x1(d + 2 * dst_stride, d2, 0);
+            store_s16_2x1(d + 3 * dst_stride, d3, 0);
           } else {
-            vst1_s16(d + 0 * dst_stride, d0);
-            vst1_s16(d + 1 * dst_stride, d1);
-            vst1_s16(d + 2 * dst_stride, d2);
-            vst1_s16(d + 3 * dst_stride, d3);
+            store_s16_4x4(d, dst_stride, d0, d1, d2, d3);
           }
 
           s += 4;
@@ -1663,11 +2416,11 @@
 
           s0 = vld1q_u8(s);
 
-          d0 = convolve12_4_sdot(s0, x_filter, correction, range_limit,
-                                 permute_tbl, shift_round_0);
+          d0 = convolve12_horiz_4_sdot(s0, x_filter, correction, range_limit,
+                                       permute_tbl);
 
           if (w == 2) {
-            vst1_lane_s32((int32_t *)d, vreinterpret_s32_s16(d0), 0);
+            store_s16_2x1(d, d0, 0);
           } else {
             vst1_s16(d, d0);
           }
@@ -1690,28 +2443,19 @@
           uint8x16_t s0[2], s1[2], s2[2], s3[2];
           int16x8_t d0, d1, d2, d3;
 
-          s0[0] = vld1q_u8(s + 0 * src_stride);
-          s1[0] = vld1q_u8(s + 1 * src_stride);
-          s2[0] = vld1q_u8(s + 2 * src_stride);
-          s3[0] = vld1q_u8(s + 3 * src_stride);
-          s0[1] = vld1q_u8(s + 0 * src_stride + 4);
-          s1[1] = vld1q_u8(s + 1 * src_stride + 4);
-          s2[1] = vld1q_u8(s + 2 * src_stride + 4);
-          s3[1] = vld1q_u8(s + 3 * src_stride + 4);
+          load_u8_16x4(s, src_stride, &s0[0], &s1[0], &s2[0], &s3[0]);
+          load_u8_16x4(s + 4, src_stride, &s0[1], &s1[1], &s2[1], &s3[1]);
 
-          d0 = convolve12_8_sdot(s0[0], s0[1], x_filter, correction,
-                                 range_limit, permute_tbl, shift_round_0);
-          d1 = convolve12_8_sdot(s1[0], s1[1], x_filter, correction,
-                                 range_limit, permute_tbl, shift_round_0);
-          d2 = convolve12_8_sdot(s2[0], s2[1], x_filter, correction,
-                                 range_limit, permute_tbl, shift_round_0);
-          d3 = convolve12_8_sdot(s3[0], s3[1], x_filter, correction,
-                                 range_limit, permute_tbl, shift_round_0);
+          d0 = convolve12_horiz_8_sdot(s0[0], s0[1], x_filter, correction,
+                                       range_limit, permute_tbl);
+          d1 = convolve12_horiz_8_sdot(s1[0], s1[1], x_filter, correction,
+                                       range_limit, permute_tbl);
+          d2 = convolve12_horiz_8_sdot(s2[0], s2[1], x_filter, correction,
+                                       range_limit, permute_tbl);
+          d3 = convolve12_horiz_8_sdot(s3[0], s3[1], x_filter, correction,
+                                       range_limit, permute_tbl);
 
-          vst1q_s16(d + 0 * dst_stride, d0);
-          vst1q_s16(d + 1 * dst_stride, d1);
-          vst1q_s16(d + 2 * dst_stride, d2);
-          vst1q_s16(d + 3 * dst_stride, d3);
+          store_s16_8x4(d, dst_stride, d0, d1, d2, d3);
 
           s += 8;
           d += 8;
@@ -1735,8 +2479,8 @@
           s0[0] = vld1q_u8(s);
           s0[1] = vld1q_u8(s + 4);
 
-          d0 = convolve12_8_sdot(s0[0], s0[1], x_filter, correction,
-                                 range_limit, permute_tbl, shift_round_0);
+          d0 = convolve12_horiz_8_sdot(s0[0], s0[1], x_filter, correction,
+                                       range_limit, permute_tbl);
 
           vst1q_s16(d, d0);
 
@@ -1752,7 +2496,7 @@
   }
 }
 
-#else  // !(defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD))
+#else  // !(AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD))
 
 static INLINE int16x4_t convolve12_horiz_4x4_s16(
     const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
@@ -1760,7 +2504,7 @@
     const int16x4_t s6, const int16x4_t s7, const int16x4_t s8,
     const int16x4_t s9, const int16x4_t s10, const int16x4_t s11,
     const int16x8_t x_filter_0_7, const int16x4_t x_filter_8_11,
-    const int32x4_t horiz_const, const int32x4_t shift_round_0) {
+    const int32x4_t horiz_const) {
   const int16x4_t x_filter_0_3 = vget_low_s16(x_filter_0_7);
   const int16x4_t x_filter_4_7 = vget_high_s16(x_filter_0_7);
   int32x4_t sum;
@@ -1779,9 +2523,7 @@
   sum = vmlal_lane_s16(sum, s10, x_filter_8_11, 2);
   sum = vmlal_lane_s16(sum, s11, x_filter_8_11, 3);
 
-  sum = vqrshlq_s32(sum, shift_round_0);
-
-  return vmovn_s32(sum);
+  return vshrn_n_s32(sum, ROUND0_BITS);
 }
 
 // 4 column per iteration horizontal filtering for 12-tap convolve_2d_sr.
@@ -1789,8 +2531,7 @@
 static INLINE void horiz_filter_12tap_w4_single_row(
     const uint8_t *src_ptr, int src_stride, int16_t *dst_ptr,
     const int dst_stride, int w, int h, const int16x8_t x_filter_0_7,
-    const int16x4_t x_filter_8_11, const int32x4_t horiz_const,
-    const int32x4_t shift_round_0) {
+    const int16x4_t x_filter_8_11, const int32x4_t horiz_const) {
   do {
     const uint8_t *s = src_ptr;
     int16_t *d = dst_ptr;
@@ -1822,10 +2563,10 @@
 
       d0 = convolve12_horiz_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10,
                                     s11, x_filter_0_7, x_filter_8_11,
-                                    horiz_const, shift_round_0);
+                                    horiz_const);
 
       if (w == 2) {
-        vst1_lane_s32((int32_t *)d, vreinterpret_s32_s16(d0), 0);
+        store_s16_2x1(d, d0, 0);
       } else {
         vst1_s16(d, d0);
       }
@@ -1841,15 +2582,17 @@
   } while (h > 0);
 }
 
-static INLINE void av1_convolve_2d_sr_horiz_12tap_neon(
+static INLINE void convolve_2d_sr_horiz_12tap_neon(
     const uint8_t *src_ptr, int src_stride, int16_t *dst_ptr,
     const int dst_stride, int w, int h, const int16x8_t x_filter_0_7,
-    const int16x4_t x_filter_8_11, const int round_0) {
+    const int16x4_t x_filter_8_11) {
   const int bd = 8;
-  const int32x4_t shift_round_0 = vdupq_n_s32(-(round_0));
-  const int32x4_t horiz_const = vdupq_n_s32((1 << (bd + FILTER_BITS - 1)));
+  // This shim of 1 << (ROUND0_BITS - 1) enables us to use non-rounding shifts -
+  // which are generally faster than rounding shifts on modern CPUs.
+  const int32x4_t horiz_const =
+      vdupq_n_s32((1 << (bd + FILTER_BITS - 1)) + (1 << (ROUND0_BITS - 1)));
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   do {
     int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
     uint8x8_t t0, t1, t2, t3;
@@ -1892,33 +2635,26 @@
 
       d0 = convolve12_horiz_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10,
                                     s11, x_filter_0_7, x_filter_8_11,
-                                    horiz_const, shift_round_0);
+                                    horiz_const);
       d1 = convolve12_horiz_4x4_s16(s1, s2, s3, s4, s5, s6, s7, s8, s9, s10,
                                     s11, s12, x_filter_0_7, x_filter_8_11,
-                                    horiz_const, shift_round_0);
+                                    horiz_const);
       d2 = convolve12_horiz_4x4_s16(s2, s3, s4, s5, s6, s7, s8, s9, s10, s11,
                                     s12, s13, x_filter_0_7, x_filter_8_11,
-                                    horiz_const, shift_round_0);
+                                    horiz_const);
       d3 = convolve12_horiz_4x4_s16(s3, s4, s5, s6, s7, s8, s9, s10, s11, s12,
                                     s13, s14, x_filter_0_7, x_filter_8_11,
-                                    horiz_const, shift_round_0);
+                                    horiz_const);
 
       transpose_s16_4x4d(&d0, &d1, &d2, &d3);
 
       if (w == 2) {
-        vst1_lane_s32((int32_t *)(d + 0 * dst_stride), vreinterpret_s32_s16(d0),
-                      0);
-        vst1_lane_s32((int32_t *)(d + 1 * dst_stride), vreinterpret_s32_s16(d1),
-                      0);
-        vst1_lane_s32((int32_t *)(d + 2 * dst_stride), vreinterpret_s32_s16(d2),
-                      0);
-        vst1_lane_s32((int32_t *)(d + 3 * dst_stride), vreinterpret_s32_s16(d3),
-                      0);
+        store_s16_2x1(d + 0 * dst_stride, d0, 0);
+        store_s16_2x1(d + 1 * dst_stride, d1, 0);
+        store_s16_2x1(d + 2 * dst_stride, d2, 0);
+        store_s16_2x1(d + 3 * dst_stride, d3, 0);
       } else {
-        vst1_s16((d + 0 * dst_stride), d0);
-        vst1_s16((d + 1 * dst_stride), d1);
-        vst1_s16((d + 2 * dst_stride), d2);
-        vst1_s16((d + 3 * dst_stride), d3);
+        store_s16_4x4(d, dst_stride, d0, d1, d2, d3);
       }
 
       s0 = s4;
@@ -1946,177 +2682,21 @@
   if (h) {
     horiz_filter_12tap_w4_single_row(src_ptr, src_stride, dst_ptr, dst_stride,
                                      w, h, x_filter_0_7, x_filter_8_11,
-                                     horiz_const, shift_round_0);
+                                     horiz_const);
   }
-#else   // !defined(__aarch64__)
+#else   // !AOM_ARCH_AARCH64
   horiz_filter_12tap_w4_single_row(src_ptr, src_stride, dst_ptr, dst_stride, w,
-                                   h, x_filter_0_7, x_filter_8_11, horiz_const,
-                                   shift_round_0);
-#endif  // defined(__aarch64__)
+                                   h, x_filter_0_7, x_filter_8_11, horiz_const);
+#endif  // AOM_ARCH_AARCH64
 }
 
-#endif  // defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#endif  // AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
 
-static INLINE void av1_convolve_2d_sr_vert_12tap_neon(
-    int16_t *src_ptr, int src_stride, uint8_t *dst_ptr, int dst_stride, int w,
-    int h, const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11,
-    ConvolveParams *conv_params) {
-  const int bd = 8;
-  const int16_t round_bits =
-      FILTER_BITS * 2 - conv_params->round_0 - conv_params->round_1;
-  const int16x8_t vec_round_bits = vdupq_n_s16(-round_bits);
-  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
-  const int32_t sub_const = (1 << (offset_bits - conv_params->round_1)) +
-                            (1 << (offset_bits - conv_params->round_1 - 1));
-  const int32x4_t round_shift_vec = vdupq_n_s32(-(conv_params->round_1));
-  const int32x4_t offset_const = vdupq_n_s32(1 << offset_bits);
-  const int32x4_t sub_const_vec = vdupq_n_s32(sub_const);
+#if AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_MATMUL_INT8)
 
-  if (w <= 4) {
-    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
-    int16x4_t d0, d1, d2, d3;
-    int16x8_t dd01, dd23;
-    uint8x8_t d01, d23;
-
-    load_s16_4x8(src_ptr, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7);
-    src_ptr += (8 * src_stride);
-    load_s16_4x4(src_ptr, src_stride, &s8, &s9, &s10, &s11);
-    src_ptr += (3 * src_stride);
-
-    do {
-      load_s16_4x4(src_ptr, src_stride, &s11, &s12, &s13, &s14);
-      src_ptr += 4 * src_stride;
-
-      d0 = convolve12_vert_4x4_s32(
-          s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, y_filter_0_7,
-          y_filter_8_11, round_shift_vec, offset_const, sub_const_vec);
-      d1 = convolve12_vert_4x4_s32(
-          s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, y_filter_0_7,
-          y_filter_8_11, round_shift_vec, offset_const, sub_const_vec);
-      d2 = convolve12_vert_4x4_s32(
-          s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, y_filter_0_7,
-          y_filter_8_11, round_shift_vec, offset_const, sub_const_vec);
-      d3 = convolve12_vert_4x4_s32(
-          s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14, y_filter_0_7,
-          y_filter_8_11, round_shift_vec, offset_const, sub_const_vec);
-
-      dd01 = vqrshlq_s16(vcombine_s16(d0, d1), vec_round_bits);
-      dd23 = vqrshlq_s16(vcombine_s16(d2, d3), vec_round_bits);
-
-      d01 = vqmovun_s16(dd01);
-      d23 = vqmovun_s16(dd23);
-
-      if (w == 2) {
-        vst1_lane_u16((uint16_t *)dst_ptr, vreinterpret_u16_u8(d01), 0);
-        dst_ptr += dst_stride;
-        vst1_lane_u16((uint16_t *)dst_ptr, vreinterpret_u16_u8(d01), 2);
-        dst_ptr += dst_stride;
-        if (h != 2) {
-          vst1_lane_u16((uint16_t *)dst_ptr, vreinterpret_u16_u8(d23), 0);
-          dst_ptr += dst_stride;
-          vst1_lane_u16((uint16_t *)dst_ptr, vreinterpret_u16_u8(d23), 2);
-          dst_ptr += dst_stride;
-        }
-      } else {
-        vst1_lane_u32((uint32_t *)dst_ptr, vreinterpret_u32_u8(d01), 0);
-        dst_ptr += dst_stride;
-        vst1_lane_u32((uint32_t *)dst_ptr, vreinterpret_u32_u8(d01), 1);
-        dst_ptr += dst_stride;
-        if (h != 2) {
-          vst1_lane_u32((uint32_t *)dst_ptr, vreinterpret_u32_u8(d23), 0);
-          dst_ptr += dst_stride;
-          vst1_lane_u32((uint32_t *)dst_ptr, vreinterpret_u32_u8(d23), 1);
-          dst_ptr += dst_stride;
-        }
-      }
-
-      s0 = s4;
-      s1 = s5;
-      s2 = s6;
-      s3 = s7;
-      s4 = s8;
-      s5 = s9;
-      s6 = s10;
-      s7 = s11;
-      s8 = s12;
-      s9 = s13;
-      s10 = s14;
-      h -= 4;
-    } while (h > 0);
-
-  } else {
-    do {
-      int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
-      uint8x8_t d0, d1, d2, d3;
-
-      int16_t *s = src_ptr;
-      uint8_t *d = dst_ptr;
-
-      int height = h;
-
-      load_s16_8x8(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7);
-      s += (8 * src_stride);
-      load_s16_8x4(s, src_stride, &s8, &s9, &s10, &s11);
-      s += (3 * src_stride);
-
-      do {
-        load_s16_8x4(s, src_stride, &s11, &s12, &s13, &s14);
-        s += 4 * src_stride;
-
-        d0 = convolve12_vert_8x4_s32(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9,
-                                     s10, s11, y_filter_0_7, y_filter_8_11,
-                                     round_shift_vec, offset_const,
-                                     sub_const_vec, vec_round_bits);
-        d1 = convolve12_vert_8x4_s32(s1, s2, s3, s4, s5, s6, s7, s8, s9, s10,
-                                     s11, s12, y_filter_0_7, y_filter_8_11,
-                                     round_shift_vec, offset_const,
-                                     sub_const_vec, vec_round_bits);
-        d2 = convolve12_vert_8x4_s32(s2, s3, s4, s5, s6, s7, s8, s9, s10, s11,
-                                     s12, s13, y_filter_0_7, y_filter_8_11,
-                                     round_shift_vec, offset_const,
-                                     sub_const_vec, vec_round_bits);
-        d3 = convolve12_vert_8x4_s32(s3, s4, s5, s6, s7, s8, s9, s10, s11, s12,
-                                     s13, s14, y_filter_0_7, y_filter_8_11,
-                                     round_shift_vec, offset_const,
-                                     sub_const_vec, vec_round_bits);
-
-        vst1_u8(d, d0);
-        d += dst_stride;
-        vst1_u8(d, d1);
-        d += dst_stride;
-        if (h != 2) {
-          vst1_u8(d, d2);
-          d += dst_stride;
-          vst1_u8(d, d3);
-          d += dst_stride;
-        }
-
-        s0 = s4;
-        s1 = s5;
-        s2 = s6;
-        s3 = s7;
-        s4 = s8;
-        s5 = s9;
-        s6 = s10;
-        s7 = s11;
-        s8 = s12;
-        s9 = s13;
-        s10 = s14;
-        height -= 4;
-      } while (height > 0);
-
-      src_ptr += 8;
-      dst_ptr += 8;
-      w -= 8;
-    } while (w > 0);
-  }
-}
-
-#if defined(__aarch64__) && defined(__ARM_FEATURE_MATMUL_INT8)
-
-static INLINE void av1_convolve_2d_sr_horiz_neon(
+static INLINE void convolve_2d_sr_horiz_8tap_neon(
     const uint8_t *src, int src_stride, int16_t *im_block, int im_stride, int w,
-    int im_h, const int16x8_t x_filter_s16, const int round_0) {
+    int im_h, const int16x8_t x_filter_s16) {
   const int bd = 8;
 
   const uint8_t *src_ptr = src;
@@ -2128,13 +2708,14 @@
   // Filter values are even, so downshift by 1 to reduce intermediate precision
   // requirements.
   const int8x8_t x_filter = vshrn_n_s16(x_filter_s16, 1);
-  const int32x4_t horiz_const = vdupq_n_s32(1 << (bd + FILTER_BITS - 2));
-
-  assert(round_0 > 0);
+  // This shim of  1 << ((ROUND0_BITS - 1) - 1) enables us to use non-rounding
+  // shifts - which are generally faster than rounding shifts on modern CPUs.
+  // The outermost -1 is needed because we halved the filter values.
+  const int32x4_t horiz_const = vdupq_n_s32((1 << (bd + FILTER_BITS - 2)) +
+                                            (1 << ((ROUND0_BITS - 1) - 1)));
 
   if (w <= 4) {
     const uint8x16x2_t permute_tbl = vld1q_u8_x2(dot_prod_permute_tbl);
-    const int16x4_t shift_round_0 = vdup_n_s16(-(round_0 - 1));
     uint8x16_t s0, s1, s2, s3;
     int32x4_t t0, t1, t2, t3;
     int16x4_t d0, d1, d2, d3;
@@ -2142,32 +2723,26 @@
     do {
       assert(height >= 4);
 
-      load_u8_8x16(src_ptr, src_stride, &s0, &s1, &s2, &s3);
+      load_u8_16x4(src_ptr, src_stride, &s0, &s1, &s2, &s3);
 
       t0 = convolve8_4_usdot(s0, x_filter, permute_tbl, horiz_const);
       t1 = convolve8_4_usdot(s1, x_filter, permute_tbl, horiz_const);
       t2 = convolve8_4_usdot(s2, x_filter, permute_tbl, horiz_const);
       t3 = convolve8_4_usdot(s3, x_filter, permute_tbl, horiz_const);
 
-      d0 = vqrshl_s16(vmovn_s32(t0), shift_round_0);
-      d1 = vqrshl_s16(vmovn_s32(t1), shift_round_0);
-      d2 = vqrshl_s16(vmovn_s32(t2), shift_round_0);
-      d3 = vqrshl_s16(vmovn_s32(t3), shift_round_0);
+      // We halved the convolution filter values so -1 from the right shift.
+      d0 = vshrn_n_s32(t0, ROUND0_BITS - 1);
+      d1 = vshrn_n_s32(t1, ROUND0_BITS - 1);
+      d2 = vshrn_n_s32(t2, ROUND0_BITS - 1);
+      d3 = vshrn_n_s32(t3, ROUND0_BITS - 1);
 
       if (w == 2) {
-        vst1_lane_u32((uint32_t *)(dst_ptr + 0 * dst_stride),
-                      vreinterpret_u32_s16(d0), 0);
-        vst1_lane_u32((uint32_t *)(dst_ptr + 1 * dst_stride),
-                      vreinterpret_u32_s16(d1), 0);
-        vst1_lane_u32((uint32_t *)(dst_ptr + 2 * dst_stride),
-                      vreinterpret_u32_s16(d2), 0);
-        vst1_lane_u32((uint32_t *)(dst_ptr + 3 * dst_stride),
-                      vreinterpret_u32_s16(d3), 0);
+        store_s16_2x1(dst_ptr + 0 * dst_stride, d0, 0);
+        store_s16_2x1(dst_ptr + 1 * dst_stride, d1, 0);
+        store_s16_2x1(dst_ptr + 2 * dst_stride, d2, 0);
+        store_s16_2x1(dst_ptr + 3 * dst_stride, d3, 0);
       } else {
-        vst1_s16(dst_ptr + 0 * dst_stride, d0);
-        vst1_s16(dst_ptr + 1 * dst_stride, d1);
-        vst1_s16(dst_ptr + 2 * dst_stride, d2);
-        vst1_s16(dst_ptr + 3 * dst_stride, d3);
+        store_s16_4x4(dst_ptr, dst_stride, d0, d1, d2, d3);
       }
 
       src_ptr += 4 * src_stride;
@@ -2181,10 +2756,11 @@
       do {
         s0 = vld1q_u8(src_ptr);
         t0 = convolve8_4_usdot(s0, x_filter, permute_tbl, horiz_const);
-        d0 = vqrshl_s16(vmovn_s32(t0), shift_round_0);
+        // We halved the convolution filter values so -1 from the right shift.
+        d0 = vshrn_n_s32(t0, ROUND0_BITS - 1);
 
         if (w == 2) {
-          vst1_lane_u32((uint32_t *)dst_ptr, vreinterpret_u32_s16(d0), 0);
+          store_s16_2x1(dst_ptr, d0, 0);
         } else {
           vst1_s16(dst_ptr, d0);
         }
@@ -2196,7 +2772,6 @@
     }
   } else {
     const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
-    const int16x8_t shift_round_0 = vdupq_n_s16(-(round_0 - 1));
     uint8x16_t s0, s1, s2, s3;
     int16x8_t d0, d1, d2, d3;
 
@@ -2208,24 +2783,14 @@
       int width = w;
 
       do {
-        s0 = vld1q_u8(s + 0 * src_stride);
-        s1 = vld1q_u8(s + 1 * src_stride);
-        s2 = vld1q_u8(s + 2 * src_stride);
-        s3 = vld1q_u8(s + 3 * src_stride);
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
 
-        d0 = convolve8_8_usdot(s0, x_filter, permute_tbl, horiz_const,
-                               shift_round_0);
-        d1 = convolve8_8_usdot(s1, x_filter, permute_tbl, horiz_const,
-                               shift_round_0);
-        d2 = convolve8_8_usdot(s2, x_filter, permute_tbl, horiz_const,
-                               shift_round_0);
-        d3 = convolve8_8_usdot(s3, x_filter, permute_tbl, horiz_const,
-                               shift_round_0);
+        d0 = convolve8_horiz_8_usdot(s0, x_filter, permute_tbl, horiz_const);
+        d1 = convolve8_horiz_8_usdot(s1, x_filter, permute_tbl, horiz_const);
+        d2 = convolve8_horiz_8_usdot(s2, x_filter, permute_tbl, horiz_const);
+        d3 = convolve8_horiz_8_usdot(s3, x_filter, permute_tbl, horiz_const);
 
-        vst1q_s16(d + 0 * dst_stride, d0);
-        vst1q_s16(d + 1 * dst_stride, d1);
-        vst1q_s16(d + 2 * dst_stride, d2);
-        vst1q_s16(d + 3 * dst_stride, d3);
+        store_s16_8x4(d, dst_stride, d0, d1, d2, d3);
 
         s += 8;
         d += 8;
@@ -2247,8 +2812,7 @@
 
         do {
           s0 = vld1q_u8(s);
-          d0 = convolve8_8_usdot(s0, x_filter, permute_tbl, horiz_const,
-                                 shift_round_0);
+          d0 = convolve8_horiz_8_usdot(s0, x_filter, permute_tbl, horiz_const);
           vst1q_s16(d, d0);
 
           s += 8;
@@ -2264,11 +2828,11 @@
   }
 }
 
-#elif defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#elif AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
 
-static INLINE void av1_convolve_2d_sr_horiz_neon(
+static INLINE void convolve_2d_sr_horiz_8tap_neon(
     const uint8_t *src, int src_stride, int16_t *im_block, int im_stride, int w,
-    int im_h, const int16x8_t x_filter_s16, const int round_0) {
+    int im_h, const int16x8_t x_filter_s16) {
   const int bd = 8;
 
   const uint8_t *src_ptr = src;
@@ -2280,18 +2844,18 @@
   // Filter values are even, so downshift by 1 to reduce intermediate precision
   // requirements.
   const int8x8_t x_filter = vshrn_n_s16(x_filter_s16, 1);
-  const int32_t horiz_const = (1 << (bd + FILTER_BITS - 2));
+  // This shim of  1 << ((ROUND0_BITS - 1) - 1) enables us to use non-rounding
+  // shifts - which are generally faster than rounding shifts on modern CPUs.
+  // The outermost -1 is needed because we halved the filter values.
+  const int32_t horiz_const =
+      ((1 << (bd + FILTER_BITS - 2)) + (1 << ((ROUND0_BITS - 1) - 1)));
   // Dot product constants.
   const int16x8_t correct_tmp = vshlq_n_s16(x_filter_s16, 6);
-  const int32x4_t correction =
-      vdupq_n_s32(vaddlvq_s16(correct_tmp) + horiz_const);
+  int32x4_t correction = vdupq_n_s32(vaddlvq_s16(correct_tmp) + horiz_const);
   const uint8x16_t range_limit = vdupq_n_u8(128);
 
-  assert(round_0 > 0);
-
   if (w <= 4) {
     const uint8x16x2_t permute_tbl = vld1q_u8_x2(dot_prod_permute_tbl);
-    const int16x4_t shift_round_0 = vdup_n_s16(-(round_0 - 1));
     uint8x16_t s0, s1, s2, s3;
     int32x4_t t0, t1, t2, t3;
     int16x4_t d0, d1, d2, d3;
@@ -2299,32 +2863,26 @@
     do {
       assert(height >= 4);
 
-      load_u8_8x16(src_ptr, src_stride, &s0, &s1, &s2, &s3);
+      load_u8_16x4(src_ptr, src_stride, &s0, &s1, &s2, &s3);
 
       t0 = convolve8_4_sdot(s0, x_filter, correction, range_limit, permute_tbl);
       t1 = convolve8_4_sdot(s1, x_filter, correction, range_limit, permute_tbl);
       t2 = convolve8_4_sdot(s2, x_filter, correction, range_limit, permute_tbl);
       t3 = convolve8_4_sdot(s3, x_filter, correction, range_limit, permute_tbl);
 
-      d0 = vqrshl_s16(vmovn_s32(t0), shift_round_0);
-      d1 = vqrshl_s16(vmovn_s32(t1), shift_round_0);
-      d2 = vqrshl_s16(vmovn_s32(t2), shift_round_0);
-      d3 = vqrshl_s16(vmovn_s32(t3), shift_round_0);
+      // We halved the convolution filter values so -1 from the right shift.
+      d0 = vshrn_n_s32(t0, ROUND0_BITS - 1);
+      d1 = vshrn_n_s32(t1, ROUND0_BITS - 1);
+      d2 = vshrn_n_s32(t2, ROUND0_BITS - 1);
+      d3 = vshrn_n_s32(t3, ROUND0_BITS - 1);
 
       if (w == 2) {
-        vst1_lane_u32((uint32_t *)(dst_ptr + 0 * dst_stride),
-                      vreinterpret_u32_s16(d0), 0);
-        vst1_lane_u32((uint32_t *)(dst_ptr + 1 * dst_stride),
-                      vreinterpret_u32_s16(d1), 0);
-        vst1_lane_u32((uint32_t *)(dst_ptr + 2 * dst_stride),
-                      vreinterpret_u32_s16(d2), 0);
-        vst1_lane_u32((uint32_t *)(dst_ptr + 3 * dst_stride),
-                      vreinterpret_u32_s16(d3), 0);
+        store_s16_2x1(dst_ptr + 0 * dst_stride, d0, 0);
+        store_s16_2x1(dst_ptr + 1 * dst_stride, d1, 0);
+        store_s16_2x1(dst_ptr + 2 * dst_stride, d2, 0);
+        store_s16_2x1(dst_ptr + 3 * dst_stride, d3, 0);
       } else {
-        vst1_s16(dst_ptr + 0 * dst_stride, d0);
-        vst1_s16(dst_ptr + 1 * dst_stride, d1);
-        vst1_s16(dst_ptr + 2 * dst_stride, d2);
-        vst1_s16(dst_ptr + 3 * dst_stride, d3);
+        store_s16_4x4(dst_ptr, dst_stride, d0, d1, d2, d3);
       }
 
       src_ptr += 4 * src_stride;
@@ -2339,10 +2897,11 @@
         s0 = vld1q_u8(src_ptr);
         t0 = convolve8_4_sdot(s0, x_filter, correction, range_limit,
                               permute_tbl);
-        d0 = vqrshl_s16(vmovn_s32(t0), shift_round_0);
+        // We halved the convolution filter values so -1 from the right shift.
+        d0 = vshrn_n_s32(t0, ROUND0_BITS - 1);
 
         if (w == 2) {
-          vst1_lane_u32((uint32_t *)dst_ptr, vreinterpret_u32_s16(d0), 0);
+          store_s16_2x1(dst_ptr, d0, 0);
         } else {
           vst1_s16(dst_ptr, d0);
         }
@@ -2354,7 +2913,6 @@
     }
   } else {
     const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
-    const int16x8_t shift_round_0 = vdupq_n_s16(-(round_0 - 1));
     uint8x16_t s0, s1, s2, s3;
     int16x8_t d0, d1, d2, d3;
 
@@ -2366,24 +2924,18 @@
       int width = w;
 
       do {
-        s0 = vld1q_u8(s + 0 * src_stride);
-        s1 = vld1q_u8(s + 1 * src_stride);
-        s2 = vld1q_u8(s + 2 * src_stride);
-        s3 = vld1q_u8(s + 3 * src_stride);
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
 
-        d0 = convolve8_8_sdot(s0, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
-        d1 = convolve8_8_sdot(s1, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
-        d2 = convolve8_8_sdot(s2, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
-        d3 = convolve8_8_sdot(s3, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
+        d0 = convolve8_horiz_8_sdot(s0, x_filter, correction, range_limit,
+                                    permute_tbl);
+        d1 = convolve8_horiz_8_sdot(s1, x_filter, correction, range_limit,
+                                    permute_tbl);
+        d2 = convolve8_horiz_8_sdot(s2, x_filter, correction, range_limit,
+                                    permute_tbl);
+        d3 = convolve8_horiz_8_sdot(s3, x_filter, correction, range_limit,
+                                    permute_tbl);
 
-        vst1q_s16(d + 0 * dst_stride, d0);
-        vst1q_s16(d + 1 * dst_stride, d1);
-        vst1q_s16(d + 2 * dst_stride, d2);
-        vst1q_s16(d + 3 * dst_stride, d3);
+        store_s16_8x4(d, dst_stride, d0, d1, d2, d3);
 
         s += 8;
         d += 8;
@@ -2406,7 +2958,9 @@
         do {
           s0 = vld1q_u8(s);
           d0 = convolve8_8_sdot(s0, x_filter, correction, range_limit,
-                                permute_tbl, shift_round_0);
+                                permute_tbl, vdupq_n_s16(0));
+          // We halved the convolution filter values so -1 from the right shift.
+          d0 = vshrq_n_s16(d0, ROUND0_BITS - 1);
           vst1q_s16(d, d0);
 
           s += 8;
@@ -2422,14 +2976,16 @@
   }
 }
 
-#else  // !(defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD))
+#else  // !(AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD))
 
 // Horizontal filtering for convolve_2d_sr for width multiple of 8
 // Processes one row at a time
-static INLINE void horiz_filter_w8_single_row(
-    const uint8_t *src_ptr, int src_stride, int16_t *dst_ptr,
-    const int dst_stride, int width, int height, const int16x8_t x_filter,
-    const int16x8_t horiz_const, const int16x8_t shift_round_0) {
+static INLINE void horiz_filter_w8_single_row(const uint8_t *src_ptr,
+                                              int src_stride, int16_t *dst_ptr,
+                                              const int dst_stride, int width,
+                                              int height,
+                                              const int16x8_t x_filter,
+                                              const int16x8_t horiz_const) {
   int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
   do {
     uint8x8_t t0 = vld1_u8(src_ptr);
@@ -2455,8 +3011,8 @@
       s6 = vextq_s16(sum, s7, 6);  // a6 a7 a8 a9 a10 a11 a12 a13
       s7 = vextq_s16(sum, s7, 7);  // a7 a8 a9 a10 a11 a12 a13 a14
 
-      int16x8_t res0 = convolve8_8x8_s16(sum, s1, s2, s3, s4, s5, s6, s7,
-                                         x_filter, horiz_const, shift_round_0);
+      int16x8_t res0 = convolve8_horiz_8x8_s16(sum, s1, s2, s3, s4, s5, s6, s7,
+                                               x_filter, horiz_const);
 
       vst1q_s16(dst_tmp, res0);
 
@@ -2472,10 +3028,12 @@
 
 // Horizontal filtering for convolve_2d_sr for width <= 4
 // Processes one row at a time
-static INLINE void horiz_filter_w4_single_row(
-    const uint8_t *src_ptr, int src_stride, int16_t *dst_ptr,
-    const int dst_stride, int width, int height, const int16x8_t x_filter,
-    const int16x4_t horiz_const, const int16x4_t shift_round_0) {
+static INLINE void horiz_filter_w4_single_row(const uint8_t *src_ptr,
+                                              int src_stride, int16_t *dst_ptr,
+                                              const int dst_stride, int width,
+                                              int height,
+                                              const int16x8_t x_filter,
+                                              const int16x4_t horiz_const) {
   int16x4_t s0, s1, s2, s3, s4, s5, s6, s7;
   do {
     const uint8_t *s = src_ptr;
@@ -2500,11 +3058,11 @@
     s6 = vext_s16(s4, s7, 2);  // a6 a7 a8 a9
     s7 = vext_s16(s4, s7, 3);  // a7 a8 a9 a10
 
-    int16x4_t d0 = convolve8_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
-                                     horiz_const, shift_round_0);
+    int16x4_t d0 = convolve8_horiz_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7,
+                                           x_filter, horiz_const);
 
     if (width == 2) {
-      vst1_lane_u32((uint32_t *)dst_ptr, vreinterpret_u32_s16(d0), 0);
+      store_s16_2x1(dst_ptr, d0, 0);
     } else {
       vst1_s16(dst_ptr, d0);
     }
@@ -2515,9 +3073,9 @@
   } while (height > 0);
 }
 
-static INLINE void av1_convolve_2d_sr_horiz_neon(
+static INLINE void convolve_2d_sr_horiz_8tap_neon(
     const uint8_t *src, int src_stride, int16_t *im_block, int im_stride, int w,
-    int im_h, const int16x8_t x_filter_s16, const int round_0) {
+    int im_h, const int16x8_t x_filter_s16) {
   const int bd = 8;
 
   const uint8_t *src_ptr = src;
@@ -2530,13 +3088,14 @@
   // requirements.
   const int16x8_t x_filter = vshrq_n_s16(x_filter_s16, 1);
 
-  assert(round_0 > 0);
-
   if (w <= 4) {
-    const int16x4_t horiz_const = vdup_n_s16((1 << (bd + FILTER_BITS - 2)));
-    const int16x4_t shift_round_0 = vdup_n_s16(-(round_0 - 1));
+    // This shim of  1 << ((ROUND0_BITS - 1) - 1) enables us to use non-rounding
+    // shifts - which are generally faster than rounding shifts on modern CPUs.
+    // The outermost -1 is needed because we halved the filter values.
+    const int16x4_t horiz_const = vdup_n_s16((1 << (bd + FILTER_BITS - 2)) +
+                                             (1 << ((ROUND0_BITS - 1) - 1)));
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     do {
       int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, d0, d1, d2, d3;
       uint8x8_t t0, t1, t2, t3;
@@ -2565,31 +3124,24 @@
       s9 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
       s10 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
 
-      d0 = convolve8_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
-                             horiz_const, shift_round_0);
-      d1 = convolve8_4x4_s16(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
-                             horiz_const, shift_round_0);
-      d2 = convolve8_4x4_s16(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
-                             horiz_const, shift_round_0);
-      d3 = convolve8_4x4_s16(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
-                             horiz_const, shift_round_0);
+      d0 = convolve8_horiz_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                                   horiz_const);
+      d1 = convolve8_horiz_4x4_s16(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
+                                   horiz_const);
+      d2 = convolve8_horiz_4x4_s16(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
+                                   horiz_const);
+      d3 = convolve8_horiz_4x4_s16(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
+                                   horiz_const);
 
       transpose_s16_4x4d(&d0, &d1, &d2, &d3);
 
       if (w == 2) {
-        vst1_lane_u32((uint32_t *)(dst_ptr + 0 * dst_stride),
-                      vreinterpret_u32_s16(d0), 0);
-        vst1_lane_u32((uint32_t *)(dst_ptr + 1 * dst_stride),
-                      vreinterpret_u32_s16(d1), 0);
-        vst1_lane_u32((uint32_t *)(dst_ptr + 2 * dst_stride),
-                      vreinterpret_u32_s16(d2), 0);
-        vst1_lane_u32((uint32_t *)(dst_ptr + 3 * dst_stride),
-                      vreinterpret_u32_s16(d3), 0);
+        store_s16_2x1(dst_ptr + 0 * dst_stride, d0, 0);
+        store_s16_2x1(dst_ptr + 1 * dst_stride, d1, 0);
+        store_s16_2x1(dst_ptr + 2 * dst_stride, d2, 0);
+        store_s16_2x1(dst_ptr + 3 * dst_stride, d3, 0);
       } else {
-        vst1_s16((dst_ptr + 0 * dst_stride), d0);
-        vst1_s16((dst_ptr + 1 * dst_stride), d1);
-        vst1_s16((dst_ptr + 2 * dst_stride), d2);
-        vst1_s16((dst_ptr + 3 * dst_stride), d3);
+        store_s16_4x4(dst_ptr, dst_stride, d0, d1, d2, d3);
       }
 
       src_ptr += 4 * src_stride;
@@ -2600,19 +3152,22 @@
     if (height) {
       assert(height < 4);
       horiz_filter_w4_single_row(src_ptr, src_stride, dst_ptr, dst_stride, w,
-                                 height, x_filter, horiz_const, shift_round_0);
+                                 height, x_filter, horiz_const);
     }
 
-#else   // !defined(__aarch64__)
+#else   // !AOM_ARCH_AARCH64
     horiz_filter_w4_single_row(src_ptr, src_stride, dst_ptr, dst_stride, w,
-                               height, x_filter, horiz_const, shift_round_0);
-#endif  // defined(__aarch64__)
+                               height, x_filter, horiz_const);
+#endif  // AOM_ARCH_AARCH64
 
   } else {
-    const int16x8_t horiz_const = vdupq_n_s16((1 << (bd + FILTER_BITS - 2)));
-    const int16x8_t shift_round_0 = vdupq_n_s16(-(round_0 - 1));
+    // This shim of  1 << ((ROUND0_BITS - 1) - 1) enables us to use non-rounding
+    // shifts - which are generally faster than rounding shifts on modern CPUs.
+    // The outermost -1 is needed because we halved the filter values.
+    const int16x8_t horiz_const = vdupq_n_s16((1 << (bd + FILTER_BITS - 2)) +
+                                              (1 << ((ROUND0_BITS - 1) - 1)));
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
 
     for (; height >= 8; height -= 8) {
       int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14,
@@ -2651,22 +3206,22 @@
         s13 = vreinterpretq_s16_u16(vmovl_u8(t6));
         s14 = vreinterpretq_s16_u16(vmovl_u8(t7));
 
-        d0 = convolve8_8x8_s16(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
-                               horiz_const, shift_round_0);
-        d1 = convolve8_8x8_s16(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
-                               horiz_const, shift_round_0);
-        d2 = convolve8_8x8_s16(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
-                               horiz_const, shift_round_0);
-        d3 = convolve8_8x8_s16(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
-                               horiz_const, shift_round_0);
-        d4 = convolve8_8x8_s16(s4, s5, s6, s7, s8, s9, s10, s11, x_filter,
-                               horiz_const, shift_round_0);
-        d5 = convolve8_8x8_s16(s5, s6, s7, s8, s9, s10, s11, s12, x_filter,
-                               horiz_const, shift_round_0);
-        d6 = convolve8_8x8_s16(s6, s7, s8, s9, s10, s11, s12, s13, x_filter,
-                               horiz_const, shift_round_0);
-        d7 = convolve8_8x8_s16(s7, s8, s9, s10, s11, s12, s13, s14, x_filter,
-                               horiz_const, shift_round_0);
+        d0 = convolve8_horiz_8x8_s16(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                                     horiz_const);
+        d1 = convolve8_horiz_8x8_s16(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
+                                     horiz_const);
+        d2 = convolve8_horiz_8x8_s16(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
+                                     horiz_const);
+        d3 = convolve8_horiz_8x8_s16(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
+                                     horiz_const);
+        d4 = convolve8_horiz_8x8_s16(s4, s5, s6, s7, s8, s9, s10, s11, x_filter,
+                                     horiz_const);
+        d5 = convolve8_horiz_8x8_s16(s5, s6, s7, s8, s9, s10, s11, s12,
+                                     x_filter, horiz_const);
+        d6 = convolve8_horiz_8x8_s16(s6, s7, s8, s9, s10, s11, s12, s13,
+                                     x_filter, horiz_const);
+        d7 = convolve8_horiz_8x8_s16(s7, s8, s9, s10, s11, s12, s13, s14,
+                                     x_filter, horiz_const);
 
         transpose_s16_8x8(&d0, &d1, &d2, &d3, &d4, &d5, &d6, &d7);
 
@@ -2741,10 +3296,11 @@
         d2 = vaddq_s16(d2, horiz_const);
         d3 = vaddq_s16(d3, horiz_const);
 
-        d0 = vqrshlq_s16(d0, shift_round_0);
-        d1 = vqrshlq_s16(d1, shift_round_0);
-        d2 = vqrshlq_s16(d2, shift_round_0);
-        d3 = vqrshlq_s16(d3, shift_round_0);
+        // We halved the convolution filter values so -1 from the right shift.
+        d0 = vshrq_n_s16(d0, ROUND0_BITS - 1);
+        d1 = vshrq_n_s16(d1, ROUND0_BITS - 1);
+        d2 = vshrq_n_s16(d2, ROUND0_BITS - 1);
+        d3 = vshrq_n_s16(d3, ROUND0_BITS - 1);
 
         store_s16_8x4(d, dst_stride, d0, d1, d2, d3);
 
@@ -2767,92 +3323,141 @@
     if (height) {
       assert(height < 4);
       horiz_filter_w8_single_row(src_ptr, src_stride, dst_ptr, dst_stride, w,
-                                 height, x_filter, horiz_const, shift_round_0);
+                                 height, x_filter, horiz_const);
     }
 
-#else   // !defined(__aarch64__)
+#else   // !AOM_ARCH_AARCH64
     horiz_filter_w8_single_row(src_ptr, src_stride, dst_ptr, dst_stride, w,
-                               height, x_filter, horiz_const, shift_round_0);
-#endif  // defined(__aarch64__)
+                               height, x_filter, horiz_const);
+#endif  // AOM_ARCH_AARCH64
   }
 }
 
-#endif  // defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#endif  // AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
 
-static INLINE void av1_convolve_2d_sr_vert_8tap_neon(
+static INLINE int32x4_t convolve12_vert_4_s32(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x4_t s6, const int16x4_t s7, const int16x4_t s8,
+    const int16x4_t s9, const int16x4_t s10, const int16x4_t s11,
+    const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter_0_7);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter_0_7);
+  int32x4_t sum;
+
+  sum = vmull_lane_s16(s0, y_filter_0_3, 0);
+  sum = vmlal_lane_s16(sum, s1, y_filter_0_3, 1);
+  sum = vmlal_lane_s16(sum, s2, y_filter_0_3, 2);
+  sum = vmlal_lane_s16(sum, s3, y_filter_0_3, 3);
+  sum = vmlal_lane_s16(sum, s4, y_filter_4_7, 0);
+  sum = vmlal_lane_s16(sum, s5, y_filter_4_7, 1);
+  sum = vmlal_lane_s16(sum, s6, y_filter_4_7, 2);
+  sum = vmlal_lane_s16(sum, s7, y_filter_4_7, 3);
+  sum = vmlal_lane_s16(sum, s8, y_filter_8_11, 0);
+  sum = vmlal_lane_s16(sum, s9, y_filter_8_11, 1);
+  sum = vmlal_lane_s16(sum, s10, y_filter_8_11, 2);
+  sum = vmlal_lane_s16(sum, s11, y_filter_8_11, 3);
+
+  return sum;
+}
+
+static INLINE uint8x8_t convolve12_vert_8_s32(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+    const int16x8_t s6, const int16x8_t s7, const int16x8_t s8,
+    const int16x8_t s9, const int16x8_t s10, const int16x8_t s11,
+    const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11,
+    const int16x8_t sub_const) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter_0_7);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter_0_7);
+  int32x4_t sum0, sum1;
+  int16x8_t res;
+
+  sum0 = vmull_lane_s16(vget_low_s16(s0), y_filter_0_3, 0);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s1), y_filter_0_3, 1);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s2), y_filter_0_3, 2);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s3), y_filter_0_3, 3);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s4), y_filter_4_7, 0);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s5), y_filter_4_7, 1);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s6), y_filter_4_7, 2);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s7), y_filter_4_7, 3);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s8), y_filter_8_11, 0);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s9), y_filter_8_11, 1);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s10), y_filter_8_11, 2);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s11), y_filter_8_11, 3);
+
+  sum1 = vmull_lane_s16(vget_high_s16(s0), y_filter_0_3, 0);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s1), y_filter_0_3, 1);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s2), y_filter_0_3, 2);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s3), y_filter_0_3, 3);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s4), y_filter_4_7, 0);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s5), y_filter_4_7, 1);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s6), y_filter_4_7, 2);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s7), y_filter_4_7, 3);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s8), y_filter_8_11, 0);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s9), y_filter_8_11, 1);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s10), y_filter_8_11, 2);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s11), y_filter_8_11, 3);
+
+  res = vcombine_s16(vqrshrn_n_s32(sum0, 2 * FILTER_BITS - ROUND0_BITS),
+                     vqrshrn_n_s32(sum1, 2 * FILTER_BITS - ROUND0_BITS));
+  res = vsubq_s16(res, sub_const);
+
+  return vqmovun_s16(res);
+}
+
+static INLINE void convolve_2d_sr_vert_12tap_neon(
     int16_t *src_ptr, int src_stride, uint8_t *dst_ptr, int dst_stride, int w,
-    int h, const int16x8_t y_filter, ConvolveParams *conv_params) {
+    int h, const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11) {
   const int bd = 8;
-  const int16_t round_bits =
-      FILTER_BITS * 2 - conv_params->round_0 - conv_params->round_1;
-  const int16x8_t vec_round_bits = vdupq_n_s16(-round_bits);
-  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
-
-  const int32_t sub_const = (1 << (offset_bits - conv_params->round_1)) +
-                            (1 << (offset_bits - conv_params->round_1 - 1));
-
-  const int32x4_t round_shift_vec = vdupq_n_s32(-(conv_params->round_1));
-  const int32x4_t offset_const = vdupq_n_s32(1 << offset_bits);
-  const int32x4_t sub_const_vec = vdupq_n_s32(sub_const);
+  const int16x8_t sub_const = vdupq_n_s16(1 << (bd - 1));
 
   if (w <= 4) {
-    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, d0;
-    int16x8_t dd0;
-    uint8x8_t d01;
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+    int32x4_t d0, d1, d2, d3;
+    int16x8_t dd01, dd23;
+    uint8x8_t d01, d23;
 
-#if defined(__aarch64__)
-    int16x4_t s8, s9, s10, d1, d2, d3;
-    int16x8_t dd1;
-    uint8x8_t d23;
-#endif  // defined(__aarch64__)
-
-    int16_t *s = src_ptr;
-    uint8_t *d = dst_ptr;
-
-    load_s16_4x8(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7);
-    s += (7 * src_stride);
+    load_s16_4x11(src_ptr, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7,
+                  &s8, &s9, &s10);
+    src_ptr += 11 * src_stride;
 
     do {
-#if defined(__aarch64__)
-      load_s16_4x4(s, src_stride, &s7, &s8, &s9, &s10);
-      s += (4 * src_stride);
+      load_s16_4x4(src_ptr, src_stride, &s11, &s12, &s13, &s14);
 
-      d0 = convolve8_vert_4x4_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
-                                  round_shift_vec, offset_const, sub_const_vec);
-      d1 = convolve8_vert_4x4_s32(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
-                                  round_shift_vec, offset_const, sub_const_vec);
-      d2 = convolve8_vert_4x4_s32(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
-                                  round_shift_vec, offset_const, sub_const_vec);
-      d3 = convolve8_vert_4x4_s32(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
-                                  round_shift_vec, offset_const, sub_const_vec);
+      d0 = convolve12_vert_4_s32(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10,
+                                 s11, y_filter_0_7, y_filter_8_11);
+      d1 = convolve12_vert_4_s32(s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11,
+                                 s12, y_filter_0_7, y_filter_8_11);
+      d2 = convolve12_vert_4_s32(s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12,
+                                 s13, y_filter_0_7, y_filter_8_11);
+      d3 = convolve12_vert_4_s32(s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13,
+                                 s14, y_filter_0_7, y_filter_8_11);
 
-      dd0 = vqrshlq_s16(vcombine_s16(d0, d1), vec_round_bits);
-      dd1 = vqrshlq_s16(vcombine_s16(d2, d3), vec_round_bits);
+      dd01 = vcombine_s16(vqrshrn_n_s32(d0, 2 * FILTER_BITS - ROUND0_BITS),
+                          vqrshrn_n_s32(d1, 2 * FILTER_BITS - ROUND0_BITS));
+      dd23 = vcombine_s16(vqrshrn_n_s32(d2, 2 * FILTER_BITS - ROUND0_BITS),
+                          vqrshrn_n_s32(d3, 2 * FILTER_BITS - ROUND0_BITS));
 
-      d01 = vqmovun_s16(dd0);
-      d23 = vqmovun_s16(dd1);
+      dd01 = vsubq_s16(dd01, sub_const);
+      dd23 = vsubq_s16(dd23, sub_const);
 
-      if (w == 4) {
-        vst1_lane_u32((uint32_t *)d, vreinterpret_u32_u8(d01), 0);
-        d += dst_stride;
-        vst1_lane_u32((uint32_t *)d, vreinterpret_u32_u8(d01), 1);
-        d += dst_stride;
+      d01 = vqmovun_s16(dd01);
+      d23 = vqmovun_s16(dd23);
+
+      if (w == 2) {
+        store_u8_2x1(dst_ptr + 0 * dst_stride, d01, 0);
+        store_u8_2x1(dst_ptr + 1 * dst_stride, d01, 2);
         if (h != 2) {
-          vst1_lane_u32((uint32_t *)d, vreinterpret_u32_u8(d23), 0);
-          d += dst_stride;
-          vst1_lane_u32((uint32_t *)d, vreinterpret_u32_u8(d23), 1);
-          d += dst_stride;
+          store_u8_2x1(dst_ptr + 2 * dst_stride, d23, 0);
+          store_u8_2x1(dst_ptr + 3 * dst_stride, d23, 2);
         }
       } else {
-        vst1_lane_u16((uint16_t *)d, vreinterpret_u16_u8(d01), 0);
-        d += dst_stride;
-        vst1_lane_u16((uint16_t *)d, vreinterpret_u16_u8(d01), 2);
-        d += dst_stride;
+        store_u8_4x1(dst_ptr + 0 * dst_stride, d01, 0);
+        store_u8_4x1(dst_ptr + 1 * dst_stride, d01, 1);
         if (h != 2) {
-          vst1_lane_u16((uint16_t *)d, vreinterpret_u16_u8(d23), 0);
-          d += dst_stride;
-          vst1_lane_u16((uint16_t *)d, vreinterpret_u16_u8(d23), 2);
-          d += dst_stride;
+          store_u8_4x1(dst_ptr + 2 * dst_stride, d23, 0);
+          store_u8_4x1(dst_ptr + 3 * dst_stride, d23, 1);
         }
       }
 
@@ -2863,79 +3468,47 @@
       s4 = s8;
       s5 = s9;
       s6 = s10;
+      s7 = s11;
+      s8 = s12;
+      s9 = s13;
+      s10 = s14;
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
       h -= 4;
-#else   // !defined(__aarch64__)
-      s7 = vld1_s16(s);
-      s += src_stride;
-
-      d0 = convolve8_vert_4x4_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
-                                  round_shift_vec, offset_const, sub_const_vec);
-
-      dd0 = vqrshlq_s16(vcombine_s16(d0, d0), vec_round_bits);
-      d01 = vqmovun_s16(dd0);
-
-      if (w == 2) {
-        vst1_lane_u16((uint16_t *)d, vreinterpret_u16_u8(d01), 0);
-        d += dst_stride;
-      } else {
-        vst1_lane_u32((uint32_t *)d, vreinterpret_u32_u8(d01), 0);
-        d += dst_stride;
-      }
-
-      s0 = s1;
-      s1 = s2;
-      s2 = s3;
-      s3 = s4;
-      s4 = s5;
-      s5 = s6;
-      s6 = s7;
-      h--;
-#endif  // defined(__aarch64__)
     } while (h > 0);
-  } else {
-    // if width is a multiple of 8 & height is a multiple of 4
-    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
-    uint8x8_t d0;
-#if defined(__aarch64__)
-    int16x8_t s8, s9, s10;
-    uint8x8_t d1, d2, d3;
-#endif  // defined(__aarch64__)
 
+  } else {
     do {
-      int height = h;
+      int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+      uint8x8_t d0, d1, d2, d3;
+
       int16_t *s = src_ptr;
       uint8_t *d = dst_ptr;
 
-      load_s16_8x8(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7);
-      s += (7 * src_stride);
+      int height = h;
+
+      load_s16_8x11(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7, &s8,
+                    &s9, &s10);
+      s += 11 * src_stride;
 
       do {
-#if defined(__aarch64__)
-        load_s16_8x4(s, src_stride, &s7, &s8, &s9, &s10);
-        s += (4 * src_stride);
+        load_s16_8x4(s, src_stride, &s11, &s12, &s13, &s14);
 
-        d0 = convolve8_vert_8x4_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
-                                    round_shift_vec, offset_const,
-                                    sub_const_vec, vec_round_bits);
-        d1 = convolve8_vert_8x4_s32(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
-                                    round_shift_vec, offset_const,
-                                    sub_const_vec, vec_round_bits);
-        d2 = convolve8_vert_8x4_s32(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
-                                    round_shift_vec, offset_const,
-                                    sub_const_vec, vec_round_bits);
-        d3 = convolve8_vert_8x4_s32(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
-                                    round_shift_vec, offset_const,
-                                    sub_const_vec, vec_round_bits);
+        d0 = convolve12_vert_8_s32(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10,
+                                   s11, y_filter_0_7, y_filter_8_11, sub_const);
+        d1 = convolve12_vert_8_s32(s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11,
+                                   s12, y_filter_0_7, y_filter_8_11, sub_const);
+        d2 =
+            convolve12_vert_8_s32(s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12,
+                                  s13, y_filter_0_7, y_filter_8_11, sub_const);
+        d3 = convolve12_vert_8_s32(s3, s4, s5, s6, s7, s8, s9, s10, s11, s12,
+                                   s13, s14, y_filter_0_7, y_filter_8_11,
+                                   sub_const);
 
-        vst1_u8(d, d0);
-        d += dst_stride;
-        vst1_u8(d, d1);
-        d += dst_stride;
         if (h != 2) {
-          vst1_u8(d, d2);
-          d += dst_stride;
-          vst1_u8(d, d3);
-          d += dst_stride;
+          store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+        } else {
+          store_u8_8x2(d, dst_stride, d0, d1);
         }
 
         s0 = s4;
@@ -2945,27 +3518,13 @@
         s4 = s8;
         s5 = s9;
         s6 = s10;
+        s7 = s11;
+        s8 = s12;
+        s9 = s13;
+        s10 = s14;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
         height -= 4;
-#else   // !defined(__aarch64__)
-        s7 = vld1q_s16(s);
-        s += src_stride;
-
-        d0 = convolve8_vert_8x4_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
-                                    round_shift_vec, offset_const,
-                                    sub_const_vec, vec_round_bits);
-
-        vst1_u8(d, d0);
-        d += dst_stride;
-
-        s0 = s1;
-        s1 = s2;
-        s2 = s3;
-        s3 = s4;
-        s4 = s5;
-        s5 = s6;
-        s6 = s7;
-        height--;
-#endif  // defined(__aarch64__)
       } while (height > 0);
 
       src_ptr += 8;
@@ -2975,11 +3534,170 @@
   }
 }
 
-static INLINE int16x4_t convolve6_vert_4x4_s32(
-    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
-    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
-    const int16x8_t y_filter, const int32x4_t round_shift_vec,
-    const int32x4_t offset_const, const int32x4_t sub_const_vec) {
+static INLINE void convolve_2d_sr_vert_8tap_neon(int16_t *src_ptr,
+                                                 int src_stride,
+                                                 uint8_t *dst_ptr,
+                                                 int dst_stride, int w, int h,
+                                                 const int16x8_t y_filter) {
+  const int bd = 8;
+  const int16x8_t sub_const = vdupq_n_s16(1 << (bd - 1));
+
+  if (w <= 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, d0;
+    uint8x8_t d01;
+
+#if AOM_ARCH_AARCH64
+    int16x4_t s8, s9, s10, d1, d2, d3;
+    uint8x8_t d23;
+#endif  // AOM_ARCH_AARCH64
+
+    int16_t *s = src_ptr;
+    uint8_t *d = dst_ptr;
+
+    load_s16_4x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+    s += 7 * src_stride;
+
+    do {
+#if AOM_ARCH_AARCH64
+      load_s16_4x4(s, src_stride, &s7, &s8, &s9, &s10);
+
+      d0 = convolve8_vert_4_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter);
+      d1 = convolve8_vert_4_s32(s1, s2, s3, s4, s5, s6, s7, s8, y_filter);
+      d2 = convolve8_vert_4_s32(s2, s3, s4, s5, s6, s7, s8, s9, y_filter);
+      d3 = convolve8_vert_4_s32(s3, s4, s5, s6, s7, s8, s9, s10, y_filter);
+
+      d01 = vqmovun_s16(vsubq_s16(vcombine_s16(d0, d1), sub_const));
+      d23 = vqmovun_s16(vsubq_s16(vcombine_s16(d2, d3), sub_const));
+
+      if (w == 2) {
+        store_u8_2x1(d + 0 * dst_stride, d01, 0);
+        store_u8_2x1(d + 1 * dst_stride, d01, 2);
+        if (h != 2) {
+          store_u8_2x1(d + 2 * dst_stride, d23, 0);
+          store_u8_2x1(d + 3 * dst_stride, d23, 2);
+        }
+      } else {
+        store_u8_4x1(d + 0 * dst_stride, d01, 0);
+        store_u8_4x1(d + 1 * dst_stride, d01, 1);
+        if (h != 2) {
+          store_u8_4x1(d + 2 * dst_stride, d23, 0);
+          store_u8_4x1(d + 3 * dst_stride, d23, 1);
+        }
+      }
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s5 = s9;
+      s6 = s10;
+      s += 4 * src_stride;
+      d += 4 * dst_stride;
+      h -= 4;
+#else   // !AOM_ARCH_AARCH64
+      s7 = vld1_s16(s);
+      s += src_stride;
+
+      d0 = convolve8_vert_4_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter);
+
+      d01 = vqmovun_s16(vsubq_s16(vcombine_s16(d0, vdup_n_s16(0)), sub_const));
+
+      if (w == 2) {
+        store_u8_2x1(d, d01, 0);
+      } else {
+        store_u8_4x1(d, d01, 0);
+      }
+
+      s0 = s1;
+      s1 = s2;
+      s2 = s3;
+      s3 = s4;
+      s4 = s5;
+      s5 = s6;
+      s6 = s7;
+      d += dst_stride;
+      h--;
+#endif  // AOM_ARCH_AARCH64
+    } while (h > 0);
+  } else {
+    // if width is a multiple of 8 & height is a multiple of 4
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint8x8_t d0;
+#if AOM_ARCH_AARCH64
+    int16x8_t s8, s9, s10;
+    uint8x8_t d1, d2, d3;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      int height = h;
+      int16_t *s = src_ptr;
+      uint8_t *d = dst_ptr;
+
+      load_s16_8x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+      s += 7 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        load_s16_8x4(s, src_stride, &s7, &s8, &s9, &s10);
+
+        d0 = convolve8_vert_8_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                                  sub_const);
+        d1 = convolve8_vert_8_s32(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                                  sub_const);
+        d2 = convolve8_vert_8_s32(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                                  sub_const);
+        d3 = convolve8_vert_8_s32(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                                  sub_const);
+
+        if (h != 2) {
+          store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+        } else {
+          store_u8_8x2(d, dst_stride, d0, d1);
+        }
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+#else   // !AOM_ARCH_AARCH64
+        s7 = vld1q_s16(s);
+
+        d0 = convolve8_vert_8_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                                  sub_const);
+
+        vst1_u8(d, d0);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s5 = s6;
+        s6 = s7;
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height > 0);
+
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w > 0);
+  }
+}
+
+static INLINE int16x4_t
+convolve6_vert_4_s32(const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+                     const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+                     const int16x8_t y_filter) {
   const int16x4_t y_filter_lo = vget_low_s16(y_filter);
   const int16x4_t y_filter_hi = vget_high_s16(y_filter);
   int32x4_t sum;
@@ -2991,19 +3709,13 @@
   sum = vmlal_lane_s16(sum, s4, y_filter_hi, 1);
   sum = vmlal_lane_s16(sum, s5, y_filter_hi, 2);
 
-  sum = vaddq_s32(sum, offset_const);
-  sum = vqrshlq_s32(sum, round_shift_vec);
-  sum = vsubq_s32(sum, sub_const_vec);
-
-  return vmovn_s32(sum);
+  return vqrshrn_n_s32(sum, 2 * FILTER_BITS - ROUND0_BITS);
 }
 
-static INLINE uint8x8_t convolve6_vert_8x4_s32(
-    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
-    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
-    const int16x8_t y_filter, const int32x4_t round_shift_vec,
-    const int32x4_t offset_const, const int32x4_t sub_const_vec,
-    const int16x8_t vec_round_bits) {
+static INLINE uint8x8_t
+convolve6_vert_8_s32(const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+                     const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+                     const int16x8_t y_filter, const int16x8_t sub_const) {
   const int16x4_t y_filter_lo = vget_low_s16(y_filter);
   const int16x4_t y_filter_hi = vget_high_s16(y_filter);
   int32x4_t sum0, sum1;
@@ -3023,97 +3735,61 @@
   sum1 = vmlal_lane_s16(sum1, vget_high_s16(s4), y_filter_hi, 1);
   sum1 = vmlal_lane_s16(sum1, vget_high_s16(s5), y_filter_hi, 2);
 
-  sum0 = vaddq_s32(sum0, offset_const);
-  sum1 = vaddq_s32(sum1, offset_const);
-  sum0 = vqrshlq_s32(sum0, round_shift_vec);
-  sum1 = vqrshlq_s32(sum1, round_shift_vec);
-  sum0 = vsubq_s32(sum0, sub_const_vec);
-  sum1 = vsubq_s32(sum1, sub_const_vec);
-
-  res = vcombine_s16(vmovn_s32(sum0), vmovn_s32(sum1));
-  res = vqrshlq_s16(res, vec_round_bits);
+  res = vcombine_s16(vqrshrn_n_s32(sum0, 2 * FILTER_BITS - ROUND0_BITS),
+                     vqrshrn_n_s32(sum1, 2 * FILTER_BITS - ROUND0_BITS));
+  res = vsubq_s16(res, sub_const);
 
   return vqmovun_s16(res);
 }
 
-static INLINE void av1_convolve_2d_sr_vert_6tap_neon(
-    int16_t *src_ptr, int src_stride, uint8_t *dst_ptr, int dst_stride, int w,
-    int h, const int16x8_t y_filter, ConvolveParams *conv_params) {
+static INLINE void convolve_2d_sr_vert_6tap_neon(int16_t *src_ptr,
+                                                 int src_stride,
+                                                 uint8_t *dst_ptr,
+                                                 int dst_stride, int w, int h,
+                                                 const int16x8_t y_filter) {
   const int bd = 8;
-  const int16_t round_bits =
-      FILTER_BITS * 2 - conv_params->round_0 - conv_params->round_1;
-  const int16x8_t vec_round_bits = vdupq_n_s16(-round_bits);
-  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
-
-  const int32_t sub_const = (1 << (offset_bits - conv_params->round_1)) +
-                            (1 << (offset_bits - conv_params->round_1 - 1));
-
-  const int32x4_t round_shift_vec = vdupq_n_s32(-(conv_params->round_1));
-  const int32x4_t offset_const = vdupq_n_s32(1 << offset_bits);
-  const int32x4_t sub_const_vec = vdupq_n_s32(sub_const);
+  const int16x8_t sub_const = vdupq_n_s16(1 << (bd - 1));
 
   if (w <= 4) {
     int16x4_t s0, s1, s2, s3, s4, s5, d0;
-    int16x8_t dd0;
     uint8x8_t d01;
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     int16x4_t s6, s7, s8, d1, d2, d3;
-    int16x8_t dd1;
     uint8x8_t d23;
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
 
     int16_t *s = src_ptr;
     uint8_t *d = dst_ptr;
 
-    s0 = vld1_s16(s + 0 * src_stride);
-    s1 = vld1_s16(s + 1 * src_stride);
-    s2 = vld1_s16(s + 2 * src_stride);
-    s3 = vld1_s16(s + 3 * src_stride);
-    s4 = vld1_s16(s + 4 * src_stride);
-    s += (5 * src_stride);
+    load_s16_4x5(s, src_stride, &s0, &s1, &s2, &s3, &s4);
+    s += 5 * src_stride;
 
     do {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
       load_s16_4x4(s, src_stride, &s5, &s6, &s7, &s8);
-      s += (4 * src_stride);
 
-      d0 = convolve6_vert_4x4_s32(s0, s1, s2, s3, s4, s5, y_filter,
-                                  round_shift_vec, offset_const, sub_const_vec);
-      d1 = convolve6_vert_4x4_s32(s1, s2, s3, s4, s5, s6, y_filter,
-                                  round_shift_vec, offset_const, sub_const_vec);
-      d2 = convolve6_vert_4x4_s32(s2, s3, s4, s5, s6, s7, y_filter,
-                                  round_shift_vec, offset_const, sub_const_vec);
-      d3 = convolve6_vert_4x4_s32(s3, s4, s5, s6, s7, s8, y_filter,
-                                  round_shift_vec, offset_const, sub_const_vec);
+      d0 = convolve6_vert_4_s32(s0, s1, s2, s3, s4, s5, y_filter);
+      d1 = convolve6_vert_4_s32(s1, s2, s3, s4, s5, s6, y_filter);
+      d2 = convolve6_vert_4_s32(s2, s3, s4, s5, s6, s7, y_filter);
+      d3 = convolve6_vert_4_s32(s3, s4, s5, s6, s7, s8, y_filter);
 
-      dd0 = vqrshlq_s16(vcombine_s16(d0, d1), vec_round_bits);
-      dd1 = vqrshlq_s16(vcombine_s16(d2, d3), vec_round_bits);
+      d01 = vqmovun_s16(vsubq_s16(vcombine_s16(d0, d1), sub_const));
+      d23 = vqmovun_s16(vsubq_s16(vcombine_s16(d2, d3), sub_const));
 
-      d01 = vqmovun_s16(dd0);
-      d23 = vqmovun_s16(dd1);
-
-      if (w == 4) {
-        vst1_lane_u32((uint32_t *)d, vreinterpret_u32_u8(d01), 0);
-        d += dst_stride;
-        vst1_lane_u32((uint32_t *)d, vreinterpret_u32_u8(d01), 1);
-        d += dst_stride;
+      if (w == 2) {
+        store_u8_2x1(d + 0 * dst_stride, d01, 0);
+        store_u8_2x1(d + 1 * dst_stride, d01, 2);
         if (h != 2) {
-          vst1_lane_u32((uint32_t *)d, vreinterpret_u32_u8(d23), 0);
-          d += dst_stride;
-          vst1_lane_u32((uint32_t *)d, vreinterpret_u32_u8(d23), 1);
-          d += dst_stride;
+          store_u8_2x1(d + 2 * dst_stride, d23, 0);
+          store_u8_2x1(d + 3 * dst_stride, d23, 2);
         }
       } else {
-        vst1_lane_u16((uint16_t *)d, vreinterpret_u16_u8(d01), 0);
-        d += dst_stride;
-        vst1_lane_u16((uint16_t *)d, vreinterpret_u16_u8(d01), 2);
-        d += dst_stride;
+        store_u8_4x1(d + 0 * dst_stride, d01, 0);
+        store_u8_4x1(d + 1 * dst_stride, d01, 1);
         if (h != 2) {
-          vst1_lane_u16((uint16_t *)d, vreinterpret_u16_u8(d23), 0);
-          d += dst_stride;
-          vst1_lane_u16((uint16_t *)d, vreinterpret_u16_u8(d23), 2);
-          d += dst_stride;
+          store_u8_4x1(d + 2 * dst_stride, d23, 0);
+          store_u8_4x1(d + 3 * dst_stride, d23, 1);
         }
       }
 
@@ -3122,23 +3798,19 @@
       s2 = s6;
       s3 = s7;
       s4 = s8;
+      s += 4 * src_stride;
+      d += 4 * dst_stride;
       h -= 4;
-#else   // !defined(__aarch64__)
+#else   // !AOM_ARCH_AARCH64
       s5 = vld1_s16(s);
-      s += src_stride;
 
-      d0 = convolve6_vert_4x4_s32(s0, s1, s2, s3, s4, s5, y_filter,
-                                  round_shift_vec, offset_const, sub_const_vec);
-
-      dd0 = vqrshlq_s16(vcombine_s16(d0, d0), vec_round_bits);
-      d01 = vqmovun_s16(dd0);
+      d0 = convolve6_vert_4_s32(s0, s1, s2, s3, s4, s5, y_filter);
+      d01 = vqmovun_s16(vsubq_s16(vcombine_s16(d0, vdup_n_s16(0)), sub_const));
 
       if (w == 2) {
-        vst1_lane_u16((uint16_t *)d, vreinterpret_u16_u8(d01), 0);
-        d += dst_stride;
+        store_u8_2x1(d, d01, 0);
       } else {
-        vst1_lane_u32((uint32_t *)d, vreinterpret_u32_u8(d01), 0);
-        d += dst_stride;
+        store_u8_4x1(d, d01, 0);
       }
 
       s0 = s1;
@@ -3146,57 +3818,41 @@
       s2 = s3;
       s3 = s4;
       s4 = s5;
+      s += src_stride;
+      d += dst_stride;
       h--;
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
     } while (h > 0);
   } else {
     // if width is a multiple of 8 & height is a multiple of 4
     int16x8_t s0, s1, s2, s3, s4, s5;
     uint8x8_t d0;
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     int16x8_t s6, s7, s8;
     uint8x8_t d1, d2, d3;
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
 
     do {
       int height = h;
       int16_t *s = src_ptr;
       uint8_t *d = dst_ptr;
 
-      s0 = vld1q_s16(s + 0 * src_stride);
-      s1 = vld1q_s16(s + 1 * src_stride);
-      s2 = vld1q_s16(s + 2 * src_stride);
-      s3 = vld1q_s16(s + 3 * src_stride);
-      s4 = vld1q_s16(s + 4 * src_stride);
-      s += (5 * src_stride);
+      load_s16_8x5(s, src_stride, &s0, &s1, &s2, &s3, &s4);
+      s += 5 * src_stride;
 
       do {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
         load_s16_8x4(s, src_stride, &s5, &s6, &s7, &s8);
-        s += (4 * src_stride);
 
-        d0 = convolve6_vert_8x4_s32(s0, s1, s2, s3, s4, s5, y_filter,
-                                    round_shift_vec, offset_const,
-                                    sub_const_vec, vec_round_bits);
-        d1 = convolve6_vert_8x4_s32(s1, s2, s3, s4, s5, s6, y_filter,
-                                    round_shift_vec, offset_const,
-                                    sub_const_vec, vec_round_bits);
-        d2 = convolve6_vert_8x4_s32(s2, s3, s4, s5, s6, s7, y_filter,
-                                    round_shift_vec, offset_const,
-                                    sub_const_vec, vec_round_bits);
-        d3 = convolve6_vert_8x4_s32(s3, s4, s5, s6, s7, s8, y_filter,
-                                    round_shift_vec, offset_const,
-                                    sub_const_vec, vec_round_bits);
+        d0 = convolve6_vert_8_s32(s0, s1, s2, s3, s4, s5, y_filter, sub_const);
+        d1 = convolve6_vert_8_s32(s1, s2, s3, s4, s5, s6, y_filter, sub_const);
+        d2 = convolve6_vert_8_s32(s2, s3, s4, s5, s6, s7, y_filter, sub_const);
+        d3 = convolve6_vert_8_s32(s3, s4, s5, s6, s7, s8, y_filter, sub_const);
 
-        vst1_u8(d, d0);
-        d += dst_stride;
-        vst1_u8(d, d1);
-        d += dst_stride;
         if (h != 2) {
-          vst1_u8(d, d2);
-          d += dst_stride;
-          vst1_u8(d, d3);
-          d += dst_stride;
+          store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+        } else {
+          store_u8_8x2(d, dst_stride, d0, d1);
         }
 
         s0 = s4;
@@ -3204,25 +3860,25 @@
         s2 = s6;
         s3 = s7;
         s4 = s8;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
         height -= 4;
-#else   // !defined(__aarch64__)
+#else   // !AOM_ARCH_AARCH64
         s5 = vld1q_s16(s);
-        s += src_stride;
 
-        d0 = convolve6_vert_8x4_s32(s0, s1, s2, s3, s4, s5, y_filter,
-                                    round_shift_vec, offset_const,
-                                    sub_const_vec, vec_round_bits);
+        d0 = convolve6_vert_8_s32(s0, s1, s2, s3, s4, s5, y_filter, sub_const);
 
         vst1_u8(d, d0);
-        d += dst_stride;
 
         s0 = s1;
         s1 = s2;
         s2 = s3;
         s3 = s4;
         s4 = s5;
+        s += src_stride;
+        d += dst_stride;
         height--;
-#endif  // defined(__aarch64__)
+#endif  // AOM_ARCH_AARCH64
       } while (height > 0);
 
       src_ptr += 8;
@@ -3238,6 +3894,7 @@
                              const InterpFilterParams *filter_params_y,
                              const int subpel_x_qn, const int subpel_y_qn,
                              ConvolveParams *conv_params) {
+  (void)conv_params;
   const int y_filter_taps = get_filter_tap(filter_params_y, subpel_y_qn);
   const int clamped_y_taps = y_filter_taps < 6 ? 6 : y_filter_taps;
   const int im_h = h + clamped_y_taps - 1;
@@ -3260,13 +3917,11 @@
     const int16x8_t y_filter_0_7 = vld1q_s16(y_filter_ptr);
     const int16x4_t y_filter_8_11 = vld1_s16(y_filter_ptr + 8);
 
-    av1_convolve_2d_sr_horiz_12tap_neon(src_ptr, src_stride, im_block,
-                                        im_stride, w, im_h, x_filter_0_7,
-                                        x_filter_8_11, conv_params->round_0);
+    convolve_2d_sr_horiz_12tap_neon(src_ptr, src_stride, im_block, im_stride, w,
+                                    im_h, x_filter_0_7, x_filter_8_11);
 
-    av1_convolve_2d_sr_vert_12tap_neon(im_block, im_stride, dst, dst_stride, w,
-                                       h, y_filter_0_7, y_filter_8_11,
-                                       conv_params);
+    convolve_2d_sr_vert_12tap_neon(im_block, im_stride, dst, dst_stride, w, h,
+                                   y_filter_0_7, y_filter_8_11);
   } else {
     DECLARE_ALIGNED(16, int16_t,
                     im_block[(MAX_SB_SIZE + HORIZ_EXTRA_ROWS) * MAX_SB_SIZE]);
@@ -3274,15 +3929,15 @@
     const int16x8_t x_filter = vld1q_s16(x_filter_ptr);
     const int16x8_t y_filter = vld1q_s16(y_filter_ptr);
 
-    av1_convolve_2d_sr_horiz_neon(src_ptr, src_stride, im_block, im_stride, w,
-                                  im_h, x_filter, conv_params->round_0);
+    convolve_2d_sr_horiz_8tap_neon(src_ptr, src_stride, im_block, im_stride, w,
+                                   im_h, x_filter);
 
     if (clamped_y_taps <= 6) {
-      av1_convolve_2d_sr_vert_6tap_neon(im_block, im_stride, dst, dst_stride, w,
-                                        h, y_filter, conv_params);
+      convolve_2d_sr_vert_6tap_neon(im_block, im_stride, dst, dst_stride, w, h,
+                                    y_filter);
     } else {
-      av1_convolve_2d_sr_vert_8tap_neon(im_block, im_stride, dst, dst_stride, w,
-                                        h, y_filter, conv_params);
+      convolve_2d_sr_vert_8tap_neon(im_block, im_stride, dst, dst_stride, w, h,
+                                    y_filter);
     }
   }
 }
@@ -3329,7 +3984,7 @@
           tt = convolve8_4(t[0], t[1], t[2], t[3], t[4], t[5], t[6], t[7],
                            filters);
           d = vqrshrun_n_s16(vcombine_s16(tt, tt), 7);
-          vst1_lane_u32((uint32_t *)&temp[4 * z], vreinterpret_u32_u8(d), 0);
+          store_u8_4x1(&temp[4 * z], d, 0);
         } else {
           int i;
           for (i = 0; i < 4; ++i) {
@@ -3342,14 +3997,10 @@
       // transpose the 4x4 filters values back to dst
       {
         const uint8x8x4_t d4 = vld4_u8(temp);
-        vst1_lane_u32((uint32_t *)&dst[x + 0 * dst_stride],
-                      vreinterpret_u32_u8(d4.val[0]), 0);
-        vst1_lane_u32((uint32_t *)&dst[x + 1 * dst_stride],
-                      vreinterpret_u32_u8(d4.val[1]), 0);
-        vst1_lane_u32((uint32_t *)&dst[x + 2 * dst_stride],
-                      vreinterpret_u32_u8(d4.val[2]), 0);
-        vst1_lane_u32((uint32_t *)&dst[x + 3 * dst_stride],
-                      vreinterpret_u32_u8(d4.val[3]), 0);
+        store_u8_4x1(&dst[x + 0 * dst_stride], d4.val[0], 0);
+        store_u8_4x1(&dst[x + 1 * dst_stride], d4.val[1], 0);
+        store_u8_4x1(&dst[x + 2 * dst_stride], d4.val[2], 0);
+        store_u8_4x1(&dst[x + 3 * dst_stride], d4.val[3], 0);
       }
       x += 4;
     } while (x < w);
@@ -3403,14 +4054,8 @@
       load_u8_8x8(temp, 8, &d[0], &d[1], &d[2], &d[3], &d[4], &d[5], &d[6],
                   &d[7]);
       transpose_u8_8x8(&d[0], &d[1], &d[2], &d[3], &d[4], &d[5], &d[6], &d[7]);
-      vst1_u8(&dst[x + 0 * dst_stride], d[0]);
-      vst1_u8(&dst[x + 1 * dst_stride], d[1]);
-      vst1_u8(&dst[x + 2 * dst_stride], d[2]);
-      vst1_u8(&dst[x + 3 * dst_stride], d[3]);
-      vst1_u8(&dst[x + 4 * dst_stride], d[4]);
-      vst1_u8(&dst[x + 5 * dst_stride], d[5]);
-      vst1_u8(&dst[x + 6 * dst_stride], d[6]);
-      vst1_u8(&dst[x + 7 * dst_stride], d[7]);
+      store_u8_8x8(dst + x, dst_stride, d[0], d[1], d[2], d[3], d[4], d[5],
+                   d[6], d[7]);
       x += 8;
     } while (x < w);
 
@@ -3449,7 +4094,7 @@
 
       tt = convolve8_4(t[0], t[1], t[2], t[3], t[4], t[5], t[6], t[7], filters);
       d = vqrshrun_n_s16(vcombine_s16(tt, tt), 7);
-      vst1_lane_u32((uint32_t *)dst, vreinterpret_u32_u8(d), 0);
+      store_u8_4x1(dst, d, 0);
     } else {
       memcpy(dst, &src_y[3 * src_stride], w);
     }

diff --git a/av1/common/arm/convolve_neon.h b/av1/common/arm/convolve_neon.h
index 4e9f636..14a6ebe 100644
--- a/av1/common/arm/convolve_neon.h
+++ b/av1/common/arm/convolve_neon.h

@@ -13,6 +13,8 @@
 
 #include <arm_neon.h>
 
+#include "config/aom_config.h"
+
 #define HORIZ_EXTRA_ROWS ((SUBPEL_TAPS + 7) & ~0x07)
 
 static INLINE int16x4_t convolve8_4(const int16x4_t s0, const int16x4_t s1,
@@ -230,7 +232,10 @@
   return sum;
 }
 
-#if defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+// clang versions < 16 did not include the dotprod feature for Arm architecture
+// versions that should have it by default, e.g., armv8.6-a.
+#if AOM_ARCH_AARCH64 && \
+    (defined(__ARM_FEATURE_DOTPROD) || defined(__ARM_FEATURE_MATMUL_INT8))
 
 DECLARE_ALIGNED(16, static const uint8_t, dot_prod_permute_tbl[48]) = {
   0, 1, 2,  3,  1, 2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6,
@@ -238,9 +243,62 @@
   8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14
 };
 
-#endif  // defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#endif  // AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
 
-#if defined(__aarch64__) && defined(__ARM_FEATURE_MATMUL_INT8)
+#if AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_MATMUL_INT8)
+
+static INLINE int16x8_t convolve8_x_8_usdot(uint8x16_t samples,
+                                            const int8x8_t filters,
+                                            const uint8x16x3_t permute_tbl,
+                                            const int32x4_t horiz_const) {
+  uint8x16_t permuted_samples[3];
+  int32x4_t sum[2];
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_u8(samples, permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_u8(samples, permute_tbl.val[1]);
+  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
+  permuted_samples[2] = vqtbl1q_u8(samples, permute_tbl.val[2]);
+
+  /* First 4 output values. */
+  sum[0] = vusdotq_lane_s32(horiz_const, permuted_samples[0], filters, 0);
+  sum[0] = vusdotq_lane_s32(sum[0], permuted_samples[1], filters, 1);
+  /* Second 4 output values. */
+  sum[1] = vusdotq_lane_s32(horiz_const, permuted_samples[1], filters, 0);
+  sum[1] = vusdotq_lane_s32(sum[1], permuted_samples[2], filters, 1);
+
+  return vcombine_s16(vmovn_s32(sum[0]), vmovn_s32(sum[1]));
+}
+
+static INLINE int16x8_t convolve8_horiz_8_usdot(uint8x16_t samples,
+                                                const int8x8_t filters,
+                                                const uint8x16x3_t permute_tbl,
+                                                const int32x4_t horiz_const) {
+  uint8x16_t permuted_samples[3];
+  int32x4_t sum[2];
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_u8(samples, permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_u8(samples, permute_tbl.val[1]);
+  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
+  permuted_samples[2] = vqtbl1q_u8(samples, permute_tbl.val[2]);
+
+  /* First 4 output values. */
+  sum[0] = vusdotq_lane_s32(horiz_const, permuted_samples[0], filters, 0);
+  sum[0] = vusdotq_lane_s32(sum[0], permuted_samples[1], filters, 1);
+  /* Second 4 output values. */
+  sum[1] = vusdotq_lane_s32(horiz_const, permuted_samples[1], filters, 0);
+  sum[1] = vusdotq_lane_s32(sum[1], permuted_samples[2], filters, 1);
+
+  /* Narrow and re-pack. */
+  // We halved the convolution filter values so -1 from the right shift.
+  return vcombine_s16(vshrn_n_s32(sum[0], ROUND0_BITS - 1),
+                      vshrn_n_s32(sum[1], ROUND0_BITS - 1));
+}
 
 static INLINE int32x4_t convolve8_4_usdot(uint8x16_t samples,
                                           const int8x8_t filters,
@@ -263,37 +321,41 @@
   return sum;
 }
 
-static INLINE int16x8_t convolve8_8_usdot(uint8x16_t samples,
-                                          const int8x8_t filters,
-                                          const uint8x16x3_t permute_tbl,
-                                          const int32x4_t horiz_const,
-                                          const int16x8_t shift_round_0) {
-  uint8x16_t permuted_samples[3];
-  int32x4_t sum0, sum1;
-  int16x8_t sum;
+#elif AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
+
+static INLINE int16x8_t convolve8_horiz_8_sdot(uint8x16_t samples,
+                                               const int8x8_t filters,
+                                               const int32x4_t correction,
+                                               const uint8x16_t range_limit,
+                                               const uint8x16x3_t permute_tbl) {
+  int8x16_t clamped_samples, permuted_samples[3];
+  int32x4_t sum[2];
+
+  /* Clamp sample range to [-128, 127] for 8-bit signed dot product. */
+  clamped_samples = vreinterpretq_s8_u8(vsubq_u8(samples, range_limit));
 
   /* Permute samples ready for dot product. */
   /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
-  permuted_samples[0] = vqtbl1q_u8(samples, permute_tbl.val[0]);
+  permuted_samples[0] = vqtbl1q_s8(clamped_samples, permute_tbl.val[0]);
   /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
-  permuted_samples[1] = vqtbl1q_u8(samples, permute_tbl.val[1]);
+  permuted_samples[1] = vqtbl1q_s8(clamped_samples, permute_tbl.val[1]);
   /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
-  permuted_samples[2] = vqtbl1q_u8(samples, permute_tbl.val[2]);
+  permuted_samples[2] = vqtbl1q_s8(clamped_samples, permute_tbl.val[2]);
 
+  /* Accumulate dot product into 'correction' to account for range clamp. */
   /* First 4 output values. */
-  sum0 = vusdotq_lane_s32(horiz_const, permuted_samples[0], filters, 0);
-  sum0 = vusdotq_lane_s32(sum0, permuted_samples[1], filters, 1);
+  sum[0] = vdotq_lane_s32(correction, permuted_samples[0], filters, 0);
+  sum[0] = vdotq_lane_s32(sum[0], permuted_samples[1], filters, 1);
   /* Second 4 output values. */
-  sum1 = vusdotq_lane_s32(horiz_const, permuted_samples[1], filters, 0);
-  sum1 = vusdotq_lane_s32(sum1, permuted_samples[2], filters, 1);
+  sum[1] = vdotq_lane_s32(correction, permuted_samples[1], filters, 0);
+  sum[1] = vdotq_lane_s32(sum[1], permuted_samples[2], filters, 1);
 
   /* Narrow and re-pack. */
-  sum = vcombine_s16(vmovn_s32(sum0), vmovn_s32(sum1));
-  return vqrshlq_s16(sum, shift_round_0);
+  /* We halved the convolution filter values so -1 from the right shift. */
+  return vcombine_s16(vshrn_n_s32(sum[0], ROUND0_BITS - 1),
+                      vshrn_n_s32(sum[1], ROUND0_BITS - 1));
 }
 
-#elif defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
-
 static INLINE int32x4_t convolve8_4_sdot(uint8x16_t samples,
                                          const int8x8_t filters,
                                          const int32x4_t correction,
@@ -353,7 +415,38 @@
   return vqrshlq_s16(sum, shift_round_0);
 }
 
-#endif  // defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+static INLINE int16x8_t convolve8_x_8_sdot(uint8x16_t samples,
+                                           const int8x8_t filters,
+                                           const int32x4_t correction,
+                                           const uint8x16_t range_limit,
+                                           const uint8x16x3_t permute_tbl) {
+  int8x16_t clamped_samples, permuted_samples[3];
+  int32x4_t sum[2];
+
+  /* Clamp sample range to [-128, 127] for 8-bit signed dot product. */
+  clamped_samples = vreinterpretq_s8_u8(vsubq_u8(samples, range_limit));
+
+  /* Permute samples ready for dot product. */
+  /* { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 } */
+  permuted_samples[0] = vqtbl1q_s8(clamped_samples, permute_tbl.val[0]);
+  /* { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 } */
+  permuted_samples[1] = vqtbl1q_s8(clamped_samples, permute_tbl.val[1]);
+  /* { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 } */
+  permuted_samples[2] = vqtbl1q_s8(clamped_samples, permute_tbl.val[2]);
+
+  /* Accumulate dot product into 'correction' to account for range clamp. */
+  /* First 4 output values. */
+  sum[0] = vdotq_lane_s32(correction, permuted_samples[0], filters, 0);
+  sum[0] = vdotq_lane_s32(sum[0], permuted_samples[1], filters, 1);
+  /* Second 4 output values. */
+  sum[1] = vdotq_lane_s32(correction, permuted_samples[1], filters, 0);
+  sum[1] = vdotq_lane_s32(sum[1], permuted_samples[2], filters, 1);
+
+  /* Narrow and re-pack. */
+  return vcombine_s16(vmovn_s32(sum[0]), vmovn_s32(sum[1]));
+}
+
+#endif  // AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
 
 static INLINE int16x4_t convolve8_4x4_s16(
     const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
@@ -379,114 +472,92 @@
   return sum;
 }
 
-static INLINE uint16x4_t convolve6_4_s32(const int16x4_t s0, const int16x4_t s1,
-                                         const int16x4_t s2, const int16x4_t s3,
-                                         const int16x4_t s4, const int16x4_t s5,
-                                         const int16x8_t y_filter,
-                                         const int32x4_t round_shift_vec,
-                                         const int32x4_t offset_const) {
-  const int16x4_t y_filter_lo = vget_low_s16(y_filter);
-  const int16x4_t y_filter_hi = vget_high_s16(y_filter);
+static INLINE int16x4_t convolve6_4x4(const int16x4_t s0, const int16x4_t s1,
+                                      const int16x4_t s2, const int16x4_t s3,
+                                      const int16x4_t s4, const int16x4_t s5,
+                                      const int16x8_t y_filter_0_7) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter_0_7);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter_0_7);
+  int16x4_t sum;
 
-  int32x4_t sum = offset_const;
-  sum = vmlal_lane_s16(sum, s0, y_filter_lo, 1);
-  sum = vmlal_lane_s16(sum, s1, y_filter_lo, 2);
-  sum = vmlal_lane_s16(sum, s2, y_filter_lo, 3);
-  sum = vmlal_lane_s16(sum, s3, y_filter_hi, 0);
-  sum = vmlal_lane_s16(sum, s4, y_filter_hi, 1);
-  sum = vmlal_lane_s16(sum, s5, y_filter_hi, 2);
+  // Filter values at indices 0 and 7 are 0.
+  sum = vmul_lane_s16(s0, y_filter_0_3, 1);
+  sum = vmla_lane_s16(sum, s1, y_filter_0_3, 2);
+  sum = vmla_lane_s16(sum, s2, y_filter_0_3, 3);
+  sum = vmla_lane_s16(sum, s3, y_filter_4_7, 0);
+  sum = vmla_lane_s16(sum, s4, y_filter_4_7, 1);
+  sum = vmla_lane_s16(sum, s5, y_filter_4_7, 2);
 
-  sum = vqrshlq_s32(sum, round_shift_vec);
-  return vqmovun_s32(sum);
+  return sum;
 }
 
-static INLINE uint16x8_t convolve6_8_s32(const int16x8_t s0, const int16x8_t s1,
-                                         const int16x8_t s2, const int16x8_t s3,
-                                         const int16x8_t s4, const int16x8_t s5,
-                                         const int16x8_t y_filter,
-                                         const int32x4_t round_shift_vec,
-                                         const int32x4_t offset_const) {
-  const int16x4_t y_filter_lo = vget_low_s16(y_filter);
-  const int16x4_t y_filter_hi = vget_high_s16(y_filter);
+static INLINE int16x8_t convolve6_8x4(const int16x8_t s0, const int16x8_t s1,
+                                      const int16x8_t s2, const int16x8_t s3,
+                                      const int16x8_t s4, const int16x8_t s5,
+                                      const int16x8_t y_filters) {
+  const int16x4_t y_filter_lo = vget_low_s16(y_filters);
+  const int16x4_t y_filter_hi = vget_high_s16(y_filters);
+  int16x8_t sum;
 
-  int32x4_t sum0 = offset_const;
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s0), y_filter_lo, 1);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s1), y_filter_lo, 2);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s2), y_filter_lo, 3);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s3), y_filter_hi, 0);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s4), y_filter_hi, 1);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s5), y_filter_hi, 2);
+  // Filter values at indices 0 and 7 are 0.
+  sum = vmulq_lane_s16(s0, y_filter_lo, 1);
+  sum = vmlaq_lane_s16(sum, s1, y_filter_lo, 2);
+  sum = vmlaq_lane_s16(sum, s2, y_filter_lo, 3);
+  sum = vmlaq_lane_s16(sum, s3, y_filter_hi, 0);
+  sum = vmlaq_lane_s16(sum, s4, y_filter_hi, 1);
+  sum = vmlaq_lane_s16(sum, s5, y_filter_hi, 2);
 
-  int32x4_t sum1 = offset_const;
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s0), y_filter_lo, 1);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s1), y_filter_lo, 2);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s2), y_filter_lo, 3);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s3), y_filter_hi, 0);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s4), y_filter_hi, 1);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s5), y_filter_hi, 2);
-
-  sum0 = vqrshlq_s32(sum0, round_shift_vec);
-  sum1 = vqrshlq_s32(sum1, round_shift_vec);
-  return vcombine_u16(vqmovun_s32(sum0), vqmovun_s32(sum1));
+  return sum;
 }
 
-static INLINE uint16x4_t convolve8_4_s32(const int16x4_t s0, const int16x4_t s1,
-                                         const int16x4_t s2, const int16x4_t s3,
-                                         const int16x4_t s4, const int16x4_t s5,
-                                         const int16x4_t s6, const int16x4_t s7,
-                                         const int16x8_t y_filter,
-                                         const int32x4_t round_shift_vec,
-                                         const int32x4_t offset_const) {
-  const int16x4_t y_filter_lo = vget_low_s16(y_filter);
-  const int16x4_t y_filter_hi = vget_high_s16(y_filter);
+#if !(AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD))
 
-  int32x4_t sum = offset_const;
-  sum = vmlal_lane_s16(sum, s0, y_filter_lo, 0);
-  sum = vmlal_lane_s16(sum, s1, y_filter_lo, 1);
-  sum = vmlal_lane_s16(sum, s2, y_filter_lo, 2);
-  sum = vmlal_lane_s16(sum, s3, y_filter_lo, 3);
-  sum = vmlal_lane_s16(sum, s4, y_filter_hi, 0);
-  sum = vmlal_lane_s16(sum, s5, y_filter_hi, 1);
-  sum = vmlal_lane_s16(sum, s6, y_filter_hi, 2);
-  sum = vmlal_lane_s16(sum, s7, y_filter_hi, 3);
+static INLINE int16x4_t convolve8_horiz_4x4_s16(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x4_t s6, const int16x4_t s7, const int16x8_t filter,
+    const int16x4_t horiz_const) {
+  const int16x4_t filter_lo = vget_low_s16(filter);
+  const int16x4_t filter_hi = vget_high_s16(filter);
+  int16x4_t sum;
 
-  sum = vqrshlq_s32(sum, round_shift_vec);
-  return vqmovun_s32(sum);
+  sum = horiz_const;
+  sum = vmla_lane_s16(sum, s0, filter_lo, 0);
+  sum = vmla_lane_s16(sum, s1, filter_lo, 1);
+  sum = vmla_lane_s16(sum, s2, filter_lo, 2);
+  sum = vmla_lane_s16(sum, s3, filter_lo, 3);
+  sum = vmla_lane_s16(sum, s4, filter_hi, 0);
+  sum = vmla_lane_s16(sum, s5, filter_hi, 1);
+  sum = vmla_lane_s16(sum, s6, filter_hi, 2);
+  sum = vmla_lane_s16(sum, s7, filter_hi, 3);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  return vshr_n_s16(sum, ROUND0_BITS - 1);
 }
 
-static INLINE uint16x8_t convolve8_8_s32(const int16x8_t s0, const int16x8_t s1,
-                                         const int16x8_t s2, const int16x8_t s3,
-                                         const int16x8_t s4, const int16x8_t s5,
-                                         const int16x8_t s6, const int16x8_t s7,
-                                         const int16x8_t y_filter,
-                                         const int32x4_t round_shift_vec,
-                                         const int32x4_t offset_const) {
-  const int16x4_t y_filter_lo = vget_low_s16(y_filter);
-  const int16x4_t y_filter_hi = vget_high_s16(y_filter);
+static INLINE int16x8_t convolve8_horiz_8x8_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+    const int16x8_t s6, const int16x8_t s7, const int16x8_t filter,
+    const int16x8_t horiz_const) {
+  const int16x4_t filter_lo = vget_low_s16(filter);
+  const int16x4_t filter_hi = vget_high_s16(filter);
+  int16x8_t sum;
 
-  int32x4_t sum0 = offset_const;
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s0), y_filter_lo, 0);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s1), y_filter_lo, 1);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s2), y_filter_lo, 2);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s3), y_filter_lo, 3);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s4), y_filter_hi, 0);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s5), y_filter_hi, 1);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s6), y_filter_hi, 2);
-  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s7), y_filter_hi, 3);
+  sum = horiz_const;
+  sum = vmlaq_lane_s16(sum, s0, filter_lo, 0);
+  sum = vmlaq_lane_s16(sum, s1, filter_lo, 1);
+  sum = vmlaq_lane_s16(sum, s2, filter_lo, 2);
+  sum = vmlaq_lane_s16(sum, s3, filter_lo, 3);
+  sum = vmlaq_lane_s16(sum, s4, filter_hi, 0);
+  sum = vmlaq_lane_s16(sum, s5, filter_hi, 1);
+  sum = vmlaq_lane_s16(sum, s6, filter_hi, 2);
+  sum = vmlaq_lane_s16(sum, s7, filter_hi, 3);
 
-  int32x4_t sum1 = offset_const;
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s0), y_filter_lo, 0);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s1), y_filter_lo, 1);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s2), y_filter_lo, 2);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s3), y_filter_lo, 3);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s4), y_filter_hi, 0);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s5), y_filter_hi, 1);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s6), y_filter_hi, 2);
-  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s7), y_filter_hi, 3);
-
-  sum0 = vqrshlq_s32(sum0, round_shift_vec);
-  sum1 = vqrshlq_s32(sum1, round_shift_vec);
-  return vcombine_u16(vqmovun_s32(sum0), vqmovun_s32(sum1));
+  // We halved the convolution filter values so -1 from the right shift.
+  return vshrq_n_s16(sum, ROUND0_BITS - 1);
 }
 
+#endif  // !(AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD))
+
 #endif  // AOM_AV1_COMMON_ARM_CONVOLVE_NEON_H_

diff --git a/av1/common/arm/highbd_convolve_neon.c b/av1/common/arm/highbd_convolve_neon.c
new file mode 100644
index 0000000..fb18e28
--- /dev/null
+++ b/av1/common/arm/highbd_convolve_neon.c

@@ -0,0 +1,2381 @@
+/*
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <assert.h>
+#include <arm_neon.h>
+
+#include "config/aom_config.h"
+#include "config/av1_rtcd.h"
+
+#include "aom_dsp/aom_dsp_common.h"
+#include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/arm/transpose_neon.h"
+#include "aom_ports/mem.h"
+#include "av1/common/convolve.h"
+#include "av1/common/filter.h"
+#include "av1/common/arm/highbd_convolve_neon.h"
+
+static INLINE void highbd_convolve_y_sr_6tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *y_filter_ptr, const int bd) {
+  const uint16x8_t max = vdupq_n_u16((1 << bd) - 1);
+  const int16x8_t y_filter_0_7 = vld1q_s16(y_filter_ptr);
+  const int32x4_t zero_s32 = vdupq_n_s32(0);
+
+  if (w <= 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8;
+    uint16x4_t d0, d1, d2, d3;
+    uint16x8_t d01, d23;
+    const int16_t *s = (const int16_t *)(src_ptr + src_stride);
+    uint16_t *d = dst_ptr;
+
+    load_s16_4x5(s, src_stride, &s0, &s1, &s2, &s3, &s4);
+    s += 5 * src_stride;
+
+    do {
+      load_s16_4x4(s, src_stride, &s5, &s6, &s7, &s8);
+
+      d0 = highbd_convolve6_4_s32_s16(s0, s1, s2, s3, s4, s5, y_filter_0_7,
+                                      zero_s32);
+      d1 = highbd_convolve6_4_s32_s16(s1, s2, s3, s4, s5, s6, y_filter_0_7,
+                                      zero_s32);
+      d2 = highbd_convolve6_4_s32_s16(s2, s3, s4, s5, s6, s7, y_filter_0_7,
+                                      zero_s32);
+      d3 = highbd_convolve6_4_s32_s16(s3, s4, s5, s6, s7, s8, y_filter_0_7,
+                                      zero_s32);
+
+      d01 = vcombine_u16(d0, d1);
+      d23 = vcombine_u16(d2, d3);
+
+      d01 = vminq_u16(d01, max);
+      d23 = vminq_u16(d23, max);
+
+      if (w == 2) {
+        store_u16q_2x1(d + 0 * dst_stride, d01, 0);
+        store_u16q_2x1(d + 1 * dst_stride, d01, 2);
+        if (h != 2) {
+          store_u16q_2x1(d + 2 * dst_stride, d23, 0);
+          store_u16q_2x1(d + 3 * dst_stride, d23, 2);
+        }
+      } else {
+        vst1_u16(d + 0 * dst_stride, vget_low_u16(d01));
+        vst1_u16(d + 1 * dst_stride, vget_high_u16(d01));
+        if (h != 2) {
+          vst1_u16(d + 2 * dst_stride, vget_low_u16(d23));
+          vst1_u16(d + 3 * dst_stride, vget_high_u16(d23));
+        }
+      }
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s += 4 * src_stride;
+      d += 4 * dst_stride;
+      h -= 4;
+    } while (h > 0);
+  } else {
+    // if width is a multiple of 8 & height is a multiple of 4
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8;
+    uint16x8_t d0, d1, d2, d3;
+
+    do {
+      int height = h;
+      const int16_t *s = (const int16_t *)(src_ptr + src_stride);
+      uint16_t *d = dst_ptr;
+
+      load_s16_8x5(s, src_stride, &s0, &s1, &s2, &s3, &s4);
+      s += 5 * src_stride;
+
+      do {
+        load_s16_8x4(s, src_stride, &s5, &s6, &s7, &s8);
+
+        d0 = highbd_convolve6_8_s32_s16(s0, s1, s2, s3, s4, s5, y_filter_0_7,
+                                        zero_s32);
+        d1 = highbd_convolve6_8_s32_s16(s1, s2, s3, s4, s5, s6, y_filter_0_7,
+                                        zero_s32);
+        d2 = highbd_convolve6_8_s32_s16(s2, s3, s4, s5, s6, s7, y_filter_0_7,
+                                        zero_s32);
+        d3 = highbd_convolve6_8_s32_s16(s3, s4, s5, s6, s7, s8, y_filter_0_7,
+                                        zero_s32);
+
+        d0 = vminq_u16(d0, max);
+        d1 = vminq_u16(d1, max);
+        d2 = vminq_u16(d2, max);
+        d3 = vminq_u16(d3, max);
+
+        if (h == 2) {
+          store_u16_8x2(d, dst_stride, d0, d1);
+        } else {
+          store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+        }
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height > 0);
+
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w > 0);
+  }
+}
+
+static INLINE void highbd_convolve_y_sr_8tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *y_filter_ptr, int bd) {
+  const uint16x8_t max = vdupq_n_u16((1 << bd) - 1);
+  const int16x8_t y_filter = vld1q_s16(y_filter_ptr);
+  const int32x4_t zero_s32 = vdupq_n_s32(0);
+
+  if (w <= 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
+    uint16x4_t d0, d1, d2, d3;
+    uint16x8_t d01, d23;
+
+    const int16_t *s = (const int16_t *)src_ptr;
+    uint16_t *d = dst_ptr;
+
+    load_s16_4x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+    s += 7 * src_stride;
+
+    do {
+      load_s16_4x4(s, src_stride, &s7, &s8, &s9, &s10);
+
+      d0 = highbd_convolve8_4_s32_s16(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                                      zero_s32);
+      d1 = highbd_convolve8_4_s32_s16(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                                      zero_s32);
+      d2 = highbd_convolve8_4_s32_s16(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                                      zero_s32);
+      d3 = highbd_convolve8_4_s32_s16(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                                      zero_s32);
+
+      d01 = vcombine_u16(d0, d1);
+      d23 = vcombine_u16(d2, d3);
+
+      d01 = vminq_u16(d01, max);
+      d23 = vminq_u16(d23, max);
+
+      if (w == 2) {
+        store_u16q_2x1(d + 0 * dst_stride, d01, 0);
+        store_u16q_2x1(d + 1 * dst_stride, d01, 2);
+        if (h != 2) {
+          store_u16q_2x1(d + 2 * dst_stride, d23, 0);
+          store_u16q_2x1(d + 3 * dst_stride, d23, 2);
+        }
+      } else {
+        vst1_u16(d + 0 * dst_stride, vget_low_u16(d01));
+        vst1_u16(d + 1 * dst_stride, vget_high_u16(d01));
+        if (h != 2) {
+          vst1_u16(d + 2 * dst_stride, vget_low_u16(d23));
+          vst1_u16(d + 3 * dst_stride, vget_high_u16(d23));
+        }
+      }
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s5 = s9;
+      s6 = s10;
+      s += 4 * src_stride;
+      d += 4 * dst_stride;
+      h -= 4;
+    } while (h > 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
+    uint16x8_t d0, d1, d2, d3;
+    do {
+      int height = h;
+      const int16_t *s = (const int16_t *)src_ptr;
+      uint16_t *d = dst_ptr;
+
+      load_s16_8x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+      s += 7 * src_stride;
+
+      do {
+        load_s16_8x4(s, src_stride, &s7, &s8, &s9, &s10);
+
+        d0 = highbd_convolve8_8_s32_s16(s0, s1, s2, s3, s4, s5, s6, s7,
+                                        y_filter, zero_s32);
+        d1 = highbd_convolve8_8_s32_s16(s1, s2, s3, s4, s5, s6, s7, s8,
+                                        y_filter, zero_s32);
+        d2 = highbd_convolve8_8_s32_s16(s2, s3, s4, s5, s6, s7, s8, s9,
+                                        y_filter, zero_s32);
+        d3 = highbd_convolve8_8_s32_s16(s3, s4, s5, s6, s7, s8, s9, s10,
+                                        y_filter, zero_s32);
+
+        d0 = vminq_u16(d0, max);
+        d1 = vminq_u16(d1, max);
+        d2 = vminq_u16(d2, max);
+        d3 = vminq_u16(d3, max);
+
+        if (h == 2) {
+          store_u16_8x2(d, dst_stride, d0, d1);
+        } else {
+          store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+        }
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height > 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w > 0);
+  }
+}
+
+static INLINE void highbd_convolve_y_sr_12tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *y_filter_ptr, int bd) {
+  const uint16x8_t max = vdupq_n_u16((1 << bd) - 1);
+  const int16x8_t y_filter_0_7 = vld1q_s16(y_filter_ptr);
+  const int16x4_t y_filter_8_11 = vld1_s16(y_filter_ptr + 8);
+  const int32x4_t zero_s32 = vdupq_n_s32(0);
+
+  if (w <= 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+    uint16x4_t d0, d1, d2, d3;
+    uint16x8_t d01, d23;
+
+    const int16_t *s = (const int16_t *)src_ptr;
+    uint16_t *d = dst_ptr;
+
+    load_s16_4x11(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7, &s8,
+                  &s9, &s10);
+    s += 11 * src_stride;
+
+    do {
+      load_s16_4x4(s, src_stride, &s11, &s12, &s13, &s14);
+
+      d0 = highbd_convolve12_y_4_s32_s16(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9,
+                                         s10, s11, y_filter_0_7, y_filter_8_11,
+                                         zero_s32);
+      d1 = highbd_convolve12_y_4_s32_s16(s1, s2, s3, s4, s5, s6, s7, s8, s9,
+                                         s10, s11, s12, y_filter_0_7,
+                                         y_filter_8_11, zero_s32);
+      d2 = highbd_convolve12_y_4_s32_s16(s2, s3, s4, s5, s6, s7, s8, s9, s10,
+                                         s11, s12, s13, y_filter_0_7,
+                                         y_filter_8_11, zero_s32);
+      d3 = highbd_convolve12_y_4_s32_s16(s3, s4, s5, s6, s7, s8, s9, s10, s11,
+                                         s12, s13, s14, y_filter_0_7,
+                                         y_filter_8_11, zero_s32);
+
+      d01 = vcombine_u16(d0, d1);
+      d23 = vcombine_u16(d2, d3);
+
+      d01 = vminq_u16(d01, max);
+      d23 = vminq_u16(d23, max);
+
+      if (w == 2) {
+        store_u16q_2x1(d + 0 * dst_stride, d01, 0);
+        store_u16q_2x1(d + 1 * dst_stride, d01, 2);
+        if (h != 2) {
+          store_u16q_2x1(d + 2 * dst_stride, d23, 0);
+          store_u16q_2x1(d + 3 * dst_stride, d23, 2);
+        }
+      } else {
+        vst1_u16(d + 0 * dst_stride, vget_low_u16(d01));
+        vst1_u16(d + 1 * dst_stride, vget_high_u16(d01));
+        if (h != 2) {
+          vst1_u16(d + 2 * dst_stride, vget_low_u16(d23));
+          vst1_u16(d + 3 * dst_stride, vget_high_u16(d23));
+        }
+      }
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s5 = s9;
+      s6 = s10;
+      s7 = s11;
+      s8 = s12;
+      s9 = s13;
+      s10 = s14;
+      s += 4 * src_stride;
+      d += 4 * dst_stride;
+      h -= 4;
+    } while (h > 0);
+  } else {
+    uint16x8_t d0, d1, d2, d3;
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+
+    do {
+      int height = h;
+      const int16_t *s = (const int16_t *)src_ptr;
+      uint16_t *d = dst_ptr;
+
+      load_s16_8x11(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7, &s8,
+                    &s9, &s10);
+      s += 11 * src_stride;
+
+      do {
+        load_s16_8x4(s, src_stride, &s11, &s12, &s13, &s14);
+
+        d0 = highbd_convolve12_y_8_s32_s16(s0, s1, s2, s3, s4, s5, s6, s7, s8,
+                                           s9, s10, s11, y_filter_0_7,
+                                           y_filter_8_11, zero_s32);
+        d1 = highbd_convolve12_y_8_s32_s16(s1, s2, s3, s4, s5, s6, s7, s8, s9,
+                                           s10, s11, s12, y_filter_0_7,
+                                           y_filter_8_11, zero_s32);
+        d2 = highbd_convolve12_y_8_s32_s16(s2, s3, s4, s5, s6, s7, s8, s9, s10,
+                                           s11, s12, s13, y_filter_0_7,
+                                           y_filter_8_11, zero_s32);
+        d3 = highbd_convolve12_y_8_s32_s16(s3, s4, s5, s6, s7, s8, s9, s10, s11,
+                                           s12, s13, s14, y_filter_0_7,
+                                           y_filter_8_11, zero_s32);
+
+        d0 = vminq_u16(d0, max);
+        d1 = vminq_u16(d1, max);
+        d2 = vminq_u16(d2, max);
+        d3 = vminq_u16(d3, max);
+
+        if (h == 2) {
+          store_u16_8x2(d, dst_stride, d0, d1);
+        } else {
+          store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+        }
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s7 = s11;
+        s8 = s12;
+        s9 = s13;
+        s10 = s14;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height > 0);
+
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w > 0);
+  }
+}
+
+void av1_highbd_convolve_y_sr_neon(const uint16_t *src, int src_stride,
+                                   uint16_t *dst, int dst_stride, int w, int h,
+                                   const InterpFilterParams *filter_params_y,
+                                   const int subpel_y_qn, int bd) {
+  const int y_filter_taps = get_filter_tap(filter_params_y, subpel_y_qn);
+  const int vert_offset = filter_params_y->taps / 2 - 1;
+  const int16_t *y_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_y, subpel_y_qn & SUBPEL_MASK);
+
+  src -= vert_offset * src_stride;
+
+  if (y_filter_taps > 8) {
+    highbd_convolve_y_sr_12tap_neon(src, src_stride, dst, dst_stride, w, h,
+                                    y_filter_ptr, bd);
+    return;
+  }
+  if (y_filter_taps < 8) {
+    highbd_convolve_y_sr_6tap_neon(src, src_stride, dst, dst_stride, w, h,
+                                   y_filter_ptr, bd);
+    return;
+  }
+
+  highbd_convolve_y_sr_8tap_neon(src, src_stride, dst, dst_stride, w, h,
+                                 y_filter_ptr, bd);
+}
+
+static INLINE void highbd_convolve_x_sr_8tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *x_filter_ptr, ConvolveParams *conv_params,
+    int bd) {
+  const int16x8_t x_filter = vld1q_s16(x_filter_ptr);
+  const uint16x8_t max = vdupq_n_u16((1 << bd) - 1);
+  const int32x4_t shift_s32 = vdupq_n_s32(-conv_params->round_0);
+  const int bits = FILTER_BITS - conv_params->round_0;
+  const int16x8_t bits_s16 = vdupq_n_s16(-bits);
+  const int32x4_t zero_s32 = vdupq_n_s32(0);
+
+  if (w <= 4) {
+    int16x8_t s0, s1, s2, s3;
+    uint16x4_t d0, d1;
+    uint16x8_t d01;
+
+    const int16_t *s = (const int16_t *)src_ptr;
+    uint16_t *d = dst_ptr;
+
+    do {
+      load_s16_8x2(s, src_stride, &s0, &s2);
+      load_s16_8x2(s + 8, src_stride, &s1, &s3);
+
+      d0 = highbd_convolve8_horiz4_s32_s16(s0, s1, x_filter, shift_s32,
+                                           zero_s32);
+      d1 = highbd_convolve8_horiz4_s32_s16(s2, s3, x_filter, shift_s32,
+                                           zero_s32);
+
+      d01 = vcombine_u16(d0, d1);
+      d01 = vqrshlq_u16(d01, bits_s16);
+      d01 = vminq_u16(d01, max);
+
+      if (w == 2) {
+        store_u16q_2x1(d + 0 * dst_stride, d01, 0);
+        store_u16q_2x1(d + 1 * dst_stride, d01, 2);
+      } else {
+        vst1_u16(d + 0 * dst_stride, vget_low_u16(d01));
+        vst1_u16(d + 1 * dst_stride, vget_high_u16(d01));
+      }
+
+      s += 2 * src_stride;
+      d += 2 * dst_stride;
+      h -= 2;
+    } while (h > 0);
+  } else {
+    int height = h;
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x8_t d0, d1, d2, d3;
+    do {
+      int width = w;
+      const int16_t *s = (const int16_t *)src_ptr;
+      uint16_t *d = dst_ptr;
+
+      load_s16_8x4(s, src_stride, &s0, &s2, &s4, &s6);
+      s += 8;
+
+      do {
+        load_s16_8x4(s, src_stride, &s1, &s3, &s5, &s7);
+
+        d0 = highbd_convolve8_horiz8_s32_s16(s0, s1, x_filter, shift_s32,
+                                             zero_s32);
+        d1 = highbd_convolve8_horiz8_s32_s16(s2, s3, x_filter, shift_s32,
+                                             zero_s32);
+        d2 = highbd_convolve8_horiz8_s32_s16(s4, s5, x_filter, shift_s32,
+                                             zero_s32);
+        d3 = highbd_convolve8_horiz8_s32_s16(s6, s7, x_filter, shift_s32,
+                                             zero_s32);
+
+        d0 = vqrshlq_u16(d0, bits_s16);
+        d1 = vqrshlq_u16(d1, bits_s16);
+        d2 = vqrshlq_u16(d2, bits_s16);
+        d3 = vqrshlq_u16(d3, bits_s16);
+
+        d0 = vminq_u16(d0, max);
+        d1 = vminq_u16(d1, max);
+        d2 = vminq_u16(d2, max);
+        d3 = vminq_u16(d3, max);
+
+        if (h == 2) {
+          store_u16_8x2(d, dst_stride, d0, d1);
+        } else {
+          store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+        }
+
+        s0 = s1;
+        s2 = s3;
+        s4 = s5;
+        s6 = s7;
+        s += 8;
+        d += 8;
+        width -= 8;
+      } while (width > 0);
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      height -= 4;
+    } while (height > 0);
+  }
+}
+
+static INLINE void highbd_convolve_x_sr_12tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *x_filter_ptr, ConvolveParams *conv_params,
+    int bd) {
+  const uint16x8_t max = vdupq_n_u16((1 << bd) - 1);
+  const int32x4_t shift_s32 = vdupq_n_s32(-conv_params->round_0);
+  const int bits = FILTER_BITS - conv_params->round_0;
+  const int16x8_t bits_s16 = vdupq_n_s16(-bits);
+  const int16x8_t x_filter_0_7 = vld1q_s16(x_filter_ptr);
+  const int16x4_t x_filter_8_11 = vld1_s16(x_filter_ptr + 8);
+  const int32x4_t zero_s32 = vdupq_n_s32(0);
+
+  if (w <= 4) {
+    int16x8_t s0, s1, s2, s3;
+    uint16x4_t d0, d1;
+    uint16x8_t d01;
+
+    const int16_t *s = (const int16_t *)src_ptr;
+    uint16_t *d = dst_ptr;
+
+    do {
+      load_s16_8x2(s, src_stride, &s0, &s2);
+      load_s16_8x2(s + 8, src_stride, &s1, &s3);
+
+      d0 = highbd_convolve12_horiz4_s32_s16(s0, s1, x_filter_0_7, x_filter_8_11,
+                                            shift_s32, zero_s32);
+      d1 = highbd_convolve12_horiz4_s32_s16(s2, s3, x_filter_0_7, x_filter_8_11,
+                                            shift_s32, zero_s32);
+
+      d01 = vcombine_u16(d0, d1);
+      d01 = vqrshlq_u16(d01, bits_s16);
+      d01 = vminq_u16(d01, max);
+
+      if (w == 2) {
+        store_u16q_2x1(d + 0 * dst_stride, d01, 0);
+        store_u16q_2x1(d + 1 * dst_stride, d01, 2);
+      } else {
+        vst1_u16(d + 0 * dst_stride, vget_low_u16(d01));
+        vst1_u16(d + 1 * dst_stride, vget_high_u16(d01));
+      }
+
+      s += 2 * src_stride;
+      d += 2 * dst_stride;
+      h -= 2;
+    } while (h > 0);
+  } else {
+    int height = h;
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11;
+    uint16x8_t d0, d1, d2, d3;
+    do {
+      int width = w;
+      const int16_t *s = (const int16_t *)src_ptr;
+      uint16_t *d = dst_ptr;
+
+      load_s16_8x4(s, src_stride, &s0, &s3, &s6, &s9);
+      s += 8;
+
+      do {
+        load_s16_8x4(s, src_stride, &s1, &s4, &s7, &s10);
+        load_s16_8x4(s + 8, src_stride, &s2, &s5, &s8, &s11);
+
+        d0 = highbd_convolve12_horiz8_s32_s16(
+            s0, s1, s2, x_filter_0_7, x_filter_8_11, shift_s32, zero_s32);
+        d1 = highbd_convolve12_horiz8_s32_s16(
+            s3, s4, s5, x_filter_0_7, x_filter_8_11, shift_s32, zero_s32);
+        d2 = highbd_convolve12_horiz8_s32_s16(
+            s6, s7, s8, x_filter_0_7, x_filter_8_11, shift_s32, zero_s32);
+        d3 = highbd_convolve12_horiz8_s32_s16(
+            s9, s10, s11, x_filter_0_7, x_filter_8_11, shift_s32, zero_s32);
+
+        d0 = vqrshlq_u16(d0, bits_s16);
+        d1 = vqrshlq_u16(d1, bits_s16);
+        d2 = vqrshlq_u16(d2, bits_s16);
+        d3 = vqrshlq_u16(d3, bits_s16);
+
+        d0 = vminq_u16(d0, max);
+        d1 = vminq_u16(d1, max);
+        d2 = vminq_u16(d2, max);
+        d3 = vminq_u16(d3, max);
+
+        if (h == 2) {
+          store_u16_8x2(d, dst_stride, d0, d1);
+        } else {
+          store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+        }
+
+        s0 = s1;
+        s1 = s2;
+        s3 = s4;
+        s4 = s5;
+        s6 = s7;
+        s7 = s8;
+        s9 = s10;
+        s10 = s11;
+        s += 8;
+        d += 8;
+        width -= 8;
+      } while (width > 0);
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      height -= 4;
+    } while (height > 0);
+  }
+}
+
+void av1_highbd_convolve_x_sr_neon(const uint16_t *src, int src_stride,
+                                   uint16_t *dst, int dst_stride, int w, int h,
+                                   const InterpFilterParams *filter_params_x,
+                                   const int subpel_x_qn,
+                                   ConvolveParams *conv_params, int bd) {
+  const int x_filter_taps = get_filter_tap(filter_params_x, subpel_x_qn);
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
+  const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_x, subpel_x_qn & SUBPEL_MASK);
+
+  src -= horiz_offset;
+
+  if (x_filter_taps > 8) {
+    highbd_convolve_x_sr_12tap_neon(src, src_stride, dst, dst_stride, w, h,
+                                    x_filter_ptr, conv_params, bd);
+    return;
+  }
+
+  highbd_convolve_x_sr_8tap_neon(src, src_stride, dst, dst_stride, w, h,
+                                 x_filter_ptr, conv_params, bd);
+}
+
+static INLINE void highbd_convolve_2d_y_sr_8tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *y_filter_ptr, ConvolveParams *conv_params,
+    int bd, const int offset, const int correction) {
+  const uint16x8_t max = vdupq_n_u16((1 << bd) - 1);
+  const int16x8_t y_filter = vld1q_s16(y_filter_ptr);
+  const int32x4_t offset_s32 = vdupq_n_s32(offset);
+  const int round1_shift = conv_params->round_1;
+  const int32x4_t round1_shift_s32 = vdupq_n_s32(-round1_shift);
+  const int32x4_t correction_s32 = vdupq_n_s32(correction);
+
+  if (w <= 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
+    uint16x4_t d0, d1, d2, d3;
+    uint16x8_t d01, d23;
+
+    const int16_t *s = (const int16_t *)src_ptr;
+    uint16_t *d = dst_ptr;
+
+    load_s16_4x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+    s += 7 * src_stride;
+
+    do {
+      load_s16_4x4(s, src_stride, &s7, &s8, &s9, &s10);
+
+      d0 = highbd_convolve8_4_sr_s32_s16(s0, s1, s2, s3, s4, s5, s6, s7,
+                                         y_filter, round1_shift_s32, offset_s32,
+                                         correction_s32);
+      d1 = highbd_convolve8_4_sr_s32_s16(s1, s2, s3, s4, s5, s6, s7, s8,
+                                         y_filter, round1_shift_s32, offset_s32,
+                                         correction_s32);
+      d2 = highbd_convolve8_4_sr_s32_s16(s2, s3, s4, s5, s6, s7, s8, s9,
+                                         y_filter, round1_shift_s32, offset_s32,
+                                         correction_s32);
+      d3 = highbd_convolve8_4_sr_s32_s16(s3, s4, s5, s6, s7, s8, s9, s10,
+                                         y_filter, round1_shift_s32, offset_s32,
+                                         correction_s32);
+
+      d01 = vcombine_u16(d0, d1);
+      d23 = vcombine_u16(d2, d3);
+
+      d01 = vminq_u16(d01, max);
+      d23 = vminq_u16(d23, max);
+
+      if (w == 2) {
+        store_u16q_2x1(d + 0 * dst_stride, d01, 0);
+        store_u16q_2x1(d + 1 * dst_stride, d01, 2);
+        if (h != 2) {
+          store_u16q_2x1(d + 2 * dst_stride, d23, 0);
+          store_u16q_2x1(d + 3 * dst_stride, d23, 2);
+        }
+      } else {
+        vst1_u16(d + 0 * dst_stride, vget_low_u16(d01));
+        vst1_u16(d + 1 * dst_stride, vget_high_u16(d01));
+        if (h != 2) {
+          vst1_u16(d + 2 * dst_stride, vget_low_u16(d23));
+          vst1_u16(d + 3 * dst_stride, vget_high_u16(d23));
+        }
+      }
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s5 = s9;
+      s6 = s10;
+      s += 4 * src_stride;
+      d += 4 * dst_stride;
+      h -= 4;
+    } while (h > 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
+    uint16x8_t d0, d1, d2, d3;
+    do {
+      int height = h;
+      const int16_t *s = (const int16_t *)src_ptr;
+      uint16_t *d = dst_ptr;
+
+      load_s16_8x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+      s += 7 * src_stride;
+
+      do {
+        load_s16_8x4(s, src_stride, &s7, &s8, &s9, &s10);
+
+        d0 = highbd_convolve8_8_sr_s32_s16(s0, s1, s2, s3, s4, s5, s6, s7,
+                                           y_filter, round1_shift_s32,
+                                           offset_s32, correction_s32);
+        d1 = highbd_convolve8_8_sr_s32_s16(s1, s2, s3, s4, s5, s6, s7, s8,
+                                           y_filter, round1_shift_s32,
+                                           offset_s32, correction_s32);
+        d2 = highbd_convolve8_8_sr_s32_s16(s2, s3, s4, s5, s6, s7, s8, s9,
+                                           y_filter, round1_shift_s32,
+                                           offset_s32, correction_s32);
+        d3 = highbd_convolve8_8_sr_s32_s16(s3, s4, s5, s6, s7, s8, s9, s10,
+                                           y_filter, round1_shift_s32,
+                                           offset_s32, correction_s32);
+
+        d0 = vminq_u16(d0, max);
+        d1 = vminq_u16(d1, max);
+        d2 = vminq_u16(d2, max);
+        d3 = vminq_u16(d3, max);
+
+        if (h == 2) {
+          store_u16_8x2(d, dst_stride, d0, d1);
+        } else {
+          store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+        }
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height > 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w > 0);
+  }
+}
+
+static INLINE void highbd_convolve_2d_y_sr_12tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *y_filter_ptr, ConvolveParams *conv_params,
+    const int bd, const int offset, const int correction) {
+  const uint16x8_t max = vdupq_n_u16((1 << bd) - 1);
+  const int16x8_t y_filter_0_7 = vld1q_s16(y_filter_ptr);
+  const int16x4_t y_filter_8_11 = vld1_s16(y_filter_ptr + 8);
+  const int32x4_t offset_s32 = vdupq_n_s32(offset);
+  const int round1_shift = conv_params->round_1;
+  const int32x4_t round1_shift_s32 = vdupq_n_s32(-round1_shift);
+  const int32x4_t correction_s32 = vdupq_n_s32(correction);
+
+  if (w <= 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+    uint16x4_t d0, d1, d2, d3;
+    uint16x8_t d01, d23;
+
+    const int16_t *s = (const int16_t *)src_ptr;
+    uint16_t *d = dst_ptr;
+
+    load_s16_4x11(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7, &s8,
+                  &s9, &s10);
+    s += 11 * src_stride;
+
+    do {
+      load_s16_4x4(s, src_stride, &s11, &s12, &s13, &s14);
+
+      d0 = highbd_convolve12_y_4_sr_s32_s16(
+          s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, y_filter_0_7,
+          y_filter_8_11, round1_shift_s32, offset_s32, correction_s32);
+      d1 = highbd_convolve12_y_4_sr_s32_s16(
+          s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, y_filter_0_7,
+          y_filter_8_11, round1_shift_s32, offset_s32, correction_s32);
+      d2 = highbd_convolve12_y_4_sr_s32_s16(
+          s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, y_filter_0_7,
+          y_filter_8_11, round1_shift_s32, offset_s32, correction_s32);
+      d3 = highbd_convolve12_y_4_sr_s32_s16(
+          s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14, y_filter_0_7,
+          y_filter_8_11, round1_shift_s32, offset_s32, correction_s32);
+
+      d01 = vcombine_u16(d0, d1);
+      d23 = vcombine_u16(d2, d3);
+
+      d01 = vminq_u16(d01, max);
+      d23 = vminq_u16(d23, max);
+
+      if (w == 2) {
+        store_u16q_2x1(d + 0 * dst_stride, d01, 0);
+        store_u16q_2x1(d + 1 * dst_stride, d01, 2);
+        if (h != 2) {
+          store_u16q_2x1(d + 2 * dst_stride, d23, 0);
+          store_u16q_2x1(d + 3 * dst_stride, d23, 2);
+        }
+      } else {
+        vst1_u16(d + 0 * dst_stride, vget_low_u16(d01));
+        vst1_u16(d + 1 * dst_stride, vget_high_u16(d01));
+        if (h != 2) {
+          vst1_u16(d + 2 * dst_stride, vget_low_u16(d23));
+          vst1_u16(d + 3 * dst_stride, vget_high_u16(d23));
+        }
+      }
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s5 = s9;
+      s6 = s10;
+      s7 = s11;
+      s8 = s12;
+      s9 = s13;
+      s10 = s14;
+      s += 4 * src_stride;
+      d += 4 * dst_stride;
+      h -= 4;
+    } while (h > 0);
+  } else {
+    uint16x8_t d0, d1, d2, d3;
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
+
+    do {
+      int height = h;
+      const int16_t *s = (const int16_t *)src_ptr;
+      uint16_t *d = dst_ptr;
+
+      load_s16_8x11(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7, &s8,
+                    &s9, &s10);
+      s += 11 * src_stride;
+
+      do {
+        load_s16_8x4(s, src_stride, &s11, &s12, &s13, &s14);
+
+        d0 = highbd_convolve12_y_8_sr_s32_s16(
+            s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, y_filter_0_7,
+            y_filter_8_11, round1_shift_s32, offset_s32, correction_s32);
+        d1 = highbd_convolve12_y_8_sr_s32_s16(
+            s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, y_filter_0_7,
+            y_filter_8_11, round1_shift_s32, offset_s32, correction_s32);
+        d2 = highbd_convolve12_y_8_sr_s32_s16(
+            s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, y_filter_0_7,
+            y_filter_8_11, round1_shift_s32, offset_s32, correction_s32);
+        d3 = highbd_convolve12_y_8_sr_s32_s16(
+            s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14, y_filter_0_7,
+            y_filter_8_11, round1_shift_s32, offset_s32, correction_s32);
+
+        d0 = vminq_u16(d0, max);
+        d1 = vminq_u16(d1, max);
+        d2 = vminq_u16(d2, max);
+        d3 = vminq_u16(d3, max);
+
+        if (h == 2) {
+          store_u16_8x2(d, dst_stride, d0, d1);
+        } else {
+          store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+        }
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s7 = s11;
+        s8 = s12;
+        s9 = s13;
+        s10 = s14;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height > 0);
+
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w > 0);
+  }
+}
+
+static INLINE void highbd_convolve_x_8tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *x_filter_ptr, ConvolveParams *conv_params,
+    const int offset) {
+  const int16x8_t x_filter = vld1q_s16(x_filter_ptr);
+  const int32x4_t shift_s32 = vdupq_n_s32(-conv_params->round_0);
+  const int32x4_t offset_s32 = vdupq_n_s32(offset);
+
+  if (w <= 4) {
+    int16x8_t s0, s1, s2, s3;
+    uint16x4_t d0, d1;
+    uint16x8_t d01;
+
+    const int16_t *s = (const int16_t *)src_ptr;
+    uint16_t *d = dst_ptr;
+
+    do {
+      load_s16_8x2(s, src_stride, &s0, &s2);
+      load_s16_8x2(s + 8, src_stride, &s1, &s3);
+
+      d0 = highbd_convolve8_horiz4_s32_s16(s0, s1, x_filter, shift_s32,
+                                           offset_s32);
+      d1 = highbd_convolve8_horiz4_s32_s16(s2, s3, x_filter, shift_s32,
+                                           offset_s32);
+
+      d01 = vcombine_u16(d0, d1);
+
+      if (w == 2) {
+        store_u16q_2x1(d + 0 * dst_stride, d01, 0);
+        store_u16q_2x1(d + 1 * dst_stride, d01, 2);
+      } else {
+        vst1_u16(d + 0 * dst_stride, vget_low_u16(d01));
+        vst1_u16(d + 1 * dst_stride, vget_high_u16(d01));
+      }
+
+      s += 2 * src_stride;
+      d += 2 * dst_stride;
+      h -= 2;
+    } while (h > 0);
+  } else {
+    int height = h;
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x8_t d0, d1, d2, d3;
+    do {
+      int width = w;
+      const int16_t *s = (const int16_t *)src_ptr;
+      uint16_t *d = dst_ptr;
+
+      load_s16_8x4(s, src_stride, &s0, &s2, &s4, &s6);
+      s += 8;
+
+      do {
+        load_s16_8x4(s, src_stride, &s1, &s3, &s5, &s7);
+
+        d0 = highbd_convolve8_horiz8_s32_s16(s0, s1, x_filter, shift_s32,
+                                             offset_s32);
+        d1 = highbd_convolve8_horiz8_s32_s16(s2, s3, x_filter, shift_s32,
+                                             offset_s32);
+        d2 = highbd_convolve8_horiz8_s32_s16(s4, s5, x_filter, shift_s32,
+                                             offset_s32);
+        d3 = highbd_convolve8_horiz8_s32_s16(s6, s7, x_filter, shift_s32,
+                                             offset_s32);
+
+        if (h == 2) {
+          store_u16_8x2(d, dst_stride, d0, d1);
+        } else {
+          store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+        }
+
+        s0 = s1;
+        s2 = s3;
+        s4 = s5;
+        s6 = s7;
+        s += 8;
+        d += 8;
+        width -= 8;
+      } while (width > 0);
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      height -= 4;
+    } while (height > 0);
+  }
+}
+
+static INLINE void highbd_convolve_2d_x_sr_12tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *x_filter_ptr, ConvolveParams *conv_params,
+    const int offset) {
+  const int32x4_t shift_s32 = vdupq_n_s32(-conv_params->round_0);
+  const int16x8_t x_filter_0_7 = vld1q_s16(x_filter_ptr);
+  const int16x4_t x_filter_8_11 = vld1_s16(x_filter_ptr + 8);
+  const int32x4_t offset_s32 = vdupq_n_s32(offset);
+
+  if (w <= 4) {
+    int16x8_t s0, s1, s2, s3;
+    uint16x4_t d0, d1;
+    uint16x8_t d01;
+
+    const int16_t *s = (const int16_t *)src_ptr;
+    uint16_t *d = dst_ptr;
+
+    do {
+      load_s16_8x2(s, src_stride, &s0, &s2);
+      load_s16_8x2(s + 8, src_stride, &s1, &s3);
+
+      d0 = highbd_convolve12_horiz4_s32_s16(s0, s1, x_filter_0_7, x_filter_8_11,
+                                            shift_s32, offset_s32);
+      d1 = highbd_convolve12_horiz4_s32_s16(s2, s3, x_filter_0_7, x_filter_8_11,
+                                            shift_s32, offset_s32);
+
+      d01 = vcombine_u16(d0, d1);
+
+      if (w == 2) {
+        store_u16q_2x1(d + 0 * dst_stride, d01, 0);
+        store_u16q_2x1(d + 1 * dst_stride, d01, 2);
+      } else {
+        vst1_u16(d + 0 * dst_stride, vget_low_u16(d01));
+        vst1_u16(d + 1 * dst_stride, vget_high_u16(d01));
+      }
+
+      s += 2 * src_stride;
+      d += 2 * dst_stride;
+      h -= 2;
+    } while (h > 0);
+  } else {
+    int height = h;
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11;
+    uint16x8_t d0, d1, d2, d3;
+    do {
+      int width = w;
+      const int16_t *s = (const int16_t *)src_ptr;
+      uint16_t *d = dst_ptr;
+
+      load_s16_8x4(s, src_stride, &s0, &s3, &s6, &s9);
+      s += 8;
+
+      do {
+        load_s16_8x4(s, src_stride, &s1, &s4, &s7, &s10);
+        load_s16_8x4(s + 8, src_stride, &s2, &s5, &s8, &s11);
+
+        d0 = highbd_convolve12_horiz8_s32_s16(
+            s0, s1, s2, x_filter_0_7, x_filter_8_11, shift_s32, offset_s32);
+        d1 = highbd_convolve12_horiz8_s32_s16(
+            s3, s4, s5, x_filter_0_7, x_filter_8_11, shift_s32, offset_s32);
+        d2 = highbd_convolve12_horiz8_s32_s16(
+            s6, s7, s8, x_filter_0_7, x_filter_8_11, shift_s32, offset_s32);
+        d3 = highbd_convolve12_horiz8_s32_s16(
+            s9, s10, s11, x_filter_0_7, x_filter_8_11, shift_s32, offset_s32);
+
+        if (h == 2) {
+          store_u16_8x2(d, dst_stride, d0, d1);
+        } else {
+          store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+        }
+
+        s0 = s1;
+        s1 = s2;
+        s3 = s4;
+        s4 = s5;
+        s6 = s7;
+        s7 = s8;
+        s9 = s10;
+        s10 = s11;
+        s += 8;
+        d += 8;
+        width -= 8;
+      } while (width > 0);
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      height -= 4;
+    } while (height > 0);
+  }
+}
+
+void av1_highbd_convolve_2d_sr_neon(const uint16_t *src, int src_stride,
+                                    uint16_t *dst, int dst_stride, int w, int h,
+                                    const InterpFilterParams *filter_params_x,
+                                    const InterpFilterParams *filter_params_y,
+                                    const int subpel_x_qn,
+                                    const int subpel_y_qn,
+                                    ConvolveParams *conv_params, int bd) {
+  DECLARE_ALIGNED(16, uint16_t,
+                  im_block[(MAX_SB_SIZE + MAX_FILTER_TAP) * MAX_SB_SIZE]);
+  const int im_h = h + filter_params_y->taps - 1;
+  const int im_stride = MAX_SB_SIZE;
+  const int vert_offset = filter_params_y->taps / 2 - 1;
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
+  const int x_offset_initial = (1 << (bd + FILTER_BITS - 1));
+  const int y_offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
+  const int y_offset_initial = (1 << y_offset_bits);
+  const int y_offset_correction =
+      ((1 << (y_offset_bits - conv_params->round_1)) +
+       (1 << (y_offset_bits - conv_params->round_1 - 1)));
+
+  const uint16_t *src_ptr = src - vert_offset * src_stride - horiz_offset;
+
+  const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_x, subpel_x_qn & SUBPEL_MASK);
+  const int16_t *y_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_y, subpel_y_qn & SUBPEL_MASK);
+
+  if (filter_params_x->taps > 8) {
+    highbd_convolve_2d_x_sr_12tap_neon(src_ptr, src_stride, im_block, im_stride,
+                                       w, im_h, x_filter_ptr, conv_params,
+                                       x_offset_initial);
+
+    highbd_convolve_2d_y_sr_12tap_neon(im_block, im_stride, dst, dst_stride, w,
+                                       h, y_filter_ptr, conv_params, bd,
+                                       y_offset_initial, y_offset_correction);
+  } else {
+    highbd_convolve_x_8tap_neon(src_ptr, src_stride, im_block, im_stride, w,
+                                im_h, x_filter_ptr, conv_params,
+                                x_offset_initial);
+
+    highbd_convolve_2d_y_sr_8tap_neon(im_block, im_stride, dst, dst_stride, w,
+                                      h, y_filter_ptr, conv_params, bd,
+                                      y_offset_initial, y_offset_correction);
+  }
+}
+
+static INLINE void highbd_convolve_2d_x_scale_8tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int subpel_x_qn, const int x_step_qn,
+    const InterpFilterParams *filter_params, ConvolveParams *conv_params,
+    const int offset) {
+  const uint32x4_t idx = { 0, 1, 2, 3 };
+  const uint32x4_t subpel_mask = vdupq_n_u32(SCALE_SUBPEL_MASK);
+  const int32x4_t shift_s32 = vdupq_n_s32(-conv_params->round_0);
+  const int32x4_t offset_s32 = vdupq_n_s32(offset);
+
+  if (w <= 4) {
+    int height = h;
+    int16x8_t s0, s1, s2, s3;
+    uint16x4_t d0;
+
+    uint16_t *d = dst_ptr;
+
+    do {
+      int x_qn = subpel_x_qn;
+
+      // Load 4 src vectors at a time, they might be the same, but we have to
+      // calculate the indices anyway. Doing it in SIMD and then storing the
+      // indices is faster than having to calculate the expression
+      // &src_ptr[((x_qn + 0*x_step_qn) >> SCALE_SUBPEL_BITS)] 4 times
+      // Ideally this should be a gather using the indices, but NEON does not
+      // have that, so have to emulate
+      const uint32x4_t xqn_idx = vmlaq_n_u32(vdupq_n_u32(x_qn), idx, x_step_qn);
+      // We have to multiply x2 to get the actual pointer as sizeof(uint16_t) =
+      // 2
+      const uint32x4_t src_idx_u32 =
+          vshlq_n_u32(vshrq_n_u32(xqn_idx, SCALE_SUBPEL_BITS), 1);
+#if AOM_ARCH_AARCH64
+      uint64x2_t src4[2];
+      src4[0] = vaddw_u32(vdupq_n_u64((const uint64_t)src_ptr),
+                          vget_low_u32(src_idx_u32));
+      src4[1] = vaddw_u32(vdupq_n_u64((const uint64_t)src_ptr),
+                          vget_high_u32(src_idx_u32));
+      int16_t *src4_ptr[4];
+      uint64_t *tmp_ptr = (uint64_t *)&src4_ptr;
+      vst1q_u64(tmp_ptr, src4[0]);
+      vst1q_u64(tmp_ptr + 2, src4[1]);
+#else
+      uint32x4_t src4;
+      src4 = vaddq_u32(vdupq_n_u32((const uint32_t)src_ptr), src_idx_u32);
+      int16_t *src4_ptr[4];
+      uint32_t *tmp_ptr = (uint32_t *)&src4_ptr;
+      vst1q_u32(tmp_ptr, src4);
+#endif  // AOM_ARCH_AARCH64
+      // Same for the filter vectors
+      const int32x4_t filter_idx_s32 = vreinterpretq_s32_u32(
+          vshrq_n_u32(vandq_u32(xqn_idx, subpel_mask), SCALE_EXTRA_BITS));
+      int32_t x_filter4_idx[4];
+      vst1q_s32(x_filter4_idx, filter_idx_s32);
+      const int16_t *x_filter4_ptr[4];
+
+      // Load source
+      s0 = vld1q_s16(src4_ptr[0]);
+      s1 = vld1q_s16(src4_ptr[1]);
+      s2 = vld1q_s16(src4_ptr[2]);
+      s3 = vld1q_s16(src4_ptr[3]);
+
+      // We could easily do this using SIMD as well instead of calling the
+      // inline function 4 times.
+      x_filter4_ptr[0] =
+          av1_get_interp_filter_subpel_kernel(filter_params, x_filter4_idx[0]);
+      x_filter4_ptr[1] =
+          av1_get_interp_filter_subpel_kernel(filter_params, x_filter4_idx[1]);
+      x_filter4_ptr[2] =
+          av1_get_interp_filter_subpel_kernel(filter_params, x_filter4_idx[2]);
+      x_filter4_ptr[3] =
+          av1_get_interp_filter_subpel_kernel(filter_params, x_filter4_idx[3]);
+
+      // Actually load the filters
+      const int16x8_t x_filter0 = vld1q_s16(x_filter4_ptr[0]);
+      const int16x8_t x_filter1 = vld1q_s16(x_filter4_ptr[1]);
+      const int16x8_t x_filter2 = vld1q_s16(x_filter4_ptr[2]);
+      const int16x8_t x_filter3 = vld1q_s16(x_filter4_ptr[3]);
+
+      // Group low and high parts and transpose
+      int16x4_t filters_lo[] = { vget_low_s16(x_filter0),
+                                 vget_low_s16(x_filter1),
+                                 vget_low_s16(x_filter2),
+                                 vget_low_s16(x_filter3) };
+      int16x4_t filters_hi[] = { vget_high_s16(x_filter0),
+                                 vget_high_s16(x_filter1),
+                                 vget_high_s16(x_filter2),
+                                 vget_high_s16(x_filter3) };
+      transpose_u16_4x4((uint16x4_t *)filters_lo);
+      transpose_u16_4x4((uint16x4_t *)filters_hi);
+
+      // Run the 2D Scale convolution
+      d0 = highbd_convolve8_2d_scale_horiz4x8_s32_s16(
+          s0, s1, s2, s3, filters_lo, filters_hi, shift_s32, offset_s32);
+
+      if (w == 2) {
+        store_u16_2x1(d + 0 * dst_stride, d0, 0);
+      } else {
+        vst1_u16(d + 0 * dst_stride, d0);
+      }
+
+      src_ptr += src_stride;
+      d += dst_stride;
+      height--;
+    } while (height > 0);
+  } else {
+    int height = h;
+    int16x8_t s0, s1, s2, s3;
+    uint16x4_t d0;
+
+    do {
+      int width = w;
+      int x_qn = subpel_x_qn;
+      uint16_t *d = dst_ptr;
+      const uint16_t *s = src_ptr;
+
+      do {
+        // Load 4 src vectors at a time, they might be the same, but we have to
+        // calculate the indices anyway. Doing it in SIMD and then storing the
+        // indices is faster than having to calculate the expression
+        // &src_ptr[((x_qn + 0*x_step_qn) >> SCALE_SUBPEL_BITS)] 4 times
+        // Ideally this should be a gather using the indices, but NEON does not
+        // have that, so have to emulate
+        const uint32x4_t xqn_idx =
+            vmlaq_n_u32(vdupq_n_u32(x_qn), idx, x_step_qn);
+        // We have to multiply x2 to get the actual pointer as sizeof(uint16_t)
+        // = 2
+        const uint32x4_t src_idx_u32 =
+            vshlq_n_u32(vshrq_n_u32(xqn_idx, SCALE_SUBPEL_BITS), 1);
+#if AOM_ARCH_AARCH64
+        uint64x2_t src4[2];
+        src4[0] = vaddw_u32(vdupq_n_u64((const uint64_t)s),
+                            vget_low_u32(src_idx_u32));
+        src4[1] = vaddw_u32(vdupq_n_u64((const uint64_t)s),
+                            vget_high_u32(src_idx_u32));
+        int16_t *src4_ptr[4];
+        uint64_t *tmp_ptr = (uint64_t *)&src4_ptr;
+        vst1q_u64(tmp_ptr, src4[0]);
+        vst1q_u64(tmp_ptr + 2, src4[1]);
+#else
+        uint32x4_t src4;
+        src4 = vaddq_u32(vdupq_n_u32((const uint32_t)s), src_idx_u32);
+        int16_t *src4_ptr[4];
+        uint32_t *tmp_ptr = (uint32_t *)&src4_ptr;
+        vst1q_u32(tmp_ptr, src4);
+#endif  // AOM_ARCH_AARCH64
+        // Same for the filter vectors
+        const int32x4_t filter_idx_s32 = vreinterpretq_s32_u32(
+            vshrq_n_u32(vandq_u32(xqn_idx, subpel_mask), SCALE_EXTRA_BITS));
+        int32_t x_filter4_idx[4];
+        vst1q_s32(x_filter4_idx, filter_idx_s32);
+        const int16_t *x_filter4_ptr[4];
+
+        // Load source
+        s0 = vld1q_s16(src4_ptr[0]);
+        s1 = vld1q_s16(src4_ptr[1]);
+        s2 = vld1q_s16(src4_ptr[2]);
+        s3 = vld1q_s16(src4_ptr[3]);
+
+        // We could easily do this using SIMD as well instead of calling the
+        // inline function 4 times.
+        x_filter4_ptr[0] = av1_get_interp_filter_subpel_kernel(
+            filter_params, x_filter4_idx[0]);
+        x_filter4_ptr[1] = av1_get_interp_filter_subpel_kernel(
+            filter_params, x_filter4_idx[1]);
+        x_filter4_ptr[2] = av1_get_interp_filter_subpel_kernel(
+            filter_params, x_filter4_idx[2]);
+        x_filter4_ptr[3] = av1_get_interp_filter_subpel_kernel(
+            filter_params, x_filter4_idx[3]);
+
+        // Actually load the filters
+        const int16x8_t x_filter0 = vld1q_s16(x_filter4_ptr[0]);
+        const int16x8_t x_filter1 = vld1q_s16(x_filter4_ptr[1]);
+        const int16x8_t x_filter2 = vld1q_s16(x_filter4_ptr[2]);
+        const int16x8_t x_filter3 = vld1q_s16(x_filter4_ptr[3]);
+
+        // Group low and high parts and transpose
+        int16x4_t filters_lo[] = { vget_low_s16(x_filter0),
+                                   vget_low_s16(x_filter1),
+                                   vget_low_s16(x_filter2),
+                                   vget_low_s16(x_filter3) };
+        int16x4_t filters_hi[] = { vget_high_s16(x_filter0),
+                                   vget_high_s16(x_filter1),
+                                   vget_high_s16(x_filter2),
+                                   vget_high_s16(x_filter3) };
+        transpose_u16_4x4((uint16x4_t *)filters_lo);
+        transpose_u16_4x4((uint16x4_t *)filters_hi);
+
+        // Run the 2D Scale X convolution
+        d0 = highbd_convolve8_2d_scale_horiz4x8_s32_s16(
+            s0, s1, s2, s3, filters_lo, filters_hi, shift_s32, offset_s32);
+
+        vst1_u16(d, d0);
+
+        x_qn += 4 * x_step_qn;
+        d += 4;
+        width -= 4;
+      } while (width > 0);
+
+      src_ptr += src_stride;
+      dst_ptr += dst_stride;
+      height--;
+    } while (height > 0);
+  }
+}
+
+static INLINE void highbd_convolve_2d_y_scale_8tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int subpel_y_qn, const int y_step_qn,
+    const InterpFilterParams *filter_params, const int round1_bits,
+    const int offset) {
+  const int32x4_t offset_s32 = vdupq_n_s32(1 << offset);
+
+  const int32x4_t round1_shift_s32 = vdupq_n_s32(-round1_bits);
+  if (w <= 4) {
+    int height = h;
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x4_t d0;
+
+    uint16_t *d = dst_ptr;
+
+    int y_qn = subpel_y_qn;
+    do {
+      const int16_t *s =
+          (const int16_t *)&src_ptr[(y_qn >> SCALE_SUBPEL_BITS) * src_stride];
+
+      load_s16_4x8(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7);
+
+      const int y_filter_idx = (y_qn & SCALE_SUBPEL_MASK) >> SCALE_EXTRA_BITS;
+      const int16_t *y_filter_ptr =
+          av1_get_interp_filter_subpel_kernel(filter_params, y_filter_idx);
+      const int16x8_t y_filter = vld1q_s16(y_filter_ptr);
+
+      d0 = highbd_convolve8_4_sr_s32_s16(s0, s1, s2, s3, s4, s5, s6, s7,
+                                         y_filter, round1_shift_s32, offset_s32,
+                                         vdupq_n_s32(0));
+
+      if (w == 2) {
+        store_u16_2x1(d, d0, 0);
+      } else {
+        vst1_u16(d, d0);
+      }
+
+      y_qn += y_step_qn;
+      d += dst_stride;
+      height--;
+    } while (height > 0);
+  } else {
+    int width = w;
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x8_t d0;
+
+    do {
+      int height = h;
+      int y_qn = subpel_y_qn;
+
+      uint16_t *d = dst_ptr;
+
+      do {
+        const int16_t *s =
+            (const int16_t *)&src_ptr[(y_qn >> SCALE_SUBPEL_BITS) * src_stride];
+        load_s16_8x8(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7);
+
+        const int y_filter_idx = (y_qn & SCALE_SUBPEL_MASK) >> SCALE_EXTRA_BITS;
+        const int16_t *y_filter_ptr =
+            av1_get_interp_filter_subpel_kernel(filter_params, y_filter_idx);
+        const int16x8_t y_filter = vld1q_s16(y_filter_ptr);
+
+        d0 = highbd_convolve8_8_sr_s32_s16(s0, s1, s2, s3, s4, s5, s6, s7,
+                                           y_filter, round1_shift_s32,
+                                           offset_s32, vdupq_n_s32(0));
+        vst1q_u16(d, d0);
+
+        y_qn += y_step_qn;
+        d += dst_stride;
+        height--;
+      } while (height > 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      width -= 8;
+    } while (width > 0);
+  }
+}
+
+static INLINE void highbd_dist_wtd_comp_avg_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, ConvolveParams *conv_params, const int round_bits,
+    const int offset, const int bd) {
+  CONV_BUF_TYPE *dst16 = conv_params->dst;
+  const int dst16_stride = conv_params->dst_stride;
+  const int32x4_t round_shift_s32 = vdupq_n_s32(-round_bits);
+  const int16x4_t offset_s16 = vdup_n_s16(offset);
+  const uint16x8_t max = vdupq_n_u16((1 << bd) - 1);
+  uint16x4_t fwd_offset_u16 = vdup_n_u16(conv_params->fwd_offset);
+  uint16x4_t bck_offset_u16 = vdup_n_u16(conv_params->bck_offset);
+
+  // Weighted averaging
+  if (w <= 4) {
+    for (int y = 0; y < h; ++y) {
+      const uint16x4_t s = vld1_u16(src_ptr + y * src_stride);
+      const uint16x4_t d16 = vld1_u16(dst16 + y * dst16_stride);
+      // We use vmull_u16/vmlal_u16 instead of of vmull_s16/vmlal_s16
+      // because the latter sign-extend and the values are non-negative.
+      // However, d0/d1 are signed-integers and we use vqmovun
+      // to do saturated narrowing to unsigned.
+      int32x4_t d0 = vreinterpretq_s32_u32(vmull_u16(d16, fwd_offset_u16));
+      d0 = vreinterpretq_s32_u32(
+          vmlal_u16(vreinterpretq_u32_s32(d0), s, bck_offset_u16));
+      d0 = vshrq_n_s32(d0, DIST_PRECISION_BITS);
+      // Subtract round offset and convolve round
+      d0 = vqrshlq_s32(vsubw_s16(d0, offset_s16), round_shift_s32);
+      uint16x4_t d = vqmovun_s32(d0);
+      d = vmin_u16(d, vget_low_u16(max));
+      if (w == 2) {
+        store_u16_2x1(dst_ptr + y * dst_stride, d, 0);
+      } else {
+        vst1_u16(dst_ptr + y * dst_stride, d);
+      }
+    }
+  } else {
+    for (int y = 0; y < h; ++y) {
+      for (int x = 0; x < w; x += 8) {
+        const uint16x8_t s = vld1q_u16(src_ptr + y * src_stride + x);
+        const uint16x8_t d16 = vld1q_u16(dst16 + y * dst16_stride + x);
+        // We use vmull_u16/vmlal_u16 instead of of vmull_s16/vmlal_s16
+        // because the latter sign-extend and the values are non-negative.
+        // However, d0/d1 are signed-integers and we use vqmovun
+        // to do saturated narrowing to unsigned.
+        int32x4_t d0 =
+            vreinterpretq_s32_u32(vmull_u16(vget_low_u16(d16), fwd_offset_u16));
+        int32x4_t d1 = vreinterpretq_s32_u32(
+            vmull_u16(vget_high_u16(d16), fwd_offset_u16));
+        d0 = vreinterpretq_s32_u32(vmlal_u16(vreinterpretq_u32_s32(d0),
+                                             vget_low_u16(s), bck_offset_u16));
+        d1 = vreinterpretq_s32_u32(vmlal_u16(vreinterpretq_u32_s32(d1),
+                                             vget_high_u16(s), bck_offset_u16));
+        d0 = vshrq_n_s32(d0, DIST_PRECISION_BITS);
+        d1 = vshrq_n_s32(d1, DIST_PRECISION_BITS);
+        d0 = vqrshlq_s32(vsubw_s16(d0, offset_s16), round_shift_s32);
+        d1 = vqrshlq_s32(vsubw_s16(d1, offset_s16), round_shift_s32);
+        uint16x8_t d01 = vcombine_u16(vqmovun_s32(d0), vqmovun_s32(d1));
+        d01 = vminq_u16(d01, max);
+        vst1q_u16(dst_ptr + y * dst_stride + x, d01);
+      }
+    }
+  }
+}
+
+static INLINE void highbd_comp_avg_neon(const uint16_t *src_ptr, int src_stride,
+                                        uint16_t *dst_ptr, int dst_stride,
+                                        int w, int h,
+                                        ConvolveParams *conv_params,
+                                        const int round_bits, const int offset,
+                                        const int bd) {
+  CONV_BUF_TYPE *dst16 = conv_params->dst;
+  const int dst16_stride = conv_params->dst_stride;
+  const int32x4_t round_shift_s32 = vdupq_n_s32(-round_bits);
+  const int16x4_t offset_s16 = vdup_n_s16(offset);
+  const uint16x8_t max = vdupq_n_u16((1 << bd) - 1);
+
+  if (w <= 4) {
+    for (int y = 0; y < h; ++y) {
+      const uint16x4_t s = vld1_u16(src_ptr + y * src_stride);
+      const uint16x4_t d16 = vld1_u16(dst16 + y * dst16_stride);
+      int32x4_t s_s32 = vreinterpretq_s32_u32(vmovl_u16(s));
+      int32x4_t d16_s32 = vreinterpretq_s32_u32(vmovl_u16(d16));
+      int32x4_t d0 = vhaddq_s32(s_s32, d16_s32);
+      d0 = vsubw_s16(d0, offset_s16);
+      d0 = vqrshlq_s32(d0, round_shift_s32);
+      uint16x4_t d = vqmovun_s32(d0);
+      d = vmin_u16(d, vget_low_u16(max));
+      if (w == 2) {
+        store_u16_2x1(dst_ptr + y * dst_stride, d, 0);
+      } else {
+        vst1_u16(dst_ptr + y * dst_stride, d);
+      }
+    }
+  } else {
+    for (int y = 0; y < h; ++y) {
+      for (int x = 0; x < w; x += 8) {
+        const uint16x8_t s = vld1q_u16(src_ptr + y * src_stride + x);
+        const uint16x8_t d16 = vld1q_u16(dst16 + y * dst16_stride + x);
+        int32x4_t s_lo = vreinterpretq_s32_u32(vmovl_u16(vget_low_u16(s)));
+        int32x4_t s_hi = vreinterpretq_s32_u32(vmovl_u16(vget_high_u16(s)));
+        int32x4_t d16_lo = vreinterpretq_s32_u32(vmovl_u16(vget_low_u16(d16)));
+        int32x4_t d16_hi = vreinterpretq_s32_u32(vmovl_u16(vget_high_u16(d16)));
+        int32x4_t d0 = vhaddq_s32(s_lo, d16_lo);
+        int32x4_t d1 = vhaddq_s32(s_hi, d16_hi);
+        d0 = vsubw_s16(d0, offset_s16);
+        d1 = vsubw_s16(d1, offset_s16);
+        d0 = vqrshlq_s32(d0, round_shift_s32);
+        d1 = vqrshlq_s32(d1, round_shift_s32);
+        uint16x8_t d01 = vcombine_u16(vqmovun_s32(d0), vqmovun_s32(d1));
+        d01 = vminq_u16(d01, max);
+        vst1q_u16(dst_ptr + y * dst_stride + x, d01);
+      }
+    }
+  }
+}
+
+static INLINE void highbd_convolve_correct_offset_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int round_bits, const int offset, const int bd) {
+  const int32x4_t round_shift_s32 = vdupq_n_s32(-round_bits);
+  const int16x4_t offset_s16 = vdup_n_s16(offset);
+  const uint16x8_t max = vdupq_n_u16((1 << bd) - 1);
+
+  if (w <= 4) {
+    for (int y = 0; y < h; ++y) {
+      const int16x4_t s = vld1_s16((const int16_t *)src_ptr + y * src_stride);
+      const int32x4_t d0 =
+          vqrshlq_s32(vsubl_s16(s, offset_s16), round_shift_s32);
+      uint16x4_t d = vqmovun_s32(d0);
+      d = vmin_u16(d, vget_low_u16(max));
+      if (w == 2) {
+        store_u16_2x1(dst_ptr + y * dst_stride, d, 0);
+      } else {
+        vst1_u16(dst_ptr + y * dst_stride, d);
+      }
+    }
+  } else {
+    for (int y = 0; y < h; ++y) {
+      for (int x = 0; x < w; x += 8) {
+        // Subtract round offset and convolve round
+        const int16x8_t s =
+            vld1q_s16((const int16_t *)src_ptr + y * src_stride + x);
+        const int32x4_t d0 = vqrshlq_s32(vsubl_s16(vget_low_s16(s), offset_s16),
+                                         round_shift_s32);
+        const int32x4_t d1 = vqrshlq_s32(
+            vsubl_s16(vget_high_s16(s), offset_s16), round_shift_s32);
+        uint16x8_t d01 = vcombine_u16(vqmovun_s32(d0), vqmovun_s32(d1));
+        d01 = vminq_u16(d01, max);
+        vst1q_u16(dst_ptr + y * dst_stride + x, d01);
+      }
+    }
+  }
+}
+
+void av1_highbd_convolve_2d_scale_neon(
+    const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w,
+    int h, const InterpFilterParams *filter_params_x,
+    const InterpFilterParams *filter_params_y, const int subpel_x_qn,
+    const int x_step_qn, const int subpel_y_qn, const int y_step_qn,
+    ConvolveParams *conv_params, int bd) {
+  uint16_t *im_block = (uint16_t *)aom_memalign(
+      16, 2 * sizeof(uint16_t) * MAX_SB_SIZE * (MAX_SB_SIZE + MAX_FILTER_TAP));
+  if (!im_block) return;
+  uint16_t *im_block2 = (uint16_t *)aom_memalign(
+      16, 2 * sizeof(uint16_t) * MAX_SB_SIZE * (MAX_SB_SIZE + MAX_FILTER_TAP));
+  if (!im_block2) {
+    aom_free(im_block);  // free the first block and return.
+    return;
+  }
+
+  int im_h = (((h - 1) * y_step_qn + subpel_y_qn) >> SCALE_SUBPEL_BITS) +
+             filter_params_y->taps;
+  const int im_stride = MAX_SB_SIZE;
+  const int bits =
+      FILTER_BITS * 2 - conv_params->round_0 - conv_params->round_1;
+  assert(bits >= 0);
+
+  const int vert_offset = filter_params_y->taps / 2 - 1;
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
+  const int x_offset_bits = (1 << (bd + FILTER_BITS - 1));
+  const int y_offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
+  const int y_offset_correction =
+      ((1 << (y_offset_bits - conv_params->round_1)) +
+       (1 << (y_offset_bits - conv_params->round_1 - 1)));
+
+  CONV_BUF_TYPE *dst16 = conv_params->dst;
+  const int dst16_stride = conv_params->dst_stride;
+
+  const uint16_t *src_ptr = src - vert_offset * src_stride - horiz_offset;
+
+  highbd_convolve_2d_x_scale_8tap_neon(
+      src_ptr, src_stride, im_block, im_stride, w, im_h, subpel_x_qn, x_step_qn,
+      filter_params_x, conv_params, x_offset_bits);
+  if (conv_params->is_compound && !conv_params->do_average) {
+    highbd_convolve_2d_y_scale_8tap_neon(
+        im_block, im_stride, dst16, dst16_stride, w, h, subpel_y_qn, y_step_qn,
+        filter_params_y, conv_params->round_1, y_offset_bits);
+  } else {
+    highbd_convolve_2d_y_scale_8tap_neon(
+        im_block, im_stride, im_block2, im_stride, w, h, subpel_y_qn, y_step_qn,
+        filter_params_y, conv_params->round_1, y_offset_bits);
+  }
+
+  // Do the compound averaging outside the loop, avoids branching within the
+  // main loop
+  if (conv_params->is_compound) {
+    if (conv_params->do_average) {
+      if (conv_params->use_dist_wtd_comp_avg) {
+        highbd_dist_wtd_comp_avg_neon(im_block2, im_stride, dst, dst_stride, w,
+                                      h, conv_params, bits, y_offset_correction,
+                                      bd);
+      } else {
+        highbd_comp_avg_neon(im_block2, im_stride, dst, dst_stride, w, h,
+                             conv_params, bits, y_offset_correction, bd);
+      }
+    }
+  } else {
+    highbd_convolve_correct_offset_neon(im_block2, im_stride, dst, dst_stride,
+                                        w, h, bits, y_offset_correction, bd);
+  }
+  aom_free(im_block);
+  aom_free(im_block2);
+}
+
+static INLINE void highbd_convolve_dist_wtd_x_8tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *x_filter_ptr, ConvolveParams *conv_params,
+    const int offset) {
+  const int16x8_t x_filter = vld1q_s16(x_filter_ptr);
+  const int32x4_t shift_s32 = vdupq_n_s32(-conv_params->round_0);
+  const int weight_bits = FILTER_BITS - conv_params->round_1;
+  const int32x4_t zero_s32 = vdupq_n_s32(0);
+  const int32x4_t weight_s32 = vdupq_n_s32(1 << weight_bits);
+  const int32x4_t offset_s32 = vdupq_n_s32(offset);
+
+  if (w <= 4) {
+    int16x8_t s0, s1, s2, s3;
+    uint16x4_t d0, d1;
+    uint16x8_t d01;
+
+    const int16_t *s = (const int16_t *)src_ptr;
+    uint16_t *d = dst_ptr;
+
+    do {
+      load_s16_8x2(s, src_stride, &s0, &s2);
+      load_s16_8x2(s + 8, src_stride, &s1, &s3);
+
+      d0 = highbd_convolve8_wtd_horiz4_s32_s16(
+          s0, s1, x_filter, shift_s32, zero_s32, weight_s32, offset_s32);
+      d1 = highbd_convolve8_wtd_horiz4_s32_s16(
+          s2, s3, x_filter, shift_s32, zero_s32, weight_s32, offset_s32);
+      d01 = vcombine_u16(d0, d1);
+
+      if (w == 2) {
+        store_u16q_2x1(d + 0 * dst_stride, d01, 0);
+        store_u16q_2x1(d + 1 * dst_stride, d01, 2);
+      } else {
+        vst1_u16(d + 0 * dst_stride, vget_low_u16(d01));
+        vst1_u16(d + 1 * dst_stride, vget_high_u16(d01));
+      }
+
+      s += 2 * src_stride;
+      d += 2 * dst_stride;
+      h -= 2;
+    } while (h > 0);
+  } else {
+    int height = h;
+    int16x8_t s0, s1, s2, s3;
+    uint16x8_t d0, d1;
+
+    do {
+      int width = w;
+      const int16_t *s = (const int16_t *)src_ptr;
+      uint16_t *d = dst_ptr;
+
+      load_s16_8x2(s, src_stride, &s0, &s2);
+      s += 8;
+
+      do {
+        load_s16_8x2(s, src_stride, &s1, &s3);
+
+        d0 = highbd_convolve8_wtd_horiz8_s32_s16(
+            s0, s1, x_filter, shift_s32, zero_s32, weight_s32, offset_s32);
+        d1 = highbd_convolve8_wtd_horiz8_s32_s16(
+            s2, s3, x_filter, shift_s32, zero_s32, weight_s32, offset_s32);
+
+        store_u16_8x2(d, dst_stride, d0, d1);
+
+        s0 = s1;
+        s2 = s3;
+        s += 8;
+        d += 8;
+        width -= 8;
+      } while (width > 0);
+      src_ptr += 2 * src_stride;
+      dst_ptr += 2 * dst_stride;
+      height -= 2;
+    } while (height > 0);
+  }
+}
+
+static INLINE void highbd_convolve_dist_wtd_y_8tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *y_filter_ptr, ConvolveParams *conv_params,
+    const int offset) {
+  const int16x8_t y_filter = vld1q_s16(y_filter_ptr);
+  const int32x4_t shift_s32 = vdupq_n_s32(-conv_params->round_0);
+  const int weight_bits = FILTER_BITS - conv_params->round_1;
+  const int32x4_t zero_s32 = vdupq_n_s32(0);
+  const int32x4_t weight_s32 = vdupq_n_s32(1 << weight_bits);
+  const int32x4_t offset_s32 = vdupq_n_s32(offset);
+
+  if (w <= 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
+    uint16x4_t d0, d1;
+    uint16x8_t d01;
+
+    const int16_t *s = (const int16_t *)src_ptr;
+    uint16_t *d = dst_ptr;
+
+    load_s16_4x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+    s += 7 * src_stride;
+
+    do {
+      load_s16_4x4(s, src_stride, &s7, &s8, &s9, &s10);
+
+      d0 = highbd_convolve8_wtd_4_s32_s16(s0, s1, s2, s3, s4, s5, s6, s7,
+                                          y_filter, shift_s32, zero_s32,
+                                          weight_s32, offset_s32);
+      d1 = highbd_convolve8_wtd_4_s32_s16(s1, s2, s3, s4, s5, s6, s7, s8,
+                                          y_filter, shift_s32, zero_s32,
+                                          weight_s32, offset_s32);
+      d01 = vcombine_u16(d0, d1);
+
+      if (w == 2) {
+        store_u16q_2x1(d + 0 * dst_stride, d01, 0);
+        store_u16q_2x1(d + 1 * dst_stride, d01, 2);
+      } else {
+        vst1_u16(d + 0 * dst_stride, vget_low_u16(d01));
+        vst1_u16(d + 1 * dst_stride, vget_high_u16(d01));
+      }
+
+      s0 = s2;
+      s1 = s3;
+      s2 = s4;
+      s3 = s5;
+      s4 = s6;
+      s5 = s7;
+      s6 = s8;
+      s += 2 * src_stride;
+      d += 2 * dst_stride;
+      h -= 2;
+    } while (h > 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8;
+    uint16x8_t d0, d1;
+
+    do {
+      int height = h;
+      const int16_t *s = (const int16_t *)src_ptr;
+      uint16_t *d = dst_ptr;
+
+      load_s16_8x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+      s += 7 * src_stride;
+
+      do {
+        load_s16_8x2(s, src_stride, &s7, &s8);
+
+        d0 = highbd_convolve8_wtd_8_s32_s16(s0, s1, s2, s3, s4, s5, s6, s7,
+                                            y_filter, shift_s32, zero_s32,
+                                            weight_s32, offset_s32);
+        d1 = highbd_convolve8_wtd_8_s32_s16(s1, s2, s3, s4, s5, s6, s7, s8,
+                                            y_filter, shift_s32, zero_s32,
+                                            weight_s32, offset_s32);
+
+        store_u16_8x2(d, dst_stride, d0, d1);
+
+        s0 = s2;
+        s1 = s3;
+        s2 = s4;
+        s3 = s5;
+        s4 = s6;
+        s5 = s7;
+        s6 = s8;
+        s += 2 * src_stride;
+        d += 2 * dst_stride;
+        height -= 2;
+      } while (height > 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w > 0);
+  }
+}
+
+void av1_highbd_dist_wtd_convolve_x_neon(
+    const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w,
+    int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn,
+    ConvolveParams *conv_params, int bd) {
+  DECLARE_ALIGNED(16, uint16_t,
+                  im_block[(MAX_SB_SIZE + MAX_FILTER_TAP) * MAX_SB_SIZE]);
+  CONV_BUF_TYPE *dst16 = conv_params->dst;
+  int dst16_stride = conv_params->dst_stride;
+  const int im_stride = MAX_SB_SIZE;
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
+  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
+  const int round_offset = (1 << (offset_bits - conv_params->round_1)) +
+                           (1 << (offset_bits - conv_params->round_1 - 1));
+  const int round_bits =
+      2 * FILTER_BITS - conv_params->round_0 - conv_params->round_1;
+  assert(round_bits >= 0);
+
+  const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_x, subpel_x_qn & SUBPEL_MASK);
+
+  src -= horiz_offset;
+
+  // horizontal filter
+  if (conv_params->do_average) {
+    highbd_convolve_dist_wtd_x_8tap_neon(src, src_stride, im_block, im_stride,
+                                         w, h, x_filter_ptr, conv_params,
+                                         round_offset);
+  } else {
+    highbd_convolve_dist_wtd_x_8tap_neon(src, src_stride, dst16, dst16_stride,
+                                         w, h, x_filter_ptr, conv_params,
+                                         round_offset);
+  }
+
+  if (conv_params->do_average) {
+    if (conv_params->use_dist_wtd_comp_avg) {
+      highbd_dist_wtd_comp_avg_neon(im_block, im_stride, dst, dst_stride, w, h,
+                                    conv_params, round_bits, round_offset, bd);
+    } else {
+      highbd_comp_avg_neon(im_block, im_stride, dst, dst_stride, w, h,
+                           conv_params, round_bits, round_offset, bd);
+    }
+  }
+}
+
+void av1_highbd_dist_wtd_convolve_y_neon(
+    const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w,
+    int h, const InterpFilterParams *filter_params_y, const int subpel_y_qn,
+    ConvolveParams *conv_params, int bd) {
+  DECLARE_ALIGNED(16, uint16_t,
+                  im_block[(MAX_SB_SIZE + MAX_FILTER_TAP) * MAX_SB_SIZE]);
+  CONV_BUF_TYPE *dst16 = conv_params->dst;
+  int dst16_stride = conv_params->dst_stride;
+  const int im_stride = MAX_SB_SIZE;
+  const int vert_offset = filter_params_y->taps / 2 - 1;
+  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
+  const int round_offset = (1 << (offset_bits - conv_params->round_1)) +
+                           (1 << (offset_bits - conv_params->round_1 - 1));
+  const int round_bits =
+      2 * FILTER_BITS - conv_params->round_0 - conv_params->round_1;
+  assert(round_bits >= 0);
+
+  const int16_t *y_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_y, subpel_y_qn & SUBPEL_MASK);
+
+  src -= vert_offset * src_stride;
+
+  // vertical filter
+  if (conv_params->do_average) {
+    highbd_convolve_dist_wtd_y_8tap_neon(src, src_stride, im_block, im_stride,
+                                         w, h, y_filter_ptr, conv_params,
+                                         round_offset);
+  } else {
+    highbd_convolve_dist_wtd_y_8tap_neon(src, src_stride, dst16, dst16_stride,
+                                         w, h, y_filter_ptr, conv_params,
+                                         round_offset);
+  }
+
+  if (conv_params->do_average) {
+    if (conv_params->use_dist_wtd_comp_avg) {
+      highbd_dist_wtd_comp_avg_neon(im_block, im_stride, dst, dst_stride, w, h,
+                                    conv_params, round_bits, round_offset, bd);
+    } else {
+      highbd_comp_avg_neon(im_block, im_stride, dst, dst_stride, w, h,
+                           conv_params, round_bits, round_offset, bd);
+    }
+  }
+}
+
+static INLINE void highbd_2d_copy_neon(const uint16_t *src_ptr, int src_stride,
+                                       uint16_t *dst_ptr, int dst_stride, int w,
+                                       int h, const int round_bits,
+                                       const int offset) {
+  if (w <= 4) {
+    const int16x4_t round_shift_s16 = vdup_n_s16(round_bits);
+    const uint16x4_t offset_u16 = vdup_n_u16(offset);
+
+    for (int y = 0; y < h; ++y) {
+      const uint16x4_t s = vld1_u16(src_ptr + y * src_stride);
+      uint16x4_t d = vshl_u16(s, round_shift_s16);
+      d = vadd_u16(d, offset_u16);
+      if (w == 2) {
+        store_u16_2x1(dst_ptr + y * dst_stride, d, 0);
+      } else {
+        vst1_u16(dst_ptr + y * dst_stride, d);
+      }
+    }
+  } else {
+    const int16x8_t round_shift_s16 = vdupq_n_s16(round_bits);
+    const uint16x8_t offset_u16 = vdupq_n_u16(offset);
+
+    for (int y = 0; y < h; ++y) {
+      for (int x = 0; x < w; x += 8) {
+        const uint16x8_t s = vld1q_u16(src_ptr + y * src_stride + x);
+        uint16x8_t d = vshlq_u16(s, round_shift_s16);
+        d = vaddq_u16(d, offset_u16);
+        vst1q_u16(dst_ptr + y * dst_stride + x, d);
+      }
+    }
+  }
+}
+
+void av1_highbd_dist_wtd_convolve_2d_copy_neon(const uint16_t *src,
+                                               int src_stride, uint16_t *dst,
+                                               int dst_stride, int w, int h,
+                                               ConvolveParams *conv_params,
+                                               int bd) {
+  DECLARE_ALIGNED(16, uint16_t,
+                  im_block[(MAX_SB_SIZE + MAX_FILTER_TAP) * MAX_SB_SIZE]);
+
+  const int im_stride = MAX_SB_SIZE;
+  CONV_BUF_TYPE *dst16 = conv_params->dst;
+  int dst16_stride = conv_params->dst_stride;
+  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
+  const int round_offset = (1 << (offset_bits - conv_params->round_1)) +
+                           (1 << (offset_bits - conv_params->round_1 - 1));
+  const int round_bits =
+      2 * FILTER_BITS - conv_params->round_0 - conv_params->round_1;
+  assert(round_bits >= 0);
+
+  if (conv_params->do_average) {
+    highbd_2d_copy_neon(src, src_stride, im_block, im_stride, w, h, round_bits,
+                        round_offset);
+  } else {
+    highbd_2d_copy_neon(src, src_stride, dst16, dst16_stride, w, h, round_bits,
+                        round_offset);
+  }
+
+  if (conv_params->do_average) {
+    if (conv_params->use_dist_wtd_comp_avg) {
+      highbd_dist_wtd_comp_avg_neon(im_block, im_stride, dst, dst_stride, w, h,
+                                    conv_params, round_bits, round_offset, bd);
+    } else {
+      highbd_comp_avg_neon(im_block, im_stride, dst, dst_stride, w, h,
+                           conv_params, round_bits, round_offset, bd);
+    }
+  }
+}
+
+static INLINE void highbd_convolve_y_8tap_neon(
+    const uint16_t *src_ptr, int src_stride, uint16_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *y_filter_ptr, ConvolveParams *conv_params,
+    int offset) {
+  const int16x8_t y_filter = vld1q_s16(y_filter_ptr);
+  const int32x4_t offset_s32 = vdupq_n_s32(offset);
+  const int32x4_t shift_s32 = vdupq_n_s32(-conv_params->round_1);
+
+  if (w <= 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
+    uint16x4_t d0, d1, d2, d3;
+    uint16x8_t d01, d23;
+
+    const int16_t *s = (const int16_t *)src_ptr;
+    uint16_t *d = dst_ptr;
+
+    load_s16_4x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+    s += 7 * src_stride;
+
+    do {
+      load_s16_4x4(s, src_stride, &s7, &s8, &s9, &s10);
+
+      d0 = highbd_convolve8_sr_4_s32_s16(s0, s1, s2, s3, s4, s5, s6, s7,
+                                         y_filter, shift_s32, offset_s32);
+      d1 = highbd_convolve8_sr_4_s32_s16(s1, s2, s3, s4, s5, s6, s7, s8,
+                                         y_filter, shift_s32, offset_s32);
+      d2 = highbd_convolve8_sr_4_s32_s16(s2, s3, s4, s5, s6, s7, s8, s9,
+                                         y_filter, shift_s32, offset_s32);
+      d3 = highbd_convolve8_sr_4_s32_s16(s3, s4, s5, s6, s7, s8, s9, s10,
+                                         y_filter, shift_s32, offset_s32);
+
+      d01 = vcombine_u16(d0, d1);
+      d23 = vcombine_u16(d2, d3);
+
+      if (w == 2) {
+        store_u16q_2x1(d + 0 * dst_stride, d01, 0);
+        store_u16q_2x1(d + 1 * dst_stride, d01, 2);
+        if (h != 2) {
+          store_u16q_2x1(d + 2 * dst_stride, d23, 0);
+          store_u16q_2x1(d + 3 * dst_stride, d23, 2);
+        }
+      } else {
+        vst1_u16(d + 0 * dst_stride, vget_low_u16(d01));
+        vst1_u16(d + 1 * dst_stride, vget_high_u16(d01));
+        if (h != 2) {
+          vst1_u16(d + 2 * dst_stride, vget_low_u16(d23));
+          vst1_u16(d + 3 * dst_stride, vget_high_u16(d23));
+        }
+      }
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s5 = s9;
+      s6 = s10;
+      s += 4 * src_stride;
+      d += 4 * dst_stride;
+      h -= 4;
+    } while (h > 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10;
+    uint16x8_t d0, d1, d2, d3;
+    do {
+      int height = h;
+      const int16_t *s = (const int16_t *)src_ptr;
+      uint16_t *d = dst_ptr;
+
+      load_s16_8x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+      s += 7 * src_stride;
+
+      do {
+        load_s16_8x4(s, src_stride, &s7, &s8, &s9, &s10);
+
+        d0 = highbd_convolve8_8_s32_s16(s0, s1, s2, s3, s4, s5, s6, s7,
+                                        y_filter, offset_s32);
+        d1 = highbd_convolve8_8_s32_s16(s1, s2, s3, s4, s5, s6, s7, s8,
+                                        y_filter, offset_s32);
+        d2 = highbd_convolve8_8_s32_s16(s2, s3, s4, s5, s6, s7, s8, s9,
+                                        y_filter, offset_s32);
+        d3 = highbd_convolve8_8_s32_s16(s3, s4, s5, s6, s7, s8, s9, s10,
+                                        y_filter, offset_s32);
+
+        if (h == 2) {
+          store_u16_8x2(d, dst_stride, d0, d1);
+        } else {
+          store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+        }
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height > 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w > 0);
+  }
+}
+
+void av1_highbd_dist_wtd_convolve_2d_neon(
+    const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w,
+    int h, const InterpFilterParams *filter_params_x,
+    const InterpFilterParams *filter_params_y, const int subpel_x_qn,
+    const int subpel_y_qn, ConvolveParams *conv_params, int bd) {
+  DECLARE_ALIGNED(16, uint16_t,
+                  im_block[(MAX_SB_SIZE + MAX_FILTER_TAP) * MAX_SB_SIZE]);
+  DECLARE_ALIGNED(16, uint16_t,
+                  im_block2[(MAX_SB_SIZE + MAX_FILTER_TAP) * MAX_SB_SIZE]);
+
+  CONV_BUF_TYPE *dst16 = conv_params->dst;
+  int dst16_stride = conv_params->dst_stride;
+
+  const int im_h = h + filter_params_y->taps - 1;
+  const int im_stride = MAX_SB_SIZE;
+  const int vert_offset = filter_params_y->taps / 2 - 1;
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
+  const int round_bits =
+      2 * FILTER_BITS - conv_params->round_0 - conv_params->round_1;
+  const int x_offset_initial = (1 << (bd + FILTER_BITS - 1));
+  const int y_offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
+  const int y_offset_initial = (1 << y_offset_bits);
+  const int y_offset_correction =
+      ((1 << (y_offset_bits - conv_params->round_1)) +
+       (1 << (y_offset_bits - conv_params->round_1 - 1)));
+
+  const uint16_t *src_ptr = src - vert_offset * src_stride - horiz_offset;
+
+  const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_x, subpel_x_qn & SUBPEL_MASK);
+  const int16_t *y_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_y, subpel_y_qn & SUBPEL_MASK);
+
+  // horizontal filter
+  highbd_convolve_x_8tap_neon(src_ptr, src_stride, im_block, im_stride, w, im_h,
+                              x_filter_ptr, conv_params, x_offset_initial);
+  // vertical filter
+  if (conv_params->do_average) {
+    highbd_convolve_y_8tap_neon(im_block, im_stride, im_block2, im_stride, w, h,
+                                y_filter_ptr, conv_params, y_offset_initial);
+  } else {
+    highbd_convolve_y_8tap_neon(im_block, im_stride, dst16, dst16_stride, w, h,
+                                y_filter_ptr, conv_params, y_offset_initial);
+  }
+
+  // Do the compound averaging outside the loop, avoids branching within the
+  // main loop
+  if (conv_params->do_average) {
+    if (conv_params->use_dist_wtd_comp_avg) {
+      highbd_dist_wtd_comp_avg_neon(im_block2, im_stride, dst, dst_stride, w, h,
+                                    conv_params, round_bits,
+                                    y_offset_correction, bd);
+    } else {
+      highbd_comp_avg_neon(im_block2, im_stride, dst, dst_stride, w, h,
+                           conv_params, round_bits, y_offset_correction, bd);
+    }
+  }
+}
+
+#define UPSCALE_NORMATIVE_TAPS 8
+
+void av1_highbd_convolve_horiz_rs_neon(const uint16_t *src, int src_stride,
+                                       uint16_t *dst, int dst_stride, int w,
+                                       int h, const int16_t *x_filters,
+                                       int x0_qn, int x_step_qn, int bd) {
+  const int horiz_offset = UPSCALE_NORMATIVE_TAPS / 2 - 1;
+
+  const int32x4_t idx = { 0, 1, 2, 3 };
+  const int32x4_t subpel_mask = vdupq_n_s32(RS_SCALE_SUBPEL_MASK);
+  const int32x4_t shift_s32 = vdupq_n_s32(-FILTER_BITS);
+  const int32x4_t offset_s32 = vdupq_n_s32(0);
+  const uint16x4_t max = vdup_n_u16((1 << bd) - 1);
+
+  const uint16_t *src_ptr = src - horiz_offset;
+  uint16_t *dst_ptr = dst;
+
+  if (w <= 4) {
+    int height = h;
+    int16x8_t s0, s1, s2, s3;
+    uint16x4_t d0;
+
+    uint16_t *d = dst_ptr;
+    do {
+      int x_qn = x0_qn;
+
+      // Load 4 src vectors at a time, they might be the same, but we have to
+      // calculate the indices anyway. Doing it in SIMD and then storing the
+      // indices is faster than having to calculate the expression
+      // &src_ptr[((x_qn + 0*x_step_qn) >> RS_SCALE_SUBPEL_BITS)] 4 times
+      // Ideally this should be a gather using the indices, but NEON does not
+      // have that, so have to emulate
+      const int32x4_t xqn_idx = vmlaq_n_s32(vdupq_n_s32(x_qn), idx, x_step_qn);
+      // We have to multiply x2 to get the actual pointer as sizeof(uint16_t) =
+      // 2
+      const int32x4_t src_idx =
+          vshlq_n_s32(vshrq_n_s32(xqn_idx, RS_SCALE_SUBPEL_BITS), 1);
+      // Similarly for the filter vector indices, we calculate the filter
+      // indices for 4 columns. First we calculate the indices:
+      // x_qn & RS_SCALE_SUBPEL_MASK) >> RS_SCALE_EXTRA_BITS
+      // Then we calculate the actual pointers, multiplying with
+      // UPSCALE_UPSCALE_NORMATIVE_TAPS
+      // again shift left by 1
+      const int32x4_t x_filter4_idx = vshlq_n_s32(
+          vshrq_n_s32(vandq_s32(xqn_idx, subpel_mask), RS_SCALE_EXTRA_BITS), 1);
+      // Even though pointers are unsigned 32/64-bit ints we do signed
+      // addition The reason for this is that x_qn can be negative, leading to
+      // negative offsets. Argon test
+      // profile0_core/streams/test10573_11003.obu was failing because of
+      // this.
+#if AOM_ARCH_AARCH64
+      uint64x2_t tmp4[2];
+      tmp4[0] = vreinterpretq_u64_s64(vaddw_s32(
+          vdupq_n_s64((const int64_t)src_ptr), vget_low_s32(src_idx)));
+      tmp4[1] = vreinterpretq_u64_s64(vaddw_s32(
+          vdupq_n_s64((const int64_t)src_ptr), vget_high_s32(src_idx)));
+      int16_t *src4_ptr[4];
+      uint64_t *tmp_ptr = (uint64_t *)&src4_ptr;
+      vst1q_u64(tmp_ptr, tmp4[0]);
+      vst1q_u64(tmp_ptr + 2, tmp4[1]);
+
+      // filter vectors
+      tmp4[0] = vreinterpretq_u64_s64(vmlal_s32(
+          vdupq_n_s64((const int64_t)x_filters), vget_low_s32(x_filter4_idx),
+          vdup_n_s32(UPSCALE_NORMATIVE_TAPS)));
+      tmp4[1] = vreinterpretq_u64_s64(vmlal_s32(
+          vdupq_n_s64((const int64_t)x_filters), vget_high_s32(x_filter4_idx),
+          vdup_n_s32(UPSCALE_NORMATIVE_TAPS)));
+
+      const int16_t *x_filter4_ptr[4];
+      tmp_ptr = (uint64_t *)&x_filter4_ptr;
+      vst1q_u64(tmp_ptr, tmp4[0]);
+      vst1q_u64(tmp_ptr + 2, tmp4[1]);
+#else
+      uint32x4_t tmp4;
+      tmp4 = vreinterpretq_u32_s32(
+          vaddq_s32(vdupq_n_s32((const int32_t)src_ptr), src_idx));
+      int16_t *src4_ptr[4];
+      uint32_t *tmp_ptr = (uint32_t *)&src4_ptr;
+      vst1q_u32(tmp_ptr, tmp4);
+      // filter vectors
+      tmp4 = vreinterpretq_u32_s32(
+          vmlaq_s32(vdupq_n_s32((const int32_t)x_filters), x_filter4_idx,
+                    vdupq_n_s32(UPSCALE_NORMATIVE_TAPS)));
+
+      const int16_t *x_filter4_ptr[4];
+      tmp_ptr = (uint32_t *)&x_filter4_ptr;
+      vst1q_u32(tmp_ptr, tmp4);
+#endif  // AOM_ARCH_AARCH64
+      // Load source
+      s0 = vld1q_s16(src4_ptr[0]);
+      s1 = vld1q_s16(src4_ptr[1]);
+      s2 = vld1q_s16(src4_ptr[2]);
+      s3 = vld1q_s16(src4_ptr[3]);
+
+      // Actually load the filters
+      const int16x8_t x_filter0 = vld1q_s16(x_filter4_ptr[0]);
+      const int16x8_t x_filter1 = vld1q_s16(x_filter4_ptr[1]);
+      const int16x8_t x_filter2 = vld1q_s16(x_filter4_ptr[2]);
+      const int16x8_t x_filter3 = vld1q_s16(x_filter4_ptr[3]);
+
+      // Group low and high parts and transpose
+      int16x4_t filters_lo[] = { vget_low_s16(x_filter0),
+                                 vget_low_s16(x_filter1),
+                                 vget_low_s16(x_filter2),
+                                 vget_low_s16(x_filter3) };
+      int16x4_t filters_hi[] = { vget_high_s16(x_filter0),
+                                 vget_high_s16(x_filter1),
+                                 vget_high_s16(x_filter2),
+                                 vget_high_s16(x_filter3) };
+      transpose_u16_4x4((uint16x4_t *)filters_lo);
+      transpose_u16_4x4((uint16x4_t *)filters_hi);
+
+      // Run the 2D Scale convolution
+      d0 = highbd_convolve8_2d_scale_horiz4x8_s32_s16(
+          s0, s1, s2, s3, filters_lo, filters_hi, shift_s32, offset_s32);
+
+      d0 = vmin_u16(d0, max);
+
+      if (w == 2) {
+        store_u16_2x1(d + 0 * dst_stride, d0, 0);
+      } else {
+        vst1_u16(d + 0 * dst_stride, d0);
+      }
+
+      src_ptr += src_stride;
+      d += dst_stride;
+      height--;
+    } while (height > 0);
+  } else {
+    int height = h;
+    int16x8_t s0, s1, s2, s3;
+    uint16x4_t d0;
+
+    do {
+      int width = w;
+      int x_qn = x0_qn;
+      uint16_t *d = dst_ptr;
+      const uint16_t *s = src_ptr;
+
+      do {
+        // Load 4 src vectors at a time, they might be the same, but we have to
+        // calculate the indices anyway. Doing it in SIMD and then storing the
+        // indices is faster than having to calculate the expression
+        // &src_ptr[((x_qn + 0*x_step_qn) >> RS_SCALE_SUBPEL_BITS)] 4 times
+        // Ideally this should be a gather using the indices, but NEON does not
+        // have that, so have to emulate
+        const int32x4_t xqn_idx =
+            vmlaq_n_s32(vdupq_n_s32(x_qn), idx, x_step_qn);
+        // We have to multiply x2 to get the actual pointer as sizeof(uint16_t)
+        // = 2
+        const int32x4_t src_idx =
+            vshlq_n_s32(vshrq_n_s32(xqn_idx, RS_SCALE_SUBPEL_BITS), 1);
+
+        // Similarly for the filter vector indices, we calculate the filter
+        // indices for 4 columns. First we calculate the indices:
+        // x_qn & RS_SCALE_SUBPEL_MASK) >> RS_SCALE_EXTRA_BITS
+        // Then we calculate the actual pointers, multiplying with
+        // UPSCALE_UPSCALE_NORMATIVE_TAPS
+        // again shift left by 1
+        const int32x4_t x_filter4_idx = vshlq_n_s32(
+            vshrq_n_s32(vandq_s32(xqn_idx, subpel_mask), RS_SCALE_EXTRA_BITS),
+            1);
+        // Even though pointers are unsigned 32/64-bit ints we do signed
+        // addition The reason for this is that x_qn can be negative, leading to
+        // negative offsets. Argon test
+        // profile0_core/streams/test10573_11003.obu was failing because of
+        // this.
+#if AOM_ARCH_AARCH64
+        uint64x2_t tmp4[2];
+        tmp4[0] = vreinterpretq_u64_s64(
+            vaddw_s32(vdupq_n_s64((const int64_t)s), vget_low_s32(src_idx)));
+        tmp4[1] = vreinterpretq_u64_s64(
+            vaddw_s32(vdupq_n_s64((const int64_t)s), vget_high_s32(src_idx)));
+        int16_t *src4_ptr[4];
+        uint64_t *tmp_ptr = (uint64_t *)&src4_ptr;
+        vst1q_u64(tmp_ptr, tmp4[0]);
+        vst1q_u64(tmp_ptr + 2, tmp4[1]);
+
+        // filter vectors
+        tmp4[0] = vreinterpretq_u64_s64(vmlal_s32(
+            vdupq_n_s64((const int64_t)x_filters), vget_low_s32(x_filter4_idx),
+            vdup_n_s32(UPSCALE_NORMATIVE_TAPS)));
+        tmp4[1] = vreinterpretq_u64_s64(vmlal_s32(
+            vdupq_n_s64((const int64_t)x_filters), vget_high_s32(x_filter4_idx),
+            vdup_n_s32(UPSCALE_NORMATIVE_TAPS)));
+
+        const int16_t *x_filter4_ptr[4];
+        tmp_ptr = (uint64_t *)&x_filter4_ptr;
+        vst1q_u64(tmp_ptr, tmp4[0]);
+        vst1q_u64(tmp_ptr + 2, tmp4[1]);
+#else
+        uint32x4_t tmp4;
+        tmp4 = vreinterpretq_u32_s32(
+            vaddq_s32(vdupq_n_s32((const int32_t)s), src_idx));
+        int16_t *src4_ptr[4];
+        uint32_t *tmp_ptr = (uint32_t *)&src4_ptr;
+        vst1q_u32(tmp_ptr, tmp4);
+        // filter vectors
+        tmp4 = vreinterpretq_u32_s32(
+            vmlaq_s32(vdupq_n_s32((const int32_t)x_filters), x_filter4_idx,
+                      vdupq_n_s32(UPSCALE_NORMATIVE_TAPS)));
+
+        const int16_t *x_filter4_ptr[4];
+        tmp_ptr = (uint32_t *)&x_filter4_ptr;
+        vst1q_u32(tmp_ptr, tmp4);
+#endif  // AOM_ARCH_AARCH64
+
+        // Load source
+        s0 = vld1q_s16(src4_ptr[0]);
+        s1 = vld1q_s16(src4_ptr[1]);
+        s2 = vld1q_s16(src4_ptr[2]);
+        s3 = vld1q_s16(src4_ptr[3]);
+
+        // Actually load the filters
+        const int16x8_t x_filter0 = vld1q_s16(x_filter4_ptr[0]);
+        const int16x8_t x_filter1 = vld1q_s16(x_filter4_ptr[1]);
+        const int16x8_t x_filter2 = vld1q_s16(x_filter4_ptr[2]);
+        const int16x8_t x_filter3 = vld1q_s16(x_filter4_ptr[3]);
+
+        // Group low and high parts and transpose
+        int16x4_t filters_lo[] = { vget_low_s16(x_filter0),
+                                   vget_low_s16(x_filter1),
+                                   vget_low_s16(x_filter2),
+                                   vget_low_s16(x_filter3) };
+        int16x4_t filters_hi[] = { vget_high_s16(x_filter0),
+                                   vget_high_s16(x_filter1),
+                                   vget_high_s16(x_filter2),
+                                   vget_high_s16(x_filter3) };
+        transpose_u16_4x4((uint16x4_t *)filters_lo);
+        transpose_u16_4x4((uint16x4_t *)filters_hi);
+
+        // Run the 2D Scale X convolution
+        d0 = highbd_convolve8_2d_scale_horiz4x8_s32_s16(
+            s0, s1, s2, s3, filters_lo, filters_hi, shift_s32, offset_s32);
+
+        d0 = vmin_u16(d0, max);
+        vst1_u16(d, d0);
+
+        x_qn += 4 * x_step_qn;
+        d += 4;
+        width -= 4;
+      } while (width > 0);
+
+      src_ptr += src_stride;
+      dst_ptr += dst_stride;
+      height--;
+    } while (height > 0);
+  }
+}

diff --git a/av1/common/arm/highbd_convolve_neon.h b/av1/common/arm/highbd_convolve_neon.h
new file mode 100644
index 0000000..f9d028f
--- /dev/null
+++ b/av1/common/arm/highbd_convolve_neon.h

@@ -0,0 +1,550 @@
+/*
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#ifndef AOM_AV1_COMMON_ARM_HIGHBD_CONVOLVE_NEON_H_
+#define AOM_AV1_COMMON_ARM_HIGHBD_CONVOLVE_NEON_H_
+
+#include <arm_neon.h>
+
+static INLINE int32x4_t highbd_convolve6_4_s32(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x8_t y_filter, const int32x4_t offset) {
+  const int16x4_t y_filter_lo = vget_low_s16(y_filter);
+  const int16x4_t y_filter_hi = vget_high_s16(y_filter);
+
+  int32x4_t sum = vmlal_lane_s16(offset, s0, y_filter_lo, 1);
+  sum = vmlal_lane_s16(sum, s1, y_filter_lo, 2);
+  sum = vmlal_lane_s16(sum, s2, y_filter_lo, 3);
+  sum = vmlal_lane_s16(sum, s3, y_filter_hi, 0);
+  sum = vmlal_lane_s16(sum, s4, y_filter_hi, 1);
+  sum = vmlal_lane_s16(sum, s5, y_filter_hi, 2);
+
+  return sum;
+}
+
+static INLINE uint16x4_t highbd_convolve6_4_s32_s16(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x8_t y_filter, const int32x4_t offset) {
+  int32x4_t sum =
+      highbd_convolve6_4_s32(s0, s1, s2, s3, s4, s5, y_filter, offset);
+
+  return vqrshrun_n_s32(sum, COMPOUND_ROUND1_BITS);
+}
+
+static INLINE void highbd_convolve6_8_s32(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+    const int16x8_t y_filter, const int32x4_t offset, int32x4_t *sum0,
+    int32x4_t *sum1) {
+  const int16x4_t y_filter_lo = vget_low_s16(y_filter);
+  const int16x4_t y_filter_hi = vget_high_s16(y_filter);
+
+  *sum0 = vmlal_lane_s16(offset, vget_low_s16(s0), y_filter_lo, 1);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s1), y_filter_lo, 2);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s2), y_filter_lo, 3);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s3), y_filter_hi, 0);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s4), y_filter_hi, 1);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s5), y_filter_hi, 2);
+
+  *sum1 = vmlal_lane_s16(offset, vget_high_s16(s0), y_filter_lo, 1);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s1), y_filter_lo, 2);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s2), y_filter_lo, 3);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s3), y_filter_hi, 0);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s4), y_filter_hi, 1);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s5), y_filter_hi, 2);
+}
+
+static INLINE uint16x8_t highbd_convolve6_8_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+    const int16x8_t y_filter, const int32x4_t offset) {
+  int32x4_t sum0;
+  int32x4_t sum1;
+  highbd_convolve6_8_s32(s0, s1, s2, s3, s4, s5, y_filter, offset, &sum0,
+                         &sum1);
+
+  return vcombine_u16(vqrshrun_n_s32(sum0, COMPOUND_ROUND1_BITS),
+                      vqrshrun_n_s32(sum1, COMPOUND_ROUND1_BITS));
+}
+
+static INLINE int32x4_t highbd_convolve8_4_s32(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x4_t s6, const int16x4_t s7, const int16x8_t y_filter,
+    const int32x4_t offset) {
+  const int16x4_t y_filter_lo = vget_low_s16(y_filter);
+  const int16x4_t y_filter_hi = vget_high_s16(y_filter);
+
+  int32x4_t sum = vmlal_lane_s16(offset, s0, y_filter_lo, 0);
+  sum = vmlal_lane_s16(sum, s1, y_filter_lo, 1);
+  sum = vmlal_lane_s16(sum, s2, y_filter_lo, 2);
+  sum = vmlal_lane_s16(sum, s3, y_filter_lo, 3);
+  sum = vmlal_lane_s16(sum, s4, y_filter_hi, 0);
+  sum = vmlal_lane_s16(sum, s5, y_filter_hi, 1);
+  sum = vmlal_lane_s16(sum, s6, y_filter_hi, 2);
+  sum = vmlal_lane_s16(sum, s7, y_filter_hi, 3);
+
+  return sum;
+}
+
+static INLINE uint16x4_t highbd_convolve8_4_s32_s16(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x4_t s6, const int16x4_t s7, const int16x8_t y_filter,
+    const int32x4_t offset) {
+  int32x4_t sum =
+      highbd_convolve8_4_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter, offset);
+
+  return vqrshrun_n_s32(sum, COMPOUND_ROUND1_BITS);
+}
+
+static INLINE uint16x4_t highbd_convolve8_sr_4_s32_s16(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x4_t s6, const int16x4_t s7, const int16x8_t y_filter,
+    const int32x4_t shift_s32, const int32x4_t offset) {
+  int32x4_t sum =
+      highbd_convolve8_4_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter, offset);
+
+  sum = vqrshlq_s32(sum, shift_s32);
+  return vqmovun_s32(sum);
+}
+
+static INLINE uint16x4_t highbd_convolve8_wtd_4_s32_s16(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x4_t s6, const int16x4_t s7, const int16x8_t y_filter,
+    const int32x4_t shift_s32, const int32x4_t offset, const int32x4_t weight,
+    const int32x4_t offset2) {
+  int32x4_t sum =
+      highbd_convolve8_4_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter, offset);
+
+  sum = vqrshlq_s32(sum, shift_s32);
+  sum = vmlaq_s32(offset2, sum, weight);
+
+  return vqmovun_s32(sum);
+}
+
+// Like above but also perform round shifting and subtract correction term
+static INLINE uint16x4_t highbd_convolve8_4_sr_s32_s16(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x4_t s6, const int16x4_t s7, const int16x8_t y_filter,
+    const int32x4_t round_shift, const int32x4_t offset,
+    const int32x4_t correction) {
+  int32x4_t sum =
+      highbd_convolve8_4_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter, offset);
+
+  sum = vsubq_s32(vqrshlq_s32(sum, round_shift), correction);
+  return vqmovun_s32(sum);
+}
+
+static INLINE void highbd_convolve8_8_s32(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+    const int16x8_t s6, const int16x8_t s7, const int16x8_t y_filter,
+    const int32x4_t offset, int32x4_t *sum0, int32x4_t *sum1) {
+  const int16x4_t y_filter_lo = vget_low_s16(y_filter);
+  const int16x4_t y_filter_hi = vget_high_s16(y_filter);
+
+  *sum0 = vmlal_lane_s16(offset, vget_low_s16(s0), y_filter_lo, 0);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s1), y_filter_lo, 1);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s2), y_filter_lo, 2);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s3), y_filter_lo, 3);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s4), y_filter_hi, 0);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s5), y_filter_hi, 1);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s6), y_filter_hi, 2);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s7), y_filter_hi, 3);
+
+  *sum1 = vmlal_lane_s16(offset, vget_high_s16(s0), y_filter_lo, 0);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s1), y_filter_lo, 1);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s2), y_filter_lo, 2);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s3), y_filter_lo, 3);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s4), y_filter_hi, 0);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s5), y_filter_hi, 1);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s6), y_filter_hi, 2);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s7), y_filter_hi, 3);
+}
+
+static INLINE uint16x8_t highbd_convolve8_8_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+    const int16x8_t s6, const int16x8_t s7, const int16x8_t y_filter,
+    const int32x4_t offset) {
+  int32x4_t sum0;
+  int32x4_t sum1;
+  highbd_convolve8_8_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter, offset,
+                         &sum0, &sum1);
+
+  return vcombine_u16(vqrshrun_n_s32(sum0, COMPOUND_ROUND1_BITS),
+                      vqrshrun_n_s32(sum1, COMPOUND_ROUND1_BITS));
+}
+
+static INLINE uint16x8_t highbd_convolve8_wtd_8_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+    const int16x8_t s6, const int16x8_t s7, const int16x8_t y_filter,
+    const int32x4_t shift_s32, const int32x4_t offset, const int32x4_t weight,
+    const int32x4_t offset2) {
+  int32x4_t sum0;
+  int32x4_t sum1;
+  highbd_convolve8_8_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter, offset,
+                         &sum0, &sum1);
+
+  sum0 = vqrshlq_s32(sum0, shift_s32);
+  sum1 = vqrshlq_s32(sum1, shift_s32);
+  sum0 = vmlaq_s32(offset2, sum0, weight);
+  sum1 = vmlaq_s32(offset2, sum1, weight);
+
+  return vcombine_u16(vqmovun_s32(sum0), vqmovun_s32(sum1));
+}
+
+// Like above but also perform round shifting and subtract correction term
+static INLINE uint16x8_t highbd_convolve8_8_sr_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+    const int16x8_t s6, const int16x8_t s7, const int16x8_t y_filter,
+    const int32x4_t round_shift, const int32x4_t offset,
+    const int32x4_t correction) {
+  int32x4_t sum0;
+  int32x4_t sum1;
+  highbd_convolve8_8_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter, offset,
+                         &sum0, &sum1);
+
+  sum0 = vsubq_s32(vqrshlq_s32(sum0, round_shift), correction);
+  sum1 = vsubq_s32(vqrshlq_s32(sum1, round_shift), correction);
+
+  return vcombine_u16(vqmovun_s32(sum0), vqmovun_s32(sum1));
+}
+
+static INLINE int32x4_t highbd_convolve12_y_4_s32(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x4_t s6, const int16x4_t s7, const int16x4_t s8,
+    const int16x4_t s9, const int16x4_t s10, const int16x4_t s11,
+    const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11,
+    const int32x4_t offset) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter_0_7);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter_0_7);
+
+  int32x4_t sum = vmlal_lane_s16(offset, s0, y_filter_0_3, 0);
+  sum = vmlal_lane_s16(sum, s1, y_filter_0_3, 1);
+  sum = vmlal_lane_s16(sum, s2, y_filter_0_3, 2);
+  sum = vmlal_lane_s16(sum, s3, y_filter_0_3, 3);
+  sum = vmlal_lane_s16(sum, s4, y_filter_4_7, 0);
+  sum = vmlal_lane_s16(sum, s5, y_filter_4_7, 1);
+  sum = vmlal_lane_s16(sum, s6, y_filter_4_7, 2);
+  sum = vmlal_lane_s16(sum, s7, y_filter_4_7, 3);
+  sum = vmlal_lane_s16(sum, s8, y_filter_8_11, 0);
+  sum = vmlal_lane_s16(sum, s9, y_filter_8_11, 1);
+  sum = vmlal_lane_s16(sum, s10, y_filter_8_11, 2);
+  sum = vmlal_lane_s16(sum, s11, y_filter_8_11, 3);
+
+  return sum;
+}
+
+static INLINE uint16x4_t highbd_convolve12_y_4_s32_s16(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x4_t s6, const int16x4_t s7, const int16x4_t s8,
+    const int16x4_t s9, const int16x4_t s10, const int16x4_t s11,
+    const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11,
+    const int32x4_t offset) {
+  int32x4_t sum =
+      highbd_convolve12_y_4_s32(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10,
+                                s11, y_filter_0_7, y_filter_8_11, offset);
+
+  return vqrshrun_n_s32(sum, COMPOUND_ROUND1_BITS);
+}
+
+// Like above but also perform round shifting and subtract correction term
+static INLINE uint16x4_t highbd_convolve12_y_4_sr_s32_s16(
+    const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+    const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+    const int16x4_t s6, const int16x4_t s7, const int16x4_t s8,
+    const int16x4_t s9, const int16x4_t s10, const int16x4_t s11,
+    const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11,
+    const int32x4_t round_shift, const int32x4_t offset,
+    const int32x4_t correction) {
+  int32x4_t sum =
+      highbd_convolve12_y_4_s32(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10,
+                                s11, y_filter_0_7, y_filter_8_11, offset);
+
+  sum = vsubq_s32(vqrshlq_s32(sum, round_shift), correction);
+  return vqmovun_s32(sum);
+}
+
+static INLINE void highbd_convolve12_y_8_s32(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+    const int16x8_t s6, const int16x8_t s7, const int16x8_t s8,
+    const int16x8_t s9, const int16x8_t s10, const int16x8_t s11,
+    const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11,
+    const int32x4_t offset, int32x4_t *sum0, int32x4_t *sum1) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter_0_7);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter_0_7);
+
+  *sum0 = vmlal_lane_s16(offset, vget_low_s16(s0), y_filter_0_3, 0);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s1), y_filter_0_3, 1);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s2), y_filter_0_3, 2);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s3), y_filter_0_3, 3);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s4), y_filter_4_7, 0);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s5), y_filter_4_7, 1);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s6), y_filter_4_7, 2);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s7), y_filter_4_7, 3);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s8), y_filter_8_11, 0);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s9), y_filter_8_11, 1);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s10), y_filter_8_11, 2);
+  *sum0 = vmlal_lane_s16(*sum0, vget_low_s16(s11), y_filter_8_11, 3);
+
+  *sum1 = vmlal_lane_s16(offset, vget_high_s16(s0), y_filter_0_3, 0);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s1), y_filter_0_3, 1);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s2), y_filter_0_3, 2);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s3), y_filter_0_3, 3);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s4), y_filter_4_7, 0);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s5), y_filter_4_7, 1);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s6), y_filter_4_7, 2);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s7), y_filter_4_7, 3);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s8), y_filter_8_11, 0);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s9), y_filter_8_11, 1);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s10), y_filter_8_11, 2);
+  *sum1 = vmlal_lane_s16(*sum1, vget_high_s16(s11), y_filter_8_11, 3);
+}
+
+static INLINE uint16x8_t highbd_convolve12_y_8_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+    const int16x8_t s6, const int16x8_t s7, const int16x8_t s8,
+    const int16x8_t s9, const int16x8_t s10, const int16x8_t s11,
+    const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11,
+    const int32x4_t offset) {
+  int32x4_t sum0;
+  int32x4_t sum1;
+  highbd_convolve12_y_8_s32(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11,
+                            y_filter_0_7, y_filter_8_11, offset, &sum0, &sum1);
+
+  return vcombine_u16(vqrshrun_n_s32(sum0, COMPOUND_ROUND1_BITS),
+                      vqrshrun_n_s32(sum1, COMPOUND_ROUND1_BITS));
+}
+
+// Like above but also perform round shifting and subtract correction term
+static INLINE uint16x8_t highbd_convolve12_y_8_sr_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+    const int16x8_t s6, const int16x8_t s7, const int16x8_t s8,
+    const int16x8_t s9, const int16x8_t s10, const int16x8_t s11,
+    const int16x8_t y_filter_0_7, const int16x4_t y_filter_8_11,
+    const int32x4_t round_shift, const int32x4_t offset,
+    const int32x4_t correction) {
+  int32x4_t sum0;
+  int32x4_t sum1;
+  highbd_convolve12_y_8_s32(s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11,
+                            y_filter_0_7, y_filter_8_11, offset, &sum0, &sum1);
+
+  sum0 = vsubq_s32(vqrshlq_s32(sum0, round_shift), correction);
+  sum1 = vsubq_s32(vqrshlq_s32(sum1, round_shift), correction);
+
+  return vcombine_u16(vqmovun_s32(sum0), vqmovun_s32(sum1));
+}
+
+static INLINE int32x4_t highbd_convolve8_horiz4_s32(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t x_filter_0_7,
+    const int32x4_t offset) {
+  const int16x8_t s2 = vextq_s16(s0, s1, 1);
+  const int16x8_t s3 = vextq_s16(s0, s1, 2);
+  const int16x8_t s4 = vextq_s16(s0, s1, 3);
+  const int16x4_t s0_lo = vget_low_s16(s0);
+  const int16x4_t s1_lo = vget_low_s16(s2);
+  const int16x4_t s2_lo = vget_low_s16(s3);
+  const int16x4_t s3_lo = vget_low_s16(s4);
+  const int16x4_t s4_lo = vget_high_s16(s0);
+  const int16x4_t s5_lo = vget_high_s16(s2);
+  const int16x4_t s6_lo = vget_high_s16(s3);
+  const int16x4_t s7_lo = vget_high_s16(s4);
+
+  return highbd_convolve8_4_s32(s0_lo, s1_lo, s2_lo, s3_lo, s4_lo, s5_lo, s6_lo,
+                                s7_lo, x_filter_0_7, offset);
+}
+
+static INLINE uint16x4_t highbd_convolve8_horiz4_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t x_filter_0_7,
+    const int32x4_t shift_s32, const int32x4_t offset) {
+  int32x4_t sum = highbd_convolve8_horiz4_s32(s0, s1, x_filter_0_7, offset);
+
+  sum = vqrshlq_s32(sum, shift_s32);
+  return vqmovun_s32(sum);
+}
+
+static INLINE uint16x4_t highbd_convolve8_wtd_horiz4_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t x_filter_0_7,
+    const int32x4_t shift_s32, const int32x4_t offset, const int32x4_t weight,
+    const int32x4_t offset2) {
+  int32x4_t sum = highbd_convolve8_horiz4_s32(s0, s1, x_filter_0_7, offset);
+
+  sum = vqrshlq_s32(sum, shift_s32);
+  sum = vmlaq_s32(offset2, sum, weight);
+  return vqmovun_s32(sum);
+}
+
+static INLINE void highbd_convolve8_horiz8_s32(
+    const int16x8_t s0, const int16x8_t s0_hi, const int16x8_t x_filter_0_7,
+    const int32x4_t offset, int32x4_t *sum0, int32x4_t *sum1) {
+  const int16x8_t s1 = vextq_s16(s0, s0_hi, 1);
+  const int16x8_t s2 = vextq_s16(s0, s0_hi, 2);
+  const int16x8_t s3 = vextq_s16(s0, s0_hi, 3);
+  const int16x8_t s4 = vextq_s16(s0, s0_hi, 4);
+  const int16x8_t s5 = vextq_s16(s0, s0_hi, 5);
+  const int16x8_t s6 = vextq_s16(s0, s0_hi, 6);
+  const int16x8_t s7 = vextq_s16(s0, s0_hi, 7);
+
+  highbd_convolve8_8_s32(s0, s1, s2, s3, s4, s5, s6, s7, x_filter_0_7, offset,
+                         sum0, sum1);
+}
+
+static INLINE uint16x8_t highbd_convolve8_horiz8_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t x_filter_0_7,
+    const int32x4_t shift_s32, const int32x4_t offset) {
+  int32x4_t sum0, sum1;
+  highbd_convolve8_horiz8_s32(s0, s1, x_filter_0_7, offset, &sum0, &sum1);
+
+  sum0 = vqrshlq_s32(sum0, shift_s32);
+  sum1 = vqrshlq_s32(sum1, shift_s32);
+
+  return vcombine_u16(vqmovun_s32(sum0), vqmovun_s32(sum1));
+}
+
+static INLINE uint16x8_t highbd_convolve8_wtd_horiz8_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t x_filter_0_7,
+    const int32x4_t shift_s32, const int32x4_t offset, const int32x4_t weight,
+    const int32x4_t offset2) {
+  int32x4_t sum0, sum1;
+  highbd_convolve8_horiz8_s32(s0, s1, x_filter_0_7, offset, &sum0, &sum1);
+
+  sum0 = vqrshlq_s32(sum0, shift_s32);
+  sum1 = vqrshlq_s32(sum1, shift_s32);
+  sum0 = vmlaq_s32(offset2, sum0, weight);
+  sum1 = vmlaq_s32(offset2, sum1, weight);
+
+  return vcombine_u16(vqmovun_s32(sum0), vqmovun_s32(sum1));
+}
+
+static INLINE int32x4_t highbd_convolve12_horiz4_s32(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t x_filter_0_7,
+    const int16x4_t x_filter_8_11, const int32x4_t offset) {
+  const int16x8_t s2 = vextq_s16(s0, s1, 1);
+  const int16x8_t s3 = vextq_s16(s0, s1, 2);
+  const int16x8_t s4 = vextq_s16(s0, s1, 3);
+  const int16x8_t s5 = vextq_s16(s0, s1, 4);
+  const int16x8_t s6 = vextq_s16(s0, s1, 5);
+  const int16x8_t s7 = vextq_s16(s0, s1, 6);
+  const int16x8_t s8 = vextq_s16(s0, s1, 7);
+  const int16x4_t s0_lo = vget_low_s16(s0);
+  const int16x4_t s1_lo = vget_low_s16(s2);
+  const int16x4_t s2_lo = vget_low_s16(s3);
+  const int16x4_t s3_lo = vget_low_s16(s4);
+  const int16x4_t s4_lo = vget_high_s16(s0);
+  const int16x4_t s5_lo = vget_high_s16(s2);
+  const int16x4_t s6_lo = vget_high_s16(s3);
+  const int16x4_t s7_lo = vget_high_s16(s4);
+  const int16x4_t s8_lo = vget_high_s16(s5);
+  const int16x4_t s9_lo = vget_high_s16(s6);
+  const int16x4_t s10_lo = vget_high_s16(s7);
+  const int16x4_t s11_lo = vget_high_s16(s8);
+
+  return highbd_convolve12_y_4_s32(s0_lo, s1_lo, s2_lo, s3_lo, s4_lo, s5_lo,
+                                   s6_lo, s7_lo, s8_lo, s9_lo, s10_lo, s11_lo,
+                                   x_filter_0_7, x_filter_8_11, offset);
+}
+
+static INLINE uint16x4_t highbd_convolve12_horiz4_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t x_filter_0_7,
+    const int16x4_t x_filter_8_11, const int32x4_t shift_s32,
+    const int32x4_t offset) {
+  int32x4_t sum =
+      highbd_convolve12_horiz4_s32(s0, s1, x_filter_0_7, x_filter_8_11, offset);
+
+  sum = vqrshlq_s32(sum, shift_s32);
+  return vqmovun_s32(sum);
+}
+
+static INLINE void highbd_convolve12_horiz8_s32(
+    const int16x8_t s0_0, const int16x8_t s0_1, const int16x8_t s0_2,
+    const int16x8_t x_filter_0_7, const int16x4_t x_filter_8_11,
+    const int32x4_t offset, int32x4_t *sum0, int32x4_t *sum1) {
+  const int16x8_t s1 = vextq_s16(s0_0, s0_1, 1);
+  const int16x8_t s2 = vextq_s16(s0_0, s0_1, 2);
+  const int16x8_t s3 = vextq_s16(s0_0, s0_1, 3);
+  const int16x8_t s4 = vextq_s16(s0_0, s0_1, 4);
+  const int16x8_t s5 = vextq_s16(s0_0, s0_1, 5);
+  const int16x8_t s6 = vextq_s16(s0_0, s0_1, 6);
+  const int16x8_t s7 = vextq_s16(s0_0, s0_1, 7);
+  const int16x8_t s8 = s0_1;
+  const int16x8_t s9 = vextq_s16(s0_1, s0_2, 1);
+  const int16x8_t s10 = vextq_s16(s0_1, s0_2, 2);
+  const int16x8_t s11 = vextq_s16(s0_1, s0_2, 3);
+
+  highbd_convolve12_y_8_s32(s0_0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11,
+                            x_filter_0_7, x_filter_8_11, offset, sum0, sum1);
+}
+
+static INLINE uint16x8_t highbd_convolve12_horiz8_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t x_filter_0_7, const int16x4_t x_filter_8_11,
+    const int32x4_t shift_s32, const int32x4_t offset) {
+  int32x4_t sum0, sum1;
+  highbd_convolve12_horiz8_s32(s0, s1, s2, x_filter_0_7, x_filter_8_11, offset,
+                               &sum0, &sum1);
+
+  sum0 = vqrshlq_s32(sum0, shift_s32);
+  sum1 = vqrshlq_s32(sum1, shift_s32);
+
+  return vcombine_u16(vqmovun_s32(sum0), vqmovun_s32(sum1));
+}
+
+static INLINE int32x4_t highbd_convolve8_2d_scale_horiz4x8_s32(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x4_t *filters_lo,
+    const int16x4_t *filters_hi, const int32x4_t offset) {
+  int16x4_t s_lo[] = { vget_low_s16(s0), vget_low_s16(s1), vget_low_s16(s2),
+                       vget_low_s16(s3) };
+  int16x4_t s_hi[] = { vget_high_s16(s0), vget_high_s16(s1), vget_high_s16(s2),
+                       vget_high_s16(s3) };
+
+  transpose_u16_4x4((uint16x4_t *)s_lo);
+  transpose_u16_4x4((uint16x4_t *)s_hi);
+
+  int32x4_t sum = vmlal_s16(offset, s_lo[0], filters_lo[0]);
+  sum = vmlal_s16(sum, s_lo[1], filters_lo[1]);
+  sum = vmlal_s16(sum, s_lo[2], filters_lo[2]);
+  sum = vmlal_s16(sum, s_lo[3], filters_lo[3]);
+  sum = vmlal_s16(sum, s_hi[0], filters_hi[0]);
+  sum = vmlal_s16(sum, s_hi[1], filters_hi[1]);
+  sum = vmlal_s16(sum, s_hi[2], filters_hi[2]);
+  sum = vmlal_s16(sum, s_hi[3], filters_hi[3]);
+
+  return sum;
+}
+
+static INLINE uint16x4_t highbd_convolve8_2d_scale_horiz4x8_s32_s16(
+    const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+    const int16x8_t s3, const int16x4_t *filters_lo,
+    const int16x4_t *filters_hi, const int32x4_t shift_s32,
+    const int32x4_t offset) {
+  int32x4_t sum = highbd_convolve8_2d_scale_horiz4x8_s32(
+      s0, s1, s2, s3, filters_lo, filters_hi, offset);
+
+  sum = vqrshlq_s32(sum, shift_s32);
+  return vqmovun_s32(sum);
+}
+
+#endif  // AOM_AV1_COMMON_ARM_HIGHBD_CONVOLVE_NEON_H_

diff --git a/av1/common/arm/highbd_inv_txfm_neon.c b/av1/common/arm/highbd_inv_txfm_neon.c
index 90306b8..d197fca 100644
--- a/av1/common/arm/highbd_inv_txfm_neon.c
+++ b/av1/common/arm/highbd_inv_txfm_neon.c

@@ -17,7 +17,7 @@
 #include "config/aom_config.h"
 #include "config/av1_rtcd.h"
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
 #define TRANSPOSE_4X4(x0, x1, x2, x3, y0, y1, y2, y3)         \
   do {                                                        \
     int32x4x2_t swap_low = vtrnq_s32(x0, x1);                 \
@@ -49,7 +49,11 @@
     y3 = vextq_s32(swap_low.val[1],                                      \
                    vextq_s32(swap_high.val[1], swap_high.val[1], 2), 2); \
   } while (0)
-#endif  // (__aarch64__)
+#endif  // AOM_ARCH_AARCH64
+
+static INLINE void transpose_4x4(const int32x4_t *in, int32x4_t *out) {
+  TRANSPOSE_4X4(in[0], in[1], in[2], in[3], out[0], out[1], out[2], out[3]);
+}
 
 static INLINE void transpose_8x8(const int32x4_t *in, int32x4_t *out) {
   TRANSPOSE_4X4(in[0], in[2], in[4], in[6], out[0], out[2], out[4], out[6]);
@@ -59,16 +63,12 @@
                 out[15]);
 }
 
-static INLINE void av1_round_shift_array_32_neon(int32x4_t *input,
-                                                 int32x4_t *output,
-                                                 const int size,
-                                                 const int bit) {
+static INLINE void round_shift_array_32_neon(int32x4_t *input,
+                                             int32x4_t *output, const int size,
+                                             const int bit) {
   const int32x4_t v_bit = vdupq_n_s32(-bit);
-  const int32x4_t rnding = vdupq_n_s32(1 << (bit - 1));
-  int i;
-  for (i = 0; i < size; i++) {
-    int32x4_t vradd = vaddq_s32(input[i], rnding);
-    output[i] = vshlq_s32(vradd, v_bit);
+  for (int i = 0; i < size; i++) {
+    output[i] = vrshlq_s32(input[i], v_bit);
   }
 }
 
@@ -173,42 +173,25 @@
   }
 }
 
-static void round_shift_8x8(int32x4_t *in, int shift, const int32x4_t *rnding) {
-  if (shift != 0) {
-    const int32x4_t v_shift = vdupq_n_s32(-shift);
-    int32x4_t vradd = vaddq_s32(in[0], *rnding);
-    in[0] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[1], *rnding);
-    in[1] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[2], *rnding);
-    in[2] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[3], *rnding);
-    in[3] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[4], *rnding);
-    in[4] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[5], *rnding);
-    in[5] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[6], *rnding);
-    in[6] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[7], *rnding);
-    in[7] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[8], *rnding);
-    in[8] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[9], *rnding);
-    in[9] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[10], *rnding);
-    in[10] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[11], *rnding);
-    in[11] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[12], *rnding);
-    in[12] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[13], *rnding);
-    in[13] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[14], *rnding);
-    in[14] = vshlq_s32(vradd, v_shift);
-    vradd = vaddq_s32(in[15], *rnding);
-    in[15] = vshlq_s32(vradd, v_shift);
-  }
+static void round_shift_8x8(int32x4_t *in, int shift) {
+  assert(shift != 0);
+  const int32x4_t v_shift = vdupq_n_s32(-shift);
+  in[0] = vrshlq_s32(in[0], v_shift);
+  in[1] = vrshlq_s32(in[1], v_shift);
+  in[2] = vrshlq_s32(in[2], v_shift);
+  in[3] = vrshlq_s32(in[3], v_shift);
+  in[4] = vrshlq_s32(in[4], v_shift);
+  in[5] = vrshlq_s32(in[5], v_shift);
+  in[6] = vrshlq_s32(in[6], v_shift);
+  in[7] = vrshlq_s32(in[7], v_shift);
+  in[8] = vrshlq_s32(in[8], v_shift);
+  in[9] = vrshlq_s32(in[9], v_shift);
+  in[10] = vrshlq_s32(in[10], v_shift);
+  in[11] = vrshlq_s32(in[11], v_shift);
+  in[12] = vrshlq_s32(in[12], v_shift);
+  in[13] = vrshlq_s32(in[13], v_shift);
+  in[14] = vrshlq_s32(in[14], v_shift);
+  in[15] = vrshlq_s32(in[15], v_shift);
 }
 
 static void highbd_clamp_s32_neon(int32x4_t *in, int32x4_t *out,
@@ -567,7 +550,10 @@
 
   // Stage 0-1-2
 
-  TRANSPOSE_4X4(in[0], in[1], in[2], in[3], u0, u1, u2, u3);
+  u0 = in[0];
+  u1 = in[1];
+  u2 = in[2];
+  u3 = in[3];
 
   const int32x4_t v_bit = vdupq_n_s32(-bit);
 
@@ -611,7 +597,10 @@
   int32x4_t x0, x1, x2, x3;
   int32x4_t u0, u1, u2, u3;
 
-  TRANSPOSE_4X4(in[0], in[1], in[2], in[3], x0, x1, x2, x3);
+  x0 = in[0];
+  x1 = in[1];
+  x2 = in[2];
+  x3 = in[3];
 
   s0 = vmulq_n_s32(x0, sinpi[1]);
   s1 = vmulq_n_s32(x0, sinpi[2]);
@@ -655,12 +644,12 @@
       vreinterpretq_s16_s32(u0x.val[1]), vreinterpretq_s16_s32(zero), 1));
 
   u0x = vzipq_s32(u0x.val[0], u0x.val[1]);
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   u0 = vreinterpretq_s32_s64(vzip1q_s64(vreinterpretq_s64_s32(u0x.val[0]),
                                         vreinterpretq_s64_s32(u0x.val[1])));
 #else
   u0 = vcombine_s32(vget_low_s32(u0x.val[0]), vget_low_s32(u0x.val[1]));
-#endif  // (__aarch64__)
+#endif  // AOM_ARCH_AARCH64
   // u1
   int32x4x2_t u1x;
   u1x.val[0] = vreinterpretq_s32_s64(
@@ -680,12 +669,12 @@
       vreinterpretq_s16_s32(u1x.val[1]), vreinterpretq_s16_s32(zero), 1));
 
   u1x = vzipq_s32(u1x.val[0], u1x.val[1]);
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   u1 = vreinterpretq_s32_s64(vzip1q_s64(vreinterpretq_s64_s32(u1x.val[0]),
                                         vreinterpretq_s64_s32(u1x.val[1])));
 #else
   u1 = vcombine_s32(vget_low_s32(u1x.val[0]), vget_low_s32(u1x.val[1]));
-#endif  // (__aarch64__)
+#endif  // AOM_ARCH_AARCH64
 
   // u2
   int32x4x2_t u2x;
@@ -706,12 +695,12 @@
       vreinterpretq_s16_s32(u2x.val[1]), vreinterpretq_s16_s32(zero), 1));
 
   u2x = vzipq_s32(u2x.val[0], u2x.val[1]);
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   u2 = vreinterpretq_s32_s64(vzip1q_s64(vreinterpretq_s64_s32(u2x.val[0]),
                                         vreinterpretq_s64_s32(u2x.val[1])));
 #else
   u2 = vcombine_s32(vget_low_s32(u2x.val[0]), vget_low_s32(u2x.val[1]));
-#endif  // (__aarch64__)
+#endif  // AOM_ARCH_AARCH64
 
   // u3
   int32x4x2_t u3x;
@@ -732,12 +721,12 @@
       vreinterpretq_s16_s32(u3x.val[1]), vreinterpretq_s16_s32(zero), 1));
 
   u3x = vzipq_s32(u3x.val[0], u3x.val[1]);
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   u3 = vreinterpretq_s32_s64(vzip1q_s64(vreinterpretq_s64_s32(u3x.val[0]),
                                         vreinterpretq_s64_s32(u3x.val[1])));
 #else
   u3 = vcombine_s32(vget_low_s32(u3x.val[0]), vget_low_s32(u3x.val[1]));
-#endif  // (__aarch64__)
+#endif  // AOM_ARCH_AARCH64
 
   out[0] = u0;
   out[1] = u1;
@@ -803,7 +792,6 @@
 static void iidentity4_neon(int32x4_t *in, int32x4_t *out, int bit, int do_cols,
                             int bd, int out_shift) {
   (void)bit;
-  int32x4_t v[4];
   int32x4_t zero = vdupq_n_s32(0);
   int32x2_t fact = vdup_n_s32(NewSqrt2);
   int32x4x2_t a0;
@@ -821,7 +809,7 @@
         vshrq_n_s64(vreinterpretq_s64_s32(a0.val[1]), NewSqrt2Bits));
 
     a0 = vzipq_s32(a0.val[0], a0.val[1]);
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     out[i] = vreinterpretq_s32_s64(vzip1q_s64(
         vreinterpretq_s64_s32(a0.val[0]), vreinterpretq_s64_s32(a0.val[1])));
 #else
@@ -835,13 +823,6 @@
     round_shift_4x4(out, out_shift);
     highbd_clamp_s32_neon(out, out, &clamp_lo, &clamp_hi, 4);
   }
-  v[0] = out[0];
-  v[1] = out[1];
-  v[2] = out[2];
-  v[3] = out[3];
-
-  // Transpose for 4x4
-  TRANSPOSE_4X4(v[0], v[1], v[2], v[3], out[0], out[1], out[2], out[3]);
 }
 
 void av1_inv_txfm2d_add_4x4_neon(const int32_t *input, uint16_t *output,
@@ -854,96 +835,112 @@
     case DCT_DCT:
       load_buffer_4x4(input, in);
       idct4x4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       idct4x4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case ADST_DCT:
       load_buffer_4x4(input, in);
       idct4x4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case DCT_ADST:
       load_buffer_4x4(input, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       idct4x4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case ADST_ADST:
       load_buffer_4x4(input, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case FLIPADST_DCT:
       load_buffer_4x4(input, in);
       idct4x4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 1, -shift[1], bd);
       break;
     case DCT_FLIPADST:
       load_buffer_4x4(input, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       idct4x4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 1, 0, -shift[1], bd);
       break;
     case FLIPADST_FLIPADST:
       load_buffer_4x4(input, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 1, 1, -shift[1], bd);
       break;
     case ADST_FLIPADST:
       load_buffer_4x4(input, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 1, 0, -shift[1], bd);
       break;
     case FLIPADST_ADST:
       load_buffer_4x4(input, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 1, -shift[1], bd);
       break;
     case IDTX:
       load_buffer_4x4(input, in);
       iidentity4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       iidentity4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case V_DCT:
       load_buffer_4x4(input, in);
       iidentity4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       idct4x4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case H_DCT:
       load_buffer_4x4(input, in);
       idct4x4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       iidentity4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case V_ADST:
       load_buffer_4x4(input, in);
       iidentity4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case H_ADST:
       load_buffer_4x4(input, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       iidentity4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case V_FLIPADST:
       load_buffer_4x4(input, in);
       iidentity4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 1, -shift[1], bd);
       break;
     case H_FLIPADST:
       load_buffer_4x4(input, in);
       iadst4x4_neon(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_4x4(in, in);
       iidentity4_neon(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 1, 0, -shift[1], bd);
       break;
@@ -1069,11 +1066,10 @@
   }
 
   if (!do_cols) {
-    const int32x4_t rnding_shift = vdupq_n_s32(1 << (out_shift - 1));
     const int log_range_out = AOMMAX(16, bd + 6);
     const int32x4_t clamp_lo_out = vdupq_n_s32(-(1 << (log_range_out - 1)));
     const int32x4_t clamp_hi_out = vdupq_n_s32((1 << (log_range_out - 1)) - 1);
-    round_shift_8x8(out, out_shift, &rnding_shift);
+    round_shift_8x8(out, out_shift);
     highbd_clamp_s32_neon(out, out, &clamp_lo_out, &clamp_hi_out, 16);
   }
 }
@@ -1384,8 +1380,7 @@
                              int fliplr, int flipud, int shift, int bd) {
   uint16x8_t u0, u1, u2, u3, u4, u5, u6, u7;
   uint16x8_t v0, v1, v2, v3, v4, v5, v6, v7;
-  const int32x4_t rnding = vdupq_n_s32(1 << (shift - 1));
-  round_shift_8x8(in, shift, &rnding);
+  round_shift_8x8(in, shift);
 
   v0 = vld1q_u16(output + 0 * stride);
   v1 = vld1q_u16(output + 1 * stride);
@@ -1434,75 +1429,66 @@
   switch (tx_type) {
     case DCT_DCT:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      idct8x8_neon(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      idct8x8_neon(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 0, 0, -shift[1], bd);
+      idct8x8_neon(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      idct8x8_neon(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 0, 0, -shift[1], bd);
       break;
     case DCT_ADST:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      iadst8x8_neon(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      idct8x8_neon(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 0, 0, -shift[1], bd);
+      iadst8x8_neon(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      idct8x8_neon(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 0, 0, -shift[1], bd);
       break;
     case ADST_DCT:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      idct8x8_neon(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      iadst8x8_neon(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 0, 0, -shift[1], bd);
+      idct8x8_neon(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      iadst8x8_neon(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 0, 0, -shift[1], bd);
       break;
     case ADST_ADST:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      iadst8x8_neon(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      iadst8x8_neon(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 0, 0, -shift[1], bd);
+      iadst8x8_neon(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      iadst8x8_neon(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 0, 0, -shift[1], bd);
       break;
     case FLIPADST_DCT:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      idct8x8_neon(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      iadst8x8_neon(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 0, 1, -shift[1], bd);
+      idct8x8_neon(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      iadst8x8_neon(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 0, 1, -shift[1], bd);
       break;
     case DCT_FLIPADST:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      iadst8x8_neon(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      idct8x8_neon(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 1, 0, -shift[1], bd);
+      iadst8x8_neon(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      idct8x8_neon(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 1, 0, -shift[1], bd);
       break;
     case ADST_FLIPADST:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      iadst8x8_neon(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      iadst8x8_neon(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 1, 0, -shift[1], bd);
+      iadst8x8_neon(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      iadst8x8_neon(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 1, 0, -shift[1], bd);
       break;
     case FLIPADST_FLIPADST:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      iadst8x8_neon(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      iadst8x8_neon(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 1, 1, -shift[1], bd);
+      iadst8x8_neon(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      iadst8x8_neon(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 1, 1, -shift[1], bd);
       break;
     case FLIPADST_ADST:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      iadst8x8_neon(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      iadst8x8_neon(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 0, 1, -shift[1], bd);
+      iadst8x8_neon(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      iadst8x8_neon(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 0, 1, -shift[1], bd);
       break;
     default: assert(0);
   }
@@ -1989,11 +1975,10 @@
   addsub_neon(u[7], u[8], out + 7, out + 8, &clamp_lo, &clamp_hi);
 
   if (!do_cols) {
-    const int32x4_t rnding_shift = vdupq_n_s32(1 << (out_shift - 1));
     const int log_range_out = AOMMAX(16, bd + 6);
     const int32x4_t clamp_lo_out = vdupq_n_s32(-(1 << (log_range_out - 1)));
     const int32x4_t clamp_hi_out = vdupq_n_s32((1 << (log_range_out - 1)) - 1);
-    round_shift_8x8(out, out_shift, &rnding_shift);
+    round_shift_8x8(out, out_shift);
     highbd_clamp_s32_neon(out, out, &clamp_lo_out, &clamp_hi_out, 16);
   }
 }
@@ -2530,12 +2515,11 @@
     addsub_neon(v[7], v[8], out + 7, out + 8, &clamp_lo, &clamp_hi);
 
     if (!do_cols) {
-      const int32x4_t rnding_shift = vdupq_n_s32(1 << (out_shift - 1));
       const int log_range_out = AOMMAX(16, bd + 6);
       const int32x4_t clamp_lo_out = vdupq_n_s32(-(1 << (log_range_out - 1)));
       const int32x4_t clamp_hi_out =
           vdupq_n_s32((1 << (log_range_out - 1)) - 1);
-      round_shift_8x8(out, out_shift, &rnding_shift);
+      round_shift_8x8(out, out_shift);
       highbd_clamp_s32_neon(out, out, &clamp_lo_out, &clamp_hi_out, 16);
     }
   }
@@ -2821,6 +2805,7 @@
                    &clamp_hi_out, &v_shift, &offset);
   }
 }
+
 static void iidentity16_neon(int32x4_t *in, int32x4_t *out, int bit,
                              int do_cols, int bd, int out_shift) {
   (void)bit;
@@ -2839,7 +2824,7 @@
     a0.val[1] = vreinterpretq_s32_s64(
         vshrq_n_s64(vreinterpretq_s64_s32(a0.val[1]), NewSqrt2Bits));
     a0 = vzipq_s32(a0.val[0], a0.val[1]);
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     out[i] = vreinterpretq_s32_s64(vzip1q_s64(
         vreinterpretq_s64_s32(a0.val[0]), vreinterpretq_s64_s32(a0.val[1])));
 #else
@@ -2848,14 +2833,14 @@
   }
 
   if (!do_cols) {
-    const int32x4_t rnding_shift = vdupq_n_s32(1 << (out_shift - 1));
     const int log_range = AOMMAX(16, bd + 6);
     const int32x4_t clamp_lo = vdupq_n_s32(-(1 << (log_range - 1)));
     const int32x4_t clamp_hi = vdupq_n_s32((1 << (log_range - 1)) - 1);
-    round_shift_8x8(out, out_shift, &rnding_shift);
+    round_shift_8x8(out, out_shift);
     highbd_clamp_s32_neon(out, out, &clamp_lo, &clamp_hi, 16);
   }
 }
+
 static INLINE void idct64_stage8_neon(int32x4_t *u, const int32_t *cospi,
                                       const int32x4_t *clamp_lo,
                                       const int32x4_t *clamp_hi,
@@ -4687,12 +4672,11 @@
   addsub_neon(bf0[15], bf0[16], out + 15, out + 16, &clamp_lo, &clamp_hi);
 
   if (!do_cols) {
-    const int32x4_t rnding_shift = vdupq_n_s32(1 << (out_shift - 1));
     const int log_range_out = AOMMAX(16, bd + 6);
     const int32x4_t clamp_lo_out = vdupq_n_s32(-(1 << (log_range_out - 1)));
     const int32x4_t clamp_hi_out = vdupq_n_s32((1 << (log_range_out - 1)) - 1);
-    round_shift_8x8(out, out_shift, &rnding_shift);
-    round_shift_8x8(out + 16, out_shift, &rnding_shift);
+    round_shift_8x8(out, out_shift);
+    round_shift_8x8(out + 16, out_shift);
     highbd_clamp_s32_neon(out, out, &clamp_lo_out, &clamp_hi_out, 32);
   }
 }
@@ -4720,12 +4704,11 @@
   }
 
   if (!do_cols) {
-    const int32x4_t rnding_shift = vdupq_n_s32(1 << (out_shift - 1));
     const int log_range_out = AOMMAX(16, bd + 6);
     const int32x4_t clamp_lo_out = vdupq_n_s32(-(1 << (log_range_out - 1)));
     const int32x4_t clamp_hi_out = vdupq_n_s32((1 << (log_range_out - 1)) - 1);
-    round_shift_8x8(out, out_shift, &rnding_shift);
-    round_shift_8x8(out + 16, out_shift, &rnding_shift);
+    round_shift_8x8(out, out_shift);
+    round_shift_8x8(out + 16, out_shift);
     highbd_clamp_s32_neon(out, out, &clamp_lo_out, &clamp_hi_out, 32);
   }
 }
@@ -4791,7 +4774,7 @@
       highbd_txfm_all_1d_zeros_w8_arr[txw_idx][hitx_1d_tab[tx_type]][0];
   const transform_1d_neon col_txfm =
       highbd_txfm_all_1d_zeros_w8_arr[txh_idx][vitx_1d_tab[tx_type]][1];
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int input_stride = AOMMIN(32, txfm_size_row);
 
   assert(col_txfm != NULL);
   assert(row_txfm != NULL);
@@ -4800,9 +4783,8 @@
 
   // 1st stage: column transform
   int32x4_t buf0[8];
-  const int32_t *input_row = input;
-  int32x4_t *buf0_cur = buf0;
-  load_buffer_32bit_input(input_row, input_stride, buf0_cur, txfm_size_row);
+  load_buffer_32bit_input(input, input_stride, buf0, txfm_size_col);
+  load_buffer_32bit_input(input + 4, input_stride, buf0 + 4, txfm_size_col);
   round_shift_rect_array_32_neon(buf0, buf0, txfm_size_row);
   row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
   row_txfm(buf0 + 4, buf0 + 4, INV_COS_BIT, 0, bd, -shift[0]);
@@ -4824,7 +4806,7 @@
   // 2nd stage: column transform
   col_txfm(buf1, buf1, INV_COS_BIT, 1, bd, 0);
 
-  av1_round_shift_array_32_neon(buf1, buf1, txfm_size_row, -shift[1]);
+  round_shift_array_32_neon(buf1, buf1, txfm_size_row, -shift[1]);
 
   // write to buffer
   highbd_write_buffer_4xn_neon(buf1, output, stride, ud_flip, txfm_size_row,
@@ -4855,12 +4837,7 @@
   const int32_t *input_row = input;
   load_buffer_32bit_input(input_row, 4, buf0, txfm_size_col);
 
-  TRANSPOSE_4X4(buf0[0], buf0[2], buf0[4], buf0[6], buf1[0], buf1[1], buf1[2],
-                buf1[3]);
-  TRANSPOSE_4X4(buf0[1], buf0[3], buf0[5], buf0[7], buf1[4], buf1[5], buf1[6],
-                buf1[7]);
-
-  round_shift_rect_array_32_neon(buf1, buf0, txfm_size_col);
+  round_shift_rect_array_32_neon(buf0, buf0, txfm_size_col);
   row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
 
   int32x4_t *buf1_ptr;
@@ -4873,10 +4850,11 @@
 
   // 2nd stage: column transform
   for (int i = 0; i < 2; i++) {
-    col_txfm(buf1_ptr + i * txfm_size_row, buf1_ptr + i * txfm_size_row,
-             INV_COS_BIT, 1, bd, 0);
+    int32x4_t *buf1_cur = buf1_ptr + i * txfm_size_row;
+    transpose_4x4(buf1_cur, buf1_cur);
+    col_txfm(buf1_cur, buf1_cur, INV_COS_BIT, 1, bd, 0);
   }
-  av1_round_shift_array_32_neon(buf1_ptr, buf1_ptr, txfm_size_col, -shift[1]);
+  round_shift_array_32_neon(buf1_ptr, buf1_ptr, txfm_size_col, -shift[1]);
   // write to buffer
   highbd_write_buffer_8xn_neon(buf1_ptr, output, stride, ud_flip, txfm_size_row,
                                bd);
@@ -4896,7 +4874,7 @@
       highbd_txfm_all_1d_zeros_w8_arr[txw_idx][hitx_1d_tab[tx_type]][0];
   const transform_1d_neon col_txfm =
       highbd_txfm_all_1d_zeros_w8_arr[txh_idx][vitx_1d_tab[tx_type]][2];
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int input_stride = AOMMIN(32, txfm_size_row);
 
   assert(col_txfm != NULL);
   assert(row_txfm != NULL);
@@ -4905,10 +4883,10 @@
 
   // 1st stage: column transform
   int32x4_t buf0[16];
-  const int32_t *input_row = input;
-  int32x4_t *buf0_cur = buf0;
-  load_buffer_32bit_input(input_row, input_stride, buf0_cur, txfm_size_row);
   for (int i = 0; i < (txfm_size_row >> 2); i++) {
+    const int32_t *input_row = input + i * 4;
+    int32x4_t *buf0_cur = buf0 + i * 4;
+    load_buffer_32bit_input(input_row, input_stride, buf0_cur, txfm_size_col);
     row_txfm(buf0 + (i << 2), buf0 + (i << 2), INV_COS_BIT, 0, bd, -shift[0]);
   }
 
@@ -4929,7 +4907,7 @@
   // 2nd stage: column transform
   col_txfm(buf1, buf1, INV_COS_BIT, 1, bd, 0);
 
-  av1_round_shift_array_32_neon(buf1, buf1, txfm_size_row, -shift[1]);
+  round_shift_array_32_neon(buf1, buf1, txfm_size_row, -shift[1]);
 
   // write to buffer
   highbd_write_buffer_4xn_neon(buf1, output, stride, ud_flip, txfm_size_row,
@@ -4961,11 +4939,7 @@
   const int32_t *input_row = input;
   load_buffer_32bit_input(input_row, 4, buf0, txfm_size_col);
 
-  for (int j = 0; j < buf_size_w_div8; j++) {
-    TRANSPOSE_4X4(buf0[j], buf0[j + 4], buf0[j + 8], buf0[j + 12], buf1[4 * j],
-                  buf1[4 * j + 1], buf1[4 * j + 2], buf1[4 * j + 3]);
-  }
-  row_txfm(buf1, buf0, INV_COS_BIT, 0, bd, -shift[0]);
+  row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
 
   int32x4_t *buf1_ptr;
   if (lr_flip) {
@@ -4977,10 +4951,11 @@
 
   // 2nd stage: column transform
   for (int i = 0; i < buf_size_w_div8; i++) {
-    col_txfm(buf1_ptr + i * txfm_size_row, buf1_ptr + i * txfm_size_row,
-             INV_COS_BIT, 1, bd, 0);
+    int32x4_t *buf1_cur = buf1_ptr + i * txfm_size_row;
+    transpose_4x4(buf1_cur, buf1_cur);
+    col_txfm(buf1_cur, buf1_cur, INV_COS_BIT, 1, bd, 0);
   }
-  av1_round_shift_array_32_neon(buf1_ptr, buf1_ptr, txfm_size_col, -shift[1]);
+  round_shift_array_32_neon(buf1_ptr, buf1_ptr, txfm_size_col, -shift[1]);
 
   // write to buffer
   for (int i = 0; i < (txfm_size_col >> 3); i++) {
@@ -4990,9 +4965,10 @@
   }
 }
 
-void highbd_inv_txfm2d_add_4x16_neon(const int32_t *input, uint16_t *output,
-                                     int stride, TX_TYPE tx_type, int eob,
-                                     const int bd) {
+static void highbd_inv_txfm2d_add_4x16_neon(const int32_t *input,
+                                            uint16_t *output, int stride,
+                                            TX_TYPE tx_type, int eob,
+                                            const int bd) {
   (void)eob;
   TX_SIZE tx_size = TX_4X16;
   int32x4_t buf1[16];
@@ -5039,16 +5015,17 @@
   // 2nd stage: column transform
   col_txfm(buf1, buf1, INV_COS_BIT, 1, bd, 0);
 
-  av1_round_shift_array_32_neon(buf1, buf1, txfm_size_row, -shift[1]);
+  round_shift_array_32_neon(buf1, buf1, txfm_size_row, -shift[1]);
 
   // write to buffer
   highbd_write_buffer_4xn_neon(buf1, output, stride, ud_flip, txfm_size_row,
                                bd);
 }
 
-void highbd_inv_txfm2d_add_16x4_neon(const int32_t *input, uint16_t *output,
-                                     int stride, TX_TYPE tx_type, int eob,
-                                     const int bd) {
+static void highbd_inv_txfm2d_add_16x4_neon(const int32_t *input,
+                                            uint16_t *output, int stride,
+                                            TX_TYPE tx_type, int eob,
+                                            const int bd) {
   (void)eob;
   TX_SIZE tx_size = TX_16X4;
   int32x4_t buf1[16];
@@ -5092,7 +5069,7 @@
     col_txfm(buf1_ptr + i * txfm_size_row, buf1_ptr + i * txfm_size_row,
              INV_COS_BIT, 1, bd, 0);
   }
-  av1_round_shift_array_32_neon(buf1_ptr, buf1_ptr, txfm_size_col, -shift[1]);
+  round_shift_array_32_neon(buf1_ptr, buf1_ptr, txfm_size_col, -shift[1]);
 
   // write to buffer
   for (int i = 0; i < (txfm_size_col >> 3); i++) {
@@ -5261,46 +5238,49 @@
   const int txh_idx = get_txh_idx(tx_size);
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
-  const int input_stride = AOMMIN(32, txfm_size_col);
-  const int buf_size_w_div4 = input_stride >> 2;
+  const int buf_size_w = AOMMIN(32, txfm_size_col);
+  const int buf_size_w_div4 = buf_size_w >> 2;
   const int buf_size_h_div8 = (eoby + 8) >> 3;
+  const int row_max = AOMMIN(32, txfm_size_row);
+  const int input_stride = row_max;
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
   const int fun_idx = lowbd_txfm_all_1d_zeros_idx[eoby];
   const transform_1d_neon row_txfm =
       highbd_txfm_all_1d_zeros_w8_arr[txw_idx][hitx_1d_tab[tx_type]][0];
+  assert(row_txfm != NULL);
   const transform_1d_neon col_txfm =
       highbd_txfm_all_1d_zeros_w8_arr[txh_idx][vitx_1d_tab[tx_type]][fun_idx];
+  assert(col_txfm != NULL);
   int ud_flip, lr_flip;
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
 
   for (int i = 0; i < (buf_size_h_div8 << 1); ++i) {
     int32x4_t buf0[16];
-    const int32_t *input_row = input + i * input_stride * 4;
-    for (int j = 0; j < buf_size_w_div4; ++j) {
-      int32x4_t *buf0_cur = buf0 + j * 4;
-      load_buffer_32bit_input(input_row + j * 4, input_stride, buf0_cur, 4);
-    }
+    load_buffer_32bit_input(input + i * 4, input_stride, buf0, buf_size_w);
     if (rect_type == 1 || rect_type == -1) {
-      round_shift_rect_array_32_neon(buf0, buf0, input_stride);
+      round_shift_rect_array_32_neon(buf0, buf0, buf_size_w);
     }
     row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
 
     int32x4_t *_buf1 = buf1 + i * 4;
 
     for (int j = 0; j < buf_size_w_div4; ++j) {
-      _buf1[j * txfm_size_row + 0] = buf0[j * 4 + 0];
-      _buf1[j * txfm_size_row + 1] = buf0[j * 4 + 1];
-      _buf1[j * txfm_size_row + 2] = buf0[j * 4 + 2];
-      _buf1[j * txfm_size_row + 3] = buf0[j * 4 + 3];
+      int32x4_t *buf0_cur = buf0 + j * 4;
+      TRANSPOSE_4X4(buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3],
+                    buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3]);
+      _buf1[j * txfm_size_row + 0] = buf0_cur[0];
+      _buf1[j * txfm_size_row + 1] = buf0_cur[1];
+      _buf1[j * txfm_size_row + 2] = buf0_cur[2];
+      _buf1[j * txfm_size_row + 3] = buf0_cur[3];
     }
   }
   for (int i = 0; i < buf_size_w_div4; i++) {
     col_txfm(buf1 + i * txfm_size_row, buf1 + i * txfm_size_row, INV_COS_BIT, 1,
              bd, 0);
 
-    av1_round_shift_array_32_neon(buf1 + i * txfm_size_row,
-                                  buf1 + i * txfm_size_row, txfm_size_row,
-                                  -shift[1]);
+    round_shift_array_32_neon(buf1 + i * txfm_size_row,
+                              buf1 + i * txfm_size_row, txfm_size_row,
+                              -shift[1]);
   }
 
   // write to buffer
@@ -5322,46 +5302,43 @@
   const int txh_idx = get_txh_idx(tx_size);
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
-  const int input_stride = AOMMIN(32, txfm_size_col);
-  const int buf_size_w_div8 = input_stride >> 2;
+  const int buf_size_w_div4 = AOMMIN(32, txfm_size_col) >> 2;
   const int row_max = AOMMIN(32, txfm_size_row);
+  const int input_stride = row_max;
   const int buf_size_nonzero_w_div8 = (eobx + 8) >> 3;
+  const int buf_size_nonzero_w = buf_size_nonzero_w_div8 << 3;
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
   const int fun_idx = lowbd_txfm_all_1d_zeros_idx[eobx];
   const transform_1d_neon row_txfm =
       highbd_txfm_all_1d_zeros_w8_arr[txw_idx][hitx_1d_tab[tx_type]][fun_idx];
+  assert(row_txfm != NULL);
   const transform_1d_neon col_txfm =
       highbd_txfm_all_1d_zeros_w8_arr[txh_idx][vitx_1d_tab[tx_type]][0];
+  assert(col_txfm != NULL);
   int ud_flip, lr_flip;
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
 
   for (int i = 0; i < (row_max >> 2); ++i) {
     int32x4_t buf0[16];
-    const int32_t *input_row = input + i * input_stride * 4;
-    for (int j = 0; j < (buf_size_nonzero_w_div8 << 1); ++j) {
-      int32x4_t *buf0_cur = buf0 + j * 4;
-      load_buffer_32bit_input(input_row + j * 4, input_stride, buf0_cur, 4);
-
-      TRANSPOSE_4X4(buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3],
-                    buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3]);
-    }
+    load_buffer_32bit_input(input + i * 4, input_stride, buf0,
+                            buf_size_nonzero_w);
     if (rect_type == 1 || rect_type == -1) {
-      round_shift_rect_array_32_neon(buf0, buf0, buf_size_nonzero_w_div8 << 3);
+      round_shift_rect_array_32_neon(buf0, buf0, buf_size_nonzero_w);
     }
     row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
 
     int32x4_t *_buf1 = buf1 + i * 4;
     if (lr_flip) {
-      for (int j = 0; j < buf_size_w_div8; ++j) {
+      for (int j = 0; j < buf_size_w_div4; ++j) {
         TRANSPOSE_4X4(buf0[4 * j + 3], buf0[4 * j + 2], buf0[4 * j + 1],
                       buf0[4 * j],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 0],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 1],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 2],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 3]);
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 0],
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 1],
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 2],
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 3]);
       }
     } else {
-      for (int j = 0; j < buf_size_w_div8; ++j) {
+      for (int j = 0; j < buf_size_w_div4; ++j) {
         TRANSPOSE_4X4(
             buf0[j * 4 + 0], buf0[j * 4 + 1], buf0[j * 4 + 2], buf0[j * 4 + 3],
             _buf1[j * txfm_size_row + 0], _buf1[j * txfm_size_row + 1],
@@ -5369,13 +5346,13 @@
       }
     }
   }
-  for (int i = 0; i < buf_size_w_div8; i++) {
+  for (int i = 0; i < buf_size_w_div4; i++) {
     col_txfm(buf1 + i * txfm_size_row, buf1 + i * txfm_size_row, INV_COS_BIT, 1,
              bd, 0);
 
-    av1_round_shift_array_32_neon(buf1 + i * txfm_size_row,
-                                  buf1 + i * txfm_size_row, txfm_size_row,
-                                  -shift[1]);
+    round_shift_array_32_neon(buf1 + i * txfm_size_row,
+                              buf1 + i * txfm_size_row, txfm_size_row,
+                              -shift[1]);
   }
 
   // write to buffer
@@ -5386,6 +5363,7 @@
     }
   }
 }
+
 static void inv_txfm2d_add_idtx_neon(const int32_t *input, uint16_t *output,
                                      int stride, TX_TYPE tx_type,
                                      TX_SIZE tx_size, const int bd) {
@@ -5395,40 +5373,43 @@
   const int txh_idx = get_txh_idx(tx_size);
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
-  const int input_stride = AOMMIN(32, txfm_size_col);
   const int row_max = AOMMIN(32, txfm_size_row);
+  const int input_stride = row_max;
+  const int buf_size_w = AOMMIN(32, txfm_size_col);
+  const int buf_size_w_div4 = buf_size_w >> 2;
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
   const transform_1d_neon row_txfm =
       highbd_txfm_all_1d_zeros_w8_arr[txw_idx][hitx_1d_tab[tx_type]][0];
+  assert(row_txfm != NULL);
   const transform_1d_neon col_txfm =
       highbd_txfm_all_1d_zeros_w8_arr[txh_idx][vitx_1d_tab[tx_type]][0];
+  assert(col_txfm != NULL);
   for (int i = 0; i < (row_max >> 2); ++i) {
     int32x4_t buf0[32];
-    const int32_t *input_row = input + i * input_stride * 4;
-    for (int j = 0; j < (input_stride >> 2); ++j) {
-      int32x4_t *buf0_cur = buf0 + j * 4;
-      load_buffer_32bit_input(input_row + j * 4, input_stride, buf0_cur, 4);
-    }
+    load_buffer_32bit_input(input + i * 4, input_stride, buf0, buf_size_w);
     if (rect_type == 1 || rect_type == -1) {
-      round_shift_rect_array_32_neon(buf0, buf0, input_stride);
+      round_shift_rect_array_32_neon(buf0, buf0, buf_size_w);
     }
     row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
 
     int32x4_t *_buf1 = buf1 + i * 4;
-    for (int j = 0; j < (input_stride >> 2); ++j) {
-      _buf1[j * txfm_size_row + 0] = buf0[j * 4 + 0];
-      _buf1[j * txfm_size_row + 1] = buf0[j * 4 + 1];
-      _buf1[j * txfm_size_row + 2] = buf0[j * 4 + 2];
-      _buf1[j * txfm_size_row + 3] = buf0[j * 4 + 3];
+    for (int j = 0; j < buf_size_w_div4; ++j) {
+      int32x4_t *buf0_cur = buf0 + j * 4;
+      TRANSPOSE_4X4(buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3],
+                    buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3]);
+      _buf1[j * txfm_size_row + 0] = buf0_cur[0];
+      _buf1[j * txfm_size_row + 1] = buf0_cur[1];
+      _buf1[j * txfm_size_row + 2] = buf0_cur[2];
+      _buf1[j * txfm_size_row + 3] = buf0_cur[3];
     }
   }
-  for (int i = 0; i < (input_stride >> 2); i++) {
+  for (int i = 0; i < buf_size_w_div4; i++) {
     col_txfm(buf1 + i * txfm_size_row, buf1 + i * txfm_size_row, INV_COS_BIT, 1,
              bd, 0);
 
-    av1_round_shift_array_32_neon(buf1 + i * txfm_size_row,
-                                  buf1 + i * txfm_size_row, txfm_size_row,
-                                  -shift[1]);
+    round_shift_array_32_neon(buf1 + i * txfm_size_row,
+                              buf1 + i * txfm_size_row, txfm_size_row,
+                              -shift[1]);
   }
 
   // write to buffer
@@ -5439,9 +5420,11 @@
     }
   }
 }
-void inv_txfm2d_add_no_identity_neon(const int32_t *input, uint16_t *output,
-                                     int stride, TX_TYPE tx_type,
-                                     TX_SIZE tx_size, const int bd) {
+
+static void inv_txfm2d_add_no_identity_neon(const int32_t *input,
+                                            uint16_t *output, int stride,
+                                            TX_TYPE tx_type, TX_SIZE tx_size,
+                                            const int bd) {
   int32x4_t buf1[64 * 16];
   int eobx, eoby;
   get_eobx_eoby_scan_default(&eobx, &eoby, tx_size);
@@ -5450,10 +5433,10 @@
   const int txh_idx = get_txh_idx(tx_size);
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
-  const int buf_size_w_div8 = txfm_size_col >> 2;
-  const int buf_size_nonzero_w_div8 = (eobx + 8) >> 3;
+  const int buf_size_w_div4 = txfm_size_col >> 2;
+  const int buf_size_nonzero_w = (eobx + 8) >> 3 << 3;
   const int buf_size_nonzero_h_div8 = (eoby + 8) >> 3;
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int input_stride = AOMMIN(32, txfm_size_row);
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
 
   const int fun_idx_x = lowbd_txfm_all_1d_zeros_idx[eobx];
@@ -5470,32 +5453,26 @@
   // 1st stage: column transform
   for (int i = 0; i < buf_size_nonzero_h_div8 << 1; i++) {
     int32x4_t buf0[64];
-    const int32_t *input_row = input + i * input_stride * 4;
-    for (int j = 0; j < buf_size_nonzero_w_div8 << 1; ++j) {
-      int32x4_t *buf0_cur = &buf0[j * 4];
-      load_buffer_32bit_input(input_row + j * 4, input_stride, buf0_cur, 4);
-
-      TRANSPOSE_4X4(buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3],
-                    buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3]);
-    }
+    load_buffer_32bit_input(input + i * 4, input_stride, buf0,
+                            buf_size_nonzero_w);
     if (rect_type == 1 || rect_type == -1) {
-      round_shift_rect_array_32_neon(buf0, buf0, buf_size_nonzero_w_div8 << 3);
+      round_shift_rect_array_32_neon(buf0, buf0, buf_size_nonzero_w);
     }
     row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
 
     int32x4_t *_buf1 = &buf1[i * 4];
 
     if (lr_flip) {
-      for (int j = 0; j < buf_size_w_div8; ++j) {
+      for (int j = 0; j < buf_size_w_div4; ++j) {
         TRANSPOSE_4X4(buf0[4 * j + 3], buf0[4 * j + 2], buf0[4 * j + 1],
                       buf0[4 * j],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 0],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 1],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 2],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 3]);
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 0],
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 1],
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 2],
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 3]);
       }
     } else {
-      for (int j = 0; j < buf_size_w_div8; ++j) {
+      for (int j = 0; j < buf_size_w_div4; ++j) {
         TRANSPOSE_4X4(
             buf0[j * 4 + 0], buf0[j * 4 + 1], buf0[j * 4 + 2], buf0[j * 4 + 3],
             _buf1[j * txfm_size_row + 0], _buf1[j * txfm_size_row + 1],
@@ -5504,13 +5481,13 @@
     }
   }
   // 2nd stage: column transform
-  for (int i = 0; i < buf_size_w_div8; i++) {
+  for (int i = 0; i < buf_size_w_div4; i++) {
     col_txfm(buf1 + i * txfm_size_row, buf1 + i * txfm_size_row, INV_COS_BIT, 1,
              bd, 0);
 
-    av1_round_shift_array_32_neon(buf1 + i * txfm_size_row,
-                                  buf1 + i * txfm_size_row, txfm_size_row,
-                                  -shift[1]);
+    round_shift_array_32_neon(buf1 + i * txfm_size_row,
+                              buf1 + i * txfm_size_row, txfm_size_row,
+                              -shift[1]);
   }
 
   // write to buffer
@@ -5522,10 +5499,11 @@
   }
 }
 
-void highbd_inv_txfm2d_add_no_identity_neon(const int32_t *input,
-                                            uint16_t *output, int stride,
-                                            TX_TYPE tx_type, TX_SIZE tx_size,
-                                            int eob, const int bd) {
+static void highbd_inv_txfm2d_add_no_identity_neon(const int32_t *input,
+                                                   uint16_t *output, int stride,
+                                                   TX_TYPE tx_type,
+                                                   TX_SIZE tx_size, int eob,
+                                                   const int bd) {
   int32x4_t buf1[64 * 16];
   int eobx, eoby;
   highbd_get_eobx_eoby_scan_default(&eobx, &eoby, tx_size, eob);
@@ -5592,9 +5570,9 @@
     col_txfm(buf1 + i * txfm_size_row, buf1 + i * txfm_size_row, INV_COS_BIT, 1,
              bd, 0);
 
-    av1_round_shift_array_32_neon(buf1 + i * txfm_size_row,
-                                  buf1 + i * txfm_size_row, txfm_size_row,
-                                  -shift[1]);
+    round_shift_array_32_neon(buf1 + i * txfm_size_row,
+                              buf1 + i * txfm_size_row, txfm_size_row,
+                              -shift[1]);
   }
 
   // write to buffer
@@ -5606,10 +5584,11 @@
   }
 }
 
-void av1_highbd_inv_txfm2d_add_universe_neon(const int32_t *input,
-                                             uint8_t *output, int stride,
-                                             TX_TYPE tx_type, TX_SIZE tx_size,
-                                             int eob, const int bd) {
+static void highbd_inv_txfm2d_add_universe_neon(const int32_t *input,
+                                                uint8_t *output, int stride,
+                                                TX_TYPE tx_type,
+                                                TX_SIZE tx_size, int eob,
+                                                const int bd) {
   switch (tx_type) {
     case DCT_DCT:
     case ADST_DCT:
@@ -5643,9 +5622,9 @@
   }
 }
 
-void av1_inv_txfm2d_add_universe_neon(const int32_t *input, uint8_t *output,
-                                      int stride, TX_TYPE tx_type,
-                                      TX_SIZE tx_size, const int bd) {
+static void inv_txfm2d_add_universe_neon(const int32_t *input, uint8_t *output,
+                                         int stride, TX_TYPE tx_type,
+                                         TX_SIZE tx_size, const int bd) {
   switch (tx_type) {
     case DCT_DCT:
     case ADST_DCT:
@@ -5692,9 +5671,9 @@
     case V_DCT:
     case V_ADST:
     case V_FLIPADST:
-      av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride, tx_type,
-                                              txfm_param->tx_size,
-                                              txfm_param->eob, bd);
+      highbd_inv_txfm2d_add_universe_neon(input, dest, stride, tx_type,
+                                          txfm_param->tx_size, txfm_param->eob,
+                                          bd);
       break;
     default:
       av1_inv_txfm2d_add_8x8_neon(src, CONVERT_TO_SHORTPTR(dest), stride,
@@ -5702,6 +5681,7 @@
       break;
   }
 }
+
 void av1_highbd_inv_txfm_add_4x4_neon(const tran_low_t *input, uint8_t *dest,
                                       int stride, const TxfmParam *txfm_param) {
   assert(av1_ext_tx_used[txfm_param->tx_set_type][txfm_param->tx_type]);
@@ -5733,8 +5713,8 @@
 
 void av1_inv_txfm2d_add_8x16_neon(const tran_low_t *input, uint16_t *dest,
                                   int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_8X16, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type, TX_8X16,
+                               bd);
 }
 
 void av1_highbd_inv_txfm_add_4x16_neon(const tran_low_t *input, uint8_t *dest,
@@ -5760,176 +5740,173 @@
 void av1_highbd_inv_txfm_add_8x16_neon(const tran_low_t *input, uint8_t *dest,
                                        int stride,
                                        const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_8X16,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_8X16, txfm_param->eob, txfm_param->bd);
 }
 
 void av1_highbd_inv_txfm_add_16x8_neon(const tran_low_t *input, uint8_t *dest,
                                        int stride,
                                        const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_16X8,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_16X8, txfm_param->eob, txfm_param->bd);
 }
 
 void av1_inv_txfm2d_add_16x8_neon(const tran_low_t *input, uint16_t *dest,
                                   int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_16X8, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type, TX_16X8,
+                               bd);
 }
 
 void av1_highbd_inv_txfm_add_16x32_neon(const tran_low_t *input, uint8_t *dest,
                                         int stride,
                                         const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_16X32,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_16X32, txfm_param->eob,
+                                      txfm_param->bd);
 }
 
 void av1_inv_txfm2d_add_16x32_neon(const tran_low_t *input, uint16_t *dest,
                                    int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_16X32, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
+                               TX_16X32, bd);
 }
 
 void av1_highbd_inv_txfm_add_32x16_neon(const tran_low_t *input, uint8_t *dest,
                                         int stride,
                                         const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_32X16,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_32X16, txfm_param->eob,
+                                      txfm_param->bd);
 }
 
 void av1_inv_txfm2d_add_32x16_neon(const tran_low_t *input, uint16_t *dest,
                                    int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_32X16, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
+                               TX_32X16, bd);
 }
 
 void av1_highbd_inv_txfm_add_32x32_neon(const tran_low_t *input, uint8_t *dest,
                                         int stride,
                                         const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_32X32,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_32X32, txfm_param->eob,
+                                      txfm_param->bd);
 }
 
 void av1_inv_txfm2d_add_32x32_neon(const tran_low_t *input, uint16_t *dest,
                                    int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_32X32, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
+                               TX_32X32, bd);
 }
 
 void av1_highbd_inv_txfm_add_64x64_neon(const tran_low_t *input, uint8_t *dest,
                                         int stride,
                                         const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_64X64,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_64X64, txfm_param->eob,
+                                      txfm_param->bd);
 }
 
 void av1_inv_txfm2d_add_64x64_neon(const tran_low_t *input, uint16_t *dest,
                                    int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_64X64, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
+                               TX_64X64, bd);
 }
 
 void av1_highbd_inv_txfm_add_32x64_neon(const tran_low_t *input, uint8_t *dest,
                                         int stride,
                                         const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_32X64,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_32X64, txfm_param->eob,
+                                      txfm_param->bd);
 }
 
 void av1_inv_txfm2d_add_32x64_neon(const tran_low_t *input, uint16_t *dest,
                                    int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_32X64, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
+                               TX_32X64, bd);
 }
 
 void av1_highbd_inv_txfm_add_64x32_neon(const tran_low_t *input, uint8_t *dest,
                                         int stride,
                                         const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_64X32,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_64X32, txfm_param->eob,
+                                      txfm_param->bd);
 }
 
 void av1_inv_txfm2d_add_64x32_neon(const tran_low_t *input, uint16_t *dest,
                                    int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_64X32, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
+                               TX_64X32, bd);
 }
 
 void av1_highbd_inv_txfm_add_64x16_neon(const tran_low_t *input, uint8_t *dest,
                                         int stride,
                                         const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_64X16,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_64X16, txfm_param->eob,
+                                      txfm_param->bd);
 }
+
 void av1_inv_txfm2d_add_64x16_neon(const tran_low_t *input, uint16_t *dest,
                                    int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_64X16, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
+                               TX_64X16, bd);
 }
 
 void av1_highbd_inv_txfm_add_16x64_neon(const tran_low_t *input, uint8_t *dest,
                                         int stride,
                                         const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_16X64,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_16X64, txfm_param->eob,
+                                      txfm_param->bd);
 }
 
 void av1_inv_txfm2d_add_16x64_neon(const tran_low_t *input, uint16_t *dest,
                                    int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_16X64, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
+                               TX_16X64, bd);
 }
 
 void av1_highbd_inv_txfm_add_16x16_neon(const tran_low_t *input, uint8_t *dest,
                                         int stride,
                                         const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_16X16,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_16X16, txfm_param->eob,
+                                      txfm_param->bd);
 }
 
 void av1_inv_txfm2d_add_16x16_neon(const tran_low_t *input, uint16_t *dest,
                                    int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_16X16, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
+                               TX_16X16, bd);
 }
 
 void av1_highbd_inv_txfm_add_32x8_neon(const tran_low_t *input, uint8_t *dest,
                                        int stride,
                                        const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_32X8,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_32X8, txfm_param->eob, txfm_param->bd);
 }
 
 void av1_inv_txfm2d_add_32x8_neon(const tran_low_t *input, uint16_t *dest,
                                   int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_32X8, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type, TX_32X8,
+                               bd);
 }
 
 void av1_highbd_inv_txfm_add_8x32_neon(const tran_low_t *input, uint8_t *dest,
                                        int stride,
                                        const TxfmParam *txfm_param) {
-  av1_highbd_inv_txfm2d_add_universe_neon(input, dest, stride,
-                                          txfm_param->tx_type, TX_8X32,
-                                          txfm_param->eob, txfm_param->bd);
+  highbd_inv_txfm2d_add_universe_neon(input, dest, stride, txfm_param->tx_type,
+                                      TX_8X32, txfm_param->eob, txfm_param->bd);
 }
 
 void av1_inv_txfm2d_add_8x32_neon(const tran_low_t *input, uint16_t *dest,
                                   int stride, TX_TYPE tx_type, const int bd) {
-  av1_inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type,
-                                   TX_8X32, bd);
+  inv_txfm2d_add_universe_neon(input, (uint8_t *)dest, stride, tx_type, TX_8X32,
+                               bd);
 }
 
 void av1_highbd_inv_txfm_add_neon(const tran_low_t *input, uint8_t *dest,

diff --git a/av1/common/arm/jnt_convolve_neon.c b/av1/common/arm/jnt_convolve_neon.c
index 36c8f9c..564f7c2 100644
--- a/av1/common/arm/jnt_convolve_neon.c
+++ b/av1/common/arm/jnt_convolve_neon.c

@@ -22,548 +22,566 @@
 #include "av1/common/common.h"
 #include "av1/common/arm/convolve_neon.h"
 
-#if !defined(__aarch64__)
-static INLINE void compute_avg_4x1(
-    uint16x4_t res0, uint16x4_t d0, const uint16_t fwd_offset,
-    const uint16_t bck_offset, const int16x4_t sub_const_vec,
-    const int16_t round_bits, const int use_dist_wtd_comp_avg, uint8x8_t *t0) {
-  int16x4_t tmp0;
-  uint16x4_t tmp_u0;
-  uint32x4_t sum0;
-  int32x4_t dst0;
-  int16x8_t tmp4;
+#if !AOM_ARCH_AARCH64
+static INLINE void compute_dist_wtd_avg_4x1(uint16x4_t dd0, uint16x4_t d0,
+                                            const uint16_t fwd_offset,
+                                            const uint16_t bck_offset,
+                                            const int16x4_t round_offset,
+                                            uint8x8_t *d0_u8) {
+  uint32x4_t blend0 = vmull_n_u16(dd0, fwd_offset);
+  blend0 = vmlal_n_u16(blend0, d0, bck_offset);
 
-  if (use_dist_wtd_comp_avg) {
-    const int32x4_t round_bits_vec = vdupq_n_s32((int32_t)(-round_bits));
+  uint16x4_t avg0 = vshrn_n_u32(blend0, DIST_PRECISION_BITS);
 
-    sum0 = vmull_n_u16(res0, fwd_offset);
-    sum0 = vmlal_n_u16(sum0, d0, bck_offset);
+  int16x4_t dst0 = vsub_s16(vreinterpret_s16_u16(avg0), round_offset);
 
-    sum0 = vshrq_n_u32(sum0, DIST_PRECISION_BITS);
+  int16x8_t dst0q = vcombine_s16(dst0, vdup_n_s16(0));
 
-    dst0 = vsubq_s32(vreinterpretq_s32_u32(sum0), vmovl_s16(sub_const_vec));
-
-    dst0 = vqrshlq_s32(dst0, round_bits_vec);
-
-    tmp0 = vmovn_s32(dst0);
-    tmp4 = vcombine_s16(tmp0, tmp0);
-
-    *t0 = vqmovun_s16(tmp4);
-  } else {
-    const int16x4_t round_bits_vec = vdup_n_s16(-round_bits);
-    tmp_u0 = vhadd_u16(res0, d0);
-
-    tmp0 = vsub_s16(vreinterpret_s16_u16(tmp_u0), sub_const_vec);
-
-    tmp0 = vqrshl_s16(tmp0, round_bits_vec);
-
-    tmp4 = vcombine_s16(tmp0, vdup_n_s16(0));
-
-    *t0 = vqmovun_s16(tmp4);
-  }
+  *d0_u8 = vqrshrun_n_s16(dst0q, FILTER_BITS - ROUND0_BITS);
 }
 
-static INLINE void compute_avg_8x1(
-    uint16x8_t res0, uint16x8_t d0, const uint16_t fwd_offset,
-    const uint16_t bck_offset, const int16x4_t sub_const,
-    const int16_t round_bits, const int use_dist_wtd_comp_avg, uint8x8_t *t0) {
-  int16x8_t f0;
-  uint32x4_t sum0, sum2;
-  int32x4_t dst0, dst2;
+static INLINE void compute_basic_avg_4x1(uint16x4_t dd0, uint16x4_t d0,
+                                         const int16x4_t round_offset,
+                                         uint8x8_t *d0_u8) {
+  uint16x4_t avg0 = vhadd_u16(dd0, d0);
 
-  uint16x8_t tmp_u0;
+  int16x4_t dst0 = vsub_s16(vreinterpret_s16_u16(avg0), round_offset);
 
-  if (use_dist_wtd_comp_avg) {
-    const int32x4_t sub_const_vec = vmovl_s16(sub_const);
-    const int32x4_t round_bits_vec = vdupq_n_s32(-(int32_t)round_bits);
+  int16x8_t dst0q = vcombine_s16(dst0, vdup_n_s16(0));
 
-    sum0 = vmull_n_u16(vget_low_u16(res0), fwd_offset);
-    sum0 = vmlal_n_u16(sum0, vget_low_u16(d0), bck_offset);
-    sum0 = vshrq_n_u32(sum0, DIST_PRECISION_BITS);
-
-    sum2 = vmull_n_u16(vget_high_u16(res0), fwd_offset);
-    sum2 = vmlal_n_u16(sum2, vget_high_u16(d0), bck_offset);
-    sum2 = vshrq_n_u32(sum2, DIST_PRECISION_BITS);
-
-    dst0 = vsubq_s32(vreinterpretq_s32_u32(sum0), sub_const_vec);
-    dst2 = vsubq_s32(vreinterpretq_s32_u32(sum2), sub_const_vec);
-
-    dst0 = vqrshlq_s32(dst0, round_bits_vec);
-    dst2 = vqrshlq_s32(dst2, round_bits_vec);
-
-    f0 = vcombine_s16(vmovn_s32(dst0), vmovn_s32(dst2));
-
-    *t0 = vqmovun_s16(f0);
-
-  } else {
-    const int16x8_t sub_const_vec = vcombine_s16(sub_const, sub_const);
-    const int16x8_t round_bits_vec = vdupq_n_s16(-round_bits);
-
-    tmp_u0 = vhaddq_u16(res0, d0);
-
-    f0 = vsubq_s16(vreinterpretq_s16_u16(tmp_u0), sub_const_vec);
-
-    f0 = vqrshlq_s16(f0, round_bits_vec);
-
-    *t0 = vqmovun_s16(f0);
-  }
+  *d0_u8 = vqrshrun_n_s16(dst0q, FILTER_BITS - ROUND0_BITS);
 }
-#endif  // !defined(__arch64__)
 
-static INLINE void compute_avg_4x4(
-    uint16x4_t res0, uint16x4_t res1, uint16x4_t res2, uint16x4_t res3,
+static INLINE void compute_dist_wtd_avg_8x1(uint16x8_t dd0, uint16x8_t d0,
+                                            const uint16_t fwd_offset,
+                                            const uint16_t bck_offset,
+                                            const int16x8_t round_offset,
+                                            uint8x8_t *d0_u8) {
+  uint32x4_t blend0_lo = vmull_n_u16(vget_low_u16(dd0), fwd_offset);
+  blend0_lo = vmlal_n_u16(blend0_lo, vget_low_u16(d0), bck_offset);
+  uint32x4_t blend0_hi = vmull_n_u16(vget_high_u16(dd0), fwd_offset);
+  blend0_hi = vmlal_n_u16(blend0_hi, vget_high_u16(d0), bck_offset);
+
+  uint16x8_t avg0 = vcombine_u16(vshrn_n_u32(blend0_lo, DIST_PRECISION_BITS),
+                                 vshrn_n_u32(blend0_hi, DIST_PRECISION_BITS));
+
+  int16x8_t dst0 = vsubq_s16(vreinterpretq_s16_u16(avg0), round_offset);
+
+  *d0_u8 = vqrshrun_n_s16(dst0, FILTER_BITS - ROUND0_BITS);
+}
+
+static INLINE void compute_basic_avg_8x1(uint16x8_t dd0, uint16x8_t d0,
+                                         const int16x8_t round_offset,
+                                         uint8x8_t *d0_u8) {
+  uint16x8_t avg0 = vhaddq_u16(dd0, d0);
+
+  int16x8_t dst0 = vsubq_s16(vreinterpretq_s16_u16(avg0), round_offset);
+
+  *d0_u8 = vqrshrun_n_s16(dst0, FILTER_BITS - ROUND0_BITS);
+}
+
+#endif  // !AOM_ARCH_AARCH64
+
+static INLINE void compute_dist_wtd_avg_4x4(
+    uint16x4_t dd0, uint16x4_t dd1, uint16x4_t dd2, uint16x4_t dd3,
     uint16x4_t d0, uint16x4_t d1, uint16x4_t d2, uint16x4_t d3,
     const uint16_t fwd_offset, const uint16_t bck_offset,
-    const int16x4_t sub_const_vec, const int16_t round_bits,
-    const int use_dist_wtd_comp_avg, uint8x8_t *t0, uint8x8_t *t1) {
-  int16x4_t tmp0, tmp1, tmp2, tmp3;
-  uint16x4_t tmp_u0, tmp_u1, tmp_u2, tmp_u3;
-  uint32x4_t sum0, sum1, sum2, sum3;
+    const int16x8_t round_offset, uint8x8_t *d01_u8, uint8x8_t *d23_u8) {
+  uint32x4_t blend0 = vmull_n_u16(dd0, fwd_offset);
+  blend0 = vmlal_n_u16(blend0, d0, bck_offset);
+  uint32x4_t blend1 = vmull_n_u16(dd1, fwd_offset);
+  blend1 = vmlal_n_u16(blend1, d1, bck_offset);
+  uint32x4_t blend2 = vmull_n_u16(dd2, fwd_offset);
+  blend2 = vmlal_n_u16(blend2, d2, bck_offset);
+  uint32x4_t blend3 = vmull_n_u16(dd3, fwd_offset);
+  blend3 = vmlal_n_u16(blend3, d3, bck_offset);
 
-  int32x4_t dst0, dst1, dst2, dst3;
-  int16x8_t tmp4, tmp5;
+  uint16x4_t avg0 = vshrn_n_u32(blend0, DIST_PRECISION_BITS);
+  uint16x4_t avg1 = vshrn_n_u32(blend1, DIST_PRECISION_BITS);
+  uint16x4_t avg2 = vshrn_n_u32(blend2, DIST_PRECISION_BITS);
+  uint16x4_t avg3 = vshrn_n_u32(blend3, DIST_PRECISION_BITS);
 
-  if (use_dist_wtd_comp_avg) {
-    const int32x4_t round_bits_vec = vdupq_n_s32((int32_t)(-round_bits));
-    const int32x4_t const_vec = vmovl_s16(sub_const_vec);
+  int16x8_t dst_01 = vreinterpretq_s16_u16(vcombine_u16(avg0, avg1));
+  int16x8_t dst_23 = vreinterpretq_s16_u16(vcombine_u16(avg2, avg3));
 
-    sum0 = vmull_n_u16(res0, fwd_offset);
-    sum0 = vmlal_n_u16(sum0, d0, bck_offset);
-    sum1 = vmull_n_u16(res1, fwd_offset);
-    sum1 = vmlal_n_u16(sum1, d1, bck_offset);
-    sum2 = vmull_n_u16(res2, fwd_offset);
-    sum2 = vmlal_n_u16(sum2, d2, bck_offset);
-    sum3 = vmull_n_u16(res3, fwd_offset);
-    sum3 = vmlal_n_u16(sum3, d3, bck_offset);
+  dst_01 = vsubq_s16(dst_01, round_offset);
+  dst_23 = vsubq_s16(dst_23, round_offset);
 
-    sum0 = vshrq_n_u32(sum0, DIST_PRECISION_BITS);
-    sum1 = vshrq_n_u32(sum1, DIST_PRECISION_BITS);
-    sum2 = vshrq_n_u32(sum2, DIST_PRECISION_BITS);
-    sum3 = vshrq_n_u32(sum3, DIST_PRECISION_BITS);
-
-    dst0 = vsubq_s32(vreinterpretq_s32_u32(sum0), const_vec);
-    dst1 = vsubq_s32(vreinterpretq_s32_u32(sum1), const_vec);
-    dst2 = vsubq_s32(vreinterpretq_s32_u32(sum2), const_vec);
-    dst3 = vsubq_s32(vreinterpretq_s32_u32(sum3), const_vec);
-
-    dst0 = vqrshlq_s32(dst0, round_bits_vec);
-    dst1 = vqrshlq_s32(dst1, round_bits_vec);
-    dst2 = vqrshlq_s32(dst2, round_bits_vec);
-    dst3 = vqrshlq_s32(dst3, round_bits_vec);
-
-    tmp4 = vcombine_s16(vmovn_s32(dst0), vmovn_s32(dst1));
-    tmp5 = vcombine_s16(vmovn_s32(dst2), vmovn_s32(dst3));
-
-    *t0 = vqmovun_s16(tmp4);
-    *t1 = vqmovun_s16(tmp5);
-  } else {
-    const int16x4_t round_bits_vec = vdup_n_s16(-round_bits);
-    tmp_u0 = vhadd_u16(res0, d0);
-    tmp_u1 = vhadd_u16(res1, d1);
-    tmp_u2 = vhadd_u16(res2, d2);
-    tmp_u3 = vhadd_u16(res3, d3);
-
-    tmp0 = vsub_s16(vreinterpret_s16_u16(tmp_u0), sub_const_vec);
-    tmp1 = vsub_s16(vreinterpret_s16_u16(tmp_u1), sub_const_vec);
-    tmp2 = vsub_s16(vreinterpret_s16_u16(tmp_u2), sub_const_vec);
-    tmp3 = vsub_s16(vreinterpret_s16_u16(tmp_u3), sub_const_vec);
-
-    tmp0 = vqrshl_s16(tmp0, round_bits_vec);
-    tmp1 = vqrshl_s16(tmp1, round_bits_vec);
-    tmp2 = vqrshl_s16(tmp2, round_bits_vec);
-    tmp3 = vqrshl_s16(tmp3, round_bits_vec);
-
-    tmp4 = vcombine_s16(tmp0, tmp1);
-    tmp5 = vcombine_s16(tmp2, tmp3);
-
-    *t0 = vqmovun_s16(tmp4);
-    *t1 = vqmovun_s16(tmp5);
-  }
+  *d01_u8 = vqrshrun_n_s16(dst_01, FILTER_BITS - ROUND0_BITS);
+  *d23_u8 = vqrshrun_n_s16(dst_23, FILTER_BITS - ROUND0_BITS);
 }
 
-static INLINE void compute_avg_8x4(
-    uint16x8_t res0, uint16x8_t res1, uint16x8_t res2, uint16x8_t res3,
+static INLINE void compute_basic_avg_4x4(uint16x4_t dd0, uint16x4_t dd1,
+                                         uint16x4_t dd2, uint16x4_t dd3,
+                                         uint16x4_t d0, uint16x4_t d1,
+                                         uint16x4_t d2, uint16x4_t d3,
+                                         const int16x8_t round_offset,
+                                         uint8x8_t *d01_u8, uint8x8_t *d23_u8) {
+  uint16x4_t avg0 = vhadd_u16(dd0, d0);
+  uint16x4_t avg1 = vhadd_u16(dd1, d1);
+  uint16x4_t avg2 = vhadd_u16(dd2, d2);
+  uint16x4_t avg3 = vhadd_u16(dd3, d3);
+
+  int16x8_t dst_01 = vreinterpretq_s16_u16(vcombine_u16(avg0, avg1));
+  int16x8_t dst_23 = vreinterpretq_s16_u16(vcombine_u16(avg2, avg3));
+
+  dst_01 = vsubq_s16(dst_01, round_offset);
+  dst_23 = vsubq_s16(dst_23, round_offset);
+
+  *d01_u8 = vqrshrun_n_s16(dst_01, FILTER_BITS - ROUND0_BITS);
+  *d23_u8 = vqrshrun_n_s16(dst_23, FILTER_BITS - ROUND0_BITS);
+}
+
+static INLINE void compute_dist_wtd_avg_8x4(
+    uint16x8_t dd0, uint16x8_t dd1, uint16x8_t dd2, uint16x8_t dd3,
     uint16x8_t d0, uint16x8_t d1, uint16x8_t d2, uint16x8_t d3,
     const uint16_t fwd_offset, const uint16_t bck_offset,
-    const int16x4_t sub_const, const int16_t round_bits,
-    const int use_dist_wtd_comp_avg, uint8x8_t *t0, uint8x8_t *t1,
-    uint8x8_t *t2, uint8x8_t *t3) {
-  int16x8_t f0, f1, f2, f3;
-  uint32x4_t sum0, sum1, sum2, sum3;
-  uint32x4_t sum4, sum5, sum6, sum7;
-  int32x4_t dst0, dst1, dst2, dst3;
-  int32x4_t dst4, dst5, dst6, dst7;
-  uint16x8_t tmp_u0, tmp_u1, tmp_u2, tmp_u3;
+    const int16x8_t round_offset, uint8x8_t *d0_u8, uint8x8_t *d1_u8,
+    uint8x8_t *d2_u8, uint8x8_t *d3_u8) {
+  uint32x4_t blend0_lo = vmull_n_u16(vget_low_u16(dd0), fwd_offset);
+  blend0_lo = vmlal_n_u16(blend0_lo, vget_low_u16(d0), bck_offset);
+  uint32x4_t blend0_hi = vmull_n_u16(vget_high_u16(dd0), fwd_offset);
+  blend0_hi = vmlal_n_u16(blend0_hi, vget_high_u16(d0), bck_offset);
 
-  if (use_dist_wtd_comp_avg) {
-    const int32x4_t sub_const_vec = vmovl_s16(sub_const);
-    const int32x4_t round_bits_vec = vdupq_n_s32(-(int32_t)round_bits);
+  uint32x4_t blend1_lo = vmull_n_u16(vget_low_u16(dd1), fwd_offset);
+  blend1_lo = vmlal_n_u16(blend1_lo, vget_low_u16(d1), bck_offset);
+  uint32x4_t blend1_hi = vmull_n_u16(vget_high_u16(dd1), fwd_offset);
+  blend1_hi = vmlal_n_u16(blend1_hi, vget_high_u16(d1), bck_offset);
 
-    sum0 = vmull_n_u16(vget_low_u16(res0), fwd_offset);
-    sum0 = vmlal_n_u16(sum0, vget_low_u16(d0), bck_offset);
-    sum1 = vmull_n_u16(vget_low_u16(res1), fwd_offset);
-    sum1 = vmlal_n_u16(sum1, vget_low_u16(d1), bck_offset);
-    sum0 = vshrq_n_u32(sum0, DIST_PRECISION_BITS);
-    sum1 = vshrq_n_u32(sum1, DIST_PRECISION_BITS);
+  uint32x4_t blend2_lo = vmull_n_u16(vget_low_u16(dd2), fwd_offset);
+  blend2_lo = vmlal_n_u16(blend2_lo, vget_low_u16(d2), bck_offset);
+  uint32x4_t blend2_hi = vmull_n_u16(vget_high_u16(dd2), fwd_offset);
+  blend2_hi = vmlal_n_u16(blend2_hi, vget_high_u16(d2), bck_offset);
 
-    sum2 = vmull_n_u16(vget_high_u16(res0), fwd_offset);
-    sum2 = vmlal_n_u16(sum2, vget_high_u16(d0), bck_offset);
-    sum3 = vmull_n_u16(vget_high_u16(res1), fwd_offset);
-    sum3 = vmlal_n_u16(sum3, vget_high_u16(d1), bck_offset);
-    sum2 = vshrq_n_u32(sum2, DIST_PRECISION_BITS);
-    sum3 = vshrq_n_u32(sum3, DIST_PRECISION_BITS);
+  uint32x4_t blend3_lo = vmull_n_u16(vget_low_u16(dd3), fwd_offset);
+  blend3_lo = vmlal_n_u16(blend3_lo, vget_low_u16(d3), bck_offset);
+  uint32x4_t blend3_hi = vmull_n_u16(vget_high_u16(dd3), fwd_offset);
+  blend3_hi = vmlal_n_u16(blend3_hi, vget_high_u16(d3), bck_offset);
 
-    sum4 = vmull_n_u16(vget_low_u16(res2), fwd_offset);
-    sum4 = vmlal_n_u16(sum4, vget_low_u16(d2), bck_offset);
-    sum5 = vmull_n_u16(vget_low_u16(res3), fwd_offset);
-    sum5 = vmlal_n_u16(sum5, vget_low_u16(d3), bck_offset);
-    sum4 = vshrq_n_u32(sum4, DIST_PRECISION_BITS);
-    sum5 = vshrq_n_u32(sum5, DIST_PRECISION_BITS);
+  uint16x8_t avg0 = vcombine_u16(vshrn_n_u32(blend0_lo, DIST_PRECISION_BITS),
+                                 vshrn_n_u32(blend0_hi, DIST_PRECISION_BITS));
+  uint16x8_t avg1 = vcombine_u16(vshrn_n_u32(blend1_lo, DIST_PRECISION_BITS),
+                                 vshrn_n_u32(blend1_hi, DIST_PRECISION_BITS));
+  uint16x8_t avg2 = vcombine_u16(vshrn_n_u32(blend2_lo, DIST_PRECISION_BITS),
+                                 vshrn_n_u32(blend2_hi, DIST_PRECISION_BITS));
+  uint16x8_t avg3 = vcombine_u16(vshrn_n_u32(blend3_lo, DIST_PRECISION_BITS),
+                                 vshrn_n_u32(blend3_hi, DIST_PRECISION_BITS));
 
-    sum6 = vmull_n_u16(vget_high_u16(res2), fwd_offset);
-    sum6 = vmlal_n_u16(sum6, vget_high_u16(d2), bck_offset);
-    sum7 = vmull_n_u16(vget_high_u16(res3), fwd_offset);
-    sum7 = vmlal_n_u16(sum7, vget_high_u16(d3), bck_offset);
-    sum6 = vshrq_n_u32(sum6, DIST_PRECISION_BITS);
-    sum7 = vshrq_n_u32(sum7, DIST_PRECISION_BITS);
+  int16x8_t dst0 = vsubq_s16(vreinterpretq_s16_u16(avg0), round_offset);
+  int16x8_t dst1 = vsubq_s16(vreinterpretq_s16_u16(avg1), round_offset);
+  int16x8_t dst2 = vsubq_s16(vreinterpretq_s16_u16(avg2), round_offset);
+  int16x8_t dst3 = vsubq_s16(vreinterpretq_s16_u16(avg3), round_offset);
 
-    dst0 = vsubq_s32(vreinterpretq_s32_u32(sum0), sub_const_vec);
-    dst1 = vsubq_s32(vreinterpretq_s32_u32(sum1), sub_const_vec);
-    dst2 = vsubq_s32(vreinterpretq_s32_u32(sum2), sub_const_vec);
-    dst3 = vsubq_s32(vreinterpretq_s32_u32(sum3), sub_const_vec);
-    dst4 = vsubq_s32(vreinterpretq_s32_u32(sum4), sub_const_vec);
-    dst5 = vsubq_s32(vreinterpretq_s32_u32(sum5), sub_const_vec);
-    dst6 = vsubq_s32(vreinterpretq_s32_u32(sum6), sub_const_vec);
-    dst7 = vsubq_s32(vreinterpretq_s32_u32(sum7), sub_const_vec);
-
-    dst0 = vqrshlq_s32(dst0, round_bits_vec);
-    dst1 = vqrshlq_s32(dst1, round_bits_vec);
-    dst2 = vqrshlq_s32(dst2, round_bits_vec);
-    dst3 = vqrshlq_s32(dst3, round_bits_vec);
-    dst4 = vqrshlq_s32(dst4, round_bits_vec);
-    dst5 = vqrshlq_s32(dst5, round_bits_vec);
-    dst6 = vqrshlq_s32(dst6, round_bits_vec);
-    dst7 = vqrshlq_s32(dst7, round_bits_vec);
-
-    f0 = vcombine_s16(vmovn_s32(dst0), vmovn_s32(dst2));
-    f1 = vcombine_s16(vmovn_s32(dst1), vmovn_s32(dst3));
-    f2 = vcombine_s16(vmovn_s32(dst4), vmovn_s32(dst6));
-    f3 = vcombine_s16(vmovn_s32(dst5), vmovn_s32(dst7));
-
-    *t0 = vqmovun_s16(f0);
-    *t1 = vqmovun_s16(f1);
-    *t2 = vqmovun_s16(f2);
-    *t3 = vqmovun_s16(f3);
-
-  } else {
-    const int16x8_t sub_const_vec = vcombine_s16(sub_const, sub_const);
-    const int16x8_t round_bits_vec = vdupq_n_s16(-round_bits);
-
-    tmp_u0 = vhaddq_u16(res0, d0);
-    tmp_u1 = vhaddq_u16(res1, d1);
-    tmp_u2 = vhaddq_u16(res2, d2);
-    tmp_u3 = vhaddq_u16(res3, d3);
-
-    f0 = vsubq_s16(vreinterpretq_s16_u16(tmp_u0), sub_const_vec);
-    f1 = vsubq_s16(vreinterpretq_s16_u16(tmp_u1), sub_const_vec);
-    f2 = vsubq_s16(vreinterpretq_s16_u16(tmp_u2), sub_const_vec);
-    f3 = vsubq_s16(vreinterpretq_s16_u16(tmp_u3), sub_const_vec);
-
-    f0 = vqrshlq_s16(f0, round_bits_vec);
-    f1 = vqrshlq_s16(f1, round_bits_vec);
-    f2 = vqrshlq_s16(f2, round_bits_vec);
-    f3 = vqrshlq_s16(f3, round_bits_vec);
-
-    *t0 = vqmovun_s16(f0);
-    *t1 = vqmovun_s16(f1);
-    *t2 = vqmovun_s16(f2);
-    *t3 = vqmovun_s16(f3);
-  }
+  *d0_u8 = vqrshrun_n_s16(dst0, FILTER_BITS - ROUND0_BITS);
+  *d1_u8 = vqrshrun_n_s16(dst1, FILTER_BITS - ROUND0_BITS);
+  *d2_u8 = vqrshrun_n_s16(dst2, FILTER_BITS - ROUND0_BITS);
+  *d3_u8 = vqrshrun_n_s16(dst3, FILTER_BITS - ROUND0_BITS);
 }
 
-#if defined(__aarch64__) && defined(__ARM_FEATURE_MATMUL_INT8)
+static INLINE void compute_basic_avg_8x4(uint16x8_t dd0, uint16x8_t dd1,
+                                         uint16x8_t dd2, uint16x8_t dd3,
+                                         uint16x8_t d0, uint16x8_t d1,
+                                         uint16x8_t d2, uint16x8_t d3,
+                                         const int16x8_t round_offset,
+                                         uint8x8_t *d0_u8, uint8x8_t *d1_u8,
+                                         uint8x8_t *d2_u8, uint8x8_t *d3_u8) {
+  uint16x8_t avg0, avg1, avg2, avg3;
 
-static INLINE void dist_wtd_convolve_2d_horiz_neon(
+  avg0 = vhaddq_u16(dd0, d0);
+  avg1 = vhaddq_u16(dd1, d1);
+  avg2 = vhaddq_u16(dd2, d2);
+  avg3 = vhaddq_u16(dd3, d3);
+
+  int16x8_t dst0 = vsubq_s16(vreinterpretq_s16_u16(avg0), round_offset);
+  int16x8_t dst1 = vsubq_s16(vreinterpretq_s16_u16(avg1), round_offset);
+  int16x8_t dst2 = vsubq_s16(vreinterpretq_s16_u16(avg2), round_offset);
+  int16x8_t dst3 = vsubq_s16(vreinterpretq_s16_u16(avg3), round_offset);
+
+  *d0_u8 = vqrshrun_n_s16(dst0, FILTER_BITS - ROUND0_BITS);
+  *d1_u8 = vqrshrun_n_s16(dst1, FILTER_BITS - ROUND0_BITS);
+  *d2_u8 = vqrshrun_n_s16(dst2, FILTER_BITS - ROUND0_BITS);
+  *d3_u8 = vqrshrun_n_s16(dst3, FILTER_BITS - ROUND0_BITS);
+}
+
+#if AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_MATMUL_INT8)
+
+static INLINE int16x4_t convolve8_4_2d_h(uint8x16_t samples,
+                                         const int8x8_t x_filter,
+                                         const uint8x16x2_t permute_tbl,
+                                         const int32x4_t horiz_const) {
+  uint8x16_t permuted_samples[2];
+  int32x4_t sum;
+
+  // Permute samples ready for dot product.
+  // { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 }
+  permuted_samples[0] = vqtbl1q_u8(samples, permute_tbl.val[0]);
+  // { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 }
+  permuted_samples[1] = vqtbl1q_u8(samples, permute_tbl.val[1]);
+
+  // First 4 output values.
+  sum = vusdotq_lane_s32(horiz_const, permuted_samples[0], x_filter, 0);
+  sum = vusdotq_lane_s32(sum, permuted_samples[1], x_filter, 1);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  return vshrn_n_s32(sum, ROUND0_BITS - 1);
+}
+
+static INLINE int16x8_t convolve8_8_2d_h(uint8x16_t samples,
+                                         const int8x8_t x_filter,
+                                         const uint8x16x3_t permute_tbl,
+                                         const int32x4_t horiz_const) {
+  uint8x16_t permuted_samples[3];
+  int32x4_t sum[2];
+
+  // Permute samples ready for dot product.
+  // { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 }
+  permuted_samples[0] = vqtbl1q_u8(samples, permute_tbl.val[0]);
+  // { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 }
+  permuted_samples[1] = vqtbl1q_u8(samples, permute_tbl.val[1]);
+  // { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 }
+  permuted_samples[2] = vqtbl1q_u8(samples, permute_tbl.val[2]);
+
+  // First 4 output values.
+  sum[0] = vusdotq_lane_s32(horiz_const, permuted_samples[0], x_filter, 0);
+  sum[0] = vusdotq_lane_s32(sum[0], permuted_samples[1], x_filter, 1);
+  // Second 4 output values.
+  sum[1] = vusdotq_lane_s32(horiz_const, permuted_samples[1], x_filter, 0);
+  sum[1] = vusdotq_lane_s32(sum[1], permuted_samples[2], x_filter, 1);
+
+  // Narrow and re-pack.
+  // We halved the convolution filter values so -1 from the right shift.
+  return vcombine_s16(vshrn_n_s32(sum[0], ROUND0_BITS - 1),
+                      vshrn_n_s32(sum[1], ROUND0_BITS - 1));
+}
+
+static INLINE void dist_wtd_convolve_2d_horiz_8tap_neon(
     const uint8_t *src, int src_stride, int16_t *im_block, const int im_stride,
-    const int16x8_t x_filter_s16, const int im_h, int w, const int round_0) {
+    const int16x8_t x_filter_s16, const int im_h, int w) {
   const int bd = 8;
+  // A shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use non-rounding
+  // shifts - which are generally faster than rounding shifts on modern CPUs.
+  // (The extra -1 is needed because we halved the filter values.)
+  const int32x4_t horiz_const = vdupq_n_s32((1 << (bd + FILTER_BITS - 2)) +
+                                            (1 << ((ROUND0_BITS - 1) - 1)));
+  // Horizontal filter.
+  const int8x8_t x_filter = vmovn_s16(x_filter_s16);
+
+  const uint8_t *src_ptr = src;
   int16_t *dst_ptr = im_block;
   int dst_stride = im_stride;
-  int width = w;
   int height = im_h;
 
-  const int8x8_t x_filter = vmovn_s16(x_filter_s16);
-  const int32x4_t horiz_const = vdupq_n_s32(1 << (bd + FILTER_BITS - 2));
-
   if (w == 4) {
     const uint8x16x2_t permute_tbl = vld1q_u8_x2(dot_prod_permute_tbl);
-    const int16x4_t shift_round_0 = vdup_n_s16(-(round_0));
     uint8x16_t s0, s1, s2, s3;
-    int32x4_t t0, t1, t2, t3;
     int16x4_t d0, d1, d2, d3;
 
     do {
-      s0 = vld1q_u8(src + 0 * src_stride);
-      s1 = vld1q_u8(src + 1 * src_stride);
-      s2 = vld1q_u8(src + 2 * src_stride);
-      s3 = vld1q_u8(src + 3 * src_stride);
+      load_u8_16x4(src_ptr, src_stride, &s0, &s1, &s2, &s3);
 
-      t0 = convolve8_4_usdot(s0, x_filter, permute_tbl, horiz_const);
-      t1 = convolve8_4_usdot(s1, x_filter, permute_tbl, horiz_const);
-      t2 = convolve8_4_usdot(s2, x_filter, permute_tbl, horiz_const);
-      t3 = convolve8_4_usdot(s3, x_filter, permute_tbl, horiz_const);
+      d0 = convolve8_4_2d_h(s0, x_filter, permute_tbl, horiz_const);
+      d1 = convolve8_4_2d_h(s1, x_filter, permute_tbl, horiz_const);
+      d2 = convolve8_4_2d_h(s2, x_filter, permute_tbl, horiz_const);
+      d3 = convolve8_4_2d_h(s3, x_filter, permute_tbl, horiz_const);
 
-      d0 = vqrshl_s16(vmovn_s32(t0), shift_round_0);
-      d1 = vqrshl_s16(vmovn_s32(t1), shift_round_0);
-      d2 = vqrshl_s16(vmovn_s32(t2), shift_round_0);
-      d3 = vqrshl_s16(vmovn_s32(t3), shift_round_0);
+      store_s16_4x4(dst_ptr, dst_stride, d0, d1, d2, d3);
 
-      vst1_s16((dst_ptr + 0 * dst_stride), d0);
-      vst1_s16((dst_ptr + 1 * dst_stride), d1);
-      vst1_s16((dst_ptr + 2 * dst_stride), d2);
-      vst1_s16((dst_ptr + 3 * dst_stride), d3);
-
-      src += 4 * src_stride;
+      src_ptr += 4 * src_stride;
       dst_ptr += 4 * dst_stride;
       height -= 4;
     } while (height > 0);
   } else {
     const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
-    const int16x8_t shift_round_0 = vdupq_n_s16(-(round_0));
-    const uint8_t *s;
-    int16_t *d;
     uint8x16_t s0, s1, s2, s3;
     int16x8_t d0, d1, d2, d3;
 
     do {
-      width = w;
-      s = src;
-      d = dst_ptr;
+      const uint8_t *s = src_ptr;
+      int16_t *d = dst_ptr;
+      int width = w;
 
       do {
-        s0 = vld1q_u8(s + 0 * src_stride);
-        s1 = vld1q_u8(s + 1 * src_stride);
-        s2 = vld1q_u8(s + 2 * src_stride);
-        s3 = vld1q_u8(s + 3 * src_stride);
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
 
-        d0 = convolve8_8_usdot(s0, x_filter, permute_tbl, horiz_const,
-                               shift_round_0);
-        d1 = convolve8_8_usdot(s1, x_filter, permute_tbl, horiz_const,
-                               shift_round_0);
-        d2 = convolve8_8_usdot(s2, x_filter, permute_tbl, horiz_const,
-                               shift_round_0);
-        d3 = convolve8_8_usdot(s3, x_filter, permute_tbl, horiz_const,
-                               shift_round_0);
+        d0 = convolve8_8_2d_h(s0, x_filter, permute_tbl, horiz_const);
+        d1 = convolve8_8_2d_h(s1, x_filter, permute_tbl, horiz_const);
+        d2 = convolve8_8_2d_h(s2, x_filter, permute_tbl, horiz_const);
+        d3 = convolve8_8_2d_h(s3, x_filter, permute_tbl, horiz_const);
 
-        vst1q_s16(d + 0 * dst_stride, d0);
-        vst1q_s16(d + 1 * dst_stride, d1);
-        vst1q_s16(d + 2 * dst_stride, d2);
-        vst1q_s16(d + 3 * dst_stride, d3);
+        store_s16_8x4(d, dst_stride, d0, d1, d2, d3);
 
         s += 8;
         d += 8;
         width -= 8;
       } while (width > 0);
-
-      src += 4 * src_stride;
+      src_ptr += 4 * src_stride;
       dst_ptr += 4 * dst_stride;
       height -= 4;
     } while (height > 0);
   }
 }
 
-#elif defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#elif AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
 
-static INLINE void dist_wtd_convolve_2d_horiz_neon(
+static INLINE int16x4_t convolve8_4_2d_h(uint8x16_t samples,
+                                         const int8x8_t x_filter,
+                                         const int32x4_t correction,
+                                         const uint8x16_t range_limit,
+                                         const uint8x16x2_t permute_tbl) {
+  int8x16_t clamped_samples, permuted_samples[2];
+  int32x4_t sum;
+
+  // Clamp sample range to [-128, 127] for 8-bit signed dot product.
+  clamped_samples = vreinterpretq_s8_u8(vsubq_u8(samples, range_limit));
+
+  // Permute samples ready for dot product.
+  // { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 }
+  permuted_samples[0] = vqtbl1q_s8(clamped_samples, permute_tbl.val[0]);
+  // { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 }
+  permuted_samples[1] = vqtbl1q_s8(clamped_samples, permute_tbl.val[1]);
+
+  // Accumulate dot product into 'correction' to account for range clamp.
+  sum = vdotq_lane_s32(correction, permuted_samples[0], x_filter, 0);
+  sum = vdotq_lane_s32(sum, permuted_samples[1], x_filter, 1);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  return vshrn_n_s32(sum, ROUND0_BITS - 1);
+}
+
+static INLINE int16x8_t convolve8_8_2d_h(uint8x16_t samples,
+                                         const int8x8_t x_filter,
+                                         const int32x4_t correction,
+                                         const uint8x16_t range_limit,
+                                         const uint8x16x3_t permute_tbl) {
+  int8x16_t clamped_samples, permuted_samples[3];
+  int32x4_t sum[2];
+
+  // Clamp sample range to [-128, 127] for 8-bit signed dot product.
+  clamped_samples = vreinterpretq_s8_u8(vsubq_u8(samples, range_limit));
+
+  // Permute samples ready for dot product. */
+  // { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 }
+  permuted_samples[0] = vqtbl1q_s8(clamped_samples, permute_tbl.val[0]);
+  // { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 }
+  permuted_samples[1] = vqtbl1q_s8(clamped_samples, permute_tbl.val[1]);
+  // { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 }
+  permuted_samples[2] = vqtbl1q_s8(clamped_samples, permute_tbl.val[2]);
+
+  // Accumulate dot product into 'correction' to account for range clamp.
+  // First 4 output values.
+  sum[0] = vdotq_lane_s32(correction, permuted_samples[0], x_filter, 0);
+  sum[0] = vdotq_lane_s32(sum[0], permuted_samples[1], x_filter, 1);
+  // Second 4 output values.
+  sum[1] = vdotq_lane_s32(correction, permuted_samples[1], x_filter, 0);
+  sum[1] = vdotq_lane_s32(sum[1], permuted_samples[2], x_filter, 1);
+
+  // Narrow and re-pack.
+  // We halved the convolution filter values so -1 from the right shift.
+  return vcombine_s16(vshrn_n_s32(sum[0], ROUND0_BITS - 1),
+                      vshrn_n_s32(sum[1], ROUND0_BITS - 1));
+}
+
+static INLINE void dist_wtd_convolve_2d_horiz_8tap_neon(
     const uint8_t *src, int src_stride, int16_t *im_block, const int im_stride,
-    const int16x8_t x_filter_s16, const int im_h, int w, const int round_0) {
+    const int16x8_t x_filter_s16, const int im_h, int w) {
   const int bd = 8;
-  int16_t *dst_ptr = im_block;
-  int dst_stride = im_stride;
-  int width = w;
-  int height = im_h;
-
-  const int8x8_t x_filter = vmovn_s16(x_filter_s16);
   const int32_t horiz_const = (1 << (bd + FILTER_BITS - 2));
-  // Dot product constants.
-  const int16x8_t correct_tmp = vshlq_n_s16(x_filter_s16, 7);
-  const int32x4_t correction =
-      vdupq_n_s32(vaddlvq_s16(correct_tmp) + horiz_const);
+  // Dot product constants and other shims.
+  const int32_t correction_s32 = vaddlvq_s16(vshlq_n_s16(x_filter_s16, 7));
+  // Fold horiz_const into the dot-product filter correction constant. The
+  // additional shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use non-
+  // rounding shifts - which are generally faster than rounding shifts on
+  // modern CPUs. (The extra -1 is needed because we halved the filter values.)
+  const int32x4_t correction = vdupq_n_s32(correction_s32 + horiz_const +
+                                           (1 << ((ROUND0_BITS - 1) - 1)));
   const uint8x16_t range_limit = vdupq_n_u8(128);
+  // Horizontal filter.
+  const int8x8_t x_filter = vmovn_s16(x_filter_s16);
+
+  const uint8_t *src_ptr = src;
+  int16_t *dst_ptr = im_block;
+  int dst_stride = im_stride;
+  int height = im_h;
 
   if (w == 4) {
     const uint8x16x2_t permute_tbl = vld1q_u8_x2(dot_prod_permute_tbl);
-    const int16x4_t shift_round_0 = vdup_n_s16(-(round_0));
     uint8x16_t s0, s1, s2, s3;
-    int32x4_t t0, t1, t2, t3;
     int16x4_t d0, d1, d2, d3;
 
     do {
-      s0 = vld1q_u8(src + 0 * src_stride);
-      s1 = vld1q_u8(src + 1 * src_stride);
-      s2 = vld1q_u8(src + 2 * src_stride);
-      s3 = vld1q_u8(src + 3 * src_stride);
+      load_u8_16x4(src_ptr, src_stride, &s0, &s1, &s2, &s3);
 
-      t0 = convolve8_4_sdot(s0, x_filter, correction, range_limit, permute_tbl);
-      t1 = convolve8_4_sdot(s1, x_filter, correction, range_limit, permute_tbl);
-      t2 = convolve8_4_sdot(s2, x_filter, correction, range_limit, permute_tbl);
-      t3 = convolve8_4_sdot(s3, x_filter, correction, range_limit, permute_tbl);
+      d0 = convolve8_4_2d_h(s0, x_filter, correction, range_limit, permute_tbl);
+      d1 = convolve8_4_2d_h(s1, x_filter, correction, range_limit, permute_tbl);
+      d2 = convolve8_4_2d_h(s2, x_filter, correction, range_limit, permute_tbl);
+      d3 = convolve8_4_2d_h(s3, x_filter, correction, range_limit, permute_tbl);
 
-      d0 = vqrshl_s16(vmovn_s32(t0), shift_round_0);
-      d1 = vqrshl_s16(vmovn_s32(t1), shift_round_0);
-      d2 = vqrshl_s16(vmovn_s32(t2), shift_round_0);
-      d3 = vqrshl_s16(vmovn_s32(t3), shift_round_0);
+      store_s16_4x4(dst_ptr, dst_stride, d0, d1, d2, d3);
 
-      vst1_s16((dst_ptr + 0 * dst_stride), d0);
-      vst1_s16((dst_ptr + 1 * dst_stride), d1);
-      vst1_s16((dst_ptr + 2 * dst_stride), d2);
-      vst1_s16((dst_ptr + 3 * dst_stride), d3);
-
-      src += 4 * src_stride;
+      src_ptr += 4 * src_stride;
       dst_ptr += 4 * dst_stride;
       height -= 4;
     } while (height > 0);
   } else {
     const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
-    const int16x8_t shift_round_0 = vdupq_n_s16(-(round_0));
-    const uint8_t *s;
-    int16_t *d;
     uint8x16_t s0, s1, s2, s3;
     int16x8_t d0, d1, d2, d3;
 
     do {
-      width = w;
-      s = src;
-      d = dst_ptr;
+      const uint8_t *s = src_ptr;
+      int16_t *d = dst_ptr;
+      int width = w;
 
       do {
-        s0 = vld1q_u8(s + 0 * src_stride);
-        s1 = vld1q_u8(s + 1 * src_stride);
-        s2 = vld1q_u8(s + 2 * src_stride);
-        s3 = vld1q_u8(s + 3 * src_stride);
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
 
-        d0 = convolve8_8_sdot(s0, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
-        d1 = convolve8_8_sdot(s1, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
-        d2 = convolve8_8_sdot(s2, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
-        d3 = convolve8_8_sdot(s3, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
+        d0 = convolve8_8_2d_h(s0, x_filter, correction, range_limit,
+                              permute_tbl);
+        d1 = convolve8_8_2d_h(s1, x_filter, correction, range_limit,
+                              permute_tbl);
+        d2 = convolve8_8_2d_h(s2, x_filter, correction, range_limit,
+                              permute_tbl);
+        d3 = convolve8_8_2d_h(s3, x_filter, correction, range_limit,
+                              permute_tbl);
 
-        vst1q_s16(d + 0 * dst_stride, d0);
-        vst1q_s16(d + 1 * dst_stride, d1);
-        vst1q_s16(d + 2 * dst_stride, d2);
-        vst1q_s16(d + 3 * dst_stride, d3);
+        store_s16_8x4(d, dst_stride, d0, d1, d2, d3);
 
         s += 8;
         d += 8;
         width -= 8;
       } while (width > 0);
-
-      src += 4 * src_stride;
+      src_ptr += 4 * src_stride;
       dst_ptr += 4 * dst_stride;
       height -= 4;
     } while (height > 0);
   }
 }
 
-#else  // !(defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD))
+#else  // !(AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD))
 
-static INLINE void dist_wtd_convolve_2d_horiz_neon(
+static INLINE int16x4_t convolve8_4_2d_h(const int16x4_t s0, const int16x4_t s1,
+                                         const int16x4_t s2, const int16x4_t s3,
+                                         const int16x4_t s4, const int16x4_t s5,
+                                         const int16x4_t s6, const int16x4_t s7,
+                                         const int16x8_t x_filter,
+                                         const int16x4_t horiz_const) {
+  const int16x4_t x_filter_0_3 = vget_low_s16(x_filter);
+  const int16x4_t x_filter_4_7 = vget_high_s16(x_filter);
+
+  int16x4_t sum = horiz_const;
+  sum = vmla_lane_s16(sum, s0, x_filter_0_3, 0);
+  sum = vmla_lane_s16(sum, s1, x_filter_0_3, 1);
+  sum = vmla_lane_s16(sum, s2, x_filter_0_3, 2);
+  sum = vmla_lane_s16(sum, s3, x_filter_0_3, 3);
+  sum = vmla_lane_s16(sum, s4, x_filter_4_7, 0);
+  sum = vmla_lane_s16(sum, s5, x_filter_4_7, 1);
+  sum = vmla_lane_s16(sum, s6, x_filter_4_7, 2);
+  sum = vmla_lane_s16(sum, s7, x_filter_4_7, 3);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  return vshr_n_s16(sum, ROUND0_BITS - 1);
+}
+
+static INLINE int16x8_t convolve8_8_2d_h(const int16x8_t s0, const int16x8_t s1,
+                                         const int16x8_t s2, const int16x8_t s3,
+                                         const int16x8_t s4, const int16x8_t s5,
+                                         const int16x8_t s6, const int16x8_t s7,
+                                         const int16x8_t x_filter,
+                                         const int16x8_t horiz_const) {
+  const int16x4_t x_filter_0_3 = vget_low_s16(x_filter);
+  const int16x4_t x_filter_4_7 = vget_high_s16(x_filter);
+
+  int16x8_t sum = horiz_const;
+  sum = vmlaq_lane_s16(sum, s0, x_filter_0_3, 0);
+  sum = vmlaq_lane_s16(sum, s1, x_filter_0_3, 1);
+  sum = vmlaq_lane_s16(sum, s2, x_filter_0_3, 2);
+  sum = vmlaq_lane_s16(sum, s3, x_filter_0_3, 3);
+  sum = vmlaq_lane_s16(sum, s4, x_filter_4_7, 0);
+  sum = vmlaq_lane_s16(sum, s5, x_filter_4_7, 1);
+  sum = vmlaq_lane_s16(sum, s6, x_filter_4_7, 2);
+  sum = vmlaq_lane_s16(sum, s7, x_filter_4_7, 3);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  return vshrq_n_s16(sum, ROUND0_BITS - 1);
+}
+
+static INLINE void dist_wtd_convolve_2d_horiz_8tap_neon(
     const uint8_t *src, int src_stride, int16_t *im_block, const int im_stride,
-    const int16x8_t x_filter, const int im_h, int w, const int round_0) {
+    const int16x8_t x_filter, const int im_h, int w) {
   const int bd = 8;
-  const uint8_t *s;
-  int16_t *dst_ptr;
-  int dst_stride;
-  int width, height;
 
-  dst_ptr = im_block;
-  dst_stride = im_stride;
-  height = im_h;
-  width = w;
+  const uint8_t *src_ptr = src;
+  int16_t *dst_ptr = im_block;
+  int dst_stride = im_stride;
+  int height = im_h;
 
   if (w == 4) {
     int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, d0;
-    int16x8_t tt0;
     uint8x8_t t0;
-
-    const int16x4_t horiz_const = vdup_n_s16((1 << (bd + FILTER_BITS - 2)));
-    const int16x4_t shift_round_0 = vdup_n_s16(-(round_0));
-
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     int16x4_t s8, s9, s10, d1, d2, d3;
-    int16x8_t tt1, tt2, tt3;
     uint8x8_t t1, t2, t3;
-#endif
-    do {
-      s = src;
-      __builtin_prefetch(s + 0 * src_stride);
-#if defined(__aarch64__)
-      __builtin_prefetch(s + 1 * src_stride);
-      __builtin_prefetch(s + 2 * src_stride);
-      __builtin_prefetch(s + 3 * src_stride);
+#endif  // AOM_ARCH_AARCH64
 
-      load_u8_8x4(s, src_stride, &t0, &t1, &t2, &t3);
+    // A shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use non-rounding
+    // shifts - which are generally faster than rounding shifts on modern CPUs.
+    // (The extra -1 is needed because we halved the filter values.)
+    const int16x4_t horiz_const = vdup_n_s16((1 << (bd + FILTER_BITS - 2)) +
+                                             (1 << ((ROUND0_BITS - 1) - 1)));
+    do {
+      __builtin_prefetch(src_ptr + 0 * src_stride);
+#if AOM_ARCH_AARCH64
+      __builtin_prefetch(src_ptr + 1 * src_stride);
+      __builtin_prefetch(src_ptr + 2 * src_stride);
+      __builtin_prefetch(src_ptr + 3 * src_stride);
+
+      load_u8_8x4(src_ptr, src_stride, &t0, &t1, &t2, &t3);
       transpose_u8_8x4(&t0, &t1, &t2, &t3);
-      tt0 = vreinterpretq_s16_u16(vmovl_u8(t0));
-      tt1 = vreinterpretq_s16_u16(vmovl_u8(t1));
-      tt2 = vreinterpretq_s16_u16(vmovl_u8(t2));
-      tt3 = vreinterpretq_s16_u16(vmovl_u8(t3));
-      s0 = vget_low_s16(tt0);
-      s1 = vget_low_s16(tt1);
-      s2 = vget_low_s16(tt2);
-      s3 = vget_low_s16(tt3);
-      s4 = vget_high_s16(tt0);
-      s5 = vget_high_s16(tt1);
-      s6 = vget_high_s16(tt2);
+
+      s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s1 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+      s2 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+      s3 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+      s4 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s5 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+      s6 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+
       __builtin_prefetch(dst_ptr + 0 * dst_stride);
       __builtin_prefetch(dst_ptr + 1 * dst_stride);
       __builtin_prefetch(dst_ptr + 2 * dst_stride);
       __builtin_prefetch(dst_ptr + 3 * dst_stride);
-      s += 7;
 
-      load_u8_8x4(s, src_stride, &t0, &t1, &t2, &t3);
+      load_u8_8x4(src_ptr + 7, src_stride, &t0, &t1, &t2, &t3);
       transpose_u8_8x4(&t0, &t1, &t2, &t3);
-      tt0 = vreinterpretq_s16_u16(vmovl_u8(t0));
-      tt1 = vreinterpretq_s16_u16(vmovl_u8(t1));
-      tt2 = vreinterpretq_s16_u16(vmovl_u8(t2));
-      tt3 = vreinterpretq_s16_u16(vmovl_u8(t3));
-      s7 = vget_low_s16(tt0);
-      s8 = vget_low_s16(tt1);
-      s9 = vget_low_s16(tt2);
-      s10 = vget_low_s16(tt3);
 
-      d0 = convolve8_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
-                             horiz_const, shift_round_0);
-      d1 = convolve8_4x4_s16(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
-                             horiz_const, shift_round_0);
-      d2 = convolve8_4x4_s16(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
-                             horiz_const, shift_round_0);
-      d3 = convolve8_4x4_s16(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
-                             horiz_const, shift_round_0);
+      s7 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s8 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+      s9 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+      s10 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+
+      d0 = convolve8_4_2d_h(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                            horiz_const);
+      d1 = convolve8_4_2d_h(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
+                            horiz_const);
+      d2 = convolve8_4_2d_h(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
+                            horiz_const);
+      d3 = convolve8_4_2d_h(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
+                            horiz_const);
 
       transpose_s16_4x4d(&d0, &d1, &d2, &d3);
+      store_s16_4x4(dst_ptr, dst_stride, d0, d1, d2, d3);
 
-      vst1_s16((dst_ptr + 0 * dst_stride), d0);
-      vst1_s16((dst_ptr + 1 * dst_stride), d1);
-      vst1_s16((dst_ptr + 2 * dst_stride), d2);
-      vst1_s16((dst_ptr + 3 * dst_stride), d3);
-
-      src += 4 * src_stride;
+      src_ptr += 4 * src_stride;
       dst_ptr += 4 * dst_stride;
       height -= 4;
-#else
-      t0 = vld1_u8(s);                            // a0 a1 a2 a3 a4 a5 a6 a7
-      tt0 = vreinterpretq_s16_u16(vmovl_u8(t0));  // a0 a1 a2 a3 a4 a5 a6 a7
-      s0 = vget_low_s16(tt0);                     // a0 a1 a2 a3
-      s4 = vget_high_s16(tt0);                    // a4 a5 a6 a7
-      __builtin_prefetch(dst_ptr);
-      s += 8;
-      t0 = vld1_u8(s);  // a8 a9 a10 a11
+#else   // !AOM_ARCH_AARCH64
+      t0 = vld1_u8(src_ptr);  // a0 a1 a2 a3 a4 a5 a6 a7
+      s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));   // a0 a1 a2 a3
+      s4 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));  // a4 a5 a6 a7
 
-      // a8 a9 a10 a11
+      __builtin_prefetch(dst_ptr);
+
+      t0 = vld1_u8(src_ptr + 8);  // a8 a9 a10 a11
       s7 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
 
       s1 = vext_s16(s0, s4, 1);  // a1 a2 a3 a4
@@ -573,39 +591,47 @@
       s6 = vext_s16(s4, s7, 2);  // a6 a7 a8 a9
       s7 = vext_s16(s4, s7, 3);  // a7 a8 a9 a10
 
-      d0 = convolve8_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
-                             horiz_const, shift_round_0);
-
+      d0 = convolve8_4_2d_h(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                            horiz_const);
       vst1_s16(dst_ptr, d0);
 
-      src += src_stride;
+      src_ptr += src_stride;
       dst_ptr += dst_stride;
-      height -= 1;
-#endif
+      height--;
+#endif  // AOM_ARCH_AARCH64
     } while (height > 0);
   } else {
-    int16_t *d_tmp;
-    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
-    int16x8_t res0;
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8, d0;
     uint8x8_t t0;
+#if AOM_ARCH_AARCH64
+    int16x8_t s9, s10, s11, s12, s13, s14;
+    int16x8_t d1, d2, d3, d4, d5, d6, d7;
+    uint8x8_t t1, t2, t3, t4, t5, t6, t7;
+#endif  // AOM_ARCH_AARCH64
 
-    const int16x8_t horiz_const = vdupq_n_s16((1 << (bd + FILTER_BITS - 2)));
-    const int16x8_t shift_round_0 = vdupq_n_s16(-(round_0));
+    // A shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use non-rounding
+    // shifts - which are generally faster than rounding shifts on modern CPUs.
+    // (The extra -1 is needed because we halved the filter values.)
+    const int16x8_t horiz_const = vdupq_n_s16((1 << (bd + FILTER_BITS - 2)) +
+                                              (1 << ((ROUND0_BITS - 1) - 1)));
     do {
-#if defined(__aarch64__)
-      uint8x8_t t1, t2, t3, t4, t5, t6, t7;
-      int16x8_t s8, s9, s10, s11, s12, s13, s14;
-      int16x8_t res1, res2, res3, res4, res5, res6, res7;
-      __builtin_prefetch(src + 0 * src_stride);
-      __builtin_prefetch(src + 1 * src_stride);
-      __builtin_prefetch(src + 2 * src_stride);
-      __builtin_prefetch(src + 3 * src_stride);
-      __builtin_prefetch(src + 4 * src_stride);
-      __builtin_prefetch(src + 5 * src_stride);
-      __builtin_prefetch(src + 6 * src_stride);
-      __builtin_prefetch(src + 7 * src_stride);
-      load_u8_8x8(src, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+      const uint8_t *s;
+      int16_t *d = dst_ptr;
+      int width = w;
+
+#if AOM_ARCH_AARCH64
+      __builtin_prefetch(src_ptr + 0 * src_stride);
+      __builtin_prefetch(src_ptr + 1 * src_stride);
+      __builtin_prefetch(src_ptr + 2 * src_stride);
+      __builtin_prefetch(src_ptr + 3 * src_stride);
+      __builtin_prefetch(src_ptr + 4 * src_stride);
+      __builtin_prefetch(src_ptr + 5 * src_stride);
+      __builtin_prefetch(src_ptr + 6 * src_stride);
+      __builtin_prefetch(src_ptr + 7 * src_stride);
+
+      load_u8_8x8(src_ptr, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
       transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
       s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
       s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
       s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
@@ -614,9 +640,8 @@
       s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
       s6 = vreinterpretq_s16_u16(vmovl_u8(t6));
 
-      width = w;
-      s = src + 7;
-      d_tmp = dst_ptr;
+      s = src_ptr + 7;
+
       __builtin_prefetch(dst_ptr + 0 * dst_stride);
       __builtin_prefetch(dst_ptr + 1 * dst_stride);
       __builtin_prefetch(dst_ptr + 2 * dst_stride);
@@ -629,6 +654,7 @@
       do {
         load_u8_8x8(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
         transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
         s7 = vreinterpretq_s16_u16(vmovl_u8(t0));
         s8 = vreinterpretq_s16_u16(vmovl_u8(t1));
         s9 = vreinterpretq_s16_u16(vmovl_u8(t2));
@@ -638,28 +664,26 @@
         s13 = vreinterpretq_s16_u16(vmovl_u8(t6));
         s14 = vreinterpretq_s16_u16(vmovl_u8(t7));
 
-        res0 = convolve8_8x8_s16(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
-                                 horiz_const, shift_round_0);
-        res1 = convolve8_8x8_s16(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
-                                 horiz_const, shift_round_0);
-        res2 = convolve8_8x8_s16(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
-                                 horiz_const, shift_round_0);
-        res3 = convolve8_8x8_s16(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
-                                 horiz_const, shift_round_0);
-        res4 = convolve8_8x8_s16(s4, s5, s6, s7, s8, s9, s10, s11, x_filter,
-                                 horiz_const, shift_round_0);
-        res5 = convolve8_8x8_s16(s5, s6, s7, s8, s9, s10, s11, s12, x_filter,
-                                 horiz_const, shift_round_0);
-        res6 = convolve8_8x8_s16(s6, s7, s8, s9, s10, s11, s12, s13, x_filter,
-                                 horiz_const, shift_round_0);
-        res7 = convolve8_8x8_s16(s7, s8, s9, s10, s11, s12, s13, s14, x_filter,
-                                 horiz_const, shift_round_0);
+        d0 = convolve8_8_2d_h(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                              horiz_const);
+        d1 = convolve8_8_2d_h(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
+                              horiz_const);
+        d2 = convolve8_8_2d_h(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
+                              horiz_const);
+        d3 = convolve8_8_2d_h(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
+                              horiz_const);
+        d4 = convolve8_8_2d_h(s4, s5, s6, s7, s8, s9, s10, s11, x_filter,
+                              horiz_const);
+        d5 = convolve8_8_2d_h(s5, s6, s7, s8, s9, s10, s11, s12, x_filter,
+                              horiz_const);
+        d6 = convolve8_8_2d_h(s6, s7, s8, s9, s10, s11, s12, s13, x_filter,
+                              horiz_const);
+        d7 = convolve8_8_2d_h(s7, s8, s9, s10, s11, s12, s13, s14, x_filter,
+                              horiz_const);
 
-        transpose_s16_8x8(&res0, &res1, &res2, &res3, &res4, &res5, &res6,
-                          &res7);
+        transpose_s16_8x8(&d0, &d1, &d2, &d3, &d4, &d5, &d6, &d7);
+        store_s16_8x8(d, dst_stride, d0, d1, d2, d3, d4, d5, d6, d7);
 
-        store_s16_8x8(d_tmp, dst_stride, res0, res1, res2, res3, res4, res5,
-                      res6, res7);
         s0 = s8;
         s1 = s9;
         s2 = s10;
@@ -668,337 +692,624 @@
         s5 = s13;
         s6 = s14;
         s += 8;
-        d_tmp += 8;
+        d += 8;
         width -= 8;
       } while (width > 0);
-      src += 8 * src_stride;
+      src_ptr += 8 * src_stride;
       dst_ptr += 8 * dst_stride;
       height -= 8;
-#else
-      int16x8_t temp_0;
-      t0 = vld1_u8(src);
+#else   // !AOM_ARCH_AARCH64
+      t0 = vld1_u8(src_ptr);
       s0 = vreinterpretq_s16_u16(vmovl_u8(t0));  // a0 a1 a2 a3 a4 a5 a6 a7
 
-      width = w;
-      s = src + 8;
-      d_tmp = dst_ptr;
+      s = src_ptr + 8;
       __builtin_prefetch(dst_ptr);
 
       do {
         t0 = vld1_u8(s);  // a8 a9 a10 a11 a12 a13 a14 a15
-        s7 = vreinterpretq_s16_u16(vmovl_u8(t0));
-        temp_0 = s0;
-        s0 = s7;
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t0));
 
-        s1 = vextq_s16(temp_0, s7, 1);  // a1 a2 a3 a4 a5 a6 a7 a8
-        s2 = vextq_s16(temp_0, s7, 2);  // a2 a3 a4 a5 a6 a7 a8 a9
-        s3 = vextq_s16(temp_0, s7, 3);  // a3 a4 a5 a6 a7 a8 a9 a10
-        s4 = vextq_s16(temp_0, s7, 4);  // a4 a5 a6 a7 a8 a9 a10 a11
-        s5 = vextq_s16(temp_0, s7, 5);  // a5 a6 a7 a8 a9 a10 a11 a12
-        s6 = vextq_s16(temp_0, s7, 6);  // a6 a7 a8 a9 a10 a11 a12 a13
-        s7 = vextq_s16(temp_0, s7, 7);  // a7 a8 a9 a10 a11 a12 a13 a14
+        s1 = vextq_s16(s0, s8, 1);  // a1 a2 a3 a4 a5 a6 a7 a8
+        s2 = vextq_s16(s0, s8, 2);  // a2 a3 a4 a5 a6 a7 a8 a9
+        s3 = vextq_s16(s0, s8, 3);  // a3 a4 a5 a6 a7 a8 a9 a10
+        s4 = vextq_s16(s0, s8, 4);  // a4 a5 a6 a7 a8 a9 a10 a11
+        s5 = vextq_s16(s0, s8, 5);  // a5 a6 a7 a8 a9 a10 a11 a12
+        s6 = vextq_s16(s0, s8, 6);  // a6 a7 a8 a9 a10 a11 a12 a13
+        s7 = vextq_s16(s0, s8, 7);  // a7 a8 a9 a10 a11 a12 a13 a14
 
-        res0 = convolve8_8x8_s16(temp_0, s1, s2, s3, s4, s5, s6, s7, x_filter,
-                                 horiz_const, shift_round_0);
-        vst1q_s16(d_tmp, res0);
+        d0 = convolve8_8_2d_h(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                              horiz_const);
+        vst1q_s16(d, d0);
 
+        s0 = s8;
         s += 8;
-        d_tmp += 8;
+        d += 8;
         width -= 8;
       } while (width > 0);
-      src += src_stride;
+      src_ptr += src_stride;
       dst_ptr += dst_stride;
-      height -= 1;
-#endif
+      height--;
+#endif  // AOM_ARCH_AARCH64
     } while (height > 0);
   }
 }
 
-#endif  // defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#endif  // AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
 
-static INLINE void dist_wtd_convolve_2d_vert_6tap_neon(
+static INLINE uint16x4_t
+convolve6_4_2d_v(const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+                 const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+                 const int16x8_t y_filter, const int32x4_t offset_const) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter);
+
+  int32x4_t sum = offset_const;
+  // Filter values at indices 0 and 7 are 0.
+  sum = vmlal_lane_s16(sum, s0, y_filter_0_3, 1);
+  sum = vmlal_lane_s16(sum, s1, y_filter_0_3, 2);
+  sum = vmlal_lane_s16(sum, s2, y_filter_0_3, 3);
+  sum = vmlal_lane_s16(sum, s3, y_filter_4_7, 0);
+  sum = vmlal_lane_s16(sum, s4, y_filter_4_7, 1);
+  sum = vmlal_lane_s16(sum, s5, y_filter_4_7, 2);
+
+  return vqrshrun_n_s32(sum, COMPOUND_ROUND1_BITS);
+}
+
+static INLINE uint16x8_t
+convolve6_8_2d_v(const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+                 const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+                 const int16x8_t y_filter, const int32x4_t offset_const) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter);
+
+  int32x4_t sum0 = offset_const;
+  // Filter values at indices 0 and 7 are 0.
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s0), y_filter_0_3, 1);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s1), y_filter_0_3, 2);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s2), y_filter_0_3, 3);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s3), y_filter_4_7, 0);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s4), y_filter_4_7, 1);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s5), y_filter_4_7, 2);
+
+  int32x4_t sum1 = offset_const;
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s0), y_filter_0_3, 1);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s1), y_filter_0_3, 2);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s2), y_filter_0_3, 3);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s3), y_filter_4_7, 0);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s4), y_filter_4_7, 1);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s5), y_filter_4_7, 2);
+
+  return vcombine_u16(vqrshrun_n_s32(sum0, COMPOUND_ROUND1_BITS),
+                      vqrshrun_n_s32(sum1, COMPOUND_ROUND1_BITS));
+}
+
+static INLINE void dist_wtd_convolve_2d_vert_6tap_dist_wtd_avg_neon(
     int16_t *src_ptr, const int src_stride, uint8_t *dst8_ptr, int dst8_stride,
     ConvolveParams *conv_params, const int16x8_t y_filter, int h, int w) {
-  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
-  const int dst_stride = conv_params->dst_stride;
-
   const int bd = 8;
-  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
-  const int16_t sub_const = (1 << (offset_bits - conv_params->round_1)) +
-                            (1 << (offset_bits - conv_params->round_1 - 1));
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int32x4_t offset_const = vdupq_n_s32(1 << offset_bits);
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
 
-  const int16_t round_bits =
-      2 * FILTER_BITS - conv_params->round_0 - conv_params->round_1;
-  const int offset = bd + 2 * FILTER_BITS - conv_params->round_0;
-  const int32x4_t round_shift_vec = vdupq_n_s32(-(conv_params->round_1));
-  const int32x4_t offset_const = vdupq_n_s32(1 << offset);
-  const int16x4_t sub_const_vec = vdup_n_s16(sub_const);
   const uint16_t fwd_offset = conv_params->fwd_offset;
   const uint16_t bck_offset = conv_params->bck_offset;
-  const int do_average = conv_params->do_average;
-  const int use_dist_wtd_comp_avg = conv_params->use_dist_wtd_comp_avg;
+
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
 
   if (w == 4) {
     int16x4_t s0, s1, s2, s3, s4, s5;
     uint16x4_t dd0, d0;
     uint8x8_t d01_u8;
-
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     int16x4_t s6, s7, s8;
     uint16x4_t dd1, dd2, dd3, d1, d2, d3;
     uint8x8_t d23_u8;
-#endif
+#endif  // AOM_ARCH_AARCH64
 
-    s0 = vld1_s16(src_ptr + 0 * src_stride);
-    s1 = vld1_s16(src_ptr + 1 * src_stride);
-    s2 = vld1_s16(src_ptr + 2 * src_stride);
-    s3 = vld1_s16(src_ptr + 3 * src_stride);
-    s4 = vld1_s16(src_ptr + 4 * src_stride);
+    load_s16_4x5(src_ptr, src_stride, &s0, &s1, &s2, &s3, &s4);
     src_ptr += 5 * src_stride;
 
     do {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
       load_s16_4x4(src_ptr, src_stride, &s5, &s6, &s7, &s8);
 
-      d0 = convolve6_4_s32(s0, s1, s2, s3, s4, s5, y_filter, round_shift_vec,
-                           offset_const);
-      d1 = convolve6_4_s32(s1, s2, s3, s4, s5, s6, y_filter, round_shift_vec,
-                           offset_const);
-      d2 = convolve6_4_s32(s2, s3, s4, s5, s6, s7, y_filter, round_shift_vec,
-                           offset_const);
-      d3 = convolve6_4_s32(s3, s4, s5, s6, s7, s8, y_filter, round_shift_vec,
-                           offset_const);
+      d0 = convolve6_4_2d_v(s0, s1, s2, s3, s4, s5, y_filter, offset_const);
+      d1 = convolve6_4_2d_v(s1, s2, s3, s4, s5, s6, y_filter, offset_const);
+      d2 = convolve6_4_2d_v(s2, s3, s4, s5, s6, s7, y_filter, offset_const);
+      d3 = convolve6_4_2d_v(s3, s4, s5, s6, s7, s8, y_filter, offset_const);
 
-      if (do_average) {
-        load_u16_4x4(dst_ptr, dst_stride, &dd0, &dd1, &dd2, &dd3);
+      load_u16_4x4(dst_ptr, dst_stride, &dd0, &dd1, &dd2, &dd3);
 
-        compute_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
-                        bck_offset, sub_const_vec, round_bits,
-                        use_dist_wtd_comp_avg, &d01_u8, &d23_u8);
+      compute_dist_wtd_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                               bck_offset, round_offset_vec, &d01_u8, &d23_u8);
 
-        vst1_lane_u32((uint32_t *)dst8_ptr, vreinterpret_u32_u8(d01_u8), 0);
-        dst8_ptr += dst8_stride;
-        vst1_lane_u32((uint32_t *)dst8_ptr, vreinterpret_u32_u8(d01_u8), 1);
-        dst8_ptr += dst8_stride;
-        vst1_lane_u32((uint32_t *)dst8_ptr, vreinterpret_u32_u8(d23_u8), 0);
-        dst8_ptr += dst8_stride;
-        vst1_lane_u32((uint32_t *)dst8_ptr, vreinterpret_u32_u8(d23_u8), 1);
-        dst8_ptr += dst8_stride;
-      } else {
-        store_u16_4x4(dst_ptr, dst_stride, d0, d1, d2, d3);
-      }
+      store_u8_4x1(dst8_ptr + 0 * dst8_stride, d01_u8, 0);
+      store_u8_4x1(dst8_ptr + 1 * dst8_stride, d01_u8, 1);
+      store_u8_4x1(dst8_ptr + 2 * dst8_stride, d23_u8, 0);
+      store_u8_4x1(dst8_ptr + 3 * dst8_stride, d23_u8, 1);
+      dst8_ptr += 4 * dst8_stride;
 
       s0 = s4;
       s1 = s5;
       s2 = s6;
       s3 = s7;
       s4 = s8;
-
       src_ptr += 4 * src_stride;
       dst_ptr += 4 * dst_stride;
       h -= 4;
-#else
+#else   // !AOM_ARCH_AARCH64
       s5 = vld1_s16(src_ptr);
 
-      d0 = convolve6_4_s32(s0, s1, s2, s3, s4, s5, y_filter, round_shift_vec,
-                           offset_const);
+      d0 = convolve6_4_2d_v(s0, s1, s2, s3, s4, s5, y_filter, offset_const);
 
-      if (do_average) {
-        dd0 = vld1_u16(dst_ptr);
+      dd0 = vld1_u16(dst_ptr);
 
-        compute_avg_4x1(dd0, d0, fwd_offset, bck_offset, sub_const_vec,
-                        round_bits, use_dist_wtd_comp_avg, &d01_u8);
+      compute_dist_wtd_avg_4x1(dd0, d0, fwd_offset, bck_offset,
+                               vget_low_s16(round_offset_vec), &d01_u8);
 
-        vst1_lane_u32((uint32_t *)dst8_ptr, vreinterpret_u32_u8(d01_u8), 0);
-        dst8_ptr += dst8_stride;
+      store_u8_4x1(dst8_ptr, d01_u8, 0);
+      dst8_ptr += dst8_stride;
 
-      } else {
-        vst1_u16(dst_ptr, d0);
-      }
       s0 = s1;
       s1 = s2;
       s2 = s3;
       s3 = s4;
       s4 = s5;
-
       src_ptr += src_stride;
       dst_ptr += dst_stride;
       h--;
-#endif
-    } while (h > 0);
-
+#endif  // AOM_ARCH_AARCH64
+    } while (h != 0);
   } else {
     int16x8_t s0, s1, s2, s3, s4, s5;
     uint16x8_t dd0, d0;
     uint8x8_t d0_u8;
-
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     int16x8_t s6, s7, s8;
     uint16x8_t dd1, dd2, dd3, d1, d2, d3;
     uint8x8_t d1_u8, d2_u8, d3_u8;
-#endif
+#endif  // AOM_ARCH_AARCH64
 
     do {
       int16_t *s = src_ptr;
-      uint16_t *d = dst_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
       uint8_t *d_u8 = dst8_ptr;
       int height = h;
 
-      s0 = vld1q_s16(s + 0 * src_stride);
-      s1 = vld1q_s16(s + 1 * src_stride);
-      s2 = vld1q_s16(s + 2 * src_stride);
-      s3 = vld1q_s16(s + 3 * src_stride);
-      s4 = vld1q_s16(s + 4 * src_stride);
+      load_s16_8x5(s, src_stride, &s0, &s1, &s2, &s3, &s4);
       s += 5 * src_stride;
 
       do {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
         load_s16_8x4(s, src_stride, &s5, &s6, &s7, &s8);
 
-        d0 = convolve6_8_s32(s0, s1, s2, s3, s4, s5, y_filter, round_shift_vec,
-                             offset_const);
-        d1 = convolve6_8_s32(s1, s2, s3, s4, s5, s6, y_filter, round_shift_vec,
-                             offset_const);
-        d2 = convolve6_8_s32(s2, s3, s4, s5, s6, s7, y_filter, round_shift_vec,
-                             offset_const);
-        d3 = convolve6_8_s32(s3, s4, s5, s6, s7, s8, y_filter, round_shift_vec,
-                             offset_const);
+        d0 = convolve6_8_2d_v(s0, s1, s2, s3, s4, s5, y_filter, offset_const);
+        d1 = convolve6_8_2d_v(s1, s2, s3, s4, s5, s6, y_filter, offset_const);
+        d2 = convolve6_8_2d_v(s2, s3, s4, s5, s6, s7, y_filter, offset_const);
+        d3 = convolve6_8_2d_v(s3, s4, s5, s6, s7, s8, y_filter, offset_const);
 
-        if (do_average) {
-          load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
 
-          compute_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
-                          bck_offset, sub_const_vec, round_bits,
-                          use_dist_wtd_comp_avg, &d0_u8, &d1_u8, &d2_u8,
-                          &d3_u8);
+        compute_dist_wtd_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                                 bck_offset, round_offset_vec, &d0_u8, &d1_u8,
+                                 &d2_u8, &d3_u8);
 
-          vst1_u8(d_u8, d0_u8);
-          d_u8 += dst8_stride;
-          vst1_u8(d_u8, d1_u8);
-          d_u8 += dst8_stride;
-          vst1_u8(d_u8, d2_u8);
-          d_u8 += dst8_stride;
-          vst1_u8(d_u8, d3_u8);
-          d_u8 += dst8_stride;
-        } else {
-          store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
-        }
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+        d_u8 += 4 * dst8_stride;
 
         s0 = s4;
         s1 = s5;
         s2 = s6;
         s3 = s7;
         s4 = s8;
-
         s += 4 * src_stride;
         d += 4 * dst_stride;
         height -= 4;
-#else
+#else   // !AOM_ARCH_AARCH64
         s5 = vld1q_s16(s);
 
-        d0 = convolve6_8_s32(s0, s1, s2, s3, s4, s5, y_filter, round_shift_vec,
-                             offset_const);
+        d0 = convolve6_8_2d_v(s0, s1, s2, s3, s4, s5, y_filter, offset_const);
 
-        if (do_average) {
-          dd0 = vld1q_u16(d);
+        dd0 = vld1q_u16(d);
 
-          compute_avg_8x1(dd0, d0, fwd_offset, bck_offset, sub_const_vec,
-                          round_bits, use_dist_wtd_comp_avg, &d0_u8);
+        compute_dist_wtd_avg_8x1(dd0, d0, fwd_offset, bck_offset,
+                                 round_offset_vec, &d0_u8);
 
-          vst1_u8(d_u8, d0_u8);
-          d_u8 += dst8_stride;
-
-        } else {
-          vst1q_u16(d, d0);
-        }
+        vst1_u8(d_u8, d0_u8);
+        d_u8 += dst8_stride;
 
         s0 = s1;
         s1 = s2;
         s2 = s3;
         s3 = s4;
         s4 = s5;
-
         s += src_stride;
         d += dst_stride;
         height--;
-#endif
-      } while (height > 0);
-
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
       src_ptr += 8;
       dst_ptr += 8;
       dst8_ptr += 8;
       w -= 8;
-    } while (w > 0);
+    } while (w != 0);
   }
 }
 
-static INLINE void dist_wtd_convolve_2d_vert_8tap_neon(
+static INLINE void dist_wtd_convolve_2d_vert_6tap_avg_neon(
     int16_t *src_ptr, const int src_stride, uint8_t *dst8_ptr, int dst8_stride,
     ConvolveParams *conv_params, const int16x8_t y_filter, int h, int w) {
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int32x4_t offset_const = vdupq_n_s32(1 << offset_bits);
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
   CONV_BUF_TYPE *dst_ptr = conv_params->dst;
   const int dst_stride = conv_params->dst_stride;
 
-  const int bd = 8;
-  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
-  const int16_t sub_const = (1 << (offset_bits - conv_params->round_1)) +
-                            (1 << (offset_bits - conv_params->round_1 - 1));
+  if (w == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5;
+    uint16x4_t dd0, d0;
+    uint8x8_t d01_u8;
+#if AOM_ARCH_AARCH64
+    int16x4_t s6, s7, s8;
+    uint16x4_t dd1, dd2, dd3, d1, d2, d3;
+    uint8x8_t d23_u8;
+#endif  // AOM_ARCH_AARCH64
 
-  const int16_t round_bits =
-      2 * FILTER_BITS - conv_params->round_0 - conv_params->round_1;
-  const int offset = bd + 2 * FILTER_BITS - conv_params->round_0;
-  const int32x4_t round_shift_vec = vdupq_n_s32(-(conv_params->round_1));
-  const int32x4_t offset_const = vdupq_n_s32(1 << offset);
-  const int16x4_t sub_const_vec = vdup_n_s16(sub_const);
+    load_s16_4x5(src_ptr, src_stride, &s0, &s1, &s2, &s3, &s4);
+    src_ptr += 5 * src_stride;
+
+    do {
+#if AOM_ARCH_AARCH64
+      load_s16_4x4(src_ptr, src_stride, &s5, &s6, &s7, &s8);
+
+      d0 = convolve6_4_2d_v(s0, s1, s2, s3, s4, s5, y_filter, offset_const);
+      d1 = convolve6_4_2d_v(s1, s2, s3, s4, s5, s6, y_filter, offset_const);
+      d2 = convolve6_4_2d_v(s2, s3, s4, s5, s6, s7, y_filter, offset_const);
+      d3 = convolve6_4_2d_v(s3, s4, s5, s6, s7, s8, y_filter, offset_const);
+
+      load_u16_4x4(dst_ptr, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+      compute_basic_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                            round_offset_vec, &d01_u8, &d23_u8);
+
+      store_u8_4x1(dst8_ptr + 0 * dst8_stride, d01_u8, 0);
+      store_u8_4x1(dst8_ptr + 1 * dst8_stride, d01_u8, 1);
+      store_u8_4x1(dst8_ptr + 2 * dst8_stride, d23_u8, 0);
+      store_u8_4x1(dst8_ptr + 3 * dst8_stride, d23_u8, 1);
+      dst8_ptr += 4 * dst8_stride;
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      h -= 4;
+#else   // !AOM_ARCH_AARCH64
+      s5 = vld1_s16(src_ptr);
+
+      d0 = convolve6_4_2d_v(s0, s1, s2, s3, s4, s5, y_filter, offset_const);
+
+      dd0 = vld1_u16(dst_ptr);
+
+      compute_basic_avg_4x1(dd0, d0, vget_low_s16(round_offset_vec), &d01_u8);
+
+      store_u8_4x1(dst8_ptr, d01_u8, 0);
+      dst8_ptr += dst8_stride;
+
+      s0 = s1;
+      s1 = s2;
+      s2 = s3;
+      s3 = s4;
+      s4 = s5;
+      src_ptr += src_stride;
+      dst_ptr += dst_stride;
+      h--;
+#endif  // AOM_ARCH_AARCH64
+    } while (h != 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5;
+    uint16x8_t dd0, d0;
+    uint8x8_t d0_u8;
+#if AOM_ARCH_AARCH64
+    int16x8_t s6, s7, s8;
+    uint16x8_t dd1, dd2, dd3, d1, d2, d3;
+    uint8x8_t d1_u8, d2_u8, d3_u8;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      int16_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      uint8_t *d_u8 = dst8_ptr;
+      int height = h;
+
+      load_s16_8x5(s, src_stride, &s0, &s1, &s2, &s3, &s4);
+      s += 5 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        load_s16_8x4(s, src_stride, &s5, &s6, &s7, &s8);
+
+        d0 = convolve6_8_2d_v(s0, s1, s2, s3, s4, s5, y_filter, offset_const);
+        d1 = convolve6_8_2d_v(s1, s2, s3, s4, s5, s6, y_filter, offset_const);
+        d2 = convolve6_8_2d_v(s2, s3, s4, s5, s6, s7, y_filter, offset_const);
+        d3 = convolve6_8_2d_v(s3, s4, s5, s6, s7, s8, y_filter, offset_const);
+
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_basic_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                              round_offset_vec, &d0_u8, &d1_u8, &d2_u8, &d3_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+        d_u8 += 4 * dst8_stride;
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+#else   // !AOM_ARCH_AARCH64
+        s5 = vld1q_s16(s);
+
+        d0 = convolve6_8_2d_v(s0, s1, s2, s3, s4, s5, y_filter, offset_const);
+
+        dd0 = vld1q_u16(d);
+
+        compute_basic_avg_8x1(dd0, d0, round_offset_vec, &d0_u8);
+
+        vst1_u8(d_u8, d0_u8);
+        d_u8 += dst8_stride;
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      dst8_ptr += 8;
+      w -= 8;
+    } while (w != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_2d_vert_6tap_neon(
+    int16_t *src_ptr, const int src_stride, ConvolveParams *conv_params,
+    const int16x8_t y_filter, int h, int w) {
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int32x4_t offset_const = vdupq_n_s32(1 << offset_bits);
+
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
+
+  if (w == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5;
+    uint16x4_t d0;
+#if AOM_ARCH_AARCH64
+    int16x4_t s6, s7, s8;
+    uint16x4_t d1, d2, d3;
+#endif  // AOM_ARCH_AARCH64
+
+    load_s16_4x5(src_ptr, src_stride, &s0, &s1, &s2, &s3, &s4);
+    src_ptr += 5 * src_stride;
+
+    do {
+#if AOM_ARCH_AARCH64
+      load_s16_4x4(src_ptr, src_stride, &s5, &s6, &s7, &s8);
+
+      d0 = convolve6_4_2d_v(s0, s1, s2, s3, s4, s5, y_filter, offset_const);
+      d1 = convolve6_4_2d_v(s1, s2, s3, s4, s5, s6, y_filter, offset_const);
+      d2 = convolve6_4_2d_v(s2, s3, s4, s5, s6, s7, y_filter, offset_const);
+      d3 = convolve6_4_2d_v(s3, s4, s5, s6, s7, s8, y_filter, offset_const);
+
+      store_u16_4x4(dst_ptr, dst_stride, d0, d1, d2, d3);
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      h -= 4;
+#else   // !AOM_ARCH_AARCH64
+      s5 = vld1_s16(src_ptr);
+
+      d0 = convolve6_4_2d_v(s0, s1, s2, s3, s4, s5, y_filter, offset_const);
+
+      vst1_u16(dst_ptr, d0);
+
+      s0 = s1;
+      s1 = s2;
+      s2 = s3;
+      s3 = s4;
+      s4 = s5;
+      src_ptr += src_stride;
+      dst_ptr += dst_stride;
+      h--;
+#endif  // AOM_ARCH_AARCH64
+    } while (h != 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5;
+    uint16x8_t d0;
+#if AOM_ARCH_AARCH64
+    int16x8_t s6, s7, s8;
+    uint16x8_t d1, d2, d3;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      int16_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      int height = h;
+
+      load_s16_8x5(s, src_stride, &s0, &s1, &s2, &s3, &s4);
+      s += 5 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        load_s16_8x4(s, src_stride, &s5, &s6, &s7, &s8);
+
+        d0 = convolve6_8_2d_v(s0, s1, s2, s3, s4, s5, y_filter, offset_const);
+        d1 = convolve6_8_2d_v(s1, s2, s3, s4, s5, s6, y_filter, offset_const);
+        d2 = convolve6_8_2d_v(s2, s3, s4, s5, s6, s7, y_filter, offset_const);
+        d3 = convolve6_8_2d_v(s3, s4, s5, s6, s7, s8, y_filter, offset_const);
+
+        store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+#else   // !AOM_ARCH_AARCH64
+        s5 = vld1q_s16(s);
+
+        d0 = convolve6_8_2d_v(s0, s1, s2, s3, s4, s5, y_filter, offset_const);
+
+        vst1q_u16(d, d0);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w != 0);
+  }
+}
+
+static INLINE uint16x4_t
+convolve8_4_2d_v(const int16x4_t s0, const int16x4_t s1, const int16x4_t s2,
+                 const int16x4_t s3, const int16x4_t s4, const int16x4_t s5,
+                 const int16x4_t s6, const int16x4_t s7,
+                 const int16x8_t y_filter, const int32x4_t offset_const) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter);
+
+  int32x4_t sum = offset_const;
+  sum = vmlal_lane_s16(sum, s0, y_filter_0_3, 0);
+  sum = vmlal_lane_s16(sum, s1, y_filter_0_3, 1);
+  sum = vmlal_lane_s16(sum, s2, y_filter_0_3, 2);
+  sum = vmlal_lane_s16(sum, s3, y_filter_0_3, 3);
+  sum = vmlal_lane_s16(sum, s4, y_filter_4_7, 0);
+  sum = vmlal_lane_s16(sum, s5, y_filter_4_7, 1);
+  sum = vmlal_lane_s16(sum, s6, y_filter_4_7, 2);
+  sum = vmlal_lane_s16(sum, s7, y_filter_4_7, 3);
+
+  return vqrshrun_n_s32(sum, COMPOUND_ROUND1_BITS);
+}
+
+static INLINE uint16x8_t
+convolve8_8_2d_v(const int16x8_t s0, const int16x8_t s1, const int16x8_t s2,
+                 const int16x8_t s3, const int16x8_t s4, const int16x8_t s5,
+                 const int16x8_t s6, const int16x8_t s7,
+                 const int16x8_t y_filter, const int32x4_t offset_const) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter);
+
+  int32x4_t sum0 = offset_const;
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s0), y_filter_0_3, 0);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s1), y_filter_0_3, 1);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s2), y_filter_0_3, 2);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s3), y_filter_0_3, 3);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s4), y_filter_4_7, 0);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s5), y_filter_4_7, 1);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s6), y_filter_4_7, 2);
+  sum0 = vmlal_lane_s16(sum0, vget_low_s16(s7), y_filter_4_7, 3);
+
+  int32x4_t sum1 = offset_const;
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s0), y_filter_0_3, 0);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s1), y_filter_0_3, 1);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s2), y_filter_0_3, 2);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s3), y_filter_0_3, 3);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s4), y_filter_4_7, 0);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s5), y_filter_4_7, 1);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s6), y_filter_4_7, 2);
+  sum1 = vmlal_lane_s16(sum1, vget_high_s16(s7), y_filter_4_7, 3);
+
+  return vcombine_u16(vqrshrun_n_s32(sum0, COMPOUND_ROUND1_BITS),
+                      vqrshrun_n_s32(sum1, COMPOUND_ROUND1_BITS));
+}
+
+static INLINE void dist_wtd_convolve_2d_vert_8tap_dist_wtd_avg_neon(
+    int16_t *src_ptr, const int src_stride, uint8_t *dst8_ptr, int dst8_stride,
+    ConvolveParams *conv_params, const int16x8_t y_filter, int h, int w) {
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int32x4_t offset_const = vdupq_n_s32(1 << offset_bits);
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
   const uint16_t fwd_offset = conv_params->fwd_offset;
   const uint16_t bck_offset = conv_params->bck_offset;
-  const int do_average = conv_params->do_average;
-  const int use_dist_wtd_comp_avg = conv_params->use_dist_wtd_comp_avg;
+
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
 
   if (w == 4) {
     int16x4_t s0, s1, s2, s3, s4, s5, s6, s7;
     uint16x4_t dd0, d0;
     uint8x8_t d01_u8;
-
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     int16x4_t s8, s9, s10;
     uint16x4_t dd1, dd2, dd3, d1, d2, d3;
     uint8x8_t d23_u8;
-#endif
+#endif  // AOM_ARCH_AARCH64
 
-    load_s16_4x8(src_ptr, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7);
+    load_s16_4x7(src_ptr, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
     src_ptr += 7 * src_stride;
 
     do {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
       load_s16_4x4(src_ptr, src_stride, &s7, &s8, &s9, &s10);
 
-      d0 = convolve8_4_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
-                           round_shift_vec, offset_const);
-      d1 = convolve8_4_s32(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
-                           round_shift_vec, offset_const);
-      d2 = convolve8_4_s32(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
-                           round_shift_vec, offset_const);
-      d3 = convolve8_4_s32(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
-                           round_shift_vec, offset_const);
+      d0 = convolve8_4_2d_v(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                            offset_const);
+      d1 = convolve8_4_2d_v(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                            offset_const);
+      d2 = convolve8_4_2d_v(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                            offset_const);
+      d3 = convolve8_4_2d_v(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                            offset_const);
 
-      if (do_average) {
-        load_u16_4x4(dst_ptr, dst_stride, &dd0, &dd1, &dd2, &dd3);
+      load_u16_4x4(dst_ptr, dst_stride, &dd0, &dd1, &dd2, &dd3);
 
-        compute_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
-                        bck_offset, sub_const_vec, round_bits,
-                        use_dist_wtd_comp_avg, &d01_u8, &d23_u8);
+      compute_dist_wtd_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                               bck_offset, round_offset_vec, &d01_u8, &d23_u8);
 
-        vst1_lane_u32((uint32_t *)dst8_ptr, vreinterpret_u32_u8(d01_u8), 0);
-        dst8_ptr += dst8_stride;
-        vst1_lane_u32((uint32_t *)dst8_ptr, vreinterpret_u32_u8(d01_u8), 1);
-        dst8_ptr += dst8_stride;
-        vst1_lane_u32((uint32_t *)dst8_ptr, vreinterpret_u32_u8(d23_u8), 0);
-        dst8_ptr += dst8_stride;
-        vst1_lane_u32((uint32_t *)dst8_ptr, vreinterpret_u32_u8(d23_u8), 1);
-        dst8_ptr += dst8_stride;
-      } else {
-        store_u16_4x4(dst_ptr, dst_stride, d0, d1, d2, d3);
-      }
+      store_u8_4x1(dst8_ptr + 0 * dst8_stride, d01_u8, 0);
+      store_u8_4x1(dst8_ptr + 1 * dst8_stride, d01_u8, 1);
+      store_u8_4x1(dst8_ptr + 2 * dst8_stride, d23_u8, 0);
+      store_u8_4x1(dst8_ptr + 3 * dst8_stride, d23_u8, 1);
+      dst8_ptr += 4 * dst8_stride;
 
       s0 = s4;
       s1 = s5;
@@ -1007,28 +1318,23 @@
       s4 = s8;
       s5 = s9;
       s6 = s10;
-
       src_ptr += 4 * src_stride;
       dst_ptr += 4 * dst_stride;
       h -= 4;
-#else
+#else   // !AOM_ARCH_AARCH64
       s7 = vld1_s16(src_ptr);
 
-      d0 = convolve8_4_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
-                           round_shift_vec, offset_const);
+      d0 = convolve8_4_2d_v(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                            offset_const);
 
-      if (do_average) {
-        dd0 = vld1_u16(dst_ptr);
+      dd0 = vld1_u16(dst_ptr);
 
-        compute_avg_4x1(dd0, d0, fwd_offset, bck_offset, sub_const_vec,
-                        round_bits, use_dist_wtd_comp_avg, &d01_u8);
+      compute_dist_wtd_avg_4x1(dd0, d0, fwd_offset, bck_offset,
+                               vget_low_s16(round_offset_vec), &d01_u8);
 
-        vst1_lane_u32((uint32_t *)dst8_ptr, vreinterpret_u32_u8(d01_u8), 0);
-        dst8_ptr += dst8_stride;
+      store_u8_4x1(dst8_ptr, d01_u8, 0);
+      dst8_ptr += dst8_stride;
 
-      } else {
-        vst1_u16(dst_ptr, d0);
-      }
       s0 = s1;
       s1 = s2;
       s2 = s3;
@@ -1036,65 +1342,51 @@
       s4 = s5;
       s5 = s6;
       s6 = s7;
-
       src_ptr += src_stride;
       dst_ptr += dst_stride;
       h--;
-#endif
-    } while (h > 0);
-
+#endif  // AOM_ARCH_AARCH64
+    } while (h != 0);
   } else {
     int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
     uint16x8_t dd0, d0;
     uint8x8_t d0_u8;
-
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     int16x8_t s8, s9, s10;
     uint16x8_t dd1, dd2, dd3, d1, d2, d3;
     uint8x8_t d1_u8, d2_u8, d3_u8;
-#endif
+#endif  // AOM_ARCH_AARCH64
 
     do {
       int16_t *s = src_ptr;
-      uint16_t *d = dst_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
       uint8_t *d_u8 = dst8_ptr;
       int height = h;
 
-      load_s16_8x8(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6, &s7);
+      load_s16_8x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
       s += 7 * src_stride;
 
       do {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
         load_s16_8x4(s, src_stride, &s7, &s8, &s9, &s10);
 
-        d0 = convolve8_8_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
-                             round_shift_vec, offset_const);
-        d1 = convolve8_8_s32(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
-                             round_shift_vec, offset_const);
-        d2 = convolve8_8_s32(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
-                             round_shift_vec, offset_const);
-        d3 = convolve8_8_s32(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
-                             round_shift_vec, offset_const);
+        d0 = convolve8_8_2d_v(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                              offset_const);
+        d1 = convolve8_8_2d_v(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                              offset_const);
+        d2 = convolve8_8_2d_v(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                              offset_const);
+        d3 = convolve8_8_2d_v(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                              offset_const);
 
-        if (do_average) {
-          load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
 
-          compute_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
-                          bck_offset, sub_const_vec, round_bits,
-                          use_dist_wtd_comp_avg, &d0_u8, &d1_u8, &d2_u8,
-                          &d3_u8);
+        compute_dist_wtd_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                                 bck_offset, round_offset_vec, &d0_u8, &d1_u8,
+                                 &d2_u8, &d3_u8);
 
-          vst1_u8(d_u8, d0_u8);
-          d_u8 += dst8_stride;
-          vst1_u8(d_u8, d1_u8);
-          d_u8 += dst8_stride;
-          vst1_u8(d_u8, d2_u8);
-          d_u8 += dst8_stride;
-          vst1_u8(d_u8, d3_u8);
-          d_u8 += dst8_stride;
-        } else {
-          store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
-        }
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+        d_u8 += 4 * dst8_stride;
 
         s0 = s4;
         s1 = s5;
@@ -1103,28 +1395,22 @@
         s4 = s8;
         s5 = s9;
         s6 = s10;
-
         s += 4 * src_stride;
         d += 4 * dst_stride;
         height -= 4;
-#else
+#else   // !AOM_ARCH_AARCH64
         s7 = vld1q_s16(s);
 
-        d0 = convolve8_8_s32(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
-                             round_shift_vec, offset_const);
+        d0 = convolve8_8_2d_v(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                              offset_const);
 
-        if (do_average) {
-          dd0 = vld1q_u16(d);
+        dd0 = vld1q_u16(d);
 
-          compute_avg_8x1(dd0, d0, fwd_offset, bck_offset, sub_const_vec,
-                          round_bits, use_dist_wtd_comp_avg, &d0_u8);
+        compute_dist_wtd_avg_8x1(dd0, d0, fwd_offset, bck_offset,
+                                 round_offset_vec, &d0_u8);
 
-          vst1_u8(d_u8, d0_u8);
-          d_u8 += dst8_stride;
-
-        } else {
-          vst1q_u16(d, d0);
-        }
+        vst1_u8(d_u8, d0_u8);
+        d_u8 += dst8_stride;
 
         s0 = s1;
         s1 = s2;
@@ -1133,18 +1419,318 @@
         s4 = s5;
         s5 = s6;
         s6 = s7;
-
         s += src_stride;
         d += dst_stride;
         height--;
-#endif
-      } while (height > 0);
-
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
       src_ptr += 8;
       dst_ptr += 8;
       dst8_ptr += 8;
       w -= 8;
-    } while (w > 0);
+    } while (w != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_2d_vert_8tap_avg_neon(
+    int16_t *src_ptr, const int src_stride, uint8_t *dst8_ptr, int dst8_stride,
+    ConvolveParams *conv_params, const int16x8_t y_filter, int h, int w) {
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int32x4_t offset_const = vdupq_n_s32(1 << offset_bits);
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
+
+  if (w == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x4_t dd0, d0;
+    uint8x8_t d01_u8;
+#if AOM_ARCH_AARCH64
+    int16x4_t s8, s9, s10;
+    uint16x4_t dd1, dd2, dd3, d1, d2, d3;
+    uint8x8_t d23_u8;
+#endif  // AOM_ARCH_AARCH64
+
+    load_s16_4x7(src_ptr, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+    src_ptr += 7 * src_stride;
+
+    do {
+#if AOM_ARCH_AARCH64
+      load_s16_4x4(src_ptr, src_stride, &s7, &s8, &s9, &s10);
+
+      d0 = convolve8_4_2d_v(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                            offset_const);
+      d1 = convolve8_4_2d_v(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                            offset_const);
+      d2 = convolve8_4_2d_v(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                            offset_const);
+      d3 = convolve8_4_2d_v(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                            offset_const);
+
+      load_u16_4x4(dst_ptr, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+      compute_basic_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                            round_offset_vec, &d01_u8, &d23_u8);
+
+      store_u8_4x1(dst8_ptr + 0 * dst8_stride, d01_u8, 0);
+      store_u8_4x1(dst8_ptr + 1 * dst8_stride, d01_u8, 1);
+      store_u8_4x1(dst8_ptr + 2 * dst8_stride, d23_u8, 0);
+      store_u8_4x1(dst8_ptr + 3 * dst8_stride, d23_u8, 1);
+      dst8_ptr += 4 * dst8_stride;
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s5 = s9;
+      s6 = s10;
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      h -= 4;
+#else   // !AOM_ARCH_AARCH64
+      s7 = vld1_s16(src_ptr);
+
+      d0 = convolve8_4_2d_v(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                            offset_const);
+
+      dd0 = vld1_u16(dst_ptr);
+
+      compute_basic_avg_4x1(dd0, d0, vget_low_s16(round_offset_vec), &d01_u8);
+
+      store_u8_4x1(dst8_ptr, d01_u8, 0);
+      dst8_ptr += dst8_stride;
+
+      s0 = s1;
+      s1 = s2;
+      s2 = s3;
+      s3 = s4;
+      s4 = s5;
+      s5 = s6;
+      s6 = s7;
+      src_ptr += src_stride;
+      dst_ptr += dst_stride;
+      h--;
+#endif  // AOM_ARCH_AARCH64
+    } while (h != 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x8_t dd0, d0;
+    uint8x8_t d0_u8;
+#if AOM_ARCH_AARCH64
+    int16x8_t s8, s9, s10;
+    uint16x8_t dd1, dd2, dd3, d1, d2, d3;
+    uint8x8_t d1_u8, d2_u8, d3_u8;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      int16_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      uint8_t *d_u8 = dst8_ptr;
+      int height = h;
+
+      load_s16_8x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+      s += 7 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        load_s16_8x4(s, src_stride, &s7, &s8, &s9, &s10);
+
+        d0 = convolve8_8_2d_v(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                              offset_const);
+        d1 = convolve8_8_2d_v(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                              offset_const);
+        d2 = convolve8_8_2d_v(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                              offset_const);
+        d3 = convolve8_8_2d_v(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                              offset_const);
+
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_basic_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                              round_offset_vec, &d0_u8, &d1_u8, &d2_u8, &d3_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+        d_u8 += 4 * dst8_stride;
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+#else   // !AOM_ARCH_AARCH64
+        s7 = vld1q_s16(s);
+
+        d0 = convolve8_8_2d_v(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                              offset_const);
+
+        dd0 = vld1q_u16(d);
+
+        compute_basic_avg_8x1(dd0, d0, round_offset_vec, &d0_u8);
+
+        vst1_u8(d_u8, d0_u8);
+        d_u8 += dst8_stride;
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s5 = s6;
+        s6 = s7;
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      dst8_ptr += 8;
+      w -= 8;
+    } while (w != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_2d_vert_8tap_neon(
+    int16_t *src_ptr, const int src_stride, ConvolveParams *conv_params,
+    const int16x8_t y_filter, int h, int w) {
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int32x4_t offset_const = vdupq_n_s32(1 << offset_bits);
+
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
+
+  if (w == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x4_t d0;
+#if AOM_ARCH_AARCH64
+    int16x4_t s8, s9, s10;
+    uint16x4_t d1, d2, d3;
+#endif  // AOM_ARCH_AARCH64
+
+    load_s16_4x7(src_ptr, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+    src_ptr += 7 * src_stride;
+
+    do {
+#if AOM_ARCH_AARCH64
+      load_s16_4x4(src_ptr, src_stride, &s7, &s8, &s9, &s10);
+
+      d0 = convolve8_4_2d_v(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                            offset_const);
+      d1 = convolve8_4_2d_v(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                            offset_const);
+      d2 = convolve8_4_2d_v(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                            offset_const);
+      d3 = convolve8_4_2d_v(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                            offset_const);
+
+      store_u16_4x4(dst_ptr, dst_stride, d0, d1, d2, d3);
+
+      s0 = s4;
+      s1 = s5;
+      s2 = s6;
+      s3 = s7;
+      s4 = s8;
+      s5 = s9;
+      s6 = s10;
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      h -= 4;
+#else   // !AOM_ARCH_AARCH64
+      s7 = vld1_s16(src_ptr);
+
+      d0 = convolve8_4_2d_v(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                            offset_const);
+
+      vst1_u16(dst_ptr, d0);
+
+      s0 = s1;
+      s1 = s2;
+      s2 = s3;
+      s3 = s4;
+      s4 = s5;
+      s5 = s6;
+      s6 = s7;
+      src_ptr += src_stride;
+      dst_ptr += dst_stride;
+      h--;
+#endif  // AOM_ARCH_AARCH64
+    } while (h != 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x8_t d0;
+#if AOM_ARCH_AARCH64
+    int16x8_t s8, s9, s10;
+    uint16x8_t d1, d2, d3;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      int16_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      int height = h;
+
+      load_s16_8x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+      s += 7 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        load_s16_8x4(s, src_stride, &s7, &s8, &s9, &s10);
+
+        d0 = convolve8_8_2d_v(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                              offset_const);
+        d1 = convolve8_8_2d_v(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                              offset_const);
+        d2 = convolve8_8_2d_v(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                              offset_const);
+        d3 = convolve8_8_2d_v(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                              offset_const);
+
+        store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+#else   // !AOM_ARCH_AARCH64
+        s7 = vld1q_s16(s);
+
+        d0 = convolve8_8_2d_v(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                              offset_const);
+
+        vst1q_u16(d, d0);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s5 = s6;
+        s6 = s7;
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w != 0);
   }
 }
 
@@ -1154,8 +1740,8 @@
                                    const InterpFilterParams *filter_params_y,
                                    const int subpel_x_qn, const int subpel_y_qn,
                                    ConvolveParams *conv_params) {
-  assert(!(w % 4));
-  assert(!(h % 4));
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
 
   DECLARE_ALIGNED(16, int16_t,
                   im_block[(MAX_SB_SIZE + HORIZ_EXTRA_ROWS) * MAX_SB_SIZE]);
@@ -1167,7 +1753,6 @@
   const int im_stride = MAX_SB_SIZE;
   const int vert_offset = filter_params_y->taps / 2 - 1;
   const int horiz_offset = filter_params_x->taps / 2 - 1;
-  const int round_0 = conv_params->round_0 - 1;
   const uint8_t *src_ptr = src - vert_offset * src_stride - horiz_offset;
   const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
       filter_params_x, subpel_x_qn & SUBPEL_MASK);
@@ -1179,162 +1764,367 @@
   const int16x8_t x_filter = vshrq_n_s16(vld1q_s16(x_filter_ptr), 1);
   const int16x8_t y_filter = vld1q_s16(y_filter_ptr);
 
-  dist_wtd_convolve_2d_horiz_neon(src_ptr, src_stride, im_block, im_stride,
-                                  x_filter, im_h, w, round_0);
+  dist_wtd_convolve_2d_horiz_8tap_neon(src_ptr, src_stride, im_block, im_stride,
+                                       x_filter, im_h, w);
 
   if (clamped_y_taps == 6) {
-    dist_wtd_convolve_2d_vert_6tap_neon(im_block + im_stride, im_stride, dst8,
-                                        dst8_stride, conv_params, y_filter, h,
-                                        w);
+    if (conv_params->do_average) {
+      if (UNLIKELY(conv_params->use_dist_wtd_comp_avg)) {
+        dist_wtd_convolve_2d_vert_6tap_dist_wtd_avg_neon(
+            im_block + im_stride, im_stride, dst8, dst8_stride, conv_params,
+            y_filter, h, w);
+      } else {
+        dist_wtd_convolve_2d_vert_6tap_avg_neon(im_block + im_stride, im_stride,
+                                                dst8, dst8_stride, conv_params,
+                                                y_filter, h, w);
+      }
+    } else {
+      dist_wtd_convolve_2d_vert_6tap_neon(im_block + im_stride, im_stride,
+                                          conv_params, y_filter, h, w);
+    }
   } else {
-    dist_wtd_convolve_2d_vert_8tap_neon(im_block, im_stride, dst8, dst8_stride,
-                                        conv_params, y_filter, h, w);
+    if (conv_params->do_average) {
+      if (UNLIKELY(conv_params->use_dist_wtd_comp_avg)) {
+        dist_wtd_convolve_2d_vert_8tap_dist_wtd_avg_neon(
+            im_block, im_stride, dst8, dst8_stride, conv_params, y_filter, h,
+            w);
+      } else {
+        dist_wtd_convolve_2d_vert_8tap_avg_neon(im_block, im_stride, dst8,
+                                                dst8_stride, conv_params,
+                                                y_filter, h, w);
+      }
+    } else {
+      dist_wtd_convolve_2d_vert_8tap_neon(im_block, im_stride, conv_params,
+                                          y_filter, h, w);
+    }
+  }
+}
+
+static INLINE void dist_wtd_convolve_2d_copy_dist_wtd_avg_neon(
+    const uint8_t *src, int src_stride, uint8_t *dst8, int dst8_stride, int w,
+    int h, ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
+
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const uint16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                                (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const uint16x8_t round_offset_vec = vdupq_n_u16(round_offset);
+  const uint8x8_t shift_by_bits = vdup_n_u8(1 << (FILTER_BITS - ROUND0_BITS));
+
+  const uint16_t fwd_offset = conv_params->fwd_offset;
+  const uint16_t bck_offset = conv_params->bck_offset;
+
+  CONV_BUF_TYPE *dst = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
+  int height = h;
+
+  if (w == 4) {
+    uint8x8_t s0, s1, s2, s3, d01, d23;
+    uint16x4_t d0, d1, d2, d3, dd0, dd1, dd2, dd3;
+
+    do {
+      load_u8_8x4(src, src_stride, &s0, &s1, &s2, &s3);
+
+      d0 = vget_low_u16(vmlal_u8(round_offset_vec, s0, shift_by_bits));
+      d1 = vget_low_u16(vmlal_u8(round_offset_vec, s1, shift_by_bits));
+      d2 = vget_low_u16(vmlal_u8(round_offset_vec, s2, shift_by_bits));
+      d3 = vget_low_u16(vmlal_u8(round_offset_vec, s3, shift_by_bits));
+
+      load_u16_4x4(dst, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+      compute_dist_wtd_avg_4x4(
+          dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset, bck_offset,
+          vreinterpretq_s16_u16(round_offset_vec), &d01, &d23);
+
+      store_u8_4x1(dst8 + 0 * dst8_stride, d01, 0);
+      store_u8_4x1(dst8 + 1 * dst8_stride, d01, 1);
+      store_u8_4x1(dst8 + 2 * dst8_stride, d23, 0);
+      store_u8_4x1(dst8 + 3 * dst8_stride, d23, 1);
+
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      dst8 += 4 * dst8_stride;
+      height -= 4;
+    } while (height != 0);
+  } else {
+    uint8x8_t s0, s1, s2, s3, d0_u8, d1_u8, d2_u8, d3_u8;
+    uint16x8_t d0, d1, d2, d3, dd0, dd1, dd2, dd3;
+
+    do {
+      const uint8_t *s = src;
+      CONV_BUF_TYPE *d = dst;
+      uint8_t *d_u8 = dst8;
+      int width = w;
+
+      do {
+        load_u8_8x4(s, src_stride, &s0, &s1, &s2, &s3);
+
+        d0 = vmlal_u8(round_offset_vec, s0, shift_by_bits);
+        d1 = vmlal_u8(round_offset_vec, s1, shift_by_bits);
+        d2 = vmlal_u8(round_offset_vec, s2, shift_by_bits);
+        d3 = vmlal_u8(round_offset_vec, s3, shift_by_bits);
+
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_dist_wtd_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                                 bck_offset,
+                                 vreinterpretq_s16_u16(round_offset_vec),
+                                 &d0_u8, &d1_u8, &d2_u8, &d3_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+
+        s += 8;
+        d += 8;
+        d_u8 += 8;
+        width -= 8;
+      } while (width != 0);
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      dst8 += 4 * dst8_stride;
+      height -= 4;
+    } while (height != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_2d_copy_avg_neon(
+    const uint8_t *src, int src_stride, uint8_t *dst8, int dst8_stride, int w,
+    int h, ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
+
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const uint16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                                (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const uint16x8_t round_offset_vec = vdupq_n_u16(round_offset);
+  const uint8x8_t shift_by_bits = vdup_n_u8(1 << (FILTER_BITS - ROUND0_BITS));
+
+  CONV_BUF_TYPE *dst = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
+  int height = h;
+
+  if (w == 4) {
+    uint8x8_t s0, s1, s2, s3, d01, d23;
+    uint16x4_t d0, d1, d2, d3, dd0, dd1, dd2, dd3;
+
+    do {
+      load_u8_8x4(src, src_stride, &s0, &s1, &s2, &s3);
+
+      d0 = vget_low_u16(vmlal_u8(round_offset_vec, s0, shift_by_bits));
+      d1 = vget_low_u16(vmlal_u8(round_offset_vec, s1, shift_by_bits));
+      d2 = vget_low_u16(vmlal_u8(round_offset_vec, s2, shift_by_bits));
+      d3 = vget_low_u16(vmlal_u8(round_offset_vec, s3, shift_by_bits));
+
+      load_u16_4x4(dst, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+      compute_basic_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                            vreinterpretq_s16_u16(round_offset_vec), &d01,
+                            &d23);
+
+      store_u8_4x1(dst8 + 0 * dst8_stride, d01, 0);
+      store_u8_4x1(dst8 + 1 * dst8_stride, d01, 1);
+      store_u8_4x1(dst8 + 2 * dst8_stride, d23, 0);
+      store_u8_4x1(dst8 + 3 * dst8_stride, d23, 1);
+
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      dst8 += 4 * dst8_stride;
+      height -= 4;
+    } while (height != 0);
+  } else {
+    uint8x8_t s0, s1, s2, s3, d0_u8, d1_u8, d2_u8, d3_u8;
+    uint16x8_t d0, d1, d2, d3, dd0, dd1, dd2, dd3;
+
+    do {
+      const uint8_t *s = src;
+      CONV_BUF_TYPE *d = dst;
+      uint8_t *d_u8 = dst8;
+      int width = w;
+
+      do {
+        load_u8_8x4(s, src_stride, &s0, &s1, &s2, &s3);
+
+        d0 = vmlal_u8(round_offset_vec, s0, shift_by_bits);
+        d1 = vmlal_u8(round_offset_vec, s1, shift_by_bits);
+        d2 = vmlal_u8(round_offset_vec, s2, shift_by_bits);
+        d3 = vmlal_u8(round_offset_vec, s3, shift_by_bits);
+
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_basic_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                              vreinterpretq_s16_u16(round_offset_vec), &d0_u8,
+                              &d1_u8, &d2_u8, &d3_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+
+        s += 8;
+        d += 8;
+        d_u8 += 8;
+        width -= 8;
+      } while (width != 0);
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      dst8 += 4 * dst8_stride;
+      height -= 4;
+    } while (height != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_2d_copy_neon(const uint8_t *src,
+                                                  int src_stride, int w, int h,
+                                                  ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
+
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const uint16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                                (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const uint16x8_t round_offset_vec = vdupq_n_u16(round_offset);
+  const uint8x8_t shift_by_bits = vdup_n_u8(1 << (FILTER_BITS - ROUND0_BITS));
+
+  CONV_BUF_TYPE *dst = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
+  int height = h;
+
+  if (w == 4) {
+    uint8x8_t s0, s1, s2, s3;
+    uint16x4_t d0, d1, d2, d3;
+
+    do {
+      load_u8_8x4(src, src_stride, &s0, &s1, &s2, &s3);
+
+      d0 = vget_low_u16(vmlal_u8(round_offset_vec, s0, shift_by_bits));
+      d1 = vget_low_u16(vmlal_u8(round_offset_vec, s1, shift_by_bits));
+      d2 = vget_low_u16(vmlal_u8(round_offset_vec, s2, shift_by_bits));
+      d3 = vget_low_u16(vmlal_u8(round_offset_vec, s3, shift_by_bits));
+
+      store_u16_4x4(dst, dst_stride, d0, d1, d2, d3);
+
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      height -= 4;
+    } while (height != 0);
+  } else {
+    uint8x8_t s0, s1, s2, s3;
+    uint16x8_t d0, d1, d2, d3;
+
+    do {
+      const uint8_t *s = src;
+      CONV_BUF_TYPE *d = dst;
+      int width = w;
+
+      do {
+        load_u8_8x4(s, src_stride, &s0, &s1, &s2, &s3);
+
+        d0 = vmlal_u8(round_offset_vec, s0, shift_by_bits);
+        d1 = vmlal_u8(round_offset_vec, s1, shift_by_bits);
+        d2 = vmlal_u8(round_offset_vec, s2, shift_by_bits);
+        d3 = vmlal_u8(round_offset_vec, s3, shift_by_bits);
+
+        store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        s += 8;
+        d += 8;
+        width -= 8;
+      } while (width != 0);
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      height -= 4;
+    } while (height != 0);
   }
 }
 
 void av1_dist_wtd_convolve_2d_copy_neon(const uint8_t *src, int src_stride,
                                         uint8_t *dst8, int dst8_stride, int w,
                                         int h, ConvolveParams *conv_params) {
-  uint8x8_t res0_8, res1_8, res2_8, res3_8, tmp_shift0, tmp_shift1, tmp_shift2,
-      tmp_shift3;
-  uint16x8_t res_q0, res_q1, res_q2, res_q3, tmp_q0, tmp_q1, tmp_q2, tmp_q3;
-  uint16x4_t tmp4, tmp5, tmp6, tmp7, res4, res5, res6, res7;
-  const uint8_t *src1, *src2;
-  uint8_t *dst8_1;
-  CONV_BUF_TYPE *dst = conv_params->dst, *dst_1, *dst_2;
-  const int dst_stride = conv_params->dst_stride;
-  int x, y;
-  const int16_t bits =
-      FILTER_BITS * 2 - conv_params->round_1 - conv_params->round_0;
-  const int bd = 8;
-  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
-  const int round_offset = (1 << (offset_bits - conv_params->round_1)) +
-                           (1 << (offset_bits - conv_params->round_1 - 1));
-  const int16x4_t sub_const_vec = vdup_n_s16((int16_t)round_offset);
-  const uint16x8_t dup_round_offset16x8 = vdupq_n_u16((uint16_t)round_offset);
-  const int16x4_t dup_bits16x4 = vdup_n_s16(bits);
-  const int16x8_t dup_bits16x8 = vdupq_n_s16(bits);
-
-  if (!(w & 0x07)) {
-    for (y = 0; y < (h >> 2); ++y) {
-      src1 = src;
-      dst8_1 = dst8;
-      dst_1 = dst;
-      for (x = 0; x < (w >> 3); ++x) {
-        src2 = src1;
-        load_u8_8x4(src2, src_stride, &res0_8, &res1_8, &res2_8, &res3_8);
-
-        res_q0 = vaddq_u16(vshlq_u16(vmovl_u8(res0_8), dup_bits16x8),
-                           dup_round_offset16x8);
-        res_q1 = vaddq_u16(vshlq_u16(vmovl_u8(res1_8), dup_bits16x8),
-                           dup_round_offset16x8);
-        res_q2 = vaddq_u16(vshlq_u16(vmovl_u8(res2_8), dup_bits16x8),
-                           dup_round_offset16x8);
-        res_q3 = vaddq_u16(vshlq_u16(vmovl_u8(res3_8), dup_bits16x8),
-                           dup_round_offset16x8);
-
-        if (conv_params->do_average) {
-          dst_2 = dst_1;
-          load_u16_8x4(dst_2, dst_stride, &tmp_q0, &tmp_q1, &tmp_q2, &tmp_q3);
-
-          compute_avg_8x4(tmp_q0, tmp_q1, tmp_q2, tmp_q3, res_q0, res_q1,
-                          res_q2, res_q3, conv_params->fwd_offset,
-                          conv_params->bck_offset, sub_const_vec, bits,
-                          conv_params->use_dist_wtd_comp_avg, &tmp_shift0,
-                          &tmp_shift1, &tmp_shift2, &tmp_shift3);
-
-          vst1_u8(dst8_1 + (0 * dst8_stride), tmp_shift0);
-          vst1_u8(dst8_1 + (1 * dst8_stride), tmp_shift1);
-          vst1_u8(dst8_1 + (2 * dst8_stride), tmp_shift2);
-          vst1_u8(dst8_1 + (3 * dst8_stride), tmp_shift3);
-
-        } else {
-          vst1q_u16(dst_1 + (0 * dst_stride), res_q0);
-          vst1q_u16(dst_1 + (1 * dst_stride), res_q1);
-          vst1q_u16(dst_1 + (2 * dst_stride), res_q2);
-          vst1q_u16(dst_1 + (3 * dst_stride), res_q3);
-        }
-        src1 = src1 + 8;
-        dst_1 = dst_1 + 8;
-        dst8_1 = dst8_1 + 8;
-      }
-      src += src_stride * 4;
-      dst8 += dst8_stride * 4;
-      dst += dst_stride * 4;
+  if (conv_params->do_average) {
+    if (UNLIKELY(conv_params->use_dist_wtd_comp_avg)) {
+      dist_wtd_convolve_2d_copy_dist_wtd_avg_neon(
+          src, src_stride, dst8, dst8_stride, w, h, conv_params);
+    } else {
+      dist_wtd_convolve_2d_copy_avg_neon(src, src_stride, dst8, dst8_stride, w,
+                                         h, conv_params);
     }
-  } else if (!(w & 0x03)) {
-    for (y = 0; y < (h >> 2); ++y) {
-      src1 = src;
-      dst8_1 = dst8;
-      dst_1 = dst;
-
-      load_u8_8x4(src1, src_stride, &res0_8, &res1_8, &res2_8, &res3_8);
-
-      res4 = vadd_u16(vshl_u16(vget_low_u16(vmovl_u8(res0_8)), dup_bits16x4),
-                      vreinterpret_u16_s16(sub_const_vec));
-      res5 = vadd_u16(vshl_u16(vget_low_u16(vmovl_u8(res1_8)), dup_bits16x4),
-                      vreinterpret_u16_s16(sub_const_vec));
-      res6 = vadd_u16(vshl_u16(vget_low_u16(vmovl_u8(res2_8)), dup_bits16x4),
-                      vreinterpret_u16_s16(sub_const_vec));
-      res7 = vadd_u16(vshl_u16(vget_low_u16(vmovl_u8(res3_8)), dup_bits16x4),
-                      vreinterpret_u16_s16(sub_const_vec));
-      if (conv_params->do_average) {
-        load_u16_4x4(dst_1, dst_stride, &tmp4, &tmp5, &tmp6, &tmp7);
-
-        compute_avg_4x4(tmp4, tmp5, tmp6, tmp7, res4, res5, res6, res7,
-                        conv_params->fwd_offset, conv_params->bck_offset,
-                        sub_const_vec, bits, conv_params->use_dist_wtd_comp_avg,
-                        &tmp_shift0, &tmp_shift1);
-
-        vst1_lane_u32((uint32_t *)(dst8_1), vreinterpret_u32_u8(tmp_shift0), 0);
-        dst8_1 += dst8_stride;
-        vst1_lane_u32((uint32_t *)(dst8_1), vreinterpret_u32_u8(tmp_shift0), 1);
-        dst8_1 += dst8_stride;
-        vst1_lane_u32((uint32_t *)(dst8_1), vreinterpret_u32_u8(tmp_shift1), 0);
-        dst8_1 += dst8_stride;
-        vst1_lane_u32((uint32_t *)(dst8_1), vreinterpret_u32_u8(tmp_shift1), 1);
-
-      } else {
-        vst1_u16(dst_1, res4);
-        dst_1 += dst_stride;
-        vst1_u16(dst_1, res5);
-        dst_1 += dst_stride;
-        vst1_u16(dst_1, res6);
-        dst_1 += dst_stride;
-        vst1_u16(dst_1, res7);
-      }
-      src += src_stride * 4;
-      dst += dst_stride * 4;
-      dst8 += dst8_stride * 4;
-    }
+  } else {
+    dist_wtd_convolve_2d_copy_neon(src, src_stride, w, h, conv_params);
   }
 }
 
-#if defined(__aarch64__) && defined(__ARM_FEATURE_MATMUL_INT8)
+#if AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_MATMUL_INT8)
 
-void av1_dist_wtd_convolve_x_neon(const uint8_t *src, int src_stride,
-                                  uint8_t *dst8, int dst8_stride, int w, int h,
-                                  const InterpFilterParams *filter_params_x,
-                                  const int subpel_x_qn,
-                                  ConvolveParams *conv_params) {
-  assert(!(w % 4));
-  assert(!(h % 4));
+static INLINE uint16x4_t convolve8_4_x(uint8x16_t samples,
+                                       const int8x8_t x_filter,
+                                       const uint8x16x2_t permute_tbl,
+                                       const int32x4_t round_offset) {
+  uint8x16_t permuted_samples[2];
+  int32x4_t sum;
 
-  const int horiz_offset = filter_params_x->taps / 2 - 1;
-  const int bits = FILTER_BITS - conv_params->round_1;
+  // Permute samples ready for dot product.
+  // { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 }
+  permuted_samples[0] = vqtbl1q_u8(samples, permute_tbl.val[0]);
+  // { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 }
+  permuted_samples[1] = vqtbl1q_u8(samples, permute_tbl.val[1]);
+
+  // First 4 output values.
+  sum = vusdotq_lane_s32(round_offset, permuted_samples[0], x_filter, 0);
+  sum = vusdotq_lane_s32(sum, permuted_samples[1], x_filter, 1);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  return vreinterpret_u16_s16(vshrn_n_s32(sum, ROUND0_BITS - 1));
+}
+
+static INLINE uint16x8_t convolve8_8_x(uint8x16_t samples,
+                                       const int8x8_t x_filter,
+                                       const uint8x16x3_t permute_tbl,
+                                       const int32x4_t round_offset) {
+  uint8x16_t permuted_samples[3];
+  int32x4_t sum[2];
+
+  // Permute samples ready for dot product.
+  // { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 }
+  permuted_samples[0] = vqtbl1q_u8(samples, permute_tbl.val[0]);
+  // { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 }
+  permuted_samples[1] = vqtbl1q_u8(samples, permute_tbl.val[1]);
+  // { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 }
+  permuted_samples[2] = vqtbl1q_u8(samples, permute_tbl.val[2]);
+
+  // First 4 output values.
+  sum[0] = vusdotq_lane_s32(round_offset, permuted_samples[0], x_filter, 0);
+  sum[0] = vusdotq_lane_s32(sum[0], permuted_samples[1], x_filter, 1);
+  // Second 4 output values.
+  sum[1] = vusdotq_lane_s32(round_offset, permuted_samples[1], x_filter, 0);
+  sum[1] = vusdotq_lane_s32(sum[1], permuted_samples[2], x_filter, 1);
+
+  // Narrow and re-pack.
+  // We halved the convolution filter values so -1 from the right shift.
+  int16x8_t res = vcombine_s16(vshrn_n_s32(sum[0], ROUND0_BITS - 1),
+                               vshrn_n_s32(sum[1], ROUND0_BITS - 1));
+  return vreinterpretq_u16_s16(res);
+}
+
+static INLINE void dist_wtd_convolve_x_dist_wtd_avg_neon(
+    const uint8_t *src, int src_stride, uint8_t *dst8, int dst8_stride, int w,
+    int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn,
+    ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
+
   const int bd = 8;
-  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
-  const int round_offset = (1 << (offset_bits - conv_params->round_1)) +
-                           (1 << (offset_bits - conv_params->round_1 - 1));
-  const int round_bits =
-      2 * FILTER_BITS - conv_params->round_0 - conv_params->round_1;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+  // A shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use non-rounding
+  // shifts - which are generally faster than rounding shifts on modern CPUs.
+  // (The extra -1 is needed because we halved the filter values.)
+  const int32x4_t round_offset_shim = vdupq_n_s32(
+      (round_offset << (ROUND0_BITS - 1)) + (1 << ((ROUND0_BITS - 1) - 1)));
+
   const uint16_t fwd_offset = conv_params->fwd_offset;
   const uint16_t bck_offset = conv_params->bck_offset;
-  const int use_dist_wtd_comp_avg = conv_params->use_dist_wtd_comp_avg;
-  const int16x4_t round_offset64 = vdup_n_s16(round_offset);
-  const int16x8_t round_offset128 = vdupq_n_s16(round_offset);
-  const int16x8_t shift_round_0 = vdupq_n_s16(-conv_params->round_0 + 1);
-  const int16x8_t horiz_const = vdupq_n_s16(bits);
 
   // Horizontal filter.
   const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
@@ -1343,12 +2133,11 @@
   // requirements.
   const int8x8_t x_filter = vshrn_n_s16(vld1q_s16(x_filter_ptr), 1);
 
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
   const uint8_t *src_ptr = src - horiz_offset;
-  CONV_BUF_TYPE *dst = conv_params->dst;
-  CONV_BUF_TYPE *dst_ptr = dst;
-  uint8_t *dst_u8_ptr = dst8;
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  uint8_t *dst8_ptr = dst8;
   int dst_stride = conv_params->dst_stride;
-  int width = w;
   int height = h;
 
   if (w == 4) {
@@ -1356,167 +2145,90 @@
 
     do {
       uint8x16_t s0, s1, s2, s3;
-      int32x4_t d0, d1, d2, d3;
-      int16x8_t d01, d23;
-      uint16x4_t dd0, dd1, dd2, dd3;
+      uint16x4_t d0, d1, d2, d3, dd0, dd1, dd2, dd3;
       uint8x8_t d01_u8, d23_u8;
 
-      s0 = vld1q_u8(src_ptr + 0 * src_stride);
-      s1 = vld1q_u8(src_ptr + 1 * src_stride);
-      s2 = vld1q_u8(src_ptr + 2 * src_stride);
-      s3 = vld1q_u8(src_ptr + 3 * src_stride);
+      load_u8_16x4(src_ptr, src_stride, &s0, &s1, &s2, &s3);
 
-      d0 = convolve8_4_usdot(s0, x_filter, permute_tbl, vdupq_n_s32(0));
-      d1 = convolve8_4_usdot(s1, x_filter, permute_tbl, vdupq_n_s32(0));
-      d2 = convolve8_4_usdot(s2, x_filter, permute_tbl, vdupq_n_s32(0));
-      d3 = convolve8_4_usdot(s3, x_filter, permute_tbl, vdupq_n_s32(0));
+      d0 = convolve8_4_x(s0, x_filter, permute_tbl, round_offset_shim);
+      d1 = convolve8_4_x(s1, x_filter, permute_tbl, round_offset_shim);
+      d2 = convolve8_4_x(s2, x_filter, permute_tbl, round_offset_shim);
+      d3 = convolve8_4_x(s3, x_filter, permute_tbl, round_offset_shim);
 
-      d01 = vcombine_s16(vmovn_s32(d0), vmovn_s32(d1));
-      d23 = vcombine_s16(vmovn_s32(d2), vmovn_s32(d3));
+      load_u16_4x4(dst_ptr, dst_stride, &dd0, &dd1, &dd2, &dd3);
 
-      d01 = vqrshlq_s16(d01, shift_round_0);
-      d23 = vqrshlq_s16(d23, shift_round_0);
+      compute_dist_wtd_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                               bck_offset, round_offset_vec, &d01_u8, &d23_u8);
 
-      d01 = vrshlq_s16(d01, horiz_const);
-      d23 = vrshlq_s16(d23, horiz_const);
-
-      d01 = vaddq_s16(d01, round_offset128);
-      d23 = vaddq_s16(d23, round_offset128);
-
-      if (conv_params->do_average) {
-        dd0 = vld1_u16(dst_ptr);
-        dst_ptr += dst_stride;
-        dd1 = vld1_u16(dst_ptr);
-        dst_ptr += dst_stride;
-        dd2 = vld1_u16(dst_ptr);
-        dst_ptr += dst_stride;
-        dd3 = vld1_u16(dst_ptr);
-        dst_ptr += dst_stride;
-
-        compute_avg_4x4(dd0, dd1, dd2, dd3,
-                        vreinterpret_u16_s16(vget_low_s16(d01)),
-                        vreinterpret_u16_s16(vget_high_s16(d01)),
-                        vreinterpret_u16_s16(vget_low_s16(d23)),
-                        vreinterpret_u16_s16(vget_high_s16(d23)), fwd_offset,
-                        bck_offset, round_offset64, round_bits,
-                        use_dist_wtd_comp_avg, &d01_u8, &d23_u8);
-
-        vst1_lane_u32((uint32_t *)dst_u8_ptr, vreinterpret_u32_u8(d01_u8), 0);
-        dst_u8_ptr += dst8_stride;
-        vst1_lane_u32((uint32_t *)dst_u8_ptr, vreinterpret_u32_u8(d01_u8), 1);
-        dst_u8_ptr += dst8_stride;
-        vst1_lane_u32((uint32_t *)dst_u8_ptr, vreinterpret_u32_u8(d23_u8), 0);
-        dst_u8_ptr += dst8_stride;
-        vst1_lane_u32((uint32_t *)dst_u8_ptr, vreinterpret_u32_u8(d23_u8), 1);
-        dst_u8_ptr += dst8_stride;
-      } else {
-        vst1q_lane_u64((uint64_t *)dst_ptr, vreinterpretq_u64_s16(d01), 0);
-        dst_ptr += dst_stride;
-        vst1q_lane_u64((uint64_t *)dst_ptr, vreinterpretq_u64_s16(d01), 1);
-        dst_ptr += dst_stride;
-        vst1q_lane_u64((uint64_t *)dst_ptr, vreinterpretq_u64_s16(d23), 0);
-        dst_ptr += dst_stride;
-        vst1q_lane_u64((uint64_t *)dst_ptr, vreinterpretq_u64_s16(d23), 1);
-        dst_ptr += dst_stride;
-      }
+      store_u8_4x1(dst8_ptr + 0 * dst8_stride, d01_u8, 0);
+      store_u8_4x1(dst8_ptr + 1 * dst8_stride, d01_u8, 1);
+      store_u8_4x1(dst8_ptr + 2 * dst8_stride, d23_u8, 0);
+      store_u8_4x1(dst8_ptr + 3 * dst8_stride, d23_u8, 1);
 
       src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      dst8_ptr += 4 * dst8_stride;
       height -= 4;
-    } while (height > 0);
+    } while (height != 0);
   } else {
     const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
 
     do {
       const uint8_t *s = src_ptr;
       CONV_BUF_TYPE *d = dst_ptr;
-      uint8_t *d_u8 = dst_u8_ptr;
-      width = w;
+      uint8_t *d_u8 = dst8_ptr;
+      int width = w;
 
       do {
         uint8x16_t s0, s1, s2, s3;
-        int16x8_t d0, d1, d2, d3;
-        uint16x8_t dd0, dd1, dd2, dd3;
+        uint16x8_t d0, d1, d2, d3, dd0, dd1, dd2, dd3;
         uint8x8_t d0_u8, d1_u8, d2_u8, d3_u8;
 
-        s0 = vld1q_u8(s + 0 * src_stride);
-        s1 = vld1q_u8(s + 1 * src_stride);
-        s2 = vld1q_u8(s + 2 * src_stride);
-        s3 = vld1q_u8(s + 3 * src_stride);
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
 
-        d0 = convolve8_8_usdot(s0, x_filter, permute_tbl, vdupq_n_s32(0),
-                               shift_round_0);
-        d1 = convolve8_8_usdot(s1, x_filter, permute_tbl, vdupq_n_s32(0),
-                               shift_round_0);
-        d2 = convolve8_8_usdot(s2, x_filter, permute_tbl, vdupq_n_s32(0),
-                               shift_round_0);
-        d3 = convolve8_8_usdot(s3, x_filter, permute_tbl, vdupq_n_s32(0),
-                               shift_round_0);
+        d0 = convolve8_8_x(s0, x_filter, permute_tbl, round_offset_shim);
+        d1 = convolve8_8_x(s1, x_filter, permute_tbl, round_offset_shim);
+        d2 = convolve8_8_x(s2, x_filter, permute_tbl, round_offset_shim);
+        d3 = convolve8_8_x(s3, x_filter, permute_tbl, round_offset_shim);
 
-        d0 = vrshlq_s16(d0, horiz_const);
-        d1 = vrshlq_s16(d1, horiz_const);
-        d2 = vrshlq_s16(d2, horiz_const);
-        d3 = vrshlq_s16(d3, horiz_const);
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
 
-        d0 = vaddq_s16(d0, round_offset128);
-        d1 = vaddq_s16(d1, round_offset128);
-        d2 = vaddq_s16(d2, round_offset128);
-        d3 = vaddq_s16(d3, round_offset128);
+        compute_dist_wtd_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                                 bck_offset, round_offset_vec, &d0_u8, &d1_u8,
+                                 &d2_u8, &d3_u8);
 
-        if (conv_params->do_average) {
-          load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
-
-          compute_avg_8x4(dd0, dd1, dd2, dd3, vreinterpretq_u16_s16(d0),
-                          vreinterpretq_u16_s16(d1), vreinterpretq_u16_s16(d2),
-                          vreinterpretq_u16_s16(d3), fwd_offset, bck_offset,
-                          round_offset64, round_bits, use_dist_wtd_comp_avg,
-                          &d0_u8, &d1_u8, &d2_u8, &d3_u8);
-
-          store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
-        } else {
-          store_u16_8x4(d, dst_stride, vreinterpretq_u16_s16(d0),
-                        vreinterpretq_u16_s16(d1), vreinterpretq_u16_s16(d2),
-                        vreinterpretq_u16_s16(d3));
-        }
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
 
         s += 8;
         d += 8;
         d_u8 += 8;
         width -= 8;
-      } while (width > 0);
-
+      } while (width != 0);
       src_ptr += 4 * src_stride;
       dst_ptr += 4 * dst_stride;
-      dst_u8_ptr += 4 * dst8_stride;
+      dst8_ptr += 4 * dst8_stride;
       height -= 4;
-    } while (height > 0);
+    } while (height != 0);
   }
 }
 
-#elif defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+static INLINE void dist_wtd_convolve_x_avg_neon(
+    const uint8_t *src, int src_stride, uint8_t *dst8, int dst8_stride, int w,
+    int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn,
+    ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
 
-void av1_dist_wtd_convolve_x_neon(const uint8_t *src, int src_stride,
-                                  uint8_t *dst8, int dst8_stride, int w, int h,
-                                  const InterpFilterParams *filter_params_x,
-                                  const int subpel_x_qn,
-                                  ConvolveParams *conv_params) {
-  assert(!(w % 4));
-  assert(!(h % 4));
-
-  const int horiz_offset = filter_params_x->taps / 2 - 1;
-  const int bits = FILTER_BITS - conv_params->round_1;
   const int bd = 8;
-  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
-  const int round_offset = (1 << (offset_bits - conv_params->round_1)) +
-                           (1 << (offset_bits - conv_params->round_1 - 1));
-  const int round_bits =
-      2 * FILTER_BITS - conv_params->round_0 - conv_params->round_1;
-  const uint16_t fwd_offset = conv_params->fwd_offset;
-  const uint16_t bck_offset = conv_params->bck_offset;
-  const int use_dist_wtd_comp_avg = conv_params->use_dist_wtd_comp_avg;
-  const int16x4_t round_offset64 = vdup_n_s16(round_offset);
-  const int16x8_t round_offset128 = vdupq_n_s16(round_offset);
-  const int16x8_t shift_round_0 = vdupq_n_s16(-conv_params->round_0 + 1);
-  const int16x8_t horiz_const = vdupq_n_s16(bits);
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+  // A shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use non-rounding
+  // shifts - which are generally faster than rounding shifts on modern CPUs.
+  // (The extra -1 is needed because we halved the filter values.)
+  const int32x4_t round_offset_shim = vdupq_n_s32(
+      (round_offset << (ROUND0_BITS - 1)) + (1 << ((ROUND0_BITS - 1) - 1)));
 
   // Horizontal filter.
   const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
@@ -1524,17 +2236,267 @@
   // Filter values are even, so downshift by 1 to reduce intermediate precision
   // requirements.
   const int8x8_t x_filter = vshrn_n_s16(vld1q_s16(x_filter_ptr), 1);
-  // Dot-product constants.
+
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
+  const uint8_t *src_ptr = src - horiz_offset;
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  uint8_t *dst8_ptr = dst8;
+  int dst_stride = conv_params->dst_stride;
+  int height = h;
+
+  if (w == 4) {
+    const uint8x16x2_t permute_tbl = vld1q_u8_x2(dot_prod_permute_tbl);
+
+    do {
+      uint8x16_t s0, s1, s2, s3;
+      uint16x4_t d0, d1, d2, d3, dd0, dd1, dd2, dd3;
+      uint8x8_t d01_u8, d23_u8;
+
+      load_u8_16x4(src_ptr, src_stride, &s0, &s1, &s2, &s3);
+
+      d0 = convolve8_4_x(s0, x_filter, permute_tbl, round_offset_shim);
+      d1 = convolve8_4_x(s1, x_filter, permute_tbl, round_offset_shim);
+      d2 = convolve8_4_x(s2, x_filter, permute_tbl, round_offset_shim);
+      d3 = convolve8_4_x(s3, x_filter, permute_tbl, round_offset_shim);
+
+      load_u16_4x4(dst_ptr, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+      compute_basic_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                            round_offset_vec, &d01_u8, &d23_u8);
+
+      store_u8_4x1(dst8_ptr + 0 * dst8_stride, d01_u8, 0);
+      store_u8_4x1(dst8_ptr + 1 * dst8_stride, d01_u8, 1);
+      store_u8_4x1(dst8_ptr + 2 * dst8_stride, d23_u8, 0);
+      store_u8_4x1(dst8_ptr + 3 * dst8_stride, d23_u8, 1);
+
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      dst8_ptr += 4 * dst8_stride;
+      height -= 4;
+    } while (height != 0);
+  } else {
+    const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
+
+    do {
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      uint8_t *d_u8 = dst8_ptr;
+      int width = w;
+
+      do {
+        uint8x16_t s0, s1, s2, s3;
+        uint16x8_t d0, d1, d2, d3, dd0, dd1, dd2, dd3;
+        uint8x8_t d0_u8, d1_u8, d2_u8, d3_u8;
+
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
+
+        d0 = convolve8_8_x(s0, x_filter, permute_tbl, round_offset_shim);
+        d1 = convolve8_8_x(s1, x_filter, permute_tbl, round_offset_shim);
+        d2 = convolve8_8_x(s2, x_filter, permute_tbl, round_offset_shim);
+        d3 = convolve8_8_x(s3, x_filter, permute_tbl, round_offset_shim);
+
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_basic_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                              round_offset_vec, &d0_u8, &d1_u8, &d2_u8, &d3_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+
+        s += 8;
+        d += 8;
+        d_u8 += 8;
+        width -= 8;
+      } while (width != 0);
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      dst8_ptr += 4 * dst8_stride;
+      height -= 4;
+    } while (height != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_x_neon(
+    const uint8_t *src, int src_stride, int w, int h,
+    const InterpFilterParams *filter_params_x, const int subpel_x_qn,
+    ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
+
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  // A shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use non-rounding
+  // shifts - which are generally faster than rounding shifts on modern CPUs.
+  // (The extra -1 is needed because we halved the filter values.)
+  const int32x4_t round_offset_shim = vdupq_n_s32(
+      (round_offset << (ROUND0_BITS - 1)) + (1 << ((ROUND0_BITS - 1) - 1)));
+
+  // Horizontal filter.
+  const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_x, subpel_x_qn & SUBPEL_MASK);
+  // Filter values are even, so downshift by 1 to reduce intermediate precision
+  // requirements.
+  const int8x8_t x_filter = vshrn_n_s16(vld1q_s16(x_filter_ptr), 1);
+
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
+  const uint8_t *src_ptr = src - horiz_offset;
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  int dst_stride = conv_params->dst_stride;
+  int height = h;
+
+  if (w == 4) {
+    const uint8x16x2_t permute_tbl = vld1q_u8_x2(dot_prod_permute_tbl);
+
+    do {
+      uint8x16_t s0, s1, s2, s3;
+      uint16x4_t d0, d1, d2, d3;
+
+      load_u8_16x4(src_ptr, src_stride, &s0, &s1, &s2, &s3);
+
+      d0 = convolve8_4_x(s0, x_filter, permute_tbl, round_offset_shim);
+      d1 = convolve8_4_x(s1, x_filter, permute_tbl, round_offset_shim);
+      d2 = convolve8_4_x(s2, x_filter, permute_tbl, round_offset_shim);
+      d3 = convolve8_4_x(s3, x_filter, permute_tbl, round_offset_shim);
+
+      store_u16_4x4(dst_ptr, dst_stride, d0, d1, d2, d3);
+
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      height -= 4;
+    } while (height != 0);
+  } else {
+    const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
+
+    do {
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      int width = w;
+
+      do {
+        uint8x16_t s0, s1, s2, s3;
+        uint16x8_t d0, d1, d2, d3;
+
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
+
+        d0 = convolve8_8_x(s0, x_filter, permute_tbl, round_offset_shim);
+        d1 = convolve8_8_x(s1, x_filter, permute_tbl, round_offset_shim);
+        d2 = convolve8_8_x(s2, x_filter, permute_tbl, round_offset_shim);
+        d3 = convolve8_8_x(s3, x_filter, permute_tbl, round_offset_shim);
+
+        store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        s += 8;
+        d += 8;
+        width -= 8;
+      } while (width != 0);
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      height -= 4;
+    } while (height != 0);
+  }
+}
+
+#elif AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
+
+static INLINE uint16x4_t convolve8_4_x(uint8x16_t samples,
+                                       const int8x8_t x_filter,
+                                       const int32x4_t correction,
+                                       const uint8x16_t range_limit,
+                                       const uint8x16x2_t permute_tbl) {
+  int8x16_t clamped_samples, permuted_samples[2];
+  int32x4_t sum;
+
+  // Clamp sample range to [-128, 127] for 8-bit signed dot product.
+  clamped_samples = vreinterpretq_s8_u8(vsubq_u8(samples, range_limit));
+
+  // Permute samples ready for dot product.
+  // { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 }
+  permuted_samples[0] = vqtbl1q_s8(clamped_samples, permute_tbl.val[0]);
+  // { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 }
+  permuted_samples[1] = vqtbl1q_s8(clamped_samples, permute_tbl.val[1]);
+
+  // Accumulate dot product into 'correction' to account for range clamp.
+  sum = vdotq_lane_s32(correction, permuted_samples[0], x_filter, 0);
+  sum = vdotq_lane_s32(sum, permuted_samples[1], x_filter, 1);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  return vreinterpret_u16_s16(vshrn_n_s32(sum, ROUND0_BITS - 1));
+}
+
+static INLINE uint16x8_t convolve8_8_x(uint8x16_t samples,
+                                       const int8x8_t x_filter,
+                                       const int32x4_t correction,
+                                       const uint8x16_t range_limit,
+                                       const uint8x16x3_t permute_tbl) {
+  int8x16_t clamped_samples, permuted_samples[3];
+  int32x4_t sum[2];
+
+  // Clamp sample range to [-128, 127] for 8-bit signed dot product.
+  clamped_samples = vreinterpretq_s8_u8(vsubq_u8(samples, range_limit));
+
+  // Permute samples ready for dot product. */
+  // { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 }
+  permuted_samples[0] = vqtbl1q_s8(clamped_samples, permute_tbl.val[0]);
+  // { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 }
+  permuted_samples[1] = vqtbl1q_s8(clamped_samples, permute_tbl.val[1]);
+  // { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 }
+  permuted_samples[2] = vqtbl1q_s8(clamped_samples, permute_tbl.val[2]);
+
+  // Accumulate dot product into 'correction' to account for range clamp.
+  // First 4 output values.
+  sum[0] = vdotq_lane_s32(correction, permuted_samples[0], x_filter, 0);
+  sum[0] = vdotq_lane_s32(sum[0], permuted_samples[1], x_filter, 1);
+  // Second 4 output values.
+  sum[1] = vdotq_lane_s32(correction, permuted_samples[1], x_filter, 0);
+  sum[1] = vdotq_lane_s32(sum[1], permuted_samples[2], x_filter, 1);
+
+  // Narrow and re-pack.
+  // We halved the convolution filter values so -1 from the right shift.
+  int16x8_t res = vcombine_s16(vshrn_n_s32(sum[0], ROUND0_BITS - 1),
+                               vshrn_n_s32(sum[1], ROUND0_BITS - 1));
+  return vreinterpretq_u16_s16(res);
+}
+
+static INLINE void dist_wtd_convolve_x_dist_wtd_avg_neon(
+    const uint8_t *src, int src_stride, uint8_t *dst8, int dst8_stride, int w,
+    int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn,
+    ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
+
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
+  const uint16_t fwd_offset = conv_params->fwd_offset;
+  const uint16_t bck_offset = conv_params->bck_offset;
+
+  // Horizontal filter.
+  const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_x, subpel_x_qn & SUBPEL_MASK);
+  // Filter values are even, so downshift by 1 to reduce intermediate precision
+  // requirements.
+  const int8x8_t x_filter = vshrn_n_s16(vld1q_s16(x_filter_ptr), 1);
+
+  // Dot-product constants and other shims.
   const uint8x16_t range_limit = vdupq_n_u8(128);
   const int32_t correction_s32 = vaddlvq_s16(vshll_n_s8(x_filter, 7));
-  const int32x4_t correction = vdupq_n_s32(correction_s32);
+  // Fold round_offset into the dot-product filter correction constant. The
+  // additional shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use non-
+  // rounding shifts - which are generally faster than rounding shifts on
+  // modern CPUs. (The extra -1 is needed because we halved the filter values.)
+  int32x4_t correction =
+      vdupq_n_s32(correction_s32 + (round_offset << (ROUND0_BITS - 1)) +
+                  (1 << ((ROUND0_BITS - 1) - 1)));
 
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
   const uint8_t *src_ptr = src - horiz_offset;
-  CONV_BUF_TYPE *dst = conv_params->dst;
-  CONV_BUF_TYPE *dst_ptr = dst;
-  uint8_t *dst_u8_ptr = dst8;
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  uint8_t *dst8_ptr = dst8;
   int dst_stride = conv_params->dst_stride;
-  int width = w;
   int height = h;
 
   if (w == 4) {
@@ -1542,305 +2504,435 @@
 
     do {
       uint8x16_t s0, s1, s2, s3;
-      int32x4_t d0, d1, d2, d3;
-      int16x8_t d01, d23;
-      uint16x4_t dd0, dd1, dd2, dd3;
+      uint16x4_t d0, d1, d2, d3, dd0, dd1, dd2, dd3;
       uint8x8_t d01_u8, d23_u8;
 
-      s0 = vld1q_u8(src_ptr + 0 * src_stride);
-      s1 = vld1q_u8(src_ptr + 1 * src_stride);
-      s2 = vld1q_u8(src_ptr + 2 * src_stride);
-      s3 = vld1q_u8(src_ptr + 3 * src_stride);
+      load_u8_16x4(src_ptr, src_stride, &s0, &s1, &s2, &s3);
 
-      d0 = convolve8_4_sdot(s0, x_filter, correction, range_limit, permute_tbl);
-      d1 = convolve8_4_sdot(s1, x_filter, correction, range_limit, permute_tbl);
-      d2 = convolve8_4_sdot(s2, x_filter, correction, range_limit, permute_tbl);
-      d3 = convolve8_4_sdot(s3, x_filter, correction, range_limit, permute_tbl);
+      d0 = convolve8_4_x(s0, x_filter, correction, range_limit, permute_tbl);
+      d1 = convolve8_4_x(s1, x_filter, correction, range_limit, permute_tbl);
+      d2 = convolve8_4_x(s2, x_filter, correction, range_limit, permute_tbl);
+      d3 = convolve8_4_x(s3, x_filter, correction, range_limit, permute_tbl);
 
-      d01 = vcombine_s16(vmovn_s32(d0), vmovn_s32(d1));
-      d23 = vcombine_s16(vmovn_s32(d2), vmovn_s32(d3));
+      load_u16_4x4(dst_ptr, dst_stride, &dd0, &dd1, &dd2, &dd3);
 
-      d01 = vqrshlq_s16(d01, shift_round_0);
-      d23 = vqrshlq_s16(d23, shift_round_0);
+      compute_dist_wtd_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                               bck_offset, round_offset_vec, &d01_u8, &d23_u8);
 
-      d01 = vrshlq_s16(d01, horiz_const);
-      d23 = vrshlq_s16(d23, horiz_const);
-
-      d01 = vaddq_s16(d01, round_offset128);
-      d23 = vaddq_s16(d23, round_offset128);
-
-      if (conv_params->do_average) {
-        dd0 = vld1_u16(dst_ptr);
-        dst_ptr += dst_stride;
-        dd1 = vld1_u16(dst_ptr);
-        dst_ptr += dst_stride;
-        dd2 = vld1_u16(dst_ptr);
-        dst_ptr += dst_stride;
-        dd3 = vld1_u16(dst_ptr);
-        dst_ptr += dst_stride;
-
-        compute_avg_4x4(dd0, dd1, dd2, dd3,
-                        vreinterpret_u16_s16(vget_low_s16(d01)),
-                        vreinterpret_u16_s16(vget_high_s16(d01)),
-                        vreinterpret_u16_s16(vget_low_s16(d23)),
-                        vreinterpret_u16_s16(vget_high_s16(d23)), fwd_offset,
-                        bck_offset, round_offset64, round_bits,
-                        use_dist_wtd_comp_avg, &d01_u8, &d23_u8);
-
-        vst1_lane_u32((uint32_t *)dst_u8_ptr, vreinterpret_u32_u8(d01_u8), 0);
-        dst_u8_ptr += dst8_stride;
-        vst1_lane_u32((uint32_t *)dst_u8_ptr, vreinterpret_u32_u8(d01_u8), 1);
-        dst_u8_ptr += dst8_stride;
-        vst1_lane_u32((uint32_t *)dst_u8_ptr, vreinterpret_u32_u8(d23_u8), 0);
-        dst_u8_ptr += dst8_stride;
-        vst1_lane_u32((uint32_t *)dst_u8_ptr, vreinterpret_u32_u8(d23_u8), 1);
-        dst_u8_ptr += dst8_stride;
-      } else {
-        vst1q_lane_u64((uint64_t *)dst_ptr, vreinterpretq_u64_s16(d01), 0);
-        dst_ptr += dst_stride;
-        vst1q_lane_u64((uint64_t *)dst_ptr, vreinterpretq_u64_s16(d01), 1);
-        dst_ptr += dst_stride;
-        vst1q_lane_u64((uint64_t *)dst_ptr, vreinterpretq_u64_s16(d23), 0);
-        dst_ptr += dst_stride;
-        vst1q_lane_u64((uint64_t *)dst_ptr, vreinterpretq_u64_s16(d23), 1);
-        dst_ptr += dst_stride;
-      }
+      store_u8_4x1(dst8_ptr + 0 * dst8_stride, d01_u8, 0);
+      store_u8_4x1(dst8_ptr + 1 * dst8_stride, d01_u8, 1);
+      store_u8_4x1(dst8_ptr + 2 * dst8_stride, d23_u8, 0);
+      store_u8_4x1(dst8_ptr + 3 * dst8_stride, d23_u8, 1);
 
       src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      dst8_ptr += 4 * dst8_stride;
       height -= 4;
-    } while (height > 0);
+    } while (height != 0);
   } else {
     const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
 
     do {
       const uint8_t *s = src_ptr;
       CONV_BUF_TYPE *d = dst_ptr;
-      uint8_t *d_u8 = dst_u8_ptr;
-      width = w;
+      uint8_t *d_u8 = dst8_ptr;
+      int width = w;
 
       do {
         uint8x16_t s0, s1, s2, s3;
-        int16x8_t d0, d1, d2, d3;
-        uint16x8_t dd0, dd1, dd2, dd3;
+        uint16x8_t d0, d1, d2, d3, dd0, dd1, dd2, dd3;
         uint8x8_t d0_u8, d1_u8, d2_u8, d3_u8;
 
-        s0 = vld1q_u8(s + 0 * src_stride);
-        s1 = vld1q_u8(s + 1 * src_stride);
-        s2 = vld1q_u8(s + 2 * src_stride);
-        s3 = vld1q_u8(s + 3 * src_stride);
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
 
-        d0 = convolve8_8_sdot(s0, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
-        d1 = convolve8_8_sdot(s1, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
-        d2 = convolve8_8_sdot(s2, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
-        d3 = convolve8_8_sdot(s3, x_filter, correction, range_limit,
-                              permute_tbl, shift_round_0);
+        d0 = convolve8_8_x(s0, x_filter, correction, range_limit, permute_tbl);
+        d1 = convolve8_8_x(s1, x_filter, correction, range_limit, permute_tbl);
+        d2 = convolve8_8_x(s2, x_filter, correction, range_limit, permute_tbl);
+        d3 = convolve8_8_x(s3, x_filter, correction, range_limit, permute_tbl);
 
-        d0 = vrshlq_s16(d0, horiz_const);
-        d1 = vrshlq_s16(d1, horiz_const);
-        d2 = vrshlq_s16(d2, horiz_const);
-        d3 = vrshlq_s16(d3, horiz_const);
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
 
-        d0 = vaddq_s16(d0, round_offset128);
-        d1 = vaddq_s16(d1, round_offset128);
-        d2 = vaddq_s16(d2, round_offset128);
-        d3 = vaddq_s16(d3, round_offset128);
+        compute_dist_wtd_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                                 bck_offset, round_offset_vec, &d0_u8, &d1_u8,
+                                 &d2_u8, &d3_u8);
 
-        if (conv_params->do_average) {
-          load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
-
-          compute_avg_8x4(dd0, dd1, dd2, dd3, vreinterpretq_u16_s16(d0),
-                          vreinterpretq_u16_s16(d1), vreinterpretq_u16_s16(d2),
-                          vreinterpretq_u16_s16(d3), fwd_offset, bck_offset,
-                          round_offset64, round_bits, use_dist_wtd_comp_avg,
-                          &d0_u8, &d1_u8, &d2_u8, &d3_u8);
-
-          store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
-        } else {
-          store_u16_8x4(d, dst_stride, vreinterpretq_u16_s16(d0),
-                        vreinterpretq_u16_s16(d1), vreinterpretq_u16_s16(d2),
-                        vreinterpretq_u16_s16(d3));
-        }
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
 
         s += 8;
         d += 8;
         d_u8 += 8;
         width -= 8;
-      } while (width > 0);
-
+      } while (width != 0);
       src_ptr += 4 * src_stride;
       dst_ptr += 4 * dst_stride;
-      dst_u8_ptr += 4 * dst8_stride;
+      dst8_ptr += 4 * dst8_stride;
       height -= 4;
-    } while (height > 0);
+    } while (height != 0);
   }
 }
 
-#else  // !(defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD))
+static INLINE void dist_wtd_convolve_x_avg_neon(
+    const uint8_t *src, int src_stride, uint8_t *dst8, int dst8_stride, int w,
+    int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn,
+    ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
 
-void av1_dist_wtd_convolve_x_neon(const uint8_t *src, int src_stride,
-                                  uint8_t *dst8, int dst8_stride, int w, int h,
-                                  const InterpFilterParams *filter_params_x,
-                                  const int subpel_x_qn,
-                                  ConvolveParams *conv_params) {
-  assert(!(w % 4));
-  assert(!(h % 4));
-
-  CONV_BUF_TYPE *dst = conv_params->dst;
-  int dst_stride = conv_params->dst_stride;
-  const int horiz_offset = filter_params_x->taps / 2 - 1;
-  const int bits = FILTER_BITS - conv_params->round_1;
   const int bd = 8;
-  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
-  const int round_offset = (1 << (offset_bits - conv_params->round_1)) +
-                           (1 << (offset_bits - conv_params->round_1 - 1));
-  const int round_bits =
-      2 * FILTER_BITS - conv_params->round_0 - conv_params->round_1;
-  const uint16_t fwd_offset = conv_params->fwd_offset;
-  const uint16_t bck_offset = conv_params->bck_offset;
-  const int use_dist_wtd_comp_avg = conv_params->use_dist_wtd_comp_avg;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
 
-  // horizontal filter
+  // Horizontal filter.
   const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
       filter_params_x, subpel_x_qn & SUBPEL_MASK);
+  // Filter values are even, so downshift by 1 to reduce intermediate precision
+  // requirements.
+  const int8x8_t x_filter = vshrn_n_s16(vld1q_s16(x_filter_ptr), 1);
 
+  // Dot-product constants and other shims.
+  const uint8x16_t range_limit = vdupq_n_u8(128);
+  const int32_t correction_s32 = vaddlvq_s16(vshll_n_s8(x_filter, 7));
+  // Fold round_offset into the dot-product filter correction constant. The
+  // additional shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use non-
+  // rounding shifts - which are generally faster than rounding shifts on
+  // modern CPUs. (The extra -1 is needed because we halved the filter values.)
+  int32x4_t correction =
+      vdupq_n_s32(correction_s32 + (round_offset << (ROUND0_BITS - 1)) +
+                  (1 << ((ROUND0_BITS - 1) - 1)));
+
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
   const uint8_t *src_ptr = src - horiz_offset;
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  uint8_t *dst8_ptr = dst8;
+  int dst_stride = conv_params->dst_stride;
+  int height = h;
 
+  if (w == 4) {
+    const uint8x16x2_t permute_tbl = vld1q_u8_x2(dot_prod_permute_tbl);
+
+    do {
+      uint8x16_t s0, s1, s2, s3;
+      uint16x4_t d0, d1, d2, d3, dd0, dd1, dd2, dd3;
+      uint8x8_t d01_u8, d23_u8;
+
+      load_u8_16x4(src_ptr, src_stride, &s0, &s1, &s2, &s3);
+
+      d0 = convolve8_4_x(s0, x_filter, correction, range_limit, permute_tbl);
+      d1 = convolve8_4_x(s1, x_filter, correction, range_limit, permute_tbl);
+      d2 = convolve8_4_x(s2, x_filter, correction, range_limit, permute_tbl);
+      d3 = convolve8_4_x(s3, x_filter, correction, range_limit, permute_tbl);
+
+      load_u16_4x4(dst_ptr, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+      compute_basic_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                            round_offset_vec, &d01_u8, &d23_u8);
+
+      store_u8_4x1(dst8_ptr + 0 * dst8_stride, d01_u8, 0);
+      store_u8_4x1(dst8_ptr + 1 * dst8_stride, d01_u8, 1);
+      store_u8_4x1(dst8_ptr + 2 * dst8_stride, d23_u8, 0);
+      store_u8_4x1(dst8_ptr + 3 * dst8_stride, d23_u8, 1);
+
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      dst8_ptr += 4 * dst8_stride;
+      height -= 4;
+    } while (height != 0);
+  } else {
+    const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
+
+    do {
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      uint8_t *d_u8 = dst8_ptr;
+      int width = w;
+
+      do {
+        uint8x16_t s0, s1, s2, s3;
+        uint16x8_t d0, d1, d2, d3, dd0, dd1, dd2, dd3;
+        uint8x8_t d0_u8, d1_u8, d2_u8, d3_u8;
+
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
+
+        d0 = convolve8_8_x(s0, x_filter, correction, range_limit, permute_tbl);
+        d1 = convolve8_8_x(s1, x_filter, correction, range_limit, permute_tbl);
+        d2 = convolve8_8_x(s2, x_filter, correction, range_limit, permute_tbl);
+        d3 = convolve8_8_x(s3, x_filter, correction, range_limit, permute_tbl);
+
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_basic_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                              round_offset_vec, &d0_u8, &d1_u8, &d2_u8, &d3_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+
+        s += 8;
+        d += 8;
+        d_u8 += 8;
+        width -= 8;
+      } while (width != 0);
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      dst8_ptr += 4 * dst8_stride;
+      height -= 4;
+    } while (height != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_x_neon(
+    const uint8_t *src, int src_stride, int w, int h,
+    const InterpFilterParams *filter_params_x, const int subpel_x_qn,
+    ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
+
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+
+  // Horizontal filter.
+  const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_x, subpel_x_qn & SUBPEL_MASK);
+  // Filter values are even, so downshift by 1 to reduce intermediate precision
+  // requirements.
+  const int8x8_t x_filter = vshrn_n_s16(vld1q_s16(x_filter_ptr), 1);
+
+  // Dot-product constants and other shims.
+  const uint8x16_t range_limit = vdupq_n_u8(128);
+  const int32_t correction_s32 = vaddlvq_s16(vshll_n_s8(x_filter, 7));
+  // Fold round_offset into the dot-product filter correction constant. The
+  // additional shim of 1 << ((ROUND0_BITS - 1) - 1) enables us to use non-
+  // rounding shifts - which are generally faster than rounding shifts on
+  // modern CPUs. (The extra -1 is needed because we halved the filter values.)
+  int32x4_t correction =
+      vdupq_n_s32(correction_s32 + (round_offset << (ROUND0_BITS - 1)) +
+                  (1 << ((ROUND0_BITS - 1) - 1)));
+
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
+  const uint8_t *src_ptr = src - horiz_offset;
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  int dst_stride = conv_params->dst_stride;
+  int height = h;
+
+  if (w == 4) {
+    const uint8x16x2_t permute_tbl = vld1q_u8_x2(dot_prod_permute_tbl);
+
+    do {
+      uint8x16_t s0, s1, s2, s3;
+      uint16x4_t d0, d1, d2, d3;
+
+      load_u8_16x4(src_ptr, src_stride, &s0, &s1, &s2, &s3);
+
+      d0 = convolve8_4_x(s0, x_filter, correction, range_limit, permute_tbl);
+      d1 = convolve8_4_x(s1, x_filter, correction, range_limit, permute_tbl);
+      d2 = convolve8_4_x(s2, x_filter, correction, range_limit, permute_tbl);
+      d3 = convolve8_4_x(s3, x_filter, correction, range_limit, permute_tbl);
+
+      store_u16_4x4(dst_ptr, dst_stride, d0, d1, d2, d3);
+
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      height -= 4;
+    } while (height != 0);
+  } else {
+    const uint8x16x3_t permute_tbl = vld1q_u8_x3(dot_prod_permute_tbl);
+
+    do {
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      int width = w;
+
+      do {
+        uint8x16_t s0, s1, s2, s3;
+        uint16x8_t d0, d1, d2, d3;
+
+        load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
+
+        d0 = convolve8_8_x(s0, x_filter, correction, range_limit, permute_tbl);
+        d1 = convolve8_8_x(s1, x_filter, correction, range_limit, permute_tbl);
+        d2 = convolve8_8_x(s2, x_filter, correction, range_limit, permute_tbl);
+        d3 = convolve8_8_x(s3, x_filter, correction, range_limit, permute_tbl);
+
+        store_u16_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        s += 8;
+        d += 8;
+        width -= 8;
+      } while (width != 0);
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      height -= 4;
+    } while (height != 0);
+  }
+}
+
+#else  // !(AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD))
+
+static INLINE uint16x4_t convolve8_4_x(const int16x4_t s0, const int16x4_t s1,
+                                       const int16x4_t s2, const int16x4_t s3,
+                                       const int16x4_t s4, const int16x4_t s5,
+                                       const int16x4_t s6, const int16x4_t s7,
+                                       const int16x8_t x_filter,
+                                       const int16x4_t round_offset) {
+  const int16x4_t x_filter_0_3 = vget_low_s16(x_filter);
+  const int16x4_t x_filter_4_7 = vget_high_s16(x_filter);
+
+  int16x4_t sum = vmul_lane_s16(s0, x_filter_0_3, 0);
+  sum = vmla_lane_s16(sum, s1, x_filter_0_3, 1);
+  sum = vmla_lane_s16(sum, s2, x_filter_0_3, 2);
+  sum = vmla_lane_s16(sum, s3, x_filter_0_3, 3);
+  sum = vmla_lane_s16(sum, s4, x_filter_4_7, 0);
+  sum = vmla_lane_s16(sum, s5, x_filter_4_7, 1);
+  sum = vmla_lane_s16(sum, s6, x_filter_4_7, 2);
+  sum = vmla_lane_s16(sum, s7, x_filter_4_7, 3);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  int16x4_t res = vrsra_n_s16(round_offset, sum, ROUND0_BITS - 1);
+  return vreinterpret_u16_s16(res);
+}
+
+static INLINE uint16x8_t convolve8_8_x(const int16x8_t s0, const int16x8_t s1,
+                                       const int16x8_t s2, const int16x8_t s3,
+                                       const int16x8_t s4, const int16x8_t s5,
+                                       const int16x8_t s6, const int16x8_t s7,
+                                       const int16x8_t x_filter,
+                                       const int16x8_t round_offset) {
+  const int16x4_t x_filter_0_3 = vget_low_s16(x_filter);
+  const int16x4_t x_filter_4_7 = vget_high_s16(x_filter);
+
+  int16x8_t sum = vmulq_lane_s16(s0, x_filter_0_3, 0);
+  sum = vmlaq_lane_s16(sum, s1, x_filter_0_3, 1);
+  sum = vmlaq_lane_s16(sum, s2, x_filter_0_3, 2);
+  sum = vmlaq_lane_s16(sum, s3, x_filter_0_3, 3);
+  sum = vmlaq_lane_s16(sum, s4, x_filter_4_7, 0);
+  sum = vmlaq_lane_s16(sum, s5, x_filter_4_7, 1);
+  sum = vmlaq_lane_s16(sum, s6, x_filter_4_7, 2);
+  sum = vmlaq_lane_s16(sum, s7, x_filter_4_7, 3);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  int16x8_t res = vrsraq_n_s16(round_offset, sum, ROUND0_BITS - 1);
+  return vreinterpretq_u16_s16(res);
+}
+
+static INLINE void dist_wtd_convolve_x_dist_wtd_avg_neon(
+    const uint8_t *src, int src_stride, uint8_t *dst8, int dst8_stride, int w,
+    int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn,
+    ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
+
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
+  const uint16_t fwd_offset = conv_params->fwd_offset;
+  const uint16_t bck_offset = conv_params->bck_offset;
+
+  // Horizontal filter.
+  const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_x, subpel_x_qn & SUBPEL_MASK);
   // Filter values are even, so downshift by 1 to reduce intermediate precision
   // requirements.
   const int16x8_t x_filter = vshrq_n_s16(vld1q_s16(x_filter_ptr), 1);
 
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
+  const uint8_t *src_ptr = src - horiz_offset;
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  uint8_t *dst8_ptr = dst8;
+  int dst_stride = conv_params->dst_stride;
   const uint8_t *s;
   uint8_t *d_u8;
-  uint8_t *dst_u8_ptr;
-  CONV_BUF_TYPE *d, *dst_ptr;
-  int width, height;
+  CONV_BUF_TYPE *d;
+  int width;
+  int height = h;
+
   uint8x8_t t0;
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   uint8x8_t t1, t2, t3, t4, t5, t6, t7;
-#endif
-  s = src_ptr;
-  dst_ptr = dst;
-  dst_u8_ptr = dst8;
-  width = w;
-  height = h;
+#endif  // AOM_ARCH_AARCH64
 
-  if ((w == 4) || (h == 4)) {
-    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, d0;
-    int16x8_t tt0;
-    uint16x4_t res4;
-#if defined(__aarch64__)
-    int16x4_t s8, s9, s10, d1, d2, d3;
-    int16x8_t tt1, tt2, tt3;
-    uint16x4_t res5, res6, res7;
-    uint32x2_t tu0 = vdup_n_u32(0), tu1 = vdup_n_u32(0);
-    int16x8_t u0, u1;
-#else
-    int16x4_t temp_0;
-#endif
-    const int16x4_t zero = vdup_n_s16(0);
-    const int16x4_t round_offset_vec = vdup_n_s16(round_offset);
-    const int16x4_t shift_round_0 = vdup_n_s16(-conv_params->round_0 + 1);
-    const int16x4_t horiz_const = vdup_n_s16(bits);
+  if (w == 4 || h == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8;
+    uint16x4_t d0, dd0;
+    uint8x8_t d01;
+#if AOM_ARCH_AARCH64
+    int16x4_t s9, s10;
+    uint16x4_t d1, d2, d3, dd1, dd2, dd3;
+    uint8x8_t d23;
+#endif  // AOM_ARCH_AARCH64
+
     do {
-      s = src_ptr;
       d = dst_ptr;
-      d_u8 = dst_u8_ptr;
+      d_u8 = dst8_ptr;
       width = w;
-      __builtin_prefetch(s + 0 * src_stride);
-#if defined(__aarch64__)
-      __builtin_prefetch(s + 1 * src_stride);
-      __builtin_prefetch(s + 2 * src_stride);
-      __builtin_prefetch(s + 3 * src_stride);
 
-      load_u8_8x4(s, src_stride, &t0, &t1, &t2, &t3);
+      __builtin_prefetch(src_ptr + 0 * src_stride);
+#if AOM_ARCH_AARCH64
+      __builtin_prefetch(src_ptr + 1 * src_stride);
+      __builtin_prefetch(src_ptr + 2 * src_stride);
+      __builtin_prefetch(src_ptr + 3 * src_stride);
+
+      load_u8_8x4(src_ptr, src_stride, &t0, &t1, &t2, &t3);
       transpose_u8_8x4(&t0, &t1, &t2, &t3);
-      tt0 = vreinterpretq_s16_u16(vmovl_u8(t0));
-      tt1 = vreinterpretq_s16_u16(vmovl_u8(t1));
-      tt2 = vreinterpretq_s16_u16(vmovl_u8(t2));
-      tt3 = vreinterpretq_s16_u16(vmovl_u8(t3));
-      s0 = vget_low_s16(tt0);
-      s1 = vget_low_s16(tt1);
-      s2 = vget_low_s16(tt2);
-      s3 = vget_low_s16(tt3);
-      s4 = vget_high_s16(tt0);
-      s5 = vget_high_s16(tt1);
-      s6 = vget_high_s16(tt2);
+
+      s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s1 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+      s2 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+      s3 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+      s4 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s5 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+      s6 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+
       __builtin_prefetch(d + 0 * dst_stride);
       __builtin_prefetch(d + 1 * dst_stride);
       __builtin_prefetch(d + 2 * dst_stride);
       __builtin_prefetch(d + 3 * dst_stride);
-      s += 7;
+
+      s = src_ptr + 7;
+
       do {
-        load_unaligned_u8_4x4(s, src_stride, &tu0, &tu1);
-        t0 = vreinterpret_u8_u32(tu0);
-        t1 = vreinterpret_u8_u32(tu1);
-
+        load_unaligned_u8_4x4(s, src_stride, &t0, &t1);
         transpose_u8_4x4(&t0, &t1);
-        u0 = vreinterpretq_s16_u16(vmovl_u8(t0));
-        u1 = vreinterpretq_s16_u16(vmovl_u8(t1));
 
-        s7 = vget_low_s16(u0);
-        s8 = vget_low_s16(u1);
-        s9 = vget_high_s16(u0);
-        s10 = vget_high_s16(u1);
+        s7 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+        s8 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+        s9 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+        s10 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
 
-        d0 = convolve8_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, x_filter, zero,
-                               shift_round_0);
-        d0 = vrshl_s16(d0, horiz_const);
-        d0 = vadd_s16(d0, round_offset_vec);
-        d1 = convolve8_4x4_s16(s1, s2, s3, s4, s5, s6, s7, s8, x_filter, zero,
-                               shift_round_0);
-        d1 = vrshl_s16(d1, horiz_const);
-        d1 = vadd_s16(d1, round_offset_vec);
-        d2 = convolve8_4x4_s16(s2, s3, s4, s5, s6, s7, s8, s9, x_filter, zero,
-                               shift_round_0);
-        d2 = vrshl_s16(d2, horiz_const);
-        d2 = vadd_s16(d2, round_offset_vec);
-        d3 = convolve8_4x4_s16(s3, s4, s5, s6, s7, s8, s9, s10, x_filter, zero,
-                               shift_round_0);
-        d3 = vrshl_s16(d3, horiz_const);
-        d3 = vadd_s16(d3, round_offset_vec);
+        d0 = convolve8_4_x(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                           vget_low_s16(round_offset_vec));
+        d1 = convolve8_4_x(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
+                           vget_low_s16(round_offset_vec));
+        d2 = convolve8_4_x(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
+                           vget_low_s16(round_offset_vec));
+        d3 = convolve8_4_x(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
+                           vget_low_s16(round_offset_vec));
 
-        transpose_s16_4x4d(&d0, &d1, &d2, &d3);
+        transpose_u16_4x4d(&d0, &d1, &d2, &d3);
 
-        if (conv_params->do_average) {
-          __builtin_prefetch(d + 0 * dst_stride);
-          __builtin_prefetch(d + 1 * dst_stride);
-          __builtin_prefetch(d + 2 * dst_stride);
-          __builtin_prefetch(d + 3 * dst_stride);
+        __builtin_prefetch(d + 0 * dst_stride);
+        __builtin_prefetch(d + 1 * dst_stride);
+        __builtin_prefetch(d + 2 * dst_stride);
+        __builtin_prefetch(d + 3 * dst_stride);
 
-          __builtin_prefetch(d_u8 + 0 * dst8_stride);
-          __builtin_prefetch(d_u8 + 1 * dst8_stride);
-          __builtin_prefetch(d_u8 + 2 * dst8_stride);
-          __builtin_prefetch(d_u8 + 3 * dst8_stride);
+        __builtin_prefetch(d_u8 + 0 * dst8_stride);
+        __builtin_prefetch(d_u8 + 1 * dst8_stride);
+        __builtin_prefetch(d_u8 + 2 * dst8_stride);
+        __builtin_prefetch(d_u8 + 3 * dst8_stride);
 
-          load_u16_4x4(d, dst_stride, &res4, &res5, &res6, &res7);
+        load_u16_4x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
 
-          compute_avg_4x4(res4, res5, res6, res7, vreinterpret_u16_s16(d0),
-                          vreinterpret_u16_s16(d1), vreinterpret_u16_s16(d2),
-                          vreinterpret_u16_s16(d3), fwd_offset, bck_offset,
-                          round_offset_vec, round_bits, use_dist_wtd_comp_avg,
-                          &t0, &t1);
+        compute_dist_wtd_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                                 bck_offset, round_offset_vec, &d01, &d23);
 
-          vst1_lane_u32((uint32_t *)d_u8, vreinterpret_u32_u8(t0),
-                        0);  // 00 01 02 03
-          vst1_lane_u32((uint32_t *)(d_u8 + dst8_stride),
-                        vreinterpret_u32_u8(t0),
-                        1);  // 10 11 12 13
-          vst1_lane_u32((uint32_t *)(d_u8 + 2 * dst8_stride),
-                        vreinterpret_u32_u8(t1),
-                        0);  // 20 21 22 23
-          vst1_lane_u32((uint32_t *)(d_u8 + 3 * dst8_stride),
-                        vreinterpret_u32_u8(t1),
-                        1);  // 30 31 32 33
-        } else {
-          store_u16_4x4(d, dst_stride, vreinterpret_u16_s16(d0),
-                        vreinterpret_u16_s16(d1), vreinterpret_u16_s16(d2),
-                        vreinterpret_u16_s16(d3));
-        }
+        store_u8_4x1(d_u8 + 0 * dst8_stride, d01, 0);
+        store_u8_4x1(d_u8 + 1 * dst8_stride, d01, 1);
+        store_u8_4x1(d_u8 + 2 * dst8_stride, d23, 0);
+        store_u8_4x1(d_u8 + 3 * dst8_stride, d23, 1);
 
         s0 = s4;
         s1 = s5;
@@ -1849,90 +2941,76 @@
         s4 = s8;
         s5 = s9;
         s6 = s10;
-
         s += 4;
-        width -= 4;
         d += 4;
         d_u8 += 4;
-      } while (width > 0);
-      src_ptr += (src_stride << 2);
-      dst_ptr += (dst_stride << 2);
-      dst_u8_ptr += (dst8_stride << 2);
+        width -= 4;
+      } while (width != 0);
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      dst8_ptr += 4 * dst8_stride;
       height -= 4;
-#else
-      t0 = vld1_u8(s);                            // a0 a1 a2 a3 a4 a5 a6 a7
-      tt0 = vreinterpretq_s16_u16(vmovl_u8(t0));  // a0 a1 a2 a3 a4 a5 a6 a7
-      s0 = vget_low_s16(tt0);                     // a0 a1 a2 a3
-      s4 = vget_high_s16(tt0);                    // a4 a5 a6 a7
+#else   // !AOM_ARCH_AARCH64
+      t0 = vld1_u8(src_ptr);  // a0 a1 a2 a3 a4 a5 a6 a7
+      s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s4 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+
       __builtin_prefetch(d);
 
-      s += 8;
+      s = src_ptr + 8;
+
       do {
         t0 = vld1_u8(s);  // a8 a9 a10 a11
+        s8 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
 
-        // a8 a9 a10 a11
-        s7 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
-        temp_0 = s7;
         s1 = vext_s16(s0, s4, 1);  // a1 a2 a3 a4
         s2 = vext_s16(s0, s4, 2);  // a2 a3 a4 a5
         s3 = vext_s16(s0, s4, 3);  // a3 a4 a5 a6
-        s5 = vext_s16(s4, s7, 1);  // a5 a6 a7 a8
-        s6 = vext_s16(s4, s7, 2);  // a6 a7 a8 a9
-        s7 = vext_s16(s4, s7, 3);  // a7 a8 a9 a10
+        s5 = vext_s16(s4, s8, 1);  // a5 a6 a7 a8
+        s6 = vext_s16(s4, s8, 2);  // a6 a7 a8 a9
+        s7 = vext_s16(s4, s8, 3);  // a7 a8 a9 a10
 
-        d0 = convolve8_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, x_filter, zero,
-                               shift_round_0);
-        d0 = vrshl_s16(d0, horiz_const);
-        d0 = vadd_s16(d0, round_offset_vec);
+        d0 = convolve8_4_x(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                           vget_low_s16(round_offset_vec));
+
+        __builtin_prefetch(d);
+        __builtin_prefetch(d_u8);
+
+        dd0 = vld1_u16(d);
+
+        compute_dist_wtd_avg_4x1(dd0, d0, fwd_offset, bck_offset,
+                                 vget_low_s16(round_offset_vec), &d01);
+
+        store_u8_4x1(d_u8, d01, 0);
+
         s0 = s4;
-        s4 = temp_0;
-        if (conv_params->do_average) {
-          __builtin_prefetch(d);
-          __builtin_prefetch(d_u8);
-
-          res4 = vld1_u16(d);
-
-          compute_avg_4x1(res4, vreinterpret_u16_s16(d0), fwd_offset,
-                          bck_offset, round_offset_vec, round_bits,
-                          use_dist_wtd_comp_avg, &t0);
-
-          vst1_lane_u32((uint32_t *)d_u8, vreinterpret_u32_u8(t0),
-                        0);  // 00 01 02 03
-        } else {
-          vst1_u16(d, vreinterpret_u16_s16(d0));
-        }
-
+        s4 = s8;
         s += 4;
-        width -= 4;
         d += 4;
         d_u8 += 4;
-      } while (width > 0);
-      src_ptr += (src_stride);
-      dst_ptr += (dst_stride);
-      dst_u8_ptr += (dst8_stride);
+        width -= 4;
+      } while (width != 0);
+      src_ptr += src_stride;
+      dst_ptr += dst_stride;
+      dst8_ptr += dst8_stride;
       height--;
-#endif
-    } while (height > 0);
+#endif  // AOM_ARCH_AARCH64
+    } while (height != 0);
   } else {
-    CONV_BUF_TYPE *d_tmp;
-    uint8_t *d_u8_tmp;
-    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
-    int16x8_t res0;
-    uint16x8_t res8;
-    const int16x8_t round_offset128 = vdupq_n_s16(round_offset);
-    const int16x4_t round_offset64 = vdup_n_s16(round_offset);
-    const int16x8_t shift_round_0 = vdupq_n_s16(-conv_params->round_0 + 1);
-    const int16x8_t horiz_const = vdupq_n_s16(bits);
-    const int16x8_t zero = vdupq_n_s16(0);
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8;
+    uint16x8_t d0, dd0;
+    uint8x8_t d0_u8;
 
-    d = dst_ptr = dst;
-    d_u8 = dst_u8_ptr = dst8;
     do {
-#if defined(__aarch64__)
-      int16x8_t s11, s12, s13, s14;
-      int16x8_t s8, s9, s10;
-      int16x8_t res1, res2, res3, res4, res5, res6, res7;
-      uint16x8_t res9, res10, res11;
+      d = dst_ptr;
+      d_u8 = dst8_ptr;
+      width = w;
+
+#if AOM_ARCH_AARCH64
+      int16x8_t s9, s10, s11, s12, s13, s14;
+      uint16x8_t d1, d2, d3, d4, d5, d6, d7, dd1, dd2, dd3, dd4, dd5, dd6, dd7;
+      uint8x8_t d1_u8, d2_u8, d3_u8, d4_u8, d5_u8, d6_u8, d7_u8;
+
       __builtin_prefetch(src_ptr + 0 * src_stride);
       __builtin_prefetch(src_ptr + 1 * src_stride);
       __builtin_prefetch(src_ptr + 2 * src_stride);
@@ -1941,8 +3019,10 @@
       __builtin_prefetch(src_ptr + 5 * src_stride);
       __builtin_prefetch(src_ptr + 6 * src_stride);
       __builtin_prefetch(src_ptr + 7 * src_stride);
+
       load_u8_8x8(src_ptr, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
       transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
       s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
       s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
       s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
@@ -1951,11 +3031,6 @@
       s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
       s6 = vreinterpretq_s16_u16(vmovl_u8(t6));
 
-      width = w;
-      s = src_ptr + 7;
-      d = dst_ptr;
-      d_u8_tmp = dst_u8_ptr;
-
       __builtin_prefetch(dst_ptr + 0 * dst_stride);
       __builtin_prefetch(dst_ptr + 1 * dst_stride);
       __builtin_prefetch(dst_ptr + 2 * dst_stride);
@@ -1965,12 +3040,12 @@
       __builtin_prefetch(dst_ptr + 6 * dst_stride);
       __builtin_prefetch(dst_ptr + 7 * dst_stride);
 
-      do {
-        d_u8 = d_u8_tmp;
-        d_tmp = d;
+      s = src_ptr + 7;
 
+      do {
         load_u8_8x8(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
         transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
         s7 = vreinterpretq_s16_u16(vmovl_u8(t0));
         s8 = vreinterpretq_s16_u16(vmovl_u8(t1));
         s9 = vreinterpretq_s16_u16(vmovl_u8(t2));
@@ -1980,79 +3055,654 @@
         s13 = vreinterpretq_s16_u16(vmovl_u8(t6));
         s14 = vreinterpretq_s16_u16(vmovl_u8(t7));
 
-        res0 = convolve8_8x8_s16(s0, s1, s2, s3, s4, s5, s6, s7, x_filter, zero,
-                                 shift_round_0);
+        d0 = convolve8_8_x(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                           round_offset_vec);
+        d1 = convolve8_8_x(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
+                           round_offset_vec);
+        d2 = convolve8_8_x(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
+                           round_offset_vec);
+        d3 = convolve8_8_x(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
+                           round_offset_vec);
+        d4 = convolve8_8_x(s4, s5, s6, s7, s8, s9, s10, s11, x_filter,
+                           round_offset_vec);
+        d5 = convolve8_8_x(s5, s6, s7, s8, s9, s10, s11, s12, x_filter,
+                           round_offset_vec);
+        d6 = convolve8_8_x(s6, s7, s8, s9, s10, s11, s12, s13, x_filter,
+                           round_offset_vec);
+        d7 = convolve8_8_x(s7, s8, s9, s10, s11, s12, s13, s14, x_filter,
+                           round_offset_vec);
 
-        res0 = vrshlq_s16(res0, horiz_const);
-        res0 = vaddq_s16(res0, round_offset128);
+        transpose_u16_8x8(&d0, &d1, &d2, &d3, &d4, &d5, &d6, &d7);
 
-        res1 = convolve8_8x8_s16(s1, s2, s3, s4, s5, s6, s7, s8, x_filter, zero,
-                                 shift_round_0);
-        res1 = vrshlq_s16(res1, horiz_const);
-        res1 = vaddq_s16(res1, round_offset128);
-        res2 = convolve8_8x8_s16(s2, s3, s4, s5, s6, s7, s8, s9, x_filter, zero,
-                                 shift_round_0);
-        res2 = vrshlq_s16(res2, horiz_const);
-        res2 = vaddq_s16(res2, round_offset128);
-        res3 = convolve8_8x8_s16(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
-                                 zero, shift_round_0);
-        res3 = vrshlq_s16(res3, horiz_const);
-        res3 = vaddq_s16(res3, round_offset128);
-        res4 = convolve8_8x8_s16(s4, s5, s6, s7, s8, s9, s10, s11, x_filter,
-                                 zero, shift_round_0);
-        res4 = vrshlq_s16(res4, horiz_const);
-        res4 = vaddq_s16(res4, round_offset128);
-        res5 = convolve8_8x8_s16(s5, s6, s7, s8, s9, s10, s11, s12, x_filter,
-                                 zero, shift_round_0);
-        res5 = vrshlq_s16(res5, horiz_const);
-        res5 = vaddq_s16(res5, round_offset128);
-        res6 = convolve8_8x8_s16(s6, s7, s8, s9, s10, s11, s12, s13, x_filter,
-                                 zero, shift_round_0);
-        res6 = vrshlq_s16(res6, horiz_const);
-        res6 = vaddq_s16(res6, round_offset128);
-        res7 = convolve8_8x8_s16(s7, s8, s9, s10, s11, s12, s13, s14, x_filter,
-                                 zero, shift_round_0);
-        res7 = vrshlq_s16(res7, horiz_const);
-        res7 = vaddq_s16(res7, round_offset128);
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
 
-        transpose_s16_8x8(&res0, &res1, &res2, &res3, &res4, &res5, &res6,
-                          &res7);
+        compute_dist_wtd_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                                 bck_offset, round_offset_vec, &d0_u8, &d1_u8,
+                                 &d2_u8, &d3_u8);
 
-        if (conv_params->do_average) {
-          load_u16_8x4(d_tmp, dst_stride, &res8, &res9, &res10, &res11);
-          d_tmp += (dst_stride << 2);
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
 
-          compute_avg_8x4(res8, res9, res10, res11, vreinterpretq_u16_s16(res0),
-                          vreinterpretq_u16_s16(res1),
-                          vreinterpretq_u16_s16(res2),
-                          vreinterpretq_u16_s16(res3), fwd_offset, bck_offset,
-                          round_offset64, round_bits, use_dist_wtd_comp_avg,
-                          &t0, &t1, &t2, &t3);
+        load_u16_8x4(d + 4 * dst_stride, dst_stride, &dd4, &dd5, &dd6, &dd7);
 
-          store_u8_8x4(d_u8, dst8_stride, t0, t1, t2, t3);
-          d_u8 += (dst8_stride << 2);
+        compute_dist_wtd_avg_8x4(dd4, dd5, dd6, dd7, d4, d5, d6, d7, fwd_offset,
+                                 bck_offset, round_offset_vec, &d4_u8, &d5_u8,
+                                 &d6_u8, &d7_u8);
 
-          load_u16_8x4(d_tmp, dst_stride, &res8, &res9, &res10, &res11);
-          d_tmp += (dst_stride << 2);
+        store_u8_8x4(d_u8 + 4 * dst8_stride, dst8_stride, d4_u8, d5_u8, d6_u8,
+                     d7_u8);
 
-          compute_avg_8x4(res8, res9, res10, res11, vreinterpretq_u16_s16(res4),
-                          vreinterpretq_u16_s16(res5),
-                          vreinterpretq_u16_s16(res6),
-                          vreinterpretq_u16_s16(res7), fwd_offset, bck_offset,
-                          round_offset64, round_bits, use_dist_wtd_comp_avg,
-                          &t0, &t1, &t2, &t3);
+        s0 = s8;
+        s1 = s9;
+        s2 = s10;
+        s3 = s11;
+        s4 = s12;
+        s5 = s13;
+        s6 = s14;
+        s += 8;
+        d += 8;
+        d_u8 += 8;
+        width -= 8;
+      } while (width != 0);
+      src_ptr += 8 * src_stride;
+      dst_ptr += 8 * dst_stride;
+      dst8_ptr += 8 * dst8_stride;
+      height -= 8;
+#else   // !AOM_ARCH_AARCH64
+      __builtin_prefetch(src_ptr);
 
-          store_u8_8x4(d_u8, dst8_stride, t0, t1, t2, t3);
-          d_u8 += (dst8_stride << 2);
-        } else {
-          store_u16_8x8(
-              d_tmp, dst_stride, vreinterpretq_u16_s16(res0),
-              vreinterpretq_u16_s16(res1), vreinterpretq_u16_s16(res2),
-              vreinterpretq_u16_s16(res3), vreinterpretq_u16_s16(res4),
-              vreinterpretq_u16_s16(res5), vreinterpretq_u16_s16(res6),
-              vreinterpretq_u16_s16(res7));
-          d_tmp += (dst_stride << 3);
-        }
+      t0 = vld1_u8(src_ptr);
+      s0 = vreinterpretq_s16_u16(vmovl_u8(t0));  // a0 a1 a2 a3 a4 a5 a6 a7
+
+      __builtin_prefetch(dst_ptr);
+
+      s = src_ptr + 8;
+
+      do {
+        t0 = vld1_u8(s);  // a8 a9 a10 a11 a12 a13 a14 a15
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t0));
+
+        s1 = vextq_s16(s0, s8, 1);  // a1 a2 a3 a4 a5 a6 a7 a8
+        s2 = vextq_s16(s0, s8, 2);  // a2 a3 a4 a5 a6 a7 a8 a9
+        s3 = vextq_s16(s0, s8, 3);  // a3 a4 a5 a6 a7 a8 a9 a10
+        s4 = vextq_s16(s0, s8, 4);  // a4 a5 a6 a7 a8 a9 a10 a11
+        s5 = vextq_s16(s0, s8, 5);  // a5 a6 a7 a8 a9 a10 a11 a12
+        s6 = vextq_s16(s0, s8, 6);  // a6 a7 a8 a9 a10 a11 a12 a13
+        s7 = vextq_s16(s0, s8, 7);  // a7 a8 a9 a10 a11 a12 a13 a14
+
+        d0 = convolve8_8_x(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                           round_offset_vec);
+
+        dd0 = vld1q_u16(d);
+
+        compute_dist_wtd_avg_8x1(dd0, d0, fwd_offset, bck_offset,
+                                 round_offset_vec, &d0_u8);
+
+        vst1_u8(d_u8, d0_u8);
+
+        s0 = s8;
+        s += 8;
+        d += 8;
+        d_u8 += 8;
+        width -= 8;
+      } while (width != 0);
+      src_ptr += src_stride;
+      dst_ptr += dst_stride;
+      dst8_ptr += dst8_stride;
+      height--;
+#endif  // AOM_ARCH_AARCH64
+    } while (height != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_x_avg_neon(
+    const uint8_t *src, int src_stride, uint8_t *dst8, int dst8_stride, int w,
+    int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn,
+    ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
+
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
+  // Horizontal filter.
+  const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_x, subpel_x_qn & SUBPEL_MASK);
+  // Filter values are even, so downshift by 1 to reduce intermediate precision
+  // requirements.
+  const int16x8_t x_filter = vshrq_n_s16(vld1q_s16(x_filter_ptr), 1);
+
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
+  const uint8_t *src_ptr = src - horiz_offset;
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  uint8_t *dst8_ptr = dst8;
+  int dst_stride = conv_params->dst_stride;
+  const uint8_t *s;
+  uint8_t *d_u8;
+  CONV_BUF_TYPE *d;
+  int width;
+  int height = h;
+
+  uint8x8_t t0;
+#if AOM_ARCH_AARCH64
+  uint8x8_t t1, t2, t3, t4, t5, t6, t7;
+#endif  // AOM_ARCH_AARCH64
+
+  if (w == 4 || h == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8;
+    uint16x4_t d0, dd0;
+    uint8x8_t d01;
+#if AOM_ARCH_AARCH64
+    int16x4_t s9, s10;
+    uint16x4_t d1, d2, d3, dd1, dd2, dd3;
+    uint8x8_t d23;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      d = dst_ptr;
+      d_u8 = dst8_ptr;
+      width = w;
+
+      __builtin_prefetch(src_ptr + 0 * src_stride);
+#if AOM_ARCH_AARCH64
+      __builtin_prefetch(src_ptr + 1 * src_stride);
+      __builtin_prefetch(src_ptr + 2 * src_stride);
+      __builtin_prefetch(src_ptr + 3 * src_stride);
+
+      load_u8_8x4(src_ptr, src_stride, &t0, &t1, &t2, &t3);
+      transpose_u8_8x4(&t0, &t1, &t2, &t3);
+
+      s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s1 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+      s2 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+      s3 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+      s4 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s5 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+      s6 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+
+      __builtin_prefetch(d + 0 * dst_stride);
+      __builtin_prefetch(d + 1 * dst_stride);
+      __builtin_prefetch(d + 2 * dst_stride);
+      __builtin_prefetch(d + 3 * dst_stride);
+
+      s = src_ptr + 7;
+
+      do {
+        load_unaligned_u8_4x4(s, src_stride, &t0, &t1);
+        transpose_u8_4x4(&t0, &t1);
+
+        s7 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+        s8 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+        s9 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+        s10 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+
+        d0 = convolve8_4_x(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                           vget_low_s16(round_offset_vec));
+        d1 = convolve8_4_x(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
+                           vget_low_s16(round_offset_vec));
+        d2 = convolve8_4_x(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
+                           vget_low_s16(round_offset_vec));
+        d3 = convolve8_4_x(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
+                           vget_low_s16(round_offset_vec));
+
+        transpose_u16_4x4d(&d0, &d1, &d2, &d3);
+
+        __builtin_prefetch(d + 0 * dst_stride);
+        __builtin_prefetch(d + 1 * dst_stride);
+        __builtin_prefetch(d + 2 * dst_stride);
+        __builtin_prefetch(d + 3 * dst_stride);
+
+        __builtin_prefetch(d_u8 + 0 * dst8_stride);
+        __builtin_prefetch(d_u8 + 1 * dst8_stride);
+        __builtin_prefetch(d_u8 + 2 * dst8_stride);
+        __builtin_prefetch(d_u8 + 3 * dst8_stride);
+
+        load_u16_4x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_basic_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                              round_offset_vec, &d01, &d23);
+
+        store_u8_4x1(d_u8 + 0 * dst8_stride, d01, 0);
+        store_u8_4x1(d_u8 + 1 * dst8_stride, d01, 1);
+        store_u8_4x1(d_u8 + 2 * dst8_stride, d23, 0);
+        store_u8_4x1(d_u8 + 3 * dst8_stride, d23, 1);
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s += 4;
+        d += 4;
+        d_u8 += 4;
+        width -= 4;
+      } while (width != 0);
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      dst8_ptr += 4 * dst8_stride;
+      height -= 4;
+#else   // !AOM_ARCH_AARCH64
+      t0 = vld1_u8(src_ptr);  // a0 a1 a2 a3 a4 a5 a6 a7
+      s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s4 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+
+      __builtin_prefetch(d);
+
+      s = src_ptr + 8;
+
+      do {
+        t0 = vld1_u8(s);  // a8 a9 a10 a11
+        s8 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+
+        s1 = vext_s16(s0, s4, 1);  // a1 a2 a3 a4
+        s2 = vext_s16(s0, s4, 2);  // a2 a3 a4 a5
+        s3 = vext_s16(s0, s4, 3);  // a3 a4 a5 a6
+        s5 = vext_s16(s4, s8, 1);  // a5 a6 a7 a8
+        s6 = vext_s16(s4, s8, 2);  // a6 a7 a8 a9
+        s7 = vext_s16(s4, s8, 3);  // a7 a8 a9 a10
+
+        d0 = convolve8_4_x(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                           vget_low_s16(round_offset_vec));
+
+        __builtin_prefetch(d);
+        __builtin_prefetch(d_u8);
+
+        dd0 = vld1_u16(d);
+
+        compute_basic_avg_4x1(dd0, d0, vget_low_s16(round_offset_vec), &d01);
+
+        store_u8_4x1(d_u8, d01, 0);
+
+        s0 = s4;
+        s4 = s8;
+        s += 4;
+        d += 4;
+        d_u8 += 4;
+        width -= 4;
+      } while (width != 0);
+      src_ptr += src_stride;
+      dst_ptr += dst_stride;
+      dst8_ptr += dst8_stride;
+      height--;
+#endif  // AOM_ARCH_AARCH64
+    } while (height != 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8;
+    uint16x8_t d0, dd0;
+    uint8x8_t d0_u8;
+
+    do {
+      d = dst_ptr;
+      d_u8 = dst8_ptr;
+      width = w;
+
+#if AOM_ARCH_AARCH64
+      int16x8_t s9, s10, s11, s12, s13, s14;
+      uint16x8_t d1, d2, d3, d4, d5, d6, d7, dd1, dd2, dd3, dd4, dd5, dd6, dd7;
+      uint8x8_t d1_u8, d2_u8, d3_u8, d4_u8, d5_u8, d6_u8, d7_u8;
+
+      __builtin_prefetch(src_ptr + 0 * src_stride);
+      __builtin_prefetch(src_ptr + 1 * src_stride);
+      __builtin_prefetch(src_ptr + 2 * src_stride);
+      __builtin_prefetch(src_ptr + 3 * src_stride);
+      __builtin_prefetch(src_ptr + 4 * src_stride);
+      __builtin_prefetch(src_ptr + 5 * src_stride);
+      __builtin_prefetch(src_ptr + 6 * src_stride);
+      __builtin_prefetch(src_ptr + 7 * src_stride);
+
+      load_u8_8x8(src_ptr, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+      transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
+      s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
+      s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
+      s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
+      s3 = vreinterpretq_s16_u16(vmovl_u8(t3));
+      s4 = vreinterpretq_s16_u16(vmovl_u8(t4));
+      s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
+      s6 = vreinterpretq_s16_u16(vmovl_u8(t6));
+
+      __builtin_prefetch(dst_ptr + 0 * dst_stride);
+      __builtin_prefetch(dst_ptr + 1 * dst_stride);
+      __builtin_prefetch(dst_ptr + 2 * dst_stride);
+      __builtin_prefetch(dst_ptr + 3 * dst_stride);
+      __builtin_prefetch(dst_ptr + 4 * dst_stride);
+      __builtin_prefetch(dst_ptr + 5 * dst_stride);
+      __builtin_prefetch(dst_ptr + 6 * dst_stride);
+      __builtin_prefetch(dst_ptr + 7 * dst_stride);
+
+      s = src_ptr + 7;
+
+      do {
+        load_u8_8x8(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+        transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
+        s7 = vreinterpretq_s16_u16(vmovl_u8(t0));
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t1));
+        s9 = vreinterpretq_s16_u16(vmovl_u8(t2));
+        s10 = vreinterpretq_s16_u16(vmovl_u8(t3));
+        s11 = vreinterpretq_s16_u16(vmovl_u8(t4));
+        s12 = vreinterpretq_s16_u16(vmovl_u8(t5));
+        s13 = vreinterpretq_s16_u16(vmovl_u8(t6));
+        s14 = vreinterpretq_s16_u16(vmovl_u8(t7));
+
+        d0 = convolve8_8_x(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                           round_offset_vec);
+        d1 = convolve8_8_x(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
+                           round_offset_vec);
+        d2 = convolve8_8_x(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
+                           round_offset_vec);
+        d3 = convolve8_8_x(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
+                           round_offset_vec);
+        d4 = convolve8_8_x(s4, s5, s6, s7, s8, s9, s10, s11, x_filter,
+                           round_offset_vec);
+        d5 = convolve8_8_x(s5, s6, s7, s8, s9, s10, s11, s12, x_filter,
+                           round_offset_vec);
+        d6 = convolve8_8_x(s6, s7, s8, s9, s10, s11, s12, s13, x_filter,
+                           round_offset_vec);
+        d7 = convolve8_8_x(s7, s8, s9, s10, s11, s12, s13, s14, x_filter,
+                           round_offset_vec);
+
+        transpose_u16_8x8(&d0, &d1, &d2, &d3, &d4, &d5, &d6, &d7);
+
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_basic_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                              round_offset_vec, &d0_u8, &d1_u8, &d2_u8, &d3_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+
+        load_u16_8x4(d + 4 * dst_stride, dst_stride, &dd4, &dd5, &dd6, &dd7);
+
+        compute_basic_avg_8x4(dd4, dd5, dd6, dd7, d4, d5, d6, d7,
+                              round_offset_vec, &d4_u8, &d5_u8, &d6_u8, &d7_u8);
+
+        store_u8_8x4(d_u8 + 4 * dst8_stride, dst8_stride, d4_u8, d5_u8, d6_u8,
+                     d7_u8);
+
+        s0 = s8;
+        s1 = s9;
+        s2 = s10;
+        s3 = s11;
+        s4 = s12;
+        s5 = s13;
+        s6 = s14;
+        s += 8;
+        d += 8;
+        d_u8 += 8;
+        width -= 8;
+      } while (width != 0);
+      src_ptr += 8 * src_stride;
+      dst_ptr += 8 * dst_stride;
+      dst8_ptr += 8 * dst8_stride;
+      height -= 8;
+#else   // !AOM_ARCH_AARCH64
+      __builtin_prefetch(src_ptr);
+
+      t0 = vld1_u8(src_ptr);
+      s0 = vreinterpretq_s16_u16(vmovl_u8(t0));  // a0 a1 a2 a3 a4 a5 a6 a7
+
+      __builtin_prefetch(dst_ptr);
+
+      s = src_ptr + 8;
+
+      do {
+        t0 = vld1_u8(s);  // a8 a9 a10 a11 a12 a13 a14 a15
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t0));
+
+        s1 = vextq_s16(s0, s8, 1);  // a1 a2 a3 a4 a5 a6 a7 a8
+        s2 = vextq_s16(s0, s8, 2);  // a2 a3 a4 a5 a6 a7 a8 a9
+        s3 = vextq_s16(s0, s8, 3);  // a3 a4 a5 a6 a7 a8 a9 a10
+        s4 = vextq_s16(s0, s8, 4);  // a4 a5 a6 a7 a8 a9 a10 a11
+        s5 = vextq_s16(s0, s8, 5);  // a5 a6 a7 a8 a9 a10 a11 a12
+        s6 = vextq_s16(s0, s8, 6);  // a6 a7 a8 a9 a10 a11 a12 a13
+        s7 = vextq_s16(s0, s8, 7);  // a7 a8 a9 a10 a11 a12 a13 a14
+
+        d0 = convolve8_8_x(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                           round_offset_vec);
+
+        dd0 = vld1q_u16(d);
+
+        compute_basic_avg_8x1(dd0, d0, round_offset_vec, &d0_u8);
+
+        vst1_u8(d_u8, d0_u8);
+
+        s0 = s8;
+        s += 8;
+        d += 8;
+        d_u8 += 8;
+        width -= 8;
+      } while (width != 0);
+      src_ptr += src_stride;
+      dst_ptr += dst_stride;
+      dst8_ptr += dst8_stride;
+      height--;
+#endif  // AOM_ARCH_AARCH64
+    } while (height != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_x_neon(
+    const uint8_t *src, int src_stride, int w, int h,
+    const InterpFilterParams *filter_params_x, const int subpel_x_qn,
+    ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
+
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
+  // Horizontal filter.
+  const int16_t *x_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_x, subpel_x_qn & SUBPEL_MASK);
+  // Filter values are even, so downshift by 1 to reduce intermediate precision
+  // requirements.
+  const int16x8_t x_filter = vshrq_n_s16(vld1q_s16(x_filter_ptr), 1);
+
+  const int horiz_offset = filter_params_x->taps / 2 - 1;
+  const uint8_t *src_ptr = src - horiz_offset;
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  int dst_stride = conv_params->dst_stride;
+  const uint8_t *s;
+  CONV_BUF_TYPE *d;
+  int width;
+  int height = h;
+
+  uint8x8_t t0;
+#if AOM_ARCH_AARCH64
+  uint8x8_t t1, t2, t3, t4, t5, t6, t7;
+#endif  // AOM_ARCH_AARCH64
+
+  if (w == 4 || h == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, s8;
+    uint16x4_t d0;
+#if AOM_ARCH_AARCH64
+    int16x4_t s9, s10;
+    uint16x4_t d1, d2, d3;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      d = dst_ptr;
+      width = w;
+
+      __builtin_prefetch(src_ptr + 0 * src_stride);
+#if AOM_ARCH_AARCH64
+      __builtin_prefetch(src_ptr + 1 * src_stride);
+      __builtin_prefetch(src_ptr + 2 * src_stride);
+      __builtin_prefetch(src_ptr + 3 * src_stride);
+
+      load_u8_8x4(src_ptr, src_stride, &t0, &t1, &t2, &t3);
+      transpose_u8_8x4(&t0, &t1, &t2, &t3);
+
+      s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s1 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+      s2 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+      s3 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t3)));
+      s4 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s5 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+      s6 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t2)));
+
+      __builtin_prefetch(d + 0 * dst_stride);
+      __builtin_prefetch(d + 1 * dst_stride);
+      __builtin_prefetch(d + 2 * dst_stride);
+      __builtin_prefetch(d + 3 * dst_stride);
+
+      s = src_ptr + 7;
+
+      do {
+        load_unaligned_u8_4x4(s, src_stride, &t0, &t1);
+        transpose_u8_4x4(&t0, &t1);
+
+        s7 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+        s8 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+        s9 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+        s10 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t1)));
+
+        d0 = convolve8_4_x(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                           vget_low_s16(round_offset_vec));
+        d1 = convolve8_4_x(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
+                           vget_low_s16(round_offset_vec));
+        d2 = convolve8_4_x(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
+                           vget_low_s16(round_offset_vec));
+        d3 = convolve8_4_x(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
+                           vget_low_s16(round_offset_vec));
+
+        transpose_u16_4x4d(&d0, &d1, &d2, &d3);
+
+        store_u16_4x4(d, dst_stride, d0, d1, d2, d3);
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s += 4;
+        d += 4;
+        width -= 4;
+      } while (width != 0);
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      height -= 4;
+#else   // !AOM_ARCH_AARCH64
+      t0 = vld1_u8(src_ptr);  // a0 a1 a2 a3 a4 a5 a6 a7
+      s0 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+      s4 = vget_high_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+
+      __builtin_prefetch(d);
+
+      s = src_ptr + 8;
+
+      do {
+        t0 = vld1_u8(s);  // a8 a9 a10 a11
+        s8 = vget_low_s16(vreinterpretq_s16_u16(vmovl_u8(t0)));
+
+        s1 = vext_s16(s0, s4, 1);  // a1 a2 a3 a4
+        s2 = vext_s16(s0, s4, 2);  // a2 a3 a4 a5
+        s3 = vext_s16(s0, s4, 3);  // a3 a4 a5 a6
+        s5 = vext_s16(s4, s8, 1);  // a5 a6 a7 a8
+        s6 = vext_s16(s4, s8, 2);  // a6 a7 a8 a9
+        s7 = vext_s16(s4, s8, 3);  // a7 a8 a9 a10
+
+        d0 = convolve8_4_x(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                           vget_low_s16(round_offset_vec));
+
+        vst1_u16(d, d0);
+
+        s0 = s4;
+        s4 = s8;
+        s += 4;
+        d += 4;
+        width -= 4;
+      } while (width != 0);
+      src_ptr += src_stride;
+      dst_ptr += dst_stride;
+      height--;
+#endif  // AOM_ARCH_AARCH64
+    } while (height != 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7, s8;
+    uint16x8_t d0;
+
+    do {
+      d = dst_ptr;
+      width = w;
+
+#if AOM_ARCH_AARCH64
+      int16x8_t s9, s10, s11, s12, s13, s14;
+      uint16x8_t d1, d2, d3, d4, d5, d6, d7;
+
+      __builtin_prefetch(src_ptr + 0 * src_stride);
+      __builtin_prefetch(src_ptr + 1 * src_stride);
+      __builtin_prefetch(src_ptr + 2 * src_stride);
+      __builtin_prefetch(src_ptr + 3 * src_stride);
+      __builtin_prefetch(src_ptr + 4 * src_stride);
+      __builtin_prefetch(src_ptr + 5 * src_stride);
+      __builtin_prefetch(src_ptr + 6 * src_stride);
+      __builtin_prefetch(src_ptr + 7 * src_stride);
+
+      load_u8_8x8(src_ptr, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+      transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
+      s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
+      s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
+      s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
+      s3 = vreinterpretq_s16_u16(vmovl_u8(t3));
+      s4 = vreinterpretq_s16_u16(vmovl_u8(t4));
+      s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
+      s6 = vreinterpretq_s16_u16(vmovl_u8(t6));
+
+      __builtin_prefetch(dst_ptr + 0 * dst_stride);
+      __builtin_prefetch(dst_ptr + 1 * dst_stride);
+      __builtin_prefetch(dst_ptr + 2 * dst_stride);
+      __builtin_prefetch(dst_ptr + 3 * dst_stride);
+      __builtin_prefetch(dst_ptr + 4 * dst_stride);
+      __builtin_prefetch(dst_ptr + 5 * dst_stride);
+      __builtin_prefetch(dst_ptr + 6 * dst_stride);
+      __builtin_prefetch(dst_ptr + 7 * dst_stride);
+
+      s = src_ptr + 7;
+
+      do {
+        load_u8_8x8(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+        transpose_u8_8x8(&t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
+        s7 = vreinterpretq_s16_u16(vmovl_u8(t0));
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t1));
+        s9 = vreinterpretq_s16_u16(vmovl_u8(t2));
+        s10 = vreinterpretq_s16_u16(vmovl_u8(t3));
+        s11 = vreinterpretq_s16_u16(vmovl_u8(t4));
+        s12 = vreinterpretq_s16_u16(vmovl_u8(t5));
+        s13 = vreinterpretq_s16_u16(vmovl_u8(t6));
+        s14 = vreinterpretq_s16_u16(vmovl_u8(t7));
+
+        d0 = convolve8_8_x(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                           round_offset_vec);
+        d1 = convolve8_8_x(s1, s2, s3, s4, s5, s6, s7, s8, x_filter,
+                           round_offset_vec);
+        d2 = convolve8_8_x(s2, s3, s4, s5, s6, s7, s8, s9, x_filter,
+                           round_offset_vec);
+        d3 = convolve8_8_x(s3, s4, s5, s6, s7, s8, s9, s10, x_filter,
+                           round_offset_vec);
+        d4 = convolve8_8_x(s4, s5, s6, s7, s8, s9, s10, s11, x_filter,
+                           round_offset_vec);
+        d5 = convolve8_8_x(s5, s6, s7, s8, s9, s10, s11, s12, x_filter,
+                           round_offset_vec);
+        d6 = convolve8_8_x(s6, s7, s8, s9, s10, s11, s12, s13, x_filter,
+                           round_offset_vec);
+        d7 = convolve8_8_x(s7, s8, s9, s10, s11, s12, s13, s14, x_filter,
+                           round_offset_vec);
+
+        transpose_u16_8x8(&d0, &d1, &d2, &d3, &d4, &d5, &d6, &d7);
+
+        store_u16_8x8(d, dst_stride, d0, d1, d2, d3, d4, d5, d6, d7);
 
         s0 = s8;
         s1 = s9;
@@ -2064,235 +3714,878 @@
         s += 8;
         d += 8;
         width -= 8;
-        d_u8_tmp += 8;
-      } while (width > 0);
+      } while (width != 0);
       src_ptr += 8 * src_stride;
       dst_ptr += 8 * dst_stride;
-      dst_u8_ptr += 8 * dst8_stride;
       height -= 8;
-#else
-      int16x8_t temp_0;
+#else   // !AOM_ARCH_AARCH64
       __builtin_prefetch(src_ptr);
+
       t0 = vld1_u8(src_ptr);
       s0 = vreinterpretq_s16_u16(vmovl_u8(t0));  // a0 a1 a2 a3 a4 a5 a6 a7
 
-      width = w;
-      s = src_ptr + 8;
-      d = dst_ptr;
-      d_u8_tmp = dst_u8_ptr;
-
       __builtin_prefetch(dst_ptr);
 
+      s = src_ptr + 8;
+
       do {
-        d_u8 = d_u8_tmp;
-        d_tmp = d;
-
         t0 = vld1_u8(s);  // a8 a9 a10 a11 a12 a13 a14 a15
-        s7 = vreinterpretq_s16_u16(vmovl_u8(t0));
-        temp_0 = s0;
-        s0 = s7;
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t0));
 
-        s1 = vextq_s16(temp_0, s7, 1);  // a1 a2 a3 a4 a5 a6 a7 a8
-        s2 = vextq_s16(temp_0, s7, 2);  // a2 a3 a4 a5 a6 a7 a8 a9
-        s3 = vextq_s16(temp_0, s7, 3);  // a3 a4 a5 a6 a7 a8 a9 a10
-        s4 = vextq_s16(temp_0, s7, 4);  // a4 a5 a6 a7 a8 a9 a10 a11
-        s5 = vextq_s16(temp_0, s7, 5);  // a5 a6 a7 a8 a9 a10 a11 a12
-        s6 = vextq_s16(temp_0, s7, 6);  // a6 a7 a8 a9 a10 a11 a12 a13
-        s7 = vextq_s16(temp_0, s7, 7);  // a7 a8 a9 a10 a11 a12 a13 a14
+        s1 = vextq_s16(s0, s8, 1);  // a1 a2 a3 a4 a5 a6 a7 a8
+        s2 = vextq_s16(s0, s8, 2);  // a2 a3 a4 a5 a6 a7 a8 a9
+        s3 = vextq_s16(s0, s8, 3);  // a3 a4 a5 a6 a7 a8 a9 a10
+        s4 = vextq_s16(s0, s8, 4);  // a4 a5 a6 a7 a8 a9 a10 a11
+        s5 = vextq_s16(s0, s8, 5);  // a5 a6 a7 a8 a9 a10 a11 a12
+        s6 = vextq_s16(s0, s8, 6);  // a6 a7 a8 a9 a10 a11 a12 a13
+        s7 = vextq_s16(s0, s8, 7);  // a7 a8 a9 a10 a11 a12 a13 a14
 
-        res0 = convolve8_8x8_s16(temp_0, s1, s2, s3, s4, s5, s6, s7, x_filter,
-                                 zero, shift_round_0);
+        d0 = convolve8_8_x(s0, s1, s2, s3, s4, s5, s6, s7, x_filter,
+                           round_offset_vec);
 
-        res0 = vrshlq_s16(res0, horiz_const);
-        res0 = vaddq_s16(res0, round_offset128);
+        vst1q_u16(d, d0);
 
-        if (conv_params->do_average) {
-          res8 = vld1q_u16(d_tmp);
-          d_tmp += (dst_stride);
-
-          compute_avg_8x1(res8, vreinterpretq_u16_s16(res0), fwd_offset,
-                          bck_offset, round_offset64, round_bits,
-                          use_dist_wtd_comp_avg, &t0);
-
-          vst1_u8(d_u8, t0);
-          d_u8 += (dst8_stride);
-        } else {
-          vst1q_u16(d_tmp, vreinterpretq_u16_s16(res0));
-          d_tmp += (dst_stride);
-        }
-
+        s0 = s8;
         s += 8;
         d += 8;
         width -= 8;
-        d_u8_tmp += 8;
-      } while (width > 0);
+      } while (width != 0);
       src_ptr += src_stride;
       dst_ptr += dst_stride;
-      dst_u8_ptr += dst8_stride;
       height--;
-#endif
-    } while (height > 0);
+#endif  // AOM_ARCH_AARCH64
+    } while (height != 0);
   }
 }
 
-#endif  // defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#endif  // AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
 
-void av1_dist_wtd_convolve_y_neon(const uint8_t *src, int src_stride,
+void av1_dist_wtd_convolve_x_neon(const uint8_t *src, int src_stride,
                                   uint8_t *dst8, int dst8_stride, int w, int h,
-                                  const InterpFilterParams *filter_params_y,
-                                  const int subpel_y_qn,
+                                  const InterpFilterParams *filter_params_x,
+                                  const int subpel_x_qn,
                                   ConvolveParams *conv_params) {
-  assert(!(w % 4));
-  assert(!(h % 4));
+  if (conv_params->do_average) {
+    if (UNLIKELY(conv_params->use_dist_wtd_comp_avg)) {
+      dist_wtd_convolve_x_dist_wtd_avg_neon(src, src_stride, dst8, dst8_stride,
+                                            w, h, filter_params_x, subpel_x_qn,
+                                            conv_params);
+    } else {
+      dist_wtd_convolve_x_avg_neon(src, src_stride, dst8, dst8_stride, w, h,
+                                   filter_params_x, subpel_x_qn, conv_params);
+    }
+  } else {
+    dist_wtd_convolve_x_neon(src, src_stride, w, h, filter_params_x,
+                             subpel_x_qn, conv_params);
+  }
+}
 
-  CONV_BUF_TYPE *dst = conv_params->dst;
-  const int dst_stride = conv_params->dst_stride;
-  const int vert_offset = filter_params_y->taps / 2 - 1;
-  const int bits = FILTER_BITS - conv_params->round_0;
+static INLINE uint16x4_t convolve6_4_y(const int16x4_t s0, const int16x4_t s1,
+                                       const int16x4_t s2, const int16x4_t s3,
+                                       const int16x4_t s4, const int16x4_t s5,
+                                       const int16x8_t y_filter,
+                                       const int16x4_t round_offset) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter);
+
+  // Filter values at indices 0 and 7 are 0.
+  int16x4_t sum = vmul_lane_s16(s0, y_filter_0_3, 1);
+  sum = vmla_lane_s16(sum, s1, y_filter_0_3, 2);
+  sum = vmla_lane_s16(sum, s2, y_filter_0_3, 3);
+  sum = vmla_lane_s16(sum, s3, y_filter_4_7, 0);
+  sum = vmla_lane_s16(sum, s4, y_filter_4_7, 1);
+  sum = vmla_lane_s16(sum, s5, y_filter_4_7, 2);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  int16x4_t res = vrsra_n_s16(round_offset, sum, ROUND0_BITS - 1);
+  return vreinterpret_u16_s16(res);
+}
+
+static INLINE uint16x8_t convolve6_8_y(const int16x8_t s0, const int16x8_t s1,
+                                       const int16x8_t s2, const int16x8_t s3,
+                                       const int16x8_t s4, const int16x8_t s5,
+                                       const int16x8_t y_filter,
+                                       const int16x8_t round_offset) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter);
+
+  // Filter values at indices 0 and 7 are 0.
+  int16x8_t sum = vmulq_lane_s16(s0, y_filter_0_3, 1);
+  sum = vmlaq_lane_s16(sum, s1, y_filter_0_3, 2);
+  sum = vmlaq_lane_s16(sum, s2, y_filter_0_3, 3);
+  sum = vmlaq_lane_s16(sum, s3, y_filter_4_7, 0);
+  sum = vmlaq_lane_s16(sum, s4, y_filter_4_7, 1);
+  sum = vmlaq_lane_s16(sum, s5, y_filter_4_7, 2);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  int16x8_t res = vrsraq_n_s16(round_offset, sum, ROUND0_BITS - 1);
+  return vreinterpretq_u16_s16(res);
+}
+
+static INLINE void dist_wtd_convolve_y_6tap_dist_wtd_avg_neon(
+    const uint8_t *src_ptr, int src_stride, uint8_t *dst8_ptr,
+    const int dst8_stride, int w, int h, const int16x8_t y_filter,
+    ConvolveParams *conv_params) {
   const int bd = 8;
-  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
-  const int round_offset = (1 << (offset_bits - conv_params->round_1)) +
-                           (1 << (offset_bits - conv_params->round_1 - 1));
-  const int round_bits =
-      2 * FILTER_BITS - conv_params->round_0 - conv_params->round_1;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
   const uint16_t fwd_offset = conv_params->fwd_offset;
   const uint16_t bck_offset = conv_params->bck_offset;
-  const int use_dist_wtd_comp_avg = conv_params->use_dist_wtd_comp_avg;
-  const int shift_value = (conv_params->round_1 - 1 - bits);
 
-  // vertical filter
-  const int16_t *y_filter_ptr = av1_get_interp_filter_subpel_kernel(
-      filter_params_y, subpel_y_qn & SUBPEL_MASK);
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
+  int width = w;
 
-  const uint8_t *src_ptr = src - (vert_offset * src_stride);
-
-  // Filter values are even, so downshift by 1 to reduce intermediate precision
-  // requirements.
-  const int16x8_t y_filter = vshrq_n_s16(vld1q_s16(y_filter_ptr), 1);
-
-  const uint8_t *s;
-  uint8_t *d_u8;
-  uint8_t *dst_u8_ptr;
-  CONV_BUF_TYPE *d, *dst_ptr;
-  int width, height;
-
-  s = src_ptr;
-  dst_ptr = dst;
-  dst_u8_ptr = dst8;
-  width = w;
-  height = h;
-
-  // used to get rid of multiplication = (vertical filter output sum) *
-  // (1<<bits).
-  assert((conv_params->round_1 - 2) >= bits);
-
-  if ((w == 4) || (h == 4)) {
-    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7, d0;
-    uint16x4_t res4;
-    uint32x2_t tu0 = vdup_n_u32(0), tu1 = vdup_n_u32(0), tu2 = vdup_n_u32(0),
-               tu3 = vdup_n_u32(0);
-    int16x8_t u0, u1, u2, u3;
-    uint8x8_t t0;
-
-#if defined(__aarch64__)
-    int16x4_t s8, s9, s10, d1, d2, d3;
-    uint16x4_t res5, res6, res7;
-    uint8x8_t t1;
-#endif
-    const int16x4_t round_offset64 = vdup_n_s16(round_offset);
-    const int16x4_t shift_vec = vdup_n_s16(-shift_value);
-    const int16x4_t zero = vdup_n_s16(0);
+  if (w == 4 || h == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5;
+    uint16x4_t d0, dd0;
+    uint8x8_t t0, t1, t2, t3, t4, d01;
+#if AOM_ARCH_AARCH64
+    int16x4_t s6, s7, s8;
+    uint16x4_t d1, d2, d3, dd1, dd2, dd3;
+    uint8x8_t d23;
+#endif  // AOM_ARCH_AARCH64
 
     do {
-      s = src_ptr;
-      d = dst_ptr;
-      d_u8 = dst_u8_ptr;
-      height = h;
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      uint8_t *d_u8 = dst8_ptr;
+      int height = h;
+
+      t0 = load_unaligned_u8_4x1(s + 0 * src_stride);
+      t1 = load_unaligned_u8_4x1(s + 1 * src_stride);
+      t2 = load_unaligned_u8_4x1(s + 2 * src_stride);
+      t3 = load_unaligned_u8_4x1(s + 3 * src_stride);
+      t4 = load_unaligned_u8_4x1(s + 4 * src_stride);
+
+      s0 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+      s1 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t1)));
+      s2 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t2)));
+      s3 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t3)));
+      s4 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t4)));
+
+      s += 5 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        t0 = load_unaligned_u8_4x1(s + 0 * src_stride);
+        t1 = load_unaligned_u8_4x1(s + 1 * src_stride);
+        t2 = load_unaligned_u8_4x1(s + 2 * src_stride);
+        t3 = load_unaligned_u8_4x1(s + 3 * src_stride);
+
+        s5 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+        s6 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t1)));
+        s7 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t2)));
+        s8 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t3)));
+
+        d0 = convolve6_4_y(s0, s1, s2, s3, s4, s5, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d1 = convolve6_4_y(s1, s2, s3, s4, s5, s6, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d2 = convolve6_4_y(s2, s3, s4, s5, s6, s7, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d3 = convolve6_4_y(s3, s4, s5, s6, s7, s8, y_filter,
+                           vget_low_s16(round_offset_vec));
+
+        load_u16_4x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_dist_wtd_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                                 bck_offset, round_offset_vec, &d01, &d23);
+
+        store_u8_4x1(d_u8 + 0 * dst8_stride, d01, 0);
+        store_u8_4x1(d_u8 + 1 * dst8_stride, d01, 1);
+        store_u8_4x1(d_u8 + 2 * dst8_stride, d23, 0);
+        store_u8_4x1(d_u8 + 3 * dst8_stride, d23, 1);
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        d_u8 += 4 * dst8_stride;
+        height -= 4;
+#else   // !AOM_ARCH_AARCH64
+        t0 = load_unaligned_u8_4x1(s);
+        s5 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+
+        d0 = convolve6_4_y(s0, s1, s2, s3, s4, s5, y_filter,
+                           vget_low_s16(round_offset_vec));
+
+        dd0 = vld1_u16(d);
+
+        compute_dist_wtd_avg_4x1(dd0, d0, fwd_offset, bck_offset,
+                                 vget_low_s16(round_offset_vec), &d01);
+
+        store_u8_4x1(d_u8, d01, 0);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s += src_stride;
+        d += dst_stride;
+        d_u8 += dst8_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 4;
+      dst_ptr += 4;
+      dst8_ptr += 4;
+      width -= 4;
+    } while (width != 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5;
+    uint16x8_t d0, dd0;
+    uint8x8_t d0_u8, t0, t1, t2, t3, t4;
+#if AOM_ARCH_AARCH64
+    int16x8_t s6, s7, s8, s9, s10, s11, s12;
+    uint16x8_t d1, d2, d3, d4, d5, d6, d7, dd1, dd2, dd3, dd4, dd5, dd6, dd7;
+    uint8x8_t d1_u8, d2_u8, d3_u8, d4_u8, d5_u8, d6_u8, d7_u8, t5, t6, t7;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      const uint8_t *s = src_ptr + (5 * src_stride);
+      CONV_BUF_TYPE *d = dst_ptr;
+      uint8_t *d_u8 = dst8_ptr;
+      int height = h;
+
+      load_u8_8x5(src_ptr, src_stride, &t0, &t1, &t2, &t3, &t4);
+
+      s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
+      s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
+      s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
+      s3 = vreinterpretq_s16_u16(vmovl_u8(t3));
+      s4 = vreinterpretq_s16_u16(vmovl_u8(t4));
+
+      do {
+#if AOM_ARCH_AARCH64
+        load_u8_8x8(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
+        s5 = vreinterpretq_s16_u16(vmovl_u8(t0));
+        s6 = vreinterpretq_s16_u16(vmovl_u8(t1));
+        s7 = vreinterpretq_s16_u16(vmovl_u8(t2));
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t3));
+        s9 = vreinterpretq_s16_u16(vmovl_u8(t4));
+        s10 = vreinterpretq_s16_u16(vmovl_u8(t5));
+        s11 = vreinterpretq_s16_u16(vmovl_u8(t6));
+        s12 = vreinterpretq_s16_u16(vmovl_u8(t7));
+
+        d0 = convolve6_8_y(s0, s1, s2, s3, s4, s5, y_filter, round_offset_vec);
+        d1 = convolve6_8_y(s1, s2, s3, s4, s5, s6, y_filter, round_offset_vec);
+        d2 = convolve6_8_y(s2, s3, s4, s5, s6, s7, y_filter, round_offset_vec);
+        d3 = convolve6_8_y(s3, s4, s5, s6, s7, s8, y_filter, round_offset_vec);
+        d4 = convolve6_8_y(s4, s5, s6, s7, s8, s9, y_filter, round_offset_vec);
+        d5 = convolve6_8_y(s5, s6, s7, s8, s9, s10, y_filter, round_offset_vec);
+        d6 =
+            convolve6_8_y(s6, s7, s8, s9, s10, s11, y_filter, round_offset_vec);
+        d7 = convolve6_8_y(s7, s8, s9, s10, s11, s12, y_filter,
+                           round_offset_vec);
+
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_dist_wtd_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                                 bck_offset, round_offset_vec, &d0_u8, &d1_u8,
+                                 &d2_u8, &d3_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+        d_u8 += 4 * dst8_stride;
+
+        load_u16_8x4(d + 4 * dst_stride, dst_stride, &dd4, &dd5, &dd6, &dd7);
+
+        compute_dist_wtd_avg_8x4(dd4, dd5, dd6, dd7, d4, d5, d6, d7, fwd_offset,
+                                 bck_offset, round_offset_vec, &d4_u8, &d5_u8,
+                                 &d6_u8, &d7_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d4_u8, d5_u8, d6_u8, d7_u8);
+        d_u8 += 4 * dst8_stride;
+
+        s0 = s8;
+        s1 = s9;
+        s2 = s10;
+        s3 = s11;
+        s4 = s12;
+        s += 8 * src_stride;
+        d += 8 * dst_stride;
+        height -= 8;
+#else   // !AOM_ARCH_AARCH64
+        s5 = vreinterpretq_s16_u16(vmovl_u8(vld1_u8(s)));
+
+        d0 = convolve6_8_y(s0, s1, s2, s3, s4, s5, y_filter, round_offset_vec);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+
+        dd0 = vld1q_u16(d);
+
+        compute_dist_wtd_avg_8x1(dd0, d0, fwd_offset, bck_offset,
+                                 round_offset_vec, &d0_u8);
+
+        vst1_u8(d_u8, d0_u8);
+        d_u8 += dst8_stride;
+
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      dst8_ptr += 8;
+      width -= 8;
+    } while (width != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_y_6tap_avg_neon(
+    const uint8_t *src_ptr, int src_stride, uint8_t *dst8_ptr,
+    const int dst8_stride, int w, int h, const int16x8_t y_filter,
+    ConvolveParams *conv_params) {
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
+  int width = w;
+
+  if (w == 4 || h == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5;
+    uint16x4_t d0, dd0;
+    uint8x8_t t0, t1, t2, t3, t4, d01;
+#if AOM_ARCH_AARCH64
+    int16x4_t s6, s7, s8;
+    uint16x4_t d1, d2, d3, dd1, dd2, dd3;
+    uint8x8_t d23;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      uint8_t *d_u8 = dst8_ptr;
+      int height = h;
+
+      t0 = load_unaligned_u8_4x1(s + 0 * src_stride);
+      t1 = load_unaligned_u8_4x1(s + 1 * src_stride);
+      t2 = load_unaligned_u8_4x1(s + 2 * src_stride);
+      t3 = load_unaligned_u8_4x1(s + 3 * src_stride);
+      t4 = load_unaligned_u8_4x1(s + 4 * src_stride);
+
+      s0 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+      s1 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t1)));
+      s2 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t2)));
+      s3 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t3)));
+      s4 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t4)));
+
+      s += 5 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        t0 = load_unaligned_u8_4x1(s + 0 * src_stride);
+        t1 = load_unaligned_u8_4x1(s + 1 * src_stride);
+        t2 = load_unaligned_u8_4x1(s + 2 * src_stride);
+        t3 = load_unaligned_u8_4x1(s + 3 * src_stride);
+
+        s5 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+        s6 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t1)));
+        s7 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t2)));
+        s8 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t3)));
+
+        d0 = convolve6_4_y(s0, s1, s2, s3, s4, s5, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d1 = convolve6_4_y(s1, s2, s3, s4, s5, s6, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d2 = convolve6_4_y(s2, s3, s4, s5, s6, s7, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d3 = convolve6_4_y(s3, s4, s5, s6, s7, s8, y_filter,
+                           vget_low_s16(round_offset_vec));
+
+        load_u16_4x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_basic_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                              round_offset_vec, &d01, &d23);
+
+        store_u8_4x1(d_u8 + 0 * dst8_stride, d01, 0);
+        store_u8_4x1(d_u8 + 1 * dst8_stride, d01, 1);
+        store_u8_4x1(d_u8 + 2 * dst8_stride, d23, 0);
+        store_u8_4x1(d_u8 + 3 * dst8_stride, d23, 1);
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        d_u8 += 4 * dst8_stride;
+        height -= 4;
+#else   // !AOM_ARCH_AARCH64
+        t0 = load_unaligned_u8_4x1(s);
+        s5 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+
+        d0 = convolve6_4_y(s0, s1, s2, s3, s4, s5, y_filter,
+                           vget_low_s16(round_offset_vec));
+
+        dd0 = vld1_u16(d);
+
+        compute_basic_avg_4x1(dd0, d0, vget_low_s16(round_offset_vec), &d01);
+
+        store_u8_4x1(d_u8, d01, 0);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s += src_stride;
+        d += dst_stride;
+        d_u8 += dst8_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 4;
+      dst_ptr += 4;
+      dst8_ptr += 4;
+      width -= 4;
+    } while (width != 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5;
+    uint16x8_t d0, dd0;
+    uint8x8_t d0_u8, t0, t1, t2, t3, t4;
+#if AOM_ARCH_AARCH64
+    int16x8_t s6, s7, s8, s9, s10, s11, s12;
+    uint16x8_t d1, d2, d3, d4, d5, d6, d7, dd1, dd2, dd3, dd4, dd5, dd6, dd7;
+    uint8x8_t d1_u8, d2_u8, d3_u8, d4_u8, d5_u8, d6_u8, d7_u8, t5, t6, t7;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      const uint8_t *s = src_ptr + (5 * src_stride);
+      CONV_BUF_TYPE *d = dst_ptr;
+      uint8_t *d_u8 = dst8_ptr;
+      int height = h;
+
+      load_u8_8x5(src_ptr, src_stride, &t0, &t1, &t2, &t3, &t4);
+
+      s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
+      s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
+      s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
+      s3 = vreinterpretq_s16_u16(vmovl_u8(t3));
+      s4 = vreinterpretq_s16_u16(vmovl_u8(t4));
+
+      do {
+#if AOM_ARCH_AARCH64
+        load_u8_8x8(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
+        s5 = vreinterpretq_s16_u16(vmovl_u8(t0));
+        s6 = vreinterpretq_s16_u16(vmovl_u8(t1));
+        s7 = vreinterpretq_s16_u16(vmovl_u8(t2));
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t3));
+        s9 = vreinterpretq_s16_u16(vmovl_u8(t4));
+        s10 = vreinterpretq_s16_u16(vmovl_u8(t5));
+        s11 = vreinterpretq_s16_u16(vmovl_u8(t6));
+        s12 = vreinterpretq_s16_u16(vmovl_u8(t7));
+
+        d0 = convolve6_8_y(s0, s1, s2, s3, s4, s5, y_filter, round_offset_vec);
+        d1 = convolve6_8_y(s1, s2, s3, s4, s5, s6, y_filter, round_offset_vec);
+        d2 = convolve6_8_y(s2, s3, s4, s5, s6, s7, y_filter, round_offset_vec);
+        d3 = convolve6_8_y(s3, s4, s5, s6, s7, s8, y_filter, round_offset_vec);
+        d4 = convolve6_8_y(s4, s5, s6, s7, s8, s9, y_filter, round_offset_vec);
+        d5 = convolve6_8_y(s5, s6, s7, s8, s9, s10, y_filter, round_offset_vec);
+        d6 =
+            convolve6_8_y(s6, s7, s8, s9, s10, s11, y_filter, round_offset_vec);
+        d7 = convolve6_8_y(s7, s8, s9, s10, s11, s12, y_filter,
+                           round_offset_vec);
+
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_basic_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                              round_offset_vec, &d0_u8, &d1_u8, &d2_u8, &d3_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+        d_u8 += 4 * dst8_stride;
+
+        load_u16_8x4(d + 4 * dst_stride, dst_stride, &dd4, &dd5, &dd6, &dd7);
+
+        compute_basic_avg_8x4(dd4, dd5, dd6, dd7, d4, d5, d6, d7,
+                              round_offset_vec, &d4_u8, &d5_u8, &d6_u8, &d7_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d4_u8, d5_u8, d6_u8, d7_u8);
+        d_u8 += 4 * dst8_stride;
+
+        s0 = s8;
+        s1 = s9;
+        s2 = s10;
+        s3 = s11;
+        s4 = s12;
+        s += 8 * src_stride;
+        d += 8 * dst_stride;
+        height -= 8;
+#else   // !AOM_ARCH_AARCH64
+        s5 = vreinterpretq_s16_u16(vmovl_u8(vld1_u8(s)));
+
+        d0 = convolve6_8_y(s0, s1, s2, s3, s4, s5, y_filter, round_offset_vec);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+
+        dd0 = vld1q_u16(d);
+
+        compute_basic_avg_8x1(dd0, d0, round_offset_vec, &d0_u8);
+
+        vst1_u8(d_u8, d0_u8);
+        d_u8 += dst8_stride;
+
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      dst8_ptr += 8;
+      width -= 8;
+    } while (width != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_y_6tap_neon(const uint8_t *src_ptr,
+                                                 int src_stride, int w, int h,
+                                                 const int16x8_t y_filter,
+                                                 ConvolveParams *conv_params) {
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
+  int width = w;
+
+  if (w == 4 || h == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5;
+    uint16x4_t d0;
+    uint8x8_t t0, t1, t2, t3, t4;
+#if AOM_ARCH_AARCH64
+    int16x4_t s6, s7, s8;
+    uint16x4_t d1, d2, d3;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      int height = h;
+
+      t0 = load_unaligned_u8_4x1(s + 0 * src_stride);
+      t1 = load_unaligned_u8_4x1(s + 1 * src_stride);
+      t2 = load_unaligned_u8_4x1(s + 2 * src_stride);
+      t3 = load_unaligned_u8_4x1(s + 3 * src_stride);
+      t4 = load_unaligned_u8_4x1(s + 4 * src_stride);
+
+      s0 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+      s1 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t1)));
+      s2 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t2)));
+      s3 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t3)));
+      s4 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t4)));
+
+      s += 5 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        t0 = load_unaligned_u8_4x1(s + 0 * src_stride);
+        t1 = load_unaligned_u8_4x1(s + 1 * src_stride);
+        t2 = load_unaligned_u8_4x1(s + 2 * src_stride);
+        t3 = load_unaligned_u8_4x1(s + 3 * src_stride);
+
+        s5 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+        s6 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t1)));
+        s7 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t2)));
+        s8 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t3)));
+
+        d0 = convolve6_4_y(s0, s1, s2, s3, s4, s5, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d1 = convolve6_4_y(s1, s2, s3, s4, s5, s6, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d2 = convolve6_4_y(s2, s3, s4, s5, s6, s7, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d3 = convolve6_4_y(s3, s4, s5, s6, s7, s8, y_filter,
+                           vget_low_s16(round_offset_vec));
+
+        store_u16_4x4(d, dst_stride, d0, d1, d2, d3);
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+#else   // !AOM_ARCH_AARCH64
+        t0 = load_unaligned_u8_4x1(s);
+        s5 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+
+        d0 = convolve6_4_y(s0, s1, s2, s3, s4, s5, y_filter,
+                           vget_low_s16(round_offset_vec));
+
+        vst1_u16(d, d0);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 4;
+      dst_ptr += 4;
+      width -= 4;
+    } while (width != 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5;
+    uint16x8_t d0;
+    uint8x8_t t0, t1, t2, t3, t4;
+#if AOM_ARCH_AARCH64
+    int16x8_t s6, s7, s8, s9, s10, s11, s12;
+    uint16x8_t d1, d2, d3, d4, d5, d6, d7;
+    uint8x8_t t5, t6, t7;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      const uint8_t *s = src_ptr + (5 * src_stride);
+      CONV_BUF_TYPE *d = dst_ptr;
+      int height = h;
+
+      load_u8_8x5(src_ptr, src_stride, &t0, &t1, &t2, &t3, &t4);
+
+      s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
+      s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
+      s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
+      s3 = vreinterpretq_s16_u16(vmovl_u8(t3));
+      s4 = vreinterpretq_s16_u16(vmovl_u8(t4));
+
+      do {
+#if AOM_ARCH_AARCH64
+        load_u8_8x8(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
+        s5 = vreinterpretq_s16_u16(vmovl_u8(t0));
+        s6 = vreinterpretq_s16_u16(vmovl_u8(t1));
+        s7 = vreinterpretq_s16_u16(vmovl_u8(t2));
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t3));
+        s9 = vreinterpretq_s16_u16(vmovl_u8(t4));
+        s10 = vreinterpretq_s16_u16(vmovl_u8(t5));
+        s11 = vreinterpretq_s16_u16(vmovl_u8(t6));
+        s12 = vreinterpretq_s16_u16(vmovl_u8(t7));
+
+        d0 = convolve6_8_y(s0, s1, s2, s3, s4, s5, y_filter, round_offset_vec);
+        d1 = convolve6_8_y(s1, s2, s3, s4, s5, s6, y_filter, round_offset_vec);
+        d2 = convolve6_8_y(s2, s3, s4, s5, s6, s7, y_filter, round_offset_vec);
+        d3 = convolve6_8_y(s3, s4, s5, s6, s7, s8, y_filter, round_offset_vec);
+        d4 = convolve6_8_y(s4, s5, s6, s7, s8, s9, y_filter, round_offset_vec);
+        d5 = convolve6_8_y(s5, s6, s7, s8, s9, s10, y_filter, round_offset_vec);
+        d6 =
+            convolve6_8_y(s6, s7, s8, s9, s10, s11, y_filter, round_offset_vec);
+        d7 = convolve6_8_y(s7, s8, s9, s10, s11, s12, y_filter,
+                           round_offset_vec);
+
+        store_u16_8x8(d, dst_stride, d0, d1, d2, d3, d4, d5, d6, d7);
+
+        s0 = s8;
+        s1 = s9;
+        s2 = s10;
+        s3 = s11;
+        s4 = s12;
+        s += 8 * src_stride;
+        d += 8 * dst_stride;
+        height -= 8;
+#else   // !AOM_ARCH_AARCH64
+        s5 = vreinterpretq_s16_u16(vmovl_u8(vld1_u8(s)));
+
+        d0 = convolve6_8_y(s0, s1, s2, s3, s4, s5, y_filter, round_offset_vec);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+
+        vst1q_u16(d, d0);
+
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      width -= 8;
+    } while (width != 0);
+  }
+}
+
+static INLINE uint16x4_t convolve8_4_y(const int16x4_t s0, const int16x4_t s1,
+                                       const int16x4_t s2, const int16x4_t s3,
+                                       const int16x4_t s4, const int16x4_t s5,
+                                       const int16x4_t s6, const int16x4_t s7,
+                                       const int16x8_t y_filter,
+                                       const int16x4_t round_offset) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter);
+
+  int16x4_t sum = vmul_lane_s16(s0, y_filter_0_3, 0);
+  sum = vmla_lane_s16(sum, s1, y_filter_0_3, 1);
+  sum = vmla_lane_s16(sum, s2, y_filter_0_3, 2);
+  sum = vmla_lane_s16(sum, s3, y_filter_0_3, 3);
+  sum = vmla_lane_s16(sum, s4, y_filter_4_7, 0);
+  sum = vmla_lane_s16(sum, s5, y_filter_4_7, 1);
+  sum = vmla_lane_s16(sum, s6, y_filter_4_7, 2);
+  sum = vmla_lane_s16(sum, s7, y_filter_4_7, 3);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  int16x4_t res = vrsra_n_s16(round_offset, sum, ROUND0_BITS - 1);
+  return vreinterpret_u16_s16(res);
+}
+
+static INLINE uint16x8_t convolve8_8_y(const int16x8_t s0, const int16x8_t s1,
+                                       const int16x8_t s2, const int16x8_t s3,
+                                       const int16x8_t s4, const int16x8_t s5,
+                                       const int16x8_t s6, const int16x8_t s7,
+                                       const int16x8_t y_filter,
+                                       const int16x8_t round_offset) {
+  const int16x4_t y_filter_0_3 = vget_low_s16(y_filter);
+  const int16x4_t y_filter_4_7 = vget_high_s16(y_filter);
+
+  int16x8_t sum = vmulq_lane_s16(s0, y_filter_0_3, 0);
+  sum = vmlaq_lane_s16(sum, s1, y_filter_0_3, 1);
+  sum = vmlaq_lane_s16(sum, s2, y_filter_0_3, 2);
+  sum = vmlaq_lane_s16(sum, s3, y_filter_0_3, 3);
+  sum = vmlaq_lane_s16(sum, s4, y_filter_4_7, 0);
+  sum = vmlaq_lane_s16(sum, s5, y_filter_4_7, 1);
+  sum = vmlaq_lane_s16(sum, s6, y_filter_4_7, 2);
+  sum = vmlaq_lane_s16(sum, s7, y_filter_4_7, 3);
+
+  // We halved the convolution filter values so -1 from the right shift.
+  int16x8_t res = vrsraq_n_s16(round_offset, sum, ROUND0_BITS - 1);
+  return vreinterpretq_u16_s16(res);
+}
+
+static INLINE void dist_wtd_convolve_y_8tap_dist_wtd_avg_neon(
+    const uint8_t *src_ptr, int src_stride, uint8_t *dst8_ptr,
+    const int dst8_stride, int w, int h, const int16x8_t y_filter,
+    ConvolveParams *conv_params) {
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
+  const uint16_t fwd_offset = conv_params->fwd_offset;
+  const uint16_t bck_offset = conv_params->bck_offset;
+
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
+  int width = w;
+
+  if (w == 4 || h == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x4_t d0, dd0;
+    uint8x8_t t0, t1, t2, t3, t4, t5, t6, d01;
+#if AOM_ARCH_AARCH64
+    int16x4_t s8, s9, s10;
+    uint16x4_t d1, d2, d3, dd1, dd2, dd3;
+    uint8x8_t d23;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      uint8_t *d_u8 = dst8_ptr;
+      int height = h;
+
       __builtin_prefetch(s + 0 * src_stride);
       __builtin_prefetch(s + 1 * src_stride);
       __builtin_prefetch(s + 2 * src_stride);
       __builtin_prefetch(s + 3 * src_stride);
 
-      load_unaligned_u8_4x8(s, src_stride, &tu0, &tu1, &tu2, &tu3);
+      t0 = load_unaligned_u8_4x1(s + 0 * src_stride);
+      t1 = load_unaligned_u8_4x1(s + 1 * src_stride);
+      t2 = load_unaligned_u8_4x1(s + 2 * src_stride);
+      t3 = load_unaligned_u8_4x1(s + 3 * src_stride);
+      t4 = load_unaligned_u8_4x1(s + 4 * src_stride);
+      t5 = load_unaligned_u8_4x1(s + 5 * src_stride);
+      t6 = load_unaligned_u8_4x1(s + 6 * src_stride);
 
-      u0 = vreinterpretq_s16_u16(vmovl_u8(vreinterpret_u8_u32(tu0)));
-      u1 = vreinterpretq_s16_u16(vmovl_u8(vreinterpret_u8_u32(tu1)));
-      u2 = vreinterpretq_s16_u16(vmovl_u8(vreinterpret_u8_u32(tu2)));
-      u3 = vreinterpretq_s16_u16(vmovl_u8(vreinterpret_u8_u32(tu3)));
-
-      s0 = vget_low_s16(u0);
-      s1 = vget_high_s16(u0);
-      s2 = vget_low_s16(u1);
-      s3 = vget_high_s16(u1);
-      s4 = vget_low_s16(u2);
-      s5 = vget_high_s16(u2);
-      s6 = vget_low_s16(u3);
+      s0 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+      s1 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t1)));
+      s2 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t2)));
+      s3 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t3)));
+      s4 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t4)));
+      s5 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t5)));
+      s6 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t6)));
 
       __builtin_prefetch(d + 0 * dst_stride);
       __builtin_prefetch(d + 1 * dst_stride);
       __builtin_prefetch(d + 2 * dst_stride);
       __builtin_prefetch(d + 3 * dst_stride);
 
-      s += (7 * src_stride);
+      s += 7 * src_stride;
+
       do {
-#if defined(__aarch64__)
-        load_unaligned_u8_4x4(s, src_stride, &tu0, &tu1);
+#if AOM_ARCH_AARCH64
+        t0 = load_unaligned_u8_4x1(s + 0 * src_stride);
+        t1 = load_unaligned_u8_4x1(s + 1 * src_stride);
+        t2 = load_unaligned_u8_4x1(s + 2 * src_stride);
+        t3 = load_unaligned_u8_4x1(s + 3 * src_stride);
 
-        u0 = vreinterpretq_s16_u16(vmovl_u8(vreinterpret_u8_u32(tu0)));
-        u1 = vreinterpretq_s16_u16(vmovl_u8(vreinterpret_u8_u32(tu1)));
+        s7 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+        s8 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t1)));
+        s9 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t2)));
+        s10 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t3)));
 
-        s7 = vget_low_s16(u0);
-        s8 = vget_high_s16(u0);
-        s9 = vget_low_s16(u1);
-        s10 = vget_high_s16(u1);
+        d0 = convolve8_4_y(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d1 = convolve8_4_y(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d2 = convolve8_4_y(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d3 = convolve8_4_y(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                           vget_low_s16(round_offset_vec));
 
-        d0 = convolve8_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, y_filter, zero,
-                               shift_vec);
-        d0 = vadd_s16(d0, round_offset64);
-        d1 = convolve8_4x4_s16(s1, s2, s3, s4, s5, s6, s7, s8, y_filter, zero,
-                               shift_vec);
-        d1 = vadd_s16(d1, round_offset64);
-        d2 = convolve8_4x4_s16(s2, s3, s4, s5, s6, s7, s8, s9, y_filter, zero,
-                               shift_vec);
-        d2 = vadd_s16(d2, round_offset64);
-        d3 = convolve8_4x4_s16(s3, s4, s5, s6, s7, s8, s9, s10, y_filter, zero,
-                               shift_vec);
-        d3 = vadd_s16(d3, round_offset64);
+        __builtin_prefetch(d + 0 * dst_stride);
+        __builtin_prefetch(d + 1 * dst_stride);
+        __builtin_prefetch(d + 2 * dst_stride);
+        __builtin_prefetch(d + 3 * dst_stride);
 
-        if (conv_params->do_average) {
-          __builtin_prefetch(d + 0 * dst_stride);
-          __builtin_prefetch(d + 1 * dst_stride);
-          __builtin_prefetch(d + 2 * dst_stride);
-          __builtin_prefetch(d + 3 * dst_stride);
+        __builtin_prefetch(d_u8 + 0 * dst8_stride);
+        __builtin_prefetch(d_u8 + 1 * dst8_stride);
+        __builtin_prefetch(d_u8 + 2 * dst8_stride);
+        __builtin_prefetch(d_u8 + 3 * dst8_stride);
 
-          __builtin_prefetch(d_u8 + 0 * dst8_stride);
-          __builtin_prefetch(d_u8 + 1 * dst8_stride);
-          __builtin_prefetch(d_u8 + 2 * dst8_stride);
-          __builtin_prefetch(d_u8 + 3 * dst8_stride);
+        load_u16_4x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
 
-          load_u16_4x4(d, dst_stride, &res4, &res5, &res6, &res7);
-          d += (dst_stride << 2);
+        compute_dist_wtd_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                                 bck_offset, round_offset_vec, &d01, &d23);
 
-          compute_avg_4x4(res4, res5, res6, res7, vreinterpret_u16_s16(d0),
-                          vreinterpret_u16_s16(d1), vreinterpret_u16_s16(d2),
-                          vreinterpret_u16_s16(d3), fwd_offset, bck_offset,
-                          round_offset64, round_bits, use_dist_wtd_comp_avg,
-                          &t0, &t1);
-
-          vst1_lane_u32((uint32_t *)d_u8, vreinterpret_u32_u8(t0), 0);
-          d_u8 += dst8_stride;
-          vst1_lane_u32((uint32_t *)d_u8, vreinterpret_u32_u8(t0), 1);
-          d_u8 += dst8_stride;
-          vst1_lane_u32((uint32_t *)d_u8, vreinterpret_u32_u8(t1), 0);
-          d_u8 += dst8_stride;
-          vst1_lane_u32((uint32_t *)d_u8, vreinterpret_u32_u8(t1), 1);
-          d_u8 += dst8_stride;
-        } else {
-          store_u16_4x4(d, dst_stride, vreinterpret_u16_s16(d0),
-                        vreinterpret_u16_s16(d1), vreinterpret_u16_s16(d2),
-                        vreinterpret_u16_s16(d3));
-          d += (dst_stride << 2);
-        }
+        store_u8_4x1(d_u8 + 0 * dst8_stride, d01, 0);
+        store_u8_4x1(d_u8 + 1 * dst8_stride, d01, 1);
+        store_u8_4x1(d_u8 + 2 * dst8_stride, d23, 0);
+        store_u8_4x1(d_u8 + 3 * dst8_stride, d23, 1);
 
         s0 = s4;
         s1 = s5;
@@ -2301,35 +4594,25 @@
         s4 = s8;
         s5 = s9;
         s6 = s10;
-
-        s += (src_stride << 2);
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        d_u8 += 4 * dst8_stride;
         height -= 4;
-#else
-        load_unaligned_u8_4x1(s, src_stride, &tu0);
-        u0 = vreinterpretq_s16_u16(vmovl_u8(vreinterpret_u8_u32(tu0)));
-        s7 = vget_low_s16(u0);
+#else   // !AOM_ARCH_AARCH64
+        t0 = load_unaligned_u8_4x1(s);
+        s7 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
 
-        d0 = convolve8_4x4_s16(s0, s1, s2, s3, s4, s5, s6, s7, y_filter, zero,
-                               shift_vec);
+        d0 = convolve8_4_y(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                           vget_low_s16(round_offset_vec));
 
-        d0 = vadd_s16(d0, round_offset64);
+        __builtin_prefetch(d);
 
-        if (conv_params->do_average) {
-          __builtin_prefetch(d);
+        dd0 = vld1_u16(d);
 
-          res4 = vld1_u16(d);
-          d += (dst_stride);
+        compute_dist_wtd_avg_4x1(dd0, d0, fwd_offset, bck_offset,
+                                 vget_low_s16(round_offset_vec), &d01);
 
-          compute_avg_4x1(res4, vreinterpret_u16_s16(d0), fwd_offset,
-                          bck_offset, round_offset64, round_bits,
-                          use_dist_wtd_comp_avg, &t0);
-
-          vst1_lane_u32((uint32_t *)d_u8, vreinterpret_u32_u8(t0), 0);
-          d_u8 += dst8_stride;
-        } else {
-          vst1_u16(d, vreinterpret_u16_s16(d0));
-          d += (dst_stride);
-        }
+        store_u8_4x1(d_u8, d01, 0);
 
         s0 = s1;
         s1 = s2;
@@ -2338,43 +4621,42 @@
         s4 = s5;
         s5 = s6;
         s6 = s7;
-
-        s += (src_stride);
+        s += src_stride;
+        d += dst_stride;
+        d_u8 += dst8_stride;
         height--;
-#endif
-      } while (height > 0);
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
       src_ptr += 4;
       dst_ptr += 4;
-      dst_u8_ptr += 4;
+      dst8_ptr += 4;
       width -= 4;
-    } while (width > 0);
+    } while (width != 0);
   } else {
-    CONV_BUF_TYPE *d_tmp;
     int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
-    int16x8_t res0;
-    uint16x8_t res8;
-    uint8x8_t t0, t1, t2, t3, t4, t5, t6, t7;
-    const int16x8_t round_offset128 = vdupq_n_s16(round_offset);
-    const int16x8_t shift_vec = vdupq_n_s16(-shift_value);
-    const int16x4_t round_offset64 = vdup_n_s16(round_offset);
-    const int16x8_t zero = vdupq_n_s16(0);
-#if defined(__aarch64__)
+    uint16x8_t d0, dd0;
+    uint8x8_t d0_u8, t0, t1, t2, t3, t4, t5, t6;
+#if AOM_ARCH_AARCH64
     int16x8_t s8, s9, s10, s11, s12, s13, s14;
-    int16x8_t res1, res2, res3, res4, res5, res6, res7;
-    uint16x8_t res10, res11, res9;
-#endif
-    dst_ptr = dst;
-    dst_u8_ptr = dst8;
+    uint16x8_t d1, d2, d3, d4, d5, d6, d7, dd1, dd2, dd3, dd4, dd5, dd6, dd7;
+    uint8x8_t d1_u8, d2_u8, d3_u8, d4_u8, d5_u8, d6_u8, d7_u8, t7;
+#endif  // AOM_ARCH_AARCH64
+
     do {
-      __builtin_prefetch(src_ptr + 0 * src_stride);
-      __builtin_prefetch(src_ptr + 1 * src_stride);
-      __builtin_prefetch(src_ptr + 2 * src_stride);
-      __builtin_prefetch(src_ptr + 3 * src_stride);
-      __builtin_prefetch(src_ptr + 4 * src_stride);
-      __builtin_prefetch(src_ptr + 5 * src_stride);
-      __builtin_prefetch(src_ptr + 6 * src_stride);
-      __builtin_prefetch(src_ptr + 7 * src_stride);
-      load_u8_8x8(src_ptr, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      uint8_t *d_u8 = dst8_ptr;
+      int height = h;
+
+      __builtin_prefetch(s + 0 * src_stride);
+      __builtin_prefetch(s + 1 * src_stride);
+      __builtin_prefetch(s + 2 * src_stride);
+      __builtin_prefetch(s + 3 * src_stride);
+      __builtin_prefetch(s + 4 * src_stride);
+      __builtin_prefetch(s + 5 * src_stride);
+      __builtin_prefetch(s + 6 * src_stride);
+      __builtin_prefetch(s + 7 * src_stride);
+      load_u8_8x7(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6);
 
       s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
       s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
@@ -2384,13 +4666,10 @@
       s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
       s6 = vreinterpretq_s16_u16(vmovl_u8(t6));
 
-      height = h;
-      s = src_ptr + (7 * src_stride);
-      d_tmp = dst_ptr;
-      d_u8 = dst_u8_ptr;
+      s += 7 * src_stride;
 
       do {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
         load_u8_8x8(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
 
         s7 = vreinterpretq_s16_u16(vmovl_u8(t0));
@@ -2407,71 +4686,45 @@
         __builtin_prefetch(dst_ptr + 2 * dst_stride);
         __builtin_prefetch(dst_ptr + 3 * dst_stride);
 
-        res0 = convolve8_8x8_s16(s0, s1, s2, s3, s4, s5, s6, s7, y_filter, zero,
-                                 shift_vec);
-        res0 = vaddq_s16(res0, round_offset128);
-        res1 = convolve8_8x8_s16(s1, s2, s3, s4, s5, s6, s7, s8, y_filter, zero,
-                                 shift_vec);
-        res1 = vaddq_s16(res1, round_offset128);
-        res2 = convolve8_8x8_s16(s2, s3, s4, s5, s6, s7, s8, s9, y_filter, zero,
-                                 shift_vec);
-        res2 = vaddq_s16(res2, round_offset128);
-        res3 = convolve8_8x8_s16(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
-                                 zero, shift_vec);
-        res3 = vaddq_s16(res3, round_offset128);
-        res4 = convolve8_8x8_s16(s4, s5, s6, s7, s8, s9, s10, s11, y_filter,
-                                 zero, shift_vec);
-        res4 = vaddq_s16(res4, round_offset128);
-        res5 = convolve8_8x8_s16(s5, s6, s7, s8, s9, s10, s11, s12, y_filter,
-                                 zero, shift_vec);
-        res5 = vaddq_s16(res5, round_offset128);
-        res6 = convolve8_8x8_s16(s6, s7, s8, s9, s10, s11, s12, s13, y_filter,
-                                 zero, shift_vec);
-        res6 = vaddq_s16(res6, round_offset128);
-        res7 = convolve8_8x8_s16(s7, s8, s9, s10, s11, s12, s13, s14, y_filter,
-                                 zero, shift_vec);
-        res7 = vaddq_s16(res7, round_offset128);
+        d0 = convolve8_8_y(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                           round_offset_vec);
+        d1 = convolve8_8_y(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                           round_offset_vec);
+        d2 = convolve8_8_y(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                           round_offset_vec);
+        d3 = convolve8_8_y(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                           round_offset_vec);
+        d4 = convolve8_8_y(s4, s5, s6, s7, s8, s9, s10, s11, y_filter,
+                           round_offset_vec);
+        d5 = convolve8_8_y(s5, s6, s7, s8, s9, s10, s11, s12, y_filter,
+                           round_offset_vec);
+        d6 = convolve8_8_y(s6, s7, s8, s9, s10, s11, s12, s13, y_filter,
+                           round_offset_vec);
+        d7 = convolve8_8_y(s7, s8, s9, s10, s11, s12, s13, s14, y_filter,
+                           round_offset_vec);
 
-        if (conv_params->do_average) {
-          __builtin_prefetch(d_tmp + 0 * dst8_stride);
-          __builtin_prefetch(d_tmp + 1 * dst8_stride);
-          __builtin_prefetch(d_tmp + 2 * dst8_stride);
-          __builtin_prefetch(d_tmp + 3 * dst8_stride);
+        __builtin_prefetch(d + 0 * dst8_stride);
+        __builtin_prefetch(d + 1 * dst8_stride);
+        __builtin_prefetch(d + 2 * dst8_stride);
+        __builtin_prefetch(d + 3 * dst8_stride);
 
-          load_u16_8x4(d_tmp, dst_stride, &res8, &res9, &res10, &res11);
-          d_tmp += (dst_stride << 2);
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
 
-          compute_avg_8x4(res8, res9, res10, res11, vreinterpretq_u16_s16(res0),
-                          vreinterpretq_u16_s16(res1),
-                          vreinterpretq_u16_s16(res2),
-                          vreinterpretq_u16_s16(res3), fwd_offset, bck_offset,
-                          round_offset64, round_bits, use_dist_wtd_comp_avg,
-                          &t0, &t1, &t2, &t3);
+        compute_dist_wtd_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3, fwd_offset,
+                                 bck_offset, round_offset_vec, &d0_u8, &d1_u8,
+                                 &d2_u8, &d3_u8);
 
-          store_u8_8x4(d_u8, dst8_stride, t0, t1, t2, t3);
-          d_u8 += (dst8_stride << 2);
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+        d_u8 += 4 * dst8_stride;
 
-          load_u16_8x4(d_tmp, dst_stride, &res8, &res9, &res10, &res11);
-          d_tmp += (dst_stride << 2);
+        load_u16_8x4(d + 4 * dst_stride, dst_stride, &dd4, &dd5, &dd6, &dd7);
 
-          compute_avg_8x4(res8, res9, res10, res11, vreinterpretq_u16_s16(res4),
-                          vreinterpretq_u16_s16(res5),
-                          vreinterpretq_u16_s16(res6),
-                          vreinterpretq_u16_s16(res7), fwd_offset, bck_offset,
-                          round_offset64, round_bits, use_dist_wtd_comp_avg,
-                          &t0, &t1, &t2, &t3);
+        compute_dist_wtd_avg_8x4(dd4, dd5, dd6, dd7, d4, d5, d6, d7, fwd_offset,
+                                 bck_offset, round_offset_vec, &d4_u8, &d5_u8,
+                                 &d6_u8, &d7_u8);
 
-          store_u8_8x4(d_u8, dst8_stride, t0, t1, t2, t3);
-          d_u8 += (dst8_stride << 2);
-        } else {
-          store_u16_8x8(
-              d_tmp, dst_stride, vreinterpretq_u16_s16(res0),
-              vreinterpretq_u16_s16(res1), vreinterpretq_u16_s16(res2),
-              vreinterpretq_u16_s16(res3), vreinterpretq_u16_s16(res4),
-              vreinterpretq_u16_s16(res5), vreinterpretq_u16_s16(res6),
-              vreinterpretq_u16_s16(res7));
-          d_tmp += (dst_stride << 3);
-        }
+        store_u8_8x4(d_u8, dst8_stride, d4_u8, d5_u8, d6_u8, d7_u8);
+        d_u8 += 4 * dst8_stride;
 
         s0 = s8;
         s1 = s9;
@@ -2480,16 +4733,16 @@
         s4 = s12;
         s5 = s13;
         s6 = s14;
-        s += (8 * src_stride);
+        s += 8 * src_stride;
+        d += 8 * dst_stride;
         height -= 8;
-#else
+#else   // !AOM_ARCH_AARCH64
         s7 = vreinterpretq_s16_u16(vmovl_u8(vld1_u8(s)));
 
         __builtin_prefetch(dst_ptr);
 
-        res0 = convolve8_8x8_s16(s0, s1, s2, s3, s4, s5, s6, s7, y_filter, zero,
-                                 shift_vec);
-        res0 = vaddq_s16(res0, round_offset128);
+        d0 = convolve8_8_y(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                           round_offset_vec);
 
         s0 = s1;
         s1 = s2;
@@ -2499,31 +4752,585 @@
         s5 = s6;
         s6 = s7;
 
-        if (conv_params->do_average) {
-          __builtin_prefetch(d_tmp);
+        __builtin_prefetch(d);
 
-          res8 = vld1q_u16(d_tmp);
-          d_tmp += (dst_stride);
+        dd0 = vld1q_u16(d);
 
-          compute_avg_8x1(res8, vreinterpretq_u16_s16(res0), fwd_offset,
-                          bck_offset, round_offset64, round_bits,
-                          use_dist_wtd_comp_avg, &t0);
+        compute_dist_wtd_avg_8x1(dd0, d0, fwd_offset, bck_offset,
+                                 round_offset_vec, &d0_u8);
 
-          vst1_u8(d_u8, t0);
-          d_u8 += (dst8_stride);
-        } else {
-          vst1q_u16(d_tmp, vreinterpretq_u16_s16(res0));
-          d_tmp += dst_stride;
-        }
+        vst1_u8(d_u8, d0_u8);
+        d_u8 += dst8_stride;
 
-        s += (src_stride);
+        s += src_stride;
+        d += dst_stride;
         height--;
-#endif
-      } while (height > 0);
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
       src_ptr += 8;
       dst_ptr += 8;
-      dst_u8_ptr += 8;
+      dst8_ptr += 8;
       width -= 8;
-    } while (width > 0);
+    } while (width != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_y_8tap_avg_neon(
+    const uint8_t *src_ptr, int src_stride, uint8_t *dst8_ptr,
+    const int dst8_stride, int w, int h, const int16x8_t y_filter,
+    ConvolveParams *conv_params) {
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
+  int width = w;
+
+  if (w == 4 || h == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x4_t d0, dd0;
+    uint8x8_t t0, t1, t2, t3, t4, t5, t6, d01;
+#if AOM_ARCH_AARCH64
+    int16x4_t s8, s9, s10;
+    uint16x4_t d1, d2, d3, dd1, dd2, dd3;
+    uint8x8_t d23;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      uint8_t *d_u8 = dst8_ptr;
+      int height = h;
+
+      __builtin_prefetch(s + 0 * src_stride);
+      __builtin_prefetch(s + 1 * src_stride);
+      __builtin_prefetch(s + 2 * src_stride);
+      __builtin_prefetch(s + 3 * src_stride);
+
+      t0 = load_unaligned_u8_4x1(s + 0 * src_stride);
+      t1 = load_unaligned_u8_4x1(s + 1 * src_stride);
+      t2 = load_unaligned_u8_4x1(s + 2 * src_stride);
+      t3 = load_unaligned_u8_4x1(s + 3 * src_stride);
+      t4 = load_unaligned_u8_4x1(s + 4 * src_stride);
+      t5 = load_unaligned_u8_4x1(s + 5 * src_stride);
+      t6 = load_unaligned_u8_4x1(s + 6 * src_stride);
+
+      s0 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+      s1 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t1)));
+      s2 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t2)));
+      s3 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t3)));
+      s4 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t4)));
+      s5 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t5)));
+      s6 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t6)));
+
+      __builtin_prefetch(d + 0 * dst_stride);
+      __builtin_prefetch(d + 1 * dst_stride);
+      __builtin_prefetch(d + 2 * dst_stride);
+      __builtin_prefetch(d + 3 * dst_stride);
+
+      s += 7 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        t0 = load_unaligned_u8_4x1(s + 0 * src_stride);
+        t1 = load_unaligned_u8_4x1(s + 1 * src_stride);
+        t2 = load_unaligned_u8_4x1(s + 2 * src_stride);
+        t3 = load_unaligned_u8_4x1(s + 3 * src_stride);
+
+        s7 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+        s8 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t1)));
+        s9 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t2)));
+        s10 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t3)));
+
+        d0 = convolve8_4_y(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d1 = convolve8_4_y(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d2 = convolve8_4_y(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d3 = convolve8_4_y(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                           vget_low_s16(round_offset_vec));
+
+        __builtin_prefetch(d + 0 * dst_stride);
+        __builtin_prefetch(d + 1 * dst_stride);
+        __builtin_prefetch(d + 2 * dst_stride);
+        __builtin_prefetch(d + 3 * dst_stride);
+
+        __builtin_prefetch(d_u8 + 0 * dst8_stride);
+        __builtin_prefetch(d_u8 + 1 * dst8_stride);
+        __builtin_prefetch(d_u8 + 2 * dst8_stride);
+        __builtin_prefetch(d_u8 + 3 * dst8_stride);
+
+        load_u16_4x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_basic_avg_4x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                              round_offset_vec, &d01, &d23);
+
+        store_u8_4x1(d_u8 + 0 * dst8_stride, d01, 0);
+        store_u8_4x1(d_u8 + 1 * dst8_stride, d01, 1);
+        store_u8_4x1(d_u8 + 2 * dst8_stride, d23, 0);
+        store_u8_4x1(d_u8 + 3 * dst8_stride, d23, 1);
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        d_u8 += 4 * dst8_stride;
+        height -= 4;
+#else   // !AOM_ARCH_AARCH64
+        t0 = load_unaligned_u8_4x1(s);
+        s7 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+
+        d0 = convolve8_4_y(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                           vget_low_s16(round_offset_vec));
+
+        __builtin_prefetch(d);
+
+        dd0 = vld1_u16(d);
+
+        compute_basic_avg_4x1(dd0, d0, vget_low_s16(round_offset_vec), &d01);
+
+        store_u8_4x1(d_u8, d01, 0);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s5 = s6;
+        s6 = s7;
+        s += src_stride;
+        d += dst_stride;
+        d_u8 += dst8_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 4;
+      dst_ptr += 4;
+      dst8_ptr += 4;
+      width -= 4;
+    } while (width != 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x8_t d0, dd0;
+    uint8x8_t d0_u8, t0, t1, t2, t3, t4, t5, t6;
+#if AOM_ARCH_AARCH64
+    int16x8_t s8, s9, s10, s11, s12, s13, s14;
+    uint16x8_t d1, d2, d3, d4, d5, d6, d7, dd1, dd2, dd3, dd4, dd5, dd6, dd7;
+    uint8x8_t d1_u8, d2_u8, d3_u8, d4_u8, d5_u8, d6_u8, d7_u8, t7;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      uint8_t *d_u8 = dst8_ptr;
+      int height = h;
+
+      __builtin_prefetch(s + 0 * src_stride);
+      __builtin_prefetch(s + 1 * src_stride);
+      __builtin_prefetch(s + 2 * src_stride);
+      __builtin_prefetch(s + 3 * src_stride);
+      __builtin_prefetch(s + 4 * src_stride);
+      __builtin_prefetch(s + 5 * src_stride);
+      __builtin_prefetch(s + 6 * src_stride);
+      __builtin_prefetch(s + 7 * src_stride);
+      load_u8_8x7(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6);
+
+      s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
+      s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
+      s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
+      s3 = vreinterpretq_s16_u16(vmovl_u8(t3));
+      s4 = vreinterpretq_s16_u16(vmovl_u8(t4));
+      s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
+      s6 = vreinterpretq_s16_u16(vmovl_u8(t6));
+
+      s += 7 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        load_u8_8x8(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
+        s7 = vreinterpretq_s16_u16(vmovl_u8(t0));
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t1));
+        s9 = vreinterpretq_s16_u16(vmovl_u8(t2));
+        s10 = vreinterpretq_s16_u16(vmovl_u8(t3));
+        s11 = vreinterpretq_s16_u16(vmovl_u8(t4));
+        s12 = vreinterpretq_s16_u16(vmovl_u8(t5));
+        s13 = vreinterpretq_s16_u16(vmovl_u8(t6));
+        s14 = vreinterpretq_s16_u16(vmovl_u8(t7));
+
+        __builtin_prefetch(dst_ptr + 0 * dst_stride);
+        __builtin_prefetch(dst_ptr + 1 * dst_stride);
+        __builtin_prefetch(dst_ptr + 2 * dst_stride);
+        __builtin_prefetch(dst_ptr + 3 * dst_stride);
+
+        d0 = convolve8_8_y(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                           round_offset_vec);
+        d1 = convolve8_8_y(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                           round_offset_vec);
+        d2 = convolve8_8_y(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                           round_offset_vec);
+        d3 = convolve8_8_y(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                           round_offset_vec);
+        d4 = convolve8_8_y(s4, s5, s6, s7, s8, s9, s10, s11, y_filter,
+                           round_offset_vec);
+        d5 = convolve8_8_y(s5, s6, s7, s8, s9, s10, s11, s12, y_filter,
+                           round_offset_vec);
+        d6 = convolve8_8_y(s6, s7, s8, s9, s10, s11, s12, s13, y_filter,
+                           round_offset_vec);
+        d7 = convolve8_8_y(s7, s8, s9, s10, s11, s12, s13, s14, y_filter,
+                           round_offset_vec);
+
+        __builtin_prefetch(d + 0 * dst8_stride);
+        __builtin_prefetch(d + 1 * dst8_stride);
+        __builtin_prefetch(d + 2 * dst8_stride);
+        __builtin_prefetch(d + 3 * dst8_stride);
+
+        load_u16_8x4(d, dst_stride, &dd0, &dd1, &dd2, &dd3);
+
+        compute_basic_avg_8x4(dd0, dd1, dd2, dd3, d0, d1, d2, d3,
+                              round_offset_vec, &d0_u8, &d1_u8, &d2_u8, &d3_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d0_u8, d1_u8, d2_u8, d3_u8);
+        d_u8 += 4 * dst8_stride;
+
+        load_u16_8x4(d + 4 * dst_stride, dst_stride, &dd4, &dd5, &dd6, &dd7);
+
+        compute_basic_avg_8x4(dd4, dd5, dd6, dd7, d4, d5, d6, d7,
+                              round_offset_vec, &d4_u8, &d5_u8, &d6_u8, &d7_u8);
+
+        store_u8_8x4(d_u8, dst8_stride, d4_u8, d5_u8, d6_u8, d7_u8);
+        d_u8 += 4 * dst8_stride;
+
+        s0 = s8;
+        s1 = s9;
+        s2 = s10;
+        s3 = s11;
+        s4 = s12;
+        s5 = s13;
+        s6 = s14;
+        s += 8 * src_stride;
+        d += 8 * dst_stride;
+        height -= 8;
+#else   // !AOM_ARCH_AARCH64
+        s7 = vreinterpretq_s16_u16(vmovl_u8(vld1_u8(s)));
+
+        __builtin_prefetch(dst_ptr);
+
+        d0 = convolve8_8_y(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                           round_offset_vec);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s5 = s6;
+        s6 = s7;
+
+        __builtin_prefetch(d);
+
+        dd0 = vld1q_u16(d);
+
+        compute_basic_avg_8x1(dd0, d0, round_offset_vec, &d0_u8);
+
+        vst1_u8(d_u8, d0_u8);
+        d_u8 += dst8_stride;
+
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      dst8_ptr += 8;
+      width -= 8;
+    } while (width != 0);
+  }
+}
+
+static INLINE void dist_wtd_convolve_y_8tap_neon(const uint8_t *src_ptr,
+                                                 int src_stride, int w, int h,
+                                                 const int16x8_t y_filter,
+                                                 ConvolveParams *conv_params) {
+  const int bd = 8;
+  const int offset_bits = bd + 2 * FILTER_BITS - ROUND0_BITS;
+  const int16_t round_offset = (1 << (offset_bits - COMPOUND_ROUND1_BITS)) +
+                               (1 << (offset_bits - COMPOUND_ROUND1_BITS - 1));
+  const int16x8_t round_offset_vec = vdupq_n_s16(round_offset);
+
+  CONV_BUF_TYPE *dst_ptr = conv_params->dst;
+  const int dst_stride = conv_params->dst_stride;
+  int width = w;
+
+  if (w == 4 || h == 4) {
+    int16x4_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x4_t d0;
+    uint8x8_t t0, t1, t2, t3, t4, t5, t6;
+#if AOM_ARCH_AARCH64
+    int16x4_t s8, s9, s10;
+    uint16x4_t d1, d2, d3;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      int height = h;
+
+      __builtin_prefetch(s + 0 * src_stride);
+      __builtin_prefetch(s + 1 * src_stride);
+      __builtin_prefetch(s + 2 * src_stride);
+      __builtin_prefetch(s + 3 * src_stride);
+
+      t0 = load_unaligned_u8_4x1(s + 0 * src_stride);
+      t1 = load_unaligned_u8_4x1(s + 1 * src_stride);
+      t2 = load_unaligned_u8_4x1(s + 2 * src_stride);
+      t3 = load_unaligned_u8_4x1(s + 3 * src_stride);
+      t4 = load_unaligned_u8_4x1(s + 4 * src_stride);
+      t5 = load_unaligned_u8_4x1(s + 5 * src_stride);
+      t6 = load_unaligned_u8_4x1(s + 6 * src_stride);
+
+      s0 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+      s1 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t1)));
+      s2 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t2)));
+      s3 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t3)));
+      s4 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t4)));
+      s5 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t5)));
+      s6 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t6)));
+
+      __builtin_prefetch(d + 0 * dst_stride);
+      __builtin_prefetch(d + 1 * dst_stride);
+      __builtin_prefetch(d + 2 * dst_stride);
+      __builtin_prefetch(d + 3 * dst_stride);
+
+      s += 7 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        t0 = load_unaligned_u8_4x1(s + 0 * src_stride);
+        t1 = load_unaligned_u8_4x1(s + 1 * src_stride);
+        t2 = load_unaligned_u8_4x1(s + 2 * src_stride);
+        t3 = load_unaligned_u8_4x1(s + 3 * src_stride);
+
+        s7 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+        s8 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t1)));
+        s9 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t2)));
+        s10 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t3)));
+
+        d0 = convolve8_4_y(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d1 = convolve8_4_y(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d2 = convolve8_4_y(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                           vget_low_s16(round_offset_vec));
+        d3 = convolve8_4_y(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                           vget_low_s16(round_offset_vec));
+
+        store_u16_4x4(d, dst_stride, d0, d1, d2, d3);
+
+        s0 = s4;
+        s1 = s5;
+        s2 = s6;
+        s3 = s7;
+        s4 = s8;
+        s5 = s9;
+        s6 = s10;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+#else   // !AOM_ARCH_AARCH64
+        t0 = load_unaligned_u8_4x1(s);
+        s7 = vreinterpret_s16_u16(vget_low_u16(vmovl_u8(t0)));
+
+        d0 = convolve8_4_y(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                           vget_low_s16(round_offset_vec));
+
+        vst1_u16(d, d0);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s5 = s6;
+        s6 = s7;
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 4;
+      dst_ptr += 4;
+      width -= 4;
+    } while (width != 0);
+  } else {
+    int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
+    uint16x8_t d0;
+    uint8x8_t t0, t1, t2, t3, t4, t5, t6;
+#if AOM_ARCH_AARCH64
+    int16x8_t s8, s9, s10, s11, s12, s13, s14;
+    uint16x8_t d1, d2, d3, d4, d5, d6, d7;
+    uint8x8_t t7;
+#endif  // AOM_ARCH_AARCH64
+
+    do {
+      const uint8_t *s = src_ptr;
+      CONV_BUF_TYPE *d = dst_ptr;
+      int height = h;
+
+      __builtin_prefetch(s + 0 * src_stride);
+      __builtin_prefetch(s + 1 * src_stride);
+      __builtin_prefetch(s + 2 * src_stride);
+      __builtin_prefetch(s + 3 * src_stride);
+      __builtin_prefetch(s + 4 * src_stride);
+      __builtin_prefetch(s + 5 * src_stride);
+      __builtin_prefetch(s + 6 * src_stride);
+      __builtin_prefetch(s + 7 * src_stride);
+      load_u8_8x7(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6);
+
+      s0 = vreinterpretq_s16_u16(vmovl_u8(t0));
+      s1 = vreinterpretq_s16_u16(vmovl_u8(t1));
+      s2 = vreinterpretq_s16_u16(vmovl_u8(t2));
+      s3 = vreinterpretq_s16_u16(vmovl_u8(t3));
+      s4 = vreinterpretq_s16_u16(vmovl_u8(t4));
+      s5 = vreinterpretq_s16_u16(vmovl_u8(t5));
+      s6 = vreinterpretq_s16_u16(vmovl_u8(t6));
+
+      s += 7 * src_stride;
+
+      do {
+#if AOM_ARCH_AARCH64
+        load_u8_8x8(s, src_stride, &t0, &t1, &t2, &t3, &t4, &t5, &t6, &t7);
+
+        s7 = vreinterpretq_s16_u16(vmovl_u8(t0));
+        s8 = vreinterpretq_s16_u16(vmovl_u8(t1));
+        s9 = vreinterpretq_s16_u16(vmovl_u8(t2));
+        s10 = vreinterpretq_s16_u16(vmovl_u8(t3));
+        s11 = vreinterpretq_s16_u16(vmovl_u8(t4));
+        s12 = vreinterpretq_s16_u16(vmovl_u8(t5));
+        s13 = vreinterpretq_s16_u16(vmovl_u8(t6));
+        s14 = vreinterpretq_s16_u16(vmovl_u8(t7));
+
+        __builtin_prefetch(dst_ptr + 0 * dst_stride);
+        __builtin_prefetch(dst_ptr + 1 * dst_stride);
+        __builtin_prefetch(dst_ptr + 2 * dst_stride);
+        __builtin_prefetch(dst_ptr + 3 * dst_stride);
+
+        d0 = convolve8_8_y(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                           round_offset_vec);
+        d1 = convolve8_8_y(s1, s2, s3, s4, s5, s6, s7, s8, y_filter,
+                           round_offset_vec);
+        d2 = convolve8_8_y(s2, s3, s4, s5, s6, s7, s8, s9, y_filter,
+                           round_offset_vec);
+        d3 = convolve8_8_y(s3, s4, s5, s6, s7, s8, s9, s10, y_filter,
+                           round_offset_vec);
+        d4 = convolve8_8_y(s4, s5, s6, s7, s8, s9, s10, s11, y_filter,
+                           round_offset_vec);
+        d5 = convolve8_8_y(s5, s6, s7, s8, s9, s10, s11, s12, y_filter,
+                           round_offset_vec);
+        d6 = convolve8_8_y(s6, s7, s8, s9, s10, s11, s12, s13, y_filter,
+                           round_offset_vec);
+        d7 = convolve8_8_y(s7, s8, s9, s10, s11, s12, s13, s14, y_filter,
+                           round_offset_vec);
+
+        store_u16_8x8(d, dst_stride, d0, d1, d2, d3, d4, d5, d6, d7);
+
+        s0 = s8;
+        s1 = s9;
+        s2 = s10;
+        s3 = s11;
+        s4 = s12;
+        s5 = s13;
+        s6 = s14;
+        s += 8 * src_stride;
+        d += 8 * dst_stride;
+        height -= 8;
+#else   // !AOM_ARCH_AARCH64
+        s7 = vreinterpretq_s16_u16(vmovl_u8(vld1_u8(s)));
+
+        __builtin_prefetch(dst_ptr);
+
+        d0 = convolve8_8_y(s0, s1, s2, s3, s4, s5, s6, s7, y_filter,
+                           round_offset_vec);
+
+        s0 = s1;
+        s1 = s2;
+        s2 = s3;
+        s3 = s4;
+        s4 = s5;
+        s5 = s6;
+        s6 = s7;
+
+        vst1q_u16(d, d0);
+
+        s += src_stride;
+        d += dst_stride;
+        height--;
+#endif  // AOM_ARCH_AARCH64
+      } while (height != 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      width -= 8;
+    } while (width != 0);
+  }
+}
+
+void av1_dist_wtd_convolve_y_neon(const uint8_t *src, int src_stride,
+                                  uint8_t *dst8, int dst8_stride, int w, int h,
+                                  const InterpFilterParams *filter_params_y,
+                                  const int subpel_y_qn,
+                                  ConvolveParams *conv_params) {
+  assert(w % 4 == 0);
+  assert(h % 4 == 0);
+
+  // Vertical filter.
+  const int16_t *y_filter_ptr = av1_get_interp_filter_subpel_kernel(
+      filter_params_y, subpel_y_qn & SUBPEL_MASK);
+  // Filter values are even, so downshift by 1 to reduce intermediate
+  // precision requirements.
+  const int16x8_t y_filter = vshrq_n_s16(vld1q_s16(y_filter_ptr), 1);
+
+  const int vert_offset = filter_params_y->taps / 2 - 1;
+  const uint8_t *src_ptr = src - (vert_offset * src_stride);
+
+  if (get_filter_tap(filter_params_y, subpel_y_qn) <= 6) {
+    if (conv_params->do_average) {
+      if (UNLIKELY(conv_params->use_dist_wtd_comp_avg)) {
+        dist_wtd_convolve_y_6tap_dist_wtd_avg_neon(
+            src_ptr + src_stride, src_stride, dst8, dst8_stride, w, h, y_filter,
+            conv_params);
+      } else {
+        dist_wtd_convolve_y_6tap_avg_neon(src_ptr + src_stride, src_stride,
+                                          dst8, dst8_stride, w, h, y_filter,
+                                          conv_params);
+      }
+    } else {
+      dist_wtd_convolve_y_6tap_neon(src_ptr + src_stride, src_stride, w, h,
+                                    y_filter, conv_params);
+    }
+  } else {
+    if (conv_params->do_average) {
+      if (UNLIKELY(conv_params->use_dist_wtd_comp_avg)) {
+        dist_wtd_convolve_y_8tap_dist_wtd_avg_neon(src_ptr, src_stride, dst8,
+                                                   dst8_stride, w, h, y_filter,
+                                                   conv_params);
+      } else {
+        dist_wtd_convolve_y_8tap_avg_neon(src_ptr, src_stride, dst8,
+                                          dst8_stride, w, h, y_filter,
+                                          conv_params);
+      }
+    } else {
+      dist_wtd_convolve_y_8tap_neon(src_ptr, src_stride, w, h, y_filter,
+                                    conv_params);
+    }
   }
 }

diff --git a/av1/common/arm/reconintra_neon.c b/av1/common/arm/reconintra_neon.c
index 43c470f..8d190fb 100644
--- a/av1/common/arm/reconintra_neon.c
+++ b/av1/common/arm/reconintra_neon.c

@@ -12,6 +12,8 @@
 #include <arm_neon.h>
 #include <assert.h>
 
+#include "config/aom_config.h"
+
 #include "aom/aom_integer.h"
 #include "aom_dsp/arm/sum_neon.h"
 
@@ -126,7 +128,7 @@
       out_45 = vmlaq_s16(out_45, vreinterpretq_s16_u16(p_b_hi), f5f4_hi);
       int16x8_t out_67 = vmulq_s16(vreinterpretq_s16_u16(p_b_lo), f7f6_lo);
       out_67 = vmlaq_s16(out_67, vreinterpretq_s16_u16(p_b_hi), f7f6_hi);
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
       const int16x8_t out_0123 = vpaddq_s16(out_01, out_23);
       const int16x8_t out_4567 = vpaddq_s16(out_45, out_67);
       const int16x8_t out_01234567 = vpaddq_s16(out_0123, out_4567);
@@ -137,7 +139,7 @@
                                               vqmovn_s32(vpaddlq_s16(out_67)));
       const int16x8_t out_01234567 = vcombine_s16(
           vqmovn_s32(vpaddlq_s16(out_0123)), vqmovn_s32(vpaddlq_s16(out_4567)));
-#endif  // (__aarch64__)
+#endif  // AOM_ARCH_AARCH64
       const uint32x2_t out_r =
           vreinterpret_u32_u8(vqmovun_s16(vrshrq_n_s16(out_01234567, 4)));
       // Storing

diff --git a/av1/common/arm/resize_neon.c b/av1/common/arm/resize_neon.c
index df4f7d6..5f6d214 100644
--- a/av1/common/arm/resize_neon.c
+++ b/av1/common/arm/resize_neon.c

@@ -783,7 +783,7 @@
         const int buffer_height = (4 * dst_y_h + SUBPEL_TAPS - 2 + 7) & ~7;
         uint8_t *const temp_buffer =
             (uint8_t *)malloc(buffer_stride * buffer_height);
-        if (temp_buffer) {
+        if (!temp_buffer) {
           malloc_failed = 1;
           break;
         }

diff --git a/av1/common/arm/selfguided_neon.c b/av1/common/arm/selfguided_neon.c
index f5eb36c..d14088e 100644
--- a/av1/common/arm/selfguided_neon.c
+++ b/av1/common/arm/selfguided_neon.c

@@ -1503,7 +1503,10 @@
   {
     int16_t *src_ptr;
     uint8_t *dst_ptr;
+#if CONFIG_AV1_HIGHBITDEPTH
+    uint16_t *dst16 = CONVERT_TO_SHORTPTR(dst8);
     uint16_t *dst16_ptr;
+#endif
     int16x4_t d0, d4;
     int16x8_t r0, s0;
     uint16x8_t r4;
@@ -1515,14 +1518,14 @@
     const int32x4_t xq1_vec = vdupq_n_s32(xq[1]);
     const int16x8_t zero = vdupq_n_s16(0);
     const uint16x8_t max = vdupq_n_u16((1 << bit_depth) - 1);
-    uint16_t *dst16 = CONVERT_TO_SHORTPTR(dst8);
-    dst_ptr = dst8;
     src_ptr = (int16_t *)dgd16;
     do {
       w = width;
       count = 0;
       dst_ptr = dst8 + rc * dst_stride;
+#if CONFIG_AV1_HIGHBITDEPTH
       dst16_ptr = dst16 + rc * dst_stride;
+#endif
       do {
         s0 = vld1q_s16(src_ptr + count);
 
@@ -1565,19 +1568,20 @@
         if (highbd) {
           r4 = vminq_u16(r4, max);
           vst1q_u16(dst16_ptr, r4);
+          dst16_ptr += 8;
         } else {
           t0 = vqmovn_u16(r4);
           vst1_u8(dst_ptr, t0);
+          dst_ptr += 8;
         }
 #else
         (void)max;
         t0 = vqmovn_u16(r4);
         vst1_u8(dst_ptr, t0);
+        dst_ptr += 8;
 #endif
         w -= 8;
         count += 8;
-        dst_ptr += 8;
-        dst16_ptr += 8;
       } while (w > 0);
 
       src_ptr += dgd16_stride;

diff --git a/av1/common/arm/warp_plane_neon.c b/av1/common/arm/warp_plane_neon.c
index 03b6db8..b4d3148 100644
--- a/av1/common/arm/warp_plane_neon.c
+++ b/av1/common/arm/warp_plane_neon.c

@@ -222,8 +222,7 @@
                                           int16x8_t *tmp_dst, int sx, int alpha,
                                           int k, const int offset_bits_horiz,
                                           const int reduce_bits_horiz) {
-  const uint8x16_t mask = { 255, 0, 255, 0, 255, 0, 255, 0,
-                            255, 0, 255, 0, 255, 0, 255, 0 };
+  const uint8x16_t mask = vreinterpretq_u8_u16(vdupq_n_u16(0x00ff));
   const int32x4_t add_const = vdupq_n_s32((int32_t)(1 << offset_bits_horiz));
   const int16x8_t shift = vdupq_n_s16(-(int16_t)reduce_bits_horiz);
 
@@ -488,9 +487,9 @@
   int32x4_t res_lo, res_hi;
   int16x8_t result_final;
   uint8x16_t src_1, src_2, src_3, src_4;
-  uint8x16_t indx_vec = {
-    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
-  };
+  static const uint8_t k0To15[16] = { 0, 1, 2,  3,  4,  5,  6,  7,
+                                      8, 9, 10, 11, 12, 13, 14, 15 };
+  uint8x16_t indx_vec = vld1q_u8(k0To15);
   uint8x16_t cmp_vec;
 
   const int reduce_bits_horiz = conv_params->round_0;

diff --git a/av1/common/arm/wiener_convolve_neon.c b/av1/common/arm/wiener_convolve_neon.c
index 0a12c88..d7f511d 100644
--- a/av1/common/arm/wiener_convolve_neon.c
+++ b/av1/common/arm/wiener_convolve_neon.c

@@ -153,7 +153,7 @@
   height = intermediate_height;
 
   // For aarch_64.
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   int processed_height = 0;
   uint16_t *d_tmp;
   int width, remaining_height;
@@ -248,21 +248,11 @@
       int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
       uint8x8_t t0;
       s = src_tmp_ptr;
-      s0 = vld1q_s16(s);
-      s += src_stride;
-      s1 = vld1q_s16(s);
-      s += src_stride;
-      s2 = vld1q_s16(s);
-      s += src_stride;
-      s3 = vld1q_s16(s);
-      s += src_stride;
-      s4 = vld1q_s16(s);
-      s += src_stride;
-      s5 = vld1q_s16(s);
-      s += src_stride;
-      s6 = vld1q_s16(s);
-      s += src_stride;
       d = dst_tmp_ptr;
+
+      load_s16_8x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+      s += 7 * src_stride;
+
       height = h;
 
       do {
@@ -273,14 +263,7 @@
         __builtin_prefetch(dst_tmp_ptr + 2 * dst_stride);
         __builtin_prefetch(dst_tmp_ptr + 3 * dst_stride);
 
-        s7 = vld1q_s16(s);
-        s += src_stride;
-        s8 = vld1q_s16(s);
-        s += src_stride;
-        s9 = vld1q_s16(s);
-        s += src_stride;
-        s10 = vld1q_s16(s);
-        s += src_stride;
+        load_s16_8x4(s, src_stride, &s7, &s8, &s9, &s10);
 
         t0 = wiener_convolve8_vert_4x8(s0, s1, s2, s3, s4, s5, s6, filter_y_tmp,
                                        bd, conv_params->round_1);
@@ -291,14 +274,7 @@
         t3 = wiener_convolve8_vert_4x8(s3, s4, s5, s6, s7, s8, s9, filter_y_tmp,
                                        bd, conv_params->round_1);
 
-        vst1_u8(d, t0);
-        d += dst_stride;
-        vst1_u8(d, t1);
-        d += dst_stride;
-        vst1_u8(d, t2);
-        d += dst_stride;
-        vst1_u8(d, t3);
-        d += dst_stride;
+        store_u8_8x4(d, dst_stride, t0, t1, t2, t3);
 
         s0 = s4;
         s1 = s5;
@@ -307,6 +283,8 @@
         s4 = s8;
         s5 = s9;
         s6 = s10;
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
         height -= 4;
       } while (height > 3);
 
@@ -336,21 +314,11 @@
       uint8x8_t t0;
       int16x8_t s0, s1, s2, s3, s4, s5, s6, s7;
       s = src_tmp_ptr;
-      s0 = vld1q_s16(s);
-      s += src_stride;
-      s1 = vld1q_s16(s);
-      s += src_stride;
-      s2 = vld1q_s16(s);
-      s += src_stride;
-      s3 = vld1q_s16(s);
-      s += src_stride;
-      s4 = vld1q_s16(s);
-      s += src_stride;
-      s5 = vld1q_s16(s);
-      s += src_stride;
-      s6 = vld1q_s16(s);
-      s += src_stride;
       d = dst_tmp_ptr;
+
+      load_s16_8x7(s, src_stride, &s0, &s1, &s2, &s3, &s4, &s5, &s6);
+      s += 7 * src_stride;
+
       height = h;
       PROCESS_ROW_FOR_VERTICAL_FILTER
 

diff --git a/av1/common/av1_common_int.h b/av1/common/av1_common_int.h
index 304a551..4c0cb99 100644
--- a/av1/common/av1_common_int.h
+++ b/av1/common/av1_common_int.h

@@ -16,6 +16,7 @@
 #include "config/av1_rtcd.h"
 
 #include "aom/internal/aom_codec_internal.h"
+#include "aom_dsp/flow_estimation/corner_detect.h"
 #include "aom_util/aom_thread.h"
 #include "av1/common/alloccommon.h"
 #include "av1/common/av1_loopfilter.h"
@@ -184,7 +185,8 @@
   aom_get_frame_buffer_cb_fn_t get_fb_cb;
   aom_release_frame_buffer_cb_fn_t release_fb_cb;
 
-  RefCntBuffer frame_bufs[FRAME_BUFFERS];
+  RefCntBuffer *frame_bufs;
+  uint8_t num_frame_bufs;
 
   // Frame buffers allocated internally by the codec.
   InternalFrameBufferList int_frame_buffers;
@@ -1092,10 +1094,11 @@
   int i;
 
   lock_buffer_pool(cm->buffer_pool);
-  for (i = 0; i < FRAME_BUFFERS; ++i)
+  const int num_frame_bufs = cm->buffer_pool->num_frame_bufs;
+  for (i = 0; i < num_frame_bufs; ++i)
     if (frame_bufs[i].ref_count == 0) break;
 
-  if (i != FRAME_BUFFERS) {
+  if (i != num_frame_bufs) {
     if (frame_bufs[i].buf.use_external_reference_buffers) {
       // If this frame buffer's y_buffer, u_buffer, and v_buffer point to the
       // external reference buffers. Restore the buffer pointers to point to the
@@ -1132,7 +1135,10 @@
   if (new_fb_idx == INVALID_IDX) return NULL;
 
   cm->cur_frame = &cm->buffer_pool->frame_bufs[new_fb_idx];
-  cm->cur_frame->buf.buf_8bit_valid = 0;
+#if CONFIG_AV1_ENCODER && !CONFIG_REALTIME_ONLY
+  aom_invalidate_pyramid(cm->cur_frame->buf.y_pyramid);
+  av1_invalidate_corner_list(cm->cur_frame->buf.corners);
+#endif  // CONFIG_AV1_ENCODER && !CONFIG_REALTIME_ONLY
   av1_zero(cm->cur_frame->interp_filter_selected);
   return cm->cur_frame;
 }
@@ -1237,10 +1243,8 @@
 
   const int mem_size =
       ((mi_params->mi_rows + MAX_MIB_SIZE) >> 1) * (mi_params->mi_stride >> 1);
-  int realloc = cm->tpl_mvs == NULL;
-  if (cm->tpl_mvs) realloc |= cm->tpl_mvs_mem_size < mem_size;
 
-  if (realloc) {
+  if (cm->tpl_mvs == NULL || cm->tpl_mvs_mem_size < mem_size) {
     aom_free(cm->tpl_mvs);
     CHECK_MEM_ERROR(cm, cm->tpl_mvs,
                     (TPL_MV_REF *)aom_calloc(mem_size, sizeof(*cm->tpl_mvs)));
@@ -1613,18 +1617,6 @@
          sizeof(xd->left_txfm_context_buffer));
 }
 
-// Disable array-bounds checks as the TX_SIZE enum contains values larger than
-// TX_SIZES_ALL (TX_INVALID) which make extending the array as a workaround
-// infeasible. The assert is enough for static analysis and this or other tools
-// asan, valgrind would catch oob access at runtime.
-#if defined(__GNUC__) && __GNUC__ >= 4
-#pragma GCC diagnostic ignored "-Warray-bounds"
-#endif
-
-#if defined(__GNUC__) && __GNUC__ >= 4
-#pragma GCC diagnostic warning "-Warray-bounds"
-#endif
-
 static INLINE void set_txfm_ctx(TXFM_CONTEXT *txfm_ctx, uint8_t txs, int len) {
   int i;
   for (i = 0; i < len; ++i) txfm_ctx[i] = txs;

diff --git a/av1/common/av1_inv_txfm2d.c b/av1/common/av1_inv_txfm2d.c
index 154c9d2..ee67dff 100644
--- a/av1/common/av1_inv_txfm2d.c
+++ b/av1/common/av1_inv_txfm2d.c

@@ -29,10 +29,10 @@
   uint16_t *dest = CONVERT_TO_SHORTPTR(dest8);
 
   for (i = 0; i < 4; i++) {
-    a1 = ip[0] >> UNIT_QUANT_SHIFT;
-    c1 = ip[1] >> UNIT_QUANT_SHIFT;
-    d1 = ip[2] >> UNIT_QUANT_SHIFT;
-    b1 = ip[3] >> UNIT_QUANT_SHIFT;
+    a1 = ip[4 * 0] >> UNIT_QUANT_SHIFT;
+    c1 = ip[4 * 1] >> UNIT_QUANT_SHIFT;
+    d1 = ip[4 * 2] >> UNIT_QUANT_SHIFT;
+    b1 = ip[4 * 3] >> UNIT_QUANT_SHIFT;
     a1 += c1;
     d1 -= b1;
     e1 = (a1 - d1) >> 1;
@@ -41,20 +41,20 @@
     a1 -= b1;
     d1 += c1;
 
-    op[0] = a1;
-    op[1] = b1;
-    op[2] = c1;
-    op[3] = d1;
-    ip += 4;
-    op += 4;
+    op[4 * 0] = a1;
+    op[4 * 1] = b1;
+    op[4 * 2] = c1;
+    op[4 * 3] = d1;
+    ip++;
+    op++;
   }
 
   ip = output;
   for (i = 0; i < 4; i++) {
-    a1 = ip[4 * 0];
-    c1 = ip[4 * 1];
-    d1 = ip[4 * 2];
-    b1 = ip[4 * 3];
+    a1 = ip[0];
+    c1 = ip[1];
+    d1 = ip[2];
+    b1 = ip[3];
     a1 += c1;
     d1 -= b1;
     e1 = (a1 - d1) >> 1;
@@ -73,7 +73,7 @@
     dest[stride * 2] = highbd_clip_pixel_add(dest[stride * 2], c1, bd);
     dest[stride * 3] = highbd_clip_pixel_add(dest[stride * 3], d1, bd);
 
-    ip++;
+    ip += 4;
     dest++;
   }
 }
@@ -88,7 +88,7 @@
   uint16_t *dest = CONVERT_TO_SHORTPTR(dest8);
   (void)bd;
 
-  a1 = ip[0] >> UNIT_QUANT_SHIFT;
+  a1 = ip[0 * 4] >> UNIT_QUANT_SHIFT;
   e1 = a1 >> 1;
   a1 -= e1;
   op[0] = a1;
@@ -271,19 +271,19 @@
   for (r = 0; r < txfm_size_row; ++r) {
     if (abs(rect_type) == 1) {
       for (c = 0; c < txfm_size_col; ++c) {
-        temp_in[c] = round_shift((int64_t)input[c] * NewInvSqrt2, NewSqrt2Bits);
+        temp_in[c] = round_shift(
+            (int64_t)input[c * txfm_size_row + r] * NewInvSqrt2, NewSqrt2Bits);
       }
       clamp_buf(temp_in, txfm_size_col, bd + 8);
       txfm_func_row(temp_in, buf_ptr, cos_bit_row, stage_range_row);
     } else {
       for (c = 0; c < txfm_size_col; ++c) {
-        temp_in[c] = input[c];
+        temp_in[c] = input[c * txfm_size_row + r];
       }
       clamp_buf(temp_in, txfm_size_col, bd + 8);
       txfm_func_row(temp_in, buf_ptr, cos_bit_row, stage_range_row);
     }
     av1_round_shift_array(buf_ptr, txfm_size_col, -shift[0]);
-    input += txfm_size_col;
     buf_ptr += txfm_size_col;
   }
 
@@ -393,9 +393,9 @@
   // - Copying over these values in top-left 32x32 locations.
   // - Setting the rest of the locations to 0.
   int32_t mod_input[64 * 64];
-  for (int row = 0; row < 32; ++row) {
-    memcpy(mod_input + row * 64, input + row * 32, 32 * sizeof(*mod_input));
-    memset(mod_input + row * 64 + 32, 0, 32 * sizeof(*mod_input));
+  for (int col = 0; col < 32; ++col) {
+    memcpy(mod_input + col * 64, input + col * 32, 32 * sizeof(*mod_input));
+    memset(mod_input + col * 64 + 32, 0, 32 * sizeof(*mod_input));
   }
   memset(mod_input + 32 * 64, 0, 32 * 64 * sizeof(*mod_input));
   DECLARE_ALIGNED(32, int, txfm_buf[64 * 64 + 64 + 64]);
@@ -408,11 +408,9 @@
   // Remap 32x32 input into a modified 64x32 by:
   // - Copying over these values in top-left 32x32 locations.
   // - Setting the rest of the locations to 0.
-  int32_t mod_input[64 * 32];
-  for (int row = 0; row < 32; ++row) {
-    memcpy(mod_input + row * 64, input + row * 32, 32 * sizeof(*mod_input));
-    memset(mod_input + row * 64 + 32, 0, 32 * sizeof(*mod_input));
-  }
+  int32_t mod_input[32 * 64];
+  memcpy(mod_input, input, 32 * 32 * sizeof(*mod_input));
+  memset(mod_input + 32 * 32, 0, 32 * 32 * sizeof(*mod_input));
   DECLARE_ALIGNED(32, int, txfm_buf[64 * 32 + 64 + 64]);
   inv_txfm2d_add_facade(mod_input, output, stride, txfm_buf, tx_type, TX_64X32,
                         bd);
@@ -423,9 +421,11 @@
   // Remap 32x32 input into a modified 32x64 input by:
   // - Copying over these values in top-left 32x32 locations.
   // - Setting the rest of the locations to 0.
-  int32_t mod_input[32 * 64];
-  memcpy(mod_input, input, 32 * 32 * sizeof(*mod_input));
-  memset(mod_input + 32 * 32, 0, 32 * 32 * sizeof(*mod_input));
+  int32_t mod_input[64 * 32];
+  for (int col = 0; col < 32; ++col) {
+    memcpy(mod_input + col * 64, input + col * 32, 32 * sizeof(*mod_input));
+    memset(mod_input + col * 64 + 32, 0, 32 * sizeof(*mod_input));
+  }
   DECLARE_ALIGNED(32, int, txfm_buf[64 * 32 + 64 + 64]);
   inv_txfm2d_add_facade(mod_input, output, stride, txfm_buf, tx_type, TX_32X64,
                         bd);
@@ -436,9 +436,11 @@
   // Remap 16x32 input into a modified 16x64 input by:
   // - Copying over these values in top-left 16x32 locations.
   // - Setting the rest of the locations to 0.
-  int32_t mod_input[16 * 64];
-  memcpy(mod_input, input, 16 * 32 * sizeof(*mod_input));
-  memset(mod_input + 16 * 32, 0, 16 * 32 * sizeof(*mod_input));
+  int32_t mod_input[64 * 16];
+  for (int col = 0; col < 16; ++col) {
+    memcpy(mod_input + col * 64, input + col * 32, 32 * sizeof(*mod_input));
+    memset(mod_input + col * 64 + 32, 0, 32 * sizeof(*mod_input));
+  }
   DECLARE_ALIGNED(32, int, txfm_buf[16 * 64 + 64 + 64]);
   inv_txfm2d_add_facade(mod_input, output, stride, txfm_buf, tx_type, TX_16X64,
                         bd);
@@ -449,11 +451,9 @@
   // Remap 32x16 input into a modified 64x16 by:
   // - Copying over these values in top-left 32x16 locations.
   // - Setting the rest of the locations to 0.
-  int32_t mod_input[64 * 16];
-  for (int row = 0; row < 16; ++row) {
-    memcpy(mod_input + row * 64, input + row * 32, 32 * sizeof(*mod_input));
-    memset(mod_input + row * 64 + 32, 0, 32 * sizeof(*mod_input));
-  }
+  int32_t mod_input[16 * 64];
+  memcpy(mod_input, input, 16 * 32 * sizeof(*mod_input));
+  memset(mod_input + 16 * 32, 0, 16 * 32 * sizeof(*mod_input));
   DECLARE_ALIGNED(32, int, txfm_buf[16 * 64 + 64 + 64]);
   inv_txfm2d_add_facade(mod_input, output, stride, txfm_buf, tx_type, TX_64X16,
                         bd);

diff --git a/av1/common/av1_rtcd_defs.pl b/av1/common/av1_rtcd_defs.pl
index ba1dcbb..17dcc49 100644
--- a/av1/common/av1_rtcd_defs.pl
+++ b/av1/common/av1_rtcd_defs.pl

@@ -17,12 +17,12 @@
 #include "aom/aom_integer.h"
 #include "aom_dsp/odintrin.h"
 #include "aom_dsp/txfm_common.h"
-#include "av1/common/common.h"
-#include "av1/common/enums.h"
-#include "av1/common/quant_common.h"
-#include "av1/common/filter.h"
-#include "av1/common/convolve.h"
 #include "av1/common/av1_txfm.h"
+#include "av1/common/common.h"
+#include "av1/common/convolve.h"
+#include "av1/common/enums.h"
+#include "av1/common/filter.h"
+#include "av1/common/quant_common.h"
 #include "av1/common/restoration.h"
 
 struct macroblockd;
@@ -92,7 +92,7 @@
 
 if(aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
   add_proto qw/void av1_highbd_convolve_horiz_rs/, "const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const int16_t *x_filters, int x0_qn, int x_step_qn, int bd";
-  specialize qw/av1_highbd_convolve_horiz_rs sse4_1/;
+  specialize qw/av1_highbd_convolve_horiz_rs sse4_1 neon/;
 
   add_proto qw/void av1_highbd_wiener_convolve_add_src/, "const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4, const int16_t *filter_y, int y_step_q4, int w, int h, const ConvolveParams *conv_params, int bd";
   specialize qw/av1_highbd_wiener_convolve_add_src ssse3 avx2/;
@@ -282,7 +282,7 @@
   add_proto qw/void aom_upsampled_pred/, "MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
                                           const MV *const mv, uint8_t *comp_pred, int width, int height, int subpel_x_q3,
                                           int subpel_y_q3, const uint8_t *ref, int ref_stride, int subpel_search";
-  specialize qw/aom_upsampled_pred sse2/;
+  specialize qw/aom_upsampled_pred neon sse2/;
   #
   #
   #
@@ -290,7 +290,7 @@
                                                    const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
                                                    int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
                                                    int ref_stride, int subpel_search";
-  specialize qw/aom_comp_avg_upsampled_pred sse2/;
+  specialize qw/aom_comp_avg_upsampled_pred sse2 neon/;
 
   add_proto qw/void aom_dist_wtd_comp_avg_upsampled_pred/, "MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
                                                        const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
@@ -298,13 +298,6 @@
                                                        int ref_stride, const DIST_WTD_COMP_PARAMS *jcp_param, int subpel_search";
   specialize qw/aom_dist_wtd_comp_avg_upsampled_pred ssse3/;
 
-  add_proto qw/void aom_comp_mask_upsampled_pred/, "MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
-                                                       const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
-                                                       int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
-                                                       int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask,
-                                                       int subpel_search";
-  specialize qw/aom_comp_mask_upsampled_pred sse2/;
-
   if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
     add_proto qw/void aom_highbd_upsampled_pred/, "MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
                                                    const MV *const mv, uint8_t *comp_pred8, int width, int height, int subpel_x_q3,
@@ -402,21 +395,26 @@
   # Motion search
   #
   if (aom_config("CONFIG_REALTIME_ONLY") ne "yes") {
-    add_proto qw/void av1_apply_temporal_filter/, "const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count";
+    add_proto qw/void av1_apply_temporal_filter/, "const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count";
     specialize qw/av1_apply_temporal_filter sse2 avx2 neon/;
+
+    add_proto qw/double av1_estimate_noise_from_single_plane/, "const uint8_t *src, int height, int width, int stride, int edge_thresh";
+    specialize qw/av1_estimate_noise_from_single_plane avx2/;
     if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
-      add_proto qw/void av1_highbd_apply_temporal_filter/, "const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count";
+      add_proto qw/void av1_highbd_apply_temporal_filter/, "const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count";
       specialize qw/av1_highbd_apply_temporal_filter sse2 avx2/;
+
+      add_proto qw/double av1_highbd_estimate_noise_from_single_plane/, "const uint16_t *src, int height, int width, int stride, int bit_depth, int edge_thresh";
     }
   }
 
   add_proto qw/void av1_quantize_b/, "const tran_low_t *coeff_ptr, intptr_t n_coeffs, const int16_t *zbin_ptr, const int16_t *round_ptr, const int16_t *quant_ptr, const int16_t *quant_shift_ptr, tran_low_t *qcoeff_ptr, tran_low_t *dqcoeff_ptr, const int16_t *dequant_ptr, uint16_t *eob_ptr, const int16_t *scan, const int16_t *iscan, const qm_val_t * qm_ptr, const qm_val_t * iqm_ptr, int log_scale";
 
   add_proto qw/void av1_calc_indices_dim1/, "const int16_t *data, const int16_t *centroids, uint8_t *indices, int64_t *total_dist, int n, int k";
-  specialize qw/av1_calc_indices_dim1 sse2 avx2/;
+  specialize qw/av1_calc_indices_dim1 sse2 avx2 neon/;
 
   add_proto qw/void av1_calc_indices_dim2/, "const int16_t *data, const int16_t *centroids, uint8_t *indices, int64_t *total_dist, int n, int k";
-  specialize qw/av1_calc_indices_dim2 sse2 avx2/;
+  specialize qw/av1_calc_indices_dim2 sse2 avx2 neon/;
 
   # ENCODEMB INVOKE
   if (aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
@@ -429,9 +427,6 @@
     specialize qw/av1_highbd_quantize_fp sse4_1 avx2 neon/;
   }
 
-  add_proto qw/void av1_highbd_fwht4x4/, "const int16_t *input, tran_low_t *output, int stride";
-  specialize qw/av1_highbd_fwht4x4 sse4_1 neon/;
-
   # End av1_high encoder functions
 
   # txb
@@ -452,7 +447,7 @@
   specialize qw/av1_get_crc32c_value sse4_2 arm_crc32/;
 
   if (aom_config("CONFIG_REALTIME_ONLY") ne "yes") {
-    add_proto qw/void av1_compute_stats/,  "int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, int use_downsampled_wiener_stats";
+    add_proto qw/void av1_compute_stats/,  "int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int16_t *dgd_avg, int16_t *src_avg, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, int use_downsampled_wiener_stats";
     specialize qw/av1_compute_stats sse4_1 avx2/;
     add_proto qw/void av1_calc_proj_params/, " const uint8_t *src8, int width, int height, int src_stride, const uint8_t *dat8, int dat_stride, int32_t *flt0, int flt0_stride, int32_t *flt1, int flt1_stride, int64_t H[2][2], int64_t C[2], const sgr_params_type *params";
     specialize qw/av1_calc_proj_params sse4_1 avx2/;
@@ -594,14 +589,14 @@
   specialize qw/av1_dist_wtd_convolve_x sse2 avx2 neon/;
   specialize qw/av1_dist_wtd_convolve_y sse2 avx2 neon/;
   if(aom_config("CONFIG_AV1_HIGHBITDEPTH") eq "yes") {
-    specialize qw/av1_highbd_dist_wtd_convolve_2d sse4_1 avx2/;
-    specialize qw/av1_highbd_dist_wtd_convolve_x sse4_1 avx2/;
-    specialize qw/av1_highbd_dist_wtd_convolve_y sse4_1 avx2/;
-    specialize qw/av1_highbd_dist_wtd_convolve_2d_copy sse4_1 avx2/;
-    specialize qw/av1_highbd_convolve_2d_sr ssse3 avx2/;
-    specialize qw/av1_highbd_convolve_x_sr ssse3 avx2/;
-    specialize qw/av1_highbd_convolve_y_sr ssse3 avx2/;
-    specialize qw/av1_highbd_convolve_2d_scale sse4_1/;
+    specialize qw/av1_highbd_dist_wtd_convolve_2d sse4_1 avx2 neon/;
+    specialize qw/av1_highbd_dist_wtd_convolve_x sse4_1 avx2 neon/;
+    specialize qw/av1_highbd_dist_wtd_convolve_y sse4_1 avx2 neon/;
+    specialize qw/av1_highbd_dist_wtd_convolve_2d_copy sse4_1 avx2 neon/;
+    specialize qw/av1_highbd_convolve_2d_sr ssse3 avx2 neon/;
+    specialize qw/av1_highbd_convolve_x_sr ssse3 avx2 neon/;
+    specialize qw/av1_highbd_convolve_y_sr ssse3 avx2 neon/;
+    specialize qw/av1_highbd_convolve_2d_scale sse4_1 neon/;
   }
 
 # INTRA_EDGE functions

diff --git a/av1/common/blockd.h b/av1/common/blockd.h
index 5f90e57..e7f1b6b 100644
--- a/av1/common/blockd.h
+++ b/av1/common/blockd.h

@@ -518,11 +518,6 @@
 
 /*!\cond */
 
-#if CONFIG_DEBUG
-#define CFL_SUB8X8_VAL_MI_SIZE (4)
-#define CFL_SUB8X8_VAL_MI_SQUARE \
-  (CFL_SUB8X8_VAL_MI_SIZE * CFL_SUB8X8_VAL_MI_SIZE)
-#endif  // CONFIG_DEBUG
 #define CFL_MAX_BLOCK_SIZE (BLOCK_32X32)
 #define CFL_BUF_LINE (32)
 #define CFL_BUF_LINE_I128 (CFL_BUF_LINE >> 3)
@@ -537,9 +532,10 @@
 
   // Cache the DC_PRED when performing RDO, so it does not have to be recomputed
   // for every scaling parameter
-  int dc_pred_is_cached[CFL_PRED_PLANES];
-  // The DC_PRED cache is disable when decoding
-  int use_dc_pred_cache;
+  bool dc_pred_is_cached[CFL_PRED_PLANES];
+  // Whether the DC_PRED cache is enabled. The DC_PRED cache is disabled when
+  // decoding.
+  bool use_dc_pred_cache;
   // Only cache the first row of the DC_PRED
   int16_t dc_pred_cache[CFL_PRED_PLANES][CFL_BUF_LINE];
 

diff --git a/av1/common/cdef_block_simd.h b/av1/common/cdef_block_simd.h
index df67871..5c62201 100644
--- a/av1/common/cdef_block_simd.h
+++ b/av1/common/cdef_block_simd.h

@@ -270,6 +270,12 @@
   return max;
 }
 
+// MSVC takes far too much time optimizing these.
+// https://bugs.chromium.org/p/aomedia/issues/detail?id=3395
+#if defined(_MSC_VER) && !defined(__clang__)
+#pragma optimize("", off)
+#endif
+
 CDEF_INLINE void filter_block_4x4(const int is_lowbd, void *dest, int dstride,
                                   const uint16_t *in, int pri_strength,
                                   int sec_strength, int dir, int pri_damping,
@@ -617,6 +623,10 @@
   }
 }
 
+#if defined(_MSC_VER) && !defined(__clang__)
+#pragma optimize("", on)
+#endif
+
 SIMD_INLINE void copy_block_4xh(const int is_lowbd, void *dest, int dstride,
                                 const uint16_t *in, int height) {
   uint8_t *dst8 = (uint8_t *)dest;
@@ -674,14 +684,13 @@
                                 int pri_damping, int sec_damping,
                                 int coeff_shift, int block_width,
                                 int block_height) {
-  uint8_t *dst8 = (uint8_t *)dest;
   if (block_width == 8) {
-    filter_block_8x8(/*is_lowbd=*/1, dst8, dstride, in, pri_strength,
+    filter_block_8x8(/*is_lowbd=*/1, dest, dstride, in, pri_strength,
                      sec_strength, dir, pri_damping, sec_damping, coeff_shift,
                      block_height, /*enable_primary=*/1,
                      /*enable_secondary=*/1);
   } else {
-    filter_block_4x4(/*is_lowbd=*/1, dst8, dstride, in, pri_strength,
+    filter_block_4x4(/*is_lowbd=*/1, dest, dstride, in, pri_strength,
                      sec_strength, dir, pri_damping, sec_damping, coeff_shift,
                      block_height, /*enable_primary=*/1,
                      /*enable_secondary=*/1);
@@ -693,14 +702,13 @@
                                 int pri_damping, int sec_damping,
                                 int coeff_shift, int block_width,
                                 int block_height) {
-  uint8_t *dst8 = (uint8_t *)dest;
   if (block_width == 8) {
-    filter_block_8x8(/*is_lowbd=*/1, dst8, dstride, in, pri_strength,
+    filter_block_8x8(/*is_lowbd=*/1, dest, dstride, in, pri_strength,
                      sec_strength, dir, pri_damping, sec_damping, coeff_shift,
                      block_height, /*enable_primary=*/1,
                      /*enable_secondary=*/0);
   } else {
-    filter_block_4x4(/*is_lowbd=*/1, dst8, dstride, in, pri_strength,
+    filter_block_4x4(/*is_lowbd=*/1, dest, dstride, in, pri_strength,
                      sec_strength, dir, pri_damping, sec_damping, coeff_shift,
                      block_height, /*enable_primary=*/1,
                      /*enable_secondary=*/0);
@@ -711,14 +719,13 @@
                                 int pri_damping, int sec_damping,
                                 int coeff_shift, int block_width,
                                 int block_height) {
-  uint8_t *dst8 = (uint8_t *)dest;
   if (block_width == 8) {
-    filter_block_8x8(/*is_lowbd=*/1, dst8, dstride, in, pri_strength,
+    filter_block_8x8(/*is_lowbd=*/1, dest, dstride, in, pri_strength,
                      sec_strength, dir, pri_damping, sec_damping, coeff_shift,
                      block_height, /*enable_primary=*/0,
                      /*enable_secondary=*/1);
   } else {
-    filter_block_4x4(/*is_lowbd=*/1, dst8, dstride, in, pri_strength,
+    filter_block_4x4(/*is_lowbd=*/1, dest, dstride, in, pri_strength,
                      sec_strength, dir, pri_damping, sec_damping, coeff_shift,
                      block_height, /*enable_primary=*/0,
                      /*enable_secondary=*/1);
@@ -730,7 +737,6 @@
                                 int pri_damping, int sec_damping,
                                 int coeff_shift, int block_width,
                                 int block_height) {
-  uint8_t *dst8 = (uint8_t *)dest;
   (void)pri_strength;
   (void)sec_strength;
   (void)dir;
@@ -740,9 +746,9 @@
   (void)block_width;
 
   if (block_width == 8) {
-    copy_block_8xh(/*is_lowbd=*/1, dst8, dstride, in, block_height);
+    copy_block_8xh(/*is_lowbd=*/1, dest, dstride, in, block_height);
   } else {
-    copy_block_4xh(/*is_lowbd=*/1, dst8, dstride, in, block_height);
+    copy_block_4xh(/*is_lowbd=*/1, dest, dstride, in, block_height);
   }
 }
 
@@ -751,14 +757,13 @@
                                  int pri_damping, int sec_damping,
                                  int coeff_shift, int block_width,
                                  int block_height) {
-  uint16_t *dst16 = (uint16_t *)dest;
   if (block_width == 8) {
-    filter_block_8x8(/*is_lowbd=*/0, dst16, dstride, in, pri_strength,
+    filter_block_8x8(/*is_lowbd=*/0, dest, dstride, in, pri_strength,
                      sec_strength, dir, pri_damping, sec_damping, coeff_shift,
                      block_height, /*enable_primary=*/1,
                      /*enable_secondary=*/1);
   } else {
-    filter_block_4x4(/*is_lowbd=*/0, dst16, dstride, in, pri_strength,
+    filter_block_4x4(/*is_lowbd=*/0, dest, dstride, in, pri_strength,
                      sec_strength, dir, pri_damping, sec_damping, coeff_shift,
                      block_height, /*enable_primary=*/1,
                      /*enable_secondary=*/1);
@@ -770,14 +775,13 @@
                                  int pri_damping, int sec_damping,
                                  int coeff_shift, int block_width,
                                  int block_height) {
-  uint16_t *dst16 = (uint16_t *)dest;
   if (block_width == 8) {
-    filter_block_8x8(/*is_lowbd=*/0, dst16, dstride, in, pri_strength,
+    filter_block_8x8(/*is_lowbd=*/0, dest, dstride, in, pri_strength,
                      sec_strength, dir, pri_damping, sec_damping, coeff_shift,
                      block_height, /*enable_primary=*/1,
                      /*enable_secondary=*/0);
   } else {
-    filter_block_4x4(/*is_lowbd=*/0, dst16, dstride, in, pri_strength,
+    filter_block_4x4(/*is_lowbd=*/0, dest, dstride, in, pri_strength,
                      sec_strength, dir, pri_damping, sec_damping, coeff_shift,
                      block_height, /*enable_primary=*/1,
                      /*enable_secondary=*/0);
@@ -788,14 +792,13 @@
                                  int pri_damping, int sec_damping,
                                  int coeff_shift, int block_width,
                                  int block_height) {
-  uint16_t *dst16 = (uint16_t *)dest;
   if (block_width == 8) {
-    filter_block_8x8(/*is_lowbd=*/0, dst16, dstride, in, pri_strength,
+    filter_block_8x8(/*is_lowbd=*/0, dest, dstride, in, pri_strength,
                      sec_strength, dir, pri_damping, sec_damping, coeff_shift,
                      block_height, /*enable_primary=*/0,
                      /*enable_secondary=*/1);
   } else {
-    filter_block_4x4(/*is_lowbd=*/0, dst16, dstride, in, pri_strength,
+    filter_block_4x4(/*is_lowbd=*/0, dest, dstride, in, pri_strength,
                      sec_strength, dir, pri_damping, sec_damping, coeff_shift,
                      block_height, /*enable_primary=*/0,
                      /*enable_secondary=*/1);
@@ -807,7 +810,6 @@
                                  int pri_damping, int sec_damping,
                                  int coeff_shift, int block_width,
                                  int block_height) {
-  uint16_t *dst16 = (uint16_t *)dest;
   (void)pri_strength;
   (void)sec_strength;
   (void)dir;
@@ -816,9 +818,9 @@
   (void)coeff_shift;
   (void)block_width;
   if (block_width == 8) {
-    copy_block_8xh(/*is_lowbd=*/0, dst16, dstride, in, block_height);
+    copy_block_8xh(/*is_lowbd=*/0, dest, dstride, in, block_height);
   } else {
-    copy_block_4xh(/*is_lowbd=*/0, dst16, dstride, in, block_height);
+    copy_block_4xh(/*is_lowbd=*/0, dest, dstride, in, block_height);
   }
 }
 

diff --git a/av1/common/cfl.c b/av1/common/cfl.c
index 98199cb..6d4221e 100644
--- a/av1/common/cfl.c
+++ b/av1/common/cfl.c

@@ -27,9 +27,7 @@
   cfl->store_y = 0;
   // The DC_PRED cache is disabled by default and is only enabled in
   // cfl_rd_pick_alpha
-  cfl->use_dc_pred_cache = 0;
-  cfl->dc_pred_is_cached[CFL_PRED_U] = 0;
-  cfl->dc_pred_is_cached[CFL_PRED_V] = 0;
+  clear_cfl_dc_pred_cache_flags(cfl);
 }
 
 void cfl_store_dc_pred(MACROBLOCKD *const xd, const uint8_t *input,

diff --git a/av1/common/cfl.h b/av1/common/cfl.h
index 0d53764..af8b833 100644
--- a/av1/common/cfl.h
+++ b/av1/common/cfl.h

@@ -61,11 +61,17 @@
   return ROUND_POWER_OF_TWO_SIGNED(scaled_luma_q6, 6);
 }
 
-static INLINE CFL_PRED_TYPE get_cfl_pred_type(PLANE_TYPE plane) {
+static INLINE CFL_PRED_TYPE get_cfl_pred_type(int plane) {
   assert(plane > 0);
   return (CFL_PRED_TYPE)(plane - 1);
 }
 
+static INLINE void clear_cfl_dc_pred_cache_flags(CFL_CTX *cfl) {
+  cfl->use_dc_pred_cache = false;
+  cfl->dc_pred_is_cached[CFL_PRED_U] = false;
+  cfl->dc_pred_is_cached[CFL_PRED_V] = false;
+}
+
 void cfl_predict_block(MACROBLOCKD *const xd, uint8_t *dst, int dst_stride,
                        TX_SIZE tx_size, int plane);
 

diff --git a/av1/common/convolve.c b/av1/common/convolve.c
index 54b2bb0..9bca542 100644
--- a/av1/common/convolve.c
+++ b/av1/common/convolve.c

@@ -99,7 +99,13 @@
       for (int k = 0; k < filter_params_x->taps; ++k) {
         sum += x_filter[k] * src_horiz[y * src_stride + x - fo_horiz + k];
       }
-      assert(0 <= sum && sum < (1 << (bd + FILTER_BITS + 1)));
+
+      // TODO(aomedia:3393): for 12-tap filter, in extreme cases, the result can
+      // be beyond the following range. For better prediction, a clamping can be
+      // added for 12 tap filter to ensure the horizontal filtering result is
+      // within 16 bit. The same applies to the vertical filtering.
+      assert(filter_params_x->taps > 8 ||
+             (0 <= sum && sum < (1 << (bd + FILTER_BITS + 1))));
       im_block[y * im_stride + x] =
           (int16_t)ROUND_POWER_OF_TWO(sum, conv_params->round_0);
     }
@@ -116,7 +122,8 @@
       for (int k = 0; k < filter_params_y->taps; ++k) {
         sum += y_filter[k] * src_vert[(y - fo_vert + k) * im_stride + x];
       }
-      assert(0 <= sum && sum < (1 << (offset_bits + 2)));
+      assert(filter_params_y->taps > 8 ||
+             (0 <= sum && sum < (1 << (offset_bits + 2))));
       int16_t res = ROUND_POWER_OF_TWO(sum, conv_params->round_1) -
                     ((1 << (offset_bits - conv_params->round_1)) +
                      (1 << (offset_bits - conv_params->round_1 - 1)));
@@ -173,6 +180,114 @@
   }
 }
 
+// This function is exactly the same as av1_convolve_2d_sr_c, and is an
+// optimized version for intrabc. Use the following 2-tap filter:
+// DECLARE_ALIGNED(256, static const int16_t,
+//                 av1_intrabc_bilinear_filter[2 * SUBPEL_SHIFTS]) = {
+//   128, 0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+//   64,  64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+// };
+void av1_convolve_2d_sr_intrabc_c(const uint8_t *src, int src_stride,
+                                  uint8_t *dst, int dst_stride, int w, int h,
+                                  const InterpFilterParams *filter_params_x,
+                                  const InterpFilterParams *filter_params_y,
+                                  const int subpel_x_qn, const int subpel_y_qn,
+                                  ConvolveParams *conv_params) {
+  assert(subpel_x_qn == 8);
+  assert(subpel_y_qn == 8);
+  assert(filter_params_x->taps == 2 && filter_params_y->taps == 2);
+  assert((conv_params->round_0 + conv_params->round_1) == 2 * FILTER_BITS);
+  (void)filter_params_x;
+  (void)subpel_x_qn;
+  (void)filter_params_y;
+  (void)subpel_y_qn;
+  (void)conv_params;
+
+  int16_t im_block[(MAX_SB_SIZE + MAX_FILTER_TAP - 1) * MAX_SB_SIZE];
+  int im_h = h + 1;
+  int im_stride = w;
+  assert(w <= MAX_SB_SIZE && h <= MAX_SB_SIZE);
+  const int bd = 8;
+
+  // horizontal filter
+  // explicitly operate for subpel_x_qn = 8.
+  int16_t *im = im_block;
+  for (int y = 0; y < im_h; ++y) {
+    for (int x = 0; x < w; ++x) {
+      const int32_t sum = (1 << bd) + src[x] + src[x + 1];
+      assert(0 <= sum && sum < (1 << (bd + 2)));
+      im[x] = sum;
+    }
+    src += src_stride;
+    im += im_stride;
+  }
+
+  // vertical filter
+  // explicitly operate for subpel_y_qn = 8.
+  int16_t *src_vert = im_block;
+  for (int y = 0; y < h; ++y) {
+    for (int x = 0; x < w; ++x) {
+      const int32_t sum =
+          (1 << (bd + 2)) + src_vert[x] + src_vert[im_stride + x];
+      assert(0 <= sum && sum < (1 << (bd + 4)));
+      const int16_t res =
+          ROUND_POWER_OF_TWO(sum, 2) - ((1 << bd) + (1 << (bd - 1)));
+      dst[x] = clip_pixel(res);
+    }
+    src_vert += im_stride;
+    dst += dst_stride;
+  }
+}
+
+// This function is exactly the same as av1_convolve_y_sr_c, and is an
+// optimized version for intrabc.
+void av1_convolve_y_sr_intrabc_c(const uint8_t *src, int src_stride,
+                                 uint8_t *dst, int dst_stride, int w, int h,
+                                 const InterpFilterParams *filter_params_y,
+                                 const int subpel_y_qn) {
+  assert(subpel_y_qn == 8);
+  assert(filter_params_y->taps == 2);
+  (void)filter_params_y;
+  (void)subpel_y_qn;
+
+  // vertical filter
+  // explicitly operate for subpel_y_qn = 8.
+  for (int y = 0; y < h; ++y) {
+    for (int x = 0; x < w; ++x) {
+      const int32_t res = src[x] + src[src_stride + x];
+      dst[x] = clip_pixel(ROUND_POWER_OF_TWO(res, 1));
+    }
+    src += src_stride;
+    dst += dst_stride;
+  }
+}
+
+// This function is exactly the same as av1_convolve_x_sr_c, and is an
+// optimized version for intrabc.
+void av1_convolve_x_sr_intrabc_c(const uint8_t *src, int src_stride,
+                                 uint8_t *dst, int dst_stride, int w, int h,
+                                 const InterpFilterParams *filter_params_x,
+                                 const int subpel_x_qn,
+                                 ConvolveParams *conv_params) {
+  assert(subpel_x_qn == 8);
+  assert(filter_params_x->taps == 2);
+  assert((conv_params->round_0 + conv_params->round_1) == 2 * FILTER_BITS);
+  (void)filter_params_x;
+  (void)subpel_x_qn;
+  (void)conv_params;
+
+  // horizontal filter
+  // explicitly operate for subpel_x_qn = 8.
+  for (int y = 0; y < h; ++y) {
+    for (int x = 0; x < w; ++x) {
+      const int32_t res = src[x] + src[x + 1];
+      dst[x] = clip_pixel(ROUND_POWER_OF_TWO(res, 1));
+    }
+    src += src_stride;
+    dst += dst_stride;
+  }
+}
+
 void av1_dist_wtd_convolve_2d_c(const uint8_t *src, int src_stride,
                                 uint8_t *dst, int dst_stride, int w, int h,
                                 const InterpFilterParams *filter_params_x,
@@ -200,7 +315,8 @@
       for (int k = 0; k < filter_params_x->taps; ++k) {
         sum += x_filter[k] * src_horiz[y * src_stride + x - fo_horiz + k];
       }
-      assert(0 <= sum && sum < (1 << (bd + FILTER_BITS + 1)));
+      assert(filter_params_x->taps > 8 ||
+             (0 <= sum && sum < (1 << (bd + FILTER_BITS + 1))));
       im_block[y * im_stride + x] =
           (int16_t)ROUND_POWER_OF_TWO(sum, conv_params->round_0);
     }
@@ -217,7 +333,8 @@
       for (int k = 0; k < filter_params_y->taps; ++k) {
         sum += y_filter[k] * src_vert[(y - fo_vert + k) * im_stride + x];
       }
-      assert(0 <= sum && sum < (1 << (offset_bits + 2)));
+      assert(filter_params_y->taps > 8 ||
+             (0 <= sum && sum < (1 << (offset_bits + 2))));
       CONV_BUF_TYPE res = ROUND_POWER_OF_TWO(sum, conv_params->round_1);
       if (conv_params->do_average) {
         int32_t tmp = dst16[y * dst16_stride + x];
@@ -402,7 +519,8 @@
       for (int k = 0; k < filter_params_x->taps; ++k) {
         sum += x_filter[k] * src_x[k - fo_horiz];
       }
-      assert(0 <= sum && sum < (1 << (bd + FILTER_BITS + 1)));
+      assert(filter_params_x->taps > 8 ||
+             (0 <= sum && sum < (1 << (bd + FILTER_BITS + 1))));
       im_block[y * im_stride + x] =
           (int16_t)ROUND_POWER_OF_TWO(sum, conv_params->round_0);
     }
@@ -424,7 +542,8 @@
       for (int k = 0; k < filter_params_y->taps; ++k) {
         sum += y_filter[k] * src_y[(k - fo_vert) * im_stride];
       }
-      assert(0 <= sum && sum < (1 << (offset_bits + 2)));
+      assert(filter_params_y->taps > 8 ||
+             (0 <= sum && sum < (1 << (offset_bits + 2))));
       CONV_BUF_TYPE res = ROUND_POWER_OF_TWO(sum, conv_params->round_1);
       if (conv_params->is_compound) {
         if (conv_params->do_average) {
@@ -529,23 +648,22 @@
   const InterpFilterParams *filter_params_y = interp_filters[1];
 
   // TODO(jingning, yunqing): Add SIMD support to 2-tap filter case.
-  // Do we have SIMD support to 4-tap case?
   // 2-tap filter indicates that it is for IntraBC.
   if (filter_params_x->taps == 2 || filter_params_y->taps == 2) {
     assert(filter_params_x->taps == 2 && filter_params_y->taps == 2);
     assert(!scaled);
     if (subpel_x_qn && subpel_y_qn) {
-      av1_convolve_2d_sr_c(src, src_stride, dst, dst_stride, w, h,
-                           filter_params_x, filter_params_y, subpel_x_qn,
-                           subpel_y_qn, conv_params);
+      av1_convolve_2d_sr_intrabc_c(src, src_stride, dst, dst_stride, w, h,
+                                   filter_params_x, filter_params_y,
+                                   subpel_x_qn, subpel_y_qn, conv_params);
       return;
     } else if (subpel_x_qn) {
-      av1_convolve_x_sr_c(src, src_stride, dst, dst_stride, w, h,
-                          filter_params_x, subpel_x_qn, conv_params);
+      av1_convolve_x_sr_intrabc_c(src, src_stride, dst, dst_stride, w, h,
+                                  filter_params_x, subpel_x_qn, conv_params);
       return;
     } else if (subpel_y_qn) {
-      av1_convolve_y_sr_c(src, src_stride, dst, dst_stride, w, h,
-                          filter_params_y, subpel_y_qn);
+      av1_convolve_y_sr_intrabc_c(src, src_stride, dst, dst_stride, w, h,
+                                  filter_params_y, subpel_y_qn);
       return;
     }
   }
@@ -640,7 +758,8 @@
       for (int k = 0; k < filter_params_x->taps; ++k) {
         sum += x_filter[k] * src_horiz[y * src_stride + x - fo_horiz + k];
       }
-      assert(0 <= sum && sum < (1 << (bd + FILTER_BITS + 1)));
+      assert(filter_params_x->taps > 8 ||
+             (0 <= sum && sum < (1 << (bd + FILTER_BITS + 1))));
       im_block[y * im_stride + x] =
           ROUND_POWER_OF_TWO(sum, conv_params->round_0);
     }
@@ -657,7 +776,8 @@
       for (int k = 0; k < filter_params_y->taps; ++k) {
         sum += y_filter[k] * src_vert[(y - fo_vert + k) * im_stride + x];
       }
-      assert(0 <= sum && sum < (1 << (offset_bits + 2)));
+      assert(filter_params_y->taps > 8 ||
+             (0 <= sum && sum < (1 << (offset_bits + 2))));
       int32_t res = ROUND_POWER_OF_TWO(sum, conv_params->round_1) -
                     ((1 << (offset_bits - conv_params->round_1)) +
                      (1 << (offset_bits - conv_params->round_1 - 1)));
@@ -694,7 +814,8 @@
       for (k = 0; k < filter_params_x->taps; ++k) {
         sum += x_filter[k] * src_horiz[y * src_stride + x - fo_horiz + k];
       }
-      assert(0 <= sum && sum < (1 << (bd + FILTER_BITS + 1)));
+      assert(filter_params_x->taps > 8 ||
+             (0 <= sum && sum < (1 << (bd + FILTER_BITS + 1))));
       (void)bd;
       im_block[y * im_stride + x] =
           (int16_t)ROUND_POWER_OF_TWO(sum, conv_params->round_0);
@@ -712,7 +833,8 @@
       for (k = 0; k < filter_params_y->taps; ++k) {
         sum += y_filter[k] * src_vert[(y - fo_vert + k) * im_stride + x];
       }
-      assert(0 <= sum && sum < (1 << (offset_bits + 2)));
+      assert(filter_params_y->taps > 8 ||
+             (0 <= sum && sum < (1 << (offset_bits + 2))));
       CONV_BUF_TYPE res = ROUND_POWER_OF_TWO(sum, conv_params->round_1);
       if (conv_params->do_average) {
         int32_t tmp = dst16[y * dst16_stride + x];
@@ -899,7 +1021,8 @@
       for (int k = 0; k < filter_params_x->taps; ++k) {
         sum += x_filter[k] * src_x[k - fo_horiz];
       }
-      assert(0 <= sum && sum < (1 << (bd + FILTER_BITS + 1)));
+      assert(filter_params_x->taps > 8 ||
+             (0 <= sum && sum < (1 << (bd + FILTER_BITS + 1))));
       im_block[y * im_stride + x] =
           (int16_t)ROUND_POWER_OF_TWO(sum, conv_params->round_0);
     }
@@ -921,7 +1044,8 @@
       for (int k = 0; k < filter_params_y->taps; ++k) {
         sum += y_filter[k] * src_y[(k - fo_vert) * im_stride];
       }
-      assert(0 <= sum && sum < (1 << (offset_bits + 2)));
+      assert(filter_params_y->taps > 8 ||
+             (0 <= sum && sum < (1 << (offset_bits + 2))));
       CONV_BUF_TYPE res = ROUND_POWER_OF_TWO(sum, conv_params->round_1);
       if (conv_params->is_compound) {
         if (conv_params->do_average) {

diff --git a/av1/common/entropymode.c b/av1/common/entropymode.c
index 49fc551..8381c1f 100644
--- a/av1/common/entropymode.c
+++ b/av1/common/entropymode.c

@@ -1066,7 +1066,7 @@
       RefCntBuffer *const buf = get_ref_frame_buf(cm, i);
       if (buf != NULL) buf->frame_context = *cm->fc;
     }
-    for (int i = 0; i < FRAME_BUFFERS; ++i)
+    for (int i = 0; i < cm->buffer_pool->num_frame_bufs; ++i)
       cm->buffer_pool->frame_bufs[i].frame_context = *cm->fc;
   }
 }

diff --git a/av1/common/entropymode.h b/av1/common/entropymode.h
index d1b0df2..09cd6bd 100644
--- a/av1/common/entropymode.h
+++ b/av1/common/entropymode.h

@@ -190,11 +190,11 @@
 void av1_setup_past_independence(struct AV1Common *cm);
 
 // Returns (int)ceil(log2(n)).
-// NOTE: This implementation only works for n <= 2^30.
 static INLINE int av1_ceil_log2(int n) {
   if (n < 2) return 0;
-  int i = 1, p = 2;
-  while (p < n) {
+  int i = 1;
+  unsigned int p = 2;
+  while (p < (unsigned int)n) {
     i++;
     p = p << 1;
   }

diff --git a/av1/common/enums.h b/av1/common/enums.h
index 1b952c4..fb4d756 100644
--- a/av1/common/enums.h
+++ b/av1/common/enums.h

@@ -558,8 +558,16 @@
 // REF_FRAMES for the cm->ref_frame_map array, 1 scratch frame for the new
 // frame in cm->cur_frame, INTER_REFS_PER_FRAME for scaled references on the
 // encoder in the cpi->scaled_ref_buf array.
+// The encoder uses FRAME_BUFFERS only in GOOD and REALTIME encoding modes.
+// The decoder also uses FRAME_BUFFERS.
 #define FRAME_BUFFERS (REF_FRAMES + 1 + INTER_REFS_PER_FRAME)
 
+// During allintra encoding, one reference frame buffer is free to be used again
+// only after another frame buffer is stored as the reference frame. Hence, it
+// is necessary and sufficient to maintain only two reference frame buffers in
+// this case.
+#define FRAME_BUFFERS_ALLINTRA 2
+
 #define FWD_RF_OFFSET(ref) (ref - LAST_FRAME)
 #define BWD_RF_OFFSET(ref) (ref - BWDREF_FRAME)
 
@@ -610,7 +618,7 @@
 } RestorationType;
 
 /*!\cond */
-// Picture prediction structures (0-12 are predefined) in scalability metadata.
+// Picture prediction structures (0-13 are predefined) in scalability metadata.
 enum {
   SCALABILITY_L1T2 = 0,
   SCALABILITY_L1T3 = 1,

diff --git a/av1/common/mv.h b/av1/common/mv.h
index a61287b..6828834 100644
--- a/av1/common/mv.h
+++ b/av1/common/mv.h

@@ -94,11 +94,9 @@
 
 // Bits of precision used for the model
 #define WARPEDMODEL_PREC_BITS 16
-#define WARPEDMODEL_ROW3HOMO_PREC_BITS 16
 
 #define WARPEDMODEL_TRANS_CLAMP (128 << WARPEDMODEL_PREC_BITS)
 #define WARPEDMODEL_NONDIAGAFFINE_CLAMP (1 << (WARPEDMODEL_PREC_BITS - 3))
-#define WARPEDMODEL_ROW3HOMO_CLAMP (1 << (WARPEDMODEL_PREC_BITS - 2))
 
 // Bits of subpel precision for warped interpolation
 #define WARPEDPIXEL_PREC_BITS 6
@@ -108,26 +106,18 @@
 
 #define WARPEDDIFF_PREC_BITS (WARPEDMODEL_PREC_BITS - WARPEDPIXEL_PREC_BITS)
 
-// Number of types used for global motion (must be >= 3 and <= TRANS_TYPES)
-// The following can be useful:
-// GLOBAL_TRANS_TYPES 3 - up to rotation-zoom
-// GLOBAL_TRANS_TYPES 4 - up to affine
-// GLOBAL_TRANS_TYPES 6 - up to hor/ver trapezoids
-// GLOBAL_TRANS_TYPES 7 - up to full homography
-#define GLOBAL_TRANS_TYPES 4
-
 typedef struct {
   int global_warp_allowed;
   int local_warp_allowed;
 } WarpTypesAllowed;
 
 // The order of values in the wmmat matrix below is best described
-// by the homography:
+// by the affine transformation:
 //      [x'     (m2 m3 m0   [x
 //  z .  y'  =   m4 m5 m1 *  y
-//       1]      m6 m7 1)    1]
+//       1]       0  0 1)    1]
 typedef struct {
-  int32_t wmmat[6];
+  int32_t wmmat[MAX_PARAMDIM];
   int16_t alpha, beta, gamma, delta;
   TransformationType wmtype;
   int8_t invalid;
@@ -184,19 +174,11 @@
 #define GM_ALPHA_PREC_DIFF (WARPEDMODEL_PREC_BITS - GM_ALPHA_PREC_BITS)
 #define GM_ALPHA_DECODE_FACTOR (1 << GM_ALPHA_PREC_DIFF)
 
-#define GM_ROW3HOMO_PREC_BITS 16
-#define GM_ABS_ROW3HOMO_BITS 11
-#define GM_ROW3HOMO_PREC_DIFF \
-  (WARPEDMODEL_ROW3HOMO_PREC_BITS - GM_ROW3HOMO_PREC_BITS)
-#define GM_ROW3HOMO_DECODE_FACTOR (1 << GM_ROW3HOMO_PREC_DIFF)
-
 #define GM_TRANS_MAX (1 << GM_ABS_TRANS_BITS)
 #define GM_ALPHA_MAX (1 << GM_ABS_ALPHA_BITS)
-#define GM_ROW3HOMO_MAX (1 << GM_ABS_ROW3HOMO_BITS)
 
 #define GM_TRANS_MIN -GM_TRANS_MAX
 #define GM_ALPHA_MIN -GM_ALPHA_MAX
-#define GM_ROW3HOMO_MIN -GM_ROW3HOMO_MAX
 
 static INLINE int block_center_x(int mi_col, BLOCK_SIZE bs) {
   const int bw = block_size_wide[bs];

diff --git a/av1/common/quant_common.c b/av1/common/quant_common.c
index e96d71a..b097628 100644
--- a/av1/common/quant_common.c
+++ b/av1/common/quant_common.c

@@ -415,21 +415,12 @@
         121, 122, 130, 130, 140, 140, 150, 151, 163, 164, 176, 177, 190, 191,
         204, 206, 222, 224, 230, 232, 242,
         /* Size 4x8 */
-        32, 42, 75, 91, 33, 42, 69, 86, 37, 58, 84, 91, 49, 71, 103, 110, 65,
-        84, 125, 128, 80, 97, 142, 152, 91, 100, 145, 178, 104, 112, 146, 190,
-        /* Size 8x4 */
         32, 33, 37, 49, 65, 80, 91, 104, 42, 42, 58, 71, 84, 97, 100, 112, 75,
         69, 84, 103, 125, 142, 145, 146, 91, 86, 91, 110, 128, 152, 178, 190,
+        /* Size 8x4 */
+        32, 42, 75, 91, 33, 42, 69, 86, 37, 58, 84, 91, 49, 71, 103, 110, 65,
+        84, 125, 128, 80, 97, 142, 152, 91, 100, 145, 178, 104, 112, 146, 190,
         /* Size 8x16 */
-        32, 32, 36, 53, 65, 87, 93, 99, 31, 33, 34, 49, 59, 78, 86, 93, 32, 34,
-        36, 50, 59, 77, 82, 89, 34, 37, 42, 54, 63, 79, 80, 88, 36, 38, 48, 60,
-        68, 84, 86, 90, 44, 43, 53, 71, 79, 95, 94, 97, 48, 46, 56, 76, 85, 102,
-        105, 105, 58, 54, 63, 87, 98, 116, 112, 115, 65, 58, 68, 92, 105, 124,
-        122, 124, 79, 70, 79, 104, 118, 141, 135, 135, 82, 72, 81, 106, 121,
-        144, 149, 146, 91, 80, 88, 106, 130, 148, 162, 159, 97, 86, 94, 107,
-        128, 157, 167, 171, 103, 93, 98, 114, 131, 150, 174, 186, 110, 100, 101,
-        117, 138, 161, 183, 193, 118, 107, 105, 118, 136, 157, 182, 203,
-        /* Size 16x8 */
         32, 31, 32, 34, 36, 44, 48, 58, 65, 79, 82, 91, 97, 103, 110, 118, 32,
         33, 34, 37, 38, 43, 46, 54, 58, 70, 72, 80, 86, 93, 100, 107, 36, 34,
         36, 42, 48, 53, 56, 63, 68, 79, 81, 88, 94, 98, 101, 105, 53, 49, 50,
@@ -438,40 +429,16 @@
         79, 84, 95, 102, 116, 124, 141, 144, 148, 157, 150, 161, 157, 93, 86,
         82, 80, 86, 94, 105, 112, 122, 135, 149, 162, 167, 174, 183, 182, 99,
         93, 89, 88, 90, 97, 105, 115, 124, 135, 146, 159, 171, 186, 193, 203,
+        /* Size 16x8 */
+        32, 32, 36, 53, 65, 87, 93, 99, 31, 33, 34, 49, 59, 78, 86, 93, 32, 34,
+        36, 50, 59, 77, 82, 89, 34, 37, 42, 54, 63, 79, 80, 88, 36, 38, 48, 60,
+        68, 84, 86, 90, 44, 43, 53, 71, 79, 95, 94, 97, 48, 46, 56, 76, 85, 102,
+        105, 105, 58, 54, 63, 87, 98, 116, 112, 115, 65, 58, 68, 92, 105, 124,
+        122, 124, 79, 70, 79, 104, 118, 141, 135, 135, 82, 72, 81, 106, 121,
+        144, 149, 146, 91, 80, 88, 106, 130, 148, 162, 159, 97, 86, 94, 107,
+        128, 157, 167, 171, 103, 93, 98, 114, 131, 150, 174, 186, 110, 100, 101,
+        117, 138, 161, 183, 193, 118, 107, 105, 118, 136, 157, 182, 203,
         /* Size 16x32 */
-        32, 31, 32, 34, 36, 44, 53, 59, 65, 79, 87, 90, 93, 96, 99, 102, 31, 32,
-        32, 34, 35, 42, 51, 56, 62, 75, 82, 85, 88, 91, 94, 97, 31, 32, 33, 33,
-        34, 41, 49, 54, 59, 72, 78, 82, 86, 90, 93, 97, 31, 32, 33, 34, 35, 41,
-        49, 54, 59, 71, 78, 81, 84, 87, 90, 93, 32, 32, 34, 35, 36, 42, 50, 54,
-        59, 71, 77, 80, 82, 86, 89, 93, 32, 33, 35, 37, 38, 42, 49, 53, 58, 69,
-        75, 78, 82, 86, 89, 92, 34, 34, 37, 39, 42, 48, 54, 58, 63, 73, 79, 78,
-        80, 83, 88, 92, 35, 34, 37, 41, 45, 50, 57, 61, 65, 76, 82, 83, 84, 84,
-        87, 90, 36, 34, 38, 43, 48, 54, 60, 64, 68, 78, 84, 87, 86, 89, 90, 90,
-        39, 37, 40, 45, 50, 58, 65, 69, 73, 84, 89, 89, 91, 91, 93, 96, 44, 41,
-        43, 48, 53, 63, 71, 75, 79, 90, 95, 93, 94, 95, 97, 97, 46, 43, 44, 49,
-        55, 65, 73, 78, 82, 93, 98, 100, 98, 100, 99, 103, 48, 45, 46, 51, 56,
-        67, 76, 80, 85, 96, 102, 102, 105, 102, 105, 104, 53, 49, 50, 54, 60,
-        71, 82, 87, 92, 103, 109, 107, 107, 110, 107, 111, 58, 54, 54, 58, 63,
-        75, 87, 92, 98, 110, 116, 115, 112, 111, 115, 112, 61, 57, 56, 60, 66,
-        77, 89, 95, 101, 114, 120, 118, 119, 118, 116, 120, 65, 60, 58, 63, 68,
-        79, 92, 98, 105, 118, 124, 123, 122, 123, 124, 121, 71, 65, 63, 68, 73,
-        84, 97, 103, 111, 125, 132, 132, 130, 128, 127, 130, 79, 72, 70, 74, 79,
-        90, 104, 110, 118, 133, 141, 136, 135, 135, 135, 131, 81, 74, 71, 75,
-        80, 91, 105, 112, 119, 135, 142, 140, 140, 138, 139, 142, 82, 75, 72,
-        76, 81, 92, 106, 113, 121, 136, 144, 151, 149, 149, 146, 143, 88, 80,
-        77, 80, 85, 97, 108, 115, 126, 142, 149, 153, 153, 152, 152, 154, 91,
-        83, 80, 81, 88, 100, 106, 114, 130, 142, 148, 155, 162, 160, 159, 155,
-        94, 85, 83, 82, 91, 100, 105, 118, 131, 137, 153, 160, 165, 167, 166,
-        168, 97, 88, 86, 85, 94, 100, 107, 123, 128, 140, 157, 161, 167, 173,
-        171, 169, 100, 91, 89, 87, 97, 100, 111, 121, 127, 145, 152, 164, 173,
-        178, 182, 181, 103, 94, 93, 90, 98, 101, 114, 120, 131, 144, 150, 170,
-        174, 180, 186, 183, 107, 97, 96, 93, 100, 104, 117, 119, 136, 142, 155,
-        168, 177, 187, 191, 198, 110, 101, 100, 97, 101, 108, 117, 123, 138,
-        141, 161, 165, 183, 188, 193, 200, 114, 104, 104, 100, 103, 112, 117,
-        127, 137, 146, 159, 167, 185, 190, 201, 206, 118, 108, 107, 103, 105,
-        115, 118, 131, 136, 151, 157, 172, 182, 197, 203, 208, 122, 111, 111,
-        107, 107, 119, 119, 136, 136, 156, 156, 178, 179, 203, 204, 217,
-        /* Size 32x16 */
         32, 31, 31, 31, 32, 32, 34, 35, 36, 39, 44, 46, 48, 53, 58, 61, 65, 71,
         79, 81, 82, 88, 91, 94, 97, 100, 103, 107, 110, 114, 118, 122, 31, 32,
         32, 32, 32, 33, 34, 34, 34, 37, 41, 43, 45, 49, 54, 57, 60, 65, 72, 74,
@@ -504,34 +471,50 @@
         152, 159, 166, 171, 182, 186, 191, 193, 201, 203, 204, 102, 97, 97, 93,
         93, 92, 92, 90, 90, 96, 97, 103, 104, 111, 112, 120, 121, 130, 131, 142,
         143, 154, 155, 168, 169, 181, 183, 198, 200, 206, 208, 217,
+        /* Size 32x16 */
+        32, 31, 32, 34, 36, 44, 53, 59, 65, 79, 87, 90, 93, 96, 99, 102, 31, 32,
+        32, 34, 35, 42, 51, 56, 62, 75, 82, 85, 88, 91, 94, 97, 31, 32, 33, 33,
+        34, 41, 49, 54, 59, 72, 78, 82, 86, 90, 93, 97, 31, 32, 33, 34, 35, 41,
+        49, 54, 59, 71, 78, 81, 84, 87, 90, 93, 32, 32, 34, 35, 36, 42, 50, 54,
+        59, 71, 77, 80, 82, 86, 89, 93, 32, 33, 35, 37, 38, 42, 49, 53, 58, 69,
+        75, 78, 82, 86, 89, 92, 34, 34, 37, 39, 42, 48, 54, 58, 63, 73, 79, 78,
+        80, 83, 88, 92, 35, 34, 37, 41, 45, 50, 57, 61, 65, 76, 82, 83, 84, 84,
+        87, 90, 36, 34, 38, 43, 48, 54, 60, 64, 68, 78, 84, 87, 86, 89, 90, 90,
+        39, 37, 40, 45, 50, 58, 65, 69, 73, 84, 89, 89, 91, 91, 93, 96, 44, 41,
+        43, 48, 53, 63, 71, 75, 79, 90, 95, 93, 94, 95, 97, 97, 46, 43, 44, 49,
+        55, 65, 73, 78, 82, 93, 98, 100, 98, 100, 99, 103, 48, 45, 46, 51, 56,
+        67, 76, 80, 85, 96, 102, 102, 105, 102, 105, 104, 53, 49, 50, 54, 60,
+        71, 82, 87, 92, 103, 109, 107, 107, 110, 107, 111, 58, 54, 54, 58, 63,
+        75, 87, 92, 98, 110, 116, 115, 112, 111, 115, 112, 61, 57, 56, 60, 66,
+        77, 89, 95, 101, 114, 120, 118, 119, 118, 116, 120, 65, 60, 58, 63, 68,
+        79, 92, 98, 105, 118, 124, 123, 122, 123, 124, 121, 71, 65, 63, 68, 73,
+        84, 97, 103, 111, 125, 132, 132, 130, 128, 127, 130, 79, 72, 70, 74, 79,
+        90, 104, 110, 118, 133, 141, 136, 135, 135, 135, 131, 81, 74, 71, 75,
+        80, 91, 105, 112, 119, 135, 142, 140, 140, 138, 139, 142, 82, 75, 72,
+        76, 81, 92, 106, 113, 121, 136, 144, 151, 149, 149, 146, 143, 88, 80,
+        77, 80, 85, 97, 108, 115, 126, 142, 149, 153, 153, 152, 152, 154, 91,
+        83, 80, 81, 88, 100, 106, 114, 130, 142, 148, 155, 162, 160, 159, 155,
+        94, 85, 83, 82, 91, 100, 105, 118, 131, 137, 153, 160, 165, 167, 166,
+        168, 97, 88, 86, 85, 94, 100, 107, 123, 128, 140, 157, 161, 167, 173,
+        171, 169, 100, 91, 89, 87, 97, 100, 111, 121, 127, 145, 152, 164, 173,
+        178, 182, 181, 103, 94, 93, 90, 98, 101, 114, 120, 131, 144, 150, 170,
+        174, 180, 186, 183, 107, 97, 96, 93, 100, 104, 117, 119, 136, 142, 155,
+        168, 177, 187, 191, 198, 110, 101, 100, 97, 101, 108, 117, 123, 138,
+        141, 161, 165, 183, 188, 193, 200, 114, 104, 104, 100, 103, 112, 117,
+        127, 137, 146, 159, 167, 185, 190, 201, 206, 118, 108, 107, 103, 105,
+        115, 118, 131, 136, 151, 157, 172, 182, 197, 203, 208, 122, 111, 111,
+        107, 107, 119, 119, 136, 136, 156, 156, 178, 179, 203, 204, 217,
         /* Size 4x16 */
-        31, 44, 79, 96, 32, 41, 72, 90, 32, 42, 71, 86, 34, 48, 73, 83, 34, 54,
-        78, 89, 41, 63, 90, 95, 45, 67, 96, 102, 54, 75, 110, 111, 60, 79, 118,
-        123, 72, 90, 133, 135, 75, 92, 136, 149, 83, 100, 142, 160, 88, 100,
-        140, 173, 94, 101, 144, 180, 101, 108, 141, 188, 108, 115, 151, 197,
-        /* Size 16x4 */
         31, 32, 32, 34, 34, 41, 45, 54, 60, 72, 75, 83, 88, 94, 101, 108, 44,
         41, 42, 48, 54, 63, 67, 75, 79, 90, 92, 100, 100, 101, 108, 115, 79, 72,
         71, 73, 78, 90, 96, 110, 118, 133, 136, 142, 140, 144, 141, 151, 96, 90,
         86, 83, 89, 95, 102, 111, 123, 135, 149, 160, 173, 180, 188, 197,
+        /* Size 16x4 */
+        31, 44, 79, 96, 32, 41, 72, 90, 32, 42, 71, 86, 34, 48, 73, 83, 34, 54,
+        78, 89, 41, 63, 90, 95, 45, 67, 96, 102, 54, 75, 110, 111, 60, 79, 118,
+        123, 72, 90, 133, 135, 75, 92, 136, 149, 83, 100, 142, 160, 88, 100,
+        140, 173, 94, 101, 144, 180, 101, 108, 141, 188, 108, 115, 151, 197,
         /* Size 8x32 */
-        32, 32, 36, 53, 65, 87, 93, 99, 31, 32, 35, 51, 62, 82, 88, 94, 31, 33,
-        34, 49, 59, 78, 86, 93, 31, 33, 35, 49, 59, 78, 84, 90, 32, 34, 36, 50,
-        59, 77, 82, 89, 32, 35, 38, 49, 58, 75, 82, 89, 34, 37, 42, 54, 63, 79,
-        80, 88, 35, 37, 45, 57, 65, 82, 84, 87, 36, 38, 48, 60, 68, 84, 86, 90,
-        39, 40, 50, 65, 73, 89, 91, 93, 44, 43, 53, 71, 79, 95, 94, 97, 46, 44,
-        55, 73, 82, 98, 98, 99, 48, 46, 56, 76, 85, 102, 105, 105, 53, 50, 60,
-        82, 92, 109, 107, 107, 58, 54, 63, 87, 98, 116, 112, 115, 61, 56, 66,
-        89, 101, 120, 119, 116, 65, 58, 68, 92, 105, 124, 122, 124, 71, 63, 73,
-        97, 111, 132, 130, 127, 79, 70, 79, 104, 118, 141, 135, 135, 81, 71, 80,
-        105, 119, 142, 140, 139, 82, 72, 81, 106, 121, 144, 149, 146, 88, 77,
-        85, 108, 126, 149, 153, 152, 91, 80, 88, 106, 130, 148, 162, 159, 94,
-        83, 91, 105, 131, 153, 165, 166, 97, 86, 94, 107, 128, 157, 167, 171,
-        100, 89, 97, 111, 127, 152, 173, 182, 103, 93, 98, 114, 131, 150, 174,
-        186, 107, 96, 100, 117, 136, 155, 177, 191, 110, 100, 101, 117, 138,
-        161, 183, 193, 114, 104, 103, 117, 137, 159, 185, 201, 118, 107, 105,
-        118, 136, 157, 182, 203, 122, 111, 107, 119, 136, 156, 179, 204,
-        /* Size 32x8 */
         32, 31, 31, 31, 32, 32, 34, 35, 36, 39, 44, 46, 48, 53, 58, 61, 65, 71,
         79, 81, 82, 88, 91, 94, 97, 100, 103, 107, 110, 114, 118, 122, 32, 32,
         33, 33, 34, 35, 37, 37, 38, 40, 43, 44, 46, 50, 54, 56, 58, 63, 70, 71,
@@ -547,7 +530,24 @@
         84, 86, 91, 94, 98, 105, 107, 112, 119, 122, 130, 135, 140, 149, 153,
         162, 165, 167, 173, 174, 177, 183, 185, 182, 179, 99, 94, 93, 90, 89,
         89, 88, 87, 90, 93, 97, 99, 105, 107, 115, 116, 124, 127, 135, 139, 146,
-        152, 159, 166, 171, 182, 186, 191, 193, 201, 203, 204 },
+        152, 159, 166, 171, 182, 186, 191, 193, 201, 203, 204,
+        /* Size 32x8 */
+        32, 32, 36, 53, 65, 87, 93, 99, 31, 32, 35, 51, 62, 82, 88, 94, 31, 33,
+        34, 49, 59, 78, 86, 93, 31, 33, 35, 49, 59, 78, 84, 90, 32, 34, 36, 50,
+        59, 77, 82, 89, 32, 35, 38, 49, 58, 75, 82, 89, 34, 37, 42, 54, 63, 79,
+        80, 88, 35, 37, 45, 57, 65, 82, 84, 87, 36, 38, 48, 60, 68, 84, 86, 90,
+        39, 40, 50, 65, 73, 89, 91, 93, 44, 43, 53, 71, 79, 95, 94, 97, 46, 44,
+        55, 73, 82, 98, 98, 99, 48, 46, 56, 76, 85, 102, 105, 105, 53, 50, 60,
+        82, 92, 109, 107, 107, 58, 54, 63, 87, 98, 116, 112, 115, 61, 56, 66,
+        89, 101, 120, 119, 116, 65, 58, 68, 92, 105, 124, 122, 124, 71, 63, 73,
+        97, 111, 132, 130, 127, 79, 70, 79, 104, 118, 141, 135, 135, 81, 71, 80,
+        105, 119, 142, 140, 139, 82, 72, 81, 106, 121, 144, 149, 146, 88, 77,
+        85, 108, 126, 149, 153, 152, 91, 80, 88, 106, 130, 148, 162, 159, 94,
+        83, 91, 105, 131, 153, 165, 166, 97, 86, 94, 107, 128, 157, 167, 171,
+        100, 89, 97, 111, 127, 152, 173, 182, 103, 93, 98, 114, 131, 150, 174,
+        186, 107, 96, 100, 117, 136, 155, 177, 191, 110, 100, 101, 117, 138,
+        161, 183, 193, 114, 104, 103, 117, 137, 159, 185, 201, 118, 107, 105,
+        118, 136, 157, 182, 203, 122, 111, 107, 119, 136, 156, 179, 204 },
       { /* Chroma */
         /* Size 4x4 */
         35, 46, 57, 66, 46, 60, 69, 71, 57, 69, 90, 90, 66, 71, 90, 109,
@@ -633,21 +633,12 @@
         77, 78, 82, 82, 86, 87, 92, 92, 96, 97, 102, 102, 107, 107, 112, 113,
         115, 115, 118,
         /* Size 4x8 */
-        31, 47, 60, 66, 40, 45, 54, 61, 46, 56, 64, 64, 48, 61, 75, 73, 54, 65,
-        85, 82, 61, 69, 92, 92, 64, 68, 90, 102, 68, 71, 87, 105,
-        /* Size 8x4 */
         31, 40, 46, 48, 54, 61, 64, 68, 47, 45, 56, 61, 65, 69, 68, 71, 60, 54,
         64, 75, 85, 92, 90, 87, 66, 61, 64, 73, 82, 92, 102, 105,
+        /* Size 8x4 */
+        31, 47, 60, 66, 40, 45, 54, 61, 46, 56, 64, 64, 48, 61, 75, 73, 54, 65,
+        85, 82, 61, 69, 92, 92, 64, 68, 90, 102, 68, 71, 87, 105,
         /* Size 8x16 */
-        32, 37, 48, 52, 57, 66, 68, 71, 30, 40, 46, 48, 52, 60, 63, 66, 33, 43,
-        47, 47, 51, 59, 60, 63, 42, 47, 50, 50, 53, 60, 59, 62, 49, 48, 53, 54,
-        57, 62, 62, 62, 49, 46, 53, 61, 64, 69, 66, 66, 50, 46, 54, 64, 67, 73,
-        72, 70, 54, 49, 55, 68, 73, 80, 76, 75, 57, 50, 56, 70, 76, 84, 80, 79,
-        63, 55, 60, 75, 82, 92, 87, 84, 64, 56, 61, 75, 83, 93, 93, 89, 68, 59,
-        64, 74, 86, 94, 98, 94, 70, 62, 66, 73, 83, 96, 99, 98, 72, 64, 66, 75,
-        83, 92, 101, 104, 74, 67, 66, 74, 84, 94, 103, 106, 76, 69, 67, 73, 82,
-        91, 101, 109,
-        /* Size 16x8 */
         32, 30, 33, 42, 49, 49, 50, 54, 57, 63, 64, 68, 70, 72, 74, 76, 37, 40,
         43, 47, 48, 46, 46, 49, 50, 55, 56, 59, 62, 64, 67, 69, 48, 46, 47, 50,
         53, 53, 54, 55, 56, 60, 61, 64, 66, 66, 66, 67, 52, 48, 47, 50, 54, 61,
@@ -656,37 +647,16 @@
         93, 94, 96, 92, 94, 91, 68, 63, 60, 59, 62, 66, 72, 76, 80, 87, 93, 98,
         99, 101, 103, 101, 71, 66, 63, 62, 62, 66, 70, 75, 79, 84, 89, 94, 98,
         104, 106, 109,
+        /* Size 16x8 */
+        32, 37, 48, 52, 57, 66, 68, 71, 30, 40, 46, 48, 52, 60, 63, 66, 33, 43,
+        47, 47, 51, 59, 60, 63, 42, 47, 50, 50, 53, 60, 59, 62, 49, 48, 53, 54,
+        57, 62, 62, 62, 49, 46, 53, 61, 64, 69, 66, 66, 50, 46, 54, 64, 67, 73,
+        72, 70, 54, 49, 55, 68, 73, 80, 76, 75, 57, 50, 56, 70, 76, 84, 80, 79,
+        63, 55, 60, 75, 82, 92, 87, 84, 64, 56, 61, 75, 83, 93, 93, 89, 68, 59,
+        64, 74, 86, 94, 98, 94, 70, 62, 66, 73, 83, 96, 99, 98, 72, 64, 66, 75,
+        83, 92, 101, 104, 74, 67, 66, 74, 84, 94, 103, 106, 76, 69, 67, 73, 82,
+        91, 101, 109,
         /* Size 16x32 */
-        32, 31, 37, 42, 48, 49, 52, 54, 57, 63, 66, 67, 68, 69, 71, 72, 31, 31,
-        38, 42, 47, 47, 50, 52, 54, 60, 63, 64, 65, 66, 67, 68, 30, 32, 40, 42,
-        46, 45, 48, 50, 52, 57, 60, 62, 63, 65, 66, 68, 32, 34, 41, 44, 46, 45,
-        48, 49, 51, 57, 59, 61, 62, 63, 64, 65, 33, 36, 43, 45, 47, 46, 47, 49,
-        51, 56, 59, 60, 60, 62, 63, 65, 37, 40, 47, 47, 47, 45, 47, 48, 50, 54,
-        57, 58, 60, 61, 62, 63, 42, 43, 47, 48, 50, 49, 50, 52, 53, 57, 60, 58,
-        59, 60, 62, 63, 45, 44, 47, 49, 51, 51, 52, 54, 55, 59, 61, 61, 61, 60,
-        61, 61, 49, 46, 48, 50, 53, 53, 54, 55, 57, 60, 62, 63, 62, 63, 62, 62,
-        48, 46, 47, 50, 53, 56, 57, 59, 60, 64, 66, 65, 65, 64, 64, 65, 49, 45,
-        46, 49, 53, 58, 61, 62, 64, 67, 69, 67, 66, 66, 66, 65, 49, 46, 46, 49,
-        53, 59, 62, 64, 65, 69, 71, 70, 68, 68, 67, 68, 50, 46, 46, 50, 54, 59,
-        64, 65, 67, 71, 73, 72, 72, 70, 70, 69, 52, 48, 47, 50, 54, 61, 66, 68,
-        71, 75, 77, 74, 73, 73, 71, 72, 54, 50, 49, 52, 55, 62, 68, 71, 73, 78,
-        80, 78, 76, 74, 75, 73, 55, 51, 49, 52, 56, 63, 69, 72, 75, 80, 82, 80,
-        79, 78, 76, 77, 57, 52, 50, 53, 56, 64, 70, 73, 76, 82, 84, 82, 80, 80,
-        79, 77, 60, 54, 52, 55, 58, 65, 72, 75, 79, 85, 88, 86, 84, 82, 81, 81,
-        63, 57, 55, 58, 60, 67, 75, 78, 82, 89, 92, 88, 87, 85, 84, 81, 64, 58,
-        55, 58, 61, 68, 75, 78, 82, 89, 92, 90, 89, 87, 86, 86, 64, 59, 56, 58,
-        61, 68, 75, 79, 83, 90, 93, 95, 93, 91, 89, 87, 67, 61, 58, 60, 63, 69,
-        76, 79, 85, 92, 95, 96, 94, 92, 91, 91, 68, 62, 59, 60, 64, 71, 74, 78,
-        86, 91, 94, 96, 98, 96, 94, 91, 69, 62, 60, 60, 65, 70, 72, 79, 85, 88,
-        95, 98, 99, 98, 97, 96, 70, 63, 62, 60, 66, 69, 73, 81, 83, 89, 96, 97,
-        99, 101, 98, 97, 71, 64, 63, 61, 67, 68, 74, 79, 82, 90, 93, 98, 102,
-        102, 102, 101, 72, 65, 64, 62, 66, 68, 75, 78, 83, 89, 92, 100, 101,
-        103, 104, 102, 73, 66, 65, 63, 66, 69, 75, 76, 84, 87, 93, 98, 102, 105,
-        106, 107, 74, 67, 67, 64, 66, 70, 74, 77, 84, 86, 94, 96, 103, 105, 106,
-        107, 75, 68, 68, 65, 66, 71, 74, 78, 83, 87, 93, 96, 103, 105, 109, 109,
-        76, 69, 69, 66, 67, 72, 73, 80, 82, 88, 91, 97, 101, 107, 109, 110, 77,
-        70, 70, 67, 67, 73, 73, 81, 81, 90, 90, 99, 99, 108, 108, 113,
-        /* Size 32x16 */
         32, 31, 30, 32, 33, 37, 42, 45, 49, 48, 49, 49, 50, 52, 54, 55, 57, 60,
         63, 64, 64, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 31, 31, 32, 34,
         36, 40, 43, 44, 46, 46, 45, 46, 46, 48, 50, 51, 52, 54, 57, 58, 59, 61,
@@ -716,33 +686,47 @@
         76, 79, 81, 84, 86, 89, 91, 94, 97, 98, 102, 104, 106, 106, 109, 109,
         108, 72, 68, 68, 65, 65, 63, 63, 61, 62, 65, 65, 68, 69, 72, 73, 77, 77,
         81, 81, 86, 87, 91, 91, 96, 97, 101, 102, 107, 107, 109, 110, 113,
+        /* Size 32x16 */
+        32, 31, 37, 42, 48, 49, 52, 54, 57, 63, 66, 67, 68, 69, 71, 72, 31, 31,
+        38, 42, 47, 47, 50, 52, 54, 60, 63, 64, 65, 66, 67, 68, 30, 32, 40, 42,
+        46, 45, 48, 50, 52, 57, 60, 62, 63, 65, 66, 68, 32, 34, 41, 44, 46, 45,
+        48, 49, 51, 57, 59, 61, 62, 63, 64, 65, 33, 36, 43, 45, 47, 46, 47, 49,
+        51, 56, 59, 60, 60, 62, 63, 65, 37, 40, 47, 47, 47, 45, 47, 48, 50, 54,
+        57, 58, 60, 61, 62, 63, 42, 43, 47, 48, 50, 49, 50, 52, 53, 57, 60, 58,
+        59, 60, 62, 63, 45, 44, 47, 49, 51, 51, 52, 54, 55, 59, 61, 61, 61, 60,
+        61, 61, 49, 46, 48, 50, 53, 53, 54, 55, 57, 60, 62, 63, 62, 63, 62, 62,
+        48, 46, 47, 50, 53, 56, 57, 59, 60, 64, 66, 65, 65, 64, 64, 65, 49, 45,
+        46, 49, 53, 58, 61, 62, 64, 67, 69, 67, 66, 66, 66, 65, 49, 46, 46, 49,
+        53, 59, 62, 64, 65, 69, 71, 70, 68, 68, 67, 68, 50, 46, 46, 50, 54, 59,
+        64, 65, 67, 71, 73, 72, 72, 70, 70, 69, 52, 48, 47, 50, 54, 61, 66, 68,
+        71, 75, 77, 74, 73, 73, 71, 72, 54, 50, 49, 52, 55, 62, 68, 71, 73, 78,
+        80, 78, 76, 74, 75, 73, 55, 51, 49, 52, 56, 63, 69, 72, 75, 80, 82, 80,
+        79, 78, 76, 77, 57, 52, 50, 53, 56, 64, 70, 73, 76, 82, 84, 82, 80, 80,
+        79, 77, 60, 54, 52, 55, 58, 65, 72, 75, 79, 85, 88, 86, 84, 82, 81, 81,
+        63, 57, 55, 58, 60, 67, 75, 78, 82, 89, 92, 88, 87, 85, 84, 81, 64, 58,
+        55, 58, 61, 68, 75, 78, 82, 89, 92, 90, 89, 87, 86, 86, 64, 59, 56, 58,
+        61, 68, 75, 79, 83, 90, 93, 95, 93, 91, 89, 87, 67, 61, 58, 60, 63, 69,
+        76, 79, 85, 92, 95, 96, 94, 92, 91, 91, 68, 62, 59, 60, 64, 71, 74, 78,
+        86, 91, 94, 96, 98, 96, 94, 91, 69, 62, 60, 60, 65, 70, 72, 79, 85, 88,
+        95, 98, 99, 98, 97, 96, 70, 63, 62, 60, 66, 69, 73, 81, 83, 89, 96, 97,
+        99, 101, 98, 97, 71, 64, 63, 61, 67, 68, 74, 79, 82, 90, 93, 98, 102,
+        102, 102, 101, 72, 65, 64, 62, 66, 68, 75, 78, 83, 89, 92, 100, 101,
+        103, 104, 102, 73, 66, 65, 63, 66, 69, 75, 76, 84, 87, 93, 98, 102, 105,
+        106, 107, 74, 67, 67, 64, 66, 70, 74, 77, 84, 86, 94, 96, 103, 105, 106,
+        107, 75, 68, 68, 65, 66, 71, 74, 78, 83, 87, 93, 96, 103, 105, 109, 109,
+        76, 69, 69, 66, 67, 72, 73, 80, 82, 88, 91, 97, 101, 107, 109, 110, 77,
+        70, 70, 67, 67, 73, 73, 81, 81, 90, 90, 99, 99, 108, 108, 113,
         /* Size 4x16 */
-        31, 49, 63, 69, 32, 45, 57, 65, 36, 46, 56, 62, 43, 49, 57, 60, 46, 53,
-        60, 63, 45, 58, 67, 66, 46, 59, 71, 70, 50, 62, 78, 74, 52, 64, 82, 80,
-        57, 67, 89, 85, 59, 68, 90, 91, 62, 71, 91, 96, 63, 69, 89, 101, 65, 68,
-        89, 103, 67, 70, 86, 105, 69, 72, 88, 107,
-        /* Size 16x4 */
         31, 32, 36, 43, 46, 45, 46, 50, 52, 57, 59, 62, 63, 65, 67, 69, 49, 45,
         46, 49, 53, 58, 59, 62, 64, 67, 68, 71, 69, 68, 70, 72, 63, 57, 56, 57,
         60, 67, 71, 78, 82, 89, 90, 91, 89, 89, 86, 88, 69, 65, 62, 60, 63, 66,
         70, 74, 80, 85, 91, 96, 101, 103, 105, 107,
+        /* Size 16x4 */
+        31, 49, 63, 69, 32, 45, 57, 65, 36, 46, 56, 62, 43, 49, 57, 60, 46, 53,
+        60, 63, 45, 58, 67, 66, 46, 59, 71, 70, 50, 62, 78, 74, 52, 64, 82, 80,
+        57, 67, 89, 85, 59, 68, 90, 91, 62, 71, 91, 96, 63, 69, 89, 101, 65, 68,
+        89, 103, 67, 70, 86, 105, 69, 72, 88, 107,
         /* Size 8x32 */
-        32, 37, 48, 52, 57, 66, 68, 71, 31, 38, 47, 50, 54, 63, 65, 67, 30, 40,
-        46, 48, 52, 60, 63, 66, 32, 41, 46, 48, 51, 59, 62, 64, 33, 43, 47, 47,
-        51, 59, 60, 63, 37, 47, 47, 47, 50, 57, 60, 62, 42, 47, 50, 50, 53, 60,
-        59, 62, 45, 47, 51, 52, 55, 61, 61, 61, 49, 48, 53, 54, 57, 62, 62, 62,
-        48, 47, 53, 57, 60, 66, 65, 64, 49, 46, 53, 61, 64, 69, 66, 66, 49, 46,
-        53, 62, 65, 71, 68, 67, 50, 46, 54, 64, 67, 73, 72, 70, 52, 47, 54, 66,
-        71, 77, 73, 71, 54, 49, 55, 68, 73, 80, 76, 75, 55, 49, 56, 69, 75, 82,
-        79, 76, 57, 50, 56, 70, 76, 84, 80, 79, 60, 52, 58, 72, 79, 88, 84, 81,
-        63, 55, 60, 75, 82, 92, 87, 84, 64, 55, 61, 75, 82, 92, 89, 86, 64, 56,
-        61, 75, 83, 93, 93, 89, 67, 58, 63, 76, 85, 95, 94, 91, 68, 59, 64, 74,
-        86, 94, 98, 94, 69, 60, 65, 72, 85, 95, 99, 97, 70, 62, 66, 73, 83, 96,
-        99, 98, 71, 63, 67, 74, 82, 93, 102, 102, 72, 64, 66, 75, 83, 92, 101,
-        104, 73, 65, 66, 75, 84, 93, 102, 106, 74, 67, 66, 74, 84, 94, 103, 106,
-        75, 68, 66, 74, 83, 93, 103, 109, 76, 69, 67, 73, 82, 91, 101, 109, 77,
-        70, 67, 73, 81, 90, 99, 108,
-        /* Size 32x8 */
         32, 31, 30, 32, 33, 37, 42, 45, 49, 48, 49, 49, 50, 52, 54, 55, 57, 60,
         63, 64, 64, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 37, 38, 40, 41,
         43, 47, 47, 47, 48, 47, 46, 46, 46, 47, 49, 49, 50, 52, 55, 55, 56, 58,
@@ -757,7 +741,23 @@
         59, 61, 62, 65, 66, 68, 72, 73, 76, 79, 80, 84, 87, 89, 93, 94, 98, 99,
         99, 102, 101, 102, 103, 103, 101, 99, 71, 67, 66, 64, 63, 62, 62, 61,
         62, 64, 66, 67, 70, 71, 75, 76, 79, 81, 84, 86, 89, 91, 94, 97, 98, 102,
-        104, 106, 106, 109, 109, 108 },
+        104, 106, 106, 109, 109, 108,
+        /* Size 32x8 */
+        32, 37, 48, 52, 57, 66, 68, 71, 31, 38, 47, 50, 54, 63, 65, 67, 30, 40,
+        46, 48, 52, 60, 63, 66, 32, 41, 46, 48, 51, 59, 62, 64, 33, 43, 47, 47,
+        51, 59, 60, 63, 37, 47, 47, 47, 50, 57, 60, 62, 42, 47, 50, 50, 53, 60,
+        59, 62, 45, 47, 51, 52, 55, 61, 61, 61, 49, 48, 53, 54, 57, 62, 62, 62,
+        48, 47, 53, 57, 60, 66, 65, 64, 49, 46, 53, 61, 64, 69, 66, 66, 49, 46,
+        53, 62, 65, 71, 68, 67, 50, 46, 54, 64, 67, 73, 72, 70, 52, 47, 54, 66,
+        71, 77, 73, 71, 54, 49, 55, 68, 73, 80, 76, 75, 55, 49, 56, 69, 75, 82,
+        79, 76, 57, 50, 56, 70, 76, 84, 80, 79, 60, 52, 58, 72, 79, 88, 84, 81,
+        63, 55, 60, 75, 82, 92, 87, 84, 64, 55, 61, 75, 82, 92, 89, 86, 64, 56,
+        61, 75, 83, 93, 93, 89, 67, 58, 63, 76, 85, 95, 94, 91, 68, 59, 64, 74,
+        86, 94, 98, 94, 69, 60, 65, 72, 85, 95, 99, 97, 70, 62, 66, 73, 83, 96,
+        99, 98, 71, 63, 67, 74, 82, 93, 102, 102, 72, 64, 66, 75, 83, 92, 101,
+        104, 73, 65, 66, 75, 84, 93, 102, 106, 74, 67, 66, 74, 84, 94, 103, 106,
+        75, 68, 66, 74, 83, 93, 103, 109, 76, 69, 67, 73, 82, 91, 101, 109, 77,
+        70, 67, 73, 81, 90, 99, 108 },
   },
   {
       { /* Luma */
@@ -851,21 +851,12 @@
         121, 129, 130, 139, 140, 151, 151, 162, 162, 175, 176, 187, 188, 203,
         204, 210, 211, 219,
         /* Size 4x8 */
-        32, 42, 69, 88, 33, 42, 64, 83, 36, 56, 77, 88, 46, 67, 93, 105, 60, 79,
-        112, 122, 75, 92, 130, 144, 86, 95, 136, 167, 98, 105, 136, 177,
-        /* Size 8x4 */
         32, 33, 36, 46, 60, 75, 86, 98, 42, 42, 56, 67, 79, 92, 95, 105, 69, 64,
         77, 93, 112, 130, 136, 136, 88, 83, 88, 105, 122, 144, 167, 177,
+        /* Size 8x4 */
+        32, 42, 69, 88, 33, 42, 64, 83, 36, 56, 77, 88, 46, 67, 93, 105, 60, 79,
+        112, 122, 75, 92, 130, 144, 86, 95, 136, 167, 98, 105, 136, 177,
         /* Size 8x16 */
-        32, 32, 36, 47, 65, 79, 90, 96, 31, 32, 35, 44, 60, 72, 84, 90, 32, 34,
-        36, 45, 59, 71, 80, 87, 32, 35, 40, 47, 60, 71, 78, 85, 36, 37, 48, 56,
-        68, 78, 83, 87, 39, 40, 50, 60, 73, 84, 91, 94, 47, 45, 56, 69, 84, 95,
-        101, 101, 53, 50, 60, 75, 92, 103, 108, 110, 61, 56, 65, 81, 100, 113,
-        116, 118, 71, 64, 73, 89, 111, 125, 129, 129, 79, 70, 79, 95, 118, 133,
-        142, 138, 86, 76, 84, 100, 124, 140, 153, 150, 92, 82, 89, 101, 121,
-        148, 157, 161, 98, 88, 93, 108, 124, 141, 163, 174, 104, 94, 95, 110,
-        129, 151, 171, 181, 110, 100, 98, 111, 127, 147, 169, 188,
-        /* Size 16x8 */
         32, 31, 32, 32, 36, 39, 47, 53, 61, 71, 79, 86, 92, 98, 104, 110, 32,
         32, 34, 35, 37, 40, 45, 50, 56, 64, 70, 76, 82, 88, 94, 100, 36, 35, 36,
         40, 48, 50, 56, 60, 65, 73, 79, 84, 89, 93, 95, 98, 47, 44, 45, 47, 56,
@@ -874,40 +865,16 @@
         95, 103, 113, 125, 133, 140, 148, 141, 151, 147, 90, 84, 80, 78, 83, 91,
         101, 108, 116, 129, 142, 153, 157, 163, 171, 169, 96, 90, 87, 85, 87,
         94, 101, 110, 118, 129, 138, 150, 161, 174, 181, 188,
+        /* Size 16x8 */
+        32, 32, 36, 47, 65, 79, 90, 96, 31, 32, 35, 44, 60, 72, 84, 90, 32, 34,
+        36, 45, 59, 71, 80, 87, 32, 35, 40, 47, 60, 71, 78, 85, 36, 37, 48, 56,
+        68, 78, 83, 87, 39, 40, 50, 60, 73, 84, 91, 94, 47, 45, 56, 69, 84, 95,
+        101, 101, 53, 50, 60, 75, 92, 103, 108, 110, 61, 56, 65, 81, 100, 113,
+        116, 118, 71, 64, 73, 89, 111, 125, 129, 129, 79, 70, 79, 95, 118, 133,
+        142, 138, 86, 76, 84, 100, 124, 140, 153, 150, 92, 82, 89, 101, 121,
+        148, 157, 161, 98, 88, 93, 108, 124, 141, 163, 174, 104, 94, 95, 110,
+        129, 151, 171, 181, 110, 100, 98, 111, 127, 147, 169, 188,
         /* Size 16x32 */
-        32, 31, 32, 32, 36, 44, 47, 53, 65, 73, 79, 87, 90, 93, 96, 99, 31, 32,
-        32, 33, 35, 42, 45, 51, 62, 69, 75, 83, 86, 88, 91, 94, 31, 32, 32, 33,
-        35, 41, 44, 49, 60, 67, 72, 80, 84, 87, 90, 94, 31, 32, 33, 33, 35, 41,
-        44, 49, 59, 66, 71, 79, 82, 84, 87, 90, 32, 32, 34, 34, 36, 42, 45, 50,
-        59, 65, 71, 78, 80, 83, 87, 90, 32, 33, 35, 36, 38, 42, 45, 49, 58, 64,
-        69, 76, 80, 83, 86, 88, 32, 33, 35, 36, 40, 44, 47, 51, 60, 66, 71, 76,
-        78, 81, 85, 89, 34, 34, 36, 38, 42, 48, 50, 54, 63, 69, 73, 80, 82, 81,
-        84, 86, 36, 34, 37, 40, 48, 54, 56, 60, 68, 74, 78, 84, 83, 86, 87, 87,
-        38, 36, 39, 41, 49, 56, 58, 63, 71, 77, 81, 86, 88, 88, 90, 93, 39, 37,
-        40, 42, 50, 58, 60, 65, 73, 79, 84, 90, 91, 92, 94, 93, 44, 41, 42, 45,
-        53, 63, 66, 71, 79, 85, 90, 96, 94, 96, 96, 99, 47, 44, 45, 47, 56, 66,
-        69, 75, 84, 90, 95, 99, 101, 98, 101, 99, 49, 46, 47, 48, 57, 67, 71,
-        77, 86, 93, 97, 103, 103, 105, 102, 106, 53, 49, 50, 51, 60, 71, 75, 82,
-        92, 99, 103, 111, 108, 107, 110, 107, 58, 54, 54, 55, 63, 75, 79, 87,
-        98, 105, 110, 114, 114, 113, 111, 115, 61, 56, 56, 57, 65, 77, 81, 89,
-        100, 107, 113, 118, 116, 117, 118, 116, 65, 60, 59, 60, 68, 79, 84, 92,
-        105, 112, 118, 126, 124, 122, 121, 124, 71, 65, 64, 65, 73, 84, 89, 97,
-        111, 119, 125, 130, 129, 129, 129, 125, 76, 69, 68, 69, 76, 88, 92, 101,
-        115, 123, 130, 134, 134, 131, 132, 135, 79, 72, 70, 71, 79, 90, 95, 104,
-        118, 127, 133, 143, 142, 141, 138, 136, 82, 75, 73, 74, 81, 92, 97, 106,
-        121, 130, 136, 146, 145, 144, 144, 145, 86, 78, 76, 77, 84, 95, 100,
-        109, 124, 133, 140, 147, 153, 151, 150, 146, 89, 81, 79, 78, 87, 95, 99,
-        112, 124, 130, 145, 152, 156, 157, 156, 158, 92, 84, 82, 80, 89, 95,
-        101, 116, 121, 132, 148, 151, 157, 163, 161, 159, 95, 86, 85, 83, 92,
-        95, 105, 114, 120, 136, 143, 155, 163, 167, 171, 170, 98, 89, 88, 85,
-        93, 95, 108, 113, 124, 136, 141, 160, 163, 169, 174, 171, 101, 92, 91,
-        88, 94, 98, 110, 112, 128, 133, 146, 158, 166, 175, 179, 185, 104, 95,
-        94, 91, 95, 101, 110, 115, 129, 132, 151, 154, 171, 175, 181, 186, 107,
-        98, 97, 94, 96, 105, 110, 119, 128, 136, 149, 156, 173, 177, 188, 192,
-        110, 101, 100, 97, 98, 108, 111, 123, 127, 141, 147, 161, 169, 183, 188,
-        193, 114, 104, 104, 100, 100, 111, 111, 126, 127, 145, 145, 166, 166,
-        189, 190, 201,
-        /* Size 32x16 */
         32, 31, 31, 31, 32, 32, 32, 34, 36, 38, 39, 44, 47, 49, 53, 58, 61, 65,
         71, 76, 79, 82, 86, 89, 92, 95, 98, 101, 104, 107, 110, 114, 31, 32, 32,
         32, 32, 33, 33, 34, 34, 36, 37, 41, 44, 46, 49, 54, 56, 60, 65, 69, 72,
@@ -940,34 +907,50 @@
         188, 188, 190, 99, 94, 94, 90, 90, 88, 89, 86, 87, 93, 93, 99, 99, 106,
         107, 115, 116, 124, 125, 135, 136, 145, 146, 158, 159, 170, 171, 185,
         186, 192, 193, 201,
+        /* Size 32x16 */
+        32, 31, 32, 32, 36, 44, 47, 53, 65, 73, 79, 87, 90, 93, 96, 99, 31, 32,
+        32, 33, 35, 42, 45, 51, 62, 69, 75, 83, 86, 88, 91, 94, 31, 32, 32, 33,
+        35, 41, 44, 49, 60, 67, 72, 80, 84, 87, 90, 94, 31, 32, 33, 33, 35, 41,
+        44, 49, 59, 66, 71, 79, 82, 84, 87, 90, 32, 32, 34, 34, 36, 42, 45, 50,
+        59, 65, 71, 78, 80, 83, 87, 90, 32, 33, 35, 36, 38, 42, 45, 49, 58, 64,
+        69, 76, 80, 83, 86, 88, 32, 33, 35, 36, 40, 44, 47, 51, 60, 66, 71, 76,
+        78, 81, 85, 89, 34, 34, 36, 38, 42, 48, 50, 54, 63, 69, 73, 80, 82, 81,
+        84, 86, 36, 34, 37, 40, 48, 54, 56, 60, 68, 74, 78, 84, 83, 86, 87, 87,
+        38, 36, 39, 41, 49, 56, 58, 63, 71, 77, 81, 86, 88, 88, 90, 93, 39, 37,
+        40, 42, 50, 58, 60, 65, 73, 79, 84, 90, 91, 92, 94, 93, 44, 41, 42, 45,
+        53, 63, 66, 71, 79, 85, 90, 96, 94, 96, 96, 99, 47, 44, 45, 47, 56, 66,
+        69, 75, 84, 90, 95, 99, 101, 98, 101, 99, 49, 46, 47, 48, 57, 67, 71,
+        77, 86, 93, 97, 103, 103, 105, 102, 106, 53, 49, 50, 51, 60, 71, 75, 82,
+        92, 99, 103, 111, 108, 107, 110, 107, 58, 54, 54, 55, 63, 75, 79, 87,
+        98, 105, 110, 114, 114, 113, 111, 115, 61, 56, 56, 57, 65, 77, 81, 89,
+        100, 107, 113, 118, 116, 117, 118, 116, 65, 60, 59, 60, 68, 79, 84, 92,
+        105, 112, 118, 126, 124, 122, 121, 124, 71, 65, 64, 65, 73, 84, 89, 97,
+        111, 119, 125, 130, 129, 129, 129, 125, 76, 69, 68, 69, 76, 88, 92, 101,
+        115, 123, 130, 134, 134, 131, 132, 135, 79, 72, 70, 71, 79, 90, 95, 104,
+        118, 127, 133, 143, 142, 141, 138, 136, 82, 75, 73, 74, 81, 92, 97, 106,
+        121, 130, 136, 146, 145, 144, 144, 145, 86, 78, 76, 77, 84, 95, 100,
+        109, 124, 133, 140, 147, 153, 151, 150, 146, 89, 81, 79, 78, 87, 95, 99,
+        112, 124, 130, 145, 152, 156, 157, 156, 158, 92, 84, 82, 80, 89, 95,
+        101, 116, 121, 132, 148, 151, 157, 163, 161, 159, 95, 86, 85, 83, 92,
+        95, 105, 114, 120, 136, 143, 155, 163, 167, 171, 170, 98, 89, 88, 85,
+        93, 95, 108, 113, 124, 136, 141, 160, 163, 169, 174, 171, 101, 92, 91,
+        88, 94, 98, 110, 112, 128, 133, 146, 158, 166, 175, 179, 185, 104, 95,
+        94, 91, 95, 101, 110, 115, 129, 132, 151, 154, 171, 175, 181, 186, 107,
+        98, 97, 94, 96, 105, 110, 119, 128, 136, 149, 156, 173, 177, 188, 192,
+        110, 101, 100, 97, 98, 108, 111, 123, 127, 141, 147, 161, 169, 183, 188,
+        193, 114, 104, 104, 100, 100, 111, 111, 126, 127, 145, 145, 166, 166,
+        189, 190, 201,
         /* Size 4x16 */
-        31, 44, 73, 93, 32, 41, 67, 87, 32, 42, 65, 83, 33, 44, 66, 81, 34, 54,
-        74, 86, 37, 58, 79, 92, 44, 66, 90, 98, 49, 71, 99, 107, 56, 77, 107,
-        117, 65, 84, 119, 129, 72, 90, 127, 141, 78, 95, 133, 151, 84, 95, 132,
-        163, 89, 95, 136, 169, 95, 101, 132, 175, 101, 108, 141, 183,
-        /* Size 16x4 */
         31, 32, 32, 33, 34, 37, 44, 49, 56, 65, 72, 78, 84, 89, 95, 101, 44, 41,
         42, 44, 54, 58, 66, 71, 77, 84, 90, 95, 95, 95, 101, 108, 73, 67, 65,
         66, 74, 79, 90, 99, 107, 119, 127, 133, 132, 136, 132, 141, 93, 87, 83,
         81, 86, 92, 98, 107, 117, 129, 141, 151, 163, 169, 175, 183,
+        /* Size 16x4 */
+        31, 44, 73, 93, 32, 41, 67, 87, 32, 42, 65, 83, 33, 44, 66, 81, 34, 54,
+        74, 86, 37, 58, 79, 92, 44, 66, 90, 98, 49, 71, 99, 107, 56, 77, 107,
+        117, 65, 84, 119, 129, 72, 90, 127, 141, 78, 95, 133, 151, 84, 95, 132,
+        163, 89, 95, 136, 169, 95, 101, 132, 175, 101, 108, 141, 183,
         /* Size 8x32 */
-        32, 32, 36, 47, 65, 79, 90, 96, 31, 32, 35, 45, 62, 75, 86, 91, 31, 32,
-        35, 44, 60, 72, 84, 90, 31, 33, 35, 44, 59, 71, 82, 87, 32, 34, 36, 45,
-        59, 71, 80, 87, 32, 35, 38, 45, 58, 69, 80, 86, 32, 35, 40, 47, 60, 71,
-        78, 85, 34, 36, 42, 50, 63, 73, 82, 84, 36, 37, 48, 56, 68, 78, 83, 87,
-        38, 39, 49, 58, 71, 81, 88, 90, 39, 40, 50, 60, 73, 84, 91, 94, 44, 42,
-        53, 66, 79, 90, 94, 96, 47, 45, 56, 69, 84, 95, 101, 101, 49, 47, 57,
-        71, 86, 97, 103, 102, 53, 50, 60, 75, 92, 103, 108, 110, 58, 54, 63, 79,
-        98, 110, 114, 111, 61, 56, 65, 81, 100, 113, 116, 118, 65, 59, 68, 84,
-        105, 118, 124, 121, 71, 64, 73, 89, 111, 125, 129, 129, 76, 68, 76, 92,
-        115, 130, 134, 132, 79, 70, 79, 95, 118, 133, 142, 138, 82, 73, 81, 97,
-        121, 136, 145, 144, 86, 76, 84, 100, 124, 140, 153, 150, 89, 79, 87, 99,
-        124, 145, 156, 156, 92, 82, 89, 101, 121, 148, 157, 161, 95, 85, 92,
-        105, 120, 143, 163, 171, 98, 88, 93, 108, 124, 141, 163, 174, 101, 91,
-        94, 110, 128, 146, 166, 179, 104, 94, 95, 110, 129, 151, 171, 181, 107,
-        97, 96, 110, 128, 149, 173, 188, 110, 100, 98, 111, 127, 147, 169, 188,
-        114, 104, 100, 111, 127, 145, 166, 190,
-        /* Size 32x8 */
         32, 31, 31, 31, 32, 32, 32, 34, 36, 38, 39, 44, 47, 49, 53, 58, 61, 65,
         71, 76, 79, 82, 86, 89, 92, 95, 98, 101, 104, 107, 110, 114, 32, 32, 32,
         33, 34, 35, 35, 36, 37, 39, 40, 42, 45, 47, 50, 54, 56, 59, 64, 68, 70,
@@ -983,7 +966,24 @@
         101, 103, 108, 114, 116, 124, 129, 134, 142, 145, 153, 156, 157, 163,
         163, 166, 171, 173, 169, 166, 96, 91, 90, 87, 87, 86, 85, 84, 87, 90,
         94, 96, 101, 102, 110, 111, 118, 121, 129, 132, 138, 144, 150, 156, 161,
-        171, 174, 179, 181, 188, 188, 190 },
+        171, 174, 179, 181, 188, 188, 190,
+        /* Size 32x8 */
+        32, 32, 36, 47, 65, 79, 90, 96, 31, 32, 35, 45, 62, 75, 86, 91, 31, 32,
+        35, 44, 60, 72, 84, 90, 31, 33, 35, 44, 59, 71, 82, 87, 32, 34, 36, 45,
+        59, 71, 80, 87, 32, 35, 38, 45, 58, 69, 80, 86, 32, 35, 40, 47, 60, 71,
+        78, 85, 34, 36, 42, 50, 63, 73, 82, 84, 36, 37, 48, 56, 68, 78, 83, 87,
+        38, 39, 49, 58, 71, 81, 88, 90, 39, 40, 50, 60, 73, 84, 91, 94, 44, 42,
+        53, 66, 79, 90, 94, 96, 47, 45, 56, 69, 84, 95, 101, 101, 49, 47, 57,
+        71, 86, 97, 103, 102, 53, 50, 60, 75, 92, 103, 108, 110, 58, 54, 63, 79,
+        98, 110, 114, 111, 61, 56, 65, 81, 100, 113, 116, 118, 65, 59, 68, 84,
+        105, 118, 124, 121, 71, 64, 73, 89, 111, 125, 129, 129, 76, 68, 76, 92,
+        115, 130, 134, 132, 79, 70, 79, 95, 118, 133, 142, 138, 82, 73, 81, 97,
+        121, 136, 145, 144, 86, 76, 84, 100, 124, 140, 153, 150, 89, 79, 87, 99,
+        124, 145, 156, 156, 92, 82, 89, 101, 121, 148, 157, 161, 95, 85, 92,
+        105, 120, 143, 163, 171, 98, 88, 93, 108, 124, 141, 163, 174, 101, 91,
+        94, 110, 128, 146, 166, 179, 104, 94, 95, 110, 129, 151, 171, 181, 107,
+        97, 96, 110, 128, 149, 173, 188, 110, 100, 98, 111, 127, 147, 169, 188,
+        114, 104, 100, 111, 127, 145, 166, 190 },
       { /* Chroma */
         /* Size 4x4 */
         33, 45, 56, 64, 45, 58, 66, 69, 56, 66, 86, 87, 64, 69, 87, 105,
@@ -1068,21 +1068,12 @@
         71, 71, 68, 68, 66, 66, 64, 64, 68, 68, 71, 71, 75, 75, 79, 79, 83, 84,
         88, 89, 93, 93, 98, 98, 102, 103, 108, 108, 110, 110, 113,
         /* Size 4x8 */
-        31, 47, 57, 65, 40, 45, 52, 61, 46, 55, 61, 63, 47, 60, 70, 72, 52, 64,
-        79, 81, 59, 68, 87, 90, 63, 66, 88, 99, 66, 69, 85, 102,
-        /* Size 8x4 */
         31, 40, 46, 47, 52, 59, 63, 66, 47, 45, 55, 60, 64, 68, 66, 69, 57, 52,
         61, 70, 79, 87, 88, 85, 65, 61, 63, 72, 81, 90, 99, 102,
+        /* Size 8x4 */
+        31, 47, 57, 65, 40, 45, 52, 61, 46, 55, 61, 63, 47, 60, 70, 72, 52, 64,
+        79, 81, 59, 68, 87, 90, 63, 66, 88, 99, 66, 69, 85, 102,
         /* Size 8x16 */
-        32, 35, 48, 50, 57, 63, 68, 70, 30, 38, 46, 46, 52, 58, 63, 65, 33, 41,
-        47, 46, 51, 56, 60, 63, 39, 46, 48, 47, 51, 55, 58, 61, 49, 48, 53, 54,
-        57, 60, 61, 61, 48, 46, 53, 56, 60, 64, 65, 65, 50, 46, 54, 61, 66, 70,
-        71, 69, 52, 47, 54, 63, 71, 75, 75, 74, 55, 49, 56, 65, 74, 79, 79, 78,
-        60, 53, 58, 68, 79, 85, 85, 82, 63, 55, 60, 70, 82, 89, 91, 87, 66, 58,
-        62, 72, 84, 91, 95, 91, 68, 60, 64, 71, 81, 94, 97, 96, 70, 62, 65, 73,
-        81, 89, 98, 101, 72, 65, 65, 72, 82, 92, 100, 103, 74, 67, 65, 71, 79,
-        89, 98, 105,
-        /* Size 16x8 */
         32, 30, 33, 39, 49, 48, 50, 52, 55, 60, 63, 66, 68, 70, 72, 74, 35, 38,
         41, 46, 48, 46, 46, 47, 49, 53, 55, 58, 60, 62, 65, 67, 48, 46, 47, 48,
         53, 53, 54, 54, 56, 58, 60, 62, 64, 65, 65, 65, 50, 46, 46, 47, 54, 56,
@@ -1091,37 +1082,16 @@
         89, 91, 94, 89, 92, 89, 68, 63, 60, 58, 61, 65, 71, 75, 79, 85, 91, 95,
         97, 98, 100, 98, 70, 65, 63, 61, 61, 65, 69, 74, 78, 82, 87, 91, 96,
         101, 103, 105,
+        /* Size 16x8 */
+        32, 35, 48, 50, 57, 63, 68, 70, 30, 38, 46, 46, 52, 58, 63, 65, 33, 41,
+        47, 46, 51, 56, 60, 63, 39, 46, 48, 47, 51, 55, 58, 61, 49, 48, 53, 54,
+        57, 60, 61, 61, 48, 46, 53, 56, 60, 64, 65, 65, 50, 46, 54, 61, 66, 70,
+        71, 69, 52, 47, 54, 63, 71, 75, 75, 74, 55, 49, 56, 65, 74, 79, 79, 78,
+        60, 53, 58, 68, 79, 85, 85, 82, 63, 55, 60, 70, 82, 89, 91, 87, 66, 58,
+        62, 72, 84, 91, 95, 91, 68, 60, 64, 71, 81, 94, 97, 96, 70, 62, 65, 73,
+        81, 89, 98, 101, 72, 65, 65, 72, 82, 92, 100, 103, 74, 67, 65, 71, 79,
+        89, 98, 105,
         /* Size 16x32 */
-        32, 31, 35, 38, 48, 49, 50, 52, 57, 61, 63, 67, 68, 69, 70, 71, 31, 31,
-        37, 40, 47, 47, 48, 50, 54, 57, 60, 63, 64, 65, 66, 67, 30, 32, 38, 40,
-        46, 45, 46, 48, 52, 55, 58, 61, 63, 64, 65, 67, 31, 33, 38, 41, 46, 45,
-        46, 48, 52, 55, 57, 60, 61, 62, 63, 64, 33, 36, 41, 44, 47, 46, 46, 47,
-        51, 54, 56, 59, 60, 61, 63, 64, 37, 40, 45, 47, 47, 45, 46, 47, 50, 52,
-        54, 57, 59, 61, 62, 62, 39, 41, 46, 47, 48, 47, 47, 48, 51, 54, 55, 57,
-        58, 59, 61, 62, 42, 43, 46, 48, 50, 49, 50, 50, 53, 56, 57, 60, 60, 59,
-        60, 60, 49, 46, 48, 49, 53, 53, 54, 54, 57, 59, 60, 63, 61, 62, 61, 61,
-        48, 46, 47, 48, 53, 55, 55, 56, 58, 61, 62, 64, 64, 63, 63, 64, 48, 46,
-        46, 48, 53, 56, 56, 57, 60, 62, 64, 66, 65, 65, 65, 64, 49, 45, 45, 47,
-        53, 58, 59, 61, 64, 66, 67, 69, 67, 67, 66, 67, 50, 46, 46, 48, 54, 59,
-        61, 63, 66, 68, 70, 71, 71, 68, 69, 67, 51, 47, 47, 48, 54, 60, 61, 64,
-        68, 70, 71, 73, 72, 72, 70, 71, 52, 48, 47, 48, 54, 61, 63, 66, 71, 73,
-        75, 77, 75, 73, 74, 71, 54, 50, 49, 50, 55, 62, 65, 68, 73, 76, 78, 79,
-        78, 76, 74, 75, 55, 51, 49, 50, 56, 63, 65, 69, 74, 77, 79, 81, 79, 78,
-        78, 75, 57, 52, 50, 51, 56, 64, 66, 70, 76, 79, 82, 85, 83, 81, 79, 79,
-        60, 54, 53, 53, 58, 65, 68, 72, 79, 82, 85, 87, 85, 84, 82, 80, 62, 56,
-        54, 55, 60, 66, 69, 74, 81, 84, 87, 88, 87, 85, 84, 84, 63, 57, 55, 56,
-        60, 67, 70, 75, 82, 86, 89, 92, 91, 89, 87, 84, 64, 59, 56, 57, 61, 68,
-        71, 75, 83, 87, 90, 93, 92, 90, 89, 89, 66, 60, 58, 58, 62, 69, 72, 76,
-        84, 88, 91, 94, 95, 93, 91, 89, 67, 61, 59, 58, 63, 68, 71, 78, 83, 86,
-        93, 96, 96, 96, 94, 94, 68, 62, 60, 59, 64, 67, 71, 79, 81, 86, 94, 95,
-        97, 98, 96, 94, 69, 63, 61, 60, 65, 66, 72, 77, 80, 88, 91, 96, 99, 99,
-        100, 98, 70, 64, 62, 60, 65, 66, 73, 76, 81, 87, 89, 97, 98, 100, 101,
-        99, 71, 65, 64, 61, 65, 67, 73, 74, 82, 85, 90, 95, 99, 102, 103, 104,
-        72, 65, 65, 62, 65, 68, 72, 75, 82, 83, 92, 93, 100, 102, 103, 104, 73,
-        66, 66, 63, 65, 69, 72, 76, 81, 85, 90, 93, 100, 102, 105, 106, 74, 67,
-        67, 64, 65, 70, 71, 77, 79, 86, 89, 94, 98, 103, 105, 106, 75, 68, 68,
-        65, 65, 71, 71, 78, 78, 87, 87, 96, 96, 105, 105, 109,
-        /* Size 32x16 */
         32, 31, 30, 31, 33, 37, 39, 42, 49, 48, 48, 49, 50, 51, 52, 54, 55, 57,
         60, 62, 63, 64, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 31, 31, 32, 33,
         36, 40, 41, 43, 46, 46, 46, 45, 46, 47, 48, 50, 51, 52, 54, 56, 57, 59,
@@ -1151,33 +1121,47 @@
         79, 82, 84, 87, 89, 91, 94, 96, 100, 101, 103, 103, 105, 105, 105, 71,
         67, 67, 64, 64, 62, 62, 60, 61, 64, 64, 67, 67, 71, 71, 75, 75, 79, 80,
         84, 84, 89, 89, 94, 94, 98, 99, 104, 104, 106, 106, 109,
+        /* Size 32x16 */
+        32, 31, 35, 38, 48, 49, 50, 52, 57, 61, 63, 67, 68, 69, 70, 71, 31, 31,
+        37, 40, 47, 47, 48, 50, 54, 57, 60, 63, 64, 65, 66, 67, 30, 32, 38, 40,
+        46, 45, 46, 48, 52, 55, 58, 61, 63, 64, 65, 67, 31, 33, 38, 41, 46, 45,
+        46, 48, 52, 55, 57, 60, 61, 62, 63, 64, 33, 36, 41, 44, 47, 46, 46, 47,
+        51, 54, 56, 59, 60, 61, 63, 64, 37, 40, 45, 47, 47, 45, 46, 47, 50, 52,
+        54, 57, 59, 61, 62, 62, 39, 41, 46, 47, 48, 47, 47, 48, 51, 54, 55, 57,
+        58, 59, 61, 62, 42, 43, 46, 48, 50, 49, 50, 50, 53, 56, 57, 60, 60, 59,
+        60, 60, 49, 46, 48, 49, 53, 53, 54, 54, 57, 59, 60, 63, 61, 62, 61, 61,
+        48, 46, 47, 48, 53, 55, 55, 56, 58, 61, 62, 64, 64, 63, 63, 64, 48, 46,
+        46, 48, 53, 56, 56, 57, 60, 62, 64, 66, 65, 65, 65, 64, 49, 45, 45, 47,
+        53, 58, 59, 61, 64, 66, 67, 69, 67, 67, 66, 67, 50, 46, 46, 48, 54, 59,
+        61, 63, 66, 68, 70, 71, 71, 68, 69, 67, 51, 47, 47, 48, 54, 60, 61, 64,
+        68, 70, 71, 73, 72, 72, 70, 71, 52, 48, 47, 48, 54, 61, 63, 66, 71, 73,
+        75, 77, 75, 73, 74, 71, 54, 50, 49, 50, 55, 62, 65, 68, 73, 76, 78, 79,
+        78, 76, 74, 75, 55, 51, 49, 50, 56, 63, 65, 69, 74, 77, 79, 81, 79, 78,
+        78, 75, 57, 52, 50, 51, 56, 64, 66, 70, 76, 79, 82, 85, 83, 81, 79, 79,
+        60, 54, 53, 53, 58, 65, 68, 72, 79, 82, 85, 87, 85, 84, 82, 80, 62, 56,
+        54, 55, 60, 66, 69, 74, 81, 84, 87, 88, 87, 85, 84, 84, 63, 57, 55, 56,
+        60, 67, 70, 75, 82, 86, 89, 92, 91, 89, 87, 84, 64, 59, 56, 57, 61, 68,
+        71, 75, 83, 87, 90, 93, 92, 90, 89, 89, 66, 60, 58, 58, 62, 69, 72, 76,
+        84, 88, 91, 94, 95, 93, 91, 89, 67, 61, 59, 58, 63, 68, 71, 78, 83, 86,
+        93, 96, 96, 96, 94, 94, 68, 62, 60, 59, 64, 67, 71, 79, 81, 86, 94, 95,
+        97, 98, 96, 94, 69, 63, 61, 60, 65, 66, 72, 77, 80, 88, 91, 96, 99, 99,
+        100, 98, 70, 64, 62, 60, 65, 66, 73, 76, 81, 87, 89, 97, 98, 100, 101,
+        99, 71, 65, 64, 61, 65, 67, 73, 74, 82, 85, 90, 95, 99, 102, 103, 104,
+        72, 65, 65, 62, 65, 68, 72, 75, 82, 83, 92, 93, 100, 102, 103, 104, 73,
+        66, 66, 63, 65, 69, 72, 76, 81, 85, 90, 93, 100, 102, 105, 106, 74, 67,
+        67, 64, 65, 70, 71, 77, 79, 86, 89, 94, 98, 103, 105, 106, 75, 68, 68,
+        65, 65, 71, 71, 78, 78, 87, 87, 96, 96, 105, 105, 109,
         /* Size 4x16 */
-        31, 49, 61, 69, 32, 45, 55, 64, 36, 46, 54, 61, 41, 47, 54, 59, 46, 53,
-        59, 62, 46, 56, 62, 65, 46, 59, 68, 68, 48, 61, 73, 73, 51, 63, 77, 78,
-        54, 65, 82, 84, 57, 67, 86, 89, 60, 69, 88, 93, 62, 67, 86, 98, 64, 66,
-        87, 100, 65, 68, 83, 102, 67, 70, 86, 103,
-        /* Size 16x4 */
         31, 32, 36, 41, 46, 46, 46, 48, 51, 54, 57, 60, 62, 64, 65, 67, 49, 45,
         46, 47, 53, 56, 59, 61, 63, 65, 67, 69, 67, 66, 68, 70, 61, 55, 54, 54,
         59, 62, 68, 73, 77, 82, 86, 88, 86, 87, 83, 86, 69, 64, 61, 59, 62, 65,
         68, 73, 78, 84, 89, 93, 98, 100, 102, 103,
+        /* Size 16x4 */
+        31, 49, 61, 69, 32, 45, 55, 64, 36, 46, 54, 61, 41, 47, 54, 59, 46, 53,
+        59, 62, 46, 56, 62, 65, 46, 59, 68, 68, 48, 61, 73, 73, 51, 63, 77, 78,
+        54, 65, 82, 84, 57, 67, 86, 89, 60, 69, 88, 93, 62, 67, 86, 98, 64, 66,
+        87, 100, 65, 68, 83, 102, 67, 70, 86, 103,
         /* Size 8x32 */
-        32, 35, 48, 50, 57, 63, 68, 70, 31, 37, 47, 48, 54, 60, 64, 66, 30, 38,
-        46, 46, 52, 58, 63, 65, 31, 38, 46, 46, 52, 57, 61, 63, 33, 41, 47, 46,
-        51, 56, 60, 63, 37, 45, 47, 46, 50, 54, 59, 62, 39, 46, 48, 47, 51, 55,
-        58, 61, 42, 46, 50, 50, 53, 57, 60, 60, 49, 48, 53, 54, 57, 60, 61, 61,
-        48, 47, 53, 55, 58, 62, 64, 63, 48, 46, 53, 56, 60, 64, 65, 65, 49, 45,
-        53, 59, 64, 67, 67, 66, 50, 46, 54, 61, 66, 70, 71, 69, 51, 47, 54, 61,
-        68, 71, 72, 70, 52, 47, 54, 63, 71, 75, 75, 74, 54, 49, 55, 65, 73, 78,
-        78, 74, 55, 49, 56, 65, 74, 79, 79, 78, 57, 50, 56, 66, 76, 82, 83, 79,
-        60, 53, 58, 68, 79, 85, 85, 82, 62, 54, 60, 69, 81, 87, 87, 84, 63, 55,
-        60, 70, 82, 89, 91, 87, 64, 56, 61, 71, 83, 90, 92, 89, 66, 58, 62, 72,
-        84, 91, 95, 91, 67, 59, 63, 71, 83, 93, 96, 94, 68, 60, 64, 71, 81, 94,
-        97, 96, 69, 61, 65, 72, 80, 91, 99, 100, 70, 62, 65, 73, 81, 89, 98,
-        101, 71, 64, 65, 73, 82, 90, 99, 103, 72, 65, 65, 72, 82, 92, 100, 103,
-        73, 66, 65, 72, 81, 90, 100, 105, 74, 67, 65, 71, 79, 89, 98, 105, 75,
-        68, 65, 71, 78, 87, 96, 105,
-        /* Size 32x8 */
         32, 31, 30, 31, 33, 37, 39, 42, 49, 48, 48, 49, 50, 51, 52, 54, 55, 57,
         60, 62, 63, 64, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 35, 37, 38, 38,
         41, 45, 46, 46, 48, 47, 46, 45, 46, 47, 47, 49, 49, 50, 53, 54, 55, 56,
@@ -1192,7 +1176,23 @@
         58, 60, 61, 64, 65, 67, 71, 72, 75, 78, 79, 83, 85, 87, 91, 92, 95, 96,
         97, 99, 98, 99, 100, 100, 98, 96, 70, 66, 65, 63, 63, 62, 61, 60, 61,
         63, 65, 66, 69, 70, 74, 74, 78, 79, 82, 84, 87, 89, 91, 94, 96, 100,
-        101, 103, 103, 105, 105, 105 },
+        101, 103, 103, 105, 105, 105,
+        /* Size 32x8 */
+        32, 35, 48, 50, 57, 63, 68, 70, 31, 37, 47, 48, 54, 60, 64, 66, 30, 38,
+        46, 46, 52, 58, 63, 65, 31, 38, 46, 46, 52, 57, 61, 63, 33, 41, 47, 46,
+        51, 56, 60, 63, 37, 45, 47, 46, 50, 54, 59, 62, 39, 46, 48, 47, 51, 55,
+        58, 61, 42, 46, 50, 50, 53, 57, 60, 60, 49, 48, 53, 54, 57, 60, 61, 61,
+        48, 47, 53, 55, 58, 62, 64, 63, 48, 46, 53, 56, 60, 64, 65, 65, 49, 45,
+        53, 59, 64, 67, 67, 66, 50, 46, 54, 61, 66, 70, 71, 69, 51, 47, 54, 61,
+        68, 71, 72, 70, 52, 47, 54, 63, 71, 75, 75, 74, 54, 49, 55, 65, 73, 78,
+        78, 74, 55, 49, 56, 65, 74, 79, 79, 78, 57, 50, 56, 66, 76, 82, 83, 79,
+        60, 53, 58, 68, 79, 85, 85, 82, 62, 54, 60, 69, 81, 87, 87, 84, 63, 55,
+        60, 70, 82, 89, 91, 87, 64, 56, 61, 71, 83, 90, 92, 89, 66, 58, 62, 72,
+        84, 91, 95, 91, 67, 59, 63, 71, 83, 93, 96, 94, 68, 60, 64, 71, 81, 94,
+        97, 96, 69, 61, 65, 72, 80, 91, 99, 100, 70, 62, 65, 73, 81, 89, 98,
+        101, 71, 64, 65, 73, 82, 90, 99, 103, 72, 65, 65, 72, 82, 92, 100, 103,
+        73, 66, 65, 72, 81, 90, 100, 105, 74, 67, 65, 71, 79, 89, 98, 105, 75,
+        68, 65, 71, 78, 87, 96, 105 },
   },
   {
       { /* Luma */
@@ -1284,21 +1284,12 @@
         101, 97, 97, 95, 95, 93, 93, 99, 99, 105, 105, 112, 112, 120, 120, 129,
         129, 139, 140, 149, 149, 161, 161, 172, 172, 185, 186, 191, 192, 199,
         /* Size 4x8 */
-        32, 38, 62, 86, 32, 40, 58, 80, 34, 51, 68, 85, 44, 61, 85, 101, 54, 69,
-        98, 117, 72, 84, 118, 136, 82, 89, 129, 157, 92, 98, 127, 165,
-        /* Size 8x4 */
         32, 32, 34, 44, 54, 72, 82, 92, 38, 40, 51, 61, 69, 84, 89, 98, 62, 58,
         68, 85, 98, 118, 129, 127, 86, 80, 85, 101, 117, 136, 157, 165,
+        /* Size 8x4 */
+        32, 38, 62, 86, 32, 40, 58, 80, 34, 51, 68, 85, 44, 61, 85, 101, 54, 69,
+        98, 117, 72, 84, 118, 136, 82, 89, 129, 157, 92, 98, 127, 165,
         /* Size 8x16 */
-        32, 32, 36, 44, 58, 79, 88, 93, 31, 32, 35, 41, 54, 73, 81, 88, 32, 33,
-        36, 42, 53, 71, 78, 84, 32, 34, 38, 42, 52, 69, 76, 82, 34, 36, 44, 50,
-        59, 75, 81, 84, 39, 39, 50, 58, 68, 84, 88, 90, 44, 42, 53, 63, 74, 90,
-        97, 97, 49, 46, 57, 67, 81, 97, 104, 105, 57, 53, 63, 74, 90, 108, 111,
-        113, 65, 59, 68, 79, 97, 118, 123, 122, 71, 64, 73, 84, 102, 125, 135,
-        131, 81, 72, 80, 91, 110, 135, 145, 141, 87, 77, 85, 96, 114, 140, 148,
-        151, 92, 83, 88, 102, 117, 133, 153, 163, 98, 88, 89, 103, 121, 141,
-        160, 169, 103, 94, 92, 103, 119, 137, 158, 175,
-        /* Size 16x8 */
         32, 31, 32, 32, 34, 39, 44, 49, 57, 65, 71, 81, 87, 92, 98, 103, 32, 32,
         33, 34, 36, 39, 42, 46, 53, 59, 64, 72, 77, 83, 88, 94, 36, 35, 36, 38,
         44, 50, 53, 57, 63, 68, 73, 80, 85, 88, 89, 92, 44, 41, 42, 42, 50, 58,
@@ -1307,39 +1298,16 @@
         97, 108, 118, 125, 135, 140, 133, 141, 137, 88, 81, 78, 76, 81, 88, 97,
         104, 111, 123, 135, 145, 148, 153, 160, 158, 93, 88, 84, 82, 84, 90, 97,
         105, 113, 122, 131, 141, 151, 163, 169, 175,
+        /* Size 16x8 */
+        32, 32, 36, 44, 58, 79, 88, 93, 31, 32, 35, 41, 54, 73, 81, 88, 32, 33,
+        36, 42, 53, 71, 78, 84, 32, 34, 38, 42, 52, 69, 76, 82, 34, 36, 44, 50,
+        59, 75, 81, 84, 39, 39, 50, 58, 68, 84, 88, 90, 44, 42, 53, 63, 74, 90,
+        97, 97, 49, 46, 57, 67, 81, 97, 104, 105, 57, 53, 63, 74, 90, 108, 111,
+        113, 65, 59, 68, 79, 97, 118, 123, 122, 71, 64, 73, 84, 102, 125, 135,
+        131, 81, 72, 80, 91, 110, 135, 145, 141, 87, 77, 85, 96, 114, 140, 148,
+        151, 92, 83, 88, 102, 117, 133, 153, 163, 98, 88, 89, 103, 121, 141,
+        160, 169, 103, 94, 92, 103, 119, 137, 158, 175,
         /* Size 16x32 */
-        32, 31, 32, 32, 36, 39, 44, 53, 58, 65, 79, 81, 88, 90, 93, 96, 31, 32,
-        32, 32, 35, 38, 42, 51, 55, 62, 75, 77, 83, 86, 88, 91, 31, 32, 32, 32,
-        35, 38, 41, 50, 54, 60, 73, 75, 81, 84, 88, 91, 31, 32, 32, 33, 34, 37,
-        41, 49, 53, 59, 72, 74, 79, 82, 84, 87, 32, 32, 33, 34, 36, 39, 42, 50,
-        53, 59, 71, 72, 78, 81, 84, 87, 32, 32, 34, 34, 37, 40, 42, 49, 53, 58,
-        70, 71, 77, 80, 83, 85, 32, 33, 34, 35, 38, 40, 42, 49, 52, 58, 69, 70,
-        76, 78, 82, 86, 34, 34, 35, 37, 42, 45, 48, 54, 57, 63, 73, 75, 79, 79,
-        81, 83, 34, 34, 36, 37, 44, 47, 50, 56, 59, 65, 75, 77, 81, 83, 84, 84,
-        36, 34, 37, 38, 48, 51, 54, 60, 63, 68, 78, 80, 85, 85, 86, 89, 39, 37,
-        39, 40, 50, 54, 58, 65, 68, 73, 84, 85, 88, 89, 90, 89, 40, 38, 40, 41,
-        51, 55, 59, 67, 70, 75, 85, 87, 91, 92, 92, 95, 44, 41, 42, 43, 53, 58,
-        63, 71, 74, 79, 90, 91, 97, 94, 97, 95, 47, 44, 45, 46, 56, 61, 66, 75,
-        79, 85, 95, 97, 99, 101, 98, 102, 49, 46, 46, 47, 57, 62, 67, 77, 81,
-        86, 97, 99, 104, 102, 105, 102, 53, 49, 50, 50, 60, 65, 71, 82, 86, 92,
-        103, 105, 109, 108, 106, 110, 57, 53, 53, 53, 63, 68, 74, 86, 90, 97,
-        108, 110, 111, 112, 113, 110, 59, 54, 54, 54, 64, 69, 75, 87, 91, 98,
-        111, 112, 119, 117, 115, 118, 65, 60, 59, 58, 68, 73, 79, 92, 97, 105,
-        118, 119, 123, 123, 122, 119, 69, 63, 62, 62, 71, 76, 83, 96, 100, 109,
-        122, 124, 127, 125, 125, 128, 71, 65, 64, 63, 73, 78, 84, 97, 102, 111,
-        125, 127, 135, 134, 131, 129, 79, 72, 71, 70, 79, 84, 90, 104, 109, 118,
-        133, 135, 137, 136, 136, 137, 81, 74, 72, 71, 80, 85, 91, 105, 110, 120,
-        135, 137, 145, 143, 141, 138, 82, 75, 73, 72, 81, 86, 92, 106, 111, 121,
-        136, 139, 147, 148, 147, 149, 87, 79, 77, 76, 85, 90, 96, 110, 114, 125,
-        140, 143, 148, 154, 151, 149, 90, 82, 80, 78, 87, 89, 99, 108, 113, 129,
-        135, 146, 153, 157, 160, 159, 92, 84, 83, 81, 88, 90, 102, 106, 117,
-        128, 133, 150, 153, 158, 163, 160, 95, 87, 85, 83, 88, 92, 103, 105,
-        120, 125, 137, 148, 155, 164, 168, 173, 98, 89, 88, 85, 89, 95, 103,
-        108, 121, 124, 141, 144, 160, 164, 169, 174, 100, 92, 91, 88, 90, 98,
-        103, 111, 120, 127, 139, 146, 161, 165, 175, 179, 103, 94, 94, 90, 92,
-        101, 103, 114, 119, 131, 137, 150, 158, 170, 175, 180, 106, 97, 97, 93,
-        93, 104, 104, 118, 118, 135, 135, 154, 155, 175, 176, 187,
-        /* Size 32x16 */
         32, 31, 31, 31, 32, 32, 32, 34, 34, 36, 39, 40, 44, 47, 49, 53, 57, 59,
         65, 69, 71, 79, 81, 82, 87, 90, 92, 95, 98, 100, 103, 106, 31, 32, 32,
         32, 32, 32, 33, 34, 34, 34, 37, 38, 41, 44, 46, 49, 53, 54, 60, 63, 65,
@@ -1371,34 +1339,49 @@
         136, 141, 147, 151, 160, 163, 168, 169, 175, 175, 176, 96, 91, 91, 87,
         87, 85, 86, 83, 84, 89, 89, 95, 95, 102, 102, 110, 110, 118, 119, 128,
         129, 137, 138, 149, 149, 159, 160, 173, 174, 179, 180, 187,
+        /* Size 32x16 */
+        32, 31, 32, 32, 36, 39, 44, 53, 58, 65, 79, 81, 88, 90, 93, 96, 31, 32,
+        32, 32, 35, 38, 42, 51, 55, 62, 75, 77, 83, 86, 88, 91, 31, 32, 32, 32,
+        35, 38, 41, 50, 54, 60, 73, 75, 81, 84, 88, 91, 31, 32, 32, 33, 34, 37,
+        41, 49, 53, 59, 72, 74, 79, 82, 84, 87, 32, 32, 33, 34, 36, 39, 42, 50,
+        53, 59, 71, 72, 78, 81, 84, 87, 32, 32, 34, 34, 37, 40, 42, 49, 53, 58,
+        70, 71, 77, 80, 83, 85, 32, 33, 34, 35, 38, 40, 42, 49, 52, 58, 69, 70,
+        76, 78, 82, 86, 34, 34, 35, 37, 42, 45, 48, 54, 57, 63, 73, 75, 79, 79,
+        81, 83, 34, 34, 36, 37, 44, 47, 50, 56, 59, 65, 75, 77, 81, 83, 84, 84,
+        36, 34, 37, 38, 48, 51, 54, 60, 63, 68, 78, 80, 85, 85, 86, 89, 39, 37,
+        39, 40, 50, 54, 58, 65, 68, 73, 84, 85, 88, 89, 90, 89, 40, 38, 40, 41,
+        51, 55, 59, 67, 70, 75, 85, 87, 91, 92, 92, 95, 44, 41, 42, 43, 53, 58,
+        63, 71, 74, 79, 90, 91, 97, 94, 97, 95, 47, 44, 45, 46, 56, 61, 66, 75,
+        79, 85, 95, 97, 99, 101, 98, 102, 49, 46, 46, 47, 57, 62, 67, 77, 81,
+        86, 97, 99, 104, 102, 105, 102, 53, 49, 50, 50, 60, 65, 71, 82, 86, 92,
+        103, 105, 109, 108, 106, 110, 57, 53, 53, 53, 63, 68, 74, 86, 90, 97,
+        108, 110, 111, 112, 113, 110, 59, 54, 54, 54, 64, 69, 75, 87, 91, 98,
+        111, 112, 119, 117, 115, 118, 65, 60, 59, 58, 68, 73, 79, 92, 97, 105,
+        118, 119, 123, 123, 122, 119, 69, 63, 62, 62, 71, 76, 83, 96, 100, 109,
+        122, 124, 127, 125, 125, 128, 71, 65, 64, 63, 73, 78, 84, 97, 102, 111,
+        125, 127, 135, 134, 131, 129, 79, 72, 71, 70, 79, 84, 90, 104, 109, 118,
+        133, 135, 137, 136, 136, 137, 81, 74, 72, 71, 80, 85, 91, 105, 110, 120,
+        135, 137, 145, 143, 141, 138, 82, 75, 73, 72, 81, 86, 92, 106, 111, 121,
+        136, 139, 147, 148, 147, 149, 87, 79, 77, 76, 85, 90, 96, 110, 114, 125,
+        140, 143, 148, 154, 151, 149, 90, 82, 80, 78, 87, 89, 99, 108, 113, 129,
+        135, 146, 153, 157, 160, 159, 92, 84, 83, 81, 88, 90, 102, 106, 117,
+        128, 133, 150, 153, 158, 163, 160, 95, 87, 85, 83, 88, 92, 103, 105,
+        120, 125, 137, 148, 155, 164, 168, 173, 98, 89, 88, 85, 89, 95, 103,
+        108, 121, 124, 141, 144, 160, 164, 169, 174, 100, 92, 91, 88, 90, 98,
+        103, 111, 120, 127, 139, 146, 161, 165, 175, 179, 103, 94, 94, 90, 92,
+        101, 103, 114, 119, 131, 137, 150, 158, 170, 175, 180, 106, 97, 97, 93,
+        93, 104, 104, 118, 118, 135, 135, 154, 155, 175, 176, 187,
         /* Size 4x16 */
-        31, 39, 65, 90, 32, 38, 60, 84, 32, 39, 59, 81, 33, 40, 58, 78, 34, 47,
-        65, 83, 37, 54, 73, 89, 41, 58, 79, 94, 46, 62, 86, 102, 53, 68, 97,
-        112, 60, 73, 105, 123, 65, 78, 111, 134, 74, 85, 120, 143, 79, 90, 125,
-        154, 84, 90, 128, 158, 89, 95, 124, 164, 94, 101, 131, 170,
-        /* Size 16x4 */
         31, 32, 32, 33, 34, 37, 41, 46, 53, 60, 65, 74, 79, 84, 89, 94, 39, 38,
         39, 40, 47, 54, 58, 62, 68, 73, 78, 85, 90, 90, 95, 101, 65, 60, 59, 58,
         65, 73, 79, 86, 97, 105, 111, 120, 125, 128, 124, 131, 90, 84, 81, 78,
         83, 89, 94, 102, 112, 123, 134, 143, 154, 158, 164, 170,
+        /* Size 16x4 */
+        31, 39, 65, 90, 32, 38, 60, 84, 32, 39, 59, 81, 33, 40, 58, 78, 34, 47,
+        65, 83, 37, 54, 73, 89, 41, 58, 79, 94, 46, 62, 86, 102, 53, 68, 97,
+        112, 60, 73, 105, 123, 65, 78, 111, 134, 74, 85, 120, 143, 79, 90, 125,
+        154, 84, 90, 128, 158, 89, 95, 124, 164, 94, 101, 131, 170,
         /* Size 8x32 */
-        32, 32, 36, 44, 58, 79, 88, 93, 31, 32, 35, 42, 55, 75, 83, 88, 31, 32,
-        35, 41, 54, 73, 81, 88, 31, 32, 34, 41, 53, 72, 79, 84, 32, 33, 36, 42,
-        53, 71, 78, 84, 32, 34, 37, 42, 53, 70, 77, 83, 32, 34, 38, 42, 52, 69,
-        76, 82, 34, 35, 42, 48, 57, 73, 79, 81, 34, 36, 44, 50, 59, 75, 81, 84,
-        36, 37, 48, 54, 63, 78, 85, 86, 39, 39, 50, 58, 68, 84, 88, 90, 40, 40,
-        51, 59, 70, 85, 91, 92, 44, 42, 53, 63, 74, 90, 97, 97, 47, 45, 56, 66,
-        79, 95, 99, 98, 49, 46, 57, 67, 81, 97, 104, 105, 53, 50, 60, 71, 86,
-        103, 109, 106, 57, 53, 63, 74, 90, 108, 111, 113, 59, 54, 64, 75, 91,
-        111, 119, 115, 65, 59, 68, 79, 97, 118, 123, 122, 69, 62, 71, 83, 100,
-        122, 127, 125, 71, 64, 73, 84, 102, 125, 135, 131, 79, 71, 79, 90, 109,
-        133, 137, 136, 81, 72, 80, 91, 110, 135, 145, 141, 82, 73, 81, 92, 111,
-        136, 147, 147, 87, 77, 85, 96, 114, 140, 148, 151, 90, 80, 87, 99, 113,
-        135, 153, 160, 92, 83, 88, 102, 117, 133, 153, 163, 95, 85, 88, 103,
-        120, 137, 155, 168, 98, 88, 89, 103, 121, 141, 160, 169, 100, 91, 90,
-        103, 120, 139, 161, 175, 103, 94, 92, 103, 119, 137, 158, 175, 106, 97,
-        93, 104, 118, 135, 155, 176,
-        /* Size 32x8 */
         32, 31, 31, 31, 32, 32, 32, 34, 34, 36, 39, 40, 44, 47, 49, 53, 57, 59,
         65, 69, 71, 79, 81, 82, 87, 90, 92, 95, 98, 100, 103, 106, 32, 32, 32,
         32, 33, 34, 34, 35, 36, 37, 39, 40, 42, 45, 46, 50, 53, 54, 59, 62, 64,
@@ -1414,7 +1397,24 @@
         99, 104, 109, 111, 119, 123, 127, 135, 137, 145, 147, 148, 153, 153,
         155, 160, 161, 158, 155, 93, 88, 88, 84, 84, 83, 82, 81, 84, 86, 90, 92,
         97, 98, 105, 106, 113, 115, 122, 125, 131, 136, 141, 147, 151, 160, 163,
-        168, 169, 175, 175, 176 },
+        168, 169, 175, 175, 176,
+        /* Size 32x8 */
+        32, 32, 36, 44, 58, 79, 88, 93, 31, 32, 35, 42, 55, 75, 83, 88, 31, 32,
+        35, 41, 54, 73, 81, 88, 31, 32, 34, 41, 53, 72, 79, 84, 32, 33, 36, 42,
+        53, 71, 78, 84, 32, 34, 37, 42, 53, 70, 77, 83, 32, 34, 38, 42, 52, 69,
+        76, 82, 34, 35, 42, 48, 57, 73, 79, 81, 34, 36, 44, 50, 59, 75, 81, 84,
+        36, 37, 48, 54, 63, 78, 85, 86, 39, 39, 50, 58, 68, 84, 88, 90, 40, 40,
+        51, 59, 70, 85, 91, 92, 44, 42, 53, 63, 74, 90, 97, 97, 47, 45, 56, 66,
+        79, 95, 99, 98, 49, 46, 57, 67, 81, 97, 104, 105, 53, 50, 60, 71, 86,
+        103, 109, 106, 57, 53, 63, 74, 90, 108, 111, 113, 59, 54, 64, 75, 91,
+        111, 119, 115, 65, 59, 68, 79, 97, 118, 123, 122, 69, 62, 71, 83, 100,
+        122, 127, 125, 71, 64, 73, 84, 102, 125, 135, 131, 79, 71, 79, 90, 109,
+        133, 137, 136, 81, 72, 80, 91, 110, 135, 145, 141, 82, 73, 81, 92, 111,
+        136, 147, 147, 87, 77, 85, 96, 114, 140, 148, 151, 90, 80, 87, 99, 113,
+        135, 153, 160, 92, 83, 88, 102, 117, 133, 153, 163, 95, 85, 88, 103,
+        120, 137, 155, 168, 98, 88, 89, 103, 121, 141, 160, 169, 100, 91, 90,
+        103, 120, 139, 161, 175, 103, 94, 92, 103, 119, 137, 158, 175, 106, 97,
+        93, 104, 118, 135, 155, 176 },
       { /* Chroma */
         /* Size 4x4 */
         32, 45, 53, 63, 45, 55, 62, 67, 53, 62, 80, 84, 63, 67, 84, 101,
@@ -1499,21 +1499,12 @@
         62, 66, 66, 69, 69, 72, 73, 76, 77, 81, 81, 85, 85, 89, 90, 94, 94, 99,
         99, 104, 104, 106, 106, 108,
         /* Size 4x8 */
-        31, 47, 54, 64, 38, 46, 50, 60, 46, 53, 57, 62, 46, 56, 66, 71, 50, 59,
-        74, 79, 57, 64, 82, 88, 61, 65, 85, 97, 65, 67, 82, 99,
-        /* Size 8x4 */
         31, 38, 46, 46, 50, 57, 61, 65, 47, 46, 53, 56, 59, 64, 65, 67, 54, 50,
         57, 66, 74, 82, 85, 82, 64, 60, 62, 71, 79, 88, 97, 99,
+        /* Size 8x4 */
+        31, 47, 54, 64, 38, 46, 50, 60, 46, 53, 57, 62, 46, 56, 66, 71, 50, 59,
+        74, 79, 57, 64, 82, 88, 61, 65, 85, 97, 65, 67, 82, 99,
         /* Size 8x16 */
-        32, 34, 48, 49, 54, 63, 67, 69, 31, 36, 46, 46, 50, 58, 62, 65, 33, 40,
-        47, 46, 49, 56, 59, 62, 37, 44, 47, 45, 48, 54, 57, 60, 44, 46, 51, 51,
-        53, 59, 60, 61, 48, 46, 53, 56, 58, 64, 64, 64, 49, 45, 53, 58, 62, 67,
-        70, 68, 51, 47, 54, 60, 65, 71, 73, 72, 54, 49, 55, 62, 70, 77, 77, 76,
-        57, 51, 56, 64, 73, 82, 83, 81, 60, 53, 58, 65, 75, 85, 89, 85, 64, 57,
-        61, 68, 78, 89, 93, 89, 66, 59, 63, 69, 79, 91, 94, 93, 68, 61, 63, 71,
-        79, 87, 96, 98, 70, 63, 63, 70, 80, 89, 97, 100, 72, 65, 63, 69, 77, 86,
-        95, 102,
-        /* Size 16x8 */
         32, 31, 33, 37, 44, 48, 49, 51, 54, 57, 60, 64, 66, 68, 70, 72, 34, 36,
         40, 44, 46, 46, 45, 47, 49, 51, 53, 57, 59, 61, 63, 65, 48, 46, 47, 47,
         51, 53, 53, 54, 55, 56, 58, 61, 63, 63, 63, 63, 49, 46, 46, 45, 51, 56,
@@ -1522,37 +1513,16 @@
         85, 89, 91, 87, 89, 86, 67, 62, 59, 57, 60, 64, 70, 73, 77, 83, 89, 93,
         94, 96, 97, 95, 69, 65, 62, 60, 61, 64, 68, 72, 76, 81, 85, 89, 93, 98,
         100, 102,
+        /* Size 16x8 */
+        32, 34, 48, 49, 54, 63, 67, 69, 31, 36, 46, 46, 50, 58, 62, 65, 33, 40,
+        47, 46, 49, 56, 59, 62, 37, 44, 47, 45, 48, 54, 57, 60, 44, 46, 51, 51,
+        53, 59, 60, 61, 48, 46, 53, 56, 58, 64, 64, 64, 49, 45, 53, 58, 62, 67,
+        70, 68, 51, 47, 54, 60, 65, 71, 73, 72, 54, 49, 55, 62, 70, 77, 77, 76,
+        57, 51, 56, 64, 73, 82, 83, 81, 60, 53, 58, 65, 75, 85, 89, 85, 64, 57,
+        61, 68, 78, 89, 93, 89, 66, 59, 63, 69, 79, 91, 94, 93, 68, 61, 63, 71,
+        79, 87, 96, 98, 70, 63, 63, 70, 80, 89, 97, 100, 72, 65, 63, 69, 77, 86,
+        95, 102,
         /* Size 16x32 */
-        32, 31, 34, 37, 48, 48, 49, 52, 54, 57, 63, 64, 67, 68, 69, 69, 31, 31,
-        35, 38, 47, 47, 47, 50, 51, 54, 60, 61, 63, 64, 65, 66, 31, 32, 36, 39,
-        46, 46, 46, 48, 50, 53, 58, 59, 62, 63, 65, 66, 30, 32, 36, 40, 46, 45,
-        45, 48, 49, 52, 57, 58, 60, 61, 62, 63, 33, 36, 40, 43, 47, 46, 46, 47,
-        49, 51, 56, 57, 59, 60, 62, 63, 35, 38, 42, 45, 47, 46, 45, 47, 48, 50,
-        55, 56, 58, 60, 61, 61, 37, 40, 44, 47, 47, 46, 45, 47, 48, 50, 54, 55,
-        57, 58, 60, 61, 42, 43, 45, 47, 50, 50, 49, 50, 51, 53, 57, 58, 59, 58,
-        59, 59, 44, 44, 46, 47, 51, 51, 51, 52, 53, 54, 59, 59, 60, 61, 61, 60,
-        49, 46, 47, 48, 53, 53, 53, 54, 55, 57, 60, 61, 63, 62, 62, 63, 48, 46,
-        46, 47, 53, 54, 56, 57, 58, 60, 64, 64, 64, 64, 64, 63, 48, 45, 46, 46,
-        53, 55, 56, 58, 59, 61, 65, 65, 66, 66, 65, 66, 49, 45, 45, 46, 53, 56,
-        58, 61, 62, 64, 67, 68, 70, 67, 68, 66, 50, 46, 46, 46, 54, 56, 59, 63,
-        65, 66, 70, 71, 70, 71, 68, 70, 51, 47, 47, 47, 54, 57, 60, 64, 65, 68,
-        71, 72, 73, 71, 72, 70, 52, 48, 47, 47, 54, 57, 61, 66, 68, 71, 75, 75,
-        76, 75, 73, 73, 54, 49, 49, 48, 55, 58, 62, 68, 70, 73, 77, 78, 77, 77,
-        76, 74, 54, 50, 49, 49, 55, 59, 62, 68, 70, 74, 78, 79, 81, 79, 77, 78,
-        57, 52, 51, 50, 56, 60, 64, 70, 73, 76, 82, 82, 83, 82, 81, 78, 59, 54,
-        52, 52, 58, 61, 65, 72, 74, 78, 84, 85, 85, 83, 82, 82, 60, 54, 53, 52,
-        58, 62, 65, 72, 75, 79, 85, 86, 89, 87, 85, 82, 63, 57, 56, 55, 60, 64,
-        67, 75, 77, 82, 89, 90, 90, 88, 87, 86, 64, 58, 57, 55, 61, 64, 68, 75,
-        78, 82, 89, 90, 93, 91, 89, 87, 64, 59, 57, 56, 61, 65, 68, 75, 78, 83,
-        90, 91, 94, 93, 92, 91, 66, 60, 59, 57, 63, 66, 69, 77, 79, 84, 91, 93,
-        94, 95, 93, 91, 67, 61, 60, 58, 63, 65, 70, 75, 78, 85, 88, 93, 96, 97,
-        97, 95, 68, 62, 61, 59, 63, 64, 71, 74, 79, 84, 87, 94, 96, 97, 98, 96,
-        69, 63, 62, 60, 63, 65, 71, 72, 80, 82, 88, 93, 96, 99, 100, 101, 70,
-        64, 63, 60, 63, 66, 70, 73, 80, 81, 89, 90, 97, 99, 100, 101, 71, 65,
-        64, 61, 63, 67, 70, 74, 78, 82, 88, 90, 97, 99, 102, 103, 72, 65, 65,
-        62, 63, 68, 69, 75, 77, 83, 86, 92, 95, 100, 102, 103, 73, 66, 66, 63,
-        63, 69, 69, 76, 76, 84, 84, 93, 93, 101, 101, 105,
-        /* Size 32x16 */
         32, 31, 31, 30, 33, 35, 37, 42, 44, 49, 48, 48, 49, 50, 51, 52, 54, 54,
         57, 59, 60, 63, 64, 64, 66, 67, 68, 69, 70, 71, 72, 73, 31, 31, 32, 32,
         36, 38, 40, 43, 44, 46, 46, 45, 45, 46, 47, 48, 49, 50, 52, 54, 54, 57,
@@ -1582,33 +1552,47 @@
         82, 85, 87, 89, 92, 93, 97, 98, 100, 100, 102, 102, 101, 69, 66, 66, 63,
         63, 61, 61, 59, 60, 63, 63, 66, 66, 70, 70, 73, 74, 78, 78, 82, 82, 86,
         87, 91, 91, 95, 96, 101, 101, 103, 103, 105,
+        /* Size 32x16 */
+        32, 31, 34, 37, 48, 48, 49, 52, 54, 57, 63, 64, 67, 68, 69, 69, 31, 31,
+        35, 38, 47, 47, 47, 50, 51, 54, 60, 61, 63, 64, 65, 66, 31, 32, 36, 39,
+        46, 46, 46, 48, 50, 53, 58, 59, 62, 63, 65, 66, 30, 32, 36, 40, 46, 45,
+        45, 48, 49, 52, 57, 58, 60, 61, 62, 63, 33, 36, 40, 43, 47, 46, 46, 47,
+        49, 51, 56, 57, 59, 60, 62, 63, 35, 38, 42, 45, 47, 46, 45, 47, 48, 50,
+        55, 56, 58, 60, 61, 61, 37, 40, 44, 47, 47, 46, 45, 47, 48, 50, 54, 55,
+        57, 58, 60, 61, 42, 43, 45, 47, 50, 50, 49, 50, 51, 53, 57, 58, 59, 58,
+        59, 59, 44, 44, 46, 47, 51, 51, 51, 52, 53, 54, 59, 59, 60, 61, 61, 60,
+        49, 46, 47, 48, 53, 53, 53, 54, 55, 57, 60, 61, 63, 62, 62, 63, 48, 46,
+        46, 47, 53, 54, 56, 57, 58, 60, 64, 64, 64, 64, 64, 63, 48, 45, 46, 46,
+        53, 55, 56, 58, 59, 61, 65, 65, 66, 66, 65, 66, 49, 45, 45, 46, 53, 56,
+        58, 61, 62, 64, 67, 68, 70, 67, 68, 66, 50, 46, 46, 46, 54, 56, 59, 63,
+        65, 66, 70, 71, 70, 71, 68, 70, 51, 47, 47, 47, 54, 57, 60, 64, 65, 68,
+        71, 72, 73, 71, 72, 70, 52, 48, 47, 47, 54, 57, 61, 66, 68, 71, 75, 75,
+        76, 75, 73, 73, 54, 49, 49, 48, 55, 58, 62, 68, 70, 73, 77, 78, 77, 77,
+        76, 74, 54, 50, 49, 49, 55, 59, 62, 68, 70, 74, 78, 79, 81, 79, 77, 78,
+        57, 52, 51, 50, 56, 60, 64, 70, 73, 76, 82, 82, 83, 82, 81, 78, 59, 54,
+        52, 52, 58, 61, 65, 72, 74, 78, 84, 85, 85, 83, 82, 82, 60, 54, 53, 52,
+        58, 62, 65, 72, 75, 79, 85, 86, 89, 87, 85, 82, 63, 57, 56, 55, 60, 64,
+        67, 75, 77, 82, 89, 90, 90, 88, 87, 86, 64, 58, 57, 55, 61, 64, 68, 75,
+        78, 82, 89, 90, 93, 91, 89, 87, 64, 59, 57, 56, 61, 65, 68, 75, 78, 83,
+        90, 91, 94, 93, 92, 91, 66, 60, 59, 57, 63, 66, 69, 77, 79, 84, 91, 93,
+        94, 95, 93, 91, 67, 61, 60, 58, 63, 65, 70, 75, 78, 85, 88, 93, 96, 97,
+        97, 95, 68, 62, 61, 59, 63, 64, 71, 74, 79, 84, 87, 94, 96, 97, 98, 96,
+        69, 63, 62, 60, 63, 65, 71, 72, 80, 82, 88, 93, 96, 99, 100, 101, 70,
+        64, 63, 60, 63, 66, 70, 73, 80, 81, 89, 90, 97, 99, 100, 101, 71, 65,
+        64, 61, 63, 67, 70, 74, 78, 82, 88, 90, 97, 99, 102, 103, 72, 65, 65,
+        62, 63, 68, 69, 75, 77, 83, 86, 92, 95, 100, 102, 103, 73, 66, 66, 63,
+        63, 69, 69, 76, 76, 84, 84, 93, 93, 101, 101, 105,
         /* Size 4x16 */
-        31, 48, 57, 68, 32, 46, 53, 63, 36, 46, 51, 60, 40, 46, 50, 58, 44, 51,
-        54, 61, 46, 54, 60, 64, 45, 56, 64, 67, 47, 57, 68, 71, 49, 58, 73, 77,
-        52, 60, 76, 82, 54, 62, 79, 87, 58, 64, 82, 91, 60, 66, 84, 95, 62, 64,
-        84, 97, 64, 66, 81, 99, 65, 68, 83, 100,
-        /* Size 16x4 */
         31, 32, 36, 40, 44, 46, 45, 47, 49, 52, 54, 58, 60, 62, 64, 65, 48, 46,
         46, 46, 51, 54, 56, 57, 58, 60, 62, 64, 66, 64, 66, 68, 57, 53, 51, 50,
         54, 60, 64, 68, 73, 76, 79, 82, 84, 84, 81, 83, 68, 63, 60, 58, 61, 64,
         67, 71, 77, 82, 87, 91, 95, 97, 99, 100,
+        /* Size 16x4 */
+        31, 48, 57, 68, 32, 46, 53, 63, 36, 46, 51, 60, 40, 46, 50, 58, 44, 51,
+        54, 61, 46, 54, 60, 64, 45, 56, 64, 67, 47, 57, 68, 71, 49, 58, 73, 77,
+        52, 60, 76, 82, 54, 62, 79, 87, 58, 64, 82, 91, 60, 66, 84, 95, 62, 64,
+        84, 97, 64, 66, 81, 99, 65, 68, 83, 100,
         /* Size 8x32 */
-        32, 34, 48, 49, 54, 63, 67, 69, 31, 35, 47, 47, 51, 60, 63, 65, 31, 36,
-        46, 46, 50, 58, 62, 65, 30, 36, 46, 45, 49, 57, 60, 62, 33, 40, 47, 46,
-        49, 56, 59, 62, 35, 42, 47, 45, 48, 55, 58, 61, 37, 44, 47, 45, 48, 54,
-        57, 60, 42, 45, 50, 49, 51, 57, 59, 59, 44, 46, 51, 51, 53, 59, 60, 61,
-        49, 47, 53, 53, 55, 60, 63, 62, 48, 46, 53, 56, 58, 64, 64, 64, 48, 46,
-        53, 56, 59, 65, 66, 65, 49, 45, 53, 58, 62, 67, 70, 68, 50, 46, 54, 59,
-        65, 70, 70, 68, 51, 47, 54, 60, 65, 71, 73, 72, 52, 47, 54, 61, 68, 75,
-        76, 73, 54, 49, 55, 62, 70, 77, 77, 76, 54, 49, 55, 62, 70, 78, 81, 77,
-        57, 51, 56, 64, 73, 82, 83, 81, 59, 52, 58, 65, 74, 84, 85, 82, 60, 53,
-        58, 65, 75, 85, 89, 85, 63, 56, 60, 67, 77, 89, 90, 87, 64, 57, 61, 68,
-        78, 89, 93, 89, 64, 57, 61, 68, 78, 90, 94, 92, 66, 59, 63, 69, 79, 91,
-        94, 93, 67, 60, 63, 70, 78, 88, 96, 97, 68, 61, 63, 71, 79, 87, 96, 98,
-        69, 62, 63, 71, 80, 88, 96, 100, 70, 63, 63, 70, 80, 89, 97, 100, 71,
-        64, 63, 70, 78, 88, 97, 102, 72, 65, 63, 69, 77, 86, 95, 102, 73, 66,
-        63, 69, 76, 84, 93, 101,
-        /* Size 32x8 */
         32, 31, 31, 30, 33, 35, 37, 42, 44, 49, 48, 48, 49, 50, 51, 52, 54, 54,
         57, 59, 60, 63, 64, 64, 66, 67, 68, 69, 70, 71, 72, 73, 34, 35, 36, 36,
         40, 42, 44, 45, 46, 47, 46, 46, 45, 46, 47, 47, 49, 49, 51, 52, 53, 56,
@@ -1623,7 +1607,23 @@
         57, 59, 60, 63, 64, 66, 70, 70, 73, 76, 77, 81, 83, 85, 89, 90, 93, 94,
         94, 96, 96, 96, 97, 97, 95, 93, 69, 65, 65, 62, 62, 61, 60, 59, 61, 62,
         64, 65, 68, 68, 72, 73, 76, 77, 81, 82, 85, 87, 89, 92, 93, 97, 98, 100,
-        100, 102, 102, 101 },
+        100, 102, 102, 101,
+        /* Size 32x8 */
+        32, 34, 48, 49, 54, 63, 67, 69, 31, 35, 47, 47, 51, 60, 63, 65, 31, 36,
+        46, 46, 50, 58, 62, 65, 30, 36, 46, 45, 49, 57, 60, 62, 33, 40, 47, 46,
+        49, 56, 59, 62, 35, 42, 47, 45, 48, 55, 58, 61, 37, 44, 47, 45, 48, 54,
+        57, 60, 42, 45, 50, 49, 51, 57, 59, 59, 44, 46, 51, 51, 53, 59, 60, 61,
+        49, 47, 53, 53, 55, 60, 63, 62, 48, 46, 53, 56, 58, 64, 64, 64, 48, 46,
+        53, 56, 59, 65, 66, 65, 49, 45, 53, 58, 62, 67, 70, 68, 50, 46, 54, 59,
+        65, 70, 70, 68, 51, 47, 54, 60, 65, 71, 73, 72, 52, 47, 54, 61, 68, 75,
+        76, 73, 54, 49, 55, 62, 70, 77, 77, 76, 54, 49, 55, 62, 70, 78, 81, 77,
+        57, 51, 56, 64, 73, 82, 83, 81, 59, 52, 58, 65, 74, 84, 85, 82, 60, 53,
+        58, 65, 75, 85, 89, 85, 63, 56, 60, 67, 77, 89, 90, 87, 64, 57, 61, 68,
+        78, 89, 93, 89, 64, 57, 61, 68, 78, 90, 94, 92, 66, 59, 63, 69, 79, 91,
+        94, 93, 67, 60, 63, 70, 78, 88, 96, 97, 68, 61, 63, 71, 79, 87, 96, 98,
+        69, 62, 63, 71, 80, 88, 96, 100, 70, 63, 63, 70, 80, 89, 97, 100, 71,
+        64, 63, 70, 78, 88, 97, 102, 72, 65, 63, 69, 77, 86, 95, 102, 73, 66,
+        63, 69, 76, 84, 93, 101 },
   },
   {
       { /* Luma */
@@ -1714,21 +1714,12 @@
         89, 89, 86, 86, 92, 92, 97, 98, 104, 104, 111, 111, 119, 119, 128, 129,
         137, 137, 147, 148, 157, 158, 169, 170, 174, 175, 181,
         /* Size 4x8 */
-        32, 35, 59, 83, 32, 36, 57, 78, 34, 47, 65, 82, 41, 53, 78, 97, 51, 61,
-        92, 111, 65, 73, 108, 129, 75, 81, 117, 148, 86, 92, 119, 154,
-        /* Size 8x4 */
         32, 32, 34, 41, 51, 65, 75, 86, 35, 36, 47, 53, 61, 73, 81, 92, 59, 57,
         65, 78, 92, 108, 117, 119, 83, 78, 82, 97, 111, 129, 148, 154,
+        /* Size 8x4 */
+        32, 35, 59, 83, 32, 36, 57, 78, 34, 47, 65, 82, 41, 53, 78, 97, 51, 61,
+        92, 111, 65, 73, 108, 129, 75, 81, 117, 148, 86, 92, 119, 154,
         /* Size 8x16 */
-        32, 31, 35, 44, 53, 65, 82, 90, 31, 32, 34, 41, 50, 61, 76, 85, 31, 33,
-        35, 42, 49, 59, 73, 81, 32, 34, 37, 42, 49, 58, 71, 79, 34, 35, 41, 48,
-        54, 63, 76, 81, 36, 36, 46, 54, 60, 68, 80, 87, 41, 40, 49, 60, 67, 76,
-        88, 93, 47, 44, 53, 66, 75, 84, 97, 101, 53, 50, 57, 71, 82, 92, 106,
-        108, 58, 54, 61, 75, 87, 98, 112, 116, 65, 59, 66, 79, 92, 105, 120,
-        124, 74, 67, 73, 86, 100, 113, 131, 134, 82, 73, 79, 92, 105, 120, 139,
-        142, 87, 78, 83, 96, 110, 125, 144, 153, 92, 83, 84, 97, 114, 132, 150,
-        157, 97, 88, 86, 97, 111, 128, 147, 163,
-        /* Size 16x8 */
         32, 31, 31, 32, 34, 36, 41, 47, 53, 58, 65, 74, 82, 87, 92, 97, 31, 32,
         33, 34, 35, 36, 40, 44, 50, 54, 59, 67, 73, 78, 83, 88, 35, 34, 35, 37,
         41, 46, 49, 53, 57, 61, 66, 73, 79, 83, 84, 86, 44, 41, 42, 42, 48, 54,
@@ -1737,39 +1728,16 @@
         98, 105, 113, 120, 125, 132, 128, 82, 76, 73, 71, 76, 80, 88, 97, 106,
         112, 120, 131, 139, 144, 150, 147, 90, 85, 81, 79, 81, 87, 93, 101, 108,
         116, 124, 134, 142, 153, 157, 163,
+        /* Size 16x8 */
+        32, 31, 35, 44, 53, 65, 82, 90, 31, 32, 34, 41, 50, 61, 76, 85, 31, 33,
+        35, 42, 49, 59, 73, 81, 32, 34, 37, 42, 49, 58, 71, 79, 34, 35, 41, 48,
+        54, 63, 76, 81, 36, 36, 46, 54, 60, 68, 80, 87, 41, 40, 49, 60, 67, 76,
+        88, 93, 47, 44, 53, 66, 75, 84, 97, 101, 53, 50, 57, 71, 82, 92, 106,
+        108, 58, 54, 61, 75, 87, 98, 112, 116, 65, 59, 66, 79, 92, 105, 120,
+        124, 74, 67, 73, 86, 100, 113, 131, 134, 82, 73, 79, 92, 105, 120, 139,
+        142, 87, 78, 83, 96, 110, 125, 144, 153, 92, 83, 84, 97, 114, 132, 150,
+        157, 97, 88, 86, 97, 111, 128, 147, 163,
         /* Size 16x32 */
-        32, 31, 31, 32, 35, 36, 44, 47, 53, 62, 65, 79, 82, 88, 90, 93, 31, 32,
-        32, 32, 35, 35, 42, 45, 51, 59, 62, 75, 78, 83, 86, 88, 31, 32, 32, 32,
-        34, 35, 41, 45, 50, 58, 61, 74, 76, 82, 85, 88, 31, 32, 32, 33, 34, 34,
-        41, 44, 49, 57, 59, 72, 74, 79, 82, 84, 31, 32, 33, 34, 35, 36, 42, 44,
-        49, 57, 59, 71, 73, 79, 81, 84, 32, 32, 33, 34, 36, 36, 42, 45, 50, 57,
-        59, 71, 73, 78, 80, 82, 32, 33, 34, 35, 37, 38, 42, 45, 49, 56, 58, 69,
-        71, 76, 79, 83, 32, 33, 34, 36, 39, 40, 44, 47, 51, 58, 60, 71, 73, 76,
-        78, 80, 34, 34, 35, 37, 41, 42, 48, 50, 54, 61, 63, 73, 76, 81, 81, 80,
-        35, 34, 36, 38, 45, 47, 52, 55, 59, 65, 67, 77, 79, 82, 83, 86, 36, 34,
-        36, 38, 46, 48, 54, 56, 60, 66, 68, 78, 80, 85, 87, 86, 39, 37, 39, 40,
-        48, 50, 58, 60, 65, 71, 73, 84, 86, 89, 88, 91, 41, 39, 40, 41, 49, 51,
-        60, 62, 67, 74, 76, 86, 88, 91, 93, 91, 44, 41, 42, 43, 51, 53, 63, 66,
-        71, 78, 79, 90, 92, 97, 94, 97, 47, 44, 44, 45, 53, 56, 66, 69, 75, 82,
-        84, 95, 97, 98, 101, 98, 48, 45, 45, 46, 54, 56, 67, 70, 76, 83, 85, 96,
-        98, 104, 101, 105, 53, 49, 50, 50, 57, 60, 71, 75, 82, 90, 92, 103, 106,
-        107, 108, 105, 55, 51, 51, 51, 59, 61, 72, 77, 84, 92, 94, 106, 108,
-        111, 110, 112, 58, 54, 54, 54, 61, 63, 75, 79, 87, 95, 98, 110, 112,
-        117, 116, 113, 63, 58, 58, 57, 65, 67, 78, 83, 91, 100, 103, 116, 118,
-        119, 119, 121, 65, 60, 59, 58, 66, 68, 79, 84, 92, 102, 105, 118, 120,
-        127, 124, 122, 71, 65, 64, 63, 71, 73, 84, 89, 97, 108, 111, 125, 127,
-        129, 129, 130, 74, 68, 67, 66, 73, 75, 86, 91, 100, 110, 113, 128, 131,
-        135, 134, 130, 79, 72, 71, 70, 77, 79, 90, 95, 104, 115, 118, 133, 136,
-        140, 139, 140, 82, 75, 73, 72, 79, 81, 92, 97, 105, 117, 120, 136, 139,
-        145, 142, 140, 82, 75, 74, 72, 79, 81, 92, 97, 106, 117, 121, 136, 139,
-        148, 150, 149, 87, 79, 78, 76, 83, 85, 96, 100, 110, 120, 125, 141, 144,
-        148, 153, 150, 89, 82, 81, 78, 83, 87, 97, 99, 113, 118, 128, 139, 145,
-        153, 157, 161, 92, 84, 83, 80, 84, 89, 97, 101, 114, 116, 132, 135, 150,
-        153, 157, 162, 94, 86, 85, 82, 85, 92, 97, 104, 112, 119, 130, 136, 151,
-        154, 163, 166, 97, 88, 88, 85, 86, 94, 97, 107, 111, 123, 128, 140, 147,
-        159, 163, 167, 99, 91, 91, 87, 87, 97, 97, 110, 110, 126, 126, 144, 144,
-        163, 163, 173,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 32, 32, 32, 34, 35, 36, 39, 41, 44, 47, 48, 53, 55,
         58, 63, 65, 71, 74, 79, 82, 82, 87, 89, 92, 94, 97, 99, 31, 32, 32, 32,
         32, 32, 33, 33, 34, 34, 34, 37, 39, 41, 44, 45, 49, 51, 54, 58, 60, 65,
@@ -1801,34 +1769,49 @@
         157, 157, 163, 163, 163, 93, 88, 88, 84, 84, 82, 83, 80, 80, 86, 86, 91,
         91, 97, 98, 105, 105, 112, 113, 121, 122, 130, 130, 140, 140, 149, 150,
         161, 162, 166, 167, 173,
+        /* Size 32x16 */
+        32, 31, 31, 32, 35, 36, 44, 47, 53, 62, 65, 79, 82, 88, 90, 93, 31, 32,
+        32, 32, 35, 35, 42, 45, 51, 59, 62, 75, 78, 83, 86, 88, 31, 32, 32, 32,
+        34, 35, 41, 45, 50, 58, 61, 74, 76, 82, 85, 88, 31, 32, 32, 33, 34, 34,
+        41, 44, 49, 57, 59, 72, 74, 79, 82, 84, 31, 32, 33, 34, 35, 36, 42, 44,
+        49, 57, 59, 71, 73, 79, 81, 84, 32, 32, 33, 34, 36, 36, 42, 45, 50, 57,
+        59, 71, 73, 78, 80, 82, 32, 33, 34, 35, 37, 38, 42, 45, 49, 56, 58, 69,
+        71, 76, 79, 83, 32, 33, 34, 36, 39, 40, 44, 47, 51, 58, 60, 71, 73, 76,
+        78, 80, 34, 34, 35, 37, 41, 42, 48, 50, 54, 61, 63, 73, 76, 81, 81, 80,
+        35, 34, 36, 38, 45, 47, 52, 55, 59, 65, 67, 77, 79, 82, 83, 86, 36, 34,
+        36, 38, 46, 48, 54, 56, 60, 66, 68, 78, 80, 85, 87, 86, 39, 37, 39, 40,
+        48, 50, 58, 60, 65, 71, 73, 84, 86, 89, 88, 91, 41, 39, 40, 41, 49, 51,
+        60, 62, 67, 74, 76, 86, 88, 91, 93, 91, 44, 41, 42, 43, 51, 53, 63, 66,
+        71, 78, 79, 90, 92, 97, 94, 97, 47, 44, 44, 45, 53, 56, 66, 69, 75, 82,
+        84, 95, 97, 98, 101, 98, 48, 45, 45, 46, 54, 56, 67, 70, 76, 83, 85, 96,
+        98, 104, 101, 105, 53, 49, 50, 50, 57, 60, 71, 75, 82, 90, 92, 103, 106,
+        107, 108, 105, 55, 51, 51, 51, 59, 61, 72, 77, 84, 92, 94, 106, 108,
+        111, 110, 112, 58, 54, 54, 54, 61, 63, 75, 79, 87, 95, 98, 110, 112,
+        117, 116, 113, 63, 58, 58, 57, 65, 67, 78, 83, 91, 100, 103, 116, 118,
+        119, 119, 121, 65, 60, 59, 58, 66, 68, 79, 84, 92, 102, 105, 118, 120,
+        127, 124, 122, 71, 65, 64, 63, 71, 73, 84, 89, 97, 108, 111, 125, 127,
+        129, 129, 130, 74, 68, 67, 66, 73, 75, 86, 91, 100, 110, 113, 128, 131,
+        135, 134, 130, 79, 72, 71, 70, 77, 79, 90, 95, 104, 115, 118, 133, 136,
+        140, 139, 140, 82, 75, 73, 72, 79, 81, 92, 97, 105, 117, 120, 136, 139,
+        145, 142, 140, 82, 75, 74, 72, 79, 81, 92, 97, 106, 117, 121, 136, 139,
+        148, 150, 149, 87, 79, 78, 76, 83, 85, 96, 100, 110, 120, 125, 141, 144,
+        148, 153, 150, 89, 82, 81, 78, 83, 87, 97, 99, 113, 118, 128, 139, 145,
+        153, 157, 161, 92, 84, 83, 80, 84, 89, 97, 101, 114, 116, 132, 135, 150,
+        153, 157, 162, 94, 86, 85, 82, 85, 92, 97, 104, 112, 119, 130, 136, 151,
+        154, 163, 166, 97, 88, 88, 85, 86, 94, 97, 107, 111, 123, 128, 140, 147,
+        159, 163, 167, 99, 91, 91, 87, 87, 97, 97, 110, 110, 126, 126, 144, 144,
+        163, 163, 173,
         /* Size 4x16 */
-        31, 36, 62, 88, 32, 35, 58, 82, 32, 36, 57, 79, 33, 38, 56, 76, 34, 42,
-        61, 81, 34, 48, 66, 85, 39, 51, 74, 91, 44, 56, 82, 98, 49, 60, 90, 107,
-        54, 63, 95, 117, 60, 68, 102, 127, 68, 75, 110, 135, 75, 81, 117, 145,
-        79, 85, 120, 148, 84, 89, 116, 153, 88, 94, 123, 159,
-        /* Size 16x4 */
         31, 32, 32, 33, 34, 34, 39, 44, 49, 54, 60, 68, 75, 79, 84, 88, 36, 35,
         36, 38, 42, 48, 51, 56, 60, 63, 68, 75, 81, 85, 89, 94, 62, 58, 57, 56,
         61, 66, 74, 82, 90, 95, 102, 110, 117, 120, 116, 123, 88, 82, 79, 76,
         81, 85, 91, 98, 107, 117, 127, 135, 145, 148, 153, 159,
+        /* Size 16x4 */
+        31, 36, 62, 88, 32, 35, 58, 82, 32, 36, 57, 79, 33, 38, 56, 76, 34, 42,
+        61, 81, 34, 48, 66, 85, 39, 51, 74, 91, 44, 56, 82, 98, 49, 60, 90, 107,
+        54, 63, 95, 117, 60, 68, 102, 127, 68, 75, 110, 135, 75, 81, 117, 145,
+        79, 85, 120, 148, 84, 89, 116, 153, 88, 94, 123, 159,
         /* Size 8x32 */
-        32, 31, 35, 44, 53, 65, 82, 90, 31, 32, 35, 42, 51, 62, 78, 86, 31, 32,
-        34, 41, 50, 61, 76, 85, 31, 32, 34, 41, 49, 59, 74, 82, 31, 33, 35, 42,
-        49, 59, 73, 81, 32, 33, 36, 42, 50, 59, 73, 80, 32, 34, 37, 42, 49, 58,
-        71, 79, 32, 34, 39, 44, 51, 60, 73, 78, 34, 35, 41, 48, 54, 63, 76, 81,
-        35, 36, 45, 52, 59, 67, 79, 83, 36, 36, 46, 54, 60, 68, 80, 87, 39, 39,
-        48, 58, 65, 73, 86, 88, 41, 40, 49, 60, 67, 76, 88, 93, 44, 42, 51, 63,
-        71, 79, 92, 94, 47, 44, 53, 66, 75, 84, 97, 101, 48, 45, 54, 67, 76, 85,
-        98, 101, 53, 50, 57, 71, 82, 92, 106, 108, 55, 51, 59, 72, 84, 94, 108,
-        110, 58, 54, 61, 75, 87, 98, 112, 116, 63, 58, 65, 78, 91, 103, 118,
-        119, 65, 59, 66, 79, 92, 105, 120, 124, 71, 64, 71, 84, 97, 111, 127,
-        129, 74, 67, 73, 86, 100, 113, 131, 134, 79, 71, 77, 90, 104, 118, 136,
-        139, 82, 73, 79, 92, 105, 120, 139, 142, 82, 74, 79, 92, 106, 121, 139,
-        150, 87, 78, 83, 96, 110, 125, 144, 153, 89, 81, 83, 97, 113, 128, 145,
-        157, 92, 83, 84, 97, 114, 132, 150, 157, 94, 85, 85, 97, 112, 130, 151,
-        163, 97, 88, 86, 97, 111, 128, 147, 163, 99, 91, 87, 97, 110, 126, 144,
-        163,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 32, 32, 32, 34, 35, 36, 39, 41, 44, 47, 48, 53, 55,
         58, 63, 65, 71, 74, 79, 82, 82, 87, 89, 92, 94, 97, 99, 31, 32, 32, 32,
         33, 33, 34, 34, 35, 36, 36, 39, 40, 42, 44, 45, 50, 51, 54, 58, 59, 64,
@@ -1844,7 +1827,24 @@
         108, 112, 118, 120, 127, 131, 136, 139, 139, 144, 145, 150, 151, 147,
         144, 90, 86, 85, 82, 81, 80, 79, 78, 81, 83, 87, 88, 93, 94, 101, 101,
         108, 110, 116, 119, 124, 129, 134, 139, 142, 150, 153, 157, 157, 163,
-        163, 163 },
+        163, 163,
+        /* Size 32x8 */
+        32, 31, 35, 44, 53, 65, 82, 90, 31, 32, 35, 42, 51, 62, 78, 86, 31, 32,
+        34, 41, 50, 61, 76, 85, 31, 32, 34, 41, 49, 59, 74, 82, 31, 33, 35, 42,
+        49, 59, 73, 81, 32, 33, 36, 42, 50, 59, 73, 80, 32, 34, 37, 42, 49, 58,
+        71, 79, 32, 34, 39, 44, 51, 60, 73, 78, 34, 35, 41, 48, 54, 63, 76, 81,
+        35, 36, 45, 52, 59, 67, 79, 83, 36, 36, 46, 54, 60, 68, 80, 87, 39, 39,
+        48, 58, 65, 73, 86, 88, 41, 40, 49, 60, 67, 76, 88, 93, 44, 42, 51, 63,
+        71, 79, 92, 94, 47, 44, 53, 66, 75, 84, 97, 101, 48, 45, 54, 67, 76, 85,
+        98, 101, 53, 50, 57, 71, 82, 92, 106, 108, 55, 51, 59, 72, 84, 94, 108,
+        110, 58, 54, 61, 75, 87, 98, 112, 116, 63, 58, 65, 78, 91, 103, 118,
+        119, 65, 59, 66, 79, 92, 105, 120, 124, 71, 64, 71, 84, 97, 111, 127,
+        129, 74, 67, 73, 86, 100, 113, 131, 134, 79, 71, 77, 90, 104, 118, 136,
+        139, 82, 73, 79, 92, 105, 120, 139, 142, 82, 74, 79, 92, 106, 121, 139,
+        150, 87, 78, 83, 96, 110, 125, 144, 153, 89, 81, 83, 97, 113, 128, 145,
+        157, 92, 83, 84, 97, 114, 132, 150, 157, 94, 85, 85, 97, 112, 130, 151,
+        163, 97, 88, 86, 97, 111, 128, 147, 163, 99, 91, 87, 97, 110, 126, 144,
+        163 },
       { /* Chroma */
         /* Size 4x4 */
         32, 45, 51, 61, 45, 54, 59, 65, 51, 59, 75, 81, 61, 65, 81, 97,
@@ -1929,21 +1929,12 @@
         70, 70, 74, 74, 78, 78, 82, 82, 86, 86, 91, 91, 95, 95, 100, 100, 101,
         101, 104,
         /* Size 4x8 */
-        31, 47, 53, 63, 36, 47, 50, 59, 46, 52, 55, 61, 45, 53, 63, 70, 49, 55,
-        71, 77, 54, 58, 77, 86, 59, 61, 81, 94, 63, 65, 80, 95,
-        /* Size 8x4 */
         31, 36, 46, 45, 49, 54, 59, 63, 47, 47, 52, 53, 55, 58, 61, 65, 53, 50,
         55, 63, 71, 77, 81, 80, 63, 59, 61, 70, 77, 86, 94, 95,
+        /* Size 8x4 */
+        31, 47, 53, 63, 36, 47, 50, 59, 46, 52, 55, 61, 45, 53, 63, 70, 49, 55,
+        71, 77, 54, 58, 77, 86, 59, 61, 81, 94, 63, 65, 80, 95,
         /* Size 8x16 */
-        32, 33, 45, 49, 52, 57, 64, 68, 31, 34, 45, 46, 49, 53, 60, 64, 33, 37,
-        46, 45, 47, 51, 57, 61, 37, 43, 47, 45, 47, 50, 55, 59, 42, 44, 49, 49,
-        50, 53, 58, 60, 49, 47, 52, 53, 54, 57, 61, 63, 48, 46, 51, 57, 59, 61,
-        66, 67, 50, 46, 52, 59, 63, 66, 71, 71, 52, 47, 53, 61, 66, 71, 75, 74,
-        54, 49, 54, 62, 68, 73, 79, 79, 57, 51, 55, 64, 70, 76, 83, 83, 61, 55,
-        58, 66, 73, 80, 87, 87, 64, 57, 60, 68, 75, 83, 91, 91, 66, 59, 61, 69,
-        77, 84, 93, 95, 68, 61, 61, 68, 77, 86, 94, 97, 70, 63, 61, 67, 75, 83,
-        92, 98,
-        /* Size 16x8 */
         32, 31, 33, 37, 42, 49, 48, 50, 52, 54, 57, 61, 64, 66, 68, 70, 33, 34,
         37, 43, 44, 47, 46, 46, 47, 49, 51, 55, 57, 59, 61, 63, 45, 45, 46, 47,
         49, 52, 51, 52, 53, 54, 55, 58, 60, 61, 61, 61, 49, 46, 45, 45, 49, 53,
@@ -1952,37 +1943,16 @@
         76, 80, 83, 84, 86, 83, 64, 60, 57, 55, 58, 61, 66, 71, 75, 79, 83, 87,
         91, 93, 94, 92, 68, 64, 61, 59, 60, 63, 67, 71, 74, 79, 83, 87, 91, 95,
         97, 98,
+        /* Size 16x8 */
+        32, 33, 45, 49, 52, 57, 64, 68, 31, 34, 45, 46, 49, 53, 60, 64, 33, 37,
+        46, 45, 47, 51, 57, 61, 37, 43, 47, 45, 47, 50, 55, 59, 42, 44, 49, 49,
+        50, 53, 58, 60, 49, 47, 52, 53, 54, 57, 61, 63, 48, 46, 51, 57, 59, 61,
+        66, 67, 50, 46, 52, 59, 63, 66, 71, 71, 52, 47, 53, 61, 66, 71, 75, 74,
+        54, 49, 54, 62, 68, 73, 79, 79, 57, 51, 55, 64, 70, 76, 83, 83, 61, 55,
+        58, 66, 73, 80, 87, 87, 64, 57, 60, 68, 75, 83, 91, 91, 66, 59, 61, 69,
+        77, 84, 93, 95, 68, 61, 61, 68, 77, 86, 94, 97, 70, 63, 61, 67, 75, 83,
+        92, 98,
         /* Size 16x32 */
-        32, 31, 33, 37, 45, 48, 49, 50, 52, 56, 57, 63, 64, 67, 68, 68, 31, 31,
-        34, 38, 45, 47, 47, 48, 50, 53, 54, 60, 61, 63, 64, 65, 31, 32, 34, 39,
-        45, 46, 46, 47, 49, 52, 53, 59, 60, 62, 64, 65, 30, 32, 35, 40, 44, 46,
-        45, 46, 48, 51, 52, 57, 58, 60, 61, 62, 33, 35, 37, 42, 46, 47, 45, 46,
-        47, 50, 51, 56, 57, 60, 61, 62, 33, 36, 38, 43, 46, 47, 46, 46, 47, 50,
-        51, 56, 57, 59, 60, 60, 37, 40, 43, 47, 47, 47, 45, 46, 47, 49, 50, 54,
-        55, 57, 59, 61, 39, 41, 43, 47, 48, 48, 47, 47, 48, 50, 51, 55, 56, 57,
-        58, 59, 42, 43, 44, 47, 49, 50, 49, 50, 50, 53, 53, 57, 58, 60, 60, 59,
-        47, 46, 46, 48, 51, 52, 53, 53, 53, 55, 56, 60, 61, 61, 61, 62, 49, 46,
-        47, 48, 52, 53, 53, 54, 54, 56, 57, 60, 61, 63, 63, 62, 48, 46, 46, 47,
-        51, 53, 56, 56, 57, 59, 60, 64, 64, 65, 64, 65, 48, 45, 46, 46, 51, 53,
-        57, 57, 59, 61, 61, 65, 66, 66, 67, 65, 49, 45, 45, 46, 51, 53, 58, 59,
-        61, 63, 64, 67, 68, 70, 67, 68, 50, 46, 46, 46, 52, 54, 59, 61, 63, 65,
-        66, 70, 71, 70, 71, 68, 50, 46, 46, 46, 52, 54, 59, 61, 64, 66, 67, 71,
-        71, 73, 71, 72, 52, 48, 47, 47, 53, 54, 61, 63, 66, 70, 71, 75, 75, 75,
-        74, 72, 53, 49, 48, 48, 53, 55, 61, 64, 67, 71, 72, 76, 77, 77, 75, 76,
-        54, 50, 49, 49, 54, 55, 62, 65, 68, 72, 73, 78, 79, 80, 79, 76, 56, 51,
-        51, 50, 55, 56, 63, 66, 70, 74, 76, 81, 82, 81, 80, 80, 57, 52, 51, 50,
-        55, 56, 64, 66, 70, 75, 76, 82, 83, 85, 83, 80, 60, 54, 54, 52, 57, 58,
-        65, 68, 72, 77, 79, 85, 86, 86, 85, 84, 61, 56, 55, 53, 58, 59, 66, 69,
-        73, 79, 80, 86, 87, 89, 87, 84, 63, 57, 56, 55, 59, 60, 67, 70, 75, 80,
-        82, 89, 90, 91, 89, 89, 64, 58, 57, 56, 60, 61, 68, 71, 75, 81, 83, 90,
-        91, 93, 91, 89, 64, 59, 58, 56, 60, 61, 68, 71, 75, 81, 83, 90, 91, 94,
-        94, 93, 66, 60, 59, 57, 61, 63, 69, 72, 77, 82, 84, 92, 93, 94, 95, 93,
-        67, 61, 60, 58, 61, 63, 69, 70, 78, 80, 85, 90, 93, 96, 97, 97, 68, 62,
-        61, 59, 61, 64, 68, 71, 77, 79, 86, 88, 94, 96, 97, 98, 69, 63, 62, 59,
-        61, 65, 68, 72, 76, 80, 85, 88, 94, 95, 99, 99, 70, 63, 63, 60, 61, 66,
-        67, 73, 75, 81, 83, 89, 92, 97, 98, 99, 70, 64, 64, 61, 61, 67, 67, 74,
-        74, 82, 82, 90, 90, 98, 98, 102,
-        /* Size 32x16 */
         32, 31, 31, 30, 33, 33, 37, 39, 42, 47, 49, 48, 48, 49, 50, 50, 52, 53,
         54, 56, 57, 60, 61, 63, 64, 64, 66, 67, 68, 69, 70, 70, 31, 31, 32, 32,
         35, 36, 40, 41, 43, 46, 46, 46, 45, 45, 46, 46, 48, 49, 50, 51, 52, 54,
@@ -2012,33 +1982,47 @@
         83, 85, 87, 89, 91, 94, 95, 97, 97, 99, 98, 98, 68, 65, 65, 62, 62, 60,
         61, 59, 59, 62, 62, 65, 65, 68, 68, 72, 72, 76, 76, 80, 80, 84, 84, 89,
         89, 93, 93, 97, 98, 99, 99, 102,
+        /* Size 32x16 */
+        32, 31, 33, 37, 45, 48, 49, 50, 52, 56, 57, 63, 64, 67, 68, 68, 31, 31,
+        34, 38, 45, 47, 47, 48, 50, 53, 54, 60, 61, 63, 64, 65, 31, 32, 34, 39,
+        45, 46, 46, 47, 49, 52, 53, 59, 60, 62, 64, 65, 30, 32, 35, 40, 44, 46,
+        45, 46, 48, 51, 52, 57, 58, 60, 61, 62, 33, 35, 37, 42, 46, 47, 45, 46,
+        47, 50, 51, 56, 57, 60, 61, 62, 33, 36, 38, 43, 46, 47, 46, 46, 47, 50,
+        51, 56, 57, 59, 60, 60, 37, 40, 43, 47, 47, 47, 45, 46, 47, 49, 50, 54,
+        55, 57, 59, 61, 39, 41, 43, 47, 48, 48, 47, 47, 48, 50, 51, 55, 56, 57,
+        58, 59, 42, 43, 44, 47, 49, 50, 49, 50, 50, 53, 53, 57, 58, 60, 60, 59,
+        47, 46, 46, 48, 51, 52, 53, 53, 53, 55, 56, 60, 61, 61, 61, 62, 49, 46,
+        47, 48, 52, 53, 53, 54, 54, 56, 57, 60, 61, 63, 63, 62, 48, 46, 46, 47,
+        51, 53, 56, 56, 57, 59, 60, 64, 64, 65, 64, 65, 48, 45, 46, 46, 51, 53,
+        57, 57, 59, 61, 61, 65, 66, 66, 67, 65, 49, 45, 45, 46, 51, 53, 58, 59,
+        61, 63, 64, 67, 68, 70, 67, 68, 50, 46, 46, 46, 52, 54, 59, 61, 63, 65,
+        66, 70, 71, 70, 71, 68, 50, 46, 46, 46, 52, 54, 59, 61, 64, 66, 67, 71,
+        71, 73, 71, 72, 52, 48, 47, 47, 53, 54, 61, 63, 66, 70, 71, 75, 75, 75,
+        74, 72, 53, 49, 48, 48, 53, 55, 61, 64, 67, 71, 72, 76, 77, 77, 75, 76,
+        54, 50, 49, 49, 54, 55, 62, 65, 68, 72, 73, 78, 79, 80, 79, 76, 56, 51,
+        51, 50, 55, 56, 63, 66, 70, 74, 76, 81, 82, 81, 80, 80, 57, 52, 51, 50,
+        55, 56, 64, 66, 70, 75, 76, 82, 83, 85, 83, 80, 60, 54, 54, 52, 57, 58,
+        65, 68, 72, 77, 79, 85, 86, 86, 85, 84, 61, 56, 55, 53, 58, 59, 66, 69,
+        73, 79, 80, 86, 87, 89, 87, 84, 63, 57, 56, 55, 59, 60, 67, 70, 75, 80,
+        82, 89, 90, 91, 89, 89, 64, 58, 57, 56, 60, 61, 68, 71, 75, 81, 83, 90,
+        91, 93, 91, 89, 64, 59, 58, 56, 60, 61, 68, 71, 75, 81, 83, 90, 91, 94,
+        94, 93, 66, 60, 59, 57, 61, 63, 69, 72, 77, 82, 84, 92, 93, 94, 95, 93,
+        67, 61, 60, 58, 61, 63, 69, 70, 78, 80, 85, 90, 93, 96, 97, 97, 68, 62,
+        61, 59, 61, 64, 68, 71, 77, 79, 86, 88, 94, 96, 97, 98, 69, 63, 62, 59,
+        61, 65, 68, 72, 76, 80, 85, 88, 94, 95, 99, 99, 70, 63, 63, 60, 61, 66,
+        67, 73, 75, 81, 83, 89, 92, 97, 98, 99, 70, 64, 64, 61, 61, 67, 67, 74,
+        74, 82, 82, 90, 90, 98, 98, 102,
         /* Size 4x16 */
-        31, 48, 56, 67, 32, 46, 52, 62, 35, 47, 50, 60, 40, 47, 49, 57, 43, 50,
-        53, 60, 46, 53, 56, 63, 45, 53, 61, 66, 46, 54, 65, 70, 48, 54, 70, 75,
-        50, 55, 72, 80, 52, 56, 75, 85, 56, 59, 79, 89, 58, 61, 81, 93, 60, 63,
-        82, 94, 62, 64, 79, 96, 63, 66, 81, 97,
-        /* Size 16x4 */
         31, 32, 35, 40, 43, 46, 45, 46, 48, 50, 52, 56, 58, 60, 62, 63, 48, 46,
         47, 47, 50, 53, 53, 54, 54, 55, 56, 59, 61, 63, 64, 66, 56, 52, 50, 49,
         53, 56, 61, 65, 70, 72, 75, 79, 81, 82, 79, 81, 67, 62, 60, 57, 60, 63,
         66, 70, 75, 80, 85, 89, 93, 94, 96, 97,
+        /* Size 16x4 */
+        31, 48, 56, 67, 32, 46, 52, 62, 35, 47, 50, 60, 40, 47, 49, 57, 43, 50,
+        53, 60, 46, 53, 56, 63, 45, 53, 61, 66, 46, 54, 65, 70, 48, 54, 70, 75,
+        50, 55, 72, 80, 52, 56, 75, 85, 56, 59, 79, 89, 58, 61, 81, 93, 60, 63,
+        82, 94, 62, 64, 79, 96, 63, 66, 81, 97,
         /* Size 8x32 */
-        32, 33, 45, 49, 52, 57, 64, 68, 31, 34, 45, 47, 50, 54, 61, 64, 31, 34,
-        45, 46, 49, 53, 60, 64, 30, 35, 44, 45, 48, 52, 58, 61, 33, 37, 46, 45,
-        47, 51, 57, 61, 33, 38, 46, 46, 47, 51, 57, 60, 37, 43, 47, 45, 47, 50,
-        55, 59, 39, 43, 48, 47, 48, 51, 56, 58, 42, 44, 49, 49, 50, 53, 58, 60,
-        47, 46, 51, 53, 53, 56, 61, 61, 49, 47, 52, 53, 54, 57, 61, 63, 48, 46,
-        51, 56, 57, 60, 64, 64, 48, 46, 51, 57, 59, 61, 66, 67, 49, 45, 51, 58,
-        61, 64, 68, 67, 50, 46, 52, 59, 63, 66, 71, 71, 50, 46, 52, 59, 64, 67,
-        71, 71, 52, 47, 53, 61, 66, 71, 75, 74, 53, 48, 53, 61, 67, 72, 77, 75,
-        54, 49, 54, 62, 68, 73, 79, 79, 56, 51, 55, 63, 70, 76, 82, 80, 57, 51,
-        55, 64, 70, 76, 83, 83, 60, 54, 57, 65, 72, 79, 86, 85, 61, 55, 58, 66,
-        73, 80, 87, 87, 63, 56, 59, 67, 75, 82, 90, 89, 64, 57, 60, 68, 75, 83,
-        91, 91, 64, 58, 60, 68, 75, 83, 91, 94, 66, 59, 61, 69, 77, 84, 93, 95,
-        67, 60, 61, 69, 78, 85, 93, 97, 68, 61, 61, 68, 77, 86, 94, 97, 69, 62,
-        61, 68, 76, 85, 94, 99, 70, 63, 61, 67, 75, 83, 92, 98, 70, 64, 61, 67,
-        74, 82, 90, 98,
-        /* Size 32x8 */
         32, 31, 31, 30, 33, 33, 37, 39, 42, 47, 49, 48, 48, 49, 50, 50, 52, 53,
         54, 56, 57, 60, 61, 63, 64, 64, 66, 67, 68, 69, 70, 70, 33, 34, 34, 35,
         37, 38, 43, 43, 44, 46, 47, 46, 46, 45, 46, 46, 47, 48, 49, 51, 51, 54,
@@ -2053,7 +2037,23 @@
         55, 56, 58, 61, 61, 64, 66, 68, 71, 71, 75, 77, 79, 82, 83, 86, 87, 90,
         91, 91, 93, 93, 94, 94, 92, 90, 68, 64, 64, 61, 61, 60, 59, 58, 60, 61,
         63, 64, 67, 67, 71, 71, 74, 75, 79, 80, 83, 85, 87, 89, 91, 94, 95, 97,
-        97, 99, 98, 98 },
+        97, 99, 98, 98,
+        /* Size 32x8 */
+        32, 33, 45, 49, 52, 57, 64, 68, 31, 34, 45, 47, 50, 54, 61, 64, 31, 34,
+        45, 46, 49, 53, 60, 64, 30, 35, 44, 45, 48, 52, 58, 61, 33, 37, 46, 45,
+        47, 51, 57, 61, 33, 38, 46, 46, 47, 51, 57, 60, 37, 43, 47, 45, 47, 50,
+        55, 59, 39, 43, 48, 47, 48, 51, 56, 58, 42, 44, 49, 49, 50, 53, 58, 60,
+        47, 46, 51, 53, 53, 56, 61, 61, 49, 47, 52, 53, 54, 57, 61, 63, 48, 46,
+        51, 56, 57, 60, 64, 64, 48, 46, 51, 57, 59, 61, 66, 67, 49, 45, 51, 58,
+        61, 64, 68, 67, 50, 46, 52, 59, 63, 66, 71, 71, 50, 46, 52, 59, 64, 67,
+        71, 71, 52, 47, 53, 61, 66, 71, 75, 74, 53, 48, 53, 61, 67, 72, 77, 75,
+        54, 49, 54, 62, 68, 73, 79, 79, 56, 51, 55, 63, 70, 76, 82, 80, 57, 51,
+        55, 64, 70, 76, 83, 83, 60, 54, 57, 65, 72, 79, 86, 85, 61, 55, 58, 66,
+        73, 80, 87, 87, 63, 56, 59, 67, 75, 82, 90, 89, 64, 57, 60, 68, 75, 83,
+        91, 91, 64, 58, 60, 68, 75, 83, 91, 94, 66, 59, 61, 69, 77, 84, 93, 95,
+        67, 60, 61, 69, 78, 85, 93, 97, 68, 61, 61, 68, 77, 86, 94, 97, 69, 62,
+        61, 68, 76, 85, 94, 99, 70, 63, 61, 67, 75, 83, 92, 98, 70, 64, 61, 67,
+        74, 82, 90, 98 },
   },
   {
       { /* Luma */
@@ -2142,21 +2142,12 @@
         84, 84, 83, 83, 80, 81, 86, 86, 91, 91, 96, 97, 103, 103, 110, 110, 118,
         119, 126, 126, 135, 136, 144, 144, 155, 155, 159, 159, 164,
         /* Size 4x8 */
-        32, 35, 51, 77, 32, 36, 50, 72, 34, 42, 54, 75, 38, 51, 67, 87, 48, 59,
-        80, 103, 60, 68, 92, 119, 72, 79, 104, 135, 81, 86, 112, 144,
-        /* Size 8x4 */
         32, 32, 34, 38, 48, 60, 72, 81, 35, 36, 42, 51, 59, 68, 79, 86, 51, 50,
         54, 67, 80, 92, 104, 112, 77, 72, 75, 87, 103, 119, 135, 144,
+        /* Size 8x4 */
+        32, 35, 51, 77, 32, 36, 50, 72, 34, 42, 54, 75, 38, 51, 67, 87, 48, 59,
+        80, 103, 60, 68, 92, 119, 72, 79, 104, 135, 81, 86, 112, 144,
         /* Size 8x16 */
-        32, 31, 33, 40, 51, 65, 79, 87, 31, 32, 33, 39, 49, 61, 74, 82, 31, 32,
-        34, 38, 47, 59, 71, 79, 32, 33, 36, 40, 48, 58, 69, 77, 33, 34, 38, 44,
-        52, 62, 72, 78, 36, 35, 42, 51, 58, 68, 78, 84, 39, 38, 44, 54, 63, 73,
-        84, 89, 44, 41, 46, 59, 69, 79, 90, 96, 48, 45, 50, 62, 74, 85, 96, 103,
-        53, 49, 53, 66, 79, 92, 103, 111, 58, 54, 57, 70, 84, 98, 110, 118, 66,
-        60, 63, 75, 90, 106, 119, 126, 74, 67, 69, 81, 97, 113, 128, 134, 81,
-        73, 75, 86, 102, 120, 135, 143, 86, 78, 78, 90, 106, 124, 140, 147, 91,
-        82, 80, 90, 103, 119, 137, 151,
-        /* Size 16x8 */
         32, 31, 31, 32, 33, 36, 39, 44, 48, 53, 58, 66, 74, 81, 86, 91, 31, 32,
         32, 33, 34, 35, 38, 41, 45, 49, 54, 60, 67, 73, 78, 82, 33, 33, 34, 36,
         38, 42, 44, 46, 50, 53, 57, 63, 69, 75, 78, 80, 40, 39, 38, 40, 44, 51,
@@ -2165,38 +2156,16 @@
         92, 98, 106, 113, 120, 124, 119, 79, 74, 71, 69, 72, 78, 84, 90, 96,
         103, 110, 119, 128, 135, 140, 137, 87, 82, 79, 77, 78, 84, 89, 96, 103,
         111, 118, 126, 134, 143, 147, 151,
+        /* Size 16x8 */
+        32, 31, 33, 40, 51, 65, 79, 87, 31, 32, 33, 39, 49, 61, 74, 82, 31, 32,
+        34, 38, 47, 59, 71, 79, 32, 33, 36, 40, 48, 58, 69, 77, 33, 34, 38, 44,
+        52, 62, 72, 78, 36, 35, 42, 51, 58, 68, 78, 84, 39, 38, 44, 54, 63, 73,
+        84, 89, 44, 41, 46, 59, 69, 79, 90, 96, 48, 45, 50, 62, 74, 85, 96, 103,
+        53, 49, 53, 66, 79, 92, 103, 111, 58, 54, 57, 70, 84, 98, 110, 118, 66,
+        60, 63, 75, 90, 106, 119, 126, 74, 67, 69, 81, 97, 113, 128, 134, 81,
+        73, 75, 86, 102, 120, 135, 143, 86, 78, 78, 90, 106, 124, 140, 147, 91,
+        82, 80, 90, 103, 119, 137, 151,
         /* Size 16x32 */
-        32, 31, 31, 32, 33, 36, 40, 44, 51, 53, 65, 66, 79, 81, 87, 90, 31, 32,
-        32, 32, 33, 35, 39, 42, 49, 51, 62, 63, 75, 77, 83, 85, 31, 32, 32, 32,
-        33, 35, 39, 42, 49, 51, 61, 62, 74, 76, 82, 85, 31, 32, 32, 33, 33, 34,
-        38, 41, 47, 49, 59, 60, 72, 74, 79, 81, 31, 32, 32, 33, 34, 35, 38, 41,
-        47, 49, 59, 60, 71, 73, 79, 81, 32, 32, 33, 34, 35, 36, 39, 42, 48, 50,
-        59, 60, 71, 72, 78, 80, 32, 32, 33, 35, 36, 37, 40, 42, 48, 49, 58, 59,
-        69, 71, 77, 80, 32, 33, 33, 35, 36, 38, 41, 42, 48, 49, 58, 59, 69, 70,
-        75, 77, 33, 33, 34, 36, 38, 41, 44, 46, 52, 53, 62, 63, 72, 74, 78, 78,
-        34, 34, 34, 37, 39, 42, 45, 48, 53, 54, 63, 64, 73, 75, 80, 83, 36, 34,
-        35, 38, 42, 48, 51, 54, 58, 60, 68, 69, 78, 80, 84, 83, 36, 35, 35, 38,
-        42, 48, 51, 54, 59, 60, 68, 69, 79, 80, 85, 87, 39, 37, 38, 40, 44, 50,
-        54, 58, 63, 65, 73, 74, 84, 85, 89, 88, 40, 38, 39, 41, 45, 51, 56, 59,
-        65, 67, 75, 76, 85, 87, 90, 93, 44, 41, 41, 43, 46, 53, 59, 63, 69, 71,
-        79, 80, 90, 91, 96, 93, 46, 43, 43, 44, 48, 55, 60, 65, 72, 73, 82, 83,
-        93, 94, 97, 100, 48, 45, 45, 46, 50, 56, 62, 67, 74, 76, 85, 86, 96, 98,
-        103, 100, 52, 48, 48, 49, 52, 59, 65, 70, 78, 80, 90, 91, 101, 103, 105,
-        107, 53, 49, 49, 50, 53, 60, 66, 71, 79, 82, 92, 93, 103, 105, 111, 107,
-        58, 53, 53, 53, 57, 63, 69, 74, 83, 86, 97, 98, 109, 111, 113, 115, 58,
-        54, 54, 54, 57, 63, 70, 75, 84, 87, 98, 99, 110, 112, 118, 115, 65, 60,
-        59, 58, 62, 68, 74, 79, 89, 92, 105, 106, 118, 119, 122, 123, 66, 61,
-        60, 59, 63, 69, 75, 80, 90, 93, 106, 107, 119, 121, 126, 123, 71, 65,
-        65, 63, 67, 73, 79, 84, 94, 97, 111, 112, 125, 127, 131, 132, 74, 68,
-        67, 66, 69, 75, 81, 86, 97, 100, 113, 115, 128, 130, 134, 132, 79, 72,
-        72, 70, 73, 79, 85, 90, 101, 104, 118, 119, 133, 135, 141, 140, 81, 74,
-        73, 71, 75, 80, 86, 91, 102, 105, 120, 121, 135, 137, 143, 140, 82, 75,
-        74, 72, 75, 81, 87, 92, 103, 106, 121, 122, 136, 139, 147, 151, 86, 78,
-        78, 75, 78, 84, 90, 95, 106, 109, 124, 125, 140, 142, 147, 151, 88, 81,
-        80, 77, 80, 86, 90, 98, 105, 112, 122, 127, 140, 144, 152, 155, 91, 83,
-        82, 79, 80, 88, 90, 100, 103, 114, 119, 130, 137, 148, 151, 155, 93, 85,
-        85, 81, 81, 90, 90, 102, 103, 117, 117, 134, 134, 151, 152, 160,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 32, 32, 32, 33, 34, 36, 36, 39, 40, 44, 46, 48, 52,
         53, 58, 58, 65, 66, 71, 74, 79, 81, 82, 86, 88, 91, 93, 31, 32, 32, 32,
         32, 32, 32, 33, 33, 34, 34, 35, 37, 38, 41, 43, 45, 48, 49, 53, 54, 60,
@@ -2228,33 +2197,48 @@
         152, 90, 85, 85, 81, 81, 80, 80, 77, 78, 83, 83, 87, 88, 93, 93, 100,
         100, 107, 107, 115, 115, 123, 123, 132, 132, 140, 140, 151, 151, 155,
         155, 160,
+        /* Size 32x16 */
+        32, 31, 31, 32, 33, 36, 40, 44, 51, 53, 65, 66, 79, 81, 87, 90, 31, 32,
+        32, 32, 33, 35, 39, 42, 49, 51, 62, 63, 75, 77, 83, 85, 31, 32, 32, 32,
+        33, 35, 39, 42, 49, 51, 61, 62, 74, 76, 82, 85, 31, 32, 32, 33, 33, 34,
+        38, 41, 47, 49, 59, 60, 72, 74, 79, 81, 31, 32, 32, 33, 34, 35, 38, 41,
+        47, 49, 59, 60, 71, 73, 79, 81, 32, 32, 33, 34, 35, 36, 39, 42, 48, 50,
+        59, 60, 71, 72, 78, 80, 32, 32, 33, 35, 36, 37, 40, 42, 48, 49, 58, 59,
+        69, 71, 77, 80, 32, 33, 33, 35, 36, 38, 41, 42, 48, 49, 58, 59, 69, 70,
+        75, 77, 33, 33, 34, 36, 38, 41, 44, 46, 52, 53, 62, 63, 72, 74, 78, 78,
+        34, 34, 34, 37, 39, 42, 45, 48, 53, 54, 63, 64, 73, 75, 80, 83, 36, 34,
+        35, 38, 42, 48, 51, 54, 58, 60, 68, 69, 78, 80, 84, 83, 36, 35, 35, 38,
+        42, 48, 51, 54, 59, 60, 68, 69, 79, 80, 85, 87, 39, 37, 38, 40, 44, 50,
+        54, 58, 63, 65, 73, 74, 84, 85, 89, 88, 40, 38, 39, 41, 45, 51, 56, 59,
+        65, 67, 75, 76, 85, 87, 90, 93, 44, 41, 41, 43, 46, 53, 59, 63, 69, 71,
+        79, 80, 90, 91, 96, 93, 46, 43, 43, 44, 48, 55, 60, 65, 72, 73, 82, 83,
+        93, 94, 97, 100, 48, 45, 45, 46, 50, 56, 62, 67, 74, 76, 85, 86, 96, 98,
+        103, 100, 52, 48, 48, 49, 52, 59, 65, 70, 78, 80, 90, 91, 101, 103, 105,
+        107, 53, 49, 49, 50, 53, 60, 66, 71, 79, 82, 92, 93, 103, 105, 111, 107,
+        58, 53, 53, 53, 57, 63, 69, 74, 83, 86, 97, 98, 109, 111, 113, 115, 58,
+        54, 54, 54, 57, 63, 70, 75, 84, 87, 98, 99, 110, 112, 118, 115, 65, 60,
+        59, 58, 62, 68, 74, 79, 89, 92, 105, 106, 118, 119, 122, 123, 66, 61,
+        60, 59, 63, 69, 75, 80, 90, 93, 106, 107, 119, 121, 126, 123, 71, 65,
+        65, 63, 67, 73, 79, 84, 94, 97, 111, 112, 125, 127, 131, 132, 74, 68,
+        67, 66, 69, 75, 81, 86, 97, 100, 113, 115, 128, 130, 134, 132, 79, 72,
+        72, 70, 73, 79, 85, 90, 101, 104, 118, 119, 133, 135, 141, 140, 81, 74,
+        73, 71, 75, 80, 86, 91, 102, 105, 120, 121, 135, 137, 143, 140, 82, 75,
+        74, 72, 75, 81, 87, 92, 103, 106, 121, 122, 136, 139, 147, 151, 86, 78,
+        78, 75, 78, 84, 90, 95, 106, 109, 124, 125, 140, 142, 147, 151, 88, 81,
+        80, 77, 80, 86, 90, 98, 105, 112, 122, 127, 140, 144, 152, 155, 91, 83,
+        82, 79, 80, 88, 90, 100, 103, 114, 119, 130, 137, 148, 151, 155, 93, 85,
+        85, 81, 81, 90, 90, 102, 103, 117, 117, 134, 134, 151, 152, 160,
         /* Size 4x16 */
-        31, 36, 53, 81, 32, 35, 51, 76, 32, 35, 49, 73, 32, 37, 49, 71, 33, 41,
-        53, 74, 34, 48, 60, 80, 37, 50, 65, 85, 41, 53, 71, 91, 45, 56, 76, 98,
-        49, 60, 82, 105, 54, 63, 87, 112, 61, 69, 93, 121, 68, 75, 100, 130, 74,
-        80, 105, 137, 78, 84, 109, 142, 83, 88, 114, 148,
-        /* Size 16x4 */
         31, 32, 32, 32, 33, 34, 37, 41, 45, 49, 54, 61, 68, 74, 78, 83, 36, 35,
         35, 37, 41, 48, 50, 53, 56, 60, 63, 69, 75, 80, 84, 88, 53, 51, 49, 49,
         53, 60, 65, 71, 76, 82, 87, 93, 100, 105, 109, 114, 81, 76, 73, 71, 74,
         80, 85, 91, 98, 105, 112, 121, 130, 137, 142, 148,
+        /* Size 16x4 */
+        31, 36, 53, 81, 32, 35, 51, 76, 32, 35, 49, 73, 32, 37, 49, 71, 33, 41,
+        53, 74, 34, 48, 60, 80, 37, 50, 65, 85, 41, 53, 71, 91, 45, 56, 76, 98,
+        49, 60, 82, 105, 54, 63, 87, 112, 61, 69, 93, 121, 68, 75, 100, 130, 74,
+        80, 105, 137, 78, 84, 109, 142, 83, 88, 114, 148,
         /* Size 8x32 */
-        32, 31, 33, 40, 51, 65, 79, 87, 31, 32, 33, 39, 49, 62, 75, 83, 31, 32,
-        33, 39, 49, 61, 74, 82, 31, 32, 33, 38, 47, 59, 72, 79, 31, 32, 34, 38,
-        47, 59, 71, 79, 32, 33, 35, 39, 48, 59, 71, 78, 32, 33, 36, 40, 48, 58,
-        69, 77, 32, 33, 36, 41, 48, 58, 69, 75, 33, 34, 38, 44, 52, 62, 72, 78,
-        34, 34, 39, 45, 53, 63, 73, 80, 36, 35, 42, 51, 58, 68, 78, 84, 36, 35,
-        42, 51, 59, 68, 79, 85, 39, 38, 44, 54, 63, 73, 84, 89, 40, 39, 45, 56,
-        65, 75, 85, 90, 44, 41, 46, 59, 69, 79, 90, 96, 46, 43, 48, 60, 72, 82,
-        93, 97, 48, 45, 50, 62, 74, 85, 96, 103, 52, 48, 52, 65, 78, 90, 101,
-        105, 53, 49, 53, 66, 79, 92, 103, 111, 58, 53, 57, 69, 83, 97, 109, 113,
-        58, 54, 57, 70, 84, 98, 110, 118, 65, 59, 62, 74, 89, 105, 118, 122, 66,
-        60, 63, 75, 90, 106, 119, 126, 71, 65, 67, 79, 94, 111, 125, 131, 74,
-        67, 69, 81, 97, 113, 128, 134, 79, 72, 73, 85, 101, 118, 133, 141, 81,
-        73, 75, 86, 102, 120, 135, 143, 82, 74, 75, 87, 103, 121, 136, 147, 86,
-        78, 78, 90, 106, 124, 140, 147, 88, 80, 80, 90, 105, 122, 140, 152, 91,
-        82, 80, 90, 103, 119, 137, 151, 93, 85, 81, 90, 103, 117, 134, 152,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 32, 32, 32, 33, 34, 36, 36, 39, 40, 44, 46, 48, 52,
         53, 58, 58, 65, 66, 71, 74, 79, 81, 82, 86, 88, 91, 93, 31, 32, 32, 32,
         32, 33, 33, 33, 34, 34, 35, 35, 38, 39, 41, 43, 45, 48, 49, 53, 54, 59,
@@ -2270,7 +2254,23 @@
         103, 109, 110, 118, 119, 125, 128, 133, 135, 136, 140, 140, 137, 134,
         87, 83, 82, 79, 79, 78, 77, 75, 78, 80, 84, 85, 89, 90, 96, 97, 103,
         105, 111, 113, 118, 122, 126, 131, 134, 141, 143, 147, 147, 152, 151,
-        152 },
+        152,
+        /* Size 32x8 */
+        32, 31, 33, 40, 51, 65, 79, 87, 31, 32, 33, 39, 49, 62, 75, 83, 31, 32,
+        33, 39, 49, 61, 74, 82, 31, 32, 33, 38, 47, 59, 72, 79, 31, 32, 34, 38,
+        47, 59, 71, 79, 32, 33, 35, 39, 48, 59, 71, 78, 32, 33, 36, 40, 48, 58,
+        69, 77, 32, 33, 36, 41, 48, 58, 69, 75, 33, 34, 38, 44, 52, 62, 72, 78,
+        34, 34, 39, 45, 53, 63, 73, 80, 36, 35, 42, 51, 58, 68, 78, 84, 36, 35,
+        42, 51, 59, 68, 79, 85, 39, 38, 44, 54, 63, 73, 84, 89, 40, 39, 45, 56,
+        65, 75, 85, 90, 44, 41, 46, 59, 69, 79, 90, 96, 46, 43, 48, 60, 72, 82,
+        93, 97, 48, 45, 50, 62, 74, 85, 96, 103, 52, 48, 52, 65, 78, 90, 101,
+        105, 53, 49, 53, 66, 79, 92, 103, 111, 58, 53, 57, 69, 83, 97, 109, 113,
+        58, 54, 57, 70, 84, 98, 110, 118, 65, 59, 62, 74, 89, 105, 118, 122, 66,
+        60, 63, 75, 90, 106, 119, 126, 71, 65, 67, 79, 94, 111, 125, 131, 74,
+        67, 69, 81, 97, 113, 128, 134, 79, 72, 73, 85, 101, 118, 133, 141, 81,
+        73, 75, 86, 102, 120, 135, 143, 82, 74, 75, 87, 103, 121, 136, 147, 86,
+        78, 78, 90, 106, 124, 140, 147, 88, 80, 80, 90, 105, 122, 140, 152, 91,
+        82, 80, 90, 103, 119, 137, 151, 93, 85, 81, 90, 103, 117, 134, 152 },
       { /* Chroma */
         /* Size 4x4 */
         32, 46, 49, 58, 46, 53, 55, 62, 49, 55, 70, 78, 58, 62, 78, 91,
@@ -2354,21 +2354,12 @@
         98, 97, 69, 65, 65, 62, 62, 61, 61, 58, 59, 62, 62, 65, 65, 68, 68, 71,
         71, 75, 75, 79, 79, 83, 83, 87, 87, 91, 91, 96, 96, 97, 97, 99,
         /* Size 4x8 */
-        31, 47, 50, 61, 36, 47, 47, 57, 43, 50, 50, 58, 45, 53, 58, 65, 47, 54,
-        66, 74, 52, 56, 70, 82, 57, 60, 75, 90, 61, 63, 77, 93,
-        /* Size 8x4 */
         31, 36, 43, 45, 47, 52, 57, 61, 47, 47, 50, 53, 54, 56, 60, 63, 50, 47,
         50, 58, 66, 70, 75, 77, 61, 57, 58, 65, 74, 82, 90, 93,
+        /* Size 8x4 */
+        31, 47, 50, 61, 36, 47, 47, 57, 43, 50, 50, 58, 45, 53, 58, 65, 47, 54,
+        66, 74, 52, 56, 70, 82, 57, 60, 75, 90, 61, 63, 77, 93,
         /* Size 8x16 */
-        32, 32, 40, 49, 51, 57, 63, 67, 31, 33, 41, 47, 49, 54, 59, 63, 31, 35,
-        43, 46, 47, 51, 57, 60, 35, 39, 46, 46, 47, 50, 55, 58, 41, 43, 48, 49,
-        49, 52, 57, 59, 49, 47, 50, 53, 54, 57, 60, 62, 48, 46, 49, 54, 57, 60,
-        64, 65, 49, 45, 48, 56, 61, 64, 67, 69, 50, 46, 49, 57, 63, 67, 71, 73,
-        52, 48, 50, 58, 65, 71, 75, 77, 54, 50, 51, 59, 67, 73, 78, 81, 57, 52,
-        53, 61, 69, 77, 82, 85, 61, 55, 56, 63, 72, 80, 86, 88, 64, 58, 58, 65,
-        73, 82, 89, 92, 66, 59, 59, 66, 75, 84, 91, 94, 68, 61, 59, 65, 72, 81,
-        89, 95,
-        /* Size 16x8 */
         32, 31, 31, 35, 41, 49, 48, 49, 50, 52, 54, 57, 61, 64, 66, 68, 32, 33,
         35, 39, 43, 47, 46, 45, 46, 48, 50, 52, 55, 58, 59, 61, 40, 41, 43, 46,
         48, 50, 49, 48, 49, 50, 51, 53, 56, 58, 59, 59, 49, 47, 46, 46, 49, 53,
@@ -2377,37 +2368,16 @@
         73, 77, 80, 82, 84, 81, 63, 59, 57, 55, 57, 60, 64, 67, 71, 75, 78, 82,
         86, 89, 91, 89, 67, 63, 60, 58, 59, 62, 65, 69, 73, 77, 81, 85, 88, 92,
         94, 95,
+        /* Size 16x8 */
+        32, 32, 40, 49, 51, 57, 63, 67, 31, 33, 41, 47, 49, 54, 59, 63, 31, 35,
+        43, 46, 47, 51, 57, 60, 35, 39, 46, 46, 47, 50, 55, 58, 41, 43, 48, 49,
+        49, 52, 57, 59, 49, 47, 50, 53, 54, 57, 60, 62, 48, 46, 49, 54, 57, 60,
+        64, 65, 49, 45, 48, 56, 61, 64, 67, 69, 50, 46, 49, 57, 63, 67, 71, 73,
+        52, 48, 50, 58, 65, 71, 75, 77, 54, 50, 51, 59, 67, 73, 78, 81, 57, 52,
+        53, 61, 69, 77, 82, 85, 61, 55, 56, 63, 72, 80, 86, 88, 64, 58, 58, 65,
+        73, 82, 89, 92, 66, 59, 59, 66, 75, 84, 91, 94, 68, 61, 59, 65, 72, 81,
+        89, 95,
         /* Size 16x32 */
-        32, 31, 32, 37, 40, 48, 49, 49, 51, 52, 57, 58, 63, 64, 67, 67, 31, 31,
-        33, 38, 41, 47, 47, 47, 49, 50, 54, 55, 60, 61, 63, 64, 31, 31, 33, 38,
-        41, 47, 47, 47, 49, 49, 54, 54, 59, 60, 63, 64, 30, 32, 33, 40, 42, 46,
-        45, 45, 47, 48, 52, 52, 57, 58, 60, 61, 31, 33, 35, 41, 43, 46, 46, 45,
-        47, 48, 51, 52, 57, 57, 60, 61, 33, 36, 37, 43, 44, 47, 46, 46, 47, 47,
-        51, 52, 56, 57, 59, 60, 35, 38, 39, 45, 46, 47, 46, 45, 47, 47, 50, 51,
-        55, 56, 58, 60, 37, 40, 41, 47, 47, 47, 46, 45, 46, 47, 50, 50, 54, 55,
-        57, 58, 41, 42, 43, 47, 48, 49, 49, 48, 49, 50, 52, 53, 57, 57, 59, 58,
-        42, 43, 43, 47, 48, 50, 49, 49, 50, 50, 53, 54, 57, 58, 60, 61, 49, 46,
-        47, 48, 50, 53, 53, 53, 54, 54, 57, 57, 60, 61, 62, 61, 49, 46, 47, 48,
-        50, 53, 53, 54, 54, 55, 57, 57, 61, 61, 63, 64, 48, 46, 46, 47, 49, 53,
-        54, 56, 57, 57, 60, 60, 64, 64, 65, 64, 48, 45, 46, 46, 49, 53, 55, 56,
-        58, 58, 61, 61, 65, 65, 66, 67, 49, 45, 45, 46, 48, 53, 56, 58, 61, 61,
-        64, 64, 67, 68, 69, 67, 49, 46, 46, 46, 49, 53, 57, 59, 62, 62, 65, 66,
-        69, 69, 70, 70, 50, 46, 46, 46, 49, 54, 57, 59, 63, 64, 67, 67, 71, 71,
-        73, 71, 51, 47, 47, 47, 49, 54, 58, 61, 64, 66, 69, 70, 73, 74, 74, 74,
-        52, 48, 48, 47, 50, 54, 58, 61, 65, 66, 71, 71, 75, 75, 77, 74, 54, 50,
-        49, 48, 51, 55, 59, 62, 67, 68, 73, 73, 77, 78, 78, 78, 54, 50, 50, 49,
-        51, 55, 59, 62, 67, 68, 73, 74, 78, 78, 81, 78, 57, 52, 52, 50, 52, 56,
-        60, 64, 69, 70, 76, 77, 82, 82, 83, 82, 57, 52, 52, 51, 53, 57, 61, 64,
-        69, 71, 77, 77, 82, 83, 85, 82, 60, 54, 54, 52, 55, 58, 62, 65, 71, 72,
-        79, 79, 85, 86, 87, 86, 61, 56, 55, 53, 56, 59, 63, 66, 72, 73, 80, 81,
-        86, 87, 88, 86, 63, 57, 57, 55, 57, 60, 64, 67, 73, 75, 82, 82, 89, 90,
-        92, 90, 64, 58, 58, 55, 58, 61, 65, 68, 73, 75, 82, 83, 89, 90, 92, 90,
-        64, 59, 58, 56, 58, 61, 65, 68, 74, 75, 83, 83, 90, 91, 94, 95, 66, 60,
-        59, 57, 59, 62, 66, 69, 75, 76, 84, 85, 91, 92, 94, 95, 67, 61, 60, 58,
-        59, 63, 66, 70, 74, 77, 82, 85, 91, 93, 96, 96, 68, 62, 61, 58, 59, 64,
-        65, 71, 72, 78, 81, 86, 89, 94, 95, 96, 68, 62, 62, 59, 59, 65, 65, 71,
-        71, 79, 79, 87, 87, 95, 95, 98,
-        /* Size 32x16 */
         32, 31, 31, 30, 31, 33, 35, 37, 41, 42, 49, 49, 48, 48, 49, 49, 50, 51,
         52, 54, 54, 57, 57, 60, 61, 63, 64, 64, 66, 67, 68, 68, 31, 31, 31, 32,
         33, 36, 38, 40, 42, 43, 46, 46, 46, 45, 45, 46, 46, 47, 48, 50, 50, 52,
@@ -2437,33 +2407,47 @@
         81, 83, 85, 87, 88, 92, 92, 94, 94, 96, 95, 95, 67, 64, 64, 61, 61, 60,
         60, 58, 58, 61, 61, 64, 64, 67, 67, 70, 71, 74, 74, 78, 78, 82, 82, 86,
         86, 90, 90, 95, 95, 96, 96, 98,
+        /* Size 32x16 */
+        32, 31, 32, 37, 40, 48, 49, 49, 51, 52, 57, 58, 63, 64, 67, 67, 31, 31,
+        33, 38, 41, 47, 47, 47, 49, 50, 54, 55, 60, 61, 63, 64, 31, 31, 33, 38,
+        41, 47, 47, 47, 49, 49, 54, 54, 59, 60, 63, 64, 30, 32, 33, 40, 42, 46,
+        45, 45, 47, 48, 52, 52, 57, 58, 60, 61, 31, 33, 35, 41, 43, 46, 46, 45,
+        47, 48, 51, 52, 57, 57, 60, 61, 33, 36, 37, 43, 44, 47, 46, 46, 47, 47,
+        51, 52, 56, 57, 59, 60, 35, 38, 39, 45, 46, 47, 46, 45, 47, 47, 50, 51,
+        55, 56, 58, 60, 37, 40, 41, 47, 47, 47, 46, 45, 46, 47, 50, 50, 54, 55,
+        57, 58, 41, 42, 43, 47, 48, 49, 49, 48, 49, 50, 52, 53, 57, 57, 59, 58,
+        42, 43, 43, 47, 48, 50, 49, 49, 50, 50, 53, 54, 57, 58, 60, 61, 49, 46,
+        47, 48, 50, 53, 53, 53, 54, 54, 57, 57, 60, 61, 62, 61, 49, 46, 47, 48,
+        50, 53, 53, 54, 54, 55, 57, 57, 61, 61, 63, 64, 48, 46, 46, 47, 49, 53,
+        54, 56, 57, 57, 60, 60, 64, 64, 65, 64, 48, 45, 46, 46, 49, 53, 55, 56,
+        58, 58, 61, 61, 65, 65, 66, 67, 49, 45, 45, 46, 48, 53, 56, 58, 61, 61,
+        64, 64, 67, 68, 69, 67, 49, 46, 46, 46, 49, 53, 57, 59, 62, 62, 65, 66,
+        69, 69, 70, 70, 50, 46, 46, 46, 49, 54, 57, 59, 63, 64, 67, 67, 71, 71,
+        73, 71, 51, 47, 47, 47, 49, 54, 58, 61, 64, 66, 69, 70, 73, 74, 74, 74,
+        52, 48, 48, 47, 50, 54, 58, 61, 65, 66, 71, 71, 75, 75, 77, 74, 54, 50,
+        49, 48, 51, 55, 59, 62, 67, 68, 73, 73, 77, 78, 78, 78, 54, 50, 50, 49,
+        51, 55, 59, 62, 67, 68, 73, 74, 78, 78, 81, 78, 57, 52, 52, 50, 52, 56,
+        60, 64, 69, 70, 76, 77, 82, 82, 83, 82, 57, 52, 52, 51, 53, 57, 61, 64,
+        69, 71, 77, 77, 82, 83, 85, 82, 60, 54, 54, 52, 55, 58, 62, 65, 71, 72,
+        79, 79, 85, 86, 87, 86, 61, 56, 55, 53, 56, 59, 63, 66, 72, 73, 80, 81,
+        86, 87, 88, 86, 63, 57, 57, 55, 57, 60, 64, 67, 73, 75, 82, 82, 89, 90,
+        92, 90, 64, 58, 58, 55, 58, 61, 65, 68, 73, 75, 82, 83, 89, 90, 92, 90,
+        64, 59, 58, 56, 58, 61, 65, 68, 74, 75, 83, 83, 90, 91, 94, 95, 66, 60,
+        59, 57, 59, 62, 66, 69, 75, 76, 84, 85, 91, 92, 94, 95, 67, 61, 60, 58,
+        59, 63, 66, 70, 74, 77, 82, 85, 91, 93, 96, 96, 68, 62, 61, 58, 59, 64,
+        65, 71, 72, 78, 81, 86, 89, 94, 95, 96, 68, 62, 62, 59, 59, 65, 65, 71,
+        71, 79, 79, 87, 87, 95, 95, 98,
         /* Size 4x16 */
-        31, 48, 52, 64, 31, 47, 49, 60, 33, 46, 48, 57, 38, 47, 47, 56, 42, 49,
-        50, 57, 46, 53, 54, 61, 46, 53, 57, 64, 45, 53, 61, 68, 46, 54, 64, 71,
-        48, 54, 66, 75, 50, 55, 68, 78, 52, 57, 71, 83, 56, 59, 73, 87, 58, 61,
-        75, 90, 60, 62, 76, 92, 62, 64, 78, 94,
-        /* Size 16x4 */
         31, 31, 33, 38, 42, 46, 46, 45, 46, 48, 50, 52, 56, 58, 60, 62, 48, 47,
         46, 47, 49, 53, 53, 53, 54, 54, 55, 57, 59, 61, 62, 64, 52, 49, 48, 47,
         50, 54, 57, 61, 64, 66, 68, 71, 73, 75, 76, 78, 64, 60, 57, 56, 57, 61,
         64, 68, 71, 75, 78, 83, 87, 90, 92, 94,
+        /* Size 16x4 */
+        31, 48, 52, 64, 31, 47, 49, 60, 33, 46, 48, 57, 38, 47, 47, 56, 42, 49,
+        50, 57, 46, 53, 54, 61, 46, 53, 57, 64, 45, 53, 61, 68, 46, 54, 64, 71,
+        48, 54, 66, 75, 50, 55, 68, 78, 52, 57, 71, 83, 56, 59, 73, 87, 58, 61,
+        75, 90, 60, 62, 76, 92, 62, 64, 78, 94,
         /* Size 8x32 */
-        32, 32, 40, 49, 51, 57, 63, 67, 31, 33, 41, 47, 49, 54, 60, 63, 31, 33,
-        41, 47, 49, 54, 59, 63, 30, 33, 42, 45, 47, 52, 57, 60, 31, 35, 43, 46,
-        47, 51, 57, 60, 33, 37, 44, 46, 47, 51, 56, 59, 35, 39, 46, 46, 47, 50,
-        55, 58, 37, 41, 47, 46, 46, 50, 54, 57, 41, 43, 48, 49, 49, 52, 57, 59,
-        42, 43, 48, 49, 50, 53, 57, 60, 49, 47, 50, 53, 54, 57, 60, 62, 49, 47,
-        50, 53, 54, 57, 61, 63, 48, 46, 49, 54, 57, 60, 64, 65, 48, 46, 49, 55,
-        58, 61, 65, 66, 49, 45, 48, 56, 61, 64, 67, 69, 49, 46, 49, 57, 62, 65,
-        69, 70, 50, 46, 49, 57, 63, 67, 71, 73, 51, 47, 49, 58, 64, 69, 73, 74,
-        52, 48, 50, 58, 65, 71, 75, 77, 54, 49, 51, 59, 67, 73, 77, 78, 54, 50,
-        51, 59, 67, 73, 78, 81, 57, 52, 52, 60, 69, 76, 82, 83, 57, 52, 53, 61,
-        69, 77, 82, 85, 60, 54, 55, 62, 71, 79, 85, 87, 61, 55, 56, 63, 72, 80,
-        86, 88, 63, 57, 57, 64, 73, 82, 89, 92, 64, 58, 58, 65, 73, 82, 89, 92,
-        64, 58, 58, 65, 74, 83, 90, 94, 66, 59, 59, 66, 75, 84, 91, 94, 67, 60,
-        59, 66, 74, 82, 91, 96, 68, 61, 59, 65, 72, 81, 89, 95, 68, 62, 59, 65,
-        71, 79, 87, 95,
-        /* Size 32x8 */
         32, 31, 31, 30, 31, 33, 35, 37, 41, 42, 49, 49, 48, 48, 49, 49, 50, 51,
         52, 54, 54, 57, 57, 60, 61, 63, 64, 64, 66, 67, 68, 68, 32, 33, 33, 33,
         35, 37, 39, 41, 43, 43, 47, 47, 46, 46, 45, 46, 46, 47, 48, 49, 50, 52,
@@ -2478,7 +2462,23 @@
         55, 54, 57, 57, 60, 61, 64, 65, 67, 69, 71, 73, 75, 77, 78, 82, 82, 85,
         86, 89, 89, 90, 91, 91, 89, 87, 67, 63, 63, 60, 60, 59, 58, 57, 59, 60,
         62, 63, 65, 66, 69, 70, 73, 74, 77, 78, 81, 83, 85, 87, 88, 92, 92, 94,
-        94, 96, 95, 95 },
+        94, 96, 95, 95,
+        /* Size 32x8 */
+        32, 32, 40, 49, 51, 57, 63, 67, 31, 33, 41, 47, 49, 54, 60, 63, 31, 33,
+        41, 47, 49, 54, 59, 63, 30, 33, 42, 45, 47, 52, 57, 60, 31, 35, 43, 46,
+        47, 51, 57, 60, 33, 37, 44, 46, 47, 51, 56, 59, 35, 39, 46, 46, 47, 50,
+        55, 58, 37, 41, 47, 46, 46, 50, 54, 57, 41, 43, 48, 49, 49, 52, 57, 59,
+        42, 43, 48, 49, 50, 53, 57, 60, 49, 47, 50, 53, 54, 57, 60, 62, 49, 47,
+        50, 53, 54, 57, 61, 63, 48, 46, 49, 54, 57, 60, 64, 65, 48, 46, 49, 55,
+        58, 61, 65, 66, 49, 45, 48, 56, 61, 64, 67, 69, 49, 46, 49, 57, 62, 65,
+        69, 70, 50, 46, 49, 57, 63, 67, 71, 73, 51, 47, 49, 58, 64, 69, 73, 74,
+        52, 48, 50, 58, 65, 71, 75, 77, 54, 49, 51, 59, 67, 73, 77, 78, 54, 50,
+        51, 59, 67, 73, 78, 81, 57, 52, 52, 60, 69, 76, 82, 83, 57, 52, 53, 61,
+        69, 77, 82, 85, 60, 54, 55, 62, 71, 79, 85, 87, 61, 55, 56, 63, 72, 80,
+        86, 88, 63, 57, 57, 64, 73, 82, 89, 92, 64, 58, 58, 65, 73, 82, 89, 92,
+        64, 58, 58, 65, 74, 83, 90, 94, 66, 59, 59, 66, 75, 84, 91, 94, 67, 60,
+        59, 66, 74, 82, 91, 96, 68, 61, 59, 65, 72, 81, 89, 95, 68, 62, 59, 65,
+        71, 79, 87, 95 },
   },
   {
       { /* Luma */
@@ -2566,21 +2566,12 @@
         79, 79, 77, 77, 75, 75, 80, 80, 84, 84, 90, 90, 96, 96, 102, 102, 109,
         109, 116, 116, 124, 124, 132, 132, 141, 141, 144, 144, 149,
         /* Size 4x8 */
-        32, 35, 51, 75, 32, 36, 50, 71, 34, 42, 54, 73, 37, 50, 65, 84, 45, 56,
-        76, 96, 54, 63, 87, 110, 65, 73, 97, 125, 75, 81, 106, 136,
-        /* Size 8x4 */
         32, 32, 34, 37, 45, 54, 65, 75, 35, 36, 42, 50, 56, 63, 73, 81, 51, 50,
         54, 65, 76, 87, 97, 106, 75, 71, 73, 84, 96, 110, 125, 136,
+        /* Size 8x4 */
+        32, 35, 51, 75, 32, 36, 50, 71, 34, 42, 54, 73, 37, 50, 65, 84, 45, 56,
+        76, 96, 54, 63, 87, 110, 65, 73, 97, 125, 75, 81, 106, 136,
         /* Size 8x16 */
-        32, 31, 32, 36, 44, 53, 65, 79, 31, 32, 32, 35, 42, 51, 62, 75, 31, 32,
-        33, 34, 41, 49, 59, 72, 32, 32, 34, 36, 42, 50, 59, 71, 32, 33, 35, 38,
-        42, 49, 58, 69, 34, 34, 37, 42, 48, 54, 63, 73, 36, 34, 38, 48, 54, 60,
-        68, 78, 39, 37, 40, 50, 58, 65, 73, 84, 44, 41, 43, 53, 63, 71, 79, 90,
-        48, 45, 46, 56, 67, 76, 85, 96, 53, 49, 50, 60, 71, 82, 92, 103, 58, 54,
-        54, 63, 75, 87, 98, 110, 65, 60, 58, 68, 79, 92, 105, 118, 71, 65, 63,
-        73, 84, 97, 111, 125, 79, 72, 70, 79, 90, 104, 118, 133, 82, 75, 72, 81,
-        92, 106, 121, 136,
-        /* Size 16x8 */
         32, 31, 31, 32, 32, 34, 36, 39, 44, 48, 53, 58, 65, 71, 79, 82, 31, 32,
         32, 32, 33, 34, 34, 37, 41, 45, 49, 54, 60, 65, 72, 75, 32, 32, 33, 34,
         35, 37, 38, 40, 43, 46, 50, 54, 58, 63, 70, 72, 36, 35, 34, 36, 38, 42,
@@ -2589,38 +2580,16 @@
         82, 87, 92, 97, 104, 106, 65, 62, 59, 59, 58, 63, 68, 73, 79, 85, 92,
         98, 105, 111, 118, 121, 79, 75, 72, 71, 69, 73, 78, 84, 90, 96, 103,
         110, 118, 125, 133, 136,
+        /* Size 16x8 */
+        32, 31, 32, 36, 44, 53, 65, 79, 31, 32, 32, 35, 42, 51, 62, 75, 31, 32,
+        33, 34, 41, 49, 59, 72, 32, 32, 34, 36, 42, 50, 59, 71, 32, 33, 35, 38,
+        42, 49, 58, 69, 34, 34, 37, 42, 48, 54, 63, 73, 36, 34, 38, 48, 54, 60,
+        68, 78, 39, 37, 40, 50, 58, 65, 73, 84, 44, 41, 43, 53, 63, 71, 79, 90,
+        48, 45, 46, 56, 67, 76, 85, 96, 53, 49, 50, 60, 71, 82, 92, 103, 58, 54,
+        54, 63, 75, 87, 98, 110, 65, 60, 58, 68, 79, 92, 105, 118, 71, 65, 63,
+        73, 84, 97, 111, 125, 79, 72, 70, 79, 90, 104, 118, 133, 82, 75, 72, 81,
+        92, 106, 121, 136,
         /* Size 16x32 */
-        32, 31, 31, 32, 32, 36, 36, 44, 44, 53, 53, 65, 65, 79, 79, 87, 31, 32,
-        32, 32, 32, 35, 35, 42, 42, 51, 51, 62, 62, 75, 75, 82, 31, 32, 32, 32,
-        32, 35, 35, 42, 42, 51, 51, 62, 62, 75, 75, 82, 31, 32, 32, 33, 33, 34,
-        34, 41, 41, 49, 49, 59, 59, 72, 72, 78, 31, 32, 32, 33, 33, 34, 34, 41,
-        41, 49, 49, 59, 59, 72, 72, 78, 32, 32, 32, 34, 34, 36, 36, 42, 42, 50,
-        50, 59, 59, 71, 71, 77, 32, 32, 32, 34, 34, 36, 36, 42, 42, 50, 50, 59,
-        59, 71, 71, 77, 32, 33, 33, 35, 35, 38, 38, 42, 42, 49, 49, 58, 58, 69,
-        69, 75, 32, 33, 33, 35, 35, 38, 38, 42, 42, 49, 49, 58, 58, 69, 69, 75,
-        34, 34, 34, 37, 37, 42, 42, 48, 48, 54, 54, 63, 63, 73, 73, 79, 34, 34,
-        34, 37, 37, 42, 42, 48, 48, 54, 54, 63, 63, 73, 73, 79, 36, 34, 34, 38,
-        38, 48, 48, 54, 54, 60, 60, 68, 68, 78, 78, 84, 36, 34, 34, 38, 38, 48,
-        48, 54, 54, 60, 60, 68, 68, 78, 78, 84, 39, 37, 37, 40, 40, 50, 50, 58,
-        58, 65, 65, 73, 73, 84, 84, 89, 39, 37, 37, 40, 40, 50, 50, 58, 58, 65,
-        65, 73, 73, 84, 84, 89, 44, 41, 41, 43, 43, 53, 53, 63, 63, 71, 71, 79,
-        79, 90, 90, 95, 44, 41, 41, 43, 43, 53, 53, 63, 63, 71, 71, 79, 79, 90,
-        90, 95, 48, 45, 45, 46, 46, 56, 56, 67, 67, 76, 76, 85, 85, 96, 96, 102,
-        48, 45, 45, 46, 46, 56, 56, 67, 67, 76, 76, 85, 85, 96, 96, 102, 53, 49,
-        49, 50, 50, 60, 60, 71, 71, 82, 82, 92, 92, 103, 103, 109, 53, 49, 49,
-        50, 50, 60, 60, 71, 71, 82, 82, 92, 92, 103, 103, 109, 58, 54, 54, 54,
-        54, 63, 63, 75, 75, 87, 87, 98, 98, 110, 110, 116, 58, 54, 54, 54, 54,
-        63, 63, 75, 75, 87, 87, 98, 98, 110, 110, 116, 65, 60, 60, 58, 58, 68,
-        68, 79, 79, 92, 92, 105, 105, 118, 118, 124, 65, 60, 60, 58, 58, 68, 68,
-        79, 79, 92, 92, 105, 105, 118, 118, 124, 71, 65, 65, 63, 63, 73, 73, 84,
-        84, 97, 97, 111, 111, 125, 125, 132, 71, 65, 65, 63, 63, 73, 73, 84, 84,
-        97, 97, 111, 111, 125, 125, 132, 79, 72, 72, 70, 70, 79, 79, 90, 90,
-        104, 104, 118, 118, 133, 133, 141, 79, 72, 72, 70, 70, 79, 79, 90, 90,
-        104, 104, 118, 118, 133, 133, 141, 82, 75, 75, 72, 72, 81, 81, 92, 92,
-        106, 106, 121, 121, 136, 136, 144, 82, 75, 75, 72, 72, 81, 81, 92, 92,
-        106, 106, 121, 121, 136, 136, 144, 87, 79, 79, 76, 76, 84, 84, 96, 96,
-        109, 109, 124, 124, 141, 141, 149,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 32, 32, 32, 32, 34, 34, 36, 36, 39, 39, 44, 44, 48,
         48, 53, 53, 58, 58, 65, 65, 71, 71, 79, 79, 82, 82, 87, 31, 32, 32, 32,
         32, 32, 32, 33, 33, 34, 34, 34, 34, 37, 37, 41, 41, 45, 45, 49, 49, 54,
@@ -2651,33 +2620,48 @@
         125, 125, 133, 133, 136, 136, 141, 87, 82, 82, 78, 78, 77, 77, 75, 75,
         79, 79, 84, 84, 89, 89, 95, 95, 102, 102, 109, 109, 116, 116, 124, 124,
         132, 132, 141, 141, 144, 144, 149,
+        /* Size 32x16 */
+        32, 31, 31, 32, 32, 36, 36, 44, 44, 53, 53, 65, 65, 79, 79, 87, 31, 32,
+        32, 32, 32, 35, 35, 42, 42, 51, 51, 62, 62, 75, 75, 82, 31, 32, 32, 32,
+        32, 35, 35, 42, 42, 51, 51, 62, 62, 75, 75, 82, 31, 32, 32, 33, 33, 34,
+        34, 41, 41, 49, 49, 59, 59, 72, 72, 78, 31, 32, 32, 33, 33, 34, 34, 41,
+        41, 49, 49, 59, 59, 72, 72, 78, 32, 32, 32, 34, 34, 36, 36, 42, 42, 50,
+        50, 59, 59, 71, 71, 77, 32, 32, 32, 34, 34, 36, 36, 42, 42, 50, 50, 59,
+        59, 71, 71, 77, 32, 33, 33, 35, 35, 38, 38, 42, 42, 49, 49, 58, 58, 69,
+        69, 75, 32, 33, 33, 35, 35, 38, 38, 42, 42, 49, 49, 58, 58, 69, 69, 75,
+        34, 34, 34, 37, 37, 42, 42, 48, 48, 54, 54, 63, 63, 73, 73, 79, 34, 34,
+        34, 37, 37, 42, 42, 48, 48, 54, 54, 63, 63, 73, 73, 79, 36, 34, 34, 38,
+        38, 48, 48, 54, 54, 60, 60, 68, 68, 78, 78, 84, 36, 34, 34, 38, 38, 48,
+        48, 54, 54, 60, 60, 68, 68, 78, 78, 84, 39, 37, 37, 40, 40, 50, 50, 58,
+        58, 65, 65, 73, 73, 84, 84, 89, 39, 37, 37, 40, 40, 50, 50, 58, 58, 65,
+        65, 73, 73, 84, 84, 89, 44, 41, 41, 43, 43, 53, 53, 63, 63, 71, 71, 79,
+        79, 90, 90, 95, 44, 41, 41, 43, 43, 53, 53, 63, 63, 71, 71, 79, 79, 90,
+        90, 95, 48, 45, 45, 46, 46, 56, 56, 67, 67, 76, 76, 85, 85, 96, 96, 102,
+        48, 45, 45, 46, 46, 56, 56, 67, 67, 76, 76, 85, 85, 96, 96, 102, 53, 49,
+        49, 50, 50, 60, 60, 71, 71, 82, 82, 92, 92, 103, 103, 109, 53, 49, 49,
+        50, 50, 60, 60, 71, 71, 82, 82, 92, 92, 103, 103, 109, 58, 54, 54, 54,
+        54, 63, 63, 75, 75, 87, 87, 98, 98, 110, 110, 116, 58, 54, 54, 54, 54,
+        63, 63, 75, 75, 87, 87, 98, 98, 110, 110, 116, 65, 60, 60, 58, 58, 68,
+        68, 79, 79, 92, 92, 105, 105, 118, 118, 124, 65, 60, 60, 58, 58, 68, 68,
+        79, 79, 92, 92, 105, 105, 118, 118, 124, 71, 65, 65, 63, 63, 73, 73, 84,
+        84, 97, 97, 111, 111, 125, 125, 132, 71, 65, 65, 63, 63, 73, 73, 84, 84,
+        97, 97, 111, 111, 125, 125, 132, 79, 72, 72, 70, 70, 79, 79, 90, 90,
+        104, 104, 118, 118, 133, 133, 141, 79, 72, 72, 70, 70, 79, 79, 90, 90,
+        104, 104, 118, 118, 133, 133, 141, 82, 75, 75, 72, 72, 81, 81, 92, 92,
+        106, 106, 121, 121, 136, 136, 144, 82, 75, 75, 72, 72, 81, 81, 92, 92,
+        106, 106, 121, 121, 136, 136, 144, 87, 79, 79, 76, 76, 84, 84, 96, 96,
+        109, 109, 124, 124, 141, 141, 149,
         /* Size 4x16 */
-        31, 36, 53, 79, 32, 35, 51, 75, 32, 34, 49, 72, 32, 36, 50, 71, 33, 38,
-        49, 69, 34, 42, 54, 73, 34, 48, 60, 78, 37, 50, 65, 84, 41, 53, 71, 90,
-        45, 56, 76, 96, 49, 60, 82, 103, 54, 63, 87, 110, 60, 68, 92, 118, 65,
-        73, 97, 125, 72, 79, 104, 133, 75, 81, 106, 136,
-        /* Size 16x4 */
         31, 32, 32, 32, 33, 34, 34, 37, 41, 45, 49, 54, 60, 65, 72, 75, 36, 35,
         34, 36, 38, 42, 48, 50, 53, 56, 60, 63, 68, 73, 79, 81, 53, 51, 49, 50,
         49, 54, 60, 65, 71, 76, 82, 87, 92, 97, 104, 106, 79, 75, 72, 71, 69,
         73, 78, 84, 90, 96, 103, 110, 118, 125, 133, 136,
+        /* Size 16x4 */
+        31, 36, 53, 79, 32, 35, 51, 75, 32, 34, 49, 72, 32, 36, 50, 71, 33, 38,
+        49, 69, 34, 42, 54, 73, 34, 48, 60, 78, 37, 50, 65, 84, 41, 53, 71, 90,
+        45, 56, 76, 96, 49, 60, 82, 103, 54, 63, 87, 110, 60, 68, 92, 118, 65,
+        73, 97, 125, 72, 79, 104, 133, 75, 81, 106, 136,
         /* Size 8x32 */
-        32, 31, 32, 36, 44, 53, 65, 79, 31, 32, 32, 35, 42, 51, 62, 75, 31, 32,
-        32, 35, 42, 51, 62, 75, 31, 32, 33, 34, 41, 49, 59, 72, 31, 32, 33, 34,
-        41, 49, 59, 72, 32, 32, 34, 36, 42, 50, 59, 71, 32, 32, 34, 36, 42, 50,
-        59, 71, 32, 33, 35, 38, 42, 49, 58, 69, 32, 33, 35, 38, 42, 49, 58, 69,
-        34, 34, 37, 42, 48, 54, 63, 73, 34, 34, 37, 42, 48, 54, 63, 73, 36, 34,
-        38, 48, 54, 60, 68, 78, 36, 34, 38, 48, 54, 60, 68, 78, 39, 37, 40, 50,
-        58, 65, 73, 84, 39, 37, 40, 50, 58, 65, 73, 84, 44, 41, 43, 53, 63, 71,
-        79, 90, 44, 41, 43, 53, 63, 71, 79, 90, 48, 45, 46, 56, 67, 76, 85, 96,
-        48, 45, 46, 56, 67, 76, 85, 96, 53, 49, 50, 60, 71, 82, 92, 103, 53, 49,
-        50, 60, 71, 82, 92, 103, 58, 54, 54, 63, 75, 87, 98, 110, 58, 54, 54,
-        63, 75, 87, 98, 110, 65, 60, 58, 68, 79, 92, 105, 118, 65, 60, 58, 68,
-        79, 92, 105, 118, 71, 65, 63, 73, 84, 97, 111, 125, 71, 65, 63, 73, 84,
-        97, 111, 125, 79, 72, 70, 79, 90, 104, 118, 133, 79, 72, 70, 79, 90,
-        104, 118, 133, 82, 75, 72, 81, 92, 106, 121, 136, 82, 75, 72, 81, 92,
-        106, 121, 136, 87, 79, 76, 84, 96, 109, 124, 141,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 32, 32, 32, 32, 34, 34, 36, 36, 39, 39, 44, 44, 48,
         48, 53, 53, 58, 58, 65, 65, 71, 71, 79, 79, 82, 82, 87, 31, 32, 32, 32,
         32, 32, 32, 33, 33, 34, 34, 34, 34, 37, 37, 41, 41, 45, 45, 49, 49, 54,
@@ -2692,7 +2676,23 @@
         59, 59, 58, 58, 63, 63, 68, 68, 73, 73, 79, 79, 85, 85, 92, 92, 98, 98,
         105, 105, 111, 111, 118, 118, 121, 121, 124, 79, 75, 75, 72, 72, 71, 71,
         69, 69, 73, 73, 78, 78, 84, 84, 90, 90, 96, 96, 103, 103, 110, 110, 118,
-        118, 125, 125, 133, 133, 136, 136, 141 },
+        118, 125, 125, 133, 133, 136, 136, 141,
+        /* Size 32x8 */
+        32, 31, 32, 36, 44, 53, 65, 79, 31, 32, 32, 35, 42, 51, 62, 75, 31, 32,
+        32, 35, 42, 51, 62, 75, 31, 32, 33, 34, 41, 49, 59, 72, 31, 32, 33, 34,
+        41, 49, 59, 72, 32, 32, 34, 36, 42, 50, 59, 71, 32, 32, 34, 36, 42, 50,
+        59, 71, 32, 33, 35, 38, 42, 49, 58, 69, 32, 33, 35, 38, 42, 49, 58, 69,
+        34, 34, 37, 42, 48, 54, 63, 73, 34, 34, 37, 42, 48, 54, 63, 73, 36, 34,
+        38, 48, 54, 60, 68, 78, 36, 34, 38, 48, 54, 60, 68, 78, 39, 37, 40, 50,
+        58, 65, 73, 84, 39, 37, 40, 50, 58, 65, 73, 84, 44, 41, 43, 53, 63, 71,
+        79, 90, 44, 41, 43, 53, 63, 71, 79, 90, 48, 45, 46, 56, 67, 76, 85, 96,
+        48, 45, 46, 56, 67, 76, 85, 96, 53, 49, 50, 60, 71, 82, 92, 103, 53, 49,
+        50, 60, 71, 82, 92, 103, 58, 54, 54, 63, 75, 87, 98, 110, 58, 54, 54,
+        63, 75, 87, 98, 110, 65, 60, 58, 68, 79, 92, 105, 118, 65, 60, 58, 68,
+        79, 92, 105, 118, 71, 65, 63, 73, 84, 97, 111, 125, 71, 65, 63, 73, 84,
+        97, 111, 125, 79, 72, 70, 79, 90, 104, 118, 133, 79, 72, 70, 79, 90,
+        104, 118, 133, 82, 75, 72, 81, 92, 106, 121, 136, 82, 75, 72, 81, 92,
+        106, 121, 136, 87, 79, 76, 84, 96, 109, 124, 141 },
       { /* Chroma */
         /* Size 4x4 */
         32, 46, 47, 57, 46, 53, 54, 60, 47, 54, 66, 75, 57, 60, 75, 89,
@@ -2776,21 +2776,12 @@
         91, 93, 67, 63, 63, 60, 60, 59, 59, 57, 57, 60, 60, 62, 62, 66, 66, 69,
         69, 72, 72, 76, 76, 80, 80, 84, 84, 88, 88, 92, 92, 93, 93, 95,
         /* Size 4x8 */
-        31, 47, 50, 60, 36, 47, 47, 56, 43, 50, 50, 57, 46, 53, 57, 64, 46, 54,
-        64, 71, 50, 55, 68, 78, 54, 58, 72, 85, 59, 61, 75, 90,
-        /* Size 8x4 */
         31, 36, 43, 46, 46, 50, 54, 59, 47, 47, 50, 53, 54, 55, 58, 61, 50, 47,
         50, 57, 64, 68, 72, 75, 60, 56, 57, 64, 71, 78, 85, 90,
+        /* Size 8x4 */
+        31, 47, 50, 60, 36, 47, 47, 56, 43, 50, 50, 57, 46, 53, 57, 64, 46, 54,
+        64, 71, 50, 55, 68, 78, 54, 58, 72, 85, 59, 61, 75, 90,
         /* Size 8x16 */
-        32, 31, 37, 48, 49, 52, 57, 63, 31, 31, 38, 47, 47, 50, 54, 60, 30, 32,
-        40, 46, 45, 48, 52, 57, 33, 36, 43, 47, 46, 47, 51, 56, 37, 40, 47, 47,
-        45, 47, 50, 54, 42, 43, 47, 50, 49, 50, 53, 57, 49, 46, 48, 53, 53, 54,
-        57, 60, 48, 46, 47, 53, 56, 57, 60, 64, 49, 45, 46, 53, 58, 61, 64, 67,
-        50, 46, 46, 54, 59, 64, 67, 71, 52, 48, 47, 54, 61, 66, 71, 75, 54, 50,
-        49, 55, 62, 68, 73, 78, 57, 52, 50, 56, 64, 70, 76, 82, 60, 54, 52, 58,
-        65, 72, 79, 85, 63, 57, 55, 60, 67, 75, 82, 89, 64, 59, 56, 61, 68, 75,
-        83, 90,
-        /* Size 16x8 */
         32, 31, 30, 33, 37, 42, 49, 48, 49, 50, 52, 54, 57, 60, 63, 64, 31, 31,
         32, 36, 40, 43, 46, 46, 45, 46, 48, 50, 52, 54, 57, 59, 37, 38, 40, 43,
         47, 47, 48, 47, 46, 46, 47, 49, 50, 52, 55, 56, 48, 47, 46, 47, 47, 50,
@@ -2799,37 +2790,16 @@
         66, 68, 70, 72, 75, 75, 57, 54, 52, 51, 50, 53, 57, 60, 64, 67, 71, 73,
         76, 79, 82, 83, 63, 60, 57, 56, 54, 57, 60, 64, 67, 71, 75, 78, 82, 85,
         89, 90,
+        /* Size 16x8 */
+        32, 31, 37, 48, 49, 52, 57, 63, 31, 31, 38, 47, 47, 50, 54, 60, 30, 32,
+        40, 46, 45, 48, 52, 57, 33, 36, 43, 47, 46, 47, 51, 56, 37, 40, 47, 47,
+        45, 47, 50, 54, 42, 43, 47, 50, 49, 50, 53, 57, 49, 46, 48, 53, 53, 54,
+        57, 60, 48, 46, 47, 53, 56, 57, 60, 64, 49, 45, 46, 53, 58, 61, 64, 67,
+        50, 46, 46, 54, 59, 64, 67, 71, 52, 48, 47, 54, 61, 66, 71, 75, 54, 50,
+        49, 55, 62, 68, 73, 78, 57, 52, 50, 56, 64, 70, 76, 82, 60, 54, 52, 58,
+        65, 72, 79, 85, 63, 57, 55, 60, 67, 75, 82, 89, 64, 59, 56, 61, 68, 75,
+        83, 90,
         /* Size 16x32 */
-        32, 31, 31, 37, 37, 48, 48, 49, 49, 52, 52, 57, 57, 63, 63, 66, 31, 31,
-        31, 38, 38, 47, 47, 47, 47, 50, 50, 54, 54, 60, 60, 63, 31, 31, 31, 38,
-        38, 47, 47, 47, 47, 50, 50, 54, 54, 60, 60, 63, 30, 32, 32, 40, 40, 46,
-        46, 45, 45, 48, 48, 52, 52, 57, 57, 60, 30, 32, 32, 40, 40, 46, 46, 45,
-        45, 48, 48, 52, 52, 57, 57, 60, 33, 36, 36, 43, 43, 47, 47, 46, 46, 47,
-        47, 51, 51, 56, 56, 59, 33, 36, 36, 43, 43, 47, 47, 46, 46, 47, 47, 51,
-        51, 56, 56, 59, 37, 40, 40, 47, 47, 47, 47, 45, 45, 47, 47, 50, 50, 54,
-        54, 57, 37, 40, 40, 47, 47, 47, 47, 45, 45, 47, 47, 50, 50, 54, 54, 57,
-        42, 43, 43, 47, 47, 50, 50, 49, 49, 50, 50, 53, 53, 57, 57, 60, 42, 43,
-        43, 47, 47, 50, 50, 49, 49, 50, 50, 53, 53, 57, 57, 60, 49, 46, 46, 48,
-        48, 53, 53, 53, 53, 54, 54, 57, 57, 60, 60, 62, 49, 46, 46, 48, 48, 53,
-        53, 53, 53, 54, 54, 57, 57, 60, 60, 62, 48, 46, 46, 47, 47, 53, 53, 56,
-        56, 57, 57, 60, 60, 64, 64, 66, 48, 46, 46, 47, 47, 53, 53, 56, 56, 57,
-        57, 60, 60, 64, 64, 66, 49, 45, 45, 46, 46, 53, 53, 58, 58, 61, 61, 64,
-        64, 67, 67, 69, 49, 45, 45, 46, 46, 53, 53, 58, 58, 61, 61, 64, 64, 67,
-        67, 69, 50, 46, 46, 46, 46, 54, 54, 59, 59, 64, 64, 67, 67, 71, 71, 73,
-        50, 46, 46, 46, 46, 54, 54, 59, 59, 64, 64, 67, 67, 71, 71, 73, 52, 48,
-        48, 47, 47, 54, 54, 61, 61, 66, 66, 71, 71, 75, 75, 77, 52, 48, 48, 47,
-        47, 54, 54, 61, 61, 66, 66, 71, 71, 75, 75, 77, 54, 50, 50, 49, 49, 55,
-        55, 62, 62, 68, 68, 73, 73, 78, 78, 80, 54, 50, 50, 49, 49, 55, 55, 62,
-        62, 68, 68, 73, 73, 78, 78, 80, 57, 52, 52, 50, 50, 56, 56, 64, 64, 70,
-        70, 76, 76, 82, 82, 84, 57, 52, 52, 50, 50, 56, 56, 64, 64, 70, 70, 76,
-        76, 82, 82, 84, 60, 54, 54, 52, 52, 58, 58, 65, 65, 72, 72, 79, 79, 85,
-        85, 88, 60, 54, 54, 52, 52, 58, 58, 65, 65, 72, 72, 79, 79, 85, 85, 88,
-        63, 57, 57, 55, 55, 60, 60, 67, 67, 75, 75, 82, 82, 89, 89, 92, 63, 57,
-        57, 55, 55, 60, 60, 67, 67, 75, 75, 82, 82, 89, 89, 92, 64, 59, 59, 56,
-        56, 61, 61, 68, 68, 75, 75, 83, 83, 90, 90, 93, 64, 59, 59, 56, 56, 61,
-        61, 68, 68, 75, 75, 83, 83, 90, 90, 93, 66, 60, 60, 57, 57, 63, 63, 69,
-        69, 77, 77, 84, 84, 92, 92, 95,
-        /* Size 32x16 */
         32, 31, 31, 30, 30, 33, 33, 37, 37, 42, 42, 49, 49, 48, 48, 49, 49, 50,
         50, 52, 52, 54, 54, 57, 57, 60, 60, 63, 63, 64, 64, 66, 31, 31, 31, 32,
         32, 36, 36, 40, 40, 43, 43, 46, 46, 46, 46, 45, 45, 46, 46, 48, 48, 50,
@@ -2859,33 +2829,47 @@
         75, 78, 78, 82, 82, 85, 85, 89, 89, 90, 90, 92, 66, 63, 63, 60, 60, 59,
         59, 57, 57, 60, 60, 62, 62, 66, 66, 69, 69, 73, 73, 77, 77, 80, 80, 84,
         84, 88, 88, 92, 92, 93, 93, 95,
+        /* Size 32x16 */
+        32, 31, 31, 37, 37, 48, 48, 49, 49, 52, 52, 57, 57, 63, 63, 66, 31, 31,
+        31, 38, 38, 47, 47, 47, 47, 50, 50, 54, 54, 60, 60, 63, 31, 31, 31, 38,
+        38, 47, 47, 47, 47, 50, 50, 54, 54, 60, 60, 63, 30, 32, 32, 40, 40, 46,
+        46, 45, 45, 48, 48, 52, 52, 57, 57, 60, 30, 32, 32, 40, 40, 46, 46, 45,
+        45, 48, 48, 52, 52, 57, 57, 60, 33, 36, 36, 43, 43, 47, 47, 46, 46, 47,
+        47, 51, 51, 56, 56, 59, 33, 36, 36, 43, 43, 47, 47, 46, 46, 47, 47, 51,
+        51, 56, 56, 59, 37, 40, 40, 47, 47, 47, 47, 45, 45, 47, 47, 50, 50, 54,
+        54, 57, 37, 40, 40, 47, 47, 47, 47, 45, 45, 47, 47, 50, 50, 54, 54, 57,
+        42, 43, 43, 47, 47, 50, 50, 49, 49, 50, 50, 53, 53, 57, 57, 60, 42, 43,
+        43, 47, 47, 50, 50, 49, 49, 50, 50, 53, 53, 57, 57, 60, 49, 46, 46, 48,
+        48, 53, 53, 53, 53, 54, 54, 57, 57, 60, 60, 62, 49, 46, 46, 48, 48, 53,
+        53, 53, 53, 54, 54, 57, 57, 60, 60, 62, 48, 46, 46, 47, 47, 53, 53, 56,
+        56, 57, 57, 60, 60, 64, 64, 66, 48, 46, 46, 47, 47, 53, 53, 56, 56, 57,
+        57, 60, 60, 64, 64, 66, 49, 45, 45, 46, 46, 53, 53, 58, 58, 61, 61, 64,
+        64, 67, 67, 69, 49, 45, 45, 46, 46, 53, 53, 58, 58, 61, 61, 64, 64, 67,
+        67, 69, 50, 46, 46, 46, 46, 54, 54, 59, 59, 64, 64, 67, 67, 71, 71, 73,
+        50, 46, 46, 46, 46, 54, 54, 59, 59, 64, 64, 67, 67, 71, 71, 73, 52, 48,
+        48, 47, 47, 54, 54, 61, 61, 66, 66, 71, 71, 75, 75, 77, 52, 48, 48, 47,
+        47, 54, 54, 61, 61, 66, 66, 71, 71, 75, 75, 77, 54, 50, 50, 49, 49, 55,
+        55, 62, 62, 68, 68, 73, 73, 78, 78, 80, 54, 50, 50, 49, 49, 55, 55, 62,
+        62, 68, 68, 73, 73, 78, 78, 80, 57, 52, 52, 50, 50, 56, 56, 64, 64, 70,
+        70, 76, 76, 82, 82, 84, 57, 52, 52, 50, 50, 56, 56, 64, 64, 70, 70, 76,
+        76, 82, 82, 84, 60, 54, 54, 52, 52, 58, 58, 65, 65, 72, 72, 79, 79, 85,
+        85, 88, 60, 54, 54, 52, 52, 58, 58, 65, 65, 72, 72, 79, 79, 85, 85, 88,
+        63, 57, 57, 55, 55, 60, 60, 67, 67, 75, 75, 82, 82, 89, 89, 92, 63, 57,
+        57, 55, 55, 60, 60, 67, 67, 75, 75, 82, 82, 89, 89, 92, 64, 59, 59, 56,
+        56, 61, 61, 68, 68, 75, 75, 83, 83, 90, 90, 93, 64, 59, 59, 56, 56, 61,
+        61, 68, 68, 75, 75, 83, 83, 90, 90, 93, 66, 60, 60, 57, 57, 63, 63, 69,
+        69, 77, 77, 84, 84, 92, 92, 95,
         /* Size 4x16 */
-        31, 48, 52, 63, 31, 47, 50, 60, 32, 46, 48, 57, 36, 47, 47, 56, 40, 47,
-        47, 54, 43, 50, 50, 57, 46, 53, 54, 60, 46, 53, 57, 64, 45, 53, 61, 67,
-        46, 54, 64, 71, 48, 54, 66, 75, 50, 55, 68, 78, 52, 56, 70, 82, 54, 58,
-        72, 85, 57, 60, 75, 89, 59, 61, 75, 90,
-        /* Size 16x4 */
         31, 31, 32, 36, 40, 43, 46, 46, 45, 46, 48, 50, 52, 54, 57, 59, 48, 47,
         46, 47, 47, 50, 53, 53, 53, 54, 54, 55, 56, 58, 60, 61, 52, 50, 48, 47,
         47, 50, 54, 57, 61, 64, 66, 68, 70, 72, 75, 75, 63, 60, 57, 56, 54, 57,
         60, 64, 67, 71, 75, 78, 82, 85, 89, 90,
+        /* Size 16x4 */
+        31, 48, 52, 63, 31, 47, 50, 60, 32, 46, 48, 57, 36, 47, 47, 56, 40, 47,
+        47, 54, 43, 50, 50, 57, 46, 53, 54, 60, 46, 53, 57, 64, 45, 53, 61, 67,
+        46, 54, 64, 71, 48, 54, 66, 75, 50, 55, 68, 78, 52, 56, 70, 82, 54, 58,
+        72, 85, 57, 60, 75, 89, 59, 61, 75, 90,
         /* Size 8x32 */
-        32, 31, 37, 48, 49, 52, 57, 63, 31, 31, 38, 47, 47, 50, 54, 60, 31, 31,
-        38, 47, 47, 50, 54, 60, 30, 32, 40, 46, 45, 48, 52, 57, 30, 32, 40, 46,
-        45, 48, 52, 57, 33, 36, 43, 47, 46, 47, 51, 56, 33, 36, 43, 47, 46, 47,
-        51, 56, 37, 40, 47, 47, 45, 47, 50, 54, 37, 40, 47, 47, 45, 47, 50, 54,
-        42, 43, 47, 50, 49, 50, 53, 57, 42, 43, 47, 50, 49, 50, 53, 57, 49, 46,
-        48, 53, 53, 54, 57, 60, 49, 46, 48, 53, 53, 54, 57, 60, 48, 46, 47, 53,
-        56, 57, 60, 64, 48, 46, 47, 53, 56, 57, 60, 64, 49, 45, 46, 53, 58, 61,
-        64, 67, 49, 45, 46, 53, 58, 61, 64, 67, 50, 46, 46, 54, 59, 64, 67, 71,
-        50, 46, 46, 54, 59, 64, 67, 71, 52, 48, 47, 54, 61, 66, 71, 75, 52, 48,
-        47, 54, 61, 66, 71, 75, 54, 50, 49, 55, 62, 68, 73, 78, 54, 50, 49, 55,
-        62, 68, 73, 78, 57, 52, 50, 56, 64, 70, 76, 82, 57, 52, 50, 56, 64, 70,
-        76, 82, 60, 54, 52, 58, 65, 72, 79, 85, 60, 54, 52, 58, 65, 72, 79, 85,
-        63, 57, 55, 60, 67, 75, 82, 89, 63, 57, 55, 60, 67, 75, 82, 89, 64, 59,
-        56, 61, 68, 75, 83, 90, 64, 59, 56, 61, 68, 75, 83, 90, 66, 60, 57, 63,
-        69, 77, 84, 92,
-        /* Size 32x8 */
         32, 31, 31, 30, 30, 33, 33, 37, 37, 42, 42, 49, 49, 48, 48, 49, 49, 50,
         50, 52, 52, 54, 54, 57, 57, 60, 60, 63, 63, 64, 64, 66, 31, 31, 31, 32,
         32, 36, 36, 40, 40, 43, 43, 46, 46, 46, 46, 45, 45, 46, 46, 48, 48, 50,
@@ -2900,7 +2884,23 @@
         51, 50, 50, 53, 53, 57, 57, 60, 60, 64, 64, 67, 67, 71, 71, 73, 73, 76,
         76, 79, 79, 82, 82, 83, 83, 84, 63, 60, 60, 57, 57, 56, 56, 54, 54, 57,
         57, 60, 60, 64, 64, 67, 67, 71, 71, 75, 75, 78, 78, 82, 82, 85, 85, 89,
-        89, 90, 90, 92 },
+        89, 90, 90, 92,
+        /* Size 32x8 */
+        32, 31, 37, 48, 49, 52, 57, 63, 31, 31, 38, 47, 47, 50, 54, 60, 31, 31,
+        38, 47, 47, 50, 54, 60, 30, 32, 40, 46, 45, 48, 52, 57, 30, 32, 40, 46,
+        45, 48, 52, 57, 33, 36, 43, 47, 46, 47, 51, 56, 33, 36, 43, 47, 46, 47,
+        51, 56, 37, 40, 47, 47, 45, 47, 50, 54, 37, 40, 47, 47, 45, 47, 50, 54,
+        42, 43, 47, 50, 49, 50, 53, 57, 42, 43, 47, 50, 49, 50, 53, 57, 49, 46,
+        48, 53, 53, 54, 57, 60, 49, 46, 48, 53, 53, 54, 57, 60, 48, 46, 47, 53,
+        56, 57, 60, 64, 48, 46, 47, 53, 56, 57, 60, 64, 49, 45, 46, 53, 58, 61,
+        64, 67, 49, 45, 46, 53, 58, 61, 64, 67, 50, 46, 46, 54, 59, 64, 67, 71,
+        50, 46, 46, 54, 59, 64, 67, 71, 52, 48, 47, 54, 61, 66, 71, 75, 52, 48,
+        47, 54, 61, 66, 71, 75, 54, 50, 49, 55, 62, 68, 73, 78, 54, 50, 49, 55,
+        62, 68, 73, 78, 57, 52, 50, 56, 64, 70, 76, 82, 57, 52, 50, 56, 64, 70,
+        76, 82, 60, 54, 52, 58, 65, 72, 79, 85, 60, 54, 52, 58, 65, 72, 79, 85,
+        63, 57, 55, 60, 67, 75, 82, 89, 63, 57, 55, 60, 67, 75, 82, 89, 64, 59,
+        56, 61, 68, 75, 83, 90, 64, 59, 56, 61, 68, 75, 83, 90, 66, 60, 57, 63,
+        69, 77, 84, 92 },
   },
   {
       { /* Luma */
@@ -2988,21 +2988,12 @@
         84, 86, 90, 91, 96, 96, 103, 104, 108, 110, 114, 118, 120, 125, 126,
         134, 134,
         /* Size 4x8 */
-        32, 34, 43, 62, 32, 34, 42, 59, 33, 37, 44, 58, 35, 43, 54, 68, 41, 48,
-        64, 79, 49, 54, 71, 91, 57, 60, 78, 101, 66, 68, 86, 111,
-        /* Size 8x4 */
         32, 32, 33, 35, 41, 49, 57, 66, 34, 34, 37, 43, 48, 54, 60, 68, 43, 42,
         44, 54, 64, 71, 78, 86, 62, 59, 58, 68, 79, 91, 101, 111,
+        /* Size 8x4 */
+        32, 34, 43, 62, 32, 34, 42, 59, 33, 37, 44, 58, 35, 43, 54, 68, 41, 48,
+        64, 79, 49, 54, 71, 91, 57, 60, 78, 101, 66, 68, 86, 111,
         /* Size 8x16 */
-        32, 31, 32, 36, 44, 53, 62, 73, 31, 32, 32, 35, 42, 51, 59, 69, 31, 32,
-        33, 34, 41, 49, 57, 66, 32, 32, 34, 36, 42, 50, 57, 65, 32, 33, 35, 38,
-        42, 49, 56, 64, 34, 34, 37, 42, 48, 54, 61, 69, 35, 34, 38, 47, 52, 59,
-        65, 73, 38, 36, 40, 49, 56, 63, 69, 77, 41, 39, 41, 51, 60, 67, 74, 81,
-        44, 42, 43, 54, 64, 72, 79, 86, 48, 45, 46, 56, 67, 76, 83, 91, 53, 49,
-        50, 60, 71, 82, 90, 99, 58, 54, 54, 63, 75, 87, 95, 105, 65, 60, 58, 68,
-        79, 92, 102, 112, 71, 65, 63, 73, 84, 97, 108, 119, 79, 72, 70, 79, 90,
-        104, 115, 127,
-        /* Size 16x8 */
         32, 31, 31, 32, 32, 34, 35, 38, 41, 44, 48, 53, 58, 65, 71, 79, 31, 32,
         32, 32, 33, 34, 34, 36, 39, 42, 45, 49, 54, 60, 65, 72, 32, 32, 33, 34,
         35, 37, 38, 40, 41, 43, 46, 50, 54, 58, 63, 70, 36, 35, 34, 36, 38, 42,
@@ -3011,37 +3002,16 @@
         76, 82, 87, 92, 97, 104, 62, 59, 57, 57, 56, 61, 65, 69, 74, 79, 83, 90,
         95, 102, 108, 115, 73, 69, 66, 65, 64, 69, 73, 77, 81, 86, 91, 99, 105,
         112, 119, 127,
+        /* Size 16x8 */
+        32, 31, 32, 36, 44, 53, 62, 73, 31, 32, 32, 35, 42, 51, 59, 69, 31, 32,
+        33, 34, 41, 49, 57, 66, 32, 32, 34, 36, 42, 50, 57, 65, 32, 33, 35, 38,
+        42, 49, 56, 64, 34, 34, 37, 42, 48, 54, 61, 69, 35, 34, 38, 47, 52, 59,
+        65, 73, 38, 36, 40, 49, 56, 63, 69, 77, 41, 39, 41, 51, 60, 67, 74, 81,
+        44, 42, 43, 54, 64, 72, 79, 86, 48, 45, 46, 56, 67, 76, 83, 91, 53, 49,
+        50, 60, 71, 82, 90, 99, 58, 54, 54, 63, 75, 87, 95, 105, 65, 60, 58, 68,
+        79, 92, 102, 112, 71, 65, 63, 73, 84, 97, 108, 119, 79, 72, 70, 79, 90,
+        104, 115, 127,
         /* Size 16x32 */
-        32, 31, 31, 32, 32, 34, 36, 38, 44, 44, 53, 53, 62, 65, 73, 79, 31, 32,
-        32, 32, 32, 34, 35, 37, 42, 43, 51, 51, 60, 62, 70, 75, 31, 32, 32, 32,
-        32, 34, 35, 37, 42, 43, 51, 51, 59, 62, 69, 75, 31, 32, 32, 32, 32, 33,
-        35, 36, 41, 42, 50, 50, 58, 60, 67, 73, 31, 32, 32, 32, 33, 33, 34, 36,
-        41, 41, 49, 49, 57, 59, 66, 72, 31, 32, 32, 33, 33, 34, 35, 37, 41, 42,
-        49, 49, 57, 59, 66, 71, 32, 32, 32, 33, 34, 35, 36, 38, 42, 43, 50, 50,
-        57, 59, 65, 71, 32, 32, 32, 34, 34, 35, 37, 38, 42, 43, 49, 49, 56, 59,
-        65, 70, 32, 32, 33, 34, 35, 37, 38, 39, 42, 43, 49, 49, 56, 58, 64, 69,
-        32, 33, 33, 34, 35, 37, 39, 40, 43, 44, 50, 50, 56, 58, 64, 69, 34, 34,
-        34, 36, 37, 39, 42, 44, 48, 48, 54, 54, 61, 63, 69, 73, 34, 34, 34, 36,
-        37, 39, 42, 44, 48, 48, 54, 54, 61, 63, 69, 73, 35, 34, 34, 37, 38, 42,
-        47, 48, 52, 53, 59, 59, 65, 67, 73, 77, 36, 35, 34, 37, 38, 43, 48, 49,
-        54, 54, 60, 60, 66, 68, 74, 78, 38, 36, 36, 38, 40, 44, 49, 51, 56, 57,
-        63, 63, 69, 71, 77, 81, 39, 38, 37, 40, 40, 45, 50, 52, 58, 58, 65, 65,
-        71, 73, 79, 84, 41, 39, 39, 41, 41, 46, 51, 54, 60, 60, 67, 67, 74, 76,
-        81, 86, 44, 41, 41, 42, 43, 48, 53, 56, 63, 64, 71, 71, 78, 79, 85, 90,
-        44, 42, 42, 43, 43, 48, 54, 56, 64, 64, 72, 72, 79, 81, 86, 91, 48, 45,
-        45, 46, 46, 51, 56, 59, 67, 67, 76, 76, 83, 85, 91, 96, 48, 45, 45, 46,
-        46, 51, 56, 59, 67, 67, 76, 76, 83, 85, 91, 96, 53, 49, 49, 49, 49, 54,
-        59, 62, 71, 71, 81, 81, 89, 91, 98, 103, 53, 50, 49, 50, 50, 54, 60, 63,
-        71, 72, 82, 82, 90, 92, 99, 103, 57, 53, 52, 52, 52, 57, 62, 65, 74, 75,
-        85, 85, 94, 96, 103, 108, 58, 54, 54, 54, 54, 58, 63, 67, 75, 76, 87,
-        87, 95, 98, 105, 110, 61, 57, 57, 56, 56, 60, 66, 69, 77, 78, 89, 89,
-        98, 101, 108, 114, 65, 60, 60, 59, 58, 63, 68, 71, 79, 80, 92, 92, 102,
-        105, 112, 118, 67, 62, 61, 60, 60, 64, 69, 72, 81, 82, 94, 94, 103, 106,
-        114, 120, 71, 66, 65, 64, 63, 68, 73, 76, 84, 85, 97, 97, 108, 111, 119,
-        125, 72, 66, 66, 64, 64, 68, 73, 76, 85, 86, 98, 98, 108, 111, 119, 125,
-        79, 73, 72, 71, 70, 74, 79, 82, 90, 91, 104, 104, 115, 118, 127, 133,
-        79, 73, 72, 71, 70, 74, 79, 82, 90, 91, 104, 104, 115, 118, 127, 133,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 32, 32, 32, 32, 34, 34, 35, 36, 38, 39, 41, 44,
         44, 48, 48, 53, 53, 57, 58, 61, 65, 67, 71, 72, 79, 79, 31, 32, 32, 32,
         32, 32, 32, 32, 32, 33, 34, 34, 34, 35, 36, 38, 39, 41, 42, 45, 45, 49,
@@ -3072,33 +3042,47 @@
         127, 127, 79, 75, 75, 73, 72, 71, 71, 70, 69, 69, 73, 73, 77, 78, 81,
         84, 86, 90, 91, 96, 96, 103, 103, 108, 110, 114, 118, 120, 125, 125,
         133, 133,
+        /* Size 32x16 */
+        32, 31, 31, 32, 32, 34, 36, 38, 44, 44, 53, 53, 62, 65, 73, 79, 31, 32,
+        32, 32, 32, 34, 35, 37, 42, 43, 51, 51, 60, 62, 70, 75, 31, 32, 32, 32,
+        32, 34, 35, 37, 42, 43, 51, 51, 59, 62, 69, 75, 31, 32, 32, 32, 32, 33,
+        35, 36, 41, 42, 50, 50, 58, 60, 67, 73, 31, 32, 32, 32, 33, 33, 34, 36,
+        41, 41, 49, 49, 57, 59, 66, 72, 31, 32, 32, 33, 33, 34, 35, 37, 41, 42,
+        49, 49, 57, 59, 66, 71, 32, 32, 32, 33, 34, 35, 36, 38, 42, 43, 50, 50,
+        57, 59, 65, 71, 32, 32, 32, 34, 34, 35, 37, 38, 42, 43, 49, 49, 56, 59,
+        65, 70, 32, 32, 33, 34, 35, 37, 38, 39, 42, 43, 49, 49, 56, 58, 64, 69,
+        32, 33, 33, 34, 35, 37, 39, 40, 43, 44, 50, 50, 56, 58, 64, 69, 34, 34,
+        34, 36, 37, 39, 42, 44, 48, 48, 54, 54, 61, 63, 69, 73, 34, 34, 34, 36,
+        37, 39, 42, 44, 48, 48, 54, 54, 61, 63, 69, 73, 35, 34, 34, 37, 38, 42,
+        47, 48, 52, 53, 59, 59, 65, 67, 73, 77, 36, 35, 34, 37, 38, 43, 48, 49,
+        54, 54, 60, 60, 66, 68, 74, 78, 38, 36, 36, 38, 40, 44, 49, 51, 56, 57,
+        63, 63, 69, 71, 77, 81, 39, 38, 37, 40, 40, 45, 50, 52, 58, 58, 65, 65,
+        71, 73, 79, 84, 41, 39, 39, 41, 41, 46, 51, 54, 60, 60, 67, 67, 74, 76,
+        81, 86, 44, 41, 41, 42, 43, 48, 53, 56, 63, 64, 71, 71, 78, 79, 85, 90,
+        44, 42, 42, 43, 43, 48, 54, 56, 64, 64, 72, 72, 79, 81, 86, 91, 48, 45,
+        45, 46, 46, 51, 56, 59, 67, 67, 76, 76, 83, 85, 91, 96, 48, 45, 45, 46,
+        46, 51, 56, 59, 67, 67, 76, 76, 83, 85, 91, 96, 53, 49, 49, 49, 49, 54,
+        59, 62, 71, 71, 81, 81, 89, 91, 98, 103, 53, 50, 49, 50, 50, 54, 60, 63,
+        71, 72, 82, 82, 90, 92, 99, 103, 57, 53, 52, 52, 52, 57, 62, 65, 74, 75,
+        85, 85, 94, 96, 103, 108, 58, 54, 54, 54, 54, 58, 63, 67, 75, 76, 87,
+        87, 95, 98, 105, 110, 61, 57, 57, 56, 56, 60, 66, 69, 77, 78, 89, 89,
+        98, 101, 108, 114, 65, 60, 60, 59, 58, 63, 68, 71, 79, 80, 92, 92, 102,
+        105, 112, 118, 67, 62, 61, 60, 60, 64, 69, 72, 81, 82, 94, 94, 103, 106,
+        114, 120, 71, 66, 65, 64, 63, 68, 73, 76, 84, 85, 97, 97, 108, 111, 119,
+        125, 72, 66, 66, 64, 64, 68, 73, 76, 85, 86, 98, 98, 108, 111, 119, 125,
+        79, 73, 72, 71, 70, 74, 79, 82, 90, 91, 104, 104, 115, 118, 127, 133,
+        79, 73, 72, 71, 70, 74, 79, 82, 90, 91, 104, 104, 115, 118, 127, 133,
         /* Size 4x16 */
-        31, 34, 44, 65, 32, 34, 43, 62, 32, 33, 41, 59, 32, 35, 43, 59, 32, 37,
-        43, 58, 34, 39, 48, 63, 34, 42, 53, 67, 36, 44, 57, 71, 39, 46, 60, 76,
-        42, 48, 64, 81, 45, 51, 67, 85, 50, 54, 72, 92, 54, 58, 76, 98, 60, 63,
-        80, 105, 66, 68, 85, 111, 73, 74, 91, 118,
-        /* Size 16x4 */
         31, 32, 32, 32, 32, 34, 34, 36, 39, 42, 45, 50, 54, 60, 66, 73, 34, 34,
         33, 35, 37, 39, 42, 44, 46, 48, 51, 54, 58, 63, 68, 74, 44, 43, 41, 43,
         43, 48, 53, 57, 60, 64, 67, 72, 76, 80, 85, 91, 65, 62, 59, 59, 58, 63,
         67, 71, 76, 81, 85, 92, 98, 105, 111, 118,
+        /* Size 16x4 */
+        31, 34, 44, 65, 32, 34, 43, 62, 32, 33, 41, 59, 32, 35, 43, 59, 32, 37,
+        43, 58, 34, 39, 48, 63, 34, 42, 53, 67, 36, 44, 57, 71, 39, 46, 60, 76,
+        42, 48, 64, 81, 45, 51, 67, 85, 50, 54, 72, 92, 54, 58, 76, 98, 60, 63,
+        80, 105, 66, 68, 85, 111, 73, 74, 91, 118,
         /* Size 8x32 */
-        32, 31, 32, 36, 44, 53, 62, 73, 31, 32, 32, 35, 42, 51, 60, 70, 31, 32,
-        32, 35, 42, 51, 59, 69, 31, 32, 32, 35, 41, 50, 58, 67, 31, 32, 33, 34,
-        41, 49, 57, 66, 31, 32, 33, 35, 41, 49, 57, 66, 32, 32, 34, 36, 42, 50,
-        57, 65, 32, 32, 34, 37, 42, 49, 56, 65, 32, 33, 35, 38, 42, 49, 56, 64,
-        32, 33, 35, 39, 43, 50, 56, 64, 34, 34, 37, 42, 48, 54, 61, 69, 34, 34,
-        37, 42, 48, 54, 61, 69, 35, 34, 38, 47, 52, 59, 65, 73, 36, 34, 38, 48,
-        54, 60, 66, 74, 38, 36, 40, 49, 56, 63, 69, 77, 39, 37, 40, 50, 58, 65,
-        71, 79, 41, 39, 41, 51, 60, 67, 74, 81, 44, 41, 43, 53, 63, 71, 78, 85,
-        44, 42, 43, 54, 64, 72, 79, 86, 48, 45, 46, 56, 67, 76, 83, 91, 48, 45,
-        46, 56, 67, 76, 83, 91, 53, 49, 49, 59, 71, 81, 89, 98, 53, 49, 50, 60,
-        71, 82, 90, 99, 57, 52, 52, 62, 74, 85, 94, 103, 58, 54, 54, 63, 75, 87,
-        95, 105, 61, 57, 56, 66, 77, 89, 98, 108, 65, 60, 58, 68, 79, 92, 102,
-        112, 67, 61, 60, 69, 81, 94, 103, 114, 71, 65, 63, 73, 84, 97, 108, 119,
-        72, 66, 64, 73, 85, 98, 108, 119, 79, 72, 70, 79, 90, 104, 115, 127, 79,
-        72, 70, 79, 90, 104, 115, 127,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 31, 32, 32, 32, 32, 34, 34, 35, 36, 38, 39, 41, 44,
         44, 48, 48, 53, 53, 57, 58, 61, 65, 67, 71, 72, 79, 79, 31, 32, 32, 32,
         32, 32, 32, 32, 33, 33, 34, 34, 34, 34, 36, 37, 39, 41, 42, 45, 45, 49,
@@ -3113,7 +3097,23 @@
         57, 57, 56, 56, 56, 61, 61, 65, 66, 69, 71, 74, 78, 79, 83, 83, 89, 90,
         94, 95, 98, 102, 103, 108, 108, 115, 115, 73, 70, 69, 67, 66, 66, 65,
         65, 64, 64, 69, 69, 73, 74, 77, 79, 81, 85, 86, 91, 91, 98, 99, 103,
-        105, 108, 112, 114, 119, 119, 127, 127 },
+        105, 108, 112, 114, 119, 119, 127, 127,
+        /* Size 32x8 */
+        32, 31, 32, 36, 44, 53, 62, 73, 31, 32, 32, 35, 42, 51, 60, 70, 31, 32,
+        32, 35, 42, 51, 59, 69, 31, 32, 32, 35, 41, 50, 58, 67, 31, 32, 33, 34,
+        41, 49, 57, 66, 31, 32, 33, 35, 41, 49, 57, 66, 32, 32, 34, 36, 42, 50,
+        57, 65, 32, 32, 34, 37, 42, 49, 56, 65, 32, 33, 35, 38, 42, 49, 56, 64,
+        32, 33, 35, 39, 43, 50, 56, 64, 34, 34, 37, 42, 48, 54, 61, 69, 34, 34,
+        37, 42, 48, 54, 61, 69, 35, 34, 38, 47, 52, 59, 65, 73, 36, 34, 38, 48,
+        54, 60, 66, 74, 38, 36, 40, 49, 56, 63, 69, 77, 39, 37, 40, 50, 58, 65,
+        71, 79, 41, 39, 41, 51, 60, 67, 74, 81, 44, 41, 43, 53, 63, 71, 78, 85,
+        44, 42, 43, 54, 64, 72, 79, 86, 48, 45, 46, 56, 67, 76, 83, 91, 48, 45,
+        46, 56, 67, 76, 83, 91, 53, 49, 49, 59, 71, 81, 89, 98, 53, 49, 50, 60,
+        71, 82, 90, 99, 57, 52, 52, 62, 74, 85, 94, 103, 58, 54, 54, 63, 75, 87,
+        95, 105, 61, 57, 56, 66, 77, 89, 98, 108, 65, 60, 58, 68, 79, 92, 102,
+        112, 67, 61, 60, 69, 81, 94, 103, 114, 71, 65, 63, 73, 84, 97, 108, 119,
+        72, 66, 64, 73, 85, 98, 108, 119, 79, 72, 70, 79, 90, 104, 115, 127, 79,
+        72, 70, 79, 90, 104, 115, 127 },
       { /* Chroma */
         /* Size 4x4 */
         31, 42, 47, 53, 42, 48, 50, 54, 47, 50, 61, 67, 53, 54, 67, 78,
@@ -3197,21 +3197,12 @@
         89, 89, 63, 60, 60, 58, 57, 57, 56, 55, 54, 55, 57, 57, 60, 60, 62, 63,
         65, 67, 68, 71, 71, 74, 75, 77, 78, 80, 82, 83, 85, 85, 89, 89,
         /* Size 4x8 */
-        31, 42, 47, 54, 33, 44, 45, 51, 40, 47, 46, 50, 47, 50, 54, 57, 45, 49,
-        59, 64, 48, 50, 61, 70, 51, 52, 63, 75, 55, 55, 66, 79,
-        /* Size 8x4 */
         31, 33, 40, 47, 45, 48, 51, 55, 42, 44, 47, 50, 49, 50, 52, 55, 47, 45,
         46, 54, 59, 61, 63, 66, 54, 51, 50, 57, 64, 70, 75, 79,
+        /* Size 8x4 */
+        31, 42, 47, 54, 33, 44, 45, 51, 40, 47, 46, 50, 47, 50, 54, 57, 45, 49,
+        59, 64, 48, 50, 61, 70, 51, 52, 63, 75, 55, 55, 66, 79,
         /* Size 8x16 */
-        32, 31, 37, 48, 49, 52, 56, 61, 31, 31, 38, 47, 47, 50, 53, 57, 30, 32,
-        40, 46, 45, 48, 51, 55, 33, 36, 43, 47, 46, 47, 50, 54, 37, 40, 47, 47,
-        45, 47, 49, 52, 42, 43, 47, 50, 49, 50, 53, 56, 47, 46, 48, 52, 53, 53,
-        55, 58, 48, 46, 47, 53, 55, 56, 58, 61, 48, 45, 46, 53, 57, 59, 61, 63,
-        49, 45, 46, 53, 58, 62, 64, 66, 50, 46, 46, 54, 59, 64, 66, 69, 52, 48,
-        47, 54, 61, 66, 70, 73, 54, 50, 49, 55, 62, 68, 72, 76, 57, 52, 50, 56,
-        64, 70, 75, 79, 60, 54, 52, 58, 65, 72, 77, 82, 63, 57, 55, 60, 67, 75,
-        80, 86,
-        /* Size 16x8 */
         32, 31, 30, 33, 37, 42, 47, 48, 48, 49, 50, 52, 54, 57, 60, 63, 31, 31,
         32, 36, 40, 43, 46, 46, 45, 45, 46, 48, 50, 52, 54, 57, 37, 38, 40, 43,
         47, 47, 48, 47, 46, 46, 46, 47, 49, 50, 52, 55, 48, 47, 46, 47, 47, 50,
@@ -3220,37 +3211,16 @@
         64, 66, 68, 70, 72, 75, 56, 53, 51, 50, 49, 53, 55, 58, 61, 64, 66, 70,
         72, 75, 77, 80, 61, 57, 55, 54, 52, 56, 58, 61, 63, 66, 69, 73, 76, 79,
         82, 86,
+        /* Size 16x8 */
+        32, 31, 37, 48, 49, 52, 56, 61, 31, 31, 38, 47, 47, 50, 53, 57, 30, 32,
+        40, 46, 45, 48, 51, 55, 33, 36, 43, 47, 46, 47, 50, 54, 37, 40, 47, 47,
+        45, 47, 49, 52, 42, 43, 47, 50, 49, 50, 53, 56, 47, 46, 48, 52, 53, 53,
+        55, 58, 48, 46, 47, 53, 55, 56, 58, 61, 48, 45, 46, 53, 57, 59, 61, 63,
+        49, 45, 46, 53, 58, 62, 64, 66, 50, 46, 46, 54, 59, 64, 66, 69, 52, 48,
+        47, 54, 61, 66, 70, 73, 54, 50, 49, 55, 62, 68, 72, 76, 57, 52, 50, 56,
+        64, 70, 75, 79, 60, 54, 52, 58, 65, 72, 77, 82, 63, 57, 55, 60, 67, 75,
+        80, 86,
         /* Size 16x32 */
-        32, 31, 31, 35, 37, 42, 48, 48, 49, 49, 52, 52, 56, 57, 61, 63, 31, 31,
-        31, 36, 38, 42, 47, 47, 47, 47, 50, 50, 54, 54, 58, 60, 31, 31, 31, 36,
-        38, 42, 47, 47, 47, 47, 50, 50, 53, 54, 57, 60, 30, 32, 32, 37, 39, 42,
-        46, 46, 46, 46, 48, 48, 52, 52, 56, 58, 30, 32, 32, 37, 40, 42, 46, 46,
-        45, 45, 48, 48, 51, 52, 55, 57, 32, 33, 34, 39, 41, 44, 46, 46, 45, 45,
-        48, 48, 51, 51, 54, 57, 33, 35, 36, 40, 43, 45, 47, 46, 46, 46, 47, 47,
-        50, 51, 54, 56, 34, 37, 37, 42, 44, 45, 47, 47, 45, 46, 47, 47, 50, 51,
-        53, 55, 37, 40, 40, 45, 47, 47, 47, 47, 45, 46, 47, 47, 49, 50, 52, 54,
-        37, 40, 40, 45, 47, 47, 48, 47, 46, 46, 47, 47, 49, 50, 53, 55, 42, 43,
-        43, 46, 47, 48, 50, 50, 49, 49, 50, 50, 53, 53, 56, 57, 42, 43, 43, 46,
-        47, 48, 50, 50, 49, 49, 50, 50, 53, 53, 56, 57, 47, 46, 46, 47, 48, 50,
-        52, 52, 53, 53, 53, 53, 55, 56, 58, 60, 49, 47, 46, 47, 48, 50, 53, 53,
-        53, 54, 54, 54, 56, 57, 59, 60, 48, 46, 46, 47, 47, 50, 53, 53, 55, 55,
-        56, 56, 58, 58, 61, 62, 48, 46, 46, 46, 47, 50, 53, 54, 56, 56, 57, 57,
-        59, 60, 62, 64, 48, 46, 45, 46, 46, 49, 53, 54, 57, 57, 59, 59, 61, 61,
-        63, 65, 49, 45, 45, 45, 46, 49, 53, 55, 58, 59, 61, 61, 63, 64, 66, 67,
-        49, 46, 45, 46, 46, 49, 53, 55, 58, 59, 62, 62, 64, 64, 66, 68, 50, 47,
-        46, 46, 46, 50, 54, 55, 59, 60, 64, 64, 66, 67, 69, 71, 50, 47, 46, 46,
-        46, 50, 54, 55, 59, 60, 64, 64, 66, 67, 69, 71, 52, 48, 48, 47, 47, 50,
-        54, 56, 61, 61, 66, 66, 69, 70, 72, 74, 52, 48, 48, 47, 47, 50, 54, 56,
-        61, 61, 66, 66, 70, 71, 73, 75, 53, 50, 49, 48, 48, 51, 55, 57, 62, 62,
-        68, 68, 71, 72, 75, 77, 54, 50, 50, 49, 49, 52, 55, 57, 62, 63, 68, 68,
-        72, 73, 76, 78, 55, 51, 51, 50, 49, 52, 56, 58, 63, 63, 69, 69, 74, 75,
-        78, 80, 57, 52, 52, 51, 50, 53, 56, 58, 64, 64, 70, 70, 75, 76, 79, 82,
-        58, 53, 53, 51, 51, 54, 57, 59, 64, 65, 71, 71, 76, 77, 80, 83, 60, 55,
-        54, 53, 52, 55, 58, 60, 65, 66, 72, 72, 77, 79, 82, 85, 60, 55, 55, 53,
-        53, 55, 59, 60, 65, 66, 73, 73, 78, 79, 83, 85, 63, 58, 57, 56, 55, 58,
-        60, 62, 67, 68, 75, 75, 80, 82, 86, 89, 63, 58, 57, 56, 55, 58, 60, 62,
-        67, 68, 75, 75, 80, 82, 86, 89,
-        /* Size 32x16 */
         32, 31, 31, 30, 30, 32, 33, 34, 37, 37, 42, 42, 47, 49, 48, 48, 48, 49,
         49, 50, 50, 52, 52, 53, 54, 55, 57, 58, 60, 60, 63, 63, 31, 31, 31, 32,
         32, 33, 35, 37, 40, 40, 43, 43, 46, 47, 46, 46, 46, 45, 46, 47, 47, 48,
@@ -3280,33 +3250,47 @@
         69, 72, 73, 75, 76, 78, 79, 80, 82, 83, 86, 86, 63, 60, 60, 58, 57, 57,
         56, 55, 54, 55, 57, 57, 60, 60, 62, 64, 65, 67, 68, 71, 71, 74, 75, 77,
         78, 80, 82, 83, 85, 85, 89, 89,
+        /* Size 32x16 */
+        32, 31, 31, 35, 37, 42, 48, 48, 49, 49, 52, 52, 56, 57, 61, 63, 31, 31,
+        31, 36, 38, 42, 47, 47, 47, 47, 50, 50, 54, 54, 58, 60, 31, 31, 31, 36,
+        38, 42, 47, 47, 47, 47, 50, 50, 53, 54, 57, 60, 30, 32, 32, 37, 39, 42,
+        46, 46, 46, 46, 48, 48, 52, 52, 56, 58, 30, 32, 32, 37, 40, 42, 46, 46,
+        45, 45, 48, 48, 51, 52, 55, 57, 32, 33, 34, 39, 41, 44, 46, 46, 45, 45,
+        48, 48, 51, 51, 54, 57, 33, 35, 36, 40, 43, 45, 47, 46, 46, 46, 47, 47,
+        50, 51, 54, 56, 34, 37, 37, 42, 44, 45, 47, 47, 45, 46, 47, 47, 50, 51,
+        53, 55, 37, 40, 40, 45, 47, 47, 47, 47, 45, 46, 47, 47, 49, 50, 52, 54,
+        37, 40, 40, 45, 47, 47, 48, 47, 46, 46, 47, 47, 49, 50, 53, 55, 42, 43,
+        43, 46, 47, 48, 50, 50, 49, 49, 50, 50, 53, 53, 56, 57, 42, 43, 43, 46,
+        47, 48, 50, 50, 49, 49, 50, 50, 53, 53, 56, 57, 47, 46, 46, 47, 48, 50,
+        52, 52, 53, 53, 53, 53, 55, 56, 58, 60, 49, 47, 46, 47, 48, 50, 53, 53,
+        53, 54, 54, 54, 56, 57, 59, 60, 48, 46, 46, 47, 47, 50, 53, 53, 55, 55,
+        56, 56, 58, 58, 61, 62, 48, 46, 46, 46, 47, 50, 53, 54, 56, 56, 57, 57,
+        59, 60, 62, 64, 48, 46, 45, 46, 46, 49, 53, 54, 57, 57, 59, 59, 61, 61,
+        63, 65, 49, 45, 45, 45, 46, 49, 53, 55, 58, 59, 61, 61, 63, 64, 66, 67,
+        49, 46, 45, 46, 46, 49, 53, 55, 58, 59, 62, 62, 64, 64, 66, 68, 50, 47,
+        46, 46, 46, 50, 54, 55, 59, 60, 64, 64, 66, 67, 69, 71, 50, 47, 46, 46,
+        46, 50, 54, 55, 59, 60, 64, 64, 66, 67, 69, 71, 52, 48, 48, 47, 47, 50,
+        54, 56, 61, 61, 66, 66, 69, 70, 72, 74, 52, 48, 48, 47, 47, 50, 54, 56,
+        61, 61, 66, 66, 70, 71, 73, 75, 53, 50, 49, 48, 48, 51, 55, 57, 62, 62,
+        68, 68, 71, 72, 75, 77, 54, 50, 50, 49, 49, 52, 55, 57, 62, 63, 68, 68,
+        72, 73, 76, 78, 55, 51, 51, 50, 49, 52, 56, 58, 63, 63, 69, 69, 74, 75,
+        78, 80, 57, 52, 52, 51, 50, 53, 56, 58, 64, 64, 70, 70, 75, 76, 79, 82,
+        58, 53, 53, 51, 51, 54, 57, 59, 64, 65, 71, 71, 76, 77, 80, 83, 60, 55,
+        54, 53, 52, 55, 58, 60, 65, 66, 72, 72, 77, 79, 82, 85, 60, 55, 55, 53,
+        53, 55, 59, 60, 65, 66, 73, 73, 78, 79, 83, 85, 63, 58, 57, 56, 55, 58,
+        60, 62, 67, 68, 75, 75, 80, 82, 86, 89, 63, 58, 57, 56, 55, 58, 60, 62,
+        67, 68, 75, 75, 80, 82, 86, 89,
         /* Size 4x16 */
-        31, 42, 49, 57, 31, 42, 47, 54, 32, 42, 45, 52, 35, 45, 46, 51, 40, 47,
-        46, 50, 43, 48, 49, 53, 46, 50, 53, 56, 46, 50, 55, 58, 46, 49, 57, 61,
-        46, 49, 59, 64, 47, 50, 60, 67, 48, 50, 61, 71, 50, 52, 63, 73, 52, 53,
-        64, 76, 55, 55, 66, 79, 58, 58, 68, 82,
-        /* Size 16x4 */
         31, 31, 32, 35, 40, 43, 46, 46, 46, 46, 47, 48, 50, 52, 55, 58, 42, 42,
         42, 45, 47, 48, 50, 50, 49, 49, 50, 50, 52, 53, 55, 58, 49, 47, 45, 46,
         46, 49, 53, 55, 57, 59, 60, 61, 63, 64, 66, 68, 57, 54, 52, 51, 50, 53,
         56, 58, 61, 64, 67, 71, 73, 76, 79, 82,
+        /* Size 16x4 */
+        31, 42, 49, 57, 31, 42, 47, 54, 32, 42, 45, 52, 35, 45, 46, 51, 40, 47,
+        46, 50, 43, 48, 49, 53, 46, 50, 53, 56, 46, 50, 55, 58, 46, 49, 57, 61,
+        46, 49, 59, 64, 47, 50, 60, 67, 48, 50, 61, 71, 50, 52, 63, 73, 52, 53,
+        64, 76, 55, 55, 66, 79, 58, 58, 68, 82,
         /* Size 8x32 */
-        32, 31, 37, 48, 49, 52, 56, 61, 31, 31, 38, 47, 47, 50, 54, 58, 31, 31,
-        38, 47, 47, 50, 53, 57, 30, 32, 39, 46, 46, 48, 52, 56, 30, 32, 40, 46,
-        45, 48, 51, 55, 32, 34, 41, 46, 45, 48, 51, 54, 33, 36, 43, 47, 46, 47,
-        50, 54, 34, 37, 44, 47, 45, 47, 50, 53, 37, 40, 47, 47, 45, 47, 49, 52,
-        37, 40, 47, 48, 46, 47, 49, 53, 42, 43, 47, 50, 49, 50, 53, 56, 42, 43,
-        47, 50, 49, 50, 53, 56, 47, 46, 48, 52, 53, 53, 55, 58, 49, 46, 48, 53,
-        53, 54, 56, 59, 48, 46, 47, 53, 55, 56, 58, 61, 48, 46, 47, 53, 56, 57,
-        59, 62, 48, 45, 46, 53, 57, 59, 61, 63, 49, 45, 46, 53, 58, 61, 63, 66,
-        49, 45, 46, 53, 58, 62, 64, 66, 50, 46, 46, 54, 59, 64, 66, 69, 50, 46,
-        46, 54, 59, 64, 66, 69, 52, 48, 47, 54, 61, 66, 69, 72, 52, 48, 47, 54,
-        61, 66, 70, 73, 53, 49, 48, 55, 62, 68, 71, 75, 54, 50, 49, 55, 62, 68,
-        72, 76, 55, 51, 49, 56, 63, 69, 74, 78, 57, 52, 50, 56, 64, 70, 75, 79,
-        58, 53, 51, 57, 64, 71, 76, 80, 60, 54, 52, 58, 65, 72, 77, 82, 60, 55,
-        53, 59, 65, 73, 78, 83, 63, 57, 55, 60, 67, 75, 80, 86, 63, 57, 55, 60,
-        67, 75, 80, 86,
-        /* Size 32x8 */
         32, 31, 31, 30, 30, 32, 33, 34, 37, 37, 42, 42, 47, 49, 48, 48, 48, 49,
         49, 50, 50, 52, 52, 53, 54, 55, 57, 58, 60, 60, 63, 63, 31, 31, 31, 32,
         32, 34, 36, 37, 40, 40, 43, 43, 46, 46, 46, 46, 45, 45, 45, 46, 46, 48,
@@ -3321,7 +3305,23 @@
         50, 50, 49, 49, 53, 53, 55, 56, 58, 59, 61, 63, 64, 66, 66, 69, 70, 71,
         72, 74, 75, 76, 77, 78, 80, 80, 61, 58, 57, 56, 55, 54, 54, 53, 52, 53,
         56, 56, 58, 59, 61, 62, 63, 66, 66, 69, 69, 72, 73, 75, 76, 78, 79, 80,
-        82, 83, 86, 86 },
+        82, 83, 86, 86,
+        /* Size 32x8 */
+        32, 31, 37, 48, 49, 52, 56, 61, 31, 31, 38, 47, 47, 50, 54, 58, 31, 31,
+        38, 47, 47, 50, 53, 57, 30, 32, 39, 46, 46, 48, 52, 56, 30, 32, 40, 46,
+        45, 48, 51, 55, 32, 34, 41, 46, 45, 48, 51, 54, 33, 36, 43, 47, 46, 47,
+        50, 54, 34, 37, 44, 47, 45, 47, 50, 53, 37, 40, 47, 47, 45, 47, 49, 52,
+        37, 40, 47, 48, 46, 47, 49, 53, 42, 43, 47, 50, 49, 50, 53, 56, 42, 43,
+        47, 50, 49, 50, 53, 56, 47, 46, 48, 52, 53, 53, 55, 58, 49, 46, 48, 53,
+        53, 54, 56, 59, 48, 46, 47, 53, 55, 56, 58, 61, 48, 46, 47, 53, 56, 57,
+        59, 62, 48, 45, 46, 53, 57, 59, 61, 63, 49, 45, 46, 53, 58, 61, 63, 66,
+        49, 45, 46, 53, 58, 62, 64, 66, 50, 46, 46, 54, 59, 64, 66, 69, 50, 46,
+        46, 54, 59, 64, 66, 69, 52, 48, 47, 54, 61, 66, 69, 72, 52, 48, 47, 54,
+        61, 66, 70, 73, 53, 49, 48, 55, 62, 68, 71, 75, 54, 50, 49, 55, 62, 68,
+        72, 76, 55, 51, 49, 56, 63, 69, 74, 78, 57, 52, 50, 56, 64, 70, 75, 79,
+        58, 53, 51, 57, 64, 71, 76, 80, 60, 54, 52, 58, 65, 72, 77, 82, 60, 55,
+        53, 59, 65, 73, 78, 83, 63, 57, 55, 60, 67, 75, 80, 86, 63, 57, 55, 60,
+        67, 75, 80, 86 },
   },
   {
       { /* Luma */
@@ -3408,21 +3408,12 @@
         69, 72, 72, 76, 77, 79, 83, 83, 88, 89, 92, 96, 96, 101, 102, 105, 109,
         109, 114,
         /* Size 4x8 */
-        32, 32, 42, 56, 32, 33, 41, 53, 32, 35, 42, 52, 34, 37, 50, 59, 38, 40,
-        58, 68, 44, 45, 66, 78, 50, 50, 71, 86, 61, 58, 79, 97,
-        /* Size 8x4 */
         32, 32, 32, 34, 38, 44, 50, 61, 32, 33, 35, 37, 40, 45, 50, 58, 42, 41,
         42, 50, 58, 66, 71, 79, 56, 53, 52, 59, 68, 78, 86, 97,
+        /* Size 8x4 */
+        32, 32, 42, 56, 32, 33, 41, 53, 32, 35, 42, 52, 34, 37, 50, 59, 38, 40,
+        58, 68, 44, 45, 66, 78, 50, 50, 71, 86, 61, 58, 79, 97,
         /* Size 8x16 */
-        32, 31, 32, 35, 39, 44, 53, 65, 31, 32, 32, 35, 38, 42, 51, 62, 31, 32,
-        33, 34, 37, 41, 49, 59, 31, 32, 34, 35, 38, 42, 49, 59, 32, 32, 34, 36,
-        39, 42, 49, 58, 32, 33, 35, 37, 40, 42, 49, 58, 34, 34, 37, 41, 44, 48,
-        54, 63, 36, 34, 38, 46, 50, 54, 60, 68, 38, 37, 40, 47, 52, 57, 64, 72,
-        41, 39, 41, 49, 54, 60, 67, 76, 44, 41, 43, 51, 57, 63, 71, 79, 48, 45,
-        46, 54, 60, 67, 76, 85, 53, 49, 50, 57, 64, 71, 82, 92, 57, 53, 53, 60,
-        67, 74, 86, 97, 61, 56, 56, 63, 69, 77, 89, 100, 65, 60, 58, 66, 72, 79,
-        92, 105,
-        /* Size 16x8 */
         32, 31, 31, 31, 32, 32, 34, 36, 38, 41, 44, 48, 53, 57, 61, 65, 31, 32,
         32, 32, 32, 33, 34, 34, 37, 39, 41, 45, 49, 53, 56, 60, 32, 32, 33, 34,
         34, 35, 37, 38, 40, 41, 43, 46, 50, 53, 56, 58, 35, 35, 34, 35, 36, 37,
@@ -3431,37 +3422,16 @@
         63, 67, 71, 74, 77, 79, 53, 51, 49, 49, 49, 49, 54, 60, 64, 67, 71, 76,
         82, 86, 89, 92, 65, 62, 59, 59, 58, 58, 63, 68, 72, 76, 79, 85, 92, 97,
         100, 105,
+        /* Size 16x8 */
+        32, 31, 32, 35, 39, 44, 53, 65, 31, 32, 32, 35, 38, 42, 51, 62, 31, 32,
+        33, 34, 37, 41, 49, 59, 31, 32, 34, 35, 38, 42, 49, 59, 32, 32, 34, 36,
+        39, 42, 49, 58, 32, 33, 35, 37, 40, 42, 49, 58, 34, 34, 37, 41, 44, 48,
+        54, 63, 36, 34, 38, 46, 50, 54, 60, 68, 38, 37, 40, 47, 52, 57, 64, 72,
+        41, 39, 41, 49, 54, 60, 67, 76, 44, 41, 43, 51, 57, 63, 71, 79, 48, 45,
+        46, 54, 60, 67, 76, 85, 53, 49, 50, 57, 64, 71, 82, 92, 57, 53, 53, 60,
+        67, 74, 86, 97, 61, 56, 56, 63, 69, 77, 89, 100, 65, 60, 58, 66, 72, 79,
+        92, 105,
         /* Size 16x32 */
-        32, 31, 31, 31, 32, 32, 35, 36, 39, 44, 44, 51, 53, 58, 65, 65, 31, 32,
-        32, 32, 32, 32, 35, 35, 38, 42, 42, 49, 52, 56, 63, 63, 31, 32, 32, 32,
-        32, 32, 35, 35, 38, 42, 42, 49, 51, 55, 62, 62, 31, 32, 32, 32, 32, 32,
-        34, 35, 37, 41, 41, 48, 50, 54, 61, 61, 31, 32, 32, 32, 33, 33, 34, 34,
-        37, 41, 41, 47, 49, 53, 59, 59, 31, 32, 32, 32, 33, 33, 34, 34, 37, 41,
-        41, 47, 49, 53, 59, 59, 31, 32, 32, 33, 34, 34, 35, 36, 38, 42, 42, 48,
-        49, 53, 59, 59, 32, 32, 32, 33, 34, 34, 36, 36, 38, 42, 42, 48, 50, 53,
-        59, 59, 32, 32, 32, 33, 34, 34, 36, 37, 39, 42, 42, 48, 49, 53, 58, 58,
-        32, 32, 33, 34, 35, 35, 37, 38, 40, 42, 42, 48, 49, 52, 58, 58, 32, 32,
-        33, 34, 35, 35, 37, 38, 40, 42, 42, 48, 49, 52, 58, 58, 33, 33, 33, 35,
-        36, 36, 40, 41, 43, 46, 46, 52, 53, 56, 62, 62, 34, 34, 34, 35, 37, 37,
-        41, 42, 44, 48, 48, 53, 54, 57, 63, 63, 34, 34, 34, 35, 37, 37, 43, 44,
-        46, 50, 50, 55, 56, 59, 65, 65, 36, 35, 34, 36, 38, 38, 46, 48, 50, 54,
-        54, 58, 60, 63, 68, 68, 36, 35, 34, 36, 38, 38, 46, 48, 50, 54, 54, 58,
-        60, 63, 68, 68, 38, 37, 37, 38, 40, 40, 47, 50, 52, 57, 57, 62, 64, 67,
-        72, 72, 39, 38, 37, 39, 40, 40, 48, 50, 53, 58, 58, 63, 65, 68, 73, 73,
-        41, 39, 39, 40, 41, 41, 49, 51, 54, 60, 60, 66, 67, 70, 76, 76, 44, 41,
-        41, 42, 43, 43, 51, 53, 57, 63, 63, 69, 71, 74, 79, 79, 44, 41, 41, 42,
-        43, 43, 51, 53, 57, 63, 63, 69, 71, 74, 79, 79, 47, 44, 44, 44, 45, 45,
-        53, 56, 59, 66, 66, 73, 75, 78, 84, 84, 48, 45, 45, 45, 46, 46, 54, 56,
-        60, 67, 67, 74, 76, 79, 85, 85, 50, 47, 46, 47, 47, 47, 55, 58, 61, 68,
-        68, 76, 78, 82, 88, 88, 53, 50, 49, 50, 50, 50, 57, 60, 64, 71, 71, 79,
-        82, 86, 92, 92, 53, 50, 49, 50, 50, 50, 57, 60, 64, 71, 71, 79, 82, 86,
-        92, 92, 57, 54, 53, 53, 53, 53, 60, 63, 67, 74, 74, 83, 86, 90, 97, 97,
-        58, 55, 54, 54, 54, 54, 61, 63, 68, 75, 75, 84, 87, 91, 98, 98, 61, 57,
-        56, 56, 56, 56, 63, 65, 69, 77, 77, 86, 89, 93, 100, 100, 65, 61, 60,
-        59, 58, 58, 66, 68, 72, 79, 79, 89, 92, 97, 105, 105, 65, 61, 60, 59,
-        58, 58, 66, 68, 72, 79, 79, 89, 92, 97, 105, 105, 70, 65, 64, 63, 62,
-        62, 70, 72, 76, 83, 83, 93, 96, 101, 109, 109,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 33, 34, 34, 36, 36, 38, 39,
         41, 44, 44, 47, 48, 50, 53, 53, 57, 58, 61, 65, 65, 70, 31, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 33, 34, 34, 35, 35, 37, 38, 39, 41, 41, 44,
@@ -3491,33 +3461,47 @@
         79, 84, 85, 88, 92, 92, 97, 98, 100, 105, 105, 109, 65, 63, 62, 61, 59,
         59, 59, 59, 58, 58, 58, 62, 63, 65, 68, 68, 72, 73, 76, 79, 79, 84, 85,
         88, 92, 92, 97, 98, 100, 105, 105, 109,
+        /* Size 32x16 */
+        32, 31, 31, 31, 32, 32, 35, 36, 39, 44, 44, 51, 53, 58, 65, 65, 31, 32,
+        32, 32, 32, 32, 35, 35, 38, 42, 42, 49, 52, 56, 63, 63, 31, 32, 32, 32,
+        32, 32, 35, 35, 38, 42, 42, 49, 51, 55, 62, 62, 31, 32, 32, 32, 32, 32,
+        34, 35, 37, 41, 41, 48, 50, 54, 61, 61, 31, 32, 32, 32, 33, 33, 34, 34,
+        37, 41, 41, 47, 49, 53, 59, 59, 31, 32, 32, 32, 33, 33, 34, 34, 37, 41,
+        41, 47, 49, 53, 59, 59, 31, 32, 32, 33, 34, 34, 35, 36, 38, 42, 42, 48,
+        49, 53, 59, 59, 32, 32, 32, 33, 34, 34, 36, 36, 38, 42, 42, 48, 50, 53,
+        59, 59, 32, 32, 32, 33, 34, 34, 36, 37, 39, 42, 42, 48, 49, 53, 58, 58,
+        32, 32, 33, 34, 35, 35, 37, 38, 40, 42, 42, 48, 49, 52, 58, 58, 32, 32,
+        33, 34, 35, 35, 37, 38, 40, 42, 42, 48, 49, 52, 58, 58, 33, 33, 33, 35,
+        36, 36, 40, 41, 43, 46, 46, 52, 53, 56, 62, 62, 34, 34, 34, 35, 37, 37,
+        41, 42, 44, 48, 48, 53, 54, 57, 63, 63, 34, 34, 34, 35, 37, 37, 43, 44,
+        46, 50, 50, 55, 56, 59, 65, 65, 36, 35, 34, 36, 38, 38, 46, 48, 50, 54,
+        54, 58, 60, 63, 68, 68, 36, 35, 34, 36, 38, 38, 46, 48, 50, 54, 54, 58,
+        60, 63, 68, 68, 38, 37, 37, 38, 40, 40, 47, 50, 52, 57, 57, 62, 64, 67,
+        72, 72, 39, 38, 37, 39, 40, 40, 48, 50, 53, 58, 58, 63, 65, 68, 73, 73,
+        41, 39, 39, 40, 41, 41, 49, 51, 54, 60, 60, 66, 67, 70, 76, 76, 44, 41,
+        41, 42, 43, 43, 51, 53, 57, 63, 63, 69, 71, 74, 79, 79, 44, 41, 41, 42,
+        43, 43, 51, 53, 57, 63, 63, 69, 71, 74, 79, 79, 47, 44, 44, 44, 45, 45,
+        53, 56, 59, 66, 66, 73, 75, 78, 84, 84, 48, 45, 45, 45, 46, 46, 54, 56,
+        60, 67, 67, 74, 76, 79, 85, 85, 50, 47, 46, 47, 47, 47, 55, 58, 61, 68,
+        68, 76, 78, 82, 88, 88, 53, 50, 49, 50, 50, 50, 57, 60, 64, 71, 71, 79,
+        82, 86, 92, 92, 53, 50, 49, 50, 50, 50, 57, 60, 64, 71, 71, 79, 82, 86,
+        92, 92, 57, 54, 53, 53, 53, 53, 60, 63, 67, 74, 74, 83, 86, 90, 97, 97,
+        58, 55, 54, 54, 54, 54, 61, 63, 68, 75, 75, 84, 87, 91, 98, 98, 61, 57,
+        56, 56, 56, 56, 63, 65, 69, 77, 77, 86, 89, 93, 100, 100, 65, 61, 60,
+        59, 58, 58, 66, 68, 72, 79, 79, 89, 92, 97, 105, 105, 65, 61, 60, 59,
+        58, 58, 66, 68, 72, 79, 79, 89, 92, 97, 105, 105, 70, 65, 64, 63, 62,
+        62, 70, 72, 76, 83, 83, 93, 96, 101, 109, 109,
         /* Size 4x16 */
-        31, 32, 44, 58, 32, 32, 42, 55, 32, 33, 41, 53, 32, 34, 42, 53, 32, 34,
-        42, 53, 32, 35, 42, 52, 34, 37, 48, 57, 35, 38, 54, 63, 37, 40, 57, 67,
-        39, 41, 60, 70, 41, 43, 63, 74, 45, 46, 67, 79, 50, 50, 71, 86, 54, 53,
-        74, 90, 57, 56, 77, 93, 61, 58, 79, 97,
-        /* Size 16x4 */
         31, 32, 32, 32, 32, 32, 34, 35, 37, 39, 41, 45, 50, 54, 57, 61, 32, 32,
         33, 34, 34, 35, 37, 38, 40, 41, 43, 46, 50, 53, 56, 58, 44, 42, 41, 42,
         42, 42, 48, 54, 57, 60, 63, 67, 71, 74, 77, 79, 58, 55, 53, 53, 53, 52,
         57, 63, 67, 70, 74, 79, 86, 90, 93, 97,
+        /* Size 16x4 */
+        31, 32, 44, 58, 32, 32, 42, 55, 32, 33, 41, 53, 32, 34, 42, 53, 32, 34,
+        42, 53, 32, 35, 42, 52, 34, 37, 48, 57, 35, 38, 54, 63, 37, 40, 57, 67,
+        39, 41, 60, 70, 41, 43, 63, 74, 45, 46, 67, 79, 50, 50, 71, 86, 54, 53,
+        74, 90, 57, 56, 77, 93, 61, 58, 79, 97,
         /* Size 8x32 */
-        32, 31, 32, 35, 39, 44, 53, 65, 31, 32, 32, 35, 38, 42, 52, 63, 31, 32,
-        32, 35, 38, 42, 51, 62, 31, 32, 32, 34, 37, 41, 50, 61, 31, 32, 33, 34,
-        37, 41, 49, 59, 31, 32, 33, 34, 37, 41, 49, 59, 31, 32, 34, 35, 38, 42,
-        49, 59, 32, 32, 34, 36, 38, 42, 50, 59, 32, 32, 34, 36, 39, 42, 49, 58,
-        32, 33, 35, 37, 40, 42, 49, 58, 32, 33, 35, 37, 40, 42, 49, 58, 33, 33,
-        36, 40, 43, 46, 53, 62, 34, 34, 37, 41, 44, 48, 54, 63, 34, 34, 37, 43,
-        46, 50, 56, 65, 36, 34, 38, 46, 50, 54, 60, 68, 36, 34, 38, 46, 50, 54,
-        60, 68, 38, 37, 40, 47, 52, 57, 64, 72, 39, 37, 40, 48, 53, 58, 65, 73,
-        41, 39, 41, 49, 54, 60, 67, 76, 44, 41, 43, 51, 57, 63, 71, 79, 44, 41,
-        43, 51, 57, 63, 71, 79, 47, 44, 45, 53, 59, 66, 75, 84, 48, 45, 46, 54,
-        60, 67, 76, 85, 50, 46, 47, 55, 61, 68, 78, 88, 53, 49, 50, 57, 64, 71,
-        82, 92, 53, 49, 50, 57, 64, 71, 82, 92, 57, 53, 53, 60, 67, 74, 86, 97,
-        58, 54, 54, 61, 68, 75, 87, 98, 61, 56, 56, 63, 69, 77, 89, 100, 65, 60,
-        58, 66, 72, 79, 92, 105, 65, 60, 58, 66, 72, 79, 92, 105, 70, 64, 62,
-        70, 76, 83, 96, 109,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 33, 34, 34, 36, 36, 38, 39,
         41, 44, 44, 47, 48, 50, 53, 53, 57, 58, 61, 65, 65, 70, 31, 32, 32, 32,
         32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 37, 37, 39, 41, 41, 44,
@@ -3532,7 +3516,23 @@
         49, 50, 49, 49, 49, 53, 54, 56, 60, 60, 64, 65, 67, 71, 71, 75, 76, 78,
         82, 82, 86, 87, 89, 92, 92, 96, 65, 63, 62, 61, 59, 59, 59, 59, 58, 58,
         58, 62, 63, 65, 68, 68, 72, 73, 76, 79, 79, 84, 85, 88, 92, 92, 97, 98,
-        100, 105, 105, 109 },
+        100, 105, 105, 109,
+        /* Size 32x8 */
+        32, 31, 32, 35, 39, 44, 53, 65, 31, 32, 32, 35, 38, 42, 52, 63, 31, 32,
+        32, 35, 38, 42, 51, 62, 31, 32, 32, 34, 37, 41, 50, 61, 31, 32, 33, 34,
+        37, 41, 49, 59, 31, 32, 33, 34, 37, 41, 49, 59, 31, 32, 34, 35, 38, 42,
+        49, 59, 32, 32, 34, 36, 38, 42, 50, 59, 32, 32, 34, 36, 39, 42, 49, 58,
+        32, 33, 35, 37, 40, 42, 49, 58, 32, 33, 35, 37, 40, 42, 49, 58, 33, 33,
+        36, 40, 43, 46, 53, 62, 34, 34, 37, 41, 44, 48, 54, 63, 34, 34, 37, 43,
+        46, 50, 56, 65, 36, 34, 38, 46, 50, 54, 60, 68, 36, 34, 38, 46, 50, 54,
+        60, 68, 38, 37, 40, 47, 52, 57, 64, 72, 39, 37, 40, 48, 53, 58, 65, 73,
+        41, 39, 41, 49, 54, 60, 67, 76, 44, 41, 43, 51, 57, 63, 71, 79, 44, 41,
+        43, 51, 57, 63, 71, 79, 47, 44, 45, 53, 59, 66, 75, 84, 48, 45, 46, 54,
+        60, 67, 76, 85, 50, 46, 47, 55, 61, 68, 78, 88, 53, 49, 50, 57, 64, 71,
+        82, 92, 53, 49, 50, 57, 64, 71, 82, 92, 57, 53, 53, 60, 67, 74, 86, 97,
+        58, 54, 54, 61, 68, 75, 87, 98, 61, 56, 56, 63, 69, 77, 89, 100, 65, 60,
+        58, 66, 72, 79, 92, 105, 65, 60, 58, 66, 72, 79, 92, 105, 70, 64, 62,
+        70, 76, 83, 96, 109 },
       { /* Chroma */
         /* Size 4x4 */
         31, 41, 46, 51, 41, 48, 48, 51, 46, 48, 58, 62, 51, 51, 62, 71,
@@ -3616,21 +3616,12 @@
         76, 78, 59, 57, 56, 55, 54, 54, 53, 53, 52, 51, 51, 54, 55, 56, 58, 58,
         60, 61, 63, 65, 65, 67, 68, 70, 72, 72, 74, 75, 76, 78, 78, 80,
         /* Size 4x8 */
-        31, 38, 47, 52, 32, 40, 45, 49, 39, 47, 45, 48, 44, 47, 51, 53, 46, 47,
-        56, 58, 47, 46, 59, 64, 48, 47, 61, 68, 53, 50, 64, 73,
-        /* Size 8x4 */
         31, 32, 39, 44, 46, 47, 48, 53, 38, 40, 47, 47, 47, 46, 47, 50, 47, 45,
         45, 51, 56, 59, 61, 64, 52, 49, 48, 53, 58, 64, 68, 73,
+        /* Size 8x4 */
+        31, 38, 47, 52, 32, 40, 45, 49, 39, 47, 45, 48, 44, 47, 51, 53, 46, 47,
+        56, 58, 47, 46, 59, 64, 48, 47, 61, 68, 53, 50, 64, 73,
         /* Size 8x16 */
-        32, 31, 37, 45, 48, 49, 52, 57, 31, 31, 38, 45, 47, 47, 50, 54, 30, 32,
-        40, 44, 45, 45, 48, 52, 33, 35, 42, 46, 46, 45, 47, 51, 35, 37, 44, 46,
-        46, 45, 47, 51, 37, 40, 47, 47, 47, 45, 47, 50, 42, 43, 47, 49, 50, 49,
-        50, 53, 49, 46, 48, 52, 53, 53, 54, 57, 48, 46, 47, 51, 54, 55, 57, 59,
-        48, 45, 46, 51, 54, 57, 59, 61, 49, 45, 46, 51, 55, 58, 61, 64, 50, 46,
-        46, 52, 56, 59, 64, 67, 52, 48, 47, 53, 57, 61, 66, 71, 54, 49, 48, 54,
-        58, 62, 68, 73, 55, 51, 49, 54, 58, 63, 69, 74, 57, 52, 50, 55, 59, 64,
-        70, 76,
-        /* Size 16x8 */
         32, 31, 30, 33, 35, 37, 42, 49, 48, 48, 49, 50, 52, 54, 55, 57, 31, 31,
         32, 35, 37, 40, 43, 46, 46, 45, 45, 46, 48, 49, 51, 52, 37, 38, 40, 42,
         44, 47, 47, 48, 47, 46, 46, 46, 47, 48, 49, 50, 45, 45, 44, 46, 46, 47,
@@ -3639,37 +3630,16 @@
         58, 59, 61, 62, 63, 64, 52, 50, 48, 47, 47, 47, 50, 54, 57, 59, 61, 64,
         66, 68, 69, 70, 57, 54, 52, 51, 51, 50, 53, 57, 59, 61, 64, 67, 71, 73,
         74, 76,
+        /* Size 16x8 */
+        32, 31, 37, 45, 48, 49, 52, 57, 31, 31, 38, 45, 47, 47, 50, 54, 30, 32,
+        40, 44, 45, 45, 48, 52, 33, 35, 42, 46, 46, 45, 47, 51, 35, 37, 44, 46,
+        46, 45, 47, 51, 37, 40, 47, 47, 47, 45, 47, 50, 42, 43, 47, 49, 50, 49,
+        50, 53, 49, 46, 48, 52, 53, 53, 54, 57, 48, 46, 47, 51, 54, 55, 57, 59,
+        48, 45, 46, 51, 54, 57, 59, 61, 49, 45, 46, 51, 55, 58, 61, 64, 50, 46,
+        46, 52, 56, 59, 64, 67, 52, 48, 47, 53, 57, 61, 66, 71, 54, 49, 48, 54,
+        58, 62, 68, 73, 55, 51, 49, 54, 58, 63, 69, 74, 57, 52, 50, 55, 59, 64,
+        70, 76,
         /* Size 16x32 */
-        32, 31, 31, 33, 37, 37, 45, 48, 48, 49, 49, 51, 52, 54, 57, 57, 31, 31,
-        31, 34, 38, 38, 45, 47, 47, 47, 47, 50, 50, 52, 55, 55, 31, 31, 31, 34,
-        38, 38, 45, 47, 47, 47, 47, 49, 50, 51, 54, 54, 31, 31, 32, 34, 39, 39,
-        45, 46, 46, 46, 46, 48, 49, 51, 53, 53, 30, 32, 32, 35, 40, 40, 44, 46,
-        45, 45, 45, 47, 48, 49, 52, 52, 30, 32, 32, 35, 40, 40, 44, 46, 45, 45,
-        45, 47, 48, 49, 52, 52, 33, 34, 35, 37, 42, 42, 46, 47, 46, 45, 45, 47,
-        47, 49, 51, 51, 33, 35, 36, 38, 43, 43, 46, 47, 46, 46, 46, 47, 47, 49,
-        51, 51, 35, 37, 37, 40, 44, 44, 46, 47, 46, 45, 45, 47, 47, 48, 51, 51,
-        37, 39, 40, 43, 47, 47, 47, 47, 47, 45, 45, 46, 47, 48, 50, 50, 37, 39,
-        40, 43, 47, 47, 47, 47, 47, 45, 45, 46, 47, 48, 50, 50, 41, 42, 42, 44,
-        47, 47, 49, 49, 49, 48, 48, 49, 50, 51, 52, 52, 42, 42, 43, 44, 47, 47,
-        49, 50, 50, 49, 49, 50, 50, 51, 53, 53, 44, 44, 44, 45, 47, 47, 50, 51,
-        51, 51, 51, 52, 52, 53, 54, 54, 49, 47, 46, 47, 48, 48, 52, 53, 53, 53,
-        53, 54, 54, 55, 57, 57, 49, 47, 46, 47, 48, 48, 52, 53, 53, 53, 53, 54,
-        54, 55, 57, 57, 48, 46, 46, 46, 47, 47, 51, 53, 54, 55, 55, 56, 57, 58,
-        59, 59, 48, 46, 46, 46, 47, 47, 51, 53, 54, 56, 56, 57, 57, 58, 60, 60,
-        48, 46, 45, 46, 46, 46, 51, 53, 54, 57, 57, 58, 59, 60, 61, 61, 49, 46,
-        45, 45, 46, 46, 51, 53, 55, 58, 58, 61, 61, 62, 64, 64, 49, 46, 45, 45,
-        46, 46, 51, 53, 55, 58, 58, 61, 61, 62, 64, 64, 50, 47, 46, 46, 46, 46,
-        52, 54, 56, 59, 59, 62, 63, 64, 66, 66, 50, 47, 46, 46, 46, 46, 52, 54,
-        56, 59, 59, 63, 64, 65, 67, 67, 51, 48, 47, 47, 47, 47, 52, 54, 56, 60,
-        60, 64, 65, 66, 68, 68, 52, 48, 48, 47, 47, 47, 53, 54, 57, 61, 61, 65,
-        66, 68, 71, 71, 52, 48, 48, 47, 47, 47, 53, 54, 57, 61, 61, 65, 66, 68,
-        71, 71, 54, 50, 49, 49, 48, 48, 54, 55, 58, 62, 62, 67, 68, 70, 73, 73,
-        54, 51, 50, 49, 49, 49, 54, 55, 58, 62, 62, 67, 68, 70, 73, 73, 55, 51,
-        51, 50, 49, 49, 54, 56, 58, 63, 63, 68, 69, 71, 74, 74, 57, 53, 52, 51,
-        50, 50, 55, 56, 59, 64, 64, 69, 70, 73, 76, 76, 57, 53, 52, 51, 50, 50,
-        55, 56, 59, 64, 64, 69, 70, 73, 76, 76, 59, 55, 54, 53, 52, 52, 57, 58,
-        61, 65, 65, 70, 72, 74, 78, 78,
-        /* Size 32x16 */
         32, 31, 31, 31, 30, 30, 33, 33, 35, 37, 37, 41, 42, 44, 49, 49, 48, 48,
         48, 49, 49, 50, 50, 51, 52, 52, 54, 54, 55, 57, 57, 59, 31, 31, 31, 31,
         32, 32, 34, 35, 37, 39, 39, 42, 42, 44, 47, 47, 46, 46, 46, 46, 46, 47,
@@ -3699,33 +3669,47 @@
         64, 66, 67, 68, 71, 71, 73, 73, 74, 76, 76, 78, 57, 55, 54, 53, 52, 52,
         51, 51, 51, 50, 50, 52, 53, 54, 57, 57, 59, 60, 61, 64, 64, 66, 67, 68,
         71, 71, 73, 73, 74, 76, 76, 78,
+        /* Size 32x16 */
+        32, 31, 31, 33, 37, 37, 45, 48, 48, 49, 49, 51, 52, 54, 57, 57, 31, 31,
+        31, 34, 38, 38, 45, 47, 47, 47, 47, 50, 50, 52, 55, 55, 31, 31, 31, 34,
+        38, 38, 45, 47, 47, 47, 47, 49, 50, 51, 54, 54, 31, 31, 32, 34, 39, 39,
+        45, 46, 46, 46, 46, 48, 49, 51, 53, 53, 30, 32, 32, 35, 40, 40, 44, 46,
+        45, 45, 45, 47, 48, 49, 52, 52, 30, 32, 32, 35, 40, 40, 44, 46, 45, 45,
+        45, 47, 48, 49, 52, 52, 33, 34, 35, 37, 42, 42, 46, 47, 46, 45, 45, 47,
+        47, 49, 51, 51, 33, 35, 36, 38, 43, 43, 46, 47, 46, 46, 46, 47, 47, 49,
+        51, 51, 35, 37, 37, 40, 44, 44, 46, 47, 46, 45, 45, 47, 47, 48, 51, 51,
+        37, 39, 40, 43, 47, 47, 47, 47, 47, 45, 45, 46, 47, 48, 50, 50, 37, 39,
+        40, 43, 47, 47, 47, 47, 47, 45, 45, 46, 47, 48, 50, 50, 41, 42, 42, 44,
+        47, 47, 49, 49, 49, 48, 48, 49, 50, 51, 52, 52, 42, 42, 43, 44, 47, 47,
+        49, 50, 50, 49, 49, 50, 50, 51, 53, 53, 44, 44, 44, 45, 47, 47, 50, 51,
+        51, 51, 51, 52, 52, 53, 54, 54, 49, 47, 46, 47, 48, 48, 52, 53, 53, 53,
+        53, 54, 54, 55, 57, 57, 49, 47, 46, 47, 48, 48, 52, 53, 53, 53, 53, 54,
+        54, 55, 57, 57, 48, 46, 46, 46, 47, 47, 51, 53, 54, 55, 55, 56, 57, 58,
+        59, 59, 48, 46, 46, 46, 47, 47, 51, 53, 54, 56, 56, 57, 57, 58, 60, 60,
+        48, 46, 45, 46, 46, 46, 51, 53, 54, 57, 57, 58, 59, 60, 61, 61, 49, 46,
+        45, 45, 46, 46, 51, 53, 55, 58, 58, 61, 61, 62, 64, 64, 49, 46, 45, 45,
+        46, 46, 51, 53, 55, 58, 58, 61, 61, 62, 64, 64, 50, 47, 46, 46, 46, 46,
+        52, 54, 56, 59, 59, 62, 63, 64, 66, 66, 50, 47, 46, 46, 46, 46, 52, 54,
+        56, 59, 59, 63, 64, 65, 67, 67, 51, 48, 47, 47, 47, 47, 52, 54, 56, 60,
+        60, 64, 65, 66, 68, 68, 52, 48, 48, 47, 47, 47, 53, 54, 57, 61, 61, 65,
+        66, 68, 71, 71, 52, 48, 48, 47, 47, 47, 53, 54, 57, 61, 61, 65, 66, 68,
+        71, 71, 54, 50, 49, 49, 48, 48, 54, 55, 58, 62, 62, 67, 68, 70, 73, 73,
+        54, 51, 50, 49, 49, 49, 54, 55, 58, 62, 62, 67, 68, 70, 73, 73, 55, 51,
+        51, 50, 49, 49, 54, 56, 58, 63, 63, 68, 69, 71, 74, 74, 57, 53, 52, 51,
+        50, 50, 55, 56, 59, 64, 64, 69, 70, 73, 76, 76, 57, 53, 52, 51, 50, 50,
+        55, 56, 59, 64, 64, 69, 70, 73, 76, 76, 59, 55, 54, 53, 52, 52, 57, 58,
+        61, 65, 65, 70, 72, 74, 78, 78,
         /* Size 4x16 */
-        31, 37, 49, 54, 31, 38, 47, 51, 32, 40, 45, 49, 34, 42, 45, 49, 37, 44,
-        45, 48, 39, 47, 45, 48, 42, 47, 49, 51, 47, 48, 53, 55, 46, 47, 55, 58,
-        46, 46, 57, 60, 46, 46, 58, 62, 47, 46, 59, 65, 48, 47, 61, 68, 50, 48,
-        62, 70, 51, 49, 63, 71, 53, 50, 64, 73,
-        /* Size 16x4 */
         31, 31, 32, 34, 37, 39, 42, 47, 46, 46, 46, 47, 48, 50, 51, 53, 37, 38,
         40, 42, 44, 47, 47, 48, 47, 46, 46, 46, 47, 48, 49, 50, 49, 47, 45, 45,
         45, 45, 49, 53, 55, 57, 58, 59, 61, 62, 63, 64, 54, 51, 49, 49, 48, 48,
         51, 55, 58, 60, 62, 65, 68, 70, 71, 73,
+        /* Size 16x4 */
+        31, 37, 49, 54, 31, 38, 47, 51, 32, 40, 45, 49, 34, 42, 45, 49, 37, 44,
+        45, 48, 39, 47, 45, 48, 42, 47, 49, 51, 47, 48, 53, 55, 46, 47, 55, 58,
+        46, 46, 57, 60, 46, 46, 58, 62, 47, 46, 59, 65, 48, 47, 61, 68, 50, 48,
+        62, 70, 51, 49, 63, 71, 53, 50, 64, 73,
         /* Size 8x32 */
-        32, 31, 37, 45, 48, 49, 52, 57, 31, 31, 38, 45, 47, 47, 50, 55, 31, 31,
-        38, 45, 47, 47, 50, 54, 31, 32, 39, 45, 46, 46, 49, 53, 30, 32, 40, 44,
-        45, 45, 48, 52, 30, 32, 40, 44, 45, 45, 48, 52, 33, 35, 42, 46, 46, 45,
-        47, 51, 33, 36, 43, 46, 46, 46, 47, 51, 35, 37, 44, 46, 46, 45, 47, 51,
-        37, 40, 47, 47, 47, 45, 47, 50, 37, 40, 47, 47, 47, 45, 47, 50, 41, 42,
-        47, 49, 49, 48, 50, 52, 42, 43, 47, 49, 50, 49, 50, 53, 44, 44, 47, 50,
-        51, 51, 52, 54, 49, 46, 48, 52, 53, 53, 54, 57, 49, 46, 48, 52, 53, 53,
-        54, 57, 48, 46, 47, 51, 54, 55, 57, 59, 48, 46, 47, 51, 54, 56, 57, 60,
-        48, 45, 46, 51, 54, 57, 59, 61, 49, 45, 46, 51, 55, 58, 61, 64, 49, 45,
-        46, 51, 55, 58, 61, 64, 50, 46, 46, 52, 56, 59, 63, 66, 50, 46, 46, 52,
-        56, 59, 64, 67, 51, 47, 47, 52, 56, 60, 65, 68, 52, 48, 47, 53, 57, 61,
-        66, 71, 52, 48, 47, 53, 57, 61, 66, 71, 54, 49, 48, 54, 58, 62, 68, 73,
-        54, 50, 49, 54, 58, 62, 68, 73, 55, 51, 49, 54, 58, 63, 69, 74, 57, 52,
-        50, 55, 59, 64, 70, 76, 57, 52, 50, 55, 59, 64, 70, 76, 59, 54, 52, 57,
-        61, 65, 72, 78,
-        /* Size 32x8 */
         32, 31, 31, 31, 30, 30, 33, 33, 35, 37, 37, 41, 42, 44, 49, 49, 48, 48,
         48, 49, 49, 50, 50, 51, 52, 52, 54, 54, 55, 57, 57, 59, 31, 31, 31, 32,
         32, 32, 35, 36, 37, 40, 40, 42, 43, 44, 46, 46, 46, 46, 45, 45, 45, 46,
@@ -3740,7 +3724,23 @@
         47, 47, 47, 47, 47, 50, 50, 52, 54, 54, 57, 57, 59, 61, 61, 63, 64, 65,
         66, 66, 68, 68, 69, 70, 70, 72, 57, 55, 54, 53, 52, 52, 51, 51, 51, 50,
         50, 52, 53, 54, 57, 57, 59, 60, 61, 64, 64, 66, 67, 68, 71, 71, 73, 73,
-        74, 76, 76, 78 },
+        74, 76, 76, 78,
+        /* Size 32x8 */
+        32, 31, 37, 45, 48, 49, 52, 57, 31, 31, 38, 45, 47, 47, 50, 55, 31, 31,
+        38, 45, 47, 47, 50, 54, 31, 32, 39, 45, 46, 46, 49, 53, 30, 32, 40, 44,
+        45, 45, 48, 52, 30, 32, 40, 44, 45, 45, 48, 52, 33, 35, 42, 46, 46, 45,
+        47, 51, 33, 36, 43, 46, 46, 46, 47, 51, 35, 37, 44, 46, 46, 45, 47, 51,
+        37, 40, 47, 47, 47, 45, 47, 50, 37, 40, 47, 47, 47, 45, 47, 50, 41, 42,
+        47, 49, 49, 48, 50, 52, 42, 43, 47, 49, 50, 49, 50, 53, 44, 44, 47, 50,
+        51, 51, 52, 54, 49, 46, 48, 52, 53, 53, 54, 57, 49, 46, 48, 52, 53, 53,
+        54, 57, 48, 46, 47, 51, 54, 55, 57, 59, 48, 46, 47, 51, 54, 56, 57, 60,
+        48, 45, 46, 51, 54, 57, 59, 61, 49, 45, 46, 51, 55, 58, 61, 64, 49, 45,
+        46, 51, 55, 58, 61, 64, 50, 46, 46, 52, 56, 59, 63, 66, 50, 46, 46, 52,
+        56, 59, 64, 67, 51, 47, 47, 52, 56, 60, 65, 68, 52, 48, 47, 53, 57, 61,
+        66, 71, 52, 48, 47, 53, 57, 61, 66, 71, 54, 49, 48, 54, 58, 62, 68, 73,
+        54, 50, 49, 54, 58, 62, 68, 73, 55, 51, 49, 54, 58, 63, 69, 74, 57, 52,
+        50, 55, 59, 64, 70, 76, 57, 52, 50, 55, 59, 64, 70, 76, 59, 54, 52, 57,
+        61, 65, 72, 78 },
   },
   {
       { /* Luma */
@@ -3826,21 +3826,12 @@
         92, 92, 59, 57, 56, 56, 54, 54, 54, 54, 54, 54, 53, 53, 55, 58, 58, 61,
         64, 64, 67, 69, 69, 73, 75, 76, 79, 80, 81, 86, 87, 88, 92, 92,
         /* Size 4x8 */
-        32, 32, 37, 52, 32, 33, 36, 49, 32, 34, 38, 49, 34, 37, 44, 54, 35, 38,
-        49, 60, 40, 42, 55, 69, 46, 46, 59, 76, 52, 51, 64, 83,
-        /* Size 8x4 */
         32, 32, 32, 34, 35, 40, 46, 52, 32, 33, 34, 37, 38, 42, 46, 51, 37, 36,
         38, 44, 49, 55, 59, 64, 52, 49, 49, 54, 60, 69, 76, 83,
+        /* Size 8x4 */
+        32, 32, 37, 52, 32, 33, 36, 49, 32, 34, 38, 49, 34, 37, 44, 54, 35, 38,
+        49, 60, 40, 42, 55, 69, 46, 46, 59, 76, 52, 51, 64, 83,
         /* Size 8x16 */
-        32, 31, 32, 32, 36, 44, 47, 53, 31, 32, 32, 33, 35, 42, 45, 51, 31, 32,
-        32, 33, 35, 41, 44, 49, 31, 32, 33, 33, 35, 41, 44, 49, 32, 32, 34, 34,
-        36, 42, 45, 50, 32, 33, 35, 36, 38, 42, 45, 49, 32, 33, 35, 36, 40, 44,
-        47, 51, 34, 34, 36, 38, 42, 48, 50, 54, 36, 34, 37, 40, 48, 54, 56, 60,
-        38, 36, 39, 41, 49, 56, 58, 63, 39, 37, 40, 42, 50, 58, 60, 65, 44, 41,
-        42, 45, 53, 63, 66, 71, 47, 44, 45, 47, 56, 66, 69, 75, 49, 46, 47, 48,
-        57, 67, 71, 77, 53, 49, 50, 51, 60, 71, 75, 82, 58, 54, 54, 55, 63, 75,
-        79, 87,
-        /* Size 16x8 */
         32, 31, 31, 31, 32, 32, 32, 34, 36, 38, 39, 44, 47, 49, 53, 58, 31, 32,
         32, 32, 32, 33, 33, 34, 34, 36, 37, 41, 44, 46, 49, 54, 32, 32, 32, 33,
         34, 35, 35, 36, 37, 39, 40, 42, 45, 47, 50, 54, 32, 33, 33, 33, 34, 36,
@@ -3849,37 +3840,16 @@
         58, 63, 66, 67, 71, 75, 47, 45, 44, 44, 45, 45, 47, 50, 56, 58, 60, 66,
         69, 71, 75, 79, 53, 51, 49, 49, 50, 49, 51, 54, 60, 63, 65, 71, 75, 77,
         82, 87,
+        /* Size 16x8 */
+        32, 31, 32, 32, 36, 44, 47, 53, 31, 32, 32, 33, 35, 42, 45, 51, 31, 32,
+        32, 33, 35, 41, 44, 49, 31, 32, 33, 33, 35, 41, 44, 49, 32, 32, 34, 34,
+        36, 42, 45, 50, 32, 33, 35, 36, 38, 42, 45, 49, 32, 33, 35, 36, 40, 44,
+        47, 51, 34, 34, 36, 38, 42, 48, 50, 54, 36, 34, 37, 40, 48, 54, 56, 60,
+        38, 36, 39, 41, 49, 56, 58, 63, 39, 37, 40, 42, 50, 58, 60, 65, 44, 41,
+        42, 45, 53, 63, 66, 71, 47, 44, 45, 47, 56, 66, 69, 75, 49, 46, 47, 48,
+        57, 67, 71, 77, 53, 49, 50, 51, 60, 71, 75, 82, 58, 54, 54, 55, 63, 75,
+        79, 87,
         /* Size 16x32 */
-        32, 31, 31, 31, 32, 32, 32, 35, 36, 38, 44, 44, 47, 53, 53, 59, 31, 32,
-        32, 32, 32, 32, 33, 35, 35, 37, 43, 43, 46, 52, 52, 57, 31, 32, 32, 32,
-        32, 32, 33, 35, 35, 37, 42, 42, 45, 51, 51, 56, 31, 32, 32, 32, 32, 32,
-        33, 35, 35, 37, 42, 42, 45, 51, 51, 56, 31, 32, 32, 32, 32, 32, 33, 34,
-        35, 36, 41, 41, 44, 49, 49, 54, 31, 32, 32, 32, 32, 33, 33, 34, 34, 36,
-        41, 41, 44, 49, 49, 54, 31, 32, 32, 32, 33, 33, 33, 35, 35, 36, 41, 41,
-        44, 49, 49, 54, 32, 32, 32, 32, 33, 34, 34, 36, 36, 38, 42, 42, 45, 49,
-        49, 54, 32, 32, 32, 33, 34, 34, 34, 36, 36, 38, 42, 42, 45, 50, 50, 54,
-        32, 32, 32, 33, 34, 34, 35, 37, 37, 38, 42, 42, 45, 49, 49, 54, 32, 32,
-        33, 33, 35, 35, 36, 38, 38, 39, 42, 42, 45, 49, 49, 53, 32, 32, 33, 33,
-        35, 35, 36, 38, 38, 39, 42, 42, 45, 49, 49, 53, 32, 33, 33, 33, 35, 36,
-        36, 39, 40, 41, 44, 44, 47, 51, 51, 55, 34, 34, 34, 34, 36, 37, 38, 42,
-        42, 44, 48, 48, 50, 54, 54, 58, 34, 34, 34, 34, 36, 37, 38, 42, 42, 44,
-        48, 48, 50, 54, 54, 58, 35, 34, 34, 34, 37, 37, 39, 44, 45, 46, 50, 50,
-        53, 57, 57, 61, 36, 35, 34, 35, 37, 38, 40, 47, 48, 49, 54, 54, 56, 60,
-        60, 64, 36, 35, 34, 35, 37, 38, 40, 47, 48, 49, 54, 54, 56, 60, 60, 64,
-        38, 37, 36, 37, 39, 40, 41, 48, 49, 51, 56, 56, 58, 63, 63, 67, 39, 38,
-        37, 38, 40, 40, 42, 49, 50, 52, 58, 58, 60, 65, 65, 69, 39, 38, 37, 38,
-        40, 40, 42, 49, 50, 52, 58, 58, 60, 65, 65, 69, 42, 40, 40, 40, 42, 42,
-        44, 51, 52, 55, 61, 61, 64, 69, 69, 73, 44, 42, 41, 41, 42, 43, 45, 52,
-        53, 56, 63, 63, 66, 71, 71, 75, 44, 42, 41, 41, 43, 43, 45, 52, 54, 56,
-        63, 63, 66, 72, 72, 76, 47, 45, 44, 44, 45, 45, 47, 54, 56, 58, 66, 66,
-        69, 75, 75, 79, 48, 46, 45, 45, 46, 46, 48, 55, 56, 59, 67, 67, 70, 76,
-        76, 80, 49, 47, 46, 46, 47, 47, 48, 56, 57, 60, 67, 67, 71, 77, 77, 81,
-        53, 50, 49, 49, 49, 49, 51, 58, 59, 62, 71, 71, 74, 81, 81, 86, 53, 51,
-        49, 49, 50, 50, 51, 59, 60, 63, 71, 71, 75, 82, 82, 87, 55, 52, 51, 51,
-        51, 51, 53, 60, 61, 64, 72, 72, 76, 83, 83, 88, 58, 55, 54, 54, 54, 54,
-        55, 62, 63, 67, 75, 75, 79, 87, 87, 92, 58, 55, 54, 54, 54, 54, 55, 62,
-        63, 67, 75, 75, 79, 87, 87, 92,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 34, 34, 35, 36, 36,
         38, 39, 39, 42, 44, 44, 47, 48, 49, 53, 53, 55, 58, 58, 31, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 34, 34, 35, 35, 37, 38, 38, 40,
@@ -3909,33 +3879,47 @@
         65, 69, 71, 72, 75, 76, 77, 81, 82, 83, 87, 87, 59, 57, 56, 56, 54, 54,
         54, 54, 54, 54, 53, 53, 55, 58, 58, 61, 64, 64, 67, 69, 69, 73, 75, 76,
         79, 80, 81, 86, 87, 88, 92, 92,
+        /* Size 32x16 */
+        32, 31, 31, 31, 32, 32, 32, 35, 36, 38, 44, 44, 47, 53, 53, 59, 31, 32,
+        32, 32, 32, 32, 33, 35, 35, 37, 43, 43, 46, 52, 52, 57, 31, 32, 32, 32,
+        32, 32, 33, 35, 35, 37, 42, 42, 45, 51, 51, 56, 31, 32, 32, 32, 32, 32,
+        33, 35, 35, 37, 42, 42, 45, 51, 51, 56, 31, 32, 32, 32, 32, 32, 33, 34,
+        35, 36, 41, 41, 44, 49, 49, 54, 31, 32, 32, 32, 32, 33, 33, 34, 34, 36,
+        41, 41, 44, 49, 49, 54, 31, 32, 32, 32, 33, 33, 33, 35, 35, 36, 41, 41,
+        44, 49, 49, 54, 32, 32, 32, 32, 33, 34, 34, 36, 36, 38, 42, 42, 45, 49,
+        49, 54, 32, 32, 32, 33, 34, 34, 34, 36, 36, 38, 42, 42, 45, 50, 50, 54,
+        32, 32, 32, 33, 34, 34, 35, 37, 37, 38, 42, 42, 45, 49, 49, 54, 32, 32,
+        33, 33, 35, 35, 36, 38, 38, 39, 42, 42, 45, 49, 49, 53, 32, 32, 33, 33,
+        35, 35, 36, 38, 38, 39, 42, 42, 45, 49, 49, 53, 32, 33, 33, 33, 35, 36,
+        36, 39, 40, 41, 44, 44, 47, 51, 51, 55, 34, 34, 34, 34, 36, 37, 38, 42,
+        42, 44, 48, 48, 50, 54, 54, 58, 34, 34, 34, 34, 36, 37, 38, 42, 42, 44,
+        48, 48, 50, 54, 54, 58, 35, 34, 34, 34, 37, 37, 39, 44, 45, 46, 50, 50,
+        53, 57, 57, 61, 36, 35, 34, 35, 37, 38, 40, 47, 48, 49, 54, 54, 56, 60,
+        60, 64, 36, 35, 34, 35, 37, 38, 40, 47, 48, 49, 54, 54, 56, 60, 60, 64,
+        38, 37, 36, 37, 39, 40, 41, 48, 49, 51, 56, 56, 58, 63, 63, 67, 39, 38,
+        37, 38, 40, 40, 42, 49, 50, 52, 58, 58, 60, 65, 65, 69, 39, 38, 37, 38,
+        40, 40, 42, 49, 50, 52, 58, 58, 60, 65, 65, 69, 42, 40, 40, 40, 42, 42,
+        44, 51, 52, 55, 61, 61, 64, 69, 69, 73, 44, 42, 41, 41, 42, 43, 45, 52,
+        53, 56, 63, 63, 66, 71, 71, 75, 44, 42, 41, 41, 43, 43, 45, 52, 54, 56,
+        63, 63, 66, 72, 72, 76, 47, 45, 44, 44, 45, 45, 47, 54, 56, 58, 66, 66,
+        69, 75, 75, 79, 48, 46, 45, 45, 46, 46, 48, 55, 56, 59, 67, 67, 70, 76,
+        76, 80, 49, 47, 46, 46, 47, 47, 48, 56, 57, 60, 67, 67, 71, 77, 77, 81,
+        53, 50, 49, 49, 49, 49, 51, 58, 59, 62, 71, 71, 74, 81, 81, 86, 53, 51,
+        49, 49, 50, 50, 51, 59, 60, 63, 71, 71, 75, 82, 82, 87, 55, 52, 51, 51,
+        51, 51, 53, 60, 61, 64, 72, 72, 76, 83, 83, 88, 58, 55, 54, 54, 54, 54,
+        55, 62, 63, 67, 75, 75, 79, 87, 87, 92, 58, 55, 54, 54, 54, 54, 55, 62,
+        63, 67, 75, 75, 79, 87, 87, 92,
         /* Size 4x16 */
-        31, 32, 38, 53, 32, 32, 37, 51, 32, 32, 36, 49, 32, 33, 36, 49, 32, 34,
-        38, 50, 32, 35, 39, 49, 33, 36, 41, 51, 34, 37, 44, 54, 35, 38, 49, 60,
-        37, 40, 51, 63, 38, 40, 52, 65, 42, 43, 56, 71, 45, 45, 58, 75, 47, 47,
-        60, 77, 51, 50, 63, 82, 55, 54, 67, 87,
-        /* Size 16x4 */
         31, 32, 32, 32, 32, 32, 33, 34, 35, 37, 38, 42, 45, 47, 51, 55, 32, 32,
         32, 33, 34, 35, 36, 37, 38, 40, 40, 43, 45, 47, 50, 54, 38, 37, 36, 36,
         38, 39, 41, 44, 49, 51, 52, 56, 58, 60, 63, 67, 53, 51, 49, 49, 50, 49,
         51, 54, 60, 63, 65, 71, 75, 77, 82, 87,
+        /* Size 16x4 */
+        31, 32, 38, 53, 32, 32, 37, 51, 32, 32, 36, 49, 32, 33, 36, 49, 32, 34,
+        38, 50, 32, 35, 39, 49, 33, 36, 41, 51, 34, 37, 44, 54, 35, 38, 49, 60,
+        37, 40, 51, 63, 38, 40, 52, 65, 42, 43, 56, 71, 45, 45, 58, 75, 47, 47,
+        60, 77, 51, 50, 63, 82, 55, 54, 67, 87,
         /* Size 8x32 */
-        32, 31, 32, 32, 36, 44, 47, 53, 31, 32, 32, 33, 35, 43, 46, 52, 31, 32,
-        32, 33, 35, 42, 45, 51, 31, 32, 32, 33, 35, 42, 45, 51, 31, 32, 32, 33,
-        35, 41, 44, 49, 31, 32, 32, 33, 34, 41, 44, 49, 31, 32, 33, 33, 35, 41,
-        44, 49, 32, 32, 33, 34, 36, 42, 45, 49, 32, 32, 34, 34, 36, 42, 45, 50,
-        32, 32, 34, 35, 37, 42, 45, 49, 32, 33, 35, 36, 38, 42, 45, 49, 32, 33,
-        35, 36, 38, 42, 45, 49, 32, 33, 35, 36, 40, 44, 47, 51, 34, 34, 36, 38,
-        42, 48, 50, 54, 34, 34, 36, 38, 42, 48, 50, 54, 35, 34, 37, 39, 45, 50,
-        53, 57, 36, 34, 37, 40, 48, 54, 56, 60, 36, 34, 37, 40, 48, 54, 56, 60,
-        38, 36, 39, 41, 49, 56, 58, 63, 39, 37, 40, 42, 50, 58, 60, 65, 39, 37,
-        40, 42, 50, 58, 60, 65, 42, 40, 42, 44, 52, 61, 64, 69, 44, 41, 42, 45,
-        53, 63, 66, 71, 44, 41, 43, 45, 54, 63, 66, 72, 47, 44, 45, 47, 56, 66,
-        69, 75, 48, 45, 46, 48, 56, 67, 70, 76, 49, 46, 47, 48, 57, 67, 71, 77,
-        53, 49, 49, 51, 59, 71, 74, 81, 53, 49, 50, 51, 60, 71, 75, 82, 55, 51,
-        51, 53, 61, 72, 76, 83, 58, 54, 54, 55, 63, 75, 79, 87, 58, 54, 54, 55,
-        63, 75, 79, 87,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 34, 34, 35, 36, 36,
         38, 39, 39, 42, 44, 44, 47, 48, 49, 53, 53, 55, 58, 58, 31, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 34, 36, 37, 37, 40,
@@ -3950,7 +3934,23 @@
         44, 45, 45, 45, 45, 45, 47, 50, 50, 53, 56, 56, 58, 60, 60, 64, 66, 66,
         69, 70, 71, 74, 75, 76, 79, 79, 53, 52, 51, 51, 49, 49, 49, 49, 50, 49,
         49, 49, 51, 54, 54, 57, 60, 60, 63, 65, 65, 69, 71, 72, 75, 76, 77, 81,
-        82, 83, 87, 87 },
+        82, 83, 87, 87,
+        /* Size 32x8 */
+        32, 31, 32, 32, 36, 44, 47, 53, 31, 32, 32, 33, 35, 43, 46, 52, 31, 32,
+        32, 33, 35, 42, 45, 51, 31, 32, 32, 33, 35, 42, 45, 51, 31, 32, 32, 33,
+        35, 41, 44, 49, 31, 32, 32, 33, 34, 41, 44, 49, 31, 32, 33, 33, 35, 41,
+        44, 49, 32, 32, 33, 34, 36, 42, 45, 49, 32, 32, 34, 34, 36, 42, 45, 50,
+        32, 32, 34, 35, 37, 42, 45, 49, 32, 33, 35, 36, 38, 42, 45, 49, 32, 33,
+        35, 36, 38, 42, 45, 49, 32, 33, 35, 36, 40, 44, 47, 51, 34, 34, 36, 38,
+        42, 48, 50, 54, 34, 34, 36, 38, 42, 48, 50, 54, 35, 34, 37, 39, 45, 50,
+        53, 57, 36, 34, 37, 40, 48, 54, 56, 60, 36, 34, 37, 40, 48, 54, 56, 60,
+        38, 36, 39, 41, 49, 56, 58, 63, 39, 37, 40, 42, 50, 58, 60, 65, 39, 37,
+        40, 42, 50, 58, 60, 65, 42, 40, 42, 44, 52, 61, 64, 69, 44, 41, 42, 45,
+        53, 63, 66, 71, 44, 41, 43, 45, 54, 63, 66, 72, 47, 44, 45, 47, 56, 66,
+        69, 75, 48, 45, 46, 48, 56, 67, 70, 76, 49, 46, 47, 48, 57, 67, 71, 77,
+        53, 49, 49, 51, 59, 71, 74, 81, 53, 49, 50, 51, 60, 71, 75, 82, 55, 51,
+        51, 53, 61, 72, 76, 83, 58, 54, 54, 55, 63, 75, 79, 87, 58, 54, 54, 55,
+        63, 75, 79, 87 },
       { /* Chroma */
         /* Size 4x4 */
         31, 38, 47, 49, 38, 47, 46, 46, 47, 46, 54, 57, 49, 46, 57, 66,
@@ -4034,21 +4034,12 @@
         71, 71, 54, 53, 52, 52, 50, 49, 49, 49, 49, 49, 48, 48, 49, 52, 52, 53,
         55, 55, 57, 58, 58, 61, 62, 63, 64, 65, 66, 68, 68, 69, 71, 71,
         /* Size 4x8 */
-        31, 38, 47, 50, 31, 40, 46, 48, 36, 44, 47, 47, 42, 47, 50, 50, 47, 48,
-        53, 54, 46, 46, 54, 60, 48, 46, 55, 64, 50, 48, 56, 67,
-        /* Size 8x4 */
         31, 31, 36, 42, 47, 46, 48, 50, 38, 40, 44, 47, 48, 46, 46, 48, 47, 46,
         47, 50, 53, 54, 55, 56, 50, 48, 47, 50, 54, 60, 64, 67,
+        /* Size 8x4 */
+        31, 38, 47, 50, 31, 40, 46, 48, 36, 44, 47, 47, 42, 47, 50, 50, 47, 48,
+        53, 54, 46, 46, 54, 60, 48, 46, 55, 64, 50, 48, 56, 67,
         /* Size 8x16 */
-        32, 31, 35, 38, 48, 49, 50, 52, 31, 31, 37, 40, 47, 47, 48, 50, 30, 32,
-        38, 40, 46, 45, 46, 48, 31, 33, 38, 41, 46, 45, 46, 48, 33, 36, 41, 44,
-        47, 46, 46, 47, 37, 40, 45, 47, 47, 45, 46, 47, 39, 41, 46, 47, 48, 47,
-        47, 48, 42, 43, 46, 48, 50, 49, 50, 50, 49, 46, 48, 49, 53, 53, 54, 54,
-        48, 46, 47, 48, 53, 55, 55, 56, 48, 46, 46, 48, 53, 56, 56, 57, 49, 45,
-        45, 47, 53, 58, 59, 61, 50, 46, 46, 48, 54, 59, 61, 63, 51, 47, 47, 48,
-        54, 60, 61, 64, 52, 48, 47, 48, 54, 61, 63, 66, 54, 50, 49, 50, 55, 62,
-        65, 68,
-        /* Size 16x8 */
         32, 31, 30, 31, 33, 37, 39, 42, 49, 48, 48, 49, 50, 51, 52, 54, 31, 31,
         32, 33, 36, 40, 41, 43, 46, 46, 46, 45, 46, 47, 48, 50, 35, 37, 38, 38,
         41, 45, 46, 46, 48, 47, 46, 45, 46, 47, 47, 49, 38, 40, 40, 41, 44, 47,
@@ -4057,37 +4048,16 @@
         56, 58, 59, 60, 61, 62, 50, 48, 46, 46, 46, 46, 47, 50, 54, 55, 56, 59,
         61, 61, 63, 65, 52, 50, 48, 48, 47, 47, 48, 50, 54, 56, 57, 61, 63, 64,
         66, 68,
+        /* Size 16x8 */
+        32, 31, 35, 38, 48, 49, 50, 52, 31, 31, 37, 40, 47, 47, 48, 50, 30, 32,
+        38, 40, 46, 45, 46, 48, 31, 33, 38, 41, 46, 45, 46, 48, 33, 36, 41, 44,
+        47, 46, 46, 47, 37, 40, 45, 47, 47, 45, 46, 47, 39, 41, 46, 47, 48, 47,
+        47, 48, 42, 43, 46, 48, 50, 49, 50, 50, 49, 46, 48, 49, 53, 53, 54, 54,
+        48, 46, 47, 48, 53, 55, 55, 56, 48, 46, 46, 48, 53, 56, 56, 57, 49, 45,
+        45, 47, 53, 58, 59, 61, 50, 46, 46, 48, 54, 59, 61, 63, 51, 47, 47, 48,
+        54, 60, 61, 64, 52, 48, 47, 48, 54, 61, 63, 66, 54, 50, 49, 50, 55, 62,
+        65, 68,
         /* Size 16x32 */
-        32, 31, 31, 31, 35, 37, 38, 47, 48, 48, 49, 49, 50, 52, 52, 54, 31, 31,
-        31, 32, 36, 38, 39, 46, 47, 47, 48, 48, 49, 50, 50, 53, 31, 31, 31, 32,
-        37, 38, 40, 46, 47, 47, 47, 47, 48, 50, 50, 52, 31, 31, 31, 32, 37, 38,
-        40, 46, 47, 47, 47, 47, 48, 50, 50, 52, 30, 31, 32, 32, 38, 39, 40, 45,
-        46, 46, 45, 45, 46, 48, 48, 50, 30, 31, 32, 33, 38, 40, 41, 45, 46, 46,
-        45, 45, 46, 48, 48, 50, 31, 32, 33, 33, 38, 40, 41, 45, 46, 46, 45, 45,
-        46, 48, 48, 50, 33, 35, 35, 36, 41, 43, 43, 46, 47, 46, 45, 45, 46, 47,
-        47, 49, 33, 35, 36, 36, 41, 43, 44, 46, 47, 46, 46, 46, 46, 47, 47, 49,
-        34, 36, 37, 37, 42, 44, 45, 47, 47, 47, 45, 45, 46, 47, 47, 49, 37, 39,
-        40, 41, 45, 47, 47, 47, 47, 47, 45, 45, 46, 47, 47, 48, 37, 39, 40, 41,
-        45, 47, 47, 47, 47, 47, 45, 45, 46, 47, 47, 48, 39, 40, 41, 42, 46, 47,
-        47, 48, 48, 48, 47, 47, 47, 48, 48, 50, 42, 42, 43, 43, 46, 47, 48, 50,
-        50, 50, 49, 49, 50, 50, 50, 52, 42, 42, 43, 43, 46, 47, 48, 50, 50, 50,
-        49, 49, 50, 50, 50, 52, 45, 45, 44, 45, 47, 47, 48, 51, 51, 51, 51, 51,
-        52, 52, 52, 54, 49, 47, 46, 47, 48, 48, 49, 52, 53, 53, 53, 53, 54, 54,
-        54, 55, 49, 47, 46, 47, 48, 48, 49, 52, 53, 53, 53, 53, 54, 54, 54, 55,
-        48, 47, 46, 46, 47, 47, 48, 52, 53, 53, 55, 55, 55, 56, 56, 57, 48, 46,
-        46, 46, 46, 47, 48, 52, 53, 54, 56, 56, 56, 57, 57, 59, 48, 46, 46, 46,
-        46, 47, 48, 52, 53, 54, 56, 56, 56, 57, 57, 59, 49, 46, 45, 45, 46, 46,
-        47, 52, 53, 54, 57, 57, 58, 60, 60, 61, 49, 46, 45, 45, 45, 46, 47, 52,
-        53, 55, 58, 58, 59, 61, 61, 62, 49, 46, 45, 45, 46, 46, 47, 52, 53, 55,
-        58, 58, 60, 61, 61, 63, 50, 47, 46, 46, 46, 46, 48, 53, 54, 55, 59, 59,
-        61, 63, 63, 65, 50, 48, 46, 46, 46, 46, 48, 53, 54, 55, 59, 59, 61, 64,
-        64, 65, 51, 48, 47, 47, 47, 47, 48, 53, 54, 55, 60, 60, 61, 64, 64, 66,
-        52, 49, 48, 48, 47, 47, 48, 53, 54, 56, 61, 61, 63, 66, 66, 68, 52, 49,
-        48, 48, 47, 47, 48, 53, 54, 56, 61, 61, 63, 66, 66, 68, 53, 50, 48, 48,
-        48, 48, 49, 54, 54, 56, 61, 61, 63, 67, 67, 69, 54, 51, 50, 50, 49, 49,
-        50, 55, 55, 57, 62, 62, 65, 68, 68, 71, 54, 51, 50, 50, 49, 49, 50, 55,
-        55, 57, 62, 62, 65, 68, 68, 71,
-        /* Size 32x16 */
         32, 31, 31, 31, 30, 30, 31, 33, 33, 34, 37, 37, 39, 42, 42, 45, 49, 49,
         48, 48, 48, 49, 49, 49, 50, 50, 51, 52, 52, 53, 54, 54, 31, 31, 31, 31,
         31, 31, 32, 35, 35, 36, 39, 39, 40, 42, 42, 45, 47, 47, 47, 46, 46, 46,
@@ -4117,33 +4087,47 @@
         57, 60, 61, 61, 63, 64, 64, 66, 66, 67, 68, 68, 54, 53, 52, 52, 50, 50,
         50, 49, 49, 49, 48, 48, 50, 52, 52, 54, 55, 55, 57, 59, 59, 61, 62, 63,
         65, 65, 66, 68, 68, 69, 71, 71,
+        /* Size 32x16 */
+        32, 31, 31, 31, 35, 37, 38, 47, 48, 48, 49, 49, 50, 52, 52, 54, 31, 31,
+        31, 32, 36, 38, 39, 46, 47, 47, 48, 48, 49, 50, 50, 53, 31, 31, 31, 32,
+        37, 38, 40, 46, 47, 47, 47, 47, 48, 50, 50, 52, 31, 31, 31, 32, 37, 38,
+        40, 46, 47, 47, 47, 47, 48, 50, 50, 52, 30, 31, 32, 32, 38, 39, 40, 45,
+        46, 46, 45, 45, 46, 48, 48, 50, 30, 31, 32, 33, 38, 40, 41, 45, 46, 46,
+        45, 45, 46, 48, 48, 50, 31, 32, 33, 33, 38, 40, 41, 45, 46, 46, 45, 45,
+        46, 48, 48, 50, 33, 35, 35, 36, 41, 43, 43, 46, 47, 46, 45, 45, 46, 47,
+        47, 49, 33, 35, 36, 36, 41, 43, 44, 46, 47, 46, 46, 46, 46, 47, 47, 49,
+        34, 36, 37, 37, 42, 44, 45, 47, 47, 47, 45, 45, 46, 47, 47, 49, 37, 39,
+        40, 41, 45, 47, 47, 47, 47, 47, 45, 45, 46, 47, 47, 48, 37, 39, 40, 41,
+        45, 47, 47, 47, 47, 47, 45, 45, 46, 47, 47, 48, 39, 40, 41, 42, 46, 47,
+        47, 48, 48, 48, 47, 47, 47, 48, 48, 50, 42, 42, 43, 43, 46, 47, 48, 50,
+        50, 50, 49, 49, 50, 50, 50, 52, 42, 42, 43, 43, 46, 47, 48, 50, 50, 50,
+        49, 49, 50, 50, 50, 52, 45, 45, 44, 45, 47, 47, 48, 51, 51, 51, 51, 51,
+        52, 52, 52, 54, 49, 47, 46, 47, 48, 48, 49, 52, 53, 53, 53, 53, 54, 54,
+        54, 55, 49, 47, 46, 47, 48, 48, 49, 52, 53, 53, 53, 53, 54, 54, 54, 55,
+        48, 47, 46, 46, 47, 47, 48, 52, 53, 53, 55, 55, 55, 56, 56, 57, 48, 46,
+        46, 46, 46, 47, 48, 52, 53, 54, 56, 56, 56, 57, 57, 59, 48, 46, 46, 46,
+        46, 47, 48, 52, 53, 54, 56, 56, 56, 57, 57, 59, 49, 46, 45, 45, 46, 46,
+        47, 52, 53, 54, 57, 57, 58, 60, 60, 61, 49, 46, 45, 45, 45, 46, 47, 52,
+        53, 55, 58, 58, 59, 61, 61, 62, 49, 46, 45, 45, 46, 46, 47, 52, 53, 55,
+        58, 58, 60, 61, 61, 63, 50, 47, 46, 46, 46, 46, 48, 53, 54, 55, 59, 59,
+        61, 63, 63, 65, 50, 48, 46, 46, 46, 46, 48, 53, 54, 55, 59, 59, 61, 64,
+        64, 65, 51, 48, 47, 47, 47, 47, 48, 53, 54, 55, 60, 60, 61, 64, 64, 66,
+        52, 49, 48, 48, 47, 47, 48, 53, 54, 56, 61, 61, 63, 66, 66, 68, 52, 49,
+        48, 48, 47, 47, 48, 53, 54, 56, 61, 61, 63, 66, 66, 68, 53, 50, 48, 48,
+        48, 48, 49, 54, 54, 56, 61, 61, 63, 67, 67, 69, 54, 51, 50, 50, 49, 49,
+        50, 55, 55, 57, 62, 62, 65, 68, 68, 71, 54, 51, 50, 50, 49, 49, 50, 55,
+        55, 57, 62, 62, 65, 68, 68, 71,
         /* Size 4x16 */
-        31, 37, 48, 52, 31, 38, 47, 50, 31, 39, 46, 48, 32, 40, 46, 48, 35, 43,
-        46, 47, 39, 47, 47, 47, 40, 47, 48, 48, 42, 47, 50, 50, 47, 48, 53, 54,
-        47, 47, 53, 56, 46, 47, 54, 57, 46, 46, 55, 61, 47, 46, 55, 63, 48, 47,
-        55, 64, 49, 47, 56, 66, 51, 49, 57, 68,
-        /* Size 16x4 */
         31, 31, 31, 32, 35, 39, 40, 42, 47, 47, 46, 46, 47, 48, 49, 51, 37, 38,
         39, 40, 43, 47, 47, 47, 48, 47, 47, 46, 46, 47, 47, 49, 48, 47, 46, 46,
         46, 47, 48, 50, 53, 53, 54, 55, 55, 55, 56, 57, 52, 50, 48, 48, 47, 47,
         48, 50, 54, 56, 57, 61, 63, 64, 66, 68,
+        /* Size 16x4 */
+        31, 37, 48, 52, 31, 38, 47, 50, 31, 39, 46, 48, 32, 40, 46, 48, 35, 43,
+        46, 47, 39, 47, 47, 47, 40, 47, 48, 48, 42, 47, 50, 50, 47, 48, 53, 54,
+        47, 47, 53, 56, 46, 47, 54, 57, 46, 46, 55, 61, 47, 46, 55, 63, 48, 47,
+        55, 64, 49, 47, 56, 66, 51, 49, 57, 68,
         /* Size 8x32 */
-        32, 31, 35, 38, 48, 49, 50, 52, 31, 31, 36, 39, 47, 48, 49, 50, 31, 31,
-        37, 40, 47, 47, 48, 50, 31, 31, 37, 40, 47, 47, 48, 50, 30, 32, 38, 40,
-        46, 45, 46, 48, 30, 32, 38, 41, 46, 45, 46, 48, 31, 33, 38, 41, 46, 45,
-        46, 48, 33, 35, 41, 43, 47, 45, 46, 47, 33, 36, 41, 44, 47, 46, 46, 47,
-        34, 37, 42, 45, 47, 45, 46, 47, 37, 40, 45, 47, 47, 45, 46, 47, 37, 40,
-        45, 47, 47, 45, 46, 47, 39, 41, 46, 47, 48, 47, 47, 48, 42, 43, 46, 48,
-        50, 49, 50, 50, 42, 43, 46, 48, 50, 49, 50, 50, 45, 44, 47, 48, 51, 51,
-        52, 52, 49, 46, 48, 49, 53, 53, 54, 54, 49, 46, 48, 49, 53, 53, 54, 54,
-        48, 46, 47, 48, 53, 55, 55, 56, 48, 46, 46, 48, 53, 56, 56, 57, 48, 46,
-        46, 48, 53, 56, 56, 57, 49, 45, 46, 47, 53, 57, 58, 60, 49, 45, 45, 47,
-        53, 58, 59, 61, 49, 45, 46, 47, 53, 58, 60, 61, 50, 46, 46, 48, 54, 59,
-        61, 63, 50, 46, 46, 48, 54, 59, 61, 64, 51, 47, 47, 48, 54, 60, 61, 64,
-        52, 48, 47, 48, 54, 61, 63, 66, 52, 48, 47, 48, 54, 61, 63, 66, 53, 48,
-        48, 49, 54, 61, 63, 67, 54, 50, 49, 50, 55, 62, 65, 68, 54, 50, 49, 50,
-        55, 62, 65, 68,
-        /* Size 32x8 */
         32, 31, 31, 31, 30, 30, 31, 33, 33, 34, 37, 37, 39, 42, 42, 45, 49, 49,
         48, 48, 48, 49, 49, 49, 50, 50, 51, 52, 52, 53, 54, 54, 31, 31, 31, 31,
         32, 32, 33, 35, 36, 37, 40, 40, 41, 43, 43, 44, 46, 46, 46, 46, 46, 45,
@@ -4158,7 +4142,23 @@
         46, 46, 46, 46, 46, 46, 47, 50, 50, 52, 54, 54, 55, 56, 56, 58, 59, 60,
         61, 61, 61, 63, 63, 63, 65, 65, 52, 50, 50, 50, 48, 48, 48, 47, 47, 47,
         47, 47, 48, 50, 50, 52, 54, 54, 56, 57, 57, 60, 61, 61, 63, 64, 64, 66,
-        66, 67, 68, 68 },
+        66, 67, 68, 68,
+        /* Size 32x8 */
+        32, 31, 35, 38, 48, 49, 50, 52, 31, 31, 36, 39, 47, 48, 49, 50, 31, 31,
+        37, 40, 47, 47, 48, 50, 31, 31, 37, 40, 47, 47, 48, 50, 30, 32, 38, 40,
+        46, 45, 46, 48, 30, 32, 38, 41, 46, 45, 46, 48, 31, 33, 38, 41, 46, 45,
+        46, 48, 33, 35, 41, 43, 47, 45, 46, 47, 33, 36, 41, 44, 47, 46, 46, 47,
+        34, 37, 42, 45, 47, 45, 46, 47, 37, 40, 45, 47, 47, 45, 46, 47, 37, 40,
+        45, 47, 47, 45, 46, 47, 39, 41, 46, 47, 48, 47, 47, 48, 42, 43, 46, 48,
+        50, 49, 50, 50, 42, 43, 46, 48, 50, 49, 50, 50, 45, 44, 47, 48, 51, 51,
+        52, 52, 49, 46, 48, 49, 53, 53, 54, 54, 49, 46, 48, 49, 53, 53, 54, 54,
+        48, 46, 47, 48, 53, 55, 55, 56, 48, 46, 46, 48, 53, 56, 56, 57, 48, 46,
+        46, 48, 53, 56, 56, 57, 49, 45, 46, 47, 53, 57, 58, 60, 49, 45, 45, 47,
+        53, 58, 59, 61, 49, 45, 46, 47, 53, 58, 60, 61, 50, 46, 46, 48, 54, 59,
+        61, 63, 50, 46, 46, 48, 54, 59, 61, 64, 51, 47, 47, 48, 54, 60, 61, 64,
+        52, 48, 47, 48, 54, 61, 63, 66, 52, 48, 47, 48, 54, 61, 63, 66, 53, 48,
+        48, 49, 54, 61, 63, 67, 54, 50, 49, 50, 55, 62, 65, 68, 54, 50, 49, 50,
+        55, 62, 65, 68 },
   },
   {
       { /* Luma */
@@ -4244,21 +4244,12 @@
         71, 74, 51, 50, 49, 49, 48, 47, 47, 47, 48, 48, 48, 48, 48, 48, 50, 53,
         53, 54, 57, 58, 58, 61, 63, 63, 66, 69, 69, 70, 73, 74, 74, 77,
         /* Size 4x8 */
-        31, 32, 35, 43, 32, 33, 34, 41, 32, 34, 36, 42, 32, 35, 38, 42, 34, 37,
-        43, 49, 37, 40, 49, 56, 42, 43, 53, 63, 46, 46, 56, 67,
-        /* Size 8x4 */
         31, 32, 32, 32, 34, 37, 42, 46, 32, 33, 34, 35, 37, 40, 43, 46, 35, 34,
         36, 38, 43, 49, 53, 56, 43, 41, 42, 42, 49, 56, 63, 67,
+        /* Size 8x4 */
+        31, 32, 35, 43, 32, 33, 34, 41, 32, 34, 36, 42, 32, 35, 38, 42, 34, 37,
+        43, 49, 37, 40, 49, 56, 42, 43, 53, 63, 46, 46, 56, 67,
         /* Size 8x16 */
-        32, 31, 31, 32, 35, 36, 44, 47, 31, 32, 32, 32, 35, 35, 42, 45, 31, 32,
-        32, 32, 34, 35, 41, 45, 31, 32, 32, 33, 34, 34, 41, 44, 31, 32, 33, 34,
-        35, 36, 42, 44, 32, 32, 33, 34, 36, 36, 42, 45, 32, 33, 34, 35, 37, 38,
-        42, 45, 32, 33, 34, 36, 39, 40, 44, 47, 34, 34, 35, 37, 41, 42, 48, 50,
-        35, 34, 36, 38, 45, 47, 52, 55, 36, 34, 36, 38, 46, 48, 54, 56, 39, 37,
-        39, 40, 48, 50, 58, 60, 41, 39, 40, 41, 49, 51, 60, 62, 44, 41, 42, 43,
-        51, 53, 63, 66, 47, 44, 44, 45, 53, 56, 66, 69, 48, 45, 45, 46, 54, 56,
-        67, 70,
-        /* Size 16x8 */
         32, 31, 31, 31, 31, 32, 32, 32, 34, 35, 36, 39, 41, 44, 47, 48, 31, 32,
         32, 32, 32, 32, 33, 33, 34, 34, 34, 37, 39, 41, 44, 45, 31, 32, 32, 32,
         33, 33, 34, 34, 35, 36, 36, 39, 40, 42, 44, 45, 32, 32, 32, 33, 34, 34,
@@ -4267,37 +4258,16 @@
         48, 50, 51, 53, 56, 56, 44, 42, 41, 41, 42, 42, 42, 44, 48, 52, 54, 58,
         60, 63, 66, 67, 47, 45, 45, 44, 44, 45, 45, 47, 50, 55, 56, 60, 62, 66,
         69, 70,
+        /* Size 16x8 */
+        32, 31, 31, 32, 35, 36, 44, 47, 31, 32, 32, 32, 35, 35, 42, 45, 31, 32,
+        32, 32, 34, 35, 41, 45, 31, 32, 32, 33, 34, 34, 41, 44, 31, 32, 33, 34,
+        35, 36, 42, 44, 32, 32, 33, 34, 36, 36, 42, 45, 32, 33, 34, 35, 37, 38,
+        42, 45, 32, 33, 34, 36, 39, 40, 44, 47, 34, 34, 35, 37, 41, 42, 48, 50,
+        35, 34, 36, 38, 45, 47, 52, 55, 36, 34, 36, 38, 46, 48, 54, 56, 39, 37,
+        39, 40, 48, 50, 58, 60, 41, 39, 40, 41, 49, 51, 60, 62, 44, 41, 42, 43,
+        51, 53, 63, 66, 47, 44, 44, 45, 53, 56, 66, 69, 48, 45, 45, 46, 54, 56,
+        67, 70,
         /* Size 16x32 */
-        32, 31, 31, 31, 31, 32, 32, 32, 35, 36, 36, 40, 44, 44, 47, 53, 31, 31,
-        32, 32, 32, 32, 32, 33, 35, 35, 35, 39, 43, 43, 46, 52, 31, 32, 32, 32,
-        32, 32, 32, 33, 35, 35, 35, 39, 42, 42, 45, 51, 31, 32, 32, 32, 32, 32,
-        32, 33, 35, 35, 35, 39, 42, 42, 45, 51, 31, 32, 32, 32, 32, 32, 32, 33,
-        34, 35, 35, 39, 41, 41, 45, 50, 31, 32, 32, 32, 32, 33, 33, 33, 34, 34,
-        34, 38, 41, 41, 44, 49, 31, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 38,
-        41, 41, 44, 49, 31, 32, 32, 32, 32, 33, 33, 33, 34, 35, 35, 38, 41, 41,
-        44, 49, 31, 32, 32, 32, 33, 34, 34, 34, 35, 36, 36, 39, 42, 42, 44, 49,
-        32, 32, 32, 32, 33, 34, 34, 34, 36, 36, 36, 39, 42, 42, 45, 50, 32, 32,
-        32, 32, 33, 34, 34, 34, 36, 36, 36, 39, 42, 42, 45, 50, 32, 32, 32, 32,
-        33, 35, 35, 35, 37, 37, 37, 40, 42, 42, 45, 49, 32, 32, 33, 33, 34, 35,
-        35, 36, 37, 38, 38, 41, 42, 42, 45, 49, 32, 32, 33, 33, 34, 35, 35, 36,
-        37, 38, 38, 41, 42, 42, 45, 49, 32, 33, 33, 33, 34, 36, 36, 36, 39, 40,
-        40, 42, 44, 44, 47, 51, 34, 34, 34, 34, 35, 37, 37, 38, 41, 42, 42, 45,
-        48, 48, 50, 54, 34, 34, 34, 34, 35, 37, 37, 38, 41, 42, 42, 45, 48, 48,
-        50, 54, 34, 34, 34, 34, 35, 37, 37, 38, 42, 43, 43, 46, 49, 49, 51, 55,
-        35, 35, 34, 34, 36, 38, 38, 39, 45, 47, 47, 50, 52, 52, 55, 59, 36, 35,
-        34, 34, 36, 38, 38, 40, 46, 48, 48, 51, 54, 54, 56, 60, 36, 35, 34, 34,
-        36, 38, 38, 40, 46, 48, 48, 51, 54, 54, 56, 60, 38, 37, 36, 36, 37, 40,
-        40, 41, 47, 49, 49, 53, 56, 56, 58, 63, 39, 38, 37, 37, 39, 40, 40, 42,
-        48, 50, 50, 54, 58, 58, 60, 65, 39, 38, 37, 37, 39, 40, 40, 42, 48, 50,
-        50, 54, 58, 58, 60, 65, 41, 40, 39, 39, 40, 41, 41, 43, 49, 51, 51, 56,
-        60, 60, 62, 67, 44, 42, 41, 41, 42, 43, 43, 45, 51, 53, 53, 59, 63, 63,
-        66, 71, 44, 42, 41, 41, 42, 43, 43, 45, 51, 53, 53, 59, 63, 63, 66, 71,
-        44, 43, 42, 42, 42, 43, 43, 45, 51, 54, 54, 59, 64, 64, 67, 72, 47, 45,
-        44, 44, 44, 45, 45, 47, 53, 56, 56, 61, 66, 66, 69, 75, 48, 46, 45, 45,
-        45, 46, 46, 48, 54, 56, 56, 62, 67, 67, 70, 76, 48, 46, 45, 45, 45, 46,
-        46, 48, 54, 56, 56, 62, 67, 67, 70, 76, 51, 49, 47, 47, 48, 48, 48, 50,
-        56, 58, 58, 64, 69, 69, 73, 79,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 34, 34, 34,
         35, 36, 36, 38, 39, 39, 41, 44, 44, 44, 47, 48, 48, 51, 31, 31, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 34, 34, 35, 35, 35, 37,
@@ -4327,33 +4297,47 @@
         56, 58, 60, 60, 62, 66, 66, 67, 69, 70, 70, 73, 53, 52, 51, 51, 50, 49,
         49, 49, 49, 50, 50, 49, 49, 49, 51, 54, 54, 55, 59, 60, 60, 63, 65, 65,
         67, 71, 71, 72, 75, 76, 76, 79,
+        /* Size 32x16 */
+        32, 31, 31, 31, 31, 32, 32, 32, 35, 36, 36, 40, 44, 44, 47, 53, 31, 31,
+        32, 32, 32, 32, 32, 33, 35, 35, 35, 39, 43, 43, 46, 52, 31, 32, 32, 32,
+        32, 32, 32, 33, 35, 35, 35, 39, 42, 42, 45, 51, 31, 32, 32, 32, 32, 32,
+        32, 33, 35, 35, 35, 39, 42, 42, 45, 51, 31, 32, 32, 32, 32, 32, 32, 33,
+        34, 35, 35, 39, 41, 41, 45, 50, 31, 32, 32, 32, 32, 33, 33, 33, 34, 34,
+        34, 38, 41, 41, 44, 49, 31, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 38,
+        41, 41, 44, 49, 31, 32, 32, 32, 32, 33, 33, 33, 34, 35, 35, 38, 41, 41,
+        44, 49, 31, 32, 32, 32, 33, 34, 34, 34, 35, 36, 36, 39, 42, 42, 44, 49,
+        32, 32, 32, 32, 33, 34, 34, 34, 36, 36, 36, 39, 42, 42, 45, 50, 32, 32,
+        32, 32, 33, 34, 34, 34, 36, 36, 36, 39, 42, 42, 45, 50, 32, 32, 32, 32,
+        33, 35, 35, 35, 37, 37, 37, 40, 42, 42, 45, 49, 32, 32, 33, 33, 34, 35,
+        35, 36, 37, 38, 38, 41, 42, 42, 45, 49, 32, 32, 33, 33, 34, 35, 35, 36,
+        37, 38, 38, 41, 42, 42, 45, 49, 32, 33, 33, 33, 34, 36, 36, 36, 39, 40,
+        40, 42, 44, 44, 47, 51, 34, 34, 34, 34, 35, 37, 37, 38, 41, 42, 42, 45,
+        48, 48, 50, 54, 34, 34, 34, 34, 35, 37, 37, 38, 41, 42, 42, 45, 48, 48,
+        50, 54, 34, 34, 34, 34, 35, 37, 37, 38, 42, 43, 43, 46, 49, 49, 51, 55,
+        35, 35, 34, 34, 36, 38, 38, 39, 45, 47, 47, 50, 52, 52, 55, 59, 36, 35,
+        34, 34, 36, 38, 38, 40, 46, 48, 48, 51, 54, 54, 56, 60, 36, 35, 34, 34,
+        36, 38, 38, 40, 46, 48, 48, 51, 54, 54, 56, 60, 38, 37, 36, 36, 37, 40,
+        40, 41, 47, 49, 49, 53, 56, 56, 58, 63, 39, 38, 37, 37, 39, 40, 40, 42,
+        48, 50, 50, 54, 58, 58, 60, 65, 39, 38, 37, 37, 39, 40, 40, 42, 48, 50,
+        50, 54, 58, 58, 60, 65, 41, 40, 39, 39, 40, 41, 41, 43, 49, 51, 51, 56,
+        60, 60, 62, 67, 44, 42, 41, 41, 42, 43, 43, 45, 51, 53, 53, 59, 63, 63,
+        66, 71, 44, 42, 41, 41, 42, 43, 43, 45, 51, 53, 53, 59, 63, 63, 66, 71,
+        44, 43, 42, 42, 42, 43, 43, 45, 51, 54, 54, 59, 64, 64, 67, 72, 47, 45,
+        44, 44, 44, 45, 45, 47, 53, 56, 56, 61, 66, 66, 69, 75, 48, 46, 45, 45,
+        45, 46, 46, 48, 54, 56, 56, 62, 67, 67, 70, 76, 48, 46, 45, 45, 45, 46,
+        46, 48, 54, 56, 56, 62, 67, 67, 70, 76, 51, 49, 47, 47, 48, 48, 48, 50,
+        56, 58, 58, 64, 69, 69, 73, 79,
         /* Size 4x16 */
-        31, 32, 36, 44, 32, 32, 35, 42, 32, 32, 35, 41, 32, 33, 34, 41, 32, 34,
-        36, 42, 32, 34, 36, 42, 32, 35, 38, 42, 33, 36, 40, 44, 34, 37, 42, 48,
-        35, 38, 47, 52, 35, 38, 48, 54, 38, 40, 50, 58, 40, 41, 51, 60, 42, 43,
-        53, 63, 45, 45, 56, 66, 46, 46, 56, 67,
-        /* Size 16x4 */
         31, 32, 32, 32, 32, 32, 32, 33, 34, 35, 35, 38, 40, 42, 45, 46, 32, 32,
         32, 33, 34, 34, 35, 36, 37, 38, 38, 40, 41, 43, 45, 46, 36, 35, 35, 34,
         36, 36, 38, 40, 42, 47, 48, 50, 51, 53, 56, 56, 44, 42, 41, 41, 42, 42,
         42, 44, 48, 52, 54, 58, 60, 63, 66, 67,
+        /* Size 16x4 */
+        31, 32, 36, 44, 32, 32, 35, 42, 32, 32, 35, 41, 32, 33, 34, 41, 32, 34,
+        36, 42, 32, 34, 36, 42, 32, 35, 38, 42, 33, 36, 40, 44, 34, 37, 42, 48,
+        35, 38, 47, 52, 35, 38, 48, 54, 38, 40, 50, 58, 40, 41, 51, 60, 42, 43,
+        53, 63, 45, 45, 56, 66, 46, 46, 56, 67,
         /* Size 8x32 */
-        32, 31, 31, 32, 35, 36, 44, 47, 31, 32, 32, 32, 35, 35, 43, 46, 31, 32,
-        32, 32, 35, 35, 42, 45, 31, 32, 32, 32, 35, 35, 42, 45, 31, 32, 32, 32,
-        34, 35, 41, 45, 31, 32, 32, 33, 34, 34, 41, 44, 31, 32, 32, 33, 34, 34,
-        41, 44, 31, 32, 32, 33, 34, 35, 41, 44, 31, 32, 33, 34, 35, 36, 42, 44,
-        32, 32, 33, 34, 36, 36, 42, 45, 32, 32, 33, 34, 36, 36, 42, 45, 32, 32,
-        33, 35, 37, 37, 42, 45, 32, 33, 34, 35, 37, 38, 42, 45, 32, 33, 34, 35,
-        37, 38, 42, 45, 32, 33, 34, 36, 39, 40, 44, 47, 34, 34, 35, 37, 41, 42,
-        48, 50, 34, 34, 35, 37, 41, 42, 48, 50, 34, 34, 35, 37, 42, 43, 49, 51,
-        35, 34, 36, 38, 45, 47, 52, 55, 36, 34, 36, 38, 46, 48, 54, 56, 36, 34,
-        36, 38, 46, 48, 54, 56, 38, 36, 37, 40, 47, 49, 56, 58, 39, 37, 39, 40,
-        48, 50, 58, 60, 39, 37, 39, 40, 48, 50, 58, 60, 41, 39, 40, 41, 49, 51,
-        60, 62, 44, 41, 42, 43, 51, 53, 63, 66, 44, 41, 42, 43, 51, 53, 63, 66,
-        44, 42, 42, 43, 51, 54, 64, 67, 47, 44, 44, 45, 53, 56, 66, 69, 48, 45,
-        45, 46, 54, 56, 67, 70, 48, 45, 45, 46, 54, 56, 67, 70, 51, 47, 48, 48,
-        56, 58, 69, 73,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 34, 34, 34,
         35, 36, 36, 38, 39, 39, 41, 44, 44, 44, 47, 48, 48, 51, 31, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 34, 34, 36,
@@ -4368,7 +4352,23 @@
         41, 41, 42, 42, 42, 42, 42, 42, 44, 48, 48, 49, 52, 54, 54, 56, 58, 58,
         60, 63, 63, 64, 66, 67, 67, 69, 47, 46, 45, 45, 45, 44, 44, 44, 44, 45,
         45, 45, 45, 45, 47, 50, 50, 51, 55, 56, 56, 58, 60, 60, 62, 66, 66, 67,
-        69, 70, 70, 73 },
+        69, 70, 70, 73,
+        /* Size 32x8 */
+        32, 31, 31, 32, 35, 36, 44, 47, 31, 32, 32, 32, 35, 35, 43, 46, 31, 32,
+        32, 32, 35, 35, 42, 45, 31, 32, 32, 32, 35, 35, 42, 45, 31, 32, 32, 32,
+        34, 35, 41, 45, 31, 32, 32, 33, 34, 34, 41, 44, 31, 32, 32, 33, 34, 34,
+        41, 44, 31, 32, 32, 33, 34, 35, 41, 44, 31, 32, 33, 34, 35, 36, 42, 44,
+        32, 32, 33, 34, 36, 36, 42, 45, 32, 32, 33, 34, 36, 36, 42, 45, 32, 32,
+        33, 35, 37, 37, 42, 45, 32, 33, 34, 35, 37, 38, 42, 45, 32, 33, 34, 35,
+        37, 38, 42, 45, 32, 33, 34, 36, 39, 40, 44, 47, 34, 34, 35, 37, 41, 42,
+        48, 50, 34, 34, 35, 37, 41, 42, 48, 50, 34, 34, 35, 37, 42, 43, 49, 51,
+        35, 34, 36, 38, 45, 47, 52, 55, 36, 34, 36, 38, 46, 48, 54, 56, 36, 34,
+        36, 38, 46, 48, 54, 56, 38, 36, 37, 40, 47, 49, 56, 58, 39, 37, 39, 40,
+        48, 50, 58, 60, 39, 37, 39, 40, 48, 50, 58, 60, 41, 39, 40, 41, 49, 51,
+        60, 62, 44, 41, 42, 43, 51, 53, 63, 66, 44, 41, 42, 43, 51, 53, 63, 66,
+        44, 42, 42, 43, 51, 54, 64, 67, 47, 44, 44, 45, 53, 56, 66, 69, 48, 45,
+        45, 46, 54, 56, 67, 70, 48, 45, 45, 46, 54, 56, 67, 70, 51, 47, 48, 48,
+        56, 58, 69, 73 },
       { /* Chroma */
         /* Size 4x4 */
         31, 37, 47, 47, 37, 44, 47, 45, 47, 47, 53, 53, 47, 45, 53, 59,
@@ -4452,21 +4452,12 @@
         61, 63, 51, 50, 49, 49, 48, 47, 47, 47, 47, 47, 47, 47, 46, 46, 48, 50,
         50, 51, 53, 54, 54, 56, 57, 57, 58, 60, 60, 61, 62, 63, 63, 64,
         /* Size 4x8 */
-        31, 38, 47, 48, 31, 40, 46, 45, 35, 43, 47, 46, 39, 47, 47, 45, 43, 47,
-        50, 50, 47, 47, 53, 55, 46, 46, 53, 58, 48, 46, 54, 59,
-        /* Size 8x4 */
         31, 31, 35, 39, 43, 47, 46, 48, 38, 40, 43, 47, 47, 47, 46, 46, 47, 46,
         47, 47, 50, 53, 53, 54, 48, 45, 46, 45, 50, 55, 58, 59,
+        /* Size 8x4 */
+        31, 38, 47, 48, 31, 40, 46, 45, 35, 43, 47, 46, 39, 47, 47, 45, 43, 47,
+        50, 50, 47, 47, 53, 55, 46, 46, 53, 58, 48, 46, 54, 59,
         /* Size 8x16 */
-        32, 31, 33, 37, 45, 48, 49, 50, 31, 31, 34, 38, 45, 47, 47, 48, 31, 32,
-        34, 39, 45, 46, 46, 47, 30, 32, 35, 40, 44, 46, 45, 46, 33, 35, 37, 42,
-        46, 47, 45, 46, 33, 36, 38, 43, 46, 47, 46, 46, 37, 40, 43, 47, 47, 47,
-        45, 46, 39, 41, 43, 47, 48, 48, 47, 47, 42, 43, 44, 47, 49, 50, 49, 50,
-        47, 46, 46, 48, 51, 52, 53, 53, 49, 46, 47, 48, 52, 53, 53, 54, 48, 46,
-        46, 47, 51, 53, 56, 56, 48, 45, 46, 46, 51, 53, 57, 57, 49, 45, 45, 46,
-        51, 53, 58, 59, 50, 46, 46, 46, 52, 54, 59, 61, 50, 46, 46, 46, 52, 54,
-        59, 61,
-        /* Size 16x8 */
         32, 31, 31, 30, 33, 33, 37, 39, 42, 47, 49, 48, 48, 49, 50, 50, 31, 31,
         32, 32, 35, 36, 40, 41, 43, 46, 46, 46, 45, 45, 46, 46, 33, 34, 34, 35,
         37, 38, 43, 43, 44, 46, 47, 46, 46, 45, 46, 46, 37, 38, 39, 40, 42, 43,
@@ -4475,37 +4466,16 @@
         53, 53, 53, 53, 54, 54, 49, 47, 46, 45, 45, 46, 45, 47, 49, 53, 53, 56,
         57, 58, 59, 59, 50, 48, 47, 46, 46, 46, 46, 47, 50, 53, 54, 56, 57, 59,
         61, 61,
+        /* Size 16x8 */
+        32, 31, 33, 37, 45, 48, 49, 50, 31, 31, 34, 38, 45, 47, 47, 48, 31, 32,
+        34, 39, 45, 46, 46, 47, 30, 32, 35, 40, 44, 46, 45, 46, 33, 35, 37, 42,
+        46, 47, 45, 46, 33, 36, 38, 43, 46, 47, 46, 46, 37, 40, 43, 47, 47, 47,
+        45, 46, 39, 41, 43, 47, 48, 48, 47, 47, 42, 43, 44, 47, 49, 50, 49, 50,
+        47, 46, 46, 48, 51, 52, 53, 53, 49, 46, 47, 48, 52, 53, 53, 54, 48, 46,
+        46, 47, 51, 53, 56, 56, 48, 45, 46, 46, 51, 53, 57, 57, 49, 45, 45, 46,
+        51, 53, 58, 59, 50, 46, 46, 46, 52, 54, 59, 61, 50, 46, 46, 46, 52, 54,
+        59, 61,
         /* Size 16x32 */
-        32, 31, 31, 31, 33, 37, 37, 38, 45, 48, 48, 49, 49, 49, 50, 52, 31, 31,
-        31, 31, 33, 38, 38, 39, 45, 47, 47, 48, 48, 48, 49, 51, 31, 31, 31, 31,
-        34, 38, 38, 40, 45, 47, 47, 47, 47, 47, 48, 50, 31, 31, 31, 31, 34, 38,
-        38, 40, 45, 47, 47, 47, 47, 47, 48, 50, 31, 31, 32, 32, 34, 39, 39, 40,
-        45, 46, 46, 46, 46, 46, 47, 49, 30, 31, 32, 32, 35, 40, 40, 41, 44, 46,
-        46, 45, 45, 45, 46, 48, 30, 31, 32, 32, 35, 40, 40, 41, 44, 46, 46, 45,
-        45, 45, 46, 48, 31, 32, 33, 33, 35, 40, 40, 41, 45, 46, 46, 45, 45, 45,
-        46, 48, 33, 34, 35, 35, 37, 42, 42, 43, 46, 47, 47, 46, 45, 45, 46, 47,
-        33, 35, 36, 36, 38, 43, 43, 44, 46, 47, 47, 46, 46, 46, 46, 47, 33, 35,
-        36, 36, 38, 43, 43, 44, 46, 47, 47, 46, 46, 46, 46, 47, 35, 37, 38, 38,
-        41, 45, 45, 46, 47, 47, 47, 46, 45, 45, 46, 47, 37, 39, 40, 40, 43, 47,
-        47, 47, 47, 47, 47, 46, 45, 45, 46, 47, 37, 39, 40, 40, 43, 47, 47, 47,
-        47, 47, 47, 46, 45, 45, 46, 47, 39, 40, 41, 41, 43, 47, 47, 47, 48, 48,
-        48, 47, 47, 47, 47, 48, 42, 42, 43, 43, 44, 47, 47, 48, 49, 50, 50, 49,
-        49, 49, 50, 50, 42, 42, 43, 43, 44, 47, 47, 48, 49, 50, 50, 49, 49, 49,
-        50, 50, 43, 43, 43, 43, 45, 47, 47, 48, 50, 50, 50, 50, 50, 50, 50, 51,
-        47, 46, 46, 46, 46, 48, 48, 48, 51, 52, 52, 52, 53, 53, 53, 53, 49, 47,
-        46, 46, 47, 48, 48, 49, 52, 53, 53, 53, 53, 53, 54, 54, 49, 47, 46, 46,
-        47, 48, 48, 49, 52, 53, 53, 53, 53, 53, 54, 54, 48, 47, 46, 46, 46, 47,
-        47, 48, 52, 53, 53, 54, 55, 55, 55, 56, 48, 47, 46, 46, 46, 47, 47, 48,
-        51, 53, 53, 54, 56, 56, 56, 57, 48, 47, 46, 46, 46, 47, 47, 48, 51, 53,
-        53, 54, 56, 56, 56, 57, 48, 47, 45, 45, 46, 46, 46, 47, 51, 53, 53, 55,
-        57, 57, 57, 59, 49, 46, 45, 45, 45, 46, 46, 47, 51, 53, 53, 56, 58, 58,
-        59, 61, 49, 46, 45, 45, 45, 46, 46, 47, 51, 53, 53, 56, 58, 58, 59, 61,
-        49, 47, 45, 45, 45, 46, 46, 47, 52, 53, 53, 56, 58, 58, 60, 62, 50, 48,
-        46, 46, 46, 46, 46, 48, 52, 54, 54, 57, 59, 59, 61, 63, 50, 48, 46, 46,
-        46, 46, 46, 48, 52, 54, 54, 57, 59, 59, 61, 64, 50, 48, 46, 46, 46, 46,
-        46, 48, 52, 54, 54, 57, 59, 59, 61, 64, 51, 49, 47, 47, 47, 47, 47, 48,
-        52, 54, 54, 58, 60, 60, 62, 65,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 30, 30, 31, 33, 33, 33, 35, 37, 37, 39, 42, 42, 43,
         47, 49, 49, 48, 48, 48, 48, 49, 49, 49, 50, 50, 50, 51, 31, 31, 31, 31,
         31, 31, 31, 32, 34, 35, 35, 37, 39, 39, 40, 42, 42, 43, 46, 47, 47, 47,
@@ -4535,33 +4505,47 @@
         54, 55, 56, 56, 57, 59, 59, 60, 61, 61, 61, 62, 52, 51, 50, 50, 49, 48,
         48, 48, 47, 47, 47, 47, 47, 47, 48, 50, 50, 51, 53, 54, 54, 56, 57, 57,
         59, 61, 61, 62, 63, 64, 64, 65,
+        /* Size 32x16 */
+        32, 31, 31, 31, 33, 37, 37, 38, 45, 48, 48, 49, 49, 49, 50, 52, 31, 31,
+        31, 31, 33, 38, 38, 39, 45, 47, 47, 48, 48, 48, 49, 51, 31, 31, 31, 31,
+        34, 38, 38, 40, 45, 47, 47, 47, 47, 47, 48, 50, 31, 31, 31, 31, 34, 38,
+        38, 40, 45, 47, 47, 47, 47, 47, 48, 50, 31, 31, 32, 32, 34, 39, 39, 40,
+        45, 46, 46, 46, 46, 46, 47, 49, 30, 31, 32, 32, 35, 40, 40, 41, 44, 46,
+        46, 45, 45, 45, 46, 48, 30, 31, 32, 32, 35, 40, 40, 41, 44, 46, 46, 45,
+        45, 45, 46, 48, 31, 32, 33, 33, 35, 40, 40, 41, 45, 46, 46, 45, 45, 45,
+        46, 48, 33, 34, 35, 35, 37, 42, 42, 43, 46, 47, 47, 46, 45, 45, 46, 47,
+        33, 35, 36, 36, 38, 43, 43, 44, 46, 47, 47, 46, 46, 46, 46, 47, 33, 35,
+        36, 36, 38, 43, 43, 44, 46, 47, 47, 46, 46, 46, 46, 47, 35, 37, 38, 38,
+        41, 45, 45, 46, 47, 47, 47, 46, 45, 45, 46, 47, 37, 39, 40, 40, 43, 47,
+        47, 47, 47, 47, 47, 46, 45, 45, 46, 47, 37, 39, 40, 40, 43, 47, 47, 47,
+        47, 47, 47, 46, 45, 45, 46, 47, 39, 40, 41, 41, 43, 47, 47, 47, 48, 48,
+        48, 47, 47, 47, 47, 48, 42, 42, 43, 43, 44, 47, 47, 48, 49, 50, 50, 49,
+        49, 49, 50, 50, 42, 42, 43, 43, 44, 47, 47, 48, 49, 50, 50, 49, 49, 49,
+        50, 50, 43, 43, 43, 43, 45, 47, 47, 48, 50, 50, 50, 50, 50, 50, 50, 51,
+        47, 46, 46, 46, 46, 48, 48, 48, 51, 52, 52, 52, 53, 53, 53, 53, 49, 47,
+        46, 46, 47, 48, 48, 49, 52, 53, 53, 53, 53, 53, 54, 54, 49, 47, 46, 46,
+        47, 48, 48, 49, 52, 53, 53, 53, 53, 53, 54, 54, 48, 47, 46, 46, 46, 47,
+        47, 48, 52, 53, 53, 54, 55, 55, 55, 56, 48, 47, 46, 46, 46, 47, 47, 48,
+        51, 53, 53, 54, 56, 56, 56, 57, 48, 47, 46, 46, 46, 47, 47, 48, 51, 53,
+        53, 54, 56, 56, 56, 57, 48, 47, 45, 45, 46, 46, 46, 47, 51, 53, 53, 55,
+        57, 57, 57, 59, 49, 46, 45, 45, 45, 46, 46, 47, 51, 53, 53, 56, 58, 58,
+        59, 61, 49, 46, 45, 45, 45, 46, 46, 47, 51, 53, 53, 56, 58, 58, 59, 61,
+        49, 47, 45, 45, 45, 46, 46, 47, 52, 53, 53, 56, 58, 58, 60, 62, 50, 48,
+        46, 46, 46, 46, 46, 48, 52, 54, 54, 57, 59, 59, 61, 63, 50, 48, 46, 46,
+        46, 46, 46, 48, 52, 54, 54, 57, 59, 59, 61, 64, 50, 48, 46, 46, 46, 46,
+        46, 48, 52, 54, 54, 57, 59, 59, 61, 64, 51, 49, 47, 47, 47, 47, 47, 48,
+        52, 54, 54, 58, 60, 60, 62, 65,
         /* Size 4x16 */
-        31, 37, 48, 49, 31, 38, 47, 47, 31, 39, 46, 46, 31, 40, 46, 45, 34, 42,
-        47, 45, 35, 43, 47, 46, 39, 47, 47, 45, 40, 47, 48, 47, 42, 47, 50, 49,
-        46, 48, 52, 53, 47, 48, 53, 53, 47, 47, 53, 56, 47, 46, 53, 57, 46, 46,
-        53, 58, 48, 46, 54, 59, 48, 46, 54, 59,
-        /* Size 16x4 */
         31, 31, 31, 31, 34, 35, 39, 40, 42, 46, 47, 47, 47, 46, 48, 48, 37, 38,
         39, 40, 42, 43, 47, 47, 47, 48, 48, 47, 46, 46, 46, 46, 48, 47, 46, 46,
         47, 47, 47, 48, 50, 52, 53, 53, 53, 53, 54, 54, 49, 47, 46, 45, 45, 46,
         45, 47, 49, 53, 53, 56, 57, 58, 59, 59,
+        /* Size 16x4 */
+        31, 37, 48, 49, 31, 38, 47, 47, 31, 39, 46, 46, 31, 40, 46, 45, 34, 42,
+        47, 45, 35, 43, 47, 46, 39, 47, 47, 45, 40, 47, 48, 47, 42, 47, 50, 49,
+        46, 48, 52, 53, 47, 48, 53, 53, 47, 47, 53, 56, 47, 46, 53, 57, 46, 46,
+        53, 58, 48, 46, 54, 59, 48, 46, 54, 59,
         /* Size 8x32 */
-        32, 31, 33, 37, 45, 48, 49, 50, 31, 31, 33, 38, 45, 47, 48, 49, 31, 31,
-        34, 38, 45, 47, 47, 48, 31, 31, 34, 38, 45, 47, 47, 48, 31, 32, 34, 39,
-        45, 46, 46, 47, 30, 32, 35, 40, 44, 46, 45, 46, 30, 32, 35, 40, 44, 46,
-        45, 46, 31, 33, 35, 40, 45, 46, 45, 46, 33, 35, 37, 42, 46, 47, 45, 46,
-        33, 36, 38, 43, 46, 47, 46, 46, 33, 36, 38, 43, 46, 47, 46, 46, 35, 38,
-        41, 45, 47, 47, 45, 46, 37, 40, 43, 47, 47, 47, 45, 46, 37, 40, 43, 47,
-        47, 47, 45, 46, 39, 41, 43, 47, 48, 48, 47, 47, 42, 43, 44, 47, 49, 50,
-        49, 50, 42, 43, 44, 47, 49, 50, 49, 50, 43, 43, 45, 47, 50, 50, 50, 50,
-        47, 46, 46, 48, 51, 52, 53, 53, 49, 46, 47, 48, 52, 53, 53, 54, 49, 46,
-        47, 48, 52, 53, 53, 54, 48, 46, 46, 47, 52, 53, 55, 55, 48, 46, 46, 47,
-        51, 53, 56, 56, 48, 46, 46, 47, 51, 53, 56, 56, 48, 45, 46, 46, 51, 53,
-        57, 57, 49, 45, 45, 46, 51, 53, 58, 59, 49, 45, 45, 46, 51, 53, 58, 59,
-        49, 45, 45, 46, 52, 53, 58, 60, 50, 46, 46, 46, 52, 54, 59, 61, 50, 46,
-        46, 46, 52, 54, 59, 61, 50, 46, 46, 46, 52, 54, 59, 61, 51, 47, 47, 47,
-        52, 54, 60, 62,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 30, 30, 31, 33, 33, 33, 35, 37, 37, 39, 42, 42, 43,
         47, 49, 49, 48, 48, 48, 48, 49, 49, 49, 50, 50, 50, 51, 31, 31, 31, 31,
         32, 32, 32, 33, 35, 36, 36, 38, 40, 40, 41, 43, 43, 43, 46, 46, 46, 46,
@@ -4576,7 +4560,23 @@
         45, 45, 45, 46, 46, 45, 45, 45, 47, 49, 49, 50, 53, 53, 53, 55, 56, 56,
         57, 58, 58, 58, 59, 59, 59, 60, 50, 49, 48, 48, 47, 46, 46, 46, 46, 46,
         46, 46, 46, 46, 47, 50, 50, 50, 53, 54, 54, 55, 56, 56, 57, 59, 59, 60,
-        61, 61, 61, 62 },
+        61, 61, 61, 62,
+        /* Size 32x8 */
+        32, 31, 33, 37, 45, 48, 49, 50, 31, 31, 33, 38, 45, 47, 48, 49, 31, 31,
+        34, 38, 45, 47, 47, 48, 31, 31, 34, 38, 45, 47, 47, 48, 31, 32, 34, 39,
+        45, 46, 46, 47, 30, 32, 35, 40, 44, 46, 45, 46, 30, 32, 35, 40, 44, 46,
+        45, 46, 31, 33, 35, 40, 45, 46, 45, 46, 33, 35, 37, 42, 46, 47, 45, 46,
+        33, 36, 38, 43, 46, 47, 46, 46, 33, 36, 38, 43, 46, 47, 46, 46, 35, 38,
+        41, 45, 47, 47, 45, 46, 37, 40, 43, 47, 47, 47, 45, 46, 37, 40, 43, 47,
+        47, 47, 45, 46, 39, 41, 43, 47, 48, 48, 47, 47, 42, 43, 44, 47, 49, 50,
+        49, 50, 42, 43, 44, 47, 49, 50, 49, 50, 43, 43, 45, 47, 50, 50, 50, 50,
+        47, 46, 46, 48, 51, 52, 53, 53, 49, 46, 47, 48, 52, 53, 53, 54, 49, 46,
+        47, 48, 52, 53, 53, 54, 48, 46, 46, 47, 52, 53, 55, 55, 48, 46, 46, 47,
+        51, 53, 56, 56, 48, 46, 46, 47, 51, 53, 56, 56, 48, 45, 46, 46, 51, 53,
+        57, 57, 49, 45, 45, 46, 51, 53, 58, 59, 49, 45, 45, 46, 51, 53, 58, 59,
+        49, 45, 45, 46, 52, 53, 58, 60, 50, 46, 46, 46, 52, 54, 59, 61, 50, 46,
+        46, 46, 52, 54, 59, 61, 50, 46, 46, 46, 52, 54, 59, 61, 51, 47, 47, 47,
+        52, 54, 60, 62 },
   },
   {
       { /* Luma */
@@ -4662,21 +4662,12 @@
         63, 63, 44, 43, 42, 42, 42, 41, 41, 41, 41, 41, 42, 42, 42, 42, 42, 42,
         42, 45, 47, 47, 47, 50, 54, 54, 54, 56, 58, 58, 58, 60, 63, 63,
         /* Size 4x8 */
-        31, 32, 34, 39, 32, 32, 34, 38, 32, 33, 34, 38, 32, 33, 36, 40, 33, 34,
-        38, 42, 34, 36, 41, 47, 37, 38, 44, 52, 40, 40, 46, 56,
-        /* Size 8x4 */
         31, 32, 32, 32, 33, 34, 37, 40, 32, 32, 33, 33, 34, 36, 38, 40, 34, 34,
         34, 36, 38, 41, 44, 46, 39, 38, 38, 40, 42, 47, 52, 56,
+        /* Size 8x4 */
+        31, 32, 34, 39, 32, 32, 34, 38, 32, 33, 34, 38, 32, 33, 36, 40, 33, 34,
+        38, 42, 34, 36, 41, 47, 37, 38, 44, 52, 40, 40, 46, 56,
         /* Size 8x16 */
-        32, 31, 31, 32, 32, 36, 36, 44, 31, 32, 32, 32, 32, 35, 35, 42, 31, 32,
-        32, 32, 32, 35, 35, 42, 31, 32, 32, 33, 33, 34, 34, 41, 31, 32, 32, 33,
-        33, 34, 34, 41, 32, 32, 32, 34, 34, 36, 36, 42, 32, 32, 32, 34, 34, 36,
-        36, 42, 32, 33, 33, 35, 35, 38, 38, 42, 32, 33, 33, 35, 35, 38, 38, 42,
-        34, 34, 34, 37, 37, 42, 42, 48, 34, 34, 34, 37, 37, 42, 42, 48, 36, 34,
-        34, 38, 38, 48, 48, 54, 36, 34, 34, 38, 38, 48, 48, 54, 39, 37, 37, 40,
-        40, 50, 50, 58, 39, 37, 37, 40, 40, 50, 50, 58, 44, 41, 41, 43, 43, 53,
-        53, 63,
-        /* Size 16x8 */
         32, 31, 31, 31, 31, 32, 32, 32, 32, 34, 34, 36, 36, 39, 39, 44, 31, 32,
         32, 32, 32, 32, 32, 33, 33, 34, 34, 34, 34, 37, 37, 41, 31, 32, 32, 32,
         32, 32, 32, 33, 33, 34, 34, 34, 34, 37, 37, 41, 32, 32, 32, 33, 33, 34,
@@ -4685,37 +4676,16 @@
         42, 48, 48, 50, 50, 53, 36, 35, 35, 34, 34, 36, 36, 38, 38, 42, 42, 48,
         48, 50, 50, 53, 44, 42, 42, 41, 41, 42, 42, 42, 42, 48, 48, 54, 54, 58,
         58, 63,
+        /* Size 16x8 */
+        32, 31, 31, 32, 32, 36, 36, 44, 31, 32, 32, 32, 32, 35, 35, 42, 31, 32,
+        32, 32, 32, 35, 35, 42, 31, 32, 32, 33, 33, 34, 34, 41, 31, 32, 32, 33,
+        33, 34, 34, 41, 32, 32, 32, 34, 34, 36, 36, 42, 32, 32, 32, 34, 34, 36,
+        36, 42, 32, 33, 33, 35, 35, 38, 38, 42, 32, 33, 33, 35, 35, 38, 38, 42,
+        34, 34, 34, 37, 37, 42, 42, 48, 34, 34, 34, 37, 37, 42, 42, 48, 36, 34,
+        34, 38, 38, 48, 48, 54, 36, 34, 34, 38, 38, 48, 48, 54, 39, 37, 37, 40,
+        40, 50, 50, 58, 39, 37, 37, 40, 40, 50, 50, 58, 44, 41, 41, 43, 43, 53,
+        53, 63,
         /* Size 16x32 */
-        32, 31, 31, 31, 31, 32, 32, 32, 32, 34, 36, 36, 36, 39, 44, 44, 31, 31,
-        31, 31, 31, 32, 32, 32, 32, 34, 35, 35, 35, 39, 43, 43, 31, 32, 32, 32,
-        32, 32, 32, 32, 32, 34, 35, 35, 35, 38, 42, 42, 31, 32, 32, 32, 32, 32,
-        32, 32, 32, 34, 35, 35, 35, 38, 42, 42, 31, 32, 32, 32, 32, 32, 32, 32,
-        32, 34, 35, 35, 35, 38, 42, 42, 31, 32, 32, 32, 32, 32, 32, 32, 32, 34,
-        35, 35, 35, 38, 41, 41, 31, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34,
-        34, 37, 41, 41, 31, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 34, 37,
-        41, 41, 31, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 34, 37, 41, 41,
-        31, 32, 32, 32, 32, 33, 33, 33, 33, 34, 35, 35, 35, 38, 41, 41, 32, 32,
-        32, 32, 32, 33, 34, 34, 34, 35, 36, 36, 36, 39, 42, 42, 32, 32, 32, 32,
-        32, 33, 34, 34, 34, 35, 36, 36, 36, 39, 42, 42, 32, 32, 32, 32, 32, 33,
-        34, 34, 34, 35, 36, 36, 36, 39, 42, 42, 32, 32, 32, 32, 32, 33, 34, 34,
-        34, 36, 37, 37, 37, 40, 42, 42, 32, 32, 33, 33, 33, 34, 35, 35, 35, 37,
-        38, 38, 38, 40, 42, 42, 32, 32, 33, 33, 33, 34, 35, 35, 35, 37, 38, 38,
-        38, 40, 42, 42, 32, 32, 33, 33, 33, 34, 35, 35, 35, 37, 38, 38, 38, 40,
-        42, 42, 33, 33, 33, 33, 33, 34, 36, 36, 36, 38, 40, 40, 40, 42, 45, 45,
-        34, 34, 34, 34, 34, 35, 37, 37, 37, 39, 42, 42, 42, 45, 48, 48, 34, 34,
-        34, 34, 34, 35, 37, 37, 37, 39, 42, 42, 42, 45, 48, 48, 34, 34, 34, 34,
-        34, 35, 37, 37, 37, 39, 42, 42, 42, 45, 48, 48, 35, 34, 34, 34, 34, 36,
-        37, 37, 37, 41, 45, 45, 45, 47, 50, 50, 36, 35, 34, 34, 34, 36, 38, 38,
-        38, 43, 48, 48, 48, 51, 54, 54, 36, 35, 34, 34, 34, 36, 38, 38, 38, 43,
-        48, 48, 48, 51, 54, 54, 36, 35, 34, 34, 34, 36, 38, 38, 38, 43, 48, 48,
-        48, 51, 54, 54, 37, 37, 36, 36, 36, 38, 39, 39, 39, 44, 49, 49, 49, 52,
-        56, 56, 39, 38, 37, 37, 37, 39, 40, 40, 40, 45, 50, 50, 50, 54, 58, 58,
-        39, 38, 37, 37, 37, 39, 40, 40, 40, 45, 50, 50, 50, 54, 58, 58, 39, 38,
-        37, 37, 37, 39, 40, 40, 40, 45, 50, 50, 50, 54, 58, 58, 41, 40, 39, 39,
-        39, 40, 42, 42, 42, 46, 52, 52, 52, 56, 60, 60, 44, 42, 41, 41, 41, 42,
-        43, 43, 43, 48, 53, 53, 53, 58, 63, 63, 44, 42, 41, 41, 41, 42, 43, 43,
-        43, 48, 53, 53, 53, 58, 63, 63,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 33,
         34, 34, 34, 35, 36, 36, 36, 37, 39, 39, 39, 41, 44, 44, 31, 31, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 34, 34, 34,
@@ -4745,33 +4715,47 @@
         48, 50, 54, 54, 54, 56, 58, 58, 58, 60, 63, 63, 44, 43, 42, 42, 42, 41,
         41, 41, 41, 41, 42, 42, 42, 42, 42, 42, 42, 45, 48, 48, 48, 50, 54, 54,
         54, 56, 58, 58, 58, 60, 63, 63,
+        /* Size 32x16 */
+        32, 31, 31, 31, 31, 32, 32, 32, 32, 34, 36, 36, 36, 39, 44, 44, 31, 31,
+        31, 31, 31, 32, 32, 32, 32, 34, 35, 35, 35, 39, 43, 43, 31, 32, 32, 32,
+        32, 32, 32, 32, 32, 34, 35, 35, 35, 38, 42, 42, 31, 32, 32, 32, 32, 32,
+        32, 32, 32, 34, 35, 35, 35, 38, 42, 42, 31, 32, 32, 32, 32, 32, 32, 32,
+        32, 34, 35, 35, 35, 38, 42, 42, 31, 32, 32, 32, 32, 32, 32, 32, 32, 34,
+        35, 35, 35, 38, 41, 41, 31, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34,
+        34, 37, 41, 41, 31, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 34, 37,
+        41, 41, 31, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 34, 37, 41, 41,
+        31, 32, 32, 32, 32, 33, 33, 33, 33, 34, 35, 35, 35, 38, 41, 41, 32, 32,
+        32, 32, 32, 33, 34, 34, 34, 35, 36, 36, 36, 39, 42, 42, 32, 32, 32, 32,
+        32, 33, 34, 34, 34, 35, 36, 36, 36, 39, 42, 42, 32, 32, 32, 32, 32, 33,
+        34, 34, 34, 35, 36, 36, 36, 39, 42, 42, 32, 32, 32, 32, 32, 33, 34, 34,
+        34, 36, 37, 37, 37, 40, 42, 42, 32, 32, 33, 33, 33, 34, 35, 35, 35, 37,
+        38, 38, 38, 40, 42, 42, 32, 32, 33, 33, 33, 34, 35, 35, 35, 37, 38, 38,
+        38, 40, 42, 42, 32, 32, 33, 33, 33, 34, 35, 35, 35, 37, 38, 38, 38, 40,
+        42, 42, 33, 33, 33, 33, 33, 34, 36, 36, 36, 38, 40, 40, 40, 42, 45, 45,
+        34, 34, 34, 34, 34, 35, 37, 37, 37, 39, 42, 42, 42, 45, 48, 48, 34, 34,
+        34, 34, 34, 35, 37, 37, 37, 39, 42, 42, 42, 45, 48, 48, 34, 34, 34, 34,
+        34, 35, 37, 37, 37, 39, 42, 42, 42, 45, 48, 48, 35, 34, 34, 34, 34, 36,
+        37, 37, 37, 41, 45, 45, 45, 47, 50, 50, 36, 35, 34, 34, 34, 36, 38, 38,
+        38, 43, 48, 48, 48, 51, 54, 54, 36, 35, 34, 34, 34, 36, 38, 38, 38, 43,
+        48, 48, 48, 51, 54, 54, 36, 35, 34, 34, 34, 36, 38, 38, 38, 43, 48, 48,
+        48, 51, 54, 54, 37, 37, 36, 36, 36, 38, 39, 39, 39, 44, 49, 49, 49, 52,
+        56, 56, 39, 38, 37, 37, 37, 39, 40, 40, 40, 45, 50, 50, 50, 54, 58, 58,
+        39, 38, 37, 37, 37, 39, 40, 40, 40, 45, 50, 50, 50, 54, 58, 58, 39, 38,
+        37, 37, 37, 39, 40, 40, 40, 45, 50, 50, 50, 54, 58, 58, 41, 40, 39, 39,
+        39, 40, 42, 42, 42, 46, 52, 52, 52, 56, 60, 60, 44, 42, 41, 41, 41, 42,
+        43, 43, 43, 48, 53, 53, 53, 58, 63, 63, 44, 42, 41, 41, 41, 42, 43, 43,
+        43, 48, 53, 53, 53, 58, 63, 63,
         /* Size 4x16 */
-        31, 32, 34, 39, 32, 32, 34, 38, 32, 32, 34, 38, 32, 32, 33, 37, 32, 32,
-        33, 37, 32, 33, 35, 39, 32, 33, 35, 39, 32, 34, 37, 40, 32, 34, 37, 40,
-        34, 35, 39, 45, 34, 35, 39, 45, 35, 36, 43, 51, 35, 36, 43, 51, 38, 39,
-        45, 54, 38, 39, 45, 54, 42, 42, 48, 58,
-        /* Size 16x4 */
         31, 32, 32, 32, 32, 32, 32, 32, 32, 34, 34, 35, 35, 38, 38, 42, 32, 32,
         32, 32, 32, 33, 33, 34, 34, 35, 35, 36, 36, 39, 39, 42, 34, 34, 34, 33,
         33, 35, 35, 37, 37, 39, 39, 43, 43, 45, 45, 48, 39, 38, 38, 37, 37, 39,
         39, 40, 40, 45, 45, 51, 51, 54, 54, 58,
+        /* Size 16x4 */
+        31, 32, 34, 39, 32, 32, 34, 38, 32, 32, 34, 38, 32, 32, 33, 37, 32, 32,
+        33, 37, 32, 33, 35, 39, 32, 33, 35, 39, 32, 34, 37, 40, 32, 34, 37, 40,
+        34, 35, 39, 45, 34, 35, 39, 45, 35, 36, 43, 51, 35, 36, 43, 51, 38, 39,
+        45, 54, 38, 39, 45, 54, 42, 42, 48, 58,
         /* Size 8x32 */
-        32, 31, 31, 32, 32, 36, 36, 44, 31, 31, 31, 32, 32, 35, 35, 43, 31, 32,
-        32, 32, 32, 35, 35, 42, 31, 32, 32, 32, 32, 35, 35, 42, 31, 32, 32, 32,
-        32, 35, 35, 42, 31, 32, 32, 32, 32, 35, 35, 41, 31, 32, 32, 33, 33, 34,
-        34, 41, 31, 32, 32, 33, 33, 34, 34, 41, 31, 32, 32, 33, 33, 34, 34, 41,
-        31, 32, 32, 33, 33, 35, 35, 41, 32, 32, 32, 34, 34, 36, 36, 42, 32, 32,
-        32, 34, 34, 36, 36, 42, 32, 32, 32, 34, 34, 36, 36, 42, 32, 32, 32, 34,
-        34, 37, 37, 42, 32, 33, 33, 35, 35, 38, 38, 42, 32, 33, 33, 35, 35, 38,
-        38, 42, 32, 33, 33, 35, 35, 38, 38, 42, 33, 33, 33, 36, 36, 40, 40, 45,
-        34, 34, 34, 37, 37, 42, 42, 48, 34, 34, 34, 37, 37, 42, 42, 48, 34, 34,
-        34, 37, 37, 42, 42, 48, 35, 34, 34, 37, 37, 45, 45, 50, 36, 34, 34, 38,
-        38, 48, 48, 54, 36, 34, 34, 38, 38, 48, 48, 54, 36, 34, 34, 38, 38, 48,
-        48, 54, 37, 36, 36, 39, 39, 49, 49, 56, 39, 37, 37, 40, 40, 50, 50, 58,
-        39, 37, 37, 40, 40, 50, 50, 58, 39, 37, 37, 40, 40, 50, 50, 58, 41, 39,
-        39, 42, 42, 52, 52, 60, 44, 41, 41, 43, 43, 53, 53, 63, 44, 41, 41, 43,
-        43, 53, 53, 63,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 33,
         34, 34, 34, 35, 36, 36, 36, 37, 39, 39, 39, 41, 44, 44, 31, 31, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 34, 34,
@@ -4786,7 +4770,23 @@
         34, 34, 34, 35, 36, 36, 36, 37, 38, 38, 38, 40, 42, 42, 42, 45, 48, 48,
         48, 49, 50, 50, 50, 52, 53, 53, 44, 43, 42, 42, 42, 41, 41, 41, 41, 41,
         42, 42, 42, 42, 42, 42, 42, 45, 48, 48, 48, 50, 54, 54, 54, 56, 58, 58,
-        58, 60, 63, 63 },
+        58, 60, 63, 63,
+        /* Size 32x8 */
+        32, 31, 31, 32, 32, 36, 36, 44, 31, 31, 31, 32, 32, 35, 35, 43, 31, 32,
+        32, 32, 32, 35, 35, 42, 31, 32, 32, 32, 32, 35, 35, 42, 31, 32, 32, 32,
+        32, 35, 35, 42, 31, 32, 32, 32, 32, 35, 35, 41, 31, 32, 32, 33, 33, 34,
+        34, 41, 31, 32, 32, 33, 33, 34, 34, 41, 31, 32, 32, 33, 33, 34, 34, 41,
+        31, 32, 32, 33, 33, 35, 35, 41, 32, 32, 32, 34, 34, 36, 36, 42, 32, 32,
+        32, 34, 34, 36, 36, 42, 32, 32, 32, 34, 34, 36, 36, 42, 32, 32, 32, 34,
+        34, 37, 37, 42, 32, 33, 33, 35, 35, 38, 38, 42, 32, 33, 33, 35, 35, 38,
+        38, 42, 32, 33, 33, 35, 35, 38, 38, 42, 33, 33, 33, 36, 36, 40, 40, 45,
+        34, 34, 34, 37, 37, 42, 42, 48, 34, 34, 34, 37, 37, 42, 42, 48, 34, 34,
+        34, 37, 37, 42, 42, 48, 35, 34, 34, 37, 37, 45, 45, 50, 36, 34, 34, 38,
+        38, 48, 48, 54, 36, 34, 34, 38, 38, 48, 48, 54, 36, 34, 34, 38, 38, 48,
+        48, 54, 37, 36, 36, 39, 39, 49, 49, 56, 39, 37, 37, 40, 40, 50, 50, 58,
+        39, 37, 37, 40, 40, 50, 50, 58, 39, 37, 37, 40, 40, 50, 50, 58, 41, 39,
+        39, 42, 42, 52, 52, 60, 44, 41, 41, 43, 43, 53, 53, 63, 44, 41, 41, 43,
+        43, 53, 53, 63 },
       { /* Chroma */
         /* Size 4x4 */
         31, 34, 42, 47, 34, 39, 45, 46, 42, 45, 48, 49, 47, 46, 49, 54,
@@ -4870,21 +4870,12 @@
         58, 58, 49, 48, 47, 47, 47, 46, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
         45, 47, 49, 49, 49, 51, 53, 53, 53, 54, 55, 55, 55, 57, 58, 58,
         /* Size 4x8 */
-        31, 34, 42, 48, 31, 35, 42, 46, 33, 37, 44, 46, 36, 41, 46, 46, 40, 44,
-        48, 48, 45, 46, 49, 51, 47, 47, 50, 54, 47, 46, 49, 55,
-        /* Size 8x4 */
         31, 31, 33, 36, 40, 45, 47, 47, 34, 35, 37, 41, 44, 46, 47, 46, 42, 42,
         44, 46, 48, 49, 50, 49, 48, 46, 46, 46, 48, 51, 54, 55,
+        /* Size 8x4 */
+        31, 34, 42, 48, 31, 35, 42, 46, 33, 37, 44, 46, 36, 41, 46, 46, 40, 44,
+        48, 48, 45, 46, 49, 51, 47, 47, 50, 54, 47, 46, 49, 55,
         /* Size 8x16 */
-        32, 31, 31, 37, 37, 48, 48, 49, 31, 31, 31, 38, 38, 47, 47, 47, 31, 31,
-        31, 38, 38, 47, 47, 47, 30, 32, 32, 40, 40, 46, 46, 45, 30, 32, 32, 40,
-        40, 46, 46, 45, 33, 36, 36, 43, 43, 47, 47, 46, 33, 36, 36, 43, 43, 47,
-        47, 46, 37, 40, 40, 47, 47, 47, 47, 45, 37, 40, 40, 47, 47, 47, 47, 45,
-        42, 43, 43, 47, 47, 50, 50, 49, 42, 43, 43, 47, 47, 50, 50, 49, 49, 46,
-        46, 48, 48, 53, 53, 53, 49, 46, 46, 48, 48, 53, 53, 53, 48, 46, 46, 47,
-        47, 53, 53, 56, 48, 46, 46, 47, 47, 53, 53, 56, 49, 45, 45, 46, 46, 53,
-        53, 58,
-        /* Size 16x8 */
         32, 31, 31, 30, 30, 33, 33, 37, 37, 42, 42, 49, 49, 48, 48, 49, 31, 31,
         31, 32, 32, 36, 36, 40, 40, 43, 43, 46, 46, 46, 46, 45, 31, 31, 31, 32,
         32, 36, 36, 40, 40, 43, 43, 46, 46, 46, 46, 45, 37, 38, 38, 40, 40, 43,
@@ -4893,37 +4884,16 @@
         50, 53, 53, 53, 53, 53, 48, 47, 47, 46, 46, 47, 47, 47, 47, 50, 50, 53,
         53, 53, 53, 53, 49, 47, 47, 45, 45, 46, 46, 45, 45, 49, 49, 53, 53, 56,
         56, 58,
+        /* Size 16x8 */
+        32, 31, 31, 37, 37, 48, 48, 49, 31, 31, 31, 38, 38, 47, 47, 47, 31, 31,
+        31, 38, 38, 47, 47, 47, 30, 32, 32, 40, 40, 46, 46, 45, 30, 32, 32, 40,
+        40, 46, 46, 45, 33, 36, 36, 43, 43, 47, 47, 46, 33, 36, 36, 43, 43, 47,
+        47, 46, 37, 40, 40, 47, 47, 47, 47, 45, 37, 40, 40, 47, 47, 47, 47, 45,
+        42, 43, 43, 47, 47, 50, 50, 49, 42, 43, 43, 47, 47, 50, 50, 49, 49, 46,
+        46, 48, 48, 53, 53, 53, 49, 46, 46, 48, 48, 53, 53, 53, 48, 46, 46, 47,
+        47, 53, 53, 56, 48, 46, 46, 47, 47, 53, 53, 56, 49, 45, 45, 46, 46, 53,
+        53, 58,
         /* Size 16x32 */
-        32, 31, 31, 31, 31, 33, 37, 37, 37, 42, 48, 48, 48, 48, 49, 49, 31, 31,
-        31, 31, 31, 34, 37, 37, 37, 42, 47, 47, 47, 48, 48, 48, 31, 31, 31, 31,
-        31, 34, 38, 38, 38, 42, 47, 47, 47, 47, 47, 47, 31, 31, 31, 31, 31, 34,
-        38, 38, 38, 42, 47, 47, 47, 47, 47, 47, 31, 31, 31, 31, 31, 34, 38, 38,
-        38, 42, 47, 47, 47, 47, 47, 47, 31, 31, 32, 32, 32, 35, 39, 39, 39, 42,
-        46, 46, 46, 46, 46, 46, 30, 31, 32, 32, 32, 35, 40, 40, 40, 42, 46, 46,
-        46, 45, 45, 45, 30, 31, 32, 32, 32, 35, 40, 40, 40, 42, 46, 46, 46, 45,
-        45, 45, 30, 31, 32, 32, 32, 35, 40, 40, 40, 42, 46, 46, 46, 45, 45, 45,
-        32, 33, 34, 34, 34, 37, 41, 41, 41, 44, 46, 46, 46, 46, 45, 45, 33, 34,
-        36, 36, 36, 39, 43, 43, 43, 45, 47, 47, 47, 46, 46, 46, 33, 34, 36, 36,
-        36, 39, 43, 43, 43, 45, 47, 47, 47, 46, 46, 46, 33, 34, 36, 36, 36, 39,
-        43, 43, 43, 45, 47, 47, 47, 46, 46, 46, 35, 36, 38, 38, 38, 41, 45, 45,
-        45, 46, 47, 47, 47, 46, 45, 45, 37, 38, 40, 40, 40, 43, 47, 47, 47, 47,
-        47, 47, 47, 46, 45, 45, 37, 38, 40, 40, 40, 43, 47, 47, 47, 47, 47, 47,
-        47, 46, 45, 45, 37, 38, 40, 40, 40, 43, 47, 47, 47, 47, 47, 47, 47, 46,
-        45, 45, 39, 40, 41, 41, 41, 44, 47, 47, 47, 48, 49, 49, 49, 48, 47, 47,
-        42, 42, 43, 43, 43, 45, 47, 47, 47, 48, 50, 50, 50, 50, 49, 49, 42, 42,
-        43, 43, 43, 45, 47, 47, 47, 48, 50, 50, 50, 50, 49, 49, 42, 42, 43, 43,
-        43, 45, 47, 47, 47, 48, 50, 50, 50, 50, 49, 49, 45, 45, 44, 44, 44, 46,
-        47, 47, 47, 49, 51, 51, 51, 51, 51, 51, 49, 48, 46, 46, 46, 47, 48, 48,
-        48, 50, 53, 53, 53, 53, 53, 53, 49, 48, 46, 46, 46, 47, 48, 48, 48, 50,
-        53, 53, 53, 53, 53, 53, 49, 48, 46, 46, 46, 47, 48, 48, 48, 50, 53, 53,
-        53, 53, 53, 53, 48, 47, 46, 46, 46, 47, 47, 47, 47, 50, 53, 53, 53, 54,
-        54, 54, 48, 47, 46, 46, 46, 46, 47, 47, 47, 50, 53, 53, 53, 54, 56, 56,
-        48, 47, 46, 46, 46, 46, 47, 47, 47, 50, 53, 53, 53, 54, 56, 56, 48, 47,
-        46, 46, 46, 46, 47, 47, 47, 50, 53, 53, 53, 54, 56, 56, 48, 47, 45, 45,
-        45, 46, 46, 46, 46, 49, 53, 53, 53, 55, 57, 57, 49, 47, 45, 45, 45, 45,
-        46, 46, 46, 49, 53, 53, 53, 56, 58, 58, 49, 47, 45, 45, 45, 45, 46, 46,
-        46, 49, 53, 53, 53, 56, 58, 58,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 30, 30, 30, 32, 33, 33, 33, 35, 37, 37, 37, 39,
         42, 42, 42, 45, 49, 49, 49, 48, 48, 48, 48, 48, 49, 49, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 33, 34, 34, 34, 36, 38, 38, 38, 40, 42, 42, 42, 45,
@@ -4953,33 +4923,47 @@
         49, 51, 53, 53, 53, 54, 56, 56, 56, 57, 58, 58, 49, 48, 47, 47, 47, 46,
         45, 45, 45, 45, 46, 46, 46, 45, 45, 45, 45, 47, 49, 49, 49, 51, 53, 53,
         53, 54, 56, 56, 56, 57, 58, 58,
+        /* Size 32x16 */
+        32, 31, 31, 31, 31, 33, 37, 37, 37, 42, 48, 48, 48, 48, 49, 49, 31, 31,
+        31, 31, 31, 34, 37, 37, 37, 42, 47, 47, 47, 48, 48, 48, 31, 31, 31, 31,
+        31, 34, 38, 38, 38, 42, 47, 47, 47, 47, 47, 47, 31, 31, 31, 31, 31, 34,
+        38, 38, 38, 42, 47, 47, 47, 47, 47, 47, 31, 31, 31, 31, 31, 34, 38, 38,
+        38, 42, 47, 47, 47, 47, 47, 47, 31, 31, 32, 32, 32, 35, 39, 39, 39, 42,
+        46, 46, 46, 46, 46, 46, 30, 31, 32, 32, 32, 35, 40, 40, 40, 42, 46, 46,
+        46, 45, 45, 45, 30, 31, 32, 32, 32, 35, 40, 40, 40, 42, 46, 46, 46, 45,
+        45, 45, 30, 31, 32, 32, 32, 35, 40, 40, 40, 42, 46, 46, 46, 45, 45, 45,
+        32, 33, 34, 34, 34, 37, 41, 41, 41, 44, 46, 46, 46, 46, 45, 45, 33, 34,
+        36, 36, 36, 39, 43, 43, 43, 45, 47, 47, 47, 46, 46, 46, 33, 34, 36, 36,
+        36, 39, 43, 43, 43, 45, 47, 47, 47, 46, 46, 46, 33, 34, 36, 36, 36, 39,
+        43, 43, 43, 45, 47, 47, 47, 46, 46, 46, 35, 36, 38, 38, 38, 41, 45, 45,
+        45, 46, 47, 47, 47, 46, 45, 45, 37, 38, 40, 40, 40, 43, 47, 47, 47, 47,
+        47, 47, 47, 46, 45, 45, 37, 38, 40, 40, 40, 43, 47, 47, 47, 47, 47, 47,
+        47, 46, 45, 45, 37, 38, 40, 40, 40, 43, 47, 47, 47, 47, 47, 47, 47, 46,
+        45, 45, 39, 40, 41, 41, 41, 44, 47, 47, 47, 48, 49, 49, 49, 48, 47, 47,
+        42, 42, 43, 43, 43, 45, 47, 47, 47, 48, 50, 50, 50, 50, 49, 49, 42, 42,
+        43, 43, 43, 45, 47, 47, 47, 48, 50, 50, 50, 50, 49, 49, 42, 42, 43, 43,
+        43, 45, 47, 47, 47, 48, 50, 50, 50, 50, 49, 49, 45, 45, 44, 44, 44, 46,
+        47, 47, 47, 49, 51, 51, 51, 51, 51, 51, 49, 48, 46, 46, 46, 47, 48, 48,
+        48, 50, 53, 53, 53, 53, 53, 53, 49, 48, 46, 46, 46, 47, 48, 48, 48, 50,
+        53, 53, 53, 53, 53, 53, 49, 48, 46, 46, 46, 47, 48, 48, 48, 50, 53, 53,
+        53, 53, 53, 53, 48, 47, 46, 46, 46, 47, 47, 47, 47, 50, 53, 53, 53, 54,
+        54, 54, 48, 47, 46, 46, 46, 46, 47, 47, 47, 50, 53, 53, 53, 54, 56, 56,
+        48, 47, 46, 46, 46, 46, 47, 47, 47, 50, 53, 53, 53, 54, 56, 56, 48, 47,
+        46, 46, 46, 46, 47, 47, 47, 50, 53, 53, 53, 54, 56, 56, 48, 47, 45, 45,
+        45, 46, 46, 46, 46, 49, 53, 53, 53, 55, 57, 57, 49, 47, 45, 45, 45, 45,
+        46, 46, 46, 49, 53, 53, 53, 56, 58, 58, 49, 47, 45, 45, 45, 45, 46, 46,
+        46, 49, 53, 53, 53, 56, 58, 58,
         /* Size 4x16 */
-        31, 33, 42, 48, 31, 34, 42, 47, 31, 34, 42, 47, 31, 35, 42, 45, 31, 35,
-        42, 45, 34, 39, 45, 46, 34, 39, 45, 46, 38, 43, 47, 46, 38, 43, 47, 46,
-        42, 45, 48, 50, 42, 45, 48, 50, 48, 47, 50, 53, 48, 47, 50, 53, 47, 46,
-        50, 54, 47, 46, 50, 54, 47, 45, 49, 56,
-        /* Size 16x4 */
         31, 31, 31, 31, 31, 34, 34, 38, 38, 42, 42, 48, 48, 47, 47, 47, 33, 34,
         34, 35, 35, 39, 39, 43, 43, 45, 45, 47, 47, 46, 46, 45, 42, 42, 42, 42,
         42, 45, 45, 47, 47, 48, 48, 50, 50, 50, 50, 49, 48, 47, 47, 45, 45, 46,
         46, 46, 46, 50, 50, 53, 53, 54, 54, 56,
+        /* Size 16x4 */
+        31, 33, 42, 48, 31, 34, 42, 47, 31, 34, 42, 47, 31, 35, 42, 45, 31, 35,
+        42, 45, 34, 39, 45, 46, 34, 39, 45, 46, 38, 43, 47, 46, 38, 43, 47, 46,
+        42, 45, 48, 50, 42, 45, 48, 50, 48, 47, 50, 53, 48, 47, 50, 53, 47, 46,
+        50, 54, 47, 46, 50, 54, 47, 45, 49, 56,
         /* Size 8x32 */
-        32, 31, 31, 37, 37, 48, 48, 49, 31, 31, 31, 37, 37, 47, 47, 48, 31, 31,
-        31, 38, 38, 47, 47, 47, 31, 31, 31, 38, 38, 47, 47, 47, 31, 31, 31, 38,
-        38, 47, 47, 47, 31, 32, 32, 39, 39, 46, 46, 46, 30, 32, 32, 40, 40, 46,
-        46, 45, 30, 32, 32, 40, 40, 46, 46, 45, 30, 32, 32, 40, 40, 46, 46, 45,
-        32, 34, 34, 41, 41, 46, 46, 45, 33, 36, 36, 43, 43, 47, 47, 46, 33, 36,
-        36, 43, 43, 47, 47, 46, 33, 36, 36, 43, 43, 47, 47, 46, 35, 38, 38, 45,
-        45, 47, 47, 45, 37, 40, 40, 47, 47, 47, 47, 45, 37, 40, 40, 47, 47, 47,
-        47, 45, 37, 40, 40, 47, 47, 47, 47, 45, 39, 41, 41, 47, 47, 49, 49, 47,
-        42, 43, 43, 47, 47, 50, 50, 49, 42, 43, 43, 47, 47, 50, 50, 49, 42, 43,
-        43, 47, 47, 50, 50, 49, 45, 44, 44, 47, 47, 51, 51, 51, 49, 46, 46, 48,
-        48, 53, 53, 53, 49, 46, 46, 48, 48, 53, 53, 53, 49, 46, 46, 48, 48, 53,
-        53, 53, 48, 46, 46, 47, 47, 53, 53, 54, 48, 46, 46, 47, 47, 53, 53, 56,
-        48, 46, 46, 47, 47, 53, 53, 56, 48, 46, 46, 47, 47, 53, 53, 56, 48, 45,
-        45, 46, 46, 53, 53, 57, 49, 45, 45, 46, 46, 53, 53, 58, 49, 45, 45, 46,
-        46, 53, 53, 58,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 31, 30, 30, 30, 32, 33, 33, 33, 35, 37, 37, 37, 39,
         42, 42, 42, 45, 49, 49, 49, 48, 48, 48, 48, 48, 49, 49, 31, 31, 31, 31,
         31, 32, 32, 32, 32, 34, 36, 36, 36, 38, 40, 40, 40, 41, 43, 43, 43, 44,
@@ -4994,7 +4978,23 @@
         46, 46, 46, 46, 47, 47, 47, 47, 47, 47, 47, 49, 50, 50, 50, 51, 53, 53,
         53, 53, 53, 53, 53, 53, 53, 53, 49, 48, 47, 47, 47, 46, 45, 45, 45, 45,
         46, 46, 46, 45, 45, 45, 45, 47, 49, 49, 49, 51, 53, 53, 53, 54, 56, 56,
-        56, 57, 58, 58 },
+        56, 57, 58, 58,
+        /* Size 32x8 */
+        32, 31, 31, 37, 37, 48, 48, 49, 31, 31, 31, 37, 37, 47, 47, 48, 31, 31,
+        31, 38, 38, 47, 47, 47, 31, 31, 31, 38, 38, 47, 47, 47, 31, 31, 31, 38,
+        38, 47, 47, 47, 31, 32, 32, 39, 39, 46, 46, 46, 30, 32, 32, 40, 40, 46,
+        46, 45, 30, 32, 32, 40, 40, 46, 46, 45, 30, 32, 32, 40, 40, 46, 46, 45,
+        32, 34, 34, 41, 41, 46, 46, 45, 33, 36, 36, 43, 43, 47, 47, 46, 33, 36,
+        36, 43, 43, 47, 47, 46, 33, 36, 36, 43, 43, 47, 47, 46, 35, 38, 38, 45,
+        45, 47, 47, 45, 37, 40, 40, 47, 47, 47, 47, 45, 37, 40, 40, 47, 47, 47,
+        47, 45, 37, 40, 40, 47, 47, 47, 47, 45, 39, 41, 41, 47, 47, 49, 49, 47,
+        42, 43, 43, 47, 47, 50, 50, 49, 42, 43, 43, 47, 47, 50, 50, 49, 42, 43,
+        43, 47, 47, 50, 50, 49, 45, 44, 44, 47, 47, 51, 51, 51, 49, 46, 46, 48,
+        48, 53, 53, 53, 49, 46, 46, 48, 48, 53, 53, 53, 49, 46, 46, 48, 48, 53,
+        53, 53, 48, 46, 46, 47, 47, 53, 53, 54, 48, 46, 46, 47, 47, 53, 53, 56,
+        48, 46, 46, 47, 47, 53, 53, 56, 48, 46, 46, 47, 47, 53, 53, 56, 48, 45,
+        45, 46, 46, 53, 53, 57, 49, 45, 45, 46, 46, 53, 53, 58, 49, 45, 45, 46,
+        46, 53, 53, 58 },
   },
   {
       { /* Luma */
@@ -5080,21 +5080,12 @@
         48, 49, 37, 37, 36, 36, 36, 36, 36, 35, 35, 35, 35, 36, 37, 37, 37, 37,
         38, 39, 39, 39, 39, 41, 42, 43, 43, 43, 45, 48, 49, 49, 49, 50,
         /* Size 4x8 */
-        31, 31, 32, 35, 32, 32, 32, 35, 32, 32, 33, 34, 32, 32, 34, 36, 32, 33,
-        35, 38, 33, 33, 36, 40, 34, 34, 37, 42, 35, 34, 38, 48,
-        /* Size 8x4 */
         31, 32, 32, 32, 32, 33, 34, 35, 31, 32, 32, 32, 33, 33, 34, 34, 32, 32,
         33, 34, 35, 36, 37, 38, 35, 35, 34, 36, 38, 40, 42, 48,
+        /* Size 8x4 */
+        31, 31, 32, 35, 32, 32, 32, 35, 32, 32, 33, 34, 32, 32, 34, 36, 32, 33,
+        35, 38, 33, 33, 36, 40, 34, 34, 37, 42, 35, 34, 38, 48,
         /* Size 8x16 */
-        32, 31, 31, 31, 32, 32, 35, 36, 31, 32, 32, 32, 32, 32, 35, 35, 31, 32,
-        32, 32, 32, 32, 35, 35, 31, 32, 32, 32, 32, 32, 34, 35, 31, 32, 32, 32,
-        33, 33, 34, 34, 31, 32, 32, 32, 33, 33, 34, 34, 31, 32, 32, 33, 34, 34,
-        35, 36, 32, 32, 32, 33, 34, 34, 36, 36, 32, 32, 32, 33, 34, 34, 36, 37,
-        32, 32, 33, 34, 35, 35, 37, 38, 32, 32, 33, 34, 35, 35, 37, 38, 33, 33,
-        33, 35, 36, 36, 40, 41, 34, 34, 34, 35, 37, 37, 41, 42, 34, 34, 34, 35,
-        37, 37, 43, 44, 36, 35, 34, 36, 38, 38, 46, 48, 36, 35, 34, 36, 38, 38,
-        46, 48,
-        /* Size 16x8 */
         32, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 33, 34, 34, 36, 36, 31, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 34, 35, 35, 31, 32, 32, 32,
         32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 31, 32, 32, 32, 32, 32,
@@ -5103,37 +5094,16 @@
         35, 36, 37, 37, 38, 38, 35, 35, 35, 34, 34, 34, 35, 36, 36, 37, 37, 40,
         41, 43, 46, 46, 36, 35, 35, 35, 34, 34, 36, 36, 37, 38, 38, 41, 42, 44,
         48, 48,
+        /* Size 16x8 */
+        32, 31, 31, 31, 32, 32, 35, 36, 31, 32, 32, 32, 32, 32, 35, 35, 31, 32,
+        32, 32, 32, 32, 35, 35, 31, 32, 32, 32, 32, 32, 34, 35, 31, 32, 32, 32,
+        33, 33, 34, 34, 31, 32, 32, 32, 33, 33, 34, 34, 31, 32, 32, 33, 34, 34,
+        35, 36, 32, 32, 32, 33, 34, 34, 36, 36, 32, 32, 32, 33, 34, 34, 36, 37,
+        32, 32, 33, 34, 35, 35, 37, 38, 32, 32, 33, 34, 35, 35, 37, 38, 33, 33,
+        33, 35, 36, 36, 40, 41, 34, 34, 34, 35, 37, 37, 41, 42, 34, 34, 34, 35,
+        37, 37, 43, 44, 36, 35, 34, 36, 38, 38, 46, 48, 36, 35, 34, 36, 38, 38,
+        46, 48,
         /* Size 16x32 */
-        32, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 33, 35, 36, 36, 36, 31, 31,
-        31, 31, 31, 31, 32, 32, 32, 32, 32, 33, 35, 35, 35, 35, 31, 31, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 33, 35, 35, 35, 35, 31, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 33, 35, 35, 35, 35, 31, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 33, 35, 35, 35, 35, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 33, 35, 35, 35, 35, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33,
-        34, 35, 35, 35, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 35,
-        35, 35, 31, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 34, 34,
-        31, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 34, 34, 31, 32,
-        32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 34, 34, 31, 32, 32, 32,
-        32, 32, 33, 33, 33, 33, 33, 34, 35, 35, 35, 35, 31, 32, 32, 32, 32, 32,
-        33, 33, 34, 34, 34, 34, 35, 36, 36, 36, 32, 32, 32, 32, 32, 32, 33, 34,
-        34, 34, 34, 35, 36, 36, 36, 36, 32, 32, 32, 32, 32, 32, 33, 34, 34, 34,
-        34, 35, 36, 36, 36, 36, 32, 32, 32, 32, 32, 32, 33, 34, 34, 34, 34, 35,
-        36, 36, 36, 36, 32, 32, 32, 32, 32, 32, 33, 34, 34, 34, 34, 35, 36, 37,
-        37, 37, 32, 32, 32, 33, 33, 33, 33, 34, 35, 35, 35, 36, 37, 38, 38, 38,
-        32, 32, 32, 33, 33, 33, 34, 35, 35, 35, 35, 36, 37, 38, 38, 38, 32, 32,
-        32, 33, 33, 33, 34, 35, 35, 35, 35, 36, 37, 38, 38, 38, 32, 32, 32, 33,
-        33, 33, 34, 35, 35, 35, 35, 36, 37, 38, 38, 38, 32, 33, 33, 33, 33, 33,
-        34, 35, 36, 36, 36, 37, 39, 40, 40, 40, 33, 33, 33, 33, 33, 33, 35, 36,
-        36, 36, 36, 38, 40, 41, 41, 41, 34, 34, 34, 34, 34, 34, 35, 36, 37, 37,
-        37, 39, 41, 42, 42, 42, 34, 34, 34, 34, 34, 34, 35, 36, 37, 37, 37, 39,
-        41, 42, 42, 42, 34, 34, 34, 34, 34, 34, 35, 36, 37, 37, 37, 39, 41, 42,
-        42, 42, 34, 34, 34, 34, 34, 34, 35, 37, 37, 37, 37, 40, 43, 44, 44, 44,
-        35, 35, 34, 34, 34, 34, 36, 37, 38, 38, 38, 41, 45, 47, 47, 47, 36, 35,
-        35, 34, 34, 34, 36, 37, 38, 38, 38, 42, 46, 48, 48, 48, 36, 35, 35, 34,
-        34, 34, 36, 37, 38, 38, 38, 42, 46, 48, 48, 48, 36, 35, 35, 34, 34, 34,
-        36, 37, 38, 38, 38, 42, 46, 48, 48, 48, 37, 36, 36, 36, 36, 36, 37, 38,
-        39, 39, 39, 42, 46, 49, 49, 49,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 33, 34, 34, 34, 34, 35, 36, 36, 36, 37, 31, 31, 31, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33,
@@ -5163,33 +5133,47 @@
         38, 40, 41, 42, 42, 42, 44, 47, 48, 48, 48, 49, 36, 35, 35, 35, 35, 35,
         35, 35, 34, 34, 34, 35, 36, 36, 36, 36, 37, 38, 38, 38, 38, 40, 41, 42,
         42, 42, 44, 47, 48, 48, 48, 49,
+        /* Size 32x16 */
+        32, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 33, 35, 36, 36, 36, 31, 31,
+        31, 31, 31, 31, 32, 32, 32, 32, 32, 33, 35, 35, 35, 35, 31, 31, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 33, 35, 35, 35, 35, 31, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 33, 35, 35, 35, 35, 31, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 33, 35, 35, 35, 35, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 33, 35, 35, 35, 35, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33,
+        34, 35, 35, 35, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 35,
+        35, 35, 31, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 34, 34,
+        31, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 34, 34, 31, 32,
+        32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 34, 34, 31, 32, 32, 32,
+        32, 32, 33, 33, 33, 33, 33, 34, 35, 35, 35, 35, 31, 32, 32, 32, 32, 32,
+        33, 33, 34, 34, 34, 34, 35, 36, 36, 36, 32, 32, 32, 32, 32, 32, 33, 34,
+        34, 34, 34, 35, 36, 36, 36, 36, 32, 32, 32, 32, 32, 32, 33, 34, 34, 34,
+        34, 35, 36, 36, 36, 36, 32, 32, 32, 32, 32, 32, 33, 34, 34, 34, 34, 35,
+        36, 36, 36, 36, 32, 32, 32, 32, 32, 32, 33, 34, 34, 34, 34, 35, 36, 37,
+        37, 37, 32, 32, 32, 33, 33, 33, 33, 34, 35, 35, 35, 36, 37, 38, 38, 38,
+        32, 32, 32, 33, 33, 33, 34, 35, 35, 35, 35, 36, 37, 38, 38, 38, 32, 32,
+        32, 33, 33, 33, 34, 35, 35, 35, 35, 36, 37, 38, 38, 38, 32, 32, 32, 33,
+        33, 33, 34, 35, 35, 35, 35, 36, 37, 38, 38, 38, 32, 33, 33, 33, 33, 33,
+        34, 35, 36, 36, 36, 37, 39, 40, 40, 40, 33, 33, 33, 33, 33, 33, 35, 36,
+        36, 36, 36, 38, 40, 41, 41, 41, 34, 34, 34, 34, 34, 34, 35, 36, 37, 37,
+        37, 39, 41, 42, 42, 42, 34, 34, 34, 34, 34, 34, 35, 36, 37, 37, 37, 39,
+        41, 42, 42, 42, 34, 34, 34, 34, 34, 34, 35, 36, 37, 37, 37, 39, 41, 42,
+        42, 42, 34, 34, 34, 34, 34, 34, 35, 37, 37, 37, 37, 40, 43, 44, 44, 44,
+        35, 35, 34, 34, 34, 34, 36, 37, 38, 38, 38, 41, 45, 47, 47, 47, 36, 35,
+        35, 34, 34, 34, 36, 37, 38, 38, 38, 42, 46, 48, 48, 48, 36, 35, 35, 34,
+        34, 34, 36, 37, 38, 38, 38, 42, 46, 48, 48, 48, 36, 35, 35, 34, 34, 34,
+        36, 37, 38, 38, 38, 42, 46, 48, 48, 48, 37, 36, 36, 36, 36, 36, 37, 38,
+        39, 39, 39, 42, 46, 49, 49, 49,
         /* Size 4x16 */
-        31, 31, 32, 36, 31, 32, 32, 35, 32, 32, 32, 35, 32, 32, 32, 35, 32, 32,
-        33, 34, 32, 32, 33, 34, 32, 32, 34, 36, 32, 32, 34, 36, 32, 32, 34, 37,
-        32, 33, 35, 38, 32, 33, 35, 38, 33, 33, 36, 41, 34, 34, 37, 42, 34, 34,
-        37, 44, 35, 34, 38, 48, 35, 34, 38, 48,
-        /* Size 16x4 */
         31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 34, 35, 35, 31, 32,
         32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 32, 32, 32, 32,
         33, 33, 34, 34, 34, 35, 35, 36, 37, 37, 38, 38, 36, 35, 35, 35, 34, 34,
         36, 36, 37, 38, 38, 41, 42, 44, 48, 48,
+        /* Size 16x4 */
+        31, 31, 32, 36, 31, 32, 32, 35, 32, 32, 32, 35, 32, 32, 32, 35, 32, 32,
+        33, 34, 32, 32, 33, 34, 32, 32, 34, 36, 32, 32, 34, 36, 32, 32, 34, 37,
+        32, 33, 35, 38, 32, 33, 35, 38, 33, 33, 36, 41, 34, 34, 37, 42, 34, 34,
+        37, 44, 35, 34, 38, 48, 35, 34, 38, 48,
         /* Size 8x32 */
-        32, 31, 31, 31, 32, 32, 35, 36, 31, 31, 31, 32, 32, 32, 35, 35, 31, 32,
-        32, 32, 32, 32, 35, 35, 31, 32, 32, 32, 32, 32, 35, 35, 31, 32, 32, 32,
-        32, 32, 35, 35, 31, 32, 32, 32, 32, 32, 35, 35, 31, 32, 32, 32, 32, 32,
-        34, 35, 31, 32, 32, 32, 32, 32, 34, 35, 31, 32, 32, 32, 33, 33, 34, 34,
-        31, 32, 32, 32, 33, 33, 34, 34, 31, 32, 32, 32, 33, 33, 34, 34, 31, 32,
-        32, 33, 33, 33, 35, 35, 31, 32, 32, 33, 34, 34, 35, 36, 32, 32, 32, 33,
-        34, 34, 36, 36, 32, 32, 32, 33, 34, 34, 36, 36, 32, 32, 32, 33, 34, 34,
-        36, 36, 32, 32, 32, 33, 34, 34, 36, 37, 32, 32, 33, 33, 35, 35, 37, 38,
-        32, 32, 33, 34, 35, 35, 37, 38, 32, 32, 33, 34, 35, 35, 37, 38, 32, 32,
-        33, 34, 35, 35, 37, 38, 32, 33, 33, 34, 36, 36, 39, 40, 33, 33, 33, 35,
-        36, 36, 40, 41, 34, 34, 34, 35, 37, 37, 41, 42, 34, 34, 34, 35, 37, 37,
-        41, 42, 34, 34, 34, 35, 37, 37, 41, 42, 34, 34, 34, 35, 37, 37, 43, 44,
-        35, 34, 34, 36, 38, 38, 45, 47, 36, 35, 34, 36, 38, 38, 46, 48, 36, 35,
-        34, 36, 38, 38, 46, 48, 36, 35, 34, 36, 38, 38, 46, 48, 37, 36, 36, 37,
-        39, 39, 46, 49,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 33, 34, 34, 34, 34, 35, 36, 36, 36, 37, 31, 31, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33,
@@ -5204,7 +5188,23 @@
         34, 34, 34, 34, 34, 35, 35, 36, 36, 36, 36, 37, 37, 37, 37, 39, 40, 41,
         41, 41, 43, 45, 46, 46, 46, 46, 36, 35, 35, 35, 35, 35, 35, 35, 34, 34,
         34, 35, 36, 36, 36, 36, 37, 38, 38, 38, 38, 40, 41, 42, 42, 42, 44, 47,
-        48, 48, 48, 49 },
+        48, 48, 48, 49,
+        /* Size 32x8 */
+        32, 31, 31, 31, 32, 32, 35, 36, 31, 31, 31, 32, 32, 32, 35, 35, 31, 32,
+        32, 32, 32, 32, 35, 35, 31, 32, 32, 32, 32, 32, 35, 35, 31, 32, 32, 32,
+        32, 32, 35, 35, 31, 32, 32, 32, 32, 32, 35, 35, 31, 32, 32, 32, 32, 32,
+        34, 35, 31, 32, 32, 32, 32, 32, 34, 35, 31, 32, 32, 32, 33, 33, 34, 34,
+        31, 32, 32, 32, 33, 33, 34, 34, 31, 32, 32, 32, 33, 33, 34, 34, 31, 32,
+        32, 33, 33, 33, 35, 35, 31, 32, 32, 33, 34, 34, 35, 36, 32, 32, 32, 33,
+        34, 34, 36, 36, 32, 32, 32, 33, 34, 34, 36, 36, 32, 32, 32, 33, 34, 34,
+        36, 36, 32, 32, 32, 33, 34, 34, 36, 37, 32, 32, 33, 33, 35, 35, 37, 38,
+        32, 32, 33, 34, 35, 35, 37, 38, 32, 32, 33, 34, 35, 35, 37, 38, 32, 32,
+        33, 34, 35, 35, 37, 38, 32, 33, 33, 34, 36, 36, 39, 40, 33, 33, 33, 35,
+        36, 36, 40, 41, 34, 34, 34, 35, 37, 37, 41, 42, 34, 34, 34, 35, 37, 37,
+        41, 42, 34, 34, 34, 35, 37, 37, 41, 42, 34, 34, 34, 35, 37, 37, 43, 44,
+        35, 34, 34, 36, 38, 38, 45, 47, 36, 35, 34, 36, 38, 38, 46, 48, 36, 35,
+        34, 36, 38, 38, 46, 48, 36, 35, 34, 36, 38, 38, 46, 48, 37, 36, 36, 37,
+        39, 39, 46, 49 },
       { /* Chroma */
         /* Size 4x4 */
         31, 32, 38, 46, 32, 34, 41, 46, 38, 41, 47, 47, 46, 46, 47, 52,
@@ -5288,21 +5288,12 @@
         53, 53, 49, 48, 47, 47, 47, 47, 47, 46, 46, 46, 46, 46, 46, 47, 47, 47,
         47, 47, 47, 47, 47, 48, 49, 50, 50, 50, 51, 52, 53, 53, 53, 53,
         /* Size 4x8 */
-        31, 31, 37, 48, 31, 31, 38, 47, 31, 32, 40, 46, 34, 36, 43, 47, 37, 39,
-        46, 47, 39, 41, 47, 48, 42, 43, 47, 50, 48, 46, 48, 53,
-        /* Size 8x4 */
         31, 31, 31, 34, 37, 39, 42, 48, 31, 31, 32, 36, 39, 41, 43, 46, 37, 38,
         40, 43, 46, 47, 47, 48, 48, 47, 46, 47, 47, 48, 50, 53,
+        /* Size 8x4 */
+        31, 31, 37, 48, 31, 31, 38, 47, 31, 32, 40, 46, 34, 36, 43, 47, 37, 39,
+        46, 47, 39, 41, 47, 48, 42, 43, 47, 50, 48, 46, 48, 53,
         /* Size 8x16 */
-        32, 31, 31, 33, 37, 37, 45, 48, 31, 31, 31, 34, 38, 38, 45, 47, 31, 31,
-        31, 34, 38, 38, 45, 47, 31, 31, 32, 34, 39, 39, 45, 46, 30, 32, 32, 35,
-        40, 40, 44, 46, 30, 32, 32, 35, 40, 40, 44, 46, 33, 34, 35, 37, 42, 42,
-        46, 47, 33, 35, 36, 38, 43, 43, 46, 47, 35, 37, 37, 40, 44, 44, 46, 47,
-        37, 39, 40, 43, 47, 47, 47, 47, 37, 39, 40, 43, 47, 47, 47, 47, 41, 42,
-        42, 44, 47, 47, 49, 49, 42, 42, 43, 44, 47, 47, 49, 50, 44, 44, 44, 45,
-        47, 47, 50, 51, 49, 47, 46, 47, 48, 48, 52, 53, 49, 47, 46, 47, 48, 48,
-        52, 53,
-        /* Size 16x8 */
         32, 31, 31, 31, 30, 30, 33, 33, 35, 37, 37, 41, 42, 44, 49, 49, 31, 31,
         31, 31, 32, 32, 34, 35, 37, 39, 39, 42, 42, 44, 47, 47, 31, 31, 31, 32,
         32, 32, 35, 36, 37, 40, 40, 42, 43, 44, 46, 46, 33, 34, 34, 34, 35, 35,
@@ -5311,37 +5302,16 @@
         47, 47, 47, 47, 48, 48, 45, 45, 45, 45, 44, 44, 46, 46, 46, 47, 47, 49,
         49, 50, 52, 52, 48, 47, 47, 46, 46, 46, 47, 47, 47, 47, 47, 49, 50, 51,
         53, 53,
+        /* Size 16x8 */
+        32, 31, 31, 33, 37, 37, 45, 48, 31, 31, 31, 34, 38, 38, 45, 47, 31, 31,
+        31, 34, 38, 38, 45, 47, 31, 31, 32, 34, 39, 39, 45, 46, 30, 32, 32, 35,
+        40, 40, 44, 46, 30, 32, 32, 35, 40, 40, 44, 46, 33, 34, 35, 37, 42, 42,
+        46, 47, 33, 35, 36, 38, 43, 43, 46, 47, 35, 37, 37, 40, 44, 44, 46, 47,
+        37, 39, 40, 43, 47, 47, 47, 47, 37, 39, 40, 43, 47, 47, 47, 47, 41, 42,
+        42, 44, 47, 47, 49, 49, 42, 42, 43, 44, 47, 47, 49, 50, 44, 44, 44, 45,
+        47, 47, 50, 51, 49, 47, 46, 47, 48, 48, 52, 53, 49, 47, 46, 47, 48, 48,
+        52, 53,
         /* Size 16x32 */
-        32, 31, 31, 31, 31, 31, 33, 35, 37, 37, 37, 40, 45, 48, 48, 48, 31, 31,
-        31, 31, 31, 31, 33, 36, 37, 37, 37, 41, 45, 48, 48, 48, 31, 31, 31, 31,
-        31, 31, 34, 36, 38, 38, 38, 41, 45, 47, 47, 47, 31, 31, 31, 31, 31, 31,
-        34, 37, 38, 38, 38, 41, 45, 47, 47, 47, 31, 31, 31, 31, 31, 31, 34, 37,
-        38, 38, 38, 41, 45, 47, 47, 47, 31, 31, 31, 31, 31, 31, 34, 37, 38, 38,
-        38, 41, 45, 47, 47, 47, 31, 31, 31, 32, 32, 32, 34, 37, 39, 39, 39, 41,
-        45, 46, 46, 46, 30, 31, 31, 32, 32, 32, 34, 38, 39, 39, 39, 42, 44, 46,
-        46, 46, 30, 31, 32, 32, 32, 32, 35, 38, 40, 40, 40, 42, 44, 46, 46, 46,
-        30, 31, 32, 32, 32, 32, 35, 38, 40, 40, 40, 42, 44, 46, 46, 46, 30, 31,
-        32, 32, 32, 32, 35, 38, 40, 40, 40, 42, 44, 46, 46, 46, 31, 32, 33, 33,
-        33, 33, 36, 39, 41, 41, 41, 43, 45, 46, 46, 46, 33, 34, 34, 35, 35, 35,
-        37, 40, 42, 42, 42, 44, 46, 47, 47, 47, 33, 34, 35, 36, 36, 36, 38, 41,
-        43, 43, 43, 44, 46, 47, 47, 47, 33, 34, 35, 36, 36, 36, 38, 41, 43, 43,
-        43, 44, 46, 47, 47, 47, 33, 34, 35, 36, 36, 36, 38, 41, 43, 43, 43, 44,
-        46, 47, 47, 47, 35, 36, 37, 37, 37, 37, 40, 43, 44, 44, 44, 45, 46, 47,
-        47, 47, 36, 37, 38, 39, 39, 39, 42, 44, 46, 46, 46, 47, 47, 47, 47, 47,
-        37, 38, 39, 40, 40, 40, 43, 45, 47, 47, 47, 47, 47, 47, 47, 47, 37, 38,
-        39, 40, 40, 40, 43, 45, 47, 47, 47, 47, 47, 47, 47, 47, 37, 38, 39, 40,
-        40, 40, 43, 45, 47, 47, 47, 47, 47, 47, 47, 47, 39, 39, 40, 41, 41, 41,
-        43, 46, 47, 47, 47, 48, 48, 48, 48, 48, 41, 41, 42, 42, 42, 42, 44, 46,
-        47, 47, 47, 48, 49, 49, 49, 49, 42, 42, 42, 43, 43, 43, 44, 46, 47, 47,
-        47, 48, 49, 50, 50, 50, 42, 42, 42, 43, 43, 43, 44, 46, 47, 47, 47, 48,
-        49, 50, 50, 50, 42, 42, 42, 43, 43, 43, 44, 46, 47, 47, 47, 48, 49, 50,
-        50, 50, 44, 44, 44, 44, 44, 44, 45, 47, 47, 47, 47, 49, 50, 51, 51, 51,
-        47, 46, 46, 46, 46, 46, 46, 47, 48, 48, 48, 49, 51, 52, 52, 52, 49, 48,
-        47, 46, 46, 46, 47, 48, 48, 48, 48, 50, 52, 53, 53, 53, 49, 48, 47, 46,
-        46, 46, 47, 48, 48, 48, 48, 50, 52, 53, 53, 53, 49, 48, 47, 46, 46, 46,
-        47, 48, 48, 48, 48, 50, 52, 53, 53, 53, 49, 48, 47, 46, 46, 46, 47, 47,
-        47, 47, 47, 49, 52, 53, 53, 53,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 31, 30, 30, 30, 30, 31, 33, 33, 33, 33, 35, 36,
         37, 37, 37, 39, 41, 42, 42, 42, 44, 47, 49, 49, 49, 49, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 32, 34, 34, 34, 34, 36, 37, 38, 38, 38, 39,
@@ -5371,33 +5341,47 @@
         47, 48, 49, 50, 50, 50, 51, 52, 53, 53, 53, 53, 48, 48, 47, 47, 47, 47,
         46, 46, 46, 46, 46, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 48, 49, 50,
         50, 50, 51, 52, 53, 53, 53, 53,
+        /* Size 32x16 */
+        32, 31, 31, 31, 31, 31, 33, 35, 37, 37, 37, 40, 45, 48, 48, 48, 31, 31,
+        31, 31, 31, 31, 33, 36, 37, 37, 37, 41, 45, 48, 48, 48, 31, 31, 31, 31,
+        31, 31, 34, 36, 38, 38, 38, 41, 45, 47, 47, 47, 31, 31, 31, 31, 31, 31,
+        34, 37, 38, 38, 38, 41, 45, 47, 47, 47, 31, 31, 31, 31, 31, 31, 34, 37,
+        38, 38, 38, 41, 45, 47, 47, 47, 31, 31, 31, 31, 31, 31, 34, 37, 38, 38,
+        38, 41, 45, 47, 47, 47, 31, 31, 31, 32, 32, 32, 34, 37, 39, 39, 39, 41,
+        45, 46, 46, 46, 30, 31, 31, 32, 32, 32, 34, 38, 39, 39, 39, 42, 44, 46,
+        46, 46, 30, 31, 32, 32, 32, 32, 35, 38, 40, 40, 40, 42, 44, 46, 46, 46,
+        30, 31, 32, 32, 32, 32, 35, 38, 40, 40, 40, 42, 44, 46, 46, 46, 30, 31,
+        32, 32, 32, 32, 35, 38, 40, 40, 40, 42, 44, 46, 46, 46, 31, 32, 33, 33,
+        33, 33, 36, 39, 41, 41, 41, 43, 45, 46, 46, 46, 33, 34, 34, 35, 35, 35,
+        37, 40, 42, 42, 42, 44, 46, 47, 47, 47, 33, 34, 35, 36, 36, 36, 38, 41,
+        43, 43, 43, 44, 46, 47, 47, 47, 33, 34, 35, 36, 36, 36, 38, 41, 43, 43,
+        43, 44, 46, 47, 47, 47, 33, 34, 35, 36, 36, 36, 38, 41, 43, 43, 43, 44,
+        46, 47, 47, 47, 35, 36, 37, 37, 37, 37, 40, 43, 44, 44, 44, 45, 46, 47,
+        47, 47, 36, 37, 38, 39, 39, 39, 42, 44, 46, 46, 46, 47, 47, 47, 47, 47,
+        37, 38, 39, 40, 40, 40, 43, 45, 47, 47, 47, 47, 47, 47, 47, 47, 37, 38,
+        39, 40, 40, 40, 43, 45, 47, 47, 47, 47, 47, 47, 47, 47, 37, 38, 39, 40,
+        40, 40, 43, 45, 47, 47, 47, 47, 47, 47, 47, 47, 39, 39, 40, 41, 41, 41,
+        43, 46, 47, 47, 47, 48, 48, 48, 48, 48, 41, 41, 42, 42, 42, 42, 44, 46,
+        47, 47, 47, 48, 49, 49, 49, 49, 42, 42, 42, 43, 43, 43, 44, 46, 47, 47,
+        47, 48, 49, 50, 50, 50, 42, 42, 42, 43, 43, 43, 44, 46, 47, 47, 47, 48,
+        49, 50, 50, 50, 42, 42, 42, 43, 43, 43, 44, 46, 47, 47, 47, 48, 49, 50,
+        50, 50, 44, 44, 44, 44, 44, 44, 45, 47, 47, 47, 47, 49, 50, 51, 51, 51,
+        47, 46, 46, 46, 46, 46, 46, 47, 48, 48, 48, 49, 51, 52, 52, 52, 49, 48,
+        47, 46, 46, 46, 47, 48, 48, 48, 48, 50, 52, 53, 53, 53, 49, 48, 47, 46,
+        46, 46, 47, 48, 48, 48, 48, 50, 52, 53, 53, 53, 49, 48, 47, 46, 46, 46,
+        47, 48, 48, 48, 48, 50, 52, 53, 53, 53, 49, 48, 47, 46, 46, 46, 47, 47,
+        47, 47, 47, 49, 52, 53, 53, 53,
         /* Size 4x16 */
-        31, 31, 37, 48, 31, 31, 38, 47, 31, 31, 38, 47, 31, 32, 39, 46, 31, 32,
-        40, 46, 31, 32, 40, 46, 34, 35, 42, 47, 34, 36, 43, 47, 36, 37, 44, 47,
-        38, 40, 47, 47, 38, 40, 47, 47, 41, 42, 47, 49, 42, 43, 47, 50, 44, 44,
-        47, 51, 48, 46, 48, 53, 48, 46, 48, 53,
-        /* Size 16x4 */
         31, 31, 31, 31, 31, 31, 34, 34, 36, 38, 38, 41, 42, 44, 48, 48, 31, 31,
         31, 32, 32, 32, 35, 36, 37, 40, 40, 42, 43, 44, 46, 46, 37, 38, 38, 39,
         40, 40, 42, 43, 44, 47, 47, 47, 47, 47, 48, 48, 48, 47, 47, 46, 46, 46,
         47, 47, 47, 47, 47, 49, 50, 51, 53, 53,
+        /* Size 16x4 */
+        31, 31, 37, 48, 31, 31, 38, 47, 31, 31, 38, 47, 31, 32, 39, 46, 31, 32,
+        40, 46, 31, 32, 40, 46, 34, 35, 42, 47, 34, 36, 43, 47, 36, 37, 44, 47,
+        38, 40, 47, 47, 38, 40, 47, 47, 41, 42, 47, 49, 42, 43, 47, 50, 44, 44,
+        47, 51, 48, 46, 48, 53, 48, 46, 48, 53,
         /* Size 8x32 */
-        32, 31, 31, 33, 37, 37, 45, 48, 31, 31, 31, 33, 37, 37, 45, 48, 31, 31,
-        31, 34, 38, 38, 45, 47, 31, 31, 31, 34, 38, 38, 45, 47, 31, 31, 31, 34,
-        38, 38, 45, 47, 31, 31, 31, 34, 38, 38, 45, 47, 31, 31, 32, 34, 39, 39,
-        45, 46, 30, 31, 32, 34, 39, 39, 44, 46, 30, 32, 32, 35, 40, 40, 44, 46,
-        30, 32, 32, 35, 40, 40, 44, 46, 30, 32, 32, 35, 40, 40, 44, 46, 31, 33,
-        33, 36, 41, 41, 45, 46, 33, 34, 35, 37, 42, 42, 46, 47, 33, 35, 36, 38,
-        43, 43, 46, 47, 33, 35, 36, 38, 43, 43, 46, 47, 33, 35, 36, 38, 43, 43,
-        46, 47, 35, 37, 37, 40, 44, 44, 46, 47, 36, 38, 39, 42, 46, 46, 47, 47,
-        37, 39, 40, 43, 47, 47, 47, 47, 37, 39, 40, 43, 47, 47, 47, 47, 37, 39,
-        40, 43, 47, 47, 47, 47, 39, 40, 41, 43, 47, 47, 48, 48, 41, 42, 42, 44,
-        47, 47, 49, 49, 42, 42, 43, 44, 47, 47, 49, 50, 42, 42, 43, 44, 47, 47,
-        49, 50, 42, 42, 43, 44, 47, 47, 49, 50, 44, 44, 44, 45, 47, 47, 50, 51,
-        47, 46, 46, 46, 48, 48, 51, 52, 49, 47, 46, 47, 48, 48, 52, 53, 49, 47,
-        46, 47, 48, 48, 52, 53, 49, 47, 46, 47, 48, 48, 52, 53, 49, 47, 46, 47,
-        47, 47, 52, 53,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 31, 31, 30, 30, 30, 30, 31, 33, 33, 33, 33, 35, 36,
         37, 37, 37, 39, 41, 42, 42, 42, 44, 47, 49, 49, 49, 49, 31, 31, 31, 31,
         31, 31, 31, 31, 32, 32, 32, 33, 34, 35, 35, 35, 37, 38, 39, 39, 39, 40,
@@ -5412,7 +5396,23 @@
         45, 44, 44, 44, 44, 45, 46, 46, 46, 46, 46, 47, 47, 47, 47, 48, 49, 49,
         49, 49, 50, 51, 52, 52, 52, 52, 48, 48, 47, 47, 47, 47, 46, 46, 46, 46,
         46, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 48, 49, 50, 50, 50, 51, 52,
-        53, 53, 53, 53 },
+        53, 53, 53, 53,
+        /* Size 32x8 */
+        32, 31, 31, 33, 37, 37, 45, 48, 31, 31, 31, 33, 37, 37, 45, 48, 31, 31,
+        31, 34, 38, 38, 45, 47, 31, 31, 31, 34, 38, 38, 45, 47, 31, 31, 31, 34,
+        38, 38, 45, 47, 31, 31, 31, 34, 38, 38, 45, 47, 31, 31, 32, 34, 39, 39,
+        45, 46, 30, 31, 32, 34, 39, 39, 44, 46, 30, 32, 32, 35, 40, 40, 44, 46,
+        30, 32, 32, 35, 40, 40, 44, 46, 30, 32, 32, 35, 40, 40, 44, 46, 31, 33,
+        33, 36, 41, 41, 45, 46, 33, 34, 35, 37, 42, 42, 46, 47, 33, 35, 36, 38,
+        43, 43, 46, 47, 33, 35, 36, 38, 43, 43, 46, 47, 33, 35, 36, 38, 43, 43,
+        46, 47, 35, 37, 37, 40, 44, 44, 46, 47, 36, 38, 39, 42, 46, 46, 47, 47,
+        37, 39, 40, 43, 47, 47, 47, 47, 37, 39, 40, 43, 47, 47, 47, 47, 37, 39,
+        40, 43, 47, 47, 47, 47, 39, 40, 41, 43, 47, 47, 48, 48, 41, 42, 42, 44,
+        47, 47, 49, 49, 42, 42, 43, 44, 47, 47, 49, 50, 42, 42, 43, 44, 47, 47,
+        49, 50, 42, 42, 43, 44, 47, 47, 49, 50, 44, 44, 44, 45, 47, 47, 50, 51,
+        47, 46, 46, 46, 48, 48, 51, 52, 49, 47, 46, 47, 48, 48, 52, 53, 49, 47,
+        46, 47, 48, 48, 52, 53, 49, 47, 46, 47, 48, 48, 52, 53, 49, 47, 46, 47,
+        47, 47, 52, 53 },
   },
   {
       { /* Luma */
@@ -5498,21 +5498,12 @@
         39, 39, 34, 34, 34, 34, 34, 34, 34, 34, 34, 33, 33, 33, 33, 33, 34, 34,
         35, 35, 35, 35, 35, 35, 36, 36, 37, 37, 37, 37, 38, 38, 39, 39,
         /* Size 4x8 */
-        31, 31, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32,
-        33, 34, 32, 32, 34, 34, 32, 33, 34, 35, 33, 33, 35, 36,
-        /* Size 8x4 */
         31, 31, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32, 32, 32, 33, 33, 32, 32,
         32, 32, 33, 34, 34, 35, 32, 32, 32, 33, 34, 34, 35, 36,
+        /* Size 8x4 */
+        31, 31, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32,
+        33, 34, 32, 32, 34, 34, 32, 33, 34, 35, 33, 33, 35, 36,
         /* Size 8x16 */
-        32, 31, 31, 31, 31, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 33, 31, 32,
-        32, 32, 32, 32, 32, 33, 31, 32, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32,
-        32, 32, 32, 33, 31, 32, 32, 32, 32, 33, 33, 33, 31, 32, 32, 32, 32, 33,
-        33, 33, 31, 32, 32, 32, 32, 33, 33, 33, 31, 32, 32, 32, 33, 34, 34, 34,
-        32, 32, 32, 32, 33, 34, 34, 34, 32, 32, 32, 32, 33, 34, 34, 34, 32, 32,
-        32, 32, 33, 35, 35, 35, 32, 32, 33, 33, 34, 35, 35, 36, 32, 32, 33, 33,
-        34, 35, 35, 36, 32, 33, 33, 33, 34, 36, 36, 36, 34, 34, 34, 34, 35, 37,
-        37, 38,
-        /* Size 16x8 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 34, 31, 31,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 31, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 31, 32, 32, 32, 32, 32,
@@ -5521,37 +5512,16 @@
         34, 35, 35, 35, 36, 37, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 35,
         35, 35, 36, 37, 32, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 35, 36, 36,
         36, 38,
+        /* Size 16x8 */
+        32, 31, 31, 31, 31, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 33, 31, 32,
+        32, 32, 32, 32, 32, 33, 31, 32, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32,
+        32, 32, 32, 33, 31, 32, 32, 32, 32, 33, 33, 33, 31, 32, 32, 32, 32, 33,
+        33, 33, 31, 32, 32, 32, 32, 33, 33, 33, 31, 32, 32, 32, 33, 34, 34, 34,
+        32, 32, 32, 32, 33, 34, 34, 34, 32, 32, 32, 32, 33, 34, 34, 34, 32, 32,
+        32, 32, 33, 35, 35, 35, 32, 32, 33, 33, 34, 35, 35, 36, 32, 32, 33, 33,
+        34, 35, 35, 36, 32, 33, 33, 33, 34, 36, 36, 36, 34, 34, 34, 34, 35, 37,
+        37, 38,
         /* Size 16x32 */
-        32, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 34, 31, 31,
-        31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 33, 34, 31, 31, 31, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 31, 31, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 31, 31, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 33, 34, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 33, 34, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 33, 34, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        33, 34, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34,
-        31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 31, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 31, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 31, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 31, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 33, 33, 33, 33, 33, 33, 31, 32, 32, 32, 32, 32, 32, 32, 32, 33,
-        33, 33, 33, 33, 33, 34, 31, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33,
-        33, 33, 34, 34, 31, 32, 32, 32, 32, 32, 32, 32, 33, 33, 34, 34, 34, 34,
-        34, 35, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 34, 35,
-        32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 34, 35, 32, 32,
-        32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 34, 35, 32, 32, 32, 32,
-        32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 34, 35, 32, 32, 32, 32, 32, 32,
-        32, 33, 33, 34, 34, 34, 34, 34, 35, 35, 32, 32, 32, 32, 32, 32, 32, 33,
-        33, 34, 35, 35, 35, 35, 35, 36, 32, 32, 32, 32, 33, 33, 33, 33, 33, 34,
-        35, 35, 35, 35, 36, 36, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 35, 35,
-        35, 35, 36, 37, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 35, 35, 35, 35,
-        36, 37, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 35, 35, 35, 35, 36, 37,
-        32, 32, 32, 33, 33, 33, 33, 33, 34, 34, 35, 35, 35, 35, 36, 37, 32, 33,
-        33, 33, 33, 33, 33, 33, 34, 35, 36, 36, 36, 36, 36, 38, 33, 33, 33, 33,
-        33, 33, 33, 34, 34, 35, 36, 36, 36, 36, 37, 38, 34, 34, 34, 34, 34, 34,
-        34, 34, 35, 36, 37, 37, 37, 37, 38, 39, 34, 34, 34, 34, 34, 34, 34, 34,
-        35, 36, 37, 37, 37, 37, 38, 39,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 34, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
@@ -5581,33 +5551,47 @@
         34, 35, 35, 36, 36, 36, 36, 36, 36, 37, 38, 38, 34, 34, 34, 34, 34, 34,
         34, 34, 34, 33, 33, 33, 33, 33, 34, 34, 35, 35, 35, 35, 35, 35, 36, 36,
         37, 37, 37, 37, 38, 38, 39, 39,
+        /* Size 32x16 */
+        32, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 34, 31, 31,
+        31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 33, 34, 31, 31, 31, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 31, 31, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 31, 31, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 33, 34, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 33, 34, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 33, 34, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        33, 34, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34,
+        31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 31, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 31, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 31, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 31, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 33, 33, 33, 33, 33, 33, 31, 32, 32, 32, 32, 32, 32, 32, 32, 33,
+        33, 33, 33, 33, 33, 34, 31, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33,
+        33, 33, 34, 34, 31, 32, 32, 32, 32, 32, 32, 32, 33, 33, 34, 34, 34, 34,
+        34, 35, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 34, 35,
+        32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 34, 35, 32, 32,
+        32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 34, 35, 32, 32, 32, 32,
+        32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 34, 35, 32, 32, 32, 32, 32, 32,
+        32, 33, 33, 34, 34, 34, 34, 34, 35, 35, 32, 32, 32, 32, 32, 32, 32, 33,
+        33, 34, 35, 35, 35, 35, 35, 36, 32, 32, 32, 32, 33, 33, 33, 33, 33, 34,
+        35, 35, 35, 35, 36, 36, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 35, 35,
+        35, 35, 36, 37, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 35, 35, 35, 35,
+        36, 37, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 35, 35, 35, 35, 36, 37,
+        32, 32, 32, 33, 33, 33, 33, 33, 34, 34, 35, 35, 35, 35, 36, 37, 32, 33,
+        33, 33, 33, 33, 33, 33, 34, 35, 36, 36, 36, 36, 36, 38, 33, 33, 33, 33,
+        33, 33, 33, 34, 34, 35, 36, 36, 36, 36, 37, 38, 34, 34, 34, 34, 34, 34,
+        34, 34, 35, 36, 37, 37, 37, 37, 38, 39, 34, 34, 34, 34, 34, 34, 34, 34,
+        35, 36, 37, 37, 37, 37, 38, 39,
         /* Size 4x16 */
-        31, 31, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32,
-        32, 32, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 33, 33, 32, 32, 33, 34,
-        32, 32, 33, 34, 32, 32, 33, 34, 32, 32, 34, 35, 32, 33, 34, 35, 32, 33,
-        34, 35, 33, 33, 35, 36, 34, 34, 36, 37,
-        /* Size 16x4 */
         31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 31, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 32, 32, 32, 32,
         32, 32, 32, 33, 33, 33, 33, 34, 34, 34, 35, 36, 32, 32, 32, 32, 32, 33,
         33, 33, 34, 34, 34, 35, 35, 35, 36, 37,
+        /* Size 16x4 */
+        31, 31, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32,
+        32, 32, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 33, 33, 32, 32, 33, 34,
+        32, 32, 33, 34, 32, 32, 33, 34, 32, 32, 34, 35, 32, 33, 34, 35, 32, 33,
+        34, 35, 33, 33, 35, 36, 34, 34, 36, 37,
         /* Size 8x32 */
-        32, 31, 31, 31, 31, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 33, 31, 31,
-        32, 32, 32, 32, 32, 33, 31, 32, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32,
-        32, 32, 32, 33, 31, 32, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32, 32, 32,
-        32, 33, 31, 32, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32, 32, 32, 32, 33,
-        31, 32, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32, 32, 33, 33, 33, 31, 32,
-        32, 32, 32, 33, 33, 33, 31, 32, 32, 32, 32, 33, 33, 33, 31, 32, 32, 32,
-        32, 33, 33, 33, 31, 32, 32, 32, 32, 33, 33, 33, 31, 32, 32, 32, 33, 33,
-        33, 34, 31, 32, 32, 32, 33, 34, 34, 34, 32, 32, 32, 32, 33, 34, 34, 34,
-        32, 32, 32, 32, 33, 34, 34, 34, 32, 32, 32, 32, 33, 34, 34, 34, 32, 32,
-        32, 32, 33, 34, 34, 34, 32, 32, 32, 32, 33, 34, 34, 35, 32, 32, 32, 32,
-        33, 35, 35, 35, 32, 32, 33, 33, 33, 35, 35, 36, 32, 32, 33, 33, 34, 35,
-        35, 36, 32, 32, 33, 33, 34, 35, 35, 36, 32, 32, 33, 33, 34, 35, 35, 36,
-        32, 32, 33, 33, 34, 35, 35, 36, 32, 33, 33, 33, 34, 36, 36, 36, 33, 33,
-        33, 33, 34, 36, 36, 37, 34, 34, 34, 34, 35, 37, 37, 38, 34, 34, 34, 34,
-        35, 37, 37, 38,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 34, 34, 31, 31, 31, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
@@ -5622,7 +5606,23 @@
         32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 34, 35, 35,
         35, 35, 35, 35, 36, 36, 37, 37, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 34, 35, 35, 36, 36, 36, 36, 36,
-        36, 37, 38, 38 },
+        36, 37, 38, 38,
+        /* Size 32x8 */
+        32, 31, 31, 31, 31, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 33, 31, 31,
+        32, 32, 32, 32, 32, 33, 31, 32, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32,
+        32, 32, 32, 33, 31, 32, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32, 32, 32,
+        32, 33, 31, 32, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32, 32, 32, 32, 33,
+        31, 32, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32, 32, 33, 33, 33, 31, 32,
+        32, 32, 32, 33, 33, 33, 31, 32, 32, 32, 32, 33, 33, 33, 31, 32, 32, 32,
+        32, 33, 33, 33, 31, 32, 32, 32, 32, 33, 33, 33, 31, 32, 32, 32, 33, 33,
+        33, 34, 31, 32, 32, 32, 33, 34, 34, 34, 32, 32, 32, 32, 33, 34, 34, 34,
+        32, 32, 32, 32, 33, 34, 34, 34, 32, 32, 32, 32, 33, 34, 34, 34, 32, 32,
+        32, 32, 33, 34, 34, 34, 32, 32, 32, 32, 33, 34, 34, 35, 32, 32, 32, 32,
+        33, 35, 35, 35, 32, 32, 33, 33, 33, 35, 35, 36, 32, 32, 33, 33, 34, 35,
+        35, 36, 32, 32, 33, 33, 34, 35, 35, 36, 32, 32, 33, 33, 34, 35, 35, 36,
+        32, 32, 33, 33, 34, 35, 35, 36, 32, 33, 33, 33, 34, 36, 36, 36, 33, 33,
+        33, 33, 34, 36, 36, 37, 34, 34, 34, 34, 35, 37, 37, 38, 34, 34, 34, 34,
+        35, 37, 37, 38 },
       { /* Chroma */
         /* Size 4x4 */
         31, 31, 34, 38, 31, 32, 35, 40, 34, 35, 39, 43, 38, 40, 43, 47,
@@ -5706,21 +5706,12 @@
         48, 48, 41, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 43, 43,
         44, 45, 45, 45, 45, 45, 46, 47, 47, 47, 47, 47, 48, 48, 48, 48,
         /* Size 4x8 */
-        31, 31, 35, 37, 31, 31, 36, 38, 31, 32, 37, 39, 31, 32, 37, 40, 34, 36,
-        40, 43, 35, 37, 42, 44, 38, 40, 45, 47, 41, 42, 45, 47,
-        /* Size 8x4 */
         31, 31, 31, 31, 34, 35, 38, 41, 31, 31, 32, 32, 36, 37, 40, 42, 35, 36,
         37, 37, 40, 42, 45, 45, 37, 38, 39, 40, 43, 44, 47, 47,
+        /* Size 8x4 */
+        31, 31, 35, 37, 31, 31, 36, 38, 31, 32, 37, 39, 31, 32, 37, 40, 34, 36,
+        40, 43, 35, 37, 42, 44, 38, 40, 45, 47, 41, 42, 45, 47,
         /* Size 8x16 */
-        32, 31, 31, 31, 33, 37, 37, 38, 31, 31, 31, 31, 33, 38, 38, 39, 31, 31,
-        31, 31, 34, 38, 38, 40, 31, 31, 31, 31, 34, 38, 38, 40, 31, 31, 32, 32,
-        34, 39, 39, 40, 30, 31, 32, 32, 35, 40, 40, 41, 30, 31, 32, 32, 35, 40,
-        40, 41, 31, 32, 33, 33, 35, 40, 40, 41, 33, 34, 35, 35, 37, 42, 42, 43,
-        33, 35, 36, 36, 38, 43, 43, 44, 33, 35, 36, 36, 38, 43, 43, 44, 35, 37,
-        38, 38, 41, 45, 45, 46, 37, 39, 40, 40, 43, 47, 47, 47, 37, 39, 40, 40,
-        43, 47, 47, 47, 39, 40, 41, 41, 43, 47, 47, 47, 42, 42, 43, 43, 44, 47,
-        47, 48,
-        /* Size 16x8 */
         32, 31, 31, 31, 31, 30, 30, 31, 33, 33, 33, 35, 37, 37, 39, 42, 31, 31,
         31, 31, 31, 31, 31, 32, 34, 35, 35, 37, 39, 39, 40, 42, 31, 31, 31, 31,
         32, 32, 32, 33, 35, 36, 36, 38, 40, 40, 41, 43, 31, 31, 31, 31, 32, 32,
@@ -5729,37 +5720,16 @@
         43, 45, 47, 47, 47, 47, 37, 38, 38, 38, 39, 40, 40, 40, 42, 43, 43, 45,
         47, 47, 47, 47, 38, 39, 40, 40, 40, 41, 41, 41, 43, 44, 44, 46, 47, 47,
         47, 48,
+        /* Size 16x8 */
+        32, 31, 31, 31, 33, 37, 37, 38, 31, 31, 31, 31, 33, 38, 38, 39, 31, 31,
+        31, 31, 34, 38, 38, 40, 31, 31, 31, 31, 34, 38, 38, 40, 31, 31, 32, 32,
+        34, 39, 39, 40, 30, 31, 32, 32, 35, 40, 40, 41, 30, 31, 32, 32, 35, 40,
+        40, 41, 31, 32, 33, 33, 35, 40, 40, 41, 33, 34, 35, 35, 37, 42, 42, 43,
+        33, 35, 36, 36, 38, 43, 43, 44, 33, 35, 36, 36, 38, 43, 43, 44, 35, 37,
+        38, 38, 41, 45, 45, 46, 37, 39, 40, 40, 43, 47, 47, 47, 37, 39, 40, 40,
+        43, 47, 47, 47, 39, 40, 41, 41, 43, 47, 47, 47, 42, 42, 43, 43, 44, 47,
+        47, 48,
         /* Size 16x32 */
-        32, 31, 31, 31, 31, 31, 31, 31, 33, 35, 37, 37, 37, 37, 38, 42, 31, 31,
-        31, 31, 31, 31, 31, 31, 33, 35, 37, 37, 37, 37, 39, 42, 31, 31, 31, 31,
-        31, 31, 31, 32, 33, 35, 38, 38, 38, 38, 39, 42, 31, 31, 31, 31, 31, 31,
-        31, 32, 34, 36, 38, 38, 38, 38, 40, 42, 31, 31, 31, 31, 31, 31, 31, 32,
-        34, 36, 38, 38, 38, 38, 40, 42, 31, 31, 31, 31, 31, 31, 31, 32, 34, 36,
-        38, 38, 38, 38, 40, 42, 31, 31, 31, 31, 31, 31, 31, 32, 34, 36, 38, 38,
-        38, 38, 40, 42, 31, 31, 31, 31, 31, 31, 31, 32, 34, 36, 38, 38, 38, 38,
-        40, 42, 31, 31, 31, 31, 32, 32, 32, 32, 34, 36, 39, 39, 39, 39, 40, 42,
-        30, 31, 31, 32, 32, 32, 32, 32, 34, 37, 39, 39, 39, 39, 40, 42, 30, 31,
-        31, 32, 32, 32, 32, 33, 35, 37, 40, 40, 40, 40, 41, 42, 30, 31, 31, 32,
-        32, 32, 32, 33, 35, 37, 40, 40, 40, 40, 41, 42, 30, 31, 31, 32, 32, 32,
-        32, 33, 35, 37, 40, 40, 40, 40, 41, 42, 30, 31, 31, 32, 32, 32, 32, 33,
-        35, 37, 40, 40, 40, 40, 41, 42, 31, 31, 32, 32, 33, 33, 33, 33, 35, 38,
-        40, 40, 40, 40, 41, 43, 32, 32, 33, 33, 34, 34, 34, 34, 36, 39, 41, 41,
-        41, 41, 42, 44, 33, 33, 34, 35, 35, 35, 35, 35, 37, 40, 42, 42, 42, 42,
-        43, 44, 33, 34, 35, 35, 36, 36, 36, 36, 38, 40, 43, 43, 43, 43, 44, 45,
-        33, 34, 35, 35, 36, 36, 36, 36, 38, 40, 43, 43, 43, 43, 44, 45, 33, 34,
-        35, 35, 36, 36, 36, 36, 38, 40, 43, 43, 43, 43, 44, 45, 33, 34, 35, 35,
-        36, 36, 36, 36, 38, 40, 43, 43, 43, 43, 44, 45, 34, 35, 36, 37, 37, 37,
-        37, 37, 39, 42, 44, 44, 44, 44, 45, 45, 35, 36, 37, 38, 38, 38, 38, 39,
-        41, 43, 45, 45, 45, 45, 46, 46, 36, 37, 38, 39, 39, 39, 39, 40, 42, 44,
-        47, 47, 47, 47, 47, 47, 37, 38, 39, 40, 40, 40, 40, 41, 43, 45, 47, 47,
-        47, 47, 47, 47, 37, 38, 39, 40, 40, 40, 40, 41, 43, 45, 47, 47, 47, 47,
-        47, 47, 37, 38, 39, 40, 40, 40, 40, 41, 43, 45, 47, 47, 47, 47, 47, 47,
-        37, 38, 39, 40, 40, 40, 40, 41, 43, 45, 47, 47, 47, 47, 47, 47, 39, 39,
-        40, 41, 41, 41, 41, 42, 43, 45, 47, 47, 47, 47, 47, 48, 40, 41, 41, 42,
-        42, 42, 42, 42, 44, 45, 47, 47, 47, 47, 47, 48, 42, 42, 42, 43, 43, 43,
-        43, 43, 44, 46, 47, 47, 47, 47, 48, 48, 42, 42, 42, 43, 43, 43, 43, 43,
-        44, 46, 47, 47, 47, 47, 48, 48,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30, 30, 30, 30, 31, 32, 33, 33,
         33, 33, 33, 34, 35, 36, 37, 37, 37, 37, 39, 40, 42, 42, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 33, 34, 34, 34, 34, 35,
@@ -5789,33 +5759,47 @@
         44, 45, 46, 47, 47, 47, 47, 47, 47, 47, 48, 48, 42, 42, 42, 42, 42, 42,
         42, 42, 42, 42, 42, 42, 42, 42, 43, 44, 44, 45, 45, 45, 45, 45, 46, 47,
         47, 47, 47, 47, 48, 48, 48, 48,
+        /* Size 32x16 */
+        32, 31, 31, 31, 31, 31, 31, 31, 33, 35, 37, 37, 37, 37, 38, 42, 31, 31,
+        31, 31, 31, 31, 31, 31, 33, 35, 37, 37, 37, 37, 39, 42, 31, 31, 31, 31,
+        31, 31, 31, 32, 33, 35, 38, 38, 38, 38, 39, 42, 31, 31, 31, 31, 31, 31,
+        31, 32, 34, 36, 38, 38, 38, 38, 40, 42, 31, 31, 31, 31, 31, 31, 31, 32,
+        34, 36, 38, 38, 38, 38, 40, 42, 31, 31, 31, 31, 31, 31, 31, 32, 34, 36,
+        38, 38, 38, 38, 40, 42, 31, 31, 31, 31, 31, 31, 31, 32, 34, 36, 38, 38,
+        38, 38, 40, 42, 31, 31, 31, 31, 31, 31, 31, 32, 34, 36, 38, 38, 38, 38,
+        40, 42, 31, 31, 31, 31, 32, 32, 32, 32, 34, 36, 39, 39, 39, 39, 40, 42,
+        30, 31, 31, 32, 32, 32, 32, 32, 34, 37, 39, 39, 39, 39, 40, 42, 30, 31,
+        31, 32, 32, 32, 32, 33, 35, 37, 40, 40, 40, 40, 41, 42, 30, 31, 31, 32,
+        32, 32, 32, 33, 35, 37, 40, 40, 40, 40, 41, 42, 30, 31, 31, 32, 32, 32,
+        32, 33, 35, 37, 40, 40, 40, 40, 41, 42, 30, 31, 31, 32, 32, 32, 32, 33,
+        35, 37, 40, 40, 40, 40, 41, 42, 31, 31, 32, 32, 33, 33, 33, 33, 35, 38,
+        40, 40, 40, 40, 41, 43, 32, 32, 33, 33, 34, 34, 34, 34, 36, 39, 41, 41,
+        41, 41, 42, 44, 33, 33, 34, 35, 35, 35, 35, 35, 37, 40, 42, 42, 42, 42,
+        43, 44, 33, 34, 35, 35, 36, 36, 36, 36, 38, 40, 43, 43, 43, 43, 44, 45,
+        33, 34, 35, 35, 36, 36, 36, 36, 38, 40, 43, 43, 43, 43, 44, 45, 33, 34,
+        35, 35, 36, 36, 36, 36, 38, 40, 43, 43, 43, 43, 44, 45, 33, 34, 35, 35,
+        36, 36, 36, 36, 38, 40, 43, 43, 43, 43, 44, 45, 34, 35, 36, 37, 37, 37,
+        37, 37, 39, 42, 44, 44, 44, 44, 45, 45, 35, 36, 37, 38, 38, 38, 38, 39,
+        41, 43, 45, 45, 45, 45, 46, 46, 36, 37, 38, 39, 39, 39, 39, 40, 42, 44,
+        47, 47, 47, 47, 47, 47, 37, 38, 39, 40, 40, 40, 40, 41, 43, 45, 47, 47,
+        47, 47, 47, 47, 37, 38, 39, 40, 40, 40, 40, 41, 43, 45, 47, 47, 47, 47,
+        47, 47, 37, 38, 39, 40, 40, 40, 40, 41, 43, 45, 47, 47, 47, 47, 47, 47,
+        37, 38, 39, 40, 40, 40, 40, 41, 43, 45, 47, 47, 47, 47, 47, 47, 39, 39,
+        40, 41, 41, 41, 41, 42, 43, 45, 47, 47, 47, 47, 47, 48, 40, 41, 41, 42,
+        42, 42, 42, 42, 44, 45, 47, 47, 47, 47, 47, 48, 42, 42, 42, 43, 43, 43,
+        43, 43, 44, 46, 47, 47, 47, 47, 48, 48, 42, 42, 42, 43, 43, 43, 43, 43,
+        44, 46, 47, 47, 47, 47, 48, 48,
         /* Size 4x16 */
-        31, 31, 35, 37, 31, 31, 35, 38, 31, 31, 36, 38, 31, 31, 36, 38, 31, 32,
-        36, 39, 31, 32, 37, 40, 31, 32, 37, 40, 31, 33, 38, 40, 33, 35, 40, 42,
-        34, 36, 40, 43, 34, 36, 40, 43, 36, 38, 43, 45, 38, 40, 45, 47, 38, 40,
-        45, 47, 39, 41, 45, 47, 42, 43, 46, 47,
-        /* Size 16x4 */
         31, 31, 31, 31, 31, 31, 31, 31, 33, 34, 34, 36, 38, 38, 39, 42, 31, 31,
         31, 31, 32, 32, 32, 33, 35, 36, 36, 38, 40, 40, 41, 43, 35, 35, 36, 36,
         36, 37, 37, 38, 40, 40, 40, 43, 45, 45, 45, 46, 37, 38, 38, 38, 39, 40,
         40, 40, 42, 43, 43, 45, 47, 47, 47, 47,
+        /* Size 16x4 */
+        31, 31, 35, 37, 31, 31, 35, 38, 31, 31, 36, 38, 31, 31, 36, 38, 31, 32,
+        36, 39, 31, 32, 37, 40, 31, 32, 37, 40, 31, 33, 38, 40, 33, 35, 40, 42,
+        34, 36, 40, 43, 34, 36, 40, 43, 36, 38, 43, 45, 38, 40, 45, 47, 38, 40,
+        45, 47, 39, 41, 45, 47, 42, 43, 46, 47,
         /* Size 8x32 */
-        32, 31, 31, 31, 33, 37, 37, 38, 31, 31, 31, 31, 33, 37, 37, 39, 31, 31,
-        31, 31, 33, 38, 38, 39, 31, 31, 31, 31, 34, 38, 38, 40, 31, 31, 31, 31,
-        34, 38, 38, 40, 31, 31, 31, 31, 34, 38, 38, 40, 31, 31, 31, 31, 34, 38,
-        38, 40, 31, 31, 31, 31, 34, 38, 38, 40, 31, 31, 32, 32, 34, 39, 39, 40,
-        30, 31, 32, 32, 34, 39, 39, 40, 30, 31, 32, 32, 35, 40, 40, 41, 30, 31,
-        32, 32, 35, 40, 40, 41, 30, 31, 32, 32, 35, 40, 40, 41, 30, 31, 32, 32,
-        35, 40, 40, 41, 31, 32, 33, 33, 35, 40, 40, 41, 32, 33, 34, 34, 36, 41,
-        41, 42, 33, 34, 35, 35, 37, 42, 42, 43, 33, 35, 36, 36, 38, 43, 43, 44,
-        33, 35, 36, 36, 38, 43, 43, 44, 33, 35, 36, 36, 38, 43, 43, 44, 33, 35,
-        36, 36, 38, 43, 43, 44, 34, 36, 37, 37, 39, 44, 44, 45, 35, 37, 38, 38,
-        41, 45, 45, 46, 36, 38, 39, 39, 42, 47, 47, 47, 37, 39, 40, 40, 43, 47,
-        47, 47, 37, 39, 40, 40, 43, 47, 47, 47, 37, 39, 40, 40, 43, 47, 47, 47,
-        37, 39, 40, 40, 43, 47, 47, 47, 39, 40, 41, 41, 43, 47, 47, 47, 40, 41,
-        42, 42, 44, 47, 47, 47, 42, 42, 43, 43, 44, 47, 47, 48, 42, 42, 43, 43,
-        44, 47, 47, 48,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30, 30, 30, 30, 31, 32, 33, 33,
         33, 33, 33, 34, 35, 36, 37, 37, 37, 37, 39, 40, 42, 42, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 33, 34, 35, 35, 35, 35, 36,
@@ -5830,7 +5814,23 @@
         38, 38, 39, 39, 40, 40, 40, 40, 40, 41, 42, 43, 43, 43, 43, 44, 45, 47,
         47, 47, 47, 47, 47, 47, 47, 47, 38, 39, 39, 40, 40, 40, 40, 40, 40, 40,
         41, 41, 41, 41, 41, 42, 43, 44, 44, 44, 44, 45, 46, 47, 47, 47, 47, 47,
-        47, 47, 48, 48 },
+        47, 47, 48, 48,
+        /* Size 32x8 */
+        32, 31, 31, 31, 33, 37, 37, 38, 31, 31, 31, 31, 33, 37, 37, 39, 31, 31,
+        31, 31, 33, 38, 38, 39, 31, 31, 31, 31, 34, 38, 38, 40, 31, 31, 31, 31,
+        34, 38, 38, 40, 31, 31, 31, 31, 34, 38, 38, 40, 31, 31, 31, 31, 34, 38,
+        38, 40, 31, 31, 31, 31, 34, 38, 38, 40, 31, 31, 32, 32, 34, 39, 39, 40,
+        30, 31, 32, 32, 34, 39, 39, 40, 30, 31, 32, 32, 35, 40, 40, 41, 30, 31,
+        32, 32, 35, 40, 40, 41, 30, 31, 32, 32, 35, 40, 40, 41, 30, 31, 32, 32,
+        35, 40, 40, 41, 31, 32, 33, 33, 35, 40, 40, 41, 32, 33, 34, 34, 36, 41,
+        41, 42, 33, 34, 35, 35, 37, 42, 42, 43, 33, 35, 36, 36, 38, 43, 43, 44,
+        33, 35, 36, 36, 38, 43, 43, 44, 33, 35, 36, 36, 38, 43, 43, 44, 33, 35,
+        36, 36, 38, 43, 43, 44, 34, 36, 37, 37, 39, 44, 44, 45, 35, 37, 38, 38,
+        41, 45, 45, 46, 36, 38, 39, 39, 42, 47, 47, 47, 37, 39, 40, 40, 43, 47,
+        47, 47, 37, 39, 40, 40, 43, 47, 47, 47, 37, 39, 40, 40, 43, 47, 47, 47,
+        37, 39, 40, 40, 43, 47, 47, 47, 39, 40, 41, 41, 43, 47, 47, 47, 40, 41,
+        42, 42, 44, 47, 47, 47, 42, 42, 43, 43, 44, 47, 47, 48, 42, 42, 43, 43,
+        44, 47, 47, 48 },
   },
   {
       { /* Luma */
@@ -5916,21 +5916,12 @@
         33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         /* Size 4x8 */
-        31, 31, 31, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32,
-        32, 32, 31, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33,
-        /* Size 8x4 */
         31, 31, 31, 31, 31, 31, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33,
+        /* Size 8x4 */
+        31, 31, 31, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32,
+        32, 32, 31, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33,
         /* Size 8x16 */
-        32, 31, 31, 31, 31, 31, 31, 32, 31, 31, 31, 31, 31, 31, 32, 32, 31, 31,
-        32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32,
-        32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32,
-        32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32,
-        31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32,
-        32, 32, 32, 32, 33, 33, 31, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
-        32, 32, 33, 34, 32, 32, 32, 32, 32, 32, 33, 34, 32, 32, 32, 32, 32, 32,
-        33, 34,
-        /* Size 16x8 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 31, 31,
         31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
@@ -5939,37 +5930,16 @@
         32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33,
         33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 34,
         34, 34,
+        /* Size 16x8 */
+        32, 31, 31, 31, 31, 31, 31, 32, 31, 31, 31, 31, 31, 31, 32, 32, 31, 31,
+        32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32,
+        32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32,
+        32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32,
+        31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32,
+        32, 32, 32, 32, 33, 33, 31, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
+        32, 32, 33, 34, 32, 32, 32, 32, 32, 32, 33, 34, 32, 32, 32, 32, 32, 32,
+        33, 34,
         /* Size 16x32 */
-        32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 33, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 33, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33,
-        31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 31, 31,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 31, 31, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 31, 31, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 31, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 33, 33, 33, 33, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 33, 33, 33, 33, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33,
-        33, 33, 33, 34, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33,
-        34, 34, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 33, 33, 33, 34, 34,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
@@ -5999,33 +5969,47 @@
         32, 33, 33, 33, 33, 34, 34, 34, 34, 34, 34, 34, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         34, 34, 34, 34, 34, 34, 34, 34,
+        /* Size 32x16 */
+        32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 33, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 33, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33,
+        31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 31, 31,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 31, 31, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 31, 31, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 31, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 33, 33, 33, 33, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 33, 33, 33, 33, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33,
+        33, 33, 33, 34, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33,
+        34, 34, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 33, 33, 33, 34, 34, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 33, 33, 33, 34, 34,
         /* Size 4x16 */
-        31, 31, 31, 32, 31, 31, 31, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32,
-        32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32,
-        31, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32,
-        32, 33, 32, 32, 32, 33, 32, 32, 32, 33,
-        /* Size 16x4 */
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 31, 31,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 33, 33, 33, 33, 33,
+        /* Size 16x4 */
+        31, 31, 31, 32, 31, 31, 31, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32,
+        32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32,
+        31, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32,
+        32, 33, 32, 32, 32, 33, 32, 32, 32, 33,
         /* Size 8x32 */
-        32, 31, 31, 31, 31, 31, 31, 32, 31, 31, 31, 31, 31, 31, 32, 32, 31, 31,
-        31, 31, 31, 31, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
-        32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32,
-        32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32,
-        31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32,
-        32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32,
-        32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32,
-        32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32,
-        31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32,
-        32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32,
-        32, 32, 33, 33, 31, 32, 32, 32, 32, 32, 33, 33, 31, 32, 32, 32, 32, 32,
-        33, 33, 32, 32, 32, 32, 32, 32, 33, 34, 32, 32, 32, 32, 32, 32, 33, 34,
-        32, 32, 32, 32, 32, 32, 33, 34, 32, 32, 32, 32, 32, 32, 33, 34, 32, 32,
-        32, 32, 32, 32, 33, 34, 32, 32, 32, 32, 32, 32, 33, 34, 32, 32, 32, 32,
-        32, 32, 33, 34,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31,
         31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
@@ -6040,7 +6024,23 @@
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 34, 34,
-        34, 34, 34, 34 },
+        34, 34, 34, 34,
+        /* Size 32x8 */
+        32, 31, 31, 31, 31, 31, 31, 32, 31, 31, 31, 31, 31, 31, 32, 32, 31, 31,
+        31, 31, 31, 31, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
+        32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32,
+        32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32,
+        31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32,
+        32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32,
+        32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32,
+        32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32,
+        31, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 32, 31, 32,
+        32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 32, 32, 32, 33, 31, 32, 32, 32,
+        32, 32, 33, 33, 31, 32, 32, 32, 32, 32, 33, 33, 31, 32, 32, 32, 32, 32,
+        33, 33, 32, 32, 32, 32, 32, 32, 33, 34, 32, 32, 32, 32, 32, 32, 33, 34,
+        32, 32, 32, 32, 32, 32, 33, 34, 32, 32, 32, 32, 32, 32, 33, 34, 32, 32,
+        32, 32, 32, 32, 33, 34, 32, 32, 32, 32, 32, 32, 33, 34, 32, 32, 32, 32,
+        32, 32, 33, 34 },
       { /* Chroma */
         /* Size 4x4 */
         31, 31, 31, 34, 31, 31, 31, 35, 31, 31, 32, 35, 34, 35, 35, 39,
@@ -6124,21 +6124,12 @@
         39, 40, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36,
         36, 36, 36, 36, 36, 37, 37, 38, 39, 40, 40, 40, 40, 40, 40, 40,
         /* Size 4x8 */
-        31, 31, 31, 34, 31, 31, 31, 35, 31, 31, 31, 35, 31, 32, 32, 36, 31, 32,
-        32, 36, 31, 33, 33, 37, 34, 36, 36, 40, 34, 36, 36, 40,
-        /* Size 8x4 */
         31, 31, 31, 31, 31, 31, 34, 34, 31, 31, 31, 32, 32, 33, 36, 36, 31, 31,
         31, 32, 32, 33, 36, 36, 34, 35, 35, 36, 36, 37, 40, 40,
+        /* Size 8x4 */
+        31, 31, 31, 34, 31, 31, 31, 35, 31, 31, 31, 35, 31, 32, 32, 36, 31, 32,
+        32, 36, 31, 33, 33, 37, 34, 36, 36, 40, 34, 36, 36, 40,
         /* Size 8x16 */
-        32, 31, 31, 31, 31, 31, 33, 35, 31, 31, 31, 31, 31, 31, 33, 36, 31, 31,
-        31, 31, 31, 31, 34, 36, 31, 31, 31, 31, 31, 31, 34, 37, 31, 31, 31, 31,
-        31, 31, 34, 37, 31, 31, 31, 31, 31, 31, 34, 37, 31, 31, 31, 32, 32, 32,
-        34, 37, 30, 31, 31, 32, 32, 32, 34, 38, 30, 31, 32, 32, 32, 32, 35, 38,
-        30, 31, 32, 32, 32, 32, 35, 38, 30, 31, 32, 32, 32, 32, 35, 38, 31, 32,
-        33, 33, 33, 33, 36, 39, 33, 34, 34, 35, 35, 35, 37, 40, 33, 34, 35, 36,
-        36, 36, 38, 41, 33, 34, 35, 36, 36, 36, 38, 41, 33, 34, 35, 36, 36, 36,
-        38, 41,
-        /* Size 16x8 */
         32, 31, 31, 31, 31, 31, 31, 30, 30, 30, 30, 31, 33, 33, 33, 33, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 34, 34, 34, 34, 31, 31, 31, 31,
         31, 31, 31, 31, 32, 32, 32, 33, 34, 35, 35, 35, 31, 31, 31, 31, 31, 31,
@@ -6147,37 +6138,16 @@
         32, 33, 35, 36, 36, 36, 33, 33, 34, 34, 34, 34, 34, 34, 35, 35, 35, 36,
         37, 38, 38, 38, 35, 36, 36, 37, 37, 37, 37, 38, 38, 38, 38, 39, 40, 41,
         41, 41,
+        /* Size 16x8 */
+        32, 31, 31, 31, 31, 31, 33, 35, 31, 31, 31, 31, 31, 31, 33, 36, 31, 31,
+        31, 31, 31, 31, 34, 36, 31, 31, 31, 31, 31, 31, 34, 37, 31, 31, 31, 31,
+        31, 31, 34, 37, 31, 31, 31, 31, 31, 31, 34, 37, 31, 31, 31, 32, 32, 32,
+        34, 37, 30, 31, 31, 32, 32, 32, 34, 38, 30, 31, 32, 32, 32, 32, 35, 38,
+        30, 31, 32, 32, 32, 32, 35, 38, 30, 31, 32, 32, 32, 32, 35, 38, 31, 32,
+        33, 33, 33, 33, 36, 39, 33, 34, 34, 35, 35, 35, 37, 40, 33, 34, 35, 36,
+        36, 36, 38, 41, 33, 34, 35, 36, 36, 36, 38, 41, 33, 34, 35, 36, 36, 36,
+        38, 41,
         /* Size 16x32 */
-        32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 33, 34, 35, 37, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 33, 34, 35, 37, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 32, 33, 34, 36, 37, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 32, 33, 35, 36, 38, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 32, 34, 35, 36, 38, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 33, 34, 35, 37, 38, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 33,
-        34, 35, 37, 38, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 33, 34, 35,
-        37, 38, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 33, 34, 35, 37, 38,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 33, 34, 35, 37, 38, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 33, 34, 35, 37, 38, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 33, 34, 35, 37, 38, 31, 31, 31, 31, 31, 32,
-        32, 32, 32, 32, 32, 33, 34, 36, 37, 39, 31, 31, 31, 31, 31, 32, 32, 32,
-        32, 32, 32, 33, 34, 36, 37, 39, 30, 31, 31, 31, 31, 32, 32, 32, 32, 32,
-        32, 33, 34, 36, 38, 39, 30, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 33,
-        35, 36, 38, 40, 30, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 33, 35, 36,
-        38, 40, 30, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 33, 35, 36, 38, 40,
-        30, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 33, 35, 36, 38, 40, 30, 31,
-        31, 31, 32, 32, 32, 32, 32, 32, 32, 33, 35, 36, 38, 40, 30, 31, 31, 31,
-        32, 32, 32, 32, 32, 32, 32, 33, 35, 36, 38, 40, 31, 31, 31, 32, 32, 33,
-        33, 33, 33, 33, 33, 34, 35, 37, 38, 40, 31, 32, 32, 33, 33, 33, 33, 33,
-        33, 33, 33, 35, 36, 37, 39, 41, 32, 32, 33, 33, 34, 34, 34, 34, 34, 34,
-        34, 35, 37, 38, 40, 41, 33, 33, 34, 34, 34, 35, 35, 35, 35, 35, 35, 36,
-        37, 39, 40, 42, 33, 34, 34, 35, 35, 36, 36, 36, 36, 36, 36, 37, 38, 40,
-        41, 43, 33, 34, 34, 35, 35, 36, 36, 36, 36, 36, 36, 37, 38, 40, 41, 43,
-        33, 34, 34, 35, 35, 36, 36, 36, 36, 36, 36, 37, 38, 40, 41, 43, 33, 34,
-        34, 35, 35, 36, 36, 36, 36, 36, 36, 37, 38, 40, 41, 43, 33, 34, 34, 35,
-        35, 36, 36, 36, 36, 36, 36, 37, 38, 40, 41, 43, 33, 34, 34, 35, 35, 36,
-        36, 36, 36, 36, 36, 37, 38, 40, 41, 43, 34, 34, 35, 35, 36, 36, 36, 36,
-        36, 36, 36, 38, 39, 40, 42, 44,
-        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30, 30, 30,
         30, 30, 30, 31, 31, 32, 33, 33, 33, 33, 33, 33, 33, 34, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
@@ -6207,33 +6177,47 @@
         38, 38, 39, 40, 40, 41, 41, 41, 41, 41, 41, 42, 37, 37, 37, 38, 38, 38,
         38, 38, 38, 38, 38, 38, 39, 39, 39, 40, 40, 40, 40, 40, 40, 40, 41, 41,
         42, 43, 43, 43, 43, 43, 43, 44,
+        /* Size 32x16 */
+        32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 33, 34, 35, 37, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 33, 34, 35, 37, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 32, 33, 34, 36, 37, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 32, 33, 35, 36, 38, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 32, 34, 35, 36, 38, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 33, 34, 35, 37, 38, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 33,
+        34, 35, 37, 38, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 33, 34, 35,
+        37, 38, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 33, 34, 35, 37, 38,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 33, 34, 35, 37, 38, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 33, 34, 35, 37, 38, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 33, 34, 35, 37, 38, 31, 31, 31, 31, 31, 32,
+        32, 32, 32, 32, 32, 33, 34, 36, 37, 39, 31, 31, 31, 31, 31, 32, 32, 32,
+        32, 32, 32, 33, 34, 36, 37, 39, 30, 31, 31, 31, 31, 32, 32, 32, 32, 32,
+        32, 33, 34, 36, 38, 39, 30, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 33,
+        35, 36, 38, 40, 30, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 33, 35, 36,
+        38, 40, 30, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 33, 35, 36, 38, 40,
+        30, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 33, 35, 36, 38, 40, 30, 31,
+        31, 31, 32, 32, 32, 32, 32, 32, 32, 33, 35, 36, 38, 40, 30, 31, 31, 31,
+        32, 32, 32, 32, 32, 32, 32, 33, 35, 36, 38, 40, 31, 31, 31, 32, 32, 33,
+        33, 33, 33, 33, 33, 34, 35, 37, 38, 40, 31, 32, 32, 33, 33, 33, 33, 33,
+        33, 33, 33, 35, 36, 37, 39, 41, 32, 32, 33, 33, 34, 34, 34, 34, 34, 34,
+        34, 35, 37, 38, 40, 41, 33, 33, 34, 34, 34, 35, 35, 35, 35, 35, 35, 36,
+        37, 39, 40, 42, 33, 34, 34, 35, 35, 36, 36, 36, 36, 36, 36, 37, 38, 40,
+        41, 43, 33, 34, 34, 35, 35, 36, 36, 36, 36, 36, 36, 37, 38, 40, 41, 43,
+        33, 34, 34, 35, 35, 36, 36, 36, 36, 36, 36, 37, 38, 40, 41, 43, 33, 34,
+        34, 35, 35, 36, 36, 36, 36, 36, 36, 37, 38, 40, 41, 43, 33, 34, 34, 35,
+        35, 36, 36, 36, 36, 36, 36, 37, 38, 40, 41, 43, 33, 34, 34, 35, 35, 36,
+        36, 36, 36, 36, 36, 37, 38, 40, 41, 43, 34, 34, 35, 35, 36, 36, 36, 36,
+        36, 36, 36, 38, 39, 40, 42, 44,
         /* Size 4x16 */
-        31, 31, 31, 34, 31, 31, 31, 34, 31, 31, 31, 35, 31, 31, 31, 35, 31, 31,
-        31, 35, 31, 31, 31, 35, 31, 32, 32, 36, 31, 32, 32, 36, 31, 32, 32, 36,
-        31, 32, 32, 36, 31, 32, 32, 36, 32, 33, 33, 37, 33, 35, 35, 39, 34, 36,
-        36, 40, 34, 36, 36, 40, 34, 36, 36, 40,
-        /* Size 16x4 */
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 33, 34, 34, 34, 31, 31,
         31, 31, 31, 31, 32, 32, 32, 32, 32, 33, 35, 36, 36, 36, 31, 31, 31, 31,
         31, 31, 32, 32, 32, 32, 32, 33, 35, 36, 36, 36, 34, 34, 35, 35, 35, 35,
         36, 36, 36, 36, 36, 37, 39, 40, 40, 40,
+        /* Size 16x4 */
+        31, 31, 31, 34, 31, 31, 31, 34, 31, 31, 31, 35, 31, 31, 31, 35, 31, 31,
+        31, 35, 31, 31, 31, 35, 31, 32, 32, 36, 31, 32, 32, 36, 31, 32, 32, 36,
+        31, 32, 32, 36, 31, 32, 32, 36, 32, 33, 33, 37, 33, 35, 35, 39, 34, 36,
+        36, 40, 34, 36, 36, 40, 34, 36, 36, 40,
         /* Size 8x32 */
-        32, 31, 31, 31, 31, 31, 33, 35, 31, 31, 31, 31, 31, 31, 33, 35, 31, 31,
-        31, 31, 31, 31, 33, 36, 31, 31, 31, 31, 31, 31, 33, 36, 31, 31, 31, 31,
-        31, 31, 34, 36, 31, 31, 31, 31, 31, 31, 34, 37, 31, 31, 31, 31, 31, 31,
-        34, 37, 31, 31, 31, 31, 31, 31, 34, 37, 31, 31, 31, 31, 31, 31, 34, 37,
-        31, 31, 31, 31, 31, 31, 34, 37, 31, 31, 31, 31, 31, 31, 34, 37, 31, 31,
-        31, 31, 31, 31, 34, 37, 31, 31, 31, 32, 32, 32, 34, 37, 31, 31, 31, 32,
-        32, 32, 34, 37, 30, 31, 31, 32, 32, 32, 34, 38, 30, 31, 32, 32, 32, 32,
-        35, 38, 30, 31, 32, 32, 32, 32, 35, 38, 30, 31, 32, 32, 32, 32, 35, 38,
-        30, 31, 32, 32, 32, 32, 35, 38, 30, 31, 32, 32, 32, 32, 35, 38, 30, 31,
-        32, 32, 32, 32, 35, 38, 31, 31, 32, 33, 33, 33, 35, 38, 31, 32, 33, 33,
-        33, 33, 36, 39, 32, 33, 34, 34, 34, 34, 37, 40, 33, 34, 34, 35, 35, 35,
-        37, 40, 33, 34, 35, 36, 36, 36, 38, 41, 33, 34, 35, 36, 36, 36, 38, 41,
-        33, 34, 35, 36, 36, 36, 38, 41, 33, 34, 35, 36, 36, 36, 38, 41, 33, 34,
-        35, 36, 36, 36, 38, 41, 33, 34, 35, 36, 36, 36, 38, 41, 34, 35, 36, 36,
-        36, 36, 39, 42,
-        /* Size 32x8 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30, 30, 30,
         30, 30, 30, 31, 31, 32, 33, 33, 33, 33, 33, 33, 33, 34, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
@@ -6248,7 +6232,23 @@
         34, 34, 34, 34, 34, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 36, 37,
         37, 38, 38, 38, 38, 38, 38, 39, 35, 35, 36, 36, 36, 37, 37, 37, 37, 37,
         37, 37, 37, 37, 38, 38, 38, 38, 38, 38, 38, 38, 39, 40, 40, 41, 41, 41,
-        41, 41, 41, 42 },
+        41, 41, 41, 42,
+        /* Size 32x8 */
+        32, 31, 31, 31, 31, 31, 33, 35, 31, 31, 31, 31, 31, 31, 33, 35, 31, 31,
+        31, 31, 31, 31, 33, 36, 31, 31, 31, 31, 31, 31, 33, 36, 31, 31, 31, 31,
+        31, 31, 34, 36, 31, 31, 31, 31, 31, 31, 34, 37, 31, 31, 31, 31, 31, 31,
+        34, 37, 31, 31, 31, 31, 31, 31, 34, 37, 31, 31, 31, 31, 31, 31, 34, 37,
+        31, 31, 31, 31, 31, 31, 34, 37, 31, 31, 31, 31, 31, 31, 34, 37, 31, 31,
+        31, 31, 31, 31, 34, 37, 31, 31, 31, 32, 32, 32, 34, 37, 31, 31, 31, 32,
+        32, 32, 34, 37, 30, 31, 31, 32, 32, 32, 34, 38, 30, 31, 32, 32, 32, 32,
+        35, 38, 30, 31, 32, 32, 32, 32, 35, 38, 30, 31, 32, 32, 32, 32, 35, 38,
+        30, 31, 32, 32, 32, 32, 35, 38, 30, 31, 32, 32, 32, 32, 35, 38, 30, 31,
+        32, 32, 32, 32, 35, 38, 31, 31, 32, 33, 33, 33, 35, 38, 31, 32, 33, 33,
+        33, 33, 36, 39, 32, 33, 34, 34, 34, 34, 37, 40, 33, 34, 34, 35, 35, 35,
+        37, 40, 33, 34, 35, 36, 36, 36, 38, 41, 33, 34, 35, 36, 36, 36, 38, 41,
+        33, 34, 35, 36, 36, 36, 38, 41, 33, 34, 35, 36, 36, 36, 38, 41, 33, 34,
+        35, 36, 36, 36, 38, 41, 33, 34, 35, 36, 36, 36, 38, 41, 34, 35, 36, 36,
+        36, 36, 39, 42 },
   },
   {
       { /* Luma */
@@ -6334,22 +6334,13 @@
         32, 32, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         /* Size 4x8 */
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32,
-        32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32,
-        /* Size 8x4 */
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31,
         32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32,
+        /* Size 8x4 */
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32,
+        32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32,
         /* Size 8x16 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 31, 31, 31, 32,
-        32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
-        32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32,
-        31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31,
-        32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
-        32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
-        32, 32,
-        /* Size 16x8 */
-        32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 32, 32, 32, 32, 32,
@@ -6357,37 +6348,16 @@
         32, 32, 32, 32, 32, 32, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32,
-        /* Size 16x32 */
+        /* Size 16x8 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32,
-        32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31,
-        31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31,
-        31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31,
-        31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32,
-        /* Size 32x16 */
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 31, 31, 31, 32,
+        32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
+        32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32,
+        31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31,
+        32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
+        32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
+        32, 32,
+        /* Size 16x32 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
@@ -6417,35 +6387,49 @@
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32,
+        /* Size 32x16 */
+        32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32,
+        32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31,
+        31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31,
+        31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31,
+        31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32,
         /* Size 4x16 */
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 31, 32,
-        32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32,
-        31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32,
-        32, 32, 31, 32, 32, 32, 31, 32, 32, 32,
-        /* Size 16x4 */
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        /* Size 16x4 */
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 31, 32,
+        32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32,
+        31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31, 32,
+        32, 32, 31, 32, 32, 32, 31, 32, 32, 32,
         /* Size 8x32 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32,
-        32, 32, 31, 31, 31, 32, 32, 32, 32, 32, 31, 31, 31, 32, 32, 32, 32, 32,
-        31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31,
-        32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
-        32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
-        32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32,
-        31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31,
-        32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
-        32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
-        32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32,
-        31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31,
-        32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
-        32, 32, 32, 32,
-        /* Size 32x8 */
-        32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
@@ -6458,6 +6442,22 @@
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32,
+        /* Size 32x8 */
+        32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32,
+        32, 32, 31, 31, 31, 32, 32, 32, 32, 32, 31, 31, 31, 32, 32, 32, 32, 32,
+        31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31,
+        32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
+        32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
+        32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32,
+        31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31,
+        32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
+        32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
+        32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32,
+        31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31,
+        32, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32, 32, 32, 31, 31, 32, 32,
         32, 32, 32, 32 },
       { /* Chroma */
         /* Size 4x4 */
@@ -6542,21 +6542,12 @@
         32, 32, 30, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32,
         /* Size 4x8 */
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 30, 31, 32, 32,
-        /* Size 8x4 */
         31, 31, 31, 31, 31, 31, 31, 30, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 32, 32, 31, 31, 31, 31, 31, 31, 32, 32,
+        /* Size 8x4 */
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 30, 31, 32, 32,
         /* Size 8x16 */
-        32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 31, 31, 31, 31,
-        31, 32, 32, 32, 30, 31, 31, 31, 31, 32, 32, 32, 30, 31, 31, 31, 32, 32,
-        32, 32,
-        /* Size 16x8 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
@@ -6565,37 +6556,16 @@
         31, 31, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32,
         32, 32,
-        /* Size 16x32 */
+        /* Size 16x8 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32,
-        32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32,
-        32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32,
-        30, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 30, 31,
-        31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 30, 30, 31, 31,
-        31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 30, 30, 31, 31, 31, 31,
-        31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 30, 30, 31, 31, 31, 31, 31, 31,
-        32, 32, 32, 32, 32, 32, 32, 32,
-        /* Size 32x16 */
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 31, 31, 31, 31,
+        31, 32, 32, 32, 30, 31, 31, 31, 31, 32, 32, 32, 30, 31, 31, 31, 32, 32,
+        32, 32,
+        /* Size 16x32 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30, 30, 30, 30, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
@@ -6625,17 +6595,7 @@
         31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32,
         32, 32, 32, 32, 32, 32, 32, 32,
-        /* Size 4x16 */
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 31, 31,
-        32, 32, 31, 31, 32, 32, 30, 31, 32, 32,
-        /* Size 16x4 */
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 32, 32, 32, 32,
-        /* Size 8x32 */
+        /* Size 32x16 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
@@ -6646,12 +6606,36 @@
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
-        31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 31, 31, 31, 31, 31, 32,
-        32, 32, 31, 31, 31, 31, 31, 32, 32, 32, 31, 31, 31, 31, 31, 32, 32, 32,
-        30, 31, 31, 31, 31, 32, 32, 32, 30, 31, 31, 31, 31, 32, 32, 32, 30, 31,
-        31, 31, 32, 32, 32, 32, 30, 31, 31, 31, 32, 32, 32, 32, 30, 31, 31, 31,
-        32, 32, 32, 32,
-        /* Size 32x8 */
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32,
+        32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32,
+        32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32,
+        30, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 30, 31,
+        31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 30, 30, 31, 31,
+        31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 30, 30, 31, 31, 31, 31,
+        31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 30, 30, 31, 31, 31, 31, 31, 31,
+        32, 32, 32, 32, 32, 32, 32, 32,
+        /* Size 4x16 */
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 32, 32, 32, 32,
+        /* Size 16x4 */
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 31, 31,
+        32, 32, 31, 31, 32, 32, 30, 31, 32, 32,
+        /* Size 8x32 */
         32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30, 30, 30, 30, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
@@ -6666,6 +6650,22 @@
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32,
+        32, 32, 32, 32,
+        /* Size 32x8 */
+        32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
+        31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 31, 31, 31, 31, 31, 32,
+        32, 32, 31, 31, 31, 31, 31, 32, 32, 32, 31, 31, 31, 31, 31, 32, 32, 32,
+        30, 31, 31, 31, 31, 32, 32, 32, 30, 31, 31, 31, 31, 32, 32, 32, 30, 31,
+        31, 31, 32, 32, 32, 32, 30, 31, 31, 31, 32, 32, 32, 32, 30, 31, 31, 31,
         32, 32, 32, 32 },
   },
 };
@@ -6748,20 +6748,12 @@
         4, 4, 8, 9, 9, 9, 9, 9, 9, 10, 10, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 6, 6,
         6, 6, 5, 5, 5, 5, 5, 5, 4, 4, 4,
         /* Size 4x8 */
-        32, 24, 14, 11, 31, 24, 15, 12, 28, 18, 12, 11, 21, 14, 10, 9, 16, 12,
-        8, 8, 13, 11, 7, 7, 11, 10, 7, 6, 10, 9, 7, 5,
-        /* Size 8x4 */
         32, 31, 28, 21, 16, 13, 11, 10, 24, 24, 18, 14, 12, 11, 10, 9, 14, 15,
         12, 10, 8, 7, 7, 7, 11, 12, 11, 9, 8, 7, 6, 5,
+        /* Size 8x4 */
+        32, 24, 14, 11, 31, 24, 15, 12, 28, 18, 12, 11, 21, 14, 10, 9, 16, 12,
+        8, 8, 13, 11, 7, 7, 11, 10, 7, 6, 10, 9, 7, 5,
         /* Size 8x16 */
-        32, 32, 28, 19, 16, 12, 11, 10, 33, 31, 30, 21, 17, 13, 12, 11, 32, 30,
-        28, 20, 17, 13, 12, 12, 30, 28, 24, 19, 16, 13, 13, 12, 28, 27, 21, 17,
-        15, 12, 12, 11, 23, 24, 19, 14, 13, 11, 11, 11, 21, 22, 18, 13, 12, 10,
-        10, 10, 18, 19, 16, 12, 10, 9, 9, 9, 16, 18, 15, 11, 10, 8, 8, 8, 13,
-        15, 13, 10, 9, 7, 8, 8, 12, 14, 13, 10, 8, 7, 7, 7, 11, 13, 12, 10, 8,
-        7, 6, 6, 11, 12, 11, 10, 8, 7, 6, 6, 10, 11, 10, 9, 8, 7, 6, 6, 9, 10,
-        10, 9, 7, 6, 6, 5, 9, 10, 10, 9, 8, 7, 6, 5,
-        /* Size 16x8 */
         32, 33, 32, 30, 28, 23, 21, 18, 16, 13, 12, 11, 11, 10, 9, 9, 32, 31,
         30, 28, 27, 24, 22, 19, 18, 15, 14, 13, 12, 11, 10, 10, 28, 30, 28, 24,
         21, 19, 18, 16, 15, 13, 13, 12, 11, 10, 10, 10, 19, 21, 20, 19, 17, 14,
@@ -6769,34 +6761,15 @@
         9, 8, 8, 8, 8, 7, 8, 12, 13, 13, 13, 12, 11, 10, 9, 8, 7, 7, 7, 7, 7, 6,
         7, 11, 12, 12, 13, 12, 11, 10, 9, 8, 8, 7, 6, 6, 6, 6, 6, 10, 11, 12,
         12, 11, 11, 10, 9, 8, 8, 7, 6, 6, 6, 5, 5,
+        /* Size 16x8 */
+        32, 32, 28, 19, 16, 12, 11, 10, 33, 31, 30, 21, 17, 13, 12, 11, 32, 30,
+        28, 20, 17, 13, 12, 12, 30, 28, 24, 19, 16, 13, 13, 12, 28, 27, 21, 17,
+        15, 12, 12, 11, 23, 24, 19, 14, 13, 11, 11, 11, 21, 22, 18, 13, 12, 10,
+        10, 10, 18, 19, 16, 12, 10, 9, 9, 9, 16, 18, 15, 11, 10, 8, 8, 8, 13,
+        15, 13, 10, 9, 7, 8, 8, 12, 14, 13, 10, 8, 7, 7, 7, 11, 13, 12, 10, 8,
+        7, 6, 6, 11, 12, 11, 10, 8, 7, 6, 6, 10, 11, 10, 9, 8, 7, 6, 6, 9, 10,
+        10, 9, 7, 6, 6, 5, 9, 10, 10, 9, 8, 7, 6, 5,
         /* Size 16x32 */
-        32, 33, 32, 30, 28, 23, 19, 17, 16, 13, 12, 11, 11, 11, 10, 10, 33, 32,
-        32, 30, 29, 24, 20, 18, 17, 14, 12, 12, 12, 11, 11, 11, 33, 32, 31, 31,
-        30, 25, 21, 19, 17, 14, 13, 12, 12, 11, 11, 11, 33, 32, 31, 30, 29, 25,
-        21, 19, 17, 14, 13, 13, 12, 12, 11, 11, 32, 32, 30, 29, 28, 24, 20, 19,
-        17, 14, 13, 13, 12, 12, 12, 11, 32, 31, 29, 28, 27, 24, 21, 19, 18, 15,
-        14, 13, 12, 12, 12, 11, 30, 30, 28, 26, 24, 21, 19, 18, 16, 14, 13, 13,
-        13, 12, 12, 11, 29, 30, 28, 25, 23, 20, 18, 17, 16, 13, 12, 12, 12, 12,
-        12, 11, 28, 30, 27, 24, 21, 19, 17, 16, 15, 13, 12, 12, 12, 12, 11, 11,
-        26, 28, 26, 23, 20, 18, 16, 15, 14, 12, 12, 12, 11, 11, 11, 11, 23, 25,
-        24, 21, 19, 16, 14, 14, 13, 11, 11, 11, 11, 11, 11, 11, 22, 24, 23, 21,
-        19, 16, 14, 13, 12, 11, 10, 10, 10, 10, 10, 10, 21, 23, 22, 20, 18, 15,
-        13, 13, 12, 11, 10, 10, 10, 10, 10, 10, 19, 21, 20, 19, 17, 14, 12, 12,
-        11, 10, 9, 10, 10, 9, 10, 9, 18, 19, 19, 18, 16, 14, 12, 11, 10, 9, 9,
-        9, 9, 9, 9, 9, 17, 18, 18, 17, 16, 13, 12, 11, 10, 9, 9, 9, 9, 9, 9, 9,
-        16, 17, 18, 16, 15, 13, 11, 10, 10, 9, 8, 8, 8, 8, 8, 8, 14, 16, 16, 15,
-        14, 12, 11, 10, 9, 8, 8, 8, 8, 8, 8, 8, 13, 14, 15, 14, 13, 11, 10, 9,
-        9, 8, 7, 8, 8, 8, 8, 8, 13, 14, 14, 14, 13, 11, 10, 9, 9, 8, 7, 7, 7, 7,
-        7, 7, 12, 14, 14, 13, 13, 11, 10, 9, 8, 8, 7, 7, 7, 7, 7, 7, 12, 13, 13,
-        13, 12, 11, 9, 9, 8, 7, 7, 7, 7, 7, 7, 7, 11, 12, 13, 13, 12, 10, 10, 9,
-        8, 7, 7, 7, 6, 6, 6, 7, 11, 12, 12, 12, 11, 10, 10, 9, 8, 7, 7, 6, 6, 6,
-        6, 6, 11, 12, 12, 12, 11, 10, 10, 8, 8, 7, 7, 6, 6, 6, 6, 6, 10, 11, 12,
-        12, 11, 10, 9, 8, 8, 7, 7, 6, 6, 6, 6, 6, 10, 11, 11, 11, 10, 10, 9, 9,
-        8, 7, 7, 6, 6, 6, 6, 6, 10, 11, 11, 11, 10, 10, 9, 9, 8, 7, 7, 6, 6, 5,
-        5, 5, 9, 10, 10, 11, 10, 9, 9, 8, 7, 7, 6, 6, 6, 5, 5, 5, 9, 10, 10, 10,
-        10, 9, 9, 8, 7, 7, 6, 6, 6, 5, 5, 5, 9, 9, 10, 10, 10, 9, 9, 8, 8, 7, 7,
-        6, 6, 5, 5, 5, 8, 9, 9, 10, 10, 9, 9, 8, 8, 7, 7, 6, 6, 5, 5, 5,
-        /* Size 32x16 */
         32, 33, 33, 33, 32, 32, 30, 29, 28, 26, 23, 22, 21, 19, 18, 17, 16, 14,
         13, 13, 12, 12, 11, 11, 11, 10, 10, 10, 9, 9, 9, 8, 33, 32, 32, 32, 32,
         31, 30, 30, 30, 28, 25, 24, 23, 21, 19, 18, 17, 16, 14, 14, 14, 13, 12,
@@ -6824,32 +6797,44 @@
         8, 8, 8, 7, 7, 7, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 10, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 10, 10, 9, 9, 9, 8, 8, 8, 7, 7, 7, 7, 6, 6, 6, 6, 5,
         5, 5, 5, 5,
+        /* Size 32x16 */
+        32, 33, 32, 30, 28, 23, 19, 17, 16, 13, 12, 11, 11, 11, 10, 10, 33, 32,
+        32, 30, 29, 24, 20, 18, 17, 14, 12, 12, 12, 11, 11, 11, 33, 32, 31, 31,
+        30, 25, 21, 19, 17, 14, 13, 12, 12, 11, 11, 11, 33, 32, 31, 30, 29, 25,
+        21, 19, 17, 14, 13, 13, 12, 12, 11, 11, 32, 32, 30, 29, 28, 24, 20, 19,
+        17, 14, 13, 13, 12, 12, 12, 11, 32, 31, 29, 28, 27, 24, 21, 19, 18, 15,
+        14, 13, 12, 12, 12, 11, 30, 30, 28, 26, 24, 21, 19, 18, 16, 14, 13, 13,
+        13, 12, 12, 11, 29, 30, 28, 25, 23, 20, 18, 17, 16, 13, 12, 12, 12, 12,
+        12, 11, 28, 30, 27, 24, 21, 19, 17, 16, 15, 13, 12, 12, 12, 12, 11, 11,
+        26, 28, 26, 23, 20, 18, 16, 15, 14, 12, 12, 12, 11, 11, 11, 11, 23, 25,
+        24, 21, 19, 16, 14, 14, 13, 11, 11, 11, 11, 11, 11, 11, 22, 24, 23, 21,
+        19, 16, 14, 13, 12, 11, 10, 10, 10, 10, 10, 10, 21, 23, 22, 20, 18, 15,
+        13, 13, 12, 11, 10, 10, 10, 10, 10, 10, 19, 21, 20, 19, 17, 14, 12, 12,
+        11, 10, 9, 10, 10, 9, 10, 9, 18, 19, 19, 18, 16, 14, 12, 11, 10, 9, 9,
+        9, 9, 9, 9, 9, 17, 18, 18, 17, 16, 13, 12, 11, 10, 9, 9, 9, 9, 9, 9, 9,
+        16, 17, 18, 16, 15, 13, 11, 10, 10, 9, 8, 8, 8, 8, 8, 8, 14, 16, 16, 15,
+        14, 12, 11, 10, 9, 8, 8, 8, 8, 8, 8, 8, 13, 14, 15, 14, 13, 11, 10, 9,
+        9, 8, 7, 8, 8, 8, 8, 8, 13, 14, 14, 14, 13, 11, 10, 9, 9, 8, 7, 7, 7, 7,
+        7, 7, 12, 14, 14, 13, 13, 11, 10, 9, 8, 8, 7, 7, 7, 7, 7, 7, 12, 13, 13,
+        13, 12, 11, 9, 9, 8, 7, 7, 7, 7, 7, 7, 7, 11, 12, 13, 13, 12, 10, 10, 9,
+        8, 7, 7, 7, 6, 6, 6, 7, 11, 12, 12, 12, 11, 10, 10, 9, 8, 7, 7, 6, 6, 6,
+        6, 6, 11, 12, 12, 12, 11, 10, 10, 8, 8, 7, 7, 6, 6, 6, 6, 6, 10, 11, 12,
+        12, 11, 10, 9, 8, 8, 7, 7, 6, 6, 6, 6, 6, 10, 11, 11, 11, 10, 10, 9, 9,
+        8, 7, 7, 6, 6, 6, 6, 6, 10, 11, 11, 11, 10, 10, 9, 9, 8, 7, 7, 6, 6, 5,
+        5, 5, 9, 10, 10, 11, 10, 9, 9, 8, 7, 7, 6, 6, 6, 5, 5, 5, 9, 10, 10, 10,
+        10, 9, 9, 8, 7, 7, 6, 6, 6, 5, 5, 5, 9, 9, 10, 10, 10, 9, 9, 8, 8, 7, 7,
+        6, 6, 5, 5, 5, 8, 9, 9, 10, 10, 9, 9, 8, 8, 7, 7, 6, 6, 5, 5, 5,
         /* Size 4x16 */
-        33, 23, 13, 11, 32, 25, 14, 11, 32, 24, 14, 12, 30, 21, 14, 12, 30, 19,
-        13, 12, 25, 16, 11, 11, 23, 15, 11, 10, 19, 14, 9, 9, 17, 13, 9, 8, 14,
-        11, 8, 8, 14, 11, 8, 7, 12, 10, 7, 6, 12, 10, 7, 6, 11, 10, 7, 6, 10, 9,
-        7, 5, 9, 9, 7, 5,
-        /* Size 16x4 */
         33, 32, 32, 30, 30, 25, 23, 19, 17, 14, 14, 12, 12, 11, 10, 9, 23, 25,
         24, 21, 19, 16, 15, 14, 13, 11, 11, 10, 10, 10, 9, 9, 13, 14, 14, 14,
         13, 11, 11, 9, 9, 8, 8, 7, 7, 7, 7, 7, 11, 11, 12, 12, 12, 11, 10, 9, 8,
         8, 7, 6, 6, 6, 5, 5,
+        /* Size 16x4 */
+        33, 23, 13, 11, 32, 25, 14, 11, 32, 24, 14, 12, 30, 21, 14, 12, 30, 19,
+        13, 12, 25, 16, 11, 11, 23, 15, 11, 10, 19, 14, 9, 9, 17, 13, 9, 8, 14,
+        11, 8, 8, 14, 11, 8, 7, 12, 10, 7, 6, 12, 10, 7, 6, 11, 10, 7, 6, 10, 9,
+        7, 5, 9, 9, 7, 5,
         /* Size 8x32 */
-        32, 32, 28, 19, 16, 12, 11, 10, 33, 32, 29, 20, 17, 12, 12, 11, 33, 31,
-        30, 21, 17, 13, 12, 11, 33, 31, 29, 21, 17, 13, 12, 11, 32, 30, 28, 20,
-        17, 13, 12, 12, 32, 29, 27, 21, 18, 14, 12, 12, 30, 28, 24, 19, 16, 13,
-        13, 12, 29, 28, 23, 18, 16, 12, 12, 12, 28, 27, 21, 17, 15, 12, 12, 11,
-        26, 26, 20, 16, 14, 12, 11, 11, 23, 24, 19, 14, 13, 11, 11, 11, 22, 23,
-        19, 14, 12, 10, 10, 10, 21, 22, 18, 13, 12, 10, 10, 10, 19, 20, 17, 12,
-        11, 9, 10, 10, 18, 19, 16, 12, 10, 9, 9, 9, 17, 18, 16, 12, 10, 9, 9, 9,
-        16, 18, 15, 11, 10, 8, 8, 8, 14, 16, 14, 11, 9, 8, 8, 8, 13, 15, 13, 10,
-        9, 7, 8, 8, 13, 14, 13, 10, 9, 7, 7, 7, 12, 14, 13, 10, 8, 7, 7, 7, 12,
-        13, 12, 9, 8, 7, 7, 7, 11, 13, 12, 10, 8, 7, 6, 6, 11, 12, 11, 10, 8, 7,
-        6, 6, 11, 12, 11, 10, 8, 7, 6, 6, 10, 12, 11, 9, 8, 7, 6, 6, 10, 11, 10,
-        9, 8, 7, 6, 6, 10, 11, 10, 9, 8, 7, 6, 5, 9, 10, 10, 9, 7, 6, 6, 5, 9,
-        10, 10, 9, 7, 6, 6, 5, 9, 10, 10, 9, 8, 7, 6, 5, 8, 9, 10, 9, 8, 7, 6,
-        5,
-        /* Size 32x8 */
         32, 33, 33, 33, 32, 32, 30, 29, 28, 26, 23, 22, 21, 19, 18, 17, 16, 14,
         13, 13, 12, 12, 11, 11, 11, 10, 10, 10, 9, 9, 9, 8, 32, 32, 31, 31, 30,
         29, 28, 28, 27, 26, 24, 23, 22, 20, 19, 18, 18, 16, 15, 14, 14, 13, 13,
@@ -6863,7 +6848,22 @@
         11, 12, 12, 12, 12, 12, 13, 12, 12, 11, 11, 10, 10, 10, 9, 9, 8, 8, 8,
         7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 10, 11, 11, 11, 12, 12, 12, 12,
         11, 11, 11, 10, 10, 10, 9, 9, 8, 8, 8, 7, 7, 7, 6, 6, 6, 6, 6, 5, 5, 5,
-        5, 5 },
+        5, 5,
+        /* Size 32x8 */
+        32, 32, 28, 19, 16, 12, 11, 10, 33, 32, 29, 20, 17, 12, 12, 11, 33, 31,
+        30, 21, 17, 13, 12, 11, 33, 31, 29, 21, 17, 13, 12, 11, 32, 30, 28, 20,
+        17, 13, 12, 12, 32, 29, 27, 21, 18, 14, 12, 12, 30, 28, 24, 19, 16, 13,
+        13, 12, 29, 28, 23, 18, 16, 12, 12, 12, 28, 27, 21, 17, 15, 12, 12, 11,
+        26, 26, 20, 16, 14, 12, 11, 11, 23, 24, 19, 14, 13, 11, 11, 11, 22, 23,
+        19, 14, 12, 10, 10, 10, 21, 22, 18, 13, 12, 10, 10, 10, 19, 20, 17, 12,
+        11, 9, 10, 10, 18, 19, 16, 12, 10, 9, 9, 9, 17, 18, 16, 12, 10, 9, 9, 9,
+        16, 18, 15, 11, 10, 8, 8, 8, 14, 16, 14, 11, 9, 8, 8, 8, 13, 15, 13, 10,
+        9, 7, 8, 8, 13, 14, 13, 10, 9, 7, 7, 7, 12, 14, 13, 10, 8, 7, 7, 7, 12,
+        13, 12, 9, 8, 7, 7, 7, 11, 13, 12, 10, 8, 7, 6, 6, 11, 12, 11, 10, 8, 7,
+        6, 6, 11, 12, 11, 10, 8, 7, 6, 6, 10, 12, 11, 9, 8, 7, 6, 6, 10, 11, 10,
+        9, 8, 7, 6, 6, 10, 11, 10, 9, 8, 7, 6, 5, 9, 10, 10, 9, 7, 6, 6, 5, 9,
+        10, 10, 9, 7, 6, 6, 5, 9, 10, 10, 9, 8, 7, 6, 5, 8, 9, 10, 9, 8, 7, 6,
+        5 },
       { /* Chroma */
         /* Size 4x4 */
         29, 22, 18, 16, 22, 17, 15, 14, 18, 15, 11, 11, 16, 14, 11, 9,
@@ -6947,21 +6947,12 @@
         15, 15, 15, 15, 16, 16, 15, 15, 14, 14, 13, 13, 12, 12, 12, 12, 11, 11,
         11, 11, 10, 10, 10, 10, 9, 9, 9, 9, 9,
         /* Size 4x8 */
-        33, 22, 17, 16, 26, 23, 19, 17, 22, 18, 16, 16, 21, 17, 14, 14, 19, 16,
-        12, 12, 17, 15, 11, 11, 16, 15, 11, 10, 15, 14, 12, 10,
-        /* Size 8x4 */
         33, 26, 22, 21, 19, 17, 16, 15, 22, 23, 18, 17, 16, 15, 15, 14, 17, 19,
         16, 14, 12, 11, 11, 12, 16, 17, 16, 14, 12, 11, 10, 10,
+        /* Size 8x4 */
+        33, 22, 17, 16, 26, 23, 19, 17, 22, 18, 16, 16, 21, 17, 14, 14, 19, 16,
+        12, 12, 17, 15, 11, 11, 16, 15, 11, 10, 15, 14, 12, 10,
         /* Size 8x16 */
-        32, 28, 21, 20, 18, 16, 15, 14, 34, 26, 22, 21, 20, 17, 16, 16, 31, 24,
-        22, 22, 20, 17, 17, 16, 24, 22, 20, 20, 19, 17, 17, 17, 21, 21, 19, 19,
-        18, 17, 17, 17, 21, 22, 19, 17, 16, 15, 16, 16, 20, 22, 19, 16, 15, 14,
-        14, 15, 19, 21, 19, 15, 14, 13, 13, 14, 18, 20, 18, 15, 13, 12, 13, 13,
-        16, 19, 17, 14, 12, 11, 12, 12, 16, 18, 17, 14, 12, 11, 11, 12, 15, 17,
-        16, 14, 12, 11, 10, 11, 15, 17, 16, 14, 12, 11, 10, 10, 14, 16, 16, 14,
-        12, 11, 10, 10, 14, 15, 16, 14, 12, 11, 10, 10, 13, 15, 15, 14, 12, 11,
-        10, 9,
-        /* Size 16x8 */
         32, 34, 31, 24, 21, 21, 20, 19, 18, 16, 16, 15, 15, 14, 14, 13, 28, 26,
         24, 22, 21, 22, 22, 21, 20, 19, 18, 17, 17, 16, 15, 15, 21, 22, 22, 20,
         19, 19, 19, 19, 18, 17, 17, 16, 16, 16, 16, 15, 20, 21, 22, 20, 19, 17,
@@ -6970,37 +6961,16 @@
         11, 11, 11, 11, 11, 11, 15, 16, 17, 17, 17, 16, 14, 13, 13, 12, 11, 10,
         10, 10, 10, 10, 14, 16, 16, 17, 17, 16, 15, 14, 13, 12, 12, 11, 10, 10,
         10, 9,
+        /* Size 16x8 */
+        32, 28, 21, 20, 18, 16, 15, 14, 34, 26, 22, 21, 20, 17, 16, 16, 31, 24,
+        22, 22, 20, 17, 17, 16, 24, 22, 20, 20, 19, 17, 17, 17, 21, 21, 19, 19,
+        18, 17, 17, 17, 21, 22, 19, 17, 16, 15, 16, 16, 20, 22, 19, 16, 15, 14,
+        14, 15, 19, 21, 19, 15, 14, 13, 13, 14, 18, 20, 18, 15, 13, 12, 13, 13,
+        16, 19, 17, 14, 12, 11, 12, 12, 16, 18, 17, 14, 12, 11, 11, 12, 15, 17,
+        16, 14, 12, 11, 10, 11, 15, 17, 16, 14, 12, 11, 10, 10, 14, 16, 16, 14,
+        12, 11, 10, 10, 14, 15, 16, 14, 12, 11, 10, 10, 13, 15, 15, 14, 12, 11,
+        10, 9,
         /* Size 16x32 */
-        32, 33, 28, 24, 21, 21, 20, 19, 18, 16, 16, 15, 15, 15, 14, 14, 33, 33,
-        27, 24, 22, 22, 20, 20, 19, 17, 16, 16, 16, 16, 15, 15, 34, 32, 26, 24,
-        22, 23, 21, 20, 20, 18, 17, 17, 16, 16, 16, 15, 32, 30, 25, 23, 22, 23,
-        21, 21, 20, 18, 17, 17, 17, 16, 16, 16, 31, 28, 24, 23, 22, 22, 22, 21,
-        20, 18, 17, 17, 17, 17, 16, 16, 28, 26, 22, 22, 22, 23, 22, 21, 20, 19,
-        18, 18, 17, 17, 17, 16, 24, 24, 22, 21, 20, 21, 20, 20, 19, 18, 17, 18,
-        17, 17, 17, 16, 23, 23, 22, 21, 20, 20, 20, 19, 19, 17, 17, 17, 17, 17,
-        17, 17, 21, 22, 21, 20, 19, 19, 19, 19, 18, 17, 17, 16, 17, 16, 17, 17,
-        21, 22, 22, 20, 19, 18, 18, 17, 17, 16, 16, 16, 16, 16, 16, 16, 21, 23,
-        22, 21, 19, 18, 17, 17, 16, 15, 15, 15, 16, 16, 16, 16, 21, 22, 22, 21,
-        19, 17, 17, 16, 16, 15, 14, 15, 15, 15, 15, 15, 20, 22, 22, 20, 19, 17,
-        16, 16, 15, 14, 14, 14, 14, 15, 15, 15, 20, 21, 22, 20, 19, 17, 16, 15,
-        14, 14, 13, 14, 14, 14, 14, 14, 19, 20, 21, 20, 19, 17, 15, 14, 14, 13,
-        13, 13, 13, 14, 14, 14, 19, 20, 21, 20, 18, 16, 15, 14, 14, 13, 12, 13,
-        13, 13, 13, 13, 18, 20, 20, 19, 18, 16, 15, 14, 13, 12, 12, 12, 13, 13,
-        13, 13, 17, 19, 20, 19, 18, 16, 14, 14, 13, 12, 12, 12, 12, 12, 13, 13,
-        16, 18, 19, 18, 17, 15, 14, 13, 12, 12, 11, 12, 12, 12, 12, 13, 16, 18,
-        19, 18, 17, 15, 14, 13, 12, 12, 11, 11, 12, 12, 12, 12, 16, 17, 18, 18,
-        17, 15, 14, 13, 12, 11, 11, 11, 11, 11, 12, 12, 15, 17, 18, 17, 16, 15,
-        13, 13, 12, 11, 11, 11, 11, 11, 11, 11, 15, 17, 17, 17, 16, 14, 14, 13,
-        12, 11, 11, 11, 10, 11, 11, 11, 15, 17, 17, 17, 16, 15, 14, 13, 12, 12,
-        11, 10, 10, 10, 11, 11, 15, 16, 17, 17, 16, 15, 14, 13, 12, 12, 11, 11,
-        10, 10, 10, 11, 14, 16, 16, 17, 15, 15, 14, 13, 12, 11, 11, 10, 10, 10,
-        10, 10, 14, 16, 16, 17, 16, 15, 14, 13, 12, 12, 11, 10, 10, 10, 10, 10,
-        14, 16, 16, 16, 16, 15, 14, 13, 12, 12, 11, 10, 10, 10, 10, 10, 14, 15,
-        15, 16, 16, 15, 14, 13, 12, 12, 11, 11, 10, 10, 10, 10, 14, 15, 15, 16,
-        16, 14, 14, 13, 12, 12, 11, 11, 10, 10, 9, 9, 13, 15, 15, 16, 15, 14,
-        14, 13, 12, 12, 11, 11, 10, 10, 9, 9, 13, 15, 15, 15, 15, 14, 14, 13,
-        13, 11, 11, 10, 10, 9, 9, 9,
-        /* Size 32x16 */
         32, 33, 34, 32, 31, 28, 24, 23, 21, 21, 21, 21, 20, 20, 19, 19, 18, 17,
         16, 16, 16, 15, 15, 15, 15, 14, 14, 14, 14, 14, 13, 13, 33, 33, 32, 30,
         28, 26, 24, 23, 22, 22, 23, 22, 22, 21, 20, 20, 20, 19, 18, 18, 17, 17,
@@ -7030,33 +7000,47 @@
         12, 11, 11, 11, 10, 10, 10, 10, 10, 9, 9, 9, 14, 15, 15, 16, 16, 16, 16,
         17, 17, 16, 16, 15, 15, 14, 14, 13, 13, 13, 13, 12, 12, 11, 11, 11, 11,
         10, 10, 10, 10, 9, 9, 9,
+        /* Size 32x16 */
+        32, 33, 28, 24, 21, 21, 20, 19, 18, 16, 16, 15, 15, 15, 14, 14, 33, 33,
+        27, 24, 22, 22, 20, 20, 19, 17, 16, 16, 16, 16, 15, 15, 34, 32, 26, 24,
+        22, 23, 21, 20, 20, 18, 17, 17, 16, 16, 16, 15, 32, 30, 25, 23, 22, 23,
+        21, 21, 20, 18, 17, 17, 17, 16, 16, 16, 31, 28, 24, 23, 22, 22, 22, 21,
+        20, 18, 17, 17, 17, 17, 16, 16, 28, 26, 22, 22, 22, 23, 22, 21, 20, 19,
+        18, 18, 17, 17, 17, 16, 24, 24, 22, 21, 20, 21, 20, 20, 19, 18, 17, 18,
+        17, 17, 17, 16, 23, 23, 22, 21, 20, 20, 20, 19, 19, 17, 17, 17, 17, 17,
+        17, 17, 21, 22, 21, 20, 19, 19, 19, 19, 18, 17, 17, 16, 17, 16, 17, 17,
+        21, 22, 22, 20, 19, 18, 18, 17, 17, 16, 16, 16, 16, 16, 16, 16, 21, 23,
+        22, 21, 19, 18, 17, 17, 16, 15, 15, 15, 16, 16, 16, 16, 21, 22, 22, 21,
+        19, 17, 17, 16, 16, 15, 14, 15, 15, 15, 15, 15, 20, 22, 22, 20, 19, 17,
+        16, 16, 15, 14, 14, 14, 14, 15, 15, 15, 20, 21, 22, 20, 19, 17, 16, 15,
+        14, 14, 13, 14, 14, 14, 14, 14, 19, 20, 21, 20, 19, 17, 15, 14, 14, 13,
+        13, 13, 13, 14, 14, 14, 19, 20, 21, 20, 18, 16, 15, 14, 14, 13, 12, 13,
+        13, 13, 13, 13, 18, 20, 20, 19, 18, 16, 15, 14, 13, 12, 12, 12, 13, 13,
+        13, 13, 17, 19, 20, 19, 18, 16, 14, 14, 13, 12, 12, 12, 12, 12, 13, 13,
+        16, 18, 19, 18, 17, 15, 14, 13, 12, 12, 11, 12, 12, 12, 12, 13, 16, 18,
+        19, 18, 17, 15, 14, 13, 12, 12, 11, 11, 12, 12, 12, 12, 16, 17, 18, 18,
+        17, 15, 14, 13, 12, 11, 11, 11, 11, 11, 12, 12, 15, 17, 18, 17, 16, 15,
+        13, 13, 12, 11, 11, 11, 11, 11, 11, 11, 15, 17, 17, 17, 16, 14, 14, 13,
+        12, 11, 11, 11, 10, 11, 11, 11, 15, 17, 17, 17, 16, 15, 14, 13, 12, 12,
+        11, 10, 10, 10, 11, 11, 15, 16, 17, 17, 16, 15, 14, 13, 12, 12, 11, 11,
+        10, 10, 10, 11, 14, 16, 16, 17, 15, 15, 14, 13, 12, 11, 11, 10, 10, 10,
+        10, 10, 14, 16, 16, 17, 16, 15, 14, 13, 12, 12, 11, 10, 10, 10, 10, 10,
+        14, 16, 16, 16, 16, 15, 14, 13, 12, 12, 11, 10, 10, 10, 10, 10, 14, 15,
+        15, 16, 16, 15, 14, 13, 12, 12, 11, 11, 10, 10, 10, 10, 14, 15, 15, 16,
+        16, 14, 14, 13, 12, 12, 11, 11, 10, 10, 9, 9, 13, 15, 15, 16, 15, 14,
+        14, 13, 12, 12, 11, 11, 10, 10, 9, 9, 13, 15, 15, 15, 15, 14, 14, 13,
+        13, 11, 11, 10, 10, 9, 9, 9,
         /* Size 4x16 */
-        33, 21, 16, 15, 32, 23, 18, 16, 28, 22, 18, 17, 24, 21, 18, 17, 22, 19,
-        17, 16, 23, 18, 15, 16, 22, 17, 14, 15, 20, 17, 13, 14, 20, 16, 12, 13,
-        18, 15, 12, 12, 17, 15, 11, 11, 17, 14, 11, 11, 16, 15, 12, 10, 16, 15,
-        12, 10, 15, 15, 12, 10, 15, 14, 12, 10,
-        /* Size 16x4 */
         33, 32, 28, 24, 22, 23, 22, 20, 20, 18, 17, 17, 16, 16, 15, 15, 21, 23,
         22, 21, 19, 18, 17, 17, 16, 15, 15, 14, 15, 15, 15, 14, 16, 18, 18, 18,
         17, 15, 14, 13, 12, 12, 11, 11, 12, 12, 12, 12, 15, 16, 17, 17, 16, 16,
         15, 14, 13, 12, 11, 11, 10, 10, 10, 10,
+        /* Size 16x4 */
+        33, 21, 16, 15, 32, 23, 18, 16, 28, 22, 18, 17, 24, 21, 18, 17, 22, 19,
+        17, 16, 23, 18, 15, 16, 22, 17, 14, 15, 20, 17, 13, 14, 20, 16, 12, 13,
+        18, 15, 12, 12, 17, 15, 11, 11, 17, 14, 11, 11, 16, 15, 12, 10, 16, 15,
+        12, 10, 15, 15, 12, 10, 15, 14, 12, 10,
         /* Size 8x32 */
-        32, 28, 21, 20, 18, 16, 15, 14, 33, 27, 22, 20, 19, 16, 16, 15, 34, 26,
-        22, 21, 20, 17, 16, 16, 32, 25, 22, 21, 20, 17, 17, 16, 31, 24, 22, 22,
-        20, 17, 17, 16, 28, 22, 22, 22, 20, 18, 17, 17, 24, 22, 20, 20, 19, 17,
-        17, 17, 23, 22, 20, 20, 19, 17, 17, 17, 21, 21, 19, 19, 18, 17, 17, 17,
-        21, 22, 19, 18, 17, 16, 16, 16, 21, 22, 19, 17, 16, 15, 16, 16, 21, 22,
-        19, 17, 16, 14, 15, 15, 20, 22, 19, 16, 15, 14, 14, 15, 20, 22, 19, 16,
-        14, 13, 14, 14, 19, 21, 19, 15, 14, 13, 13, 14, 19, 21, 18, 15, 14, 12,
-        13, 13, 18, 20, 18, 15, 13, 12, 13, 13, 17, 20, 18, 14, 13, 12, 12, 13,
-        16, 19, 17, 14, 12, 11, 12, 12, 16, 19, 17, 14, 12, 11, 12, 12, 16, 18,
-        17, 14, 12, 11, 11, 12, 15, 18, 16, 13, 12, 11, 11, 11, 15, 17, 16, 14,
-        12, 11, 10, 11, 15, 17, 16, 14, 12, 11, 10, 11, 15, 17, 16, 14, 12, 11,
-        10, 10, 14, 16, 15, 14, 12, 11, 10, 10, 14, 16, 16, 14, 12, 11, 10, 10,
-        14, 16, 16, 14, 12, 11, 10, 10, 14, 15, 16, 14, 12, 11, 10, 10, 14, 15,
-        16, 14, 12, 11, 10, 9, 13, 15, 15, 14, 12, 11, 10, 9, 13, 15, 15, 14,
-        13, 11, 10, 9,
-        /* Size 32x8 */
         32, 33, 34, 32, 31, 28, 24, 23, 21, 21, 21, 21, 20, 20, 19, 19, 18, 17,
         16, 16, 16, 15, 15, 15, 15, 14, 14, 14, 14, 14, 13, 13, 28, 27, 26, 25,
         24, 22, 22, 22, 21, 22, 22, 22, 22, 22, 21, 21, 20, 20, 19, 19, 18, 18,
@@ -7071,7 +7055,23 @@
         17, 17, 17, 16, 16, 15, 14, 14, 13, 13, 13, 12, 12, 12, 11, 11, 10, 10,
         10, 10, 10, 10, 10, 10, 10, 10, 14, 15, 16, 16, 16, 17, 17, 17, 17, 16,
         16, 15, 15, 14, 14, 13, 13, 13, 12, 12, 12, 11, 11, 11, 10, 10, 10, 10,
-        10, 9, 9, 9 },
+        10, 9, 9, 9,
+        /* Size 32x8 */
+        32, 28, 21, 20, 18, 16, 15, 14, 33, 27, 22, 20, 19, 16, 16, 15, 34, 26,
+        22, 21, 20, 17, 16, 16, 32, 25, 22, 21, 20, 17, 17, 16, 31, 24, 22, 22,
+        20, 17, 17, 16, 28, 22, 22, 22, 20, 18, 17, 17, 24, 22, 20, 20, 19, 17,
+        17, 17, 23, 22, 20, 20, 19, 17, 17, 17, 21, 21, 19, 19, 18, 17, 17, 17,
+        21, 22, 19, 18, 17, 16, 16, 16, 21, 22, 19, 17, 16, 15, 16, 16, 21, 22,
+        19, 17, 16, 14, 15, 15, 20, 22, 19, 16, 15, 14, 14, 15, 20, 22, 19, 16,
+        14, 13, 14, 14, 19, 21, 19, 15, 14, 13, 13, 14, 19, 21, 18, 15, 14, 12,
+        13, 13, 18, 20, 18, 15, 13, 12, 13, 13, 17, 20, 18, 14, 13, 12, 12, 13,
+        16, 19, 17, 14, 12, 11, 12, 12, 16, 19, 17, 14, 12, 11, 12, 12, 16, 18,
+        17, 14, 12, 11, 11, 12, 15, 18, 16, 13, 12, 11, 11, 11, 15, 17, 16, 14,
+        12, 11, 10, 11, 15, 17, 16, 14, 12, 11, 10, 11, 15, 17, 16, 14, 12, 11,
+        10, 10, 14, 16, 15, 14, 12, 11, 10, 10, 14, 16, 16, 14, 12, 11, 10, 10,
+        14, 16, 16, 14, 12, 11, 10, 10, 14, 15, 16, 14, 12, 11, 10, 10, 14, 15,
+        16, 14, 12, 11, 10, 9, 13, 15, 15, 14, 12, 11, 10, 9, 13, 15, 15, 14,
+        13, 11, 10, 9 },
   },
   {
       { /* Luma */
@@ -7152,20 +7152,12 @@
         5, 5, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 9, 9, 9, 8, 8, 8, 7, 7,
         7, 7, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5,
         /* Size 4x8 */
-        32, 24, 15, 12, 31, 24, 16, 12, 28, 18, 13, 12, 22, 15, 11, 10, 17, 13,
-        9, 8, 14, 11, 8, 7, 12, 11, 8, 6, 10, 10, 8, 6,
-        /* Size 8x4 */
         32, 31, 28, 22, 17, 14, 12, 10, 24, 24, 18, 15, 13, 11, 11, 10, 15, 16,
         13, 11, 9, 8, 8, 8, 12, 12, 12, 10, 8, 7, 6, 6,
+        /* Size 8x4 */
+        32, 24, 15, 12, 31, 24, 16, 12, 28, 18, 13, 12, 22, 15, 11, 10, 17, 13,
+        9, 8, 14, 11, 8, 7, 12, 11, 8, 6, 10, 10, 8, 6,
         /* Size 8x16 */
-        32, 32, 28, 22, 16, 13, 11, 11, 33, 32, 29, 23, 17, 14, 12, 11, 32, 30,
-        28, 23, 17, 14, 13, 12, 32, 29, 26, 22, 17, 14, 13, 12, 28, 28, 21, 18,
-        15, 13, 12, 12, 26, 26, 20, 17, 14, 12, 11, 11, 22, 23, 18, 15, 12, 11,
-        10, 10, 19, 20, 17, 14, 11, 10, 9, 9, 17, 18, 16, 13, 10, 9, 9, 9, 14,
-        16, 14, 12, 9, 8, 8, 8, 13, 15, 13, 11, 9, 8, 7, 7, 12, 13, 12, 10, 8,
-        7, 7, 7, 11, 12, 12, 10, 8, 7, 7, 6, 10, 12, 11, 9, 8, 7, 6, 6, 10, 11,
-        11, 9, 8, 7, 6, 6, 9, 10, 10, 9, 8, 7, 6, 5,
-        /* Size 16x8 */
         32, 33, 32, 32, 28, 26, 22, 19, 17, 14, 13, 12, 11, 10, 10, 9, 32, 32,
         30, 29, 28, 26, 23, 20, 18, 16, 15, 13, 12, 12, 11, 10, 28, 29, 28, 26,
         21, 20, 18, 17, 16, 14, 13, 12, 12, 11, 11, 10, 22, 23, 23, 22, 18, 17,
@@ -7173,35 +7165,15 @@
         9, 9, 8, 8, 8, 8, 8, 13, 14, 14, 14, 13, 12, 11, 10, 9, 8, 8, 7, 7, 7,
         7, 7, 11, 12, 13, 13, 12, 11, 10, 9, 9, 8, 7, 7, 7, 6, 6, 6, 11, 11, 12,
         12, 12, 11, 10, 9, 9, 8, 7, 7, 6, 6, 6, 5,
+        /* Size 16x8 */
+        32, 32, 28, 22, 16, 13, 11, 11, 33, 32, 29, 23, 17, 14, 12, 11, 32, 30,
+        28, 23, 17, 14, 13, 12, 32, 29, 26, 22, 17, 14, 13, 12, 28, 28, 21, 18,
+        15, 13, 12, 12, 26, 26, 20, 17, 14, 12, 11, 11, 22, 23, 18, 15, 12, 11,
+        10, 10, 19, 20, 17, 14, 11, 10, 9, 9, 17, 18, 16, 13, 10, 9, 9, 9, 14,
+        16, 14, 12, 9, 8, 8, 8, 13, 15, 13, 11, 9, 8, 7, 7, 12, 13, 12, 10, 8,
+        7, 7, 7, 11, 12, 12, 10, 8, 7, 7, 6, 10, 12, 11, 9, 8, 7, 6, 6, 10, 11,
+        11, 9, 8, 7, 6, 6, 9, 10, 10, 9, 8, 7, 6, 5,
         /* Size 16x32 */
-        32, 33, 32, 32, 28, 23, 22, 19, 16, 14, 13, 12, 11, 11, 11, 10, 33, 32,
-        32, 31, 29, 24, 23, 20, 17, 15, 14, 12, 12, 12, 11, 11, 33, 32, 32, 31,
-        29, 25, 23, 21, 17, 15, 14, 13, 12, 12, 11, 11, 33, 32, 31, 31, 29, 25,
-        23, 21, 17, 16, 14, 13, 12, 12, 12, 11, 32, 32, 30, 30, 28, 24, 23, 20,
-        17, 16, 14, 13, 13, 12, 12, 11, 32, 31, 29, 28, 27, 24, 23, 21, 18, 16,
-        15, 13, 13, 12, 12, 12, 32, 31, 29, 28, 26, 23, 22, 20, 17, 16, 14, 13,
-        13, 13, 12, 12, 30, 30, 28, 27, 24, 21, 20, 19, 16, 15, 14, 13, 12, 13,
-        12, 12, 28, 30, 28, 26, 21, 19, 18, 17, 15, 14, 13, 12, 12, 12, 12, 12,
-        27, 28, 26, 25, 21, 18, 18, 16, 14, 13, 13, 12, 12, 12, 11, 11, 26, 28,
-        26, 24, 20, 18, 17, 16, 14, 13, 12, 11, 11, 11, 11, 11, 23, 25, 24, 23,
-        19, 16, 16, 14, 13, 12, 11, 11, 11, 11, 11, 10, 22, 23, 23, 22, 18, 16,
-        15, 14, 12, 11, 11, 10, 10, 10, 10, 10, 21, 22, 22, 21, 18, 15, 14, 13,
-        12, 11, 11, 10, 10, 10, 10, 10, 19, 21, 20, 20, 17, 14, 14, 12, 11, 10,
-        10, 9, 9, 10, 9, 10, 18, 19, 19, 19, 16, 14, 13, 12, 10, 10, 9, 9, 9, 9,
-        9, 9, 17, 18, 18, 18, 16, 13, 13, 12, 10, 10, 9, 9, 9, 9, 9, 9, 16, 17,
-        17, 17, 15, 13, 12, 11, 10, 9, 9, 8, 8, 8, 8, 8, 14, 16, 16, 16, 14, 12,
-        12, 11, 9, 9, 8, 8, 8, 8, 8, 8, 13, 15, 15, 15, 13, 12, 11, 10, 9, 8, 8,
-        8, 8, 8, 8, 8, 13, 14, 15, 14, 13, 11, 11, 10, 9, 8, 8, 7, 7, 7, 7, 8,
-        12, 14, 14, 14, 13, 11, 11, 10, 8, 8, 8, 7, 7, 7, 7, 7, 12, 13, 13, 13,
-        12, 11, 10, 9, 8, 8, 7, 7, 7, 7, 7, 7, 12, 13, 13, 13, 12, 11, 10, 9, 8,
-        8, 7, 7, 7, 7, 7, 6, 11, 12, 12, 13, 12, 11, 10, 9, 8, 8, 7, 7, 7, 6, 6,
-        6, 11, 12, 12, 12, 11, 11, 10, 9, 9, 8, 7, 7, 6, 6, 6, 6, 10, 12, 12,
-        12, 11, 11, 9, 9, 8, 8, 7, 6, 6, 6, 6, 6, 10, 11, 11, 12, 11, 10, 9, 9,
-        8, 8, 7, 6, 6, 6, 6, 6, 10, 11, 11, 11, 11, 10, 9, 9, 8, 8, 7, 7, 6, 6,
-        6, 6, 10, 10, 11, 11, 11, 10, 9, 9, 8, 8, 7, 7, 6, 6, 5, 5, 9, 10, 10,
-        11, 10, 9, 9, 8, 8, 7, 7, 6, 6, 6, 5, 5, 9, 10, 10, 10, 10, 9, 9, 8, 8,
-        7, 7, 6, 6, 5, 5, 5,
-        /* Size 32x16 */
         32, 33, 33, 33, 32, 32, 32, 30, 28, 27, 26, 23, 22, 21, 19, 18, 17, 16,
         14, 13, 13, 12, 12, 12, 11, 11, 10, 10, 10, 10, 9, 9, 33, 32, 32, 32,
         32, 31, 31, 30, 30, 28, 28, 25, 23, 22, 21, 19, 18, 17, 16, 15, 14, 14,
@@ -7229,32 +7201,45 @@
         11, 11, 11, 10, 10, 9, 9, 9, 8, 8, 8, 7, 7, 7, 7, 6, 6, 6, 6, 6, 5, 5,
         5, 10, 11, 11, 11, 11, 12, 12, 12, 12, 11, 11, 10, 10, 10, 10, 9, 9, 8,
         8, 8, 8, 7, 7, 6, 6, 6, 6, 6, 6, 5, 5, 5,
+        /* Size 32x16 */
+        32, 33, 32, 32, 28, 23, 22, 19, 16, 14, 13, 12, 11, 11, 11, 10, 33, 32,
+        32, 31, 29, 24, 23, 20, 17, 15, 14, 12, 12, 12, 11, 11, 33, 32, 32, 31,
+        29, 25, 23, 21, 17, 15, 14, 13, 12, 12, 11, 11, 33, 32, 31, 31, 29, 25,
+        23, 21, 17, 16, 14, 13, 12, 12, 12, 11, 32, 32, 30, 30, 28, 24, 23, 20,
+        17, 16, 14, 13, 13, 12, 12, 11, 32, 31, 29, 28, 27, 24, 23, 21, 18, 16,
+        15, 13, 13, 12, 12, 12, 32, 31, 29, 28, 26, 23, 22, 20, 17, 16, 14, 13,
+        13, 13, 12, 12, 30, 30, 28, 27, 24, 21, 20, 19, 16, 15, 14, 13, 12, 13,
+        12, 12, 28, 30, 28, 26, 21, 19, 18, 17, 15, 14, 13, 12, 12, 12, 12, 12,
+        27, 28, 26, 25, 21, 18, 18, 16, 14, 13, 13, 12, 12, 12, 11, 11, 26, 28,
+        26, 24, 20, 18, 17, 16, 14, 13, 12, 11, 11, 11, 11, 11, 23, 25, 24, 23,
+        19, 16, 16, 14, 13, 12, 11, 11, 11, 11, 11, 10, 22, 23, 23, 22, 18, 16,
+        15, 14, 12, 11, 11, 10, 10, 10, 10, 10, 21, 22, 22, 21, 18, 15, 14, 13,
+        12, 11, 11, 10, 10, 10, 10, 10, 19, 21, 20, 20, 17, 14, 14, 12, 11, 10,
+        10, 9, 9, 10, 9, 10, 18, 19, 19, 19, 16, 14, 13, 12, 10, 10, 9, 9, 9, 9,
+        9, 9, 17, 18, 18, 18, 16, 13, 13, 12, 10, 10, 9, 9, 9, 9, 9, 9, 16, 17,
+        17, 17, 15, 13, 12, 11, 10, 9, 9, 8, 8, 8, 8, 8, 14, 16, 16, 16, 14, 12,
+        12, 11, 9, 9, 8, 8, 8, 8, 8, 8, 13, 15, 15, 15, 13, 12, 11, 10, 9, 8, 8,
+        8, 8, 8, 8, 8, 13, 14, 15, 14, 13, 11, 11, 10, 9, 8, 8, 7, 7, 7, 7, 8,
+        12, 14, 14, 14, 13, 11, 11, 10, 8, 8, 8, 7, 7, 7, 7, 7, 12, 13, 13, 13,
+        12, 11, 10, 9, 8, 8, 7, 7, 7, 7, 7, 7, 12, 13, 13, 13, 12, 11, 10, 9, 8,
+        8, 7, 7, 7, 7, 7, 6, 11, 12, 12, 13, 12, 11, 10, 9, 8, 8, 7, 7, 7, 6, 6,
+        6, 11, 12, 12, 12, 11, 11, 10, 9, 9, 8, 7, 7, 6, 6, 6, 6, 10, 12, 12,
+        12, 11, 11, 9, 9, 8, 8, 7, 6, 6, 6, 6, 6, 10, 11, 11, 12, 11, 10, 9, 9,
+        8, 8, 7, 6, 6, 6, 6, 6, 10, 11, 11, 11, 11, 10, 9, 9, 8, 8, 7, 7, 6, 6,
+        6, 6, 10, 10, 11, 11, 11, 10, 9, 9, 8, 8, 7, 7, 6, 6, 5, 5, 9, 10, 10,
+        11, 10, 9, 9, 8, 8, 7, 7, 6, 6, 6, 5, 5, 9, 10, 10, 10, 10, 9, 9, 8, 8,
+        7, 7, 6, 6, 5, 5, 5,
         /* Size 4x16 */
-        33, 23, 14, 11, 32, 25, 15, 12, 32, 24, 16, 12, 31, 23, 16, 13, 30, 19,
-        14, 12, 28, 18, 13, 11, 23, 16, 11, 10, 21, 14, 10, 10, 18, 13, 10, 9,
-        16, 12, 9, 8, 14, 11, 8, 7, 13, 11, 8, 7, 12, 11, 8, 6, 12, 11, 8, 6,
-        11, 10, 8, 6, 10, 9, 7, 6,
-        /* Size 16x4 */
         33, 32, 32, 31, 30, 28, 23, 21, 18, 16, 14, 13, 12, 12, 11, 10, 23, 25,
         24, 23, 19, 18, 16, 14, 13, 12, 11, 11, 11, 11, 10, 9, 14, 15, 16, 16,
         14, 13, 11, 10, 10, 9, 8, 8, 8, 8, 8, 7, 11, 12, 12, 13, 12, 11, 10, 10,
         9, 8, 7, 7, 6, 6, 6, 6,
+        /* Size 16x4 */
+        33, 23, 14, 11, 32, 25, 15, 12, 32, 24, 16, 12, 31, 23, 16, 13, 30, 19,
+        14, 12, 28, 18, 13, 11, 23, 16, 11, 10, 21, 14, 10, 10, 18, 13, 10, 9,
+        16, 12, 9, 8, 14, 11, 8, 7, 13, 11, 8, 7, 12, 11, 8, 6, 12, 11, 8, 6,
+        11, 10, 8, 6, 10, 9, 7, 6,
         /* Size 8x32 */
-        32, 32, 28, 22, 16, 13, 11, 11, 33, 32, 29, 23, 17, 14, 12, 11, 33, 32,
-        29, 23, 17, 14, 12, 11, 33, 31, 29, 23, 17, 14, 12, 12, 32, 30, 28, 23,
-        17, 14, 13, 12, 32, 29, 27, 23, 18, 15, 13, 12, 32, 29, 26, 22, 17, 14,
-        13, 12, 30, 28, 24, 20, 16, 14, 12, 12, 28, 28, 21, 18, 15, 13, 12, 12,
-        27, 26, 21, 18, 14, 13, 12, 11, 26, 26, 20, 17, 14, 12, 11, 11, 23, 24,
-        19, 16, 13, 11, 11, 11, 22, 23, 18, 15, 12, 11, 10, 10, 21, 22, 18, 14,
-        12, 11, 10, 10, 19, 20, 17, 14, 11, 10, 9, 9, 18, 19, 16, 13, 10, 9, 9,
-        9, 17, 18, 16, 13, 10, 9, 9, 9, 16, 17, 15, 12, 10, 9, 8, 8, 14, 16, 14,
-        12, 9, 8, 8, 8, 13, 15, 13, 11, 9, 8, 8, 8, 13, 15, 13, 11, 9, 8, 7, 7,
-        12, 14, 13, 11, 8, 8, 7, 7, 12, 13, 12, 10, 8, 7, 7, 7, 12, 13, 12, 10,
-        8, 7, 7, 7, 11, 12, 12, 10, 8, 7, 7, 6, 11, 12, 11, 10, 9, 7, 6, 6, 10,
-        12, 11, 9, 8, 7, 6, 6, 10, 11, 11, 9, 8, 7, 6, 6, 10, 11, 11, 9, 8, 7,
-        6, 6, 10, 11, 11, 9, 8, 7, 6, 5, 9, 10, 10, 9, 8, 7, 6, 5, 9, 10, 10, 9,
-        8, 7, 6, 5,
-        /* Size 32x8 */
         32, 33, 33, 33, 32, 32, 32, 30, 28, 27, 26, 23, 22, 21, 19, 18, 17, 16,
         14, 13, 13, 12, 12, 12, 11, 11, 10, 10, 10, 10, 9, 9, 32, 32, 32, 31,
         30, 29, 29, 28, 28, 26, 26, 24, 23, 22, 20, 19, 18, 17, 16, 15, 15, 14,
@@ -7268,7 +7253,22 @@
         7, 7, 7, 7, 7, 11, 12, 12, 12, 13, 13, 13, 12, 12, 12, 11, 11, 10, 10,
         9, 9, 9, 8, 8, 8, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 11, 11, 11, 12,
         12, 12, 12, 12, 12, 11, 11, 11, 10, 10, 9, 9, 9, 8, 8, 8, 7, 7, 7, 7, 6,
-        6, 6, 6, 6, 5, 5, 5 },
+        6, 6, 6, 6, 5, 5, 5,
+        /* Size 32x8 */
+        32, 32, 28, 22, 16, 13, 11, 11, 33, 32, 29, 23, 17, 14, 12, 11, 33, 32,
+        29, 23, 17, 14, 12, 11, 33, 31, 29, 23, 17, 14, 12, 12, 32, 30, 28, 23,
+        17, 14, 13, 12, 32, 29, 27, 23, 18, 15, 13, 12, 32, 29, 26, 22, 17, 14,
+        13, 12, 30, 28, 24, 20, 16, 14, 12, 12, 28, 28, 21, 18, 15, 13, 12, 12,
+        27, 26, 21, 18, 14, 13, 12, 11, 26, 26, 20, 17, 14, 12, 11, 11, 23, 24,
+        19, 16, 13, 11, 11, 11, 22, 23, 18, 15, 12, 11, 10, 10, 21, 22, 18, 14,
+        12, 11, 10, 10, 19, 20, 17, 14, 11, 10, 9, 9, 18, 19, 16, 13, 10, 9, 9,
+        9, 17, 18, 16, 13, 10, 9, 9, 9, 16, 17, 15, 12, 10, 9, 8, 8, 14, 16, 14,
+        12, 9, 8, 8, 8, 13, 15, 13, 11, 9, 8, 8, 8, 13, 15, 13, 11, 9, 8, 7, 7,
+        12, 14, 13, 11, 8, 8, 7, 7, 12, 13, 12, 10, 8, 7, 7, 7, 12, 13, 12, 10,
+        8, 7, 7, 7, 11, 12, 12, 10, 8, 7, 7, 6, 11, 12, 11, 10, 9, 7, 6, 6, 10,
+        12, 11, 9, 8, 7, 6, 6, 10, 11, 11, 9, 8, 7, 6, 6, 10, 11, 11, 9, 8, 7,
+        6, 6, 10, 11, 11, 9, 8, 7, 6, 5, 9, 10, 10, 9, 8, 7, 6, 5, 9, 10, 10, 9,
+        8, 7, 6, 5 },
       { /* Chroma */
         /* Size 4x4 */
         31, 23, 18, 16, 23, 18, 16, 15, 18, 16, 12, 12, 16, 15, 12, 10,
@@ -7352,21 +7352,12 @@
         14, 14, 14, 15, 15, 16, 16, 16, 16, 15, 15, 14, 14, 14, 14, 13, 13, 12,
         12, 12, 12, 11, 11, 10, 10, 10, 10, 9, 9, 9, 9, 9,
         /* Size 4x8 */
-        33, 22, 18, 16, 26, 23, 20, 17, 22, 19, 17, 16, 22, 17, 15, 14, 20, 16,
-        13, 13, 17, 15, 12, 11, 16, 16, 12, 10, 16, 15, 12, 10,
-        /* Size 8x4 */
         33, 26, 22, 22, 20, 17, 16, 16, 22, 23, 19, 17, 16, 15, 16, 15, 18, 20,
         17, 15, 13, 12, 12, 12, 16, 17, 16, 14, 13, 11, 10, 10,
+        /* Size 8x4 */
+        33, 22, 18, 16, 26, 23, 20, 17, 22, 19, 17, 16, 22, 17, 15, 14, 20, 16,
+        13, 13, 17, 15, 12, 11, 16, 16, 12, 10, 16, 15, 12, 10,
         /* Size 8x16 */
-        32, 29, 21, 20, 18, 16, 15, 15, 34, 27, 22, 22, 20, 18, 16, 16, 31, 25,
-        22, 22, 20, 18, 17, 16, 26, 22, 21, 22, 20, 19, 18, 17, 21, 21, 19, 19,
-        18, 17, 17, 17, 21, 22, 19, 18, 17, 16, 16, 16, 20, 22, 19, 17, 16, 15,
-        14, 15, 20, 22, 19, 16, 14, 14, 14, 14, 19, 21, 18, 16, 14, 13, 13, 13,
-        17, 19, 18, 15, 13, 12, 12, 12, 16, 19, 17, 15, 12, 12, 11, 12, 16, 18,
-        17, 14, 12, 11, 11, 11, 15, 17, 16, 14, 13, 11, 11, 11, 15, 17, 16, 14,
-        13, 12, 10, 10, 14, 16, 16, 14, 12, 11, 10, 10, 14, 15, 16, 14, 13, 12,
-        10, 10,
-        /* Size 16x8 */
         32, 34, 31, 26, 21, 21, 20, 20, 19, 17, 16, 16, 15, 15, 14, 14, 29, 27,
         25, 22, 21, 22, 22, 22, 21, 19, 19, 18, 17, 17, 16, 15, 21, 22, 22, 21,
         19, 19, 19, 19, 18, 18, 17, 17, 16, 16, 16, 16, 20, 22, 22, 22, 19, 18,
@@ -7375,37 +7366,16 @@
         12, 11, 11, 12, 11, 12, 15, 16, 17, 18, 17, 16, 14, 14, 13, 12, 11, 11,
         11, 10, 10, 10, 15, 16, 16, 17, 17, 16, 15, 14, 13, 12, 12, 11, 11, 10,
         10, 10,
+        /* Size 16x8 */
+        32, 29, 21, 20, 18, 16, 15, 15, 34, 27, 22, 22, 20, 18, 16, 16, 31, 25,
+        22, 22, 20, 18, 17, 16, 26, 22, 21, 22, 20, 19, 18, 17, 21, 21, 19, 19,
+        18, 17, 17, 17, 21, 22, 19, 18, 17, 16, 16, 16, 20, 22, 19, 17, 16, 15,
+        14, 15, 20, 22, 19, 16, 14, 14, 14, 14, 19, 21, 18, 16, 14, 13, 13, 13,
+        17, 19, 18, 15, 13, 12, 12, 12, 16, 19, 17, 15, 12, 12, 11, 12, 16, 18,
+        17, 14, 12, 11, 11, 11, 15, 17, 16, 14, 13, 11, 11, 11, 15, 17, 16, 14,
+        13, 12, 10, 10, 14, 16, 16, 14, 12, 11, 10, 10, 14, 15, 16, 14, 13, 12,
+        10, 10,
         /* Size 16x32 */
-        32, 33, 29, 27, 21, 21, 20, 20, 18, 17, 16, 15, 15, 15, 15, 14, 33, 33,
-        28, 26, 22, 22, 21, 20, 19, 18, 17, 16, 16, 16, 16, 15, 34, 32, 27, 26,
-        22, 23, 22, 21, 20, 19, 18, 17, 16, 16, 16, 15, 33, 31, 27, 25, 22, 23,
-        22, 21, 20, 19, 18, 17, 17, 17, 16, 16, 31, 28, 25, 23, 22, 22, 22, 22,
-        20, 19, 18, 17, 17, 17, 16, 16, 28, 26, 23, 22, 22, 23, 22, 22, 20, 20,
-        19, 18, 17, 17, 17, 17, 26, 25, 22, 22, 21, 22, 22, 21, 20, 19, 19, 18,
-        18, 17, 17, 17, 24, 24, 22, 21, 20, 21, 20, 20, 19, 18, 18, 17, 17, 17,
-        17, 17, 21, 22, 21, 21, 19, 19, 19, 19, 18, 17, 17, 16, 17, 17, 17, 17,
-        21, 22, 22, 21, 19, 19, 19, 18, 18, 17, 17, 16, 16, 16, 16, 16, 21, 22,
-        22, 21, 19, 18, 18, 18, 17, 17, 16, 16, 16, 16, 16, 16, 21, 23, 23, 22,
-        19, 18, 17, 17, 16, 16, 15, 15, 15, 15, 16, 15, 20, 22, 22, 21, 19, 17,
-        17, 16, 16, 15, 15, 14, 14, 15, 15, 15, 20, 22, 22, 21, 19, 17, 17, 16,
-        15, 15, 14, 14, 14, 14, 15, 14, 20, 21, 22, 21, 19, 17, 16, 16, 14, 14,
-        14, 13, 14, 14, 14, 14, 19, 20, 21, 20, 19, 17, 16, 15, 14, 13, 13, 13,
-        13, 13, 14, 14, 19, 20, 21, 20, 18, 16, 16, 15, 14, 13, 13, 13, 13, 13,
-        13, 14, 18, 20, 20, 20, 18, 16, 16, 15, 13, 13, 12, 12, 12, 13, 13, 13,
-        17, 19, 19, 19, 18, 16, 15, 14, 13, 12, 12, 12, 12, 12, 12, 13, 17, 18,
-        19, 19, 17, 16, 15, 14, 13, 12, 12, 12, 12, 12, 12, 12, 16, 18, 19, 18,
-        17, 15, 15, 14, 12, 12, 12, 11, 11, 12, 12, 12, 16, 17, 18, 18, 17, 15,
-        14, 14, 12, 12, 11, 11, 11, 11, 12, 12, 16, 17, 18, 18, 17, 15, 14, 13,
-        12, 12, 11, 11, 11, 11, 11, 12, 15, 17, 17, 18, 16, 15, 14, 13, 12, 12,
-        11, 11, 11, 11, 11, 11, 15, 17, 17, 17, 16, 15, 14, 13, 13, 12, 11, 11,
-        11, 10, 11, 11, 15, 16, 17, 17, 16, 16, 14, 13, 13, 12, 11, 11, 10, 10,
-        10, 10, 15, 16, 17, 17, 16, 16, 14, 13, 13, 12, 12, 11, 10, 10, 10, 10,
-        14, 16, 16, 17, 16, 15, 14, 14, 12, 12, 11, 11, 10, 10, 10, 10, 14, 16,
-        16, 17, 16, 15, 14, 14, 12, 12, 11, 11, 10, 10, 10, 10, 14, 16, 16, 16,
-        16, 15, 14, 13, 13, 12, 11, 11, 10, 10, 10, 10, 14, 15, 15, 16, 16, 15,
-        14, 13, 13, 12, 12, 11, 10, 10, 10, 10, 14, 15, 15, 16, 16, 14, 14, 13,
-        13, 12, 12, 11, 11, 10, 10, 9,
-        /* Size 32x16 */
         32, 33, 34, 33, 31, 28, 26, 24, 21, 21, 21, 21, 20, 20, 20, 19, 19, 18,
         17, 17, 16, 16, 16, 15, 15, 15, 15, 14, 14, 14, 14, 14, 33, 33, 32, 31,
         28, 26, 25, 24, 22, 22, 22, 23, 22, 22, 21, 20, 20, 20, 19, 18, 18, 17,
@@ -7435,33 +7405,47 @@
         12, 12, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 14, 15, 15, 16, 16, 17,
         17, 17, 17, 16, 16, 15, 15, 14, 14, 14, 14, 13, 13, 12, 12, 12, 12, 11,
         11, 10, 10, 10, 10, 10, 10, 9,
+        /* Size 32x16 */
+        32, 33, 29, 27, 21, 21, 20, 20, 18, 17, 16, 15, 15, 15, 15, 14, 33, 33,
+        28, 26, 22, 22, 21, 20, 19, 18, 17, 16, 16, 16, 16, 15, 34, 32, 27, 26,
+        22, 23, 22, 21, 20, 19, 18, 17, 16, 16, 16, 15, 33, 31, 27, 25, 22, 23,
+        22, 21, 20, 19, 18, 17, 17, 17, 16, 16, 31, 28, 25, 23, 22, 22, 22, 22,
+        20, 19, 18, 17, 17, 17, 16, 16, 28, 26, 23, 22, 22, 23, 22, 22, 20, 20,
+        19, 18, 17, 17, 17, 17, 26, 25, 22, 22, 21, 22, 22, 21, 20, 19, 19, 18,
+        18, 17, 17, 17, 24, 24, 22, 21, 20, 21, 20, 20, 19, 18, 18, 17, 17, 17,
+        17, 17, 21, 22, 21, 21, 19, 19, 19, 19, 18, 17, 17, 16, 17, 17, 17, 17,
+        21, 22, 22, 21, 19, 19, 19, 18, 18, 17, 17, 16, 16, 16, 16, 16, 21, 22,
+        22, 21, 19, 18, 18, 18, 17, 17, 16, 16, 16, 16, 16, 16, 21, 23, 23, 22,
+        19, 18, 17, 17, 16, 16, 15, 15, 15, 15, 16, 15, 20, 22, 22, 21, 19, 17,
+        17, 16, 16, 15, 15, 14, 14, 15, 15, 15, 20, 22, 22, 21, 19, 17, 17, 16,
+        15, 15, 14, 14, 14, 14, 15, 14, 20, 21, 22, 21, 19, 17, 16, 16, 14, 14,
+        14, 13, 14, 14, 14, 14, 19, 20, 21, 20, 19, 17, 16, 15, 14, 13, 13, 13,
+        13, 13, 14, 14, 19, 20, 21, 20, 18, 16, 16, 15, 14, 13, 13, 13, 13, 13,
+        13, 14, 18, 20, 20, 20, 18, 16, 16, 15, 13, 13, 12, 12, 12, 13, 13, 13,
+        17, 19, 19, 19, 18, 16, 15, 14, 13, 12, 12, 12, 12, 12, 12, 13, 17, 18,
+        19, 19, 17, 16, 15, 14, 13, 12, 12, 12, 12, 12, 12, 12, 16, 18, 19, 18,
+        17, 15, 15, 14, 12, 12, 12, 11, 11, 12, 12, 12, 16, 17, 18, 18, 17, 15,
+        14, 14, 12, 12, 11, 11, 11, 11, 12, 12, 16, 17, 18, 18, 17, 15, 14, 13,
+        12, 12, 11, 11, 11, 11, 11, 12, 15, 17, 17, 18, 16, 15, 14, 13, 12, 12,
+        11, 11, 11, 11, 11, 11, 15, 17, 17, 17, 16, 15, 14, 13, 13, 12, 11, 11,
+        11, 10, 11, 11, 15, 16, 17, 17, 16, 16, 14, 13, 13, 12, 11, 11, 10, 10,
+        10, 10, 15, 16, 17, 17, 16, 16, 14, 13, 13, 12, 12, 11, 10, 10, 10, 10,
+        14, 16, 16, 17, 16, 15, 14, 14, 12, 12, 11, 11, 10, 10, 10, 10, 14, 16,
+        16, 17, 16, 15, 14, 14, 12, 12, 11, 11, 10, 10, 10, 10, 14, 16, 16, 16,
+        16, 15, 14, 13, 13, 12, 11, 11, 10, 10, 10, 10, 14, 15, 15, 16, 16, 15,
+        14, 13, 13, 12, 12, 11, 10, 10, 10, 10, 14, 15, 15, 16, 16, 14, 14, 13,
+        13, 12, 12, 11, 11, 10, 10, 9,
         /* Size 4x16 */
-        33, 21, 17, 15, 32, 23, 19, 16, 28, 22, 19, 17, 25, 22, 19, 17, 22, 19,
-        17, 17, 22, 18, 17, 16, 22, 17, 15, 15, 21, 17, 14, 14, 20, 16, 13, 13,
-        19, 16, 12, 12, 18, 15, 12, 12, 17, 15, 12, 11, 17, 15, 12, 10, 16, 16,
-        12, 10, 16, 15, 12, 10, 15, 15, 12, 10,
-        /* Size 16x4 */
         33, 32, 28, 25, 22, 22, 22, 21, 20, 19, 18, 17, 17, 16, 16, 15, 21, 23,
         22, 22, 19, 18, 17, 17, 16, 16, 15, 15, 15, 16, 15, 15, 17, 19, 19, 19,
         17, 17, 15, 14, 13, 12, 12, 12, 12, 12, 12, 12, 15, 16, 17, 17, 17, 16,
         15, 14, 13, 12, 12, 11, 10, 10, 10, 10,
+        /* Size 16x4 */
+        33, 21, 17, 15, 32, 23, 19, 16, 28, 22, 19, 17, 25, 22, 19, 17, 22, 19,
+        17, 17, 22, 18, 17, 16, 22, 17, 15, 15, 21, 17, 14, 14, 20, 16, 13, 13,
+        19, 16, 12, 12, 18, 15, 12, 12, 17, 15, 12, 11, 17, 15, 12, 10, 16, 16,
+        12, 10, 16, 15, 12, 10, 15, 15, 12, 10,
         /* Size 8x32 */
-        32, 29, 21, 20, 18, 16, 15, 15, 33, 28, 22, 21, 19, 17, 16, 16, 34, 27,
-        22, 22, 20, 18, 16, 16, 33, 27, 22, 22, 20, 18, 17, 16, 31, 25, 22, 22,
-        20, 18, 17, 16, 28, 23, 22, 22, 20, 19, 17, 17, 26, 22, 21, 22, 20, 19,
-        18, 17, 24, 22, 20, 20, 19, 18, 17, 17, 21, 21, 19, 19, 18, 17, 17, 17,
-        21, 22, 19, 19, 18, 17, 16, 16, 21, 22, 19, 18, 17, 16, 16, 16, 21, 23,
-        19, 17, 16, 15, 15, 16, 20, 22, 19, 17, 16, 15, 14, 15, 20, 22, 19, 17,
-        15, 14, 14, 15, 20, 22, 19, 16, 14, 14, 14, 14, 19, 21, 19, 16, 14, 13,
-        13, 14, 19, 21, 18, 16, 14, 13, 13, 13, 18, 20, 18, 16, 13, 12, 12, 13,
-        17, 19, 18, 15, 13, 12, 12, 12, 17, 19, 17, 15, 13, 12, 12, 12, 16, 19,
-        17, 15, 12, 12, 11, 12, 16, 18, 17, 14, 12, 11, 11, 12, 16, 18, 17, 14,
-        12, 11, 11, 11, 15, 17, 16, 14, 12, 11, 11, 11, 15, 17, 16, 14, 13, 11,
-        11, 11, 15, 17, 16, 14, 13, 11, 10, 10, 15, 17, 16, 14, 13, 12, 10, 10,
-        14, 16, 16, 14, 12, 11, 10, 10, 14, 16, 16, 14, 12, 11, 10, 10, 14, 16,
-        16, 14, 13, 11, 10, 10, 14, 15, 16, 14, 13, 12, 10, 10, 14, 15, 16, 14,
-        13, 12, 11, 10,
-        /* Size 32x8 */
         32, 33, 34, 33, 31, 28, 26, 24, 21, 21, 21, 21, 20, 20, 20, 19, 19, 18,
         17, 17, 16, 16, 16, 15, 15, 15, 15, 14, 14, 14, 14, 14, 29, 28, 27, 27,
         25, 23, 22, 22, 21, 22, 22, 23, 22, 22, 22, 21, 21, 20, 19, 19, 19, 18,
@@ -7476,7 +7460,23 @@
         18, 17, 17, 16, 16, 15, 14, 14, 14, 13, 13, 12, 12, 12, 11, 11, 11, 11,
         11, 10, 10, 10, 10, 10, 10, 11, 15, 16, 16, 16, 16, 17, 17, 17, 17, 16,
         16, 16, 15, 15, 14, 14, 13, 13, 12, 12, 12, 12, 11, 11, 11, 10, 10, 10,
-        10, 10, 10, 10 },
+        10, 10, 10, 10,
+        /* Size 32x8 */
+        32, 29, 21, 20, 18, 16, 15, 15, 33, 28, 22, 21, 19, 17, 16, 16, 34, 27,
+        22, 22, 20, 18, 16, 16, 33, 27, 22, 22, 20, 18, 17, 16, 31, 25, 22, 22,
+        20, 18, 17, 16, 28, 23, 22, 22, 20, 19, 17, 17, 26, 22, 21, 22, 20, 19,
+        18, 17, 24, 22, 20, 20, 19, 18, 17, 17, 21, 21, 19, 19, 18, 17, 17, 17,
+        21, 22, 19, 19, 18, 17, 16, 16, 21, 22, 19, 18, 17, 16, 16, 16, 21, 23,
+        19, 17, 16, 15, 15, 16, 20, 22, 19, 17, 16, 15, 14, 15, 20, 22, 19, 17,
+        15, 14, 14, 15, 20, 22, 19, 16, 14, 14, 14, 14, 19, 21, 19, 16, 14, 13,
+        13, 14, 19, 21, 18, 16, 14, 13, 13, 13, 18, 20, 18, 16, 13, 12, 12, 13,
+        17, 19, 18, 15, 13, 12, 12, 12, 17, 19, 17, 15, 13, 12, 12, 12, 16, 19,
+        17, 15, 12, 12, 11, 12, 16, 18, 17, 14, 12, 11, 11, 12, 16, 18, 17, 14,
+        12, 11, 11, 11, 15, 17, 16, 14, 12, 11, 11, 11, 15, 17, 16, 14, 13, 11,
+        11, 11, 15, 17, 16, 14, 13, 11, 10, 10, 15, 17, 16, 14, 13, 12, 10, 10,
+        14, 16, 16, 14, 12, 11, 10, 10, 14, 16, 16, 14, 12, 11, 10, 10, 14, 16,
+        16, 14, 13, 11, 10, 10, 14, 15, 16, 14, 13, 12, 10, 10, 14, 15, 16, 14,
+        13, 12, 11, 10 },
   },
   {
       { /* Luma */
@@ -7558,20 +7558,12 @@
         11, 11, 11, 11, 11, 10, 10, 10, 10, 9, 9, 9, 9, 8, 8, 7, 7, 7, 7, 6, 6,
         6, 6, 6, 6, 5, 5, 5,
         /* Size 4x8 */
-        32, 27, 17, 12, 32, 26, 18, 13, 30, 20, 15, 12, 23, 17, 12, 10, 19, 15,
-        10, 9, 14, 12, 9, 8, 12, 12, 8, 7, 11, 10, 8, 6,
-        /* Size 8x4 */
         32, 32, 30, 23, 19, 14, 12, 11, 27, 26, 20, 17, 15, 12, 12, 10, 17, 18,
         15, 12, 10, 9, 8, 8, 12, 13, 12, 10, 9, 8, 7, 6,
+        /* Size 8x4 */
+        32, 27, 17, 12, 32, 26, 18, 13, 30, 20, 15, 12, 23, 17, 12, 10, 19, 15,
+        10, 9, 14, 12, 9, 8, 12, 12, 8, 7, 11, 10, 8, 6,
         /* Size 8x16 */
-        32, 32, 28, 23, 18, 13, 12, 11, 33, 32, 29, 25, 19, 14, 13, 12, 32, 31,
-        28, 24, 19, 14, 13, 12, 32, 30, 27, 24, 20, 15, 13, 12, 30, 28, 23, 20,
-        17, 14, 13, 12, 26, 26, 20, 18, 15, 12, 12, 11, 23, 24, 19, 16, 14, 11,
-        11, 11, 21, 22, 18, 15, 13, 11, 10, 10, 18, 19, 16, 14, 11, 9, 9, 9, 16,
-        17, 15, 13, 11, 9, 8, 8, 14, 16, 14, 12, 10, 8, 8, 8, 13, 14, 13, 11, 9,
-        8, 7, 7, 12, 13, 12, 11, 9, 7, 7, 7, 11, 12, 12, 10, 9, 8, 7, 6, 10, 12,
-        12, 10, 8, 7, 6, 6, 10, 11, 11, 10, 9, 7, 6, 6,
-        /* Size 16x8 */
         32, 33, 32, 32, 30, 26, 23, 21, 18, 16, 14, 13, 12, 11, 10, 10, 32, 32,
         31, 30, 28, 26, 24, 22, 19, 17, 16, 14, 13, 12, 12, 11, 28, 29, 28, 27,
         23, 20, 19, 18, 16, 15, 14, 13, 12, 12, 12, 11, 23, 25, 24, 24, 20, 18,
@@ -7579,35 +7571,15 @@
         11, 11, 10, 9, 9, 9, 8, 9, 13, 14, 14, 15, 14, 12, 11, 11, 9, 9, 8, 8,
         7, 8, 7, 7, 12, 13, 13, 13, 13, 12, 11, 10, 9, 8, 8, 7, 7, 7, 6, 6, 11,
         12, 12, 12, 12, 11, 11, 10, 9, 8, 8, 7, 7, 6, 6, 6,
+        /* Size 16x8 */
+        32, 32, 28, 23, 18, 13, 12, 11, 33, 32, 29, 25, 19, 14, 13, 12, 32, 31,
+        28, 24, 19, 14, 13, 12, 32, 30, 27, 24, 20, 15, 13, 12, 30, 28, 23, 20,
+        17, 14, 13, 12, 26, 26, 20, 18, 15, 12, 12, 11, 23, 24, 19, 16, 14, 11,
+        11, 11, 21, 22, 18, 15, 13, 11, 10, 10, 18, 19, 16, 14, 11, 9, 9, 9, 16,
+        17, 15, 13, 11, 9, 8, 8, 14, 16, 14, 12, 10, 8, 8, 8, 13, 14, 13, 11, 9,
+        8, 7, 7, 12, 13, 12, 11, 9, 7, 7, 7, 11, 12, 12, 10, 9, 8, 7, 6, 10, 12,
+        12, 10, 8, 7, 6, 6, 10, 11, 11, 10, 9, 7, 6, 6,
         /* Size 16x32 */
-        32, 33, 32, 32, 28, 26, 23, 19, 18, 16, 13, 13, 12, 11, 11, 11, 33, 32,
-        32, 32, 29, 27, 24, 20, 19, 17, 14, 13, 12, 12, 12, 11, 33, 32, 32, 32,
-        29, 27, 25, 20, 19, 17, 14, 14, 13, 12, 12, 11, 33, 32, 32, 31, 30, 28,
-        25, 21, 19, 17, 14, 14, 13, 12, 12, 12, 32, 32, 31, 30, 28, 26, 24, 20,
-        19, 17, 14, 14, 13, 13, 12, 12, 32, 32, 30, 30, 28, 26, 24, 21, 19, 18,
-        15, 14, 13, 13, 12, 12, 32, 31, 30, 29, 27, 26, 24, 21, 20, 18, 15, 15,
-        13, 13, 12, 12, 30, 30, 29, 28, 24, 23, 21, 19, 18, 16, 14, 14, 13, 13,
-        13, 12, 30, 30, 28, 28, 23, 22, 20, 18, 17, 16, 14, 13, 13, 12, 12, 12,
-        28, 30, 28, 27, 21, 20, 19, 17, 16, 15, 13, 13, 12, 12, 12, 12, 26, 28,
-        26, 26, 20, 19, 18, 16, 15, 14, 12, 12, 12, 12, 11, 12, 26, 27, 26, 25,
-        20, 19, 17, 15, 15, 14, 12, 12, 11, 11, 11, 11, 23, 25, 24, 24, 19, 18,
-        16, 14, 14, 13, 11, 11, 11, 11, 11, 11, 22, 23, 23, 22, 18, 17, 16, 14,
-        13, 12, 11, 11, 10, 10, 10, 10, 21, 22, 22, 22, 18, 17, 15, 13, 13, 12,
-        11, 10, 10, 10, 10, 10, 19, 21, 20, 20, 17, 16, 14, 12, 12, 11, 10, 10,
-        9, 9, 10, 9, 18, 19, 19, 19, 16, 15, 14, 12, 11, 11, 9, 9, 9, 9, 9, 9,
-        17, 19, 19, 19, 16, 15, 14, 12, 11, 10, 9, 9, 9, 9, 9, 9, 16, 17, 17,
-        18, 15, 14, 13, 11, 11, 10, 9, 9, 8, 8, 8, 9, 15, 16, 17, 17, 14, 13,
-        12, 11, 10, 9, 8, 8, 8, 8, 8, 8, 14, 16, 16, 16, 14, 13, 12, 11, 10, 9,
-        8, 8, 8, 8, 8, 8, 13, 14, 14, 15, 13, 12, 11, 10, 9, 9, 8, 8, 7, 8, 8,
-        7, 13, 14, 14, 14, 13, 12, 11, 10, 9, 9, 8, 7, 7, 7, 7, 7, 12, 14, 14,
-        14, 13, 12, 11, 10, 9, 8, 8, 7, 7, 7, 7, 7, 12, 13, 13, 13, 12, 11, 11,
-        9, 9, 8, 7, 7, 7, 7, 7, 7, 11, 12, 13, 13, 12, 12, 10, 9, 9, 8, 8, 7, 7,
-        7, 6, 6, 11, 12, 12, 13, 12, 11, 10, 10, 9, 8, 8, 7, 7, 6, 6, 6, 11, 12,
-        12, 12, 12, 11, 10, 10, 9, 8, 7, 7, 7, 6, 6, 6, 10, 12, 12, 12, 12, 11,
-        10, 9, 8, 8, 7, 7, 6, 6, 6, 6, 10, 11, 11, 12, 11, 10, 10, 9, 9, 8, 7,
-        7, 6, 6, 6, 6, 10, 11, 11, 11, 11, 10, 10, 9, 9, 8, 7, 7, 6, 6, 6, 6,
-        10, 11, 11, 11, 11, 10, 10, 9, 9, 8, 8, 7, 7, 6, 6, 5,
-        /* Size 32x16 */
         32, 33, 33, 33, 32, 32, 32, 30, 30, 28, 26, 26, 23, 22, 21, 19, 18, 17,
         16, 15, 14, 13, 13, 12, 12, 11, 11, 11, 10, 10, 10, 10, 33, 32, 32, 32,
         32, 32, 31, 30, 30, 30, 28, 27, 25, 23, 22, 21, 19, 19, 17, 16, 16, 14,
@@ -7635,32 +7607,45 @@
         12, 12, 11, 11, 11, 10, 10, 10, 9, 9, 8, 8, 8, 8, 7, 7, 7, 6, 6, 6, 6,
         6, 6, 6, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 11, 11, 10, 10, 9,
         9, 9, 9, 8, 8, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 5,
+        /* Size 32x16 */
+        32, 33, 32, 32, 28, 26, 23, 19, 18, 16, 13, 13, 12, 11, 11, 11, 33, 32,
+        32, 32, 29, 27, 24, 20, 19, 17, 14, 13, 12, 12, 12, 11, 33, 32, 32, 32,
+        29, 27, 25, 20, 19, 17, 14, 14, 13, 12, 12, 11, 33, 32, 32, 31, 30, 28,
+        25, 21, 19, 17, 14, 14, 13, 12, 12, 12, 32, 32, 31, 30, 28, 26, 24, 20,
+        19, 17, 14, 14, 13, 13, 12, 12, 32, 32, 30, 30, 28, 26, 24, 21, 19, 18,
+        15, 14, 13, 13, 12, 12, 32, 31, 30, 29, 27, 26, 24, 21, 20, 18, 15, 15,
+        13, 13, 12, 12, 30, 30, 29, 28, 24, 23, 21, 19, 18, 16, 14, 14, 13, 13,
+        13, 12, 30, 30, 28, 28, 23, 22, 20, 18, 17, 16, 14, 13, 13, 12, 12, 12,
+        28, 30, 28, 27, 21, 20, 19, 17, 16, 15, 13, 13, 12, 12, 12, 12, 26, 28,
+        26, 26, 20, 19, 18, 16, 15, 14, 12, 12, 12, 12, 11, 12, 26, 27, 26, 25,
+        20, 19, 17, 15, 15, 14, 12, 12, 11, 11, 11, 11, 23, 25, 24, 24, 19, 18,
+        16, 14, 14, 13, 11, 11, 11, 11, 11, 11, 22, 23, 23, 22, 18, 17, 16, 14,
+        13, 12, 11, 11, 10, 10, 10, 10, 21, 22, 22, 22, 18, 17, 15, 13, 13, 12,
+        11, 10, 10, 10, 10, 10, 19, 21, 20, 20, 17, 16, 14, 12, 12, 11, 10, 10,
+        9, 9, 10, 9, 18, 19, 19, 19, 16, 15, 14, 12, 11, 11, 9, 9, 9, 9, 9, 9,
+        17, 19, 19, 19, 16, 15, 14, 12, 11, 10, 9, 9, 9, 9, 9, 9, 16, 17, 17,
+        18, 15, 14, 13, 11, 11, 10, 9, 9, 8, 8, 8, 9, 15, 16, 17, 17, 14, 13,
+        12, 11, 10, 9, 8, 8, 8, 8, 8, 8, 14, 16, 16, 16, 14, 13, 12, 11, 10, 9,
+        8, 8, 8, 8, 8, 8, 13, 14, 14, 15, 13, 12, 11, 10, 9, 9, 8, 8, 7, 8, 8,
+        7, 13, 14, 14, 14, 13, 12, 11, 10, 9, 9, 8, 7, 7, 7, 7, 7, 12, 14, 14,
+        14, 13, 12, 11, 10, 9, 8, 8, 7, 7, 7, 7, 7, 12, 13, 13, 13, 12, 11, 11,
+        9, 9, 8, 7, 7, 7, 7, 7, 7, 11, 12, 13, 13, 12, 12, 10, 9, 9, 8, 8, 7, 7,
+        7, 6, 6, 11, 12, 12, 13, 12, 11, 10, 10, 9, 8, 8, 7, 7, 6, 6, 6, 11, 12,
+        12, 12, 12, 11, 10, 10, 9, 8, 7, 7, 7, 6, 6, 6, 10, 12, 12, 12, 12, 11,
+        10, 9, 8, 8, 7, 7, 6, 6, 6, 6, 10, 11, 11, 12, 11, 10, 10, 9, 9, 8, 7,
+        7, 6, 6, 6, 6, 10, 11, 11, 11, 11, 10, 10, 9, 9, 8, 7, 7, 6, 6, 6, 6,
+        10, 11, 11, 11, 11, 10, 10, 9, 9, 8, 8, 7, 7, 6, 6, 5,
         /* Size 4x16 */
-        33, 26, 16, 11, 32, 27, 17, 12, 32, 26, 17, 13, 31, 26, 18, 13, 30, 22,
-        16, 12, 28, 19, 14, 12, 25, 18, 13, 11, 22, 17, 12, 10, 19, 15, 11, 9,
-        17, 14, 10, 8, 16, 13, 9, 8, 14, 12, 9, 7, 13, 11, 8, 7, 12, 11, 8, 6,
-        12, 11, 8, 6, 11, 10, 8, 6,
-        /* Size 16x4 */
         33, 32, 32, 31, 30, 28, 25, 22, 19, 17, 16, 14, 13, 12, 12, 11, 26, 27,
         26, 26, 22, 19, 18, 17, 15, 14, 13, 12, 11, 11, 11, 10, 16, 17, 17, 18,
         16, 14, 13, 12, 11, 10, 9, 9, 8, 8, 8, 8, 11, 12, 13, 13, 12, 12, 11,
         10, 9, 8, 8, 7, 7, 6, 6, 6,
+        /* Size 16x4 */
+        33, 26, 16, 11, 32, 27, 17, 12, 32, 26, 17, 13, 31, 26, 18, 13, 30, 22,
+        16, 12, 28, 19, 14, 12, 25, 18, 13, 11, 22, 17, 12, 10, 19, 15, 11, 9,
+        17, 14, 10, 8, 16, 13, 9, 8, 14, 12, 9, 7, 13, 11, 8, 7, 12, 11, 8, 6,
+        12, 11, 8, 6, 11, 10, 8, 6,
         /* Size 8x32 */
-        32, 32, 28, 23, 18, 13, 12, 11, 33, 32, 29, 24, 19, 14, 12, 12, 33, 32,
-        29, 25, 19, 14, 13, 12, 33, 32, 30, 25, 19, 14, 13, 12, 32, 31, 28, 24,
-        19, 14, 13, 12, 32, 30, 28, 24, 19, 15, 13, 12, 32, 30, 27, 24, 20, 15,
-        13, 12, 30, 29, 24, 21, 18, 14, 13, 13, 30, 28, 23, 20, 17, 14, 13, 12,
-        28, 28, 21, 19, 16, 13, 12, 12, 26, 26, 20, 18, 15, 12, 12, 11, 26, 26,
-        20, 17, 15, 12, 11, 11, 23, 24, 19, 16, 14, 11, 11, 11, 22, 23, 18, 16,
-        13, 11, 10, 10, 21, 22, 18, 15, 13, 11, 10, 10, 19, 20, 17, 14, 12, 10,
-        9, 10, 18, 19, 16, 14, 11, 9, 9, 9, 17, 19, 16, 14, 11, 9, 9, 9, 16, 17,
-        15, 13, 11, 9, 8, 8, 15, 17, 14, 12, 10, 8, 8, 8, 14, 16, 14, 12, 10, 8,
-        8, 8, 13, 14, 13, 11, 9, 8, 7, 8, 13, 14, 13, 11, 9, 8, 7, 7, 12, 14,
-        13, 11, 9, 8, 7, 7, 12, 13, 12, 11, 9, 7, 7, 7, 11, 13, 12, 10, 9, 8, 7,
-        6, 11, 12, 12, 10, 9, 8, 7, 6, 11, 12, 12, 10, 9, 7, 7, 6, 10, 12, 12,
-        10, 8, 7, 6, 6, 10, 11, 11, 10, 9, 7, 6, 6, 10, 11, 11, 10, 9, 7, 6, 6,
-        10, 11, 11, 10, 9, 8, 7, 6,
-        /* Size 32x8 */
         32, 33, 33, 33, 32, 32, 32, 30, 30, 28, 26, 26, 23, 22, 21, 19, 18, 17,
         16, 15, 14, 13, 13, 12, 12, 11, 11, 11, 10, 10, 10, 10, 32, 32, 32, 32,
         31, 30, 30, 29, 28, 28, 26, 26, 24, 23, 22, 20, 19, 19, 17, 17, 16, 14,
@@ -7674,7 +7659,22 @@
         8, 8, 7, 7, 7, 7, 8, 12, 12, 13, 13, 13, 13, 13, 13, 13, 12, 12, 11, 11,
         10, 10, 9, 9, 9, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 7, 11, 12, 12,
         12, 12, 12, 12, 13, 12, 12, 11, 11, 11, 10, 10, 10, 9, 9, 8, 8, 8, 8, 7,
-        7, 7, 6, 6, 6, 6, 6, 6, 6 },
+        7, 7, 6, 6, 6, 6, 6, 6, 6,
+        /* Size 32x8 */
+        32, 32, 28, 23, 18, 13, 12, 11, 33, 32, 29, 24, 19, 14, 12, 12, 33, 32,
+        29, 25, 19, 14, 13, 12, 33, 32, 30, 25, 19, 14, 13, 12, 32, 31, 28, 24,
+        19, 14, 13, 12, 32, 30, 28, 24, 19, 15, 13, 12, 32, 30, 27, 24, 20, 15,
+        13, 12, 30, 29, 24, 21, 18, 14, 13, 13, 30, 28, 23, 20, 17, 14, 13, 12,
+        28, 28, 21, 19, 16, 13, 12, 12, 26, 26, 20, 18, 15, 12, 12, 11, 26, 26,
+        20, 17, 15, 12, 11, 11, 23, 24, 19, 16, 14, 11, 11, 11, 22, 23, 18, 16,
+        13, 11, 10, 10, 21, 22, 18, 15, 13, 11, 10, 10, 19, 20, 17, 14, 12, 10,
+        9, 10, 18, 19, 16, 14, 11, 9, 9, 9, 17, 19, 16, 14, 11, 9, 9, 9, 16, 17,
+        15, 13, 11, 9, 8, 8, 15, 17, 14, 12, 10, 8, 8, 8, 14, 16, 14, 12, 10, 8,
+        8, 8, 13, 14, 13, 11, 9, 8, 7, 8, 13, 14, 13, 11, 9, 8, 7, 7, 12, 14,
+        13, 11, 9, 8, 7, 7, 12, 13, 12, 11, 9, 7, 7, 7, 11, 13, 12, 10, 9, 8, 7,
+        6, 11, 12, 12, 10, 9, 8, 7, 6, 11, 12, 12, 10, 9, 7, 7, 6, 10, 12, 12,
+        10, 8, 7, 6, 6, 10, 11, 11, 10, 9, 7, 6, 6, 10, 11, 11, 10, 9, 7, 6, 6,
+        10, 11, 11, 10, 9, 8, 7, 6 },
       { /* Chroma */
         /* Size 4x4 */
         32, 23, 19, 16, 23, 19, 17, 15, 19, 17, 13, 12, 16, 15, 12, 10,
@@ -7758,21 +7758,12 @@
         10, 10, 14, 15, 15, 16, 16, 16, 16, 17, 17, 16, 16, 15, 15, 14, 14, 13,
         13, 13, 13, 12, 12, 12, 11, 11, 11, 10, 10, 10, 10, 10, 10, 9,
         /* Size 4x8 */
-        33, 22, 19, 16, 27, 22, 20, 17, 22, 19, 18, 17, 22, 18, 16, 14, 20, 17,
-        14, 13, 18, 16, 12, 12, 17, 16, 12, 11, 16, 15, 12, 10,
-        /* Size 8x4 */
         33, 27, 22, 22, 20, 18, 17, 16, 22, 22, 19, 18, 17, 16, 16, 15, 19, 20,
         18, 16, 14, 12, 12, 12, 16, 17, 17, 14, 13, 12, 11, 10,
+        /* Size 8x4 */
+        33, 22, 19, 16, 27, 22, 20, 17, 22, 19, 18, 17, 22, 18, 16, 14, 20, 17,
+        14, 13, 18, 16, 12, 12, 17, 16, 12, 11, 16, 15, 12, 10,
         /* Size 8x16 */
-        32, 30, 21, 21, 19, 16, 15, 15, 33, 28, 22, 22, 20, 18, 17, 16, 31, 26,
-        22, 22, 21, 18, 17, 17, 28, 23, 22, 23, 21, 19, 18, 17, 23, 22, 20, 20,
-        19, 17, 17, 17, 21, 22, 19, 18, 18, 16, 16, 16, 21, 23, 19, 18, 17, 15,
-        15, 15, 20, 22, 19, 17, 16, 14, 14, 14, 19, 21, 19, 17, 15, 13, 13, 13,
-        18, 20, 18, 16, 14, 12, 12, 13, 17, 19, 18, 16, 14, 12, 12, 12, 16, 18,
-        17, 15, 13, 12, 11, 12, 16, 17, 16, 15, 13, 11, 11, 11, 15, 17, 16, 14,
-        13, 12, 11, 10, 15, 16, 16, 15, 13, 12, 11, 10, 14, 16, 16, 15, 13, 12,
-        11, 10,
-        /* Size 16x8 */
         32, 33, 31, 28, 23, 21, 21, 20, 19, 18, 17, 16, 16, 15, 15, 14, 30, 28,
         26, 23, 22, 22, 23, 22, 21, 20, 19, 18, 17, 17, 16, 16, 21, 22, 22, 22,
         20, 19, 19, 19, 19, 18, 18, 17, 16, 16, 16, 16, 21, 22, 22, 23, 20, 18,
@@ -7781,37 +7772,16 @@
         12, 12, 11, 12, 12, 12, 15, 17, 17, 18, 17, 16, 15, 14, 13, 12, 12, 11,
         11, 11, 11, 11, 15, 16, 17, 17, 17, 16, 15, 14, 13, 13, 12, 12, 11, 10,
         10, 10,
+        /* Size 16x8 */
+        32, 30, 21, 21, 19, 16, 15, 15, 33, 28, 22, 22, 20, 18, 17, 16, 31, 26,
+        22, 22, 21, 18, 17, 17, 28, 23, 22, 23, 21, 19, 18, 17, 23, 22, 20, 20,
+        19, 17, 17, 17, 21, 22, 19, 18, 18, 16, 16, 16, 21, 23, 19, 18, 17, 15,
+        15, 15, 20, 22, 19, 17, 16, 14, 14, 14, 19, 21, 19, 17, 15, 13, 13, 13,
+        18, 20, 18, 16, 14, 12, 12, 13, 17, 19, 18, 16, 14, 12, 12, 12, 16, 18,
+        17, 15, 13, 12, 11, 12, 16, 17, 16, 15, 13, 11, 11, 11, 15, 17, 16, 14,
+        13, 12, 11, 10, 15, 16, 16, 15, 13, 12, 11, 10, 14, 16, 16, 15, 13, 12,
+        11, 10,
         /* Size 16x32 */
-        32, 33, 30, 28, 21, 21, 21, 20, 19, 18, 16, 16, 15, 15, 15, 15, 33, 33,
-        29, 27, 22, 22, 22, 20, 20, 19, 17, 17, 16, 16, 16, 16, 33, 32, 28, 26,
-        22, 22, 22, 21, 20, 19, 18, 17, 17, 16, 16, 16, 34, 32, 28, 26, 22, 23,
-        23, 21, 21, 20, 18, 18, 17, 17, 17, 16, 31, 28, 26, 24, 22, 22, 22, 22,
-        21, 20, 18, 18, 17, 17, 17, 16, 29, 27, 24, 23, 22, 22, 23, 22, 21, 20,
-        19, 18, 18, 17, 17, 17, 28, 26, 23, 22, 22, 22, 23, 22, 21, 20, 19, 19,
-        18, 18, 17, 17, 24, 24, 23, 22, 20, 20, 21, 20, 20, 19, 18, 18, 17, 18,
-        17, 17, 23, 23, 22, 22, 20, 20, 20, 20, 19, 19, 17, 17, 17, 17, 17, 17,
-        21, 22, 22, 21, 19, 19, 19, 19, 19, 18, 17, 17, 16, 17, 17, 16, 21, 22,
-        22, 22, 19, 19, 18, 18, 18, 17, 16, 16, 16, 16, 16, 16, 21, 23, 22, 22,
-        19, 19, 18, 18, 17, 17, 16, 16, 16, 16, 16, 16, 21, 23, 23, 22, 19, 18,
-        18, 17, 17, 16, 15, 15, 15, 15, 15, 16, 20, 22, 22, 22, 19, 18, 17, 16,
-        16, 16, 15, 14, 15, 14, 15, 15, 20, 22, 22, 22, 19, 18, 17, 16, 16, 15,
-        14, 14, 14, 14, 14, 15, 20, 21, 22, 22, 19, 18, 17, 16, 15, 14, 14, 14,
-        13, 14, 14, 14, 19, 21, 21, 21, 19, 18, 17, 15, 15, 14, 13, 13, 13, 13,
-        13, 14, 19, 20, 21, 21, 19, 17, 17, 15, 15, 14, 13, 13, 13, 13, 13, 13,
-        18, 20, 20, 20, 18, 17, 16, 15, 14, 13, 12, 12, 12, 12, 13, 13, 17, 19,
-        20, 20, 18, 17, 16, 14, 14, 13, 12, 12, 12, 12, 12, 12, 17, 19, 19, 20,
-        18, 17, 16, 14, 14, 13, 12, 12, 12, 12, 12, 12, 16, 18, 18, 19, 17, 16,
-        15, 14, 13, 12, 12, 11, 11, 12, 12, 12, 16, 18, 18, 19, 17, 16, 15, 14,
-        13, 12, 12, 11, 11, 11, 12, 12, 16, 17, 18, 18, 17, 16, 15, 14, 13, 12,
-        11, 11, 11, 11, 11, 11, 16, 17, 17, 18, 16, 16, 15, 13, 13, 12, 11, 11,
-        11, 11, 11, 11, 15, 17, 17, 18, 16, 16, 15, 14, 13, 12, 12, 11, 11, 11,
-        11, 11, 15, 17, 17, 17, 16, 16, 14, 14, 13, 12, 12, 11, 11, 11, 10, 11,
-        15, 16, 17, 17, 16, 16, 14, 14, 13, 12, 12, 11, 11, 10, 10, 10, 15, 16,
-        16, 17, 16, 16, 15, 14, 13, 13, 12, 11, 11, 10, 10, 10, 14, 16, 16, 17,
-        16, 15, 15, 14, 13, 12, 12, 11, 11, 10, 10, 10, 14, 16, 16, 17, 16, 15,
-        15, 14, 13, 12, 12, 11, 11, 10, 10, 10, 14, 16, 16, 16, 16, 15, 15, 13,
-        13, 12, 12, 11, 11, 10, 10, 10,
-        /* Size 32x16 */
         32, 33, 33, 34, 31, 29, 28, 24, 23, 21, 21, 21, 21, 20, 20, 20, 19, 19,
         18, 17, 17, 16, 16, 16, 16, 15, 15, 15, 15, 14, 14, 14, 33, 33, 32, 32,
         28, 27, 26, 24, 23, 22, 22, 23, 23, 22, 22, 21, 21, 20, 20, 19, 19, 18,
@@ -7841,33 +7811,47 @@
         12, 12, 12, 11, 11, 11, 10, 10, 10, 10, 10, 10, 15, 16, 16, 16, 16, 17,
         17, 17, 17, 16, 16, 16, 16, 15, 15, 14, 14, 13, 13, 12, 12, 12, 12, 11,
         11, 11, 11, 10, 10, 10, 10, 10,
+        /* Size 32x16 */
+        32, 33, 30, 28, 21, 21, 21, 20, 19, 18, 16, 16, 15, 15, 15, 15, 33, 33,
+        29, 27, 22, 22, 22, 20, 20, 19, 17, 17, 16, 16, 16, 16, 33, 32, 28, 26,
+        22, 22, 22, 21, 20, 19, 18, 17, 17, 16, 16, 16, 34, 32, 28, 26, 22, 23,
+        23, 21, 21, 20, 18, 18, 17, 17, 17, 16, 31, 28, 26, 24, 22, 22, 22, 22,
+        21, 20, 18, 18, 17, 17, 17, 16, 29, 27, 24, 23, 22, 22, 23, 22, 21, 20,
+        19, 18, 18, 17, 17, 17, 28, 26, 23, 22, 22, 22, 23, 22, 21, 20, 19, 19,
+        18, 18, 17, 17, 24, 24, 23, 22, 20, 20, 21, 20, 20, 19, 18, 18, 17, 18,
+        17, 17, 23, 23, 22, 22, 20, 20, 20, 20, 19, 19, 17, 17, 17, 17, 17, 17,
+        21, 22, 22, 21, 19, 19, 19, 19, 19, 18, 17, 17, 16, 17, 17, 16, 21, 22,
+        22, 22, 19, 19, 18, 18, 18, 17, 16, 16, 16, 16, 16, 16, 21, 23, 22, 22,
+        19, 19, 18, 18, 17, 17, 16, 16, 16, 16, 16, 16, 21, 23, 23, 22, 19, 18,
+        18, 17, 17, 16, 15, 15, 15, 15, 15, 16, 20, 22, 22, 22, 19, 18, 17, 16,
+        16, 16, 15, 14, 15, 14, 15, 15, 20, 22, 22, 22, 19, 18, 17, 16, 16, 15,
+        14, 14, 14, 14, 14, 15, 20, 21, 22, 22, 19, 18, 17, 16, 15, 14, 14, 14,
+        13, 14, 14, 14, 19, 21, 21, 21, 19, 18, 17, 15, 15, 14, 13, 13, 13, 13,
+        13, 14, 19, 20, 21, 21, 19, 17, 17, 15, 15, 14, 13, 13, 13, 13, 13, 13,
+        18, 20, 20, 20, 18, 17, 16, 15, 14, 13, 12, 12, 12, 12, 13, 13, 17, 19,
+        20, 20, 18, 17, 16, 14, 14, 13, 12, 12, 12, 12, 12, 12, 17, 19, 19, 20,
+        18, 17, 16, 14, 14, 13, 12, 12, 12, 12, 12, 12, 16, 18, 18, 19, 17, 16,
+        15, 14, 13, 12, 12, 11, 11, 12, 12, 12, 16, 18, 18, 19, 17, 16, 15, 14,
+        13, 12, 12, 11, 11, 11, 12, 12, 16, 17, 18, 18, 17, 16, 15, 14, 13, 12,
+        11, 11, 11, 11, 11, 11, 16, 17, 17, 18, 16, 16, 15, 13, 13, 12, 11, 11,
+        11, 11, 11, 11, 15, 17, 17, 18, 16, 16, 15, 14, 13, 12, 12, 11, 11, 11,
+        11, 11, 15, 17, 17, 17, 16, 16, 14, 14, 13, 12, 12, 11, 11, 11, 10, 11,
+        15, 16, 17, 17, 16, 16, 14, 14, 13, 12, 12, 11, 11, 10, 10, 10, 15, 16,
+        16, 17, 16, 16, 15, 14, 13, 13, 12, 11, 11, 10, 10, 10, 14, 16, 16, 17,
+        16, 15, 15, 14, 13, 12, 12, 11, 11, 10, 10, 10, 14, 16, 16, 17, 16, 15,
+        15, 14, 13, 12, 12, 11, 11, 10, 10, 10, 14, 16, 16, 16, 16, 15, 15, 13,
+        13, 12, 12, 11, 11, 10, 10, 10,
         /* Size 4x16 */
-        33, 21, 18, 15, 32, 22, 19, 16, 28, 22, 20, 17, 26, 22, 20, 18, 23, 20,
-        19, 17, 22, 19, 17, 16, 23, 18, 16, 15, 22, 18, 15, 14, 21, 18, 14, 13,
-        20, 17, 13, 12, 19, 17, 13, 12, 18, 16, 12, 11, 17, 16, 12, 11, 17, 16,
-        12, 11, 16, 16, 13, 10, 16, 15, 12, 10,
-        /* Size 16x4 */
         33, 32, 28, 26, 23, 22, 23, 22, 21, 20, 19, 18, 17, 17, 16, 16, 21, 22,
         22, 22, 20, 19, 18, 18, 18, 17, 17, 16, 16, 16, 16, 15, 18, 19, 20, 20,
         19, 17, 16, 15, 14, 13, 13, 12, 12, 12, 13, 12, 15, 16, 17, 18, 17, 16,
         15, 14, 13, 12, 12, 11, 11, 11, 10, 10,
+        /* Size 16x4 */
+        33, 21, 18, 15, 32, 22, 19, 16, 28, 22, 20, 17, 26, 22, 20, 18, 23, 20,
+        19, 17, 22, 19, 17, 16, 23, 18, 16, 15, 22, 18, 15, 14, 21, 18, 14, 13,
+        20, 17, 13, 12, 19, 17, 13, 12, 18, 16, 12, 11, 17, 16, 12, 11, 17, 16,
+        12, 11, 16, 16, 13, 10, 16, 15, 12, 10,
         /* Size 8x32 */
-        32, 30, 21, 21, 19, 16, 15, 15, 33, 29, 22, 22, 20, 17, 16, 16, 33, 28,
-        22, 22, 20, 18, 17, 16, 34, 28, 22, 23, 21, 18, 17, 17, 31, 26, 22, 22,
-        21, 18, 17, 17, 29, 24, 22, 23, 21, 19, 18, 17, 28, 23, 22, 23, 21, 19,
-        18, 17, 24, 23, 20, 21, 20, 18, 17, 17, 23, 22, 20, 20, 19, 17, 17, 17,
-        21, 22, 19, 19, 19, 17, 16, 17, 21, 22, 19, 18, 18, 16, 16, 16, 21, 22,
-        19, 18, 17, 16, 16, 16, 21, 23, 19, 18, 17, 15, 15, 15, 20, 22, 19, 17,
-        16, 15, 15, 15, 20, 22, 19, 17, 16, 14, 14, 14, 20, 22, 19, 17, 15, 14,
-        13, 14, 19, 21, 19, 17, 15, 13, 13, 13, 19, 21, 19, 17, 15, 13, 13, 13,
-        18, 20, 18, 16, 14, 12, 12, 13, 17, 20, 18, 16, 14, 12, 12, 12, 17, 19,
-        18, 16, 14, 12, 12, 12, 16, 18, 17, 15, 13, 12, 11, 12, 16, 18, 17, 15,
-        13, 12, 11, 12, 16, 18, 17, 15, 13, 11, 11, 11, 16, 17, 16, 15, 13, 11,
-        11, 11, 15, 17, 16, 15, 13, 12, 11, 11, 15, 17, 16, 14, 13, 12, 11, 10,
-        15, 17, 16, 14, 13, 12, 11, 10, 15, 16, 16, 15, 13, 12, 11, 10, 14, 16,
-        16, 15, 13, 12, 11, 10, 14, 16, 16, 15, 13, 12, 11, 10, 14, 16, 16, 15,
-        13, 12, 11, 10,
-        /* Size 32x8 */
         32, 33, 33, 34, 31, 29, 28, 24, 23, 21, 21, 21, 21, 20, 20, 20, 19, 19,
         18, 17, 17, 16, 16, 16, 16, 15, 15, 15, 15, 14, 14, 14, 30, 29, 28, 28,
         26, 24, 23, 23, 22, 22, 22, 22, 23, 22, 22, 22, 21, 21, 20, 20, 19, 18,
@@ -7882,7 +7866,23 @@
         18, 17, 17, 16, 16, 16, 15, 15, 14, 13, 13, 13, 12, 12, 12, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 15, 16, 16, 17, 17, 17, 17, 17, 17, 17,
         16, 16, 15, 15, 14, 14, 13, 13, 13, 12, 12, 12, 12, 11, 11, 11, 10, 10,
-        10, 10, 10, 10 },
+        10, 10, 10, 10,
+        /* Size 32x8 */
+        32, 30, 21, 21, 19, 16, 15, 15, 33, 29, 22, 22, 20, 17, 16, 16, 33, 28,
+        22, 22, 20, 18, 17, 16, 34, 28, 22, 23, 21, 18, 17, 17, 31, 26, 22, 22,
+        21, 18, 17, 17, 29, 24, 22, 23, 21, 19, 18, 17, 28, 23, 22, 23, 21, 19,
+        18, 17, 24, 23, 20, 21, 20, 18, 17, 17, 23, 22, 20, 20, 19, 17, 17, 17,
+        21, 22, 19, 19, 19, 17, 16, 17, 21, 22, 19, 18, 18, 16, 16, 16, 21, 22,
+        19, 18, 17, 16, 16, 16, 21, 23, 19, 18, 17, 15, 15, 15, 20, 22, 19, 17,
+        16, 15, 15, 15, 20, 22, 19, 17, 16, 14, 14, 14, 20, 22, 19, 17, 15, 14,
+        13, 14, 19, 21, 19, 17, 15, 13, 13, 13, 19, 21, 19, 17, 15, 13, 13, 13,
+        18, 20, 18, 16, 14, 12, 12, 13, 17, 20, 18, 16, 14, 12, 12, 12, 17, 19,
+        18, 16, 14, 12, 12, 12, 16, 18, 17, 15, 13, 12, 11, 12, 16, 18, 17, 15,
+        13, 12, 11, 12, 16, 18, 17, 15, 13, 11, 11, 11, 16, 17, 16, 15, 13, 11,
+        11, 11, 15, 17, 16, 15, 13, 12, 11, 11, 15, 17, 16, 14, 13, 12, 11, 10,
+        15, 17, 16, 14, 13, 12, 11, 10, 15, 16, 16, 15, 13, 12, 11, 10, 14, 16,
+        16, 15, 13, 12, 11, 10, 14, 16, 16, 15, 13, 12, 11, 10, 14, 16, 16, 15,
+        13, 12, 11, 10 },
   },
   {
       { /* Luma */
@@ -7964,20 +7964,12 @@
         7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 10, 11, 11, 11, 11, 12, 12, 12, 12, 11,
         11, 11, 10, 10, 10, 9, 9, 9, 9, 8, 8, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6,
         /* Size 4x8 */
-        32, 29, 17, 12, 32, 28, 18, 13, 30, 22, 16, 12, 25, 19, 13, 11, 20, 17,
-        11, 9, 16, 14, 9, 8, 14, 13, 9, 7, 12, 11, 9, 7,
-        /* Size 8x4 */
         32, 32, 30, 25, 20, 16, 14, 12, 29, 28, 22, 19, 17, 14, 13, 11, 17, 18,
         16, 13, 11, 9, 9, 9, 12, 13, 12, 11, 9, 8, 7, 7,
+        /* Size 8x4 */
+        32, 29, 17, 12, 32, 28, 18, 13, 30, 22, 16, 12, 25, 19, 13, 11, 20, 17,
+        11, 9, 16, 14, 9, 8, 14, 13, 9, 7, 12, 11, 9, 7,
         /* Size 8x16 */
-        32, 33, 29, 23, 19, 16, 12, 11, 33, 32, 30, 25, 20, 17, 13, 12, 33, 31,
-        29, 24, 21, 17, 14, 13, 32, 30, 28, 24, 21, 18, 14, 13, 30, 29, 25, 21,
-        19, 16, 13, 13, 28, 28, 22, 19, 17, 15, 13, 12, 25, 26, 21, 17, 15, 13,
-        12, 11, 22, 23, 19, 16, 14, 12, 11, 10, 19, 20, 18, 14, 12, 11, 10, 9,
-        18, 19, 17, 14, 12, 10, 9, 9, 16, 17, 16, 13, 11, 10, 9, 8, 14, 15, 14,
-        12, 10, 9, 8, 8, 12, 14, 13, 11, 10, 9, 7, 7, 12, 13, 12, 11, 9, 8, 7,
-        7, 11, 12, 12, 11, 9, 8, 7, 7, 11, 12, 12, 11, 9, 8, 7, 6,
-        /* Size 16x8 */
         32, 33, 33, 32, 30, 28, 25, 22, 19, 18, 16, 14, 12, 12, 11, 11, 33, 32,
         31, 30, 29, 28, 26, 23, 20, 19, 17, 15, 14, 13, 12, 12, 29, 30, 29, 28,
         25, 22, 21, 19, 18, 17, 16, 14, 13, 12, 12, 12, 23, 25, 24, 24, 21, 19,
@@ -7985,35 +7977,15 @@
         12, 12, 11, 10, 10, 9, 9, 9, 16, 17, 17, 18, 16, 15, 13, 12, 11, 10, 10,
         9, 9, 8, 8, 8, 12, 13, 14, 14, 13, 13, 12, 11, 10, 9, 9, 8, 7, 7, 7, 7,
         11, 12, 13, 13, 13, 12, 11, 10, 9, 9, 8, 8, 7, 7, 7, 6,
+        /* Size 16x8 */
+        32, 33, 29, 23, 19, 16, 12, 11, 33, 32, 30, 25, 20, 17, 13, 12, 33, 31,
+        29, 24, 21, 17, 14, 13, 32, 30, 28, 24, 21, 18, 14, 13, 30, 29, 25, 21,
+        19, 16, 13, 13, 28, 28, 22, 19, 17, 15, 13, 12, 25, 26, 21, 17, 15, 13,
+        12, 11, 22, 23, 19, 16, 14, 12, 11, 10, 19, 20, 18, 14, 12, 11, 10, 9,
+        18, 19, 17, 14, 12, 10, 9, 9, 16, 17, 16, 13, 11, 10, 9, 8, 14, 15, 14,
+        12, 10, 9, 8, 8, 12, 14, 13, 11, 10, 9, 7, 7, 12, 13, 12, 11, 9, 8, 7,
+        7, 11, 12, 12, 11, 9, 8, 7, 7, 11, 12, 12, 11, 9, 8, 7, 6,
         /* Size 16x32 */
-        32, 33, 33, 32, 29, 28, 23, 22, 19, 17, 16, 13, 12, 12, 11, 11, 33, 32,
-        32, 32, 29, 29, 24, 23, 20, 17, 17, 14, 13, 12, 12, 12, 33, 32, 32, 32,
-        30, 29, 25, 23, 20, 18, 17, 14, 13, 12, 12, 12, 33, 32, 32, 31, 30, 30,
-        25, 23, 21, 18, 17, 14, 14, 13, 12, 12, 33, 32, 31, 30, 29, 28, 24, 23,
-        21, 18, 17, 14, 14, 13, 13, 12, 32, 32, 31, 30, 28, 28, 24, 23, 20, 18,
-        17, 14, 14, 13, 13, 12, 32, 31, 30, 29, 28, 27, 24, 23, 21, 18, 18, 15,
-        14, 13, 13, 12, 32, 31, 30, 28, 26, 26, 23, 22, 20, 18, 17, 14, 14, 13,
-        13, 13, 30, 30, 29, 28, 25, 24, 21, 20, 19, 17, 16, 14, 13, 13, 13, 13,
-        29, 30, 28, 27, 23, 22, 20, 19, 17, 16, 15, 13, 13, 12, 12, 12, 28, 30,
-        28, 27, 22, 21, 19, 18, 17, 16, 15, 13, 13, 12, 12, 12, 26, 28, 26, 26,
-        21, 20, 18, 17, 16, 14, 14, 12, 12, 12, 12, 11, 25, 26, 26, 25, 21, 20,
-        17, 17, 15, 14, 13, 12, 12, 11, 11, 11, 23, 25, 24, 24, 20, 19, 16, 16,
-        14, 13, 13, 11, 11, 11, 11, 11, 22, 23, 23, 23, 19, 18, 16, 15, 14, 12,
-        12, 11, 11, 10, 10, 10, 21, 23, 23, 22, 19, 18, 15, 15, 13, 12, 12, 11,
-        10, 10, 10, 10, 19, 21, 20, 20, 18, 17, 14, 14, 12, 11, 11, 10, 10, 10,
-        9, 10, 19, 20, 20, 20, 17, 17, 14, 13, 12, 11, 11, 10, 9, 9, 9, 9, 18,
-        19, 19, 19, 17, 16, 14, 13, 12, 11, 10, 9, 9, 9, 9, 9, 16, 18, 18, 18,
-        16, 15, 13, 12, 11, 10, 10, 9, 9, 9, 9, 8, 16, 17, 17, 18, 16, 15, 13,
-        12, 11, 10, 10, 9, 9, 8, 8, 8, 14, 16, 16, 16, 14, 14, 12, 12, 11, 9, 9,
-        8, 8, 8, 8, 8, 14, 15, 15, 16, 14, 14, 12, 11, 10, 9, 9, 8, 8, 8, 8, 8,
-        13, 14, 14, 15, 13, 13, 11, 11, 10, 9, 9, 8, 8, 7, 7, 7, 12, 14, 14, 14,
-        13, 13, 11, 11, 10, 9, 9, 8, 7, 7, 7, 7, 12, 14, 14, 14, 13, 13, 11, 11,
-        10, 9, 8, 8, 7, 7, 7, 7, 12, 13, 13, 13, 12, 12, 11, 10, 9, 9, 8, 7, 7,
-        7, 7, 7, 12, 12, 13, 13, 12, 12, 11, 10, 9, 9, 8, 7, 7, 7, 7, 6, 11, 12,
-        12, 13, 12, 12, 11, 10, 9, 9, 8, 8, 7, 7, 7, 6, 11, 12, 12, 12, 12, 11,
-        11, 10, 9, 9, 8, 8, 7, 7, 6, 6, 11, 12, 12, 12, 12, 11, 11, 10, 9, 8, 8,
-        7, 7, 6, 6, 6, 10, 11, 11, 12, 12, 11, 11, 9, 9, 8, 8, 7, 7, 6, 6, 6,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 32, 32, 32, 30, 29, 28, 26, 25, 23, 22, 21, 19, 19,
         18, 16, 16, 14, 14, 13, 12, 12, 12, 12, 11, 11, 11, 10, 33, 32, 32, 32,
         32, 32, 31, 31, 30, 30, 30, 28, 26, 25, 23, 23, 21, 20, 19, 18, 17, 16,
@@ -8041,32 +8013,45 @@
         12, 12, 13, 13, 13, 13, 13, 12, 12, 12, 11, 11, 10, 10, 9, 9, 9, 9, 8,
         8, 8, 7, 7, 7, 7, 7, 7, 6, 6, 6, 11, 12, 12, 12, 12, 12, 12, 13, 13, 12,
         12, 11, 11, 11, 10, 10, 10, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 6, 6, 6, 6, 6,
+        /* Size 32x16 */
+        32, 33, 33, 32, 29, 28, 23, 22, 19, 17, 16, 13, 12, 12, 11, 11, 33, 32,
+        32, 32, 29, 29, 24, 23, 20, 17, 17, 14, 13, 12, 12, 12, 33, 32, 32, 32,
+        30, 29, 25, 23, 20, 18, 17, 14, 13, 12, 12, 12, 33, 32, 32, 31, 30, 30,
+        25, 23, 21, 18, 17, 14, 14, 13, 12, 12, 33, 32, 31, 30, 29, 28, 24, 23,
+        21, 18, 17, 14, 14, 13, 13, 12, 32, 32, 31, 30, 28, 28, 24, 23, 20, 18,
+        17, 14, 14, 13, 13, 12, 32, 31, 30, 29, 28, 27, 24, 23, 21, 18, 18, 15,
+        14, 13, 13, 12, 32, 31, 30, 28, 26, 26, 23, 22, 20, 18, 17, 14, 14, 13,
+        13, 13, 30, 30, 29, 28, 25, 24, 21, 20, 19, 17, 16, 14, 13, 13, 13, 13,
+        29, 30, 28, 27, 23, 22, 20, 19, 17, 16, 15, 13, 13, 12, 12, 12, 28, 30,
+        28, 27, 22, 21, 19, 18, 17, 16, 15, 13, 13, 12, 12, 12, 26, 28, 26, 26,
+        21, 20, 18, 17, 16, 14, 14, 12, 12, 12, 12, 11, 25, 26, 26, 25, 21, 20,
+        17, 17, 15, 14, 13, 12, 12, 11, 11, 11, 23, 25, 24, 24, 20, 19, 16, 16,
+        14, 13, 13, 11, 11, 11, 11, 11, 22, 23, 23, 23, 19, 18, 16, 15, 14, 12,
+        12, 11, 11, 10, 10, 10, 21, 23, 23, 22, 19, 18, 15, 15, 13, 12, 12, 11,
+        10, 10, 10, 10, 19, 21, 20, 20, 18, 17, 14, 14, 12, 11, 11, 10, 10, 10,
+        9, 10, 19, 20, 20, 20, 17, 17, 14, 13, 12, 11, 11, 10, 9, 9, 9, 9, 18,
+        19, 19, 19, 17, 16, 14, 13, 12, 11, 10, 9, 9, 9, 9, 9, 16, 18, 18, 18,
+        16, 15, 13, 12, 11, 10, 10, 9, 9, 9, 9, 8, 16, 17, 17, 18, 16, 15, 13,
+        12, 11, 10, 10, 9, 9, 8, 8, 8, 14, 16, 16, 16, 14, 14, 12, 12, 11, 9, 9,
+        8, 8, 8, 8, 8, 14, 15, 15, 16, 14, 14, 12, 11, 10, 9, 9, 8, 8, 8, 8, 8,
+        13, 14, 14, 15, 13, 13, 11, 11, 10, 9, 9, 8, 8, 7, 7, 7, 12, 14, 14, 14,
+        13, 13, 11, 11, 10, 9, 9, 8, 7, 7, 7, 7, 12, 14, 14, 14, 13, 13, 11, 11,
+        10, 9, 8, 8, 7, 7, 7, 7, 12, 13, 13, 13, 12, 12, 11, 10, 9, 9, 8, 7, 7,
+        7, 7, 7, 12, 12, 13, 13, 12, 12, 11, 10, 9, 9, 8, 7, 7, 7, 7, 6, 11, 12,
+        12, 13, 12, 12, 11, 10, 9, 9, 8, 8, 7, 7, 7, 6, 11, 12, 12, 12, 12, 11,
+        11, 10, 9, 9, 8, 8, 7, 7, 6, 6, 11, 12, 12, 12, 12, 11, 11, 10, 9, 8, 8,
+        7, 7, 6, 6, 6, 10, 11, 11, 12, 12, 11, 11, 9, 9, 8, 8, 7, 7, 6, 6, 6,
         /* Size 4x16 */
-        33, 28, 17, 12, 32, 29, 18, 12, 32, 28, 18, 13, 31, 27, 18, 13, 30, 24,
-        17, 13, 30, 21, 16, 12, 26, 20, 14, 11, 23, 18, 12, 10, 21, 17, 11, 10,
-        19, 16, 11, 9, 17, 15, 10, 8, 15, 14, 9, 8, 14, 13, 9, 7, 13, 12, 9, 7,
-        12, 12, 9, 7, 12, 11, 8, 6,
-        /* Size 16x4 */
         33, 32, 32, 31, 30, 30, 26, 23, 21, 19, 17, 15, 14, 13, 12, 12, 28, 29,
         28, 27, 24, 21, 20, 18, 17, 16, 15, 14, 13, 12, 12, 11, 17, 18, 18, 18,
         17, 16, 14, 12, 11, 11, 10, 9, 9, 9, 9, 8, 12, 12, 13, 13, 13, 12, 11,
         10, 10, 9, 8, 8, 7, 7, 7, 6,
+        /* Size 16x4 */
+        33, 28, 17, 12, 32, 29, 18, 12, 32, 28, 18, 13, 31, 27, 18, 13, 30, 24,
+        17, 13, 30, 21, 16, 12, 26, 20, 14, 11, 23, 18, 12, 10, 21, 17, 11, 10,
+        19, 16, 11, 9, 17, 15, 10, 8, 15, 14, 9, 8, 14, 13, 9, 7, 13, 12, 9, 7,
+        12, 12, 9, 7, 12, 11, 8, 6,
         /* Size 8x32 */
-        32, 33, 29, 23, 19, 16, 12, 11, 33, 32, 29, 24, 20, 17, 13, 12, 33, 32,
-        30, 25, 20, 17, 13, 12, 33, 32, 30, 25, 21, 17, 14, 12, 33, 31, 29, 24,
-        21, 17, 14, 13, 32, 31, 28, 24, 20, 17, 14, 13, 32, 30, 28, 24, 21, 18,
-        14, 13, 32, 30, 26, 23, 20, 17, 14, 13, 30, 29, 25, 21, 19, 16, 13, 13,
-        29, 28, 23, 20, 17, 15, 13, 12, 28, 28, 22, 19, 17, 15, 13, 12, 26, 26,
-        21, 18, 16, 14, 12, 12, 25, 26, 21, 17, 15, 13, 12, 11, 23, 24, 20, 16,
-        14, 13, 11, 11, 22, 23, 19, 16, 14, 12, 11, 10, 21, 23, 19, 15, 13, 12,
-        10, 10, 19, 20, 18, 14, 12, 11, 10, 9, 19, 20, 17, 14, 12, 11, 9, 9, 18,
-        19, 17, 14, 12, 10, 9, 9, 16, 18, 16, 13, 11, 10, 9, 9, 16, 17, 16, 13,
-        11, 10, 9, 8, 14, 16, 14, 12, 11, 9, 8, 8, 14, 15, 14, 12, 10, 9, 8, 8,
-        13, 14, 13, 11, 10, 9, 8, 7, 12, 14, 13, 11, 10, 9, 7, 7, 12, 14, 13,
-        11, 10, 8, 7, 7, 12, 13, 12, 11, 9, 8, 7, 7, 12, 13, 12, 11, 9, 8, 7, 7,
-        11, 12, 12, 11, 9, 8, 7, 7, 11, 12, 12, 11, 9, 8, 7, 6, 11, 12, 12, 11,
-        9, 8, 7, 6, 10, 11, 12, 11, 9, 8, 7, 6,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 32, 32, 32, 30, 29, 28, 26, 25, 23, 22, 21, 19, 19,
         18, 16, 16, 14, 14, 13, 12, 12, 12, 12, 11, 11, 11, 10, 33, 32, 32, 32,
         31, 31, 30, 30, 29, 28, 28, 26, 26, 24, 23, 23, 20, 20, 19, 18, 17, 16,
@@ -8080,7 +8065,22 @@
         9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 12, 13, 13, 14, 14, 14, 14, 14, 13, 13,
         13, 12, 12, 11, 11, 10, 10, 9, 9, 9, 9, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7,
         11, 12, 12, 12, 13, 13, 13, 13, 13, 12, 12, 12, 11, 11, 10, 10, 9, 9, 9,
-        9, 8, 8, 8, 7, 7, 7, 7, 7, 7, 6, 6, 6 },
+        9, 8, 8, 8, 7, 7, 7, 7, 7, 7, 6, 6, 6,
+        /* Size 32x8 */
+        32, 33, 29, 23, 19, 16, 12, 11, 33, 32, 29, 24, 20, 17, 13, 12, 33, 32,
+        30, 25, 20, 17, 13, 12, 33, 32, 30, 25, 21, 17, 14, 12, 33, 31, 29, 24,
+        21, 17, 14, 13, 32, 31, 28, 24, 20, 17, 14, 13, 32, 30, 28, 24, 21, 18,
+        14, 13, 32, 30, 26, 23, 20, 17, 14, 13, 30, 29, 25, 21, 19, 16, 13, 13,
+        29, 28, 23, 20, 17, 15, 13, 12, 28, 28, 22, 19, 17, 15, 13, 12, 26, 26,
+        21, 18, 16, 14, 12, 12, 25, 26, 21, 17, 15, 13, 12, 11, 23, 24, 20, 16,
+        14, 13, 11, 11, 22, 23, 19, 16, 14, 12, 11, 10, 21, 23, 19, 15, 13, 12,
+        10, 10, 19, 20, 18, 14, 12, 11, 10, 9, 19, 20, 17, 14, 12, 11, 9, 9, 18,
+        19, 17, 14, 12, 10, 9, 9, 16, 18, 16, 13, 11, 10, 9, 9, 16, 17, 16, 13,
+        11, 10, 9, 8, 14, 16, 14, 12, 11, 9, 8, 8, 14, 15, 14, 12, 10, 9, 8, 8,
+        13, 14, 13, 11, 10, 9, 8, 7, 12, 14, 13, 11, 10, 9, 7, 7, 12, 14, 13,
+        11, 10, 8, 7, 7, 12, 13, 12, 11, 9, 8, 7, 7, 12, 13, 12, 11, 9, 8, 7, 7,
+        11, 12, 12, 11, 9, 8, 7, 7, 11, 12, 12, 11, 9, 8, 7, 6, 11, 12, 12, 11,
+        9, 8, 7, 6, 10, 11, 12, 11, 9, 8, 7, 6 },
       { /* Chroma */
         /* Size 4x4 */
         32, 23, 20, 17, 23, 19, 17, 16, 20, 17, 14, 13, 17, 16, 13, 11,
@@ -8164,21 +8164,12 @@
         10, 10, 14, 15, 15, 16, 16, 17, 17, 17, 17, 16, 16, 15, 15, 15, 15, 14,
         14, 13, 13, 12, 12, 12, 12, 11, 11, 11, 11, 10, 10, 10, 10, 10,
         /* Size 4x8 */
-        33, 22, 19, 16, 28, 22, 20, 17, 22, 20, 19, 17, 23, 19, 16, 15, 21, 19,
-        14, 13, 19, 18, 13, 12, 17, 17, 13, 11, 16, 16, 13, 11,
-        /* Size 8x4 */
         33, 28, 22, 23, 21, 19, 17, 16, 22, 22, 20, 19, 19, 18, 17, 16, 19, 20,
         19, 16, 14, 13, 13, 13, 16, 17, 17, 15, 13, 12, 11, 11,
+        /* Size 8x4 */
+        33, 22, 19, 16, 28, 22, 20, 17, 22, 20, 19, 17, 23, 19, 16, 15, 21, 19,
+        14, 13, 19, 18, 13, 12, 17, 17, 13, 11, 16, 16, 13, 11,
         /* Size 8x16 */
-        32, 31, 23, 21, 20, 18, 16, 15, 33, 30, 23, 22, 21, 19, 17, 16, 31, 28,
-        22, 23, 22, 20, 18, 17, 28, 24, 22, 23, 22, 20, 19, 17, 24, 23, 21, 21,
-        20, 19, 18, 17, 21, 22, 20, 19, 19, 18, 17, 16, 21, 22, 20, 18, 17, 17,
-        16, 15, 20, 22, 20, 17, 16, 16, 14, 14, 20, 22, 19, 17, 16, 14, 14, 14,
-        19, 21, 19, 17, 15, 14, 13, 13, 18, 20, 19, 16, 15, 13, 12, 12, 17, 19,
-        18, 16, 14, 13, 12, 12, 16, 18, 17, 15, 14, 12, 11, 11, 16, 17, 17, 15,
-        13, 12, 11, 11, 15, 17, 17, 15, 13, 12, 11, 11, 15, 16, 17, 15, 14, 12,
-        11, 10,
-        /* Size 16x8 */
         32, 33, 31, 28, 24, 21, 21, 20, 20, 19, 18, 17, 16, 16, 15, 15, 31, 30,
         28, 24, 23, 22, 22, 22, 22, 21, 20, 19, 18, 17, 17, 16, 23, 23, 22, 22,
         21, 20, 20, 20, 19, 19, 19, 18, 17, 17, 17, 17, 21, 22, 23, 23, 21, 19,
@@ -8187,37 +8178,16 @@
         13, 13, 12, 12, 12, 12, 16, 17, 18, 19, 18, 17, 16, 14, 14, 13, 12, 12,
         11, 11, 11, 11, 15, 16, 17, 17, 17, 16, 15, 14, 14, 13, 12, 12, 11, 11,
         11, 10,
+        /* Size 16x8 */
+        32, 31, 23, 21, 20, 18, 16, 15, 33, 30, 23, 22, 21, 19, 17, 16, 31, 28,
+        22, 23, 22, 20, 18, 17, 28, 24, 22, 23, 22, 20, 19, 17, 24, 23, 21, 21,
+        20, 19, 18, 17, 21, 22, 20, 19, 19, 18, 17, 16, 21, 22, 20, 18, 17, 17,
+        16, 15, 20, 22, 20, 17, 16, 16, 14, 14, 20, 22, 19, 17, 16, 14, 14, 14,
+        19, 21, 19, 17, 15, 14, 13, 13, 18, 20, 19, 16, 15, 13, 12, 12, 17, 19,
+        18, 16, 14, 13, 12, 12, 16, 18, 17, 15, 14, 12, 11, 11, 16, 17, 17, 15,
+        13, 12, 11, 11, 15, 17, 17, 15, 13, 12, 11, 11, 15, 16, 17, 15, 14, 12,
+        11, 10,
         /* Size 16x32 */
-        32, 33, 31, 28, 23, 21, 21, 20, 20, 18, 18, 16, 16, 15, 15, 15, 33, 33,
-        30, 27, 23, 22, 22, 21, 20, 19, 19, 17, 17, 16, 16, 16, 33, 32, 30, 26,
-        23, 22, 22, 22, 21, 20, 19, 17, 17, 17, 16, 16, 34, 32, 29, 26, 23, 22,
-        23, 22, 21, 20, 20, 18, 18, 17, 17, 17, 31, 29, 28, 24, 22, 22, 23, 22,
-        22, 20, 20, 18, 18, 17, 17, 17, 31, 28, 27, 24, 22, 22, 22, 22, 22, 20,
-        20, 18, 18, 17, 17, 17, 28, 26, 24, 22, 22, 22, 23, 22, 22, 21, 20, 19,
-        19, 18, 17, 17, 26, 25, 24, 22, 21, 21, 22, 22, 21, 20, 20, 19, 18, 18,
-        18, 17, 24, 24, 23, 22, 21, 20, 21, 20, 20, 19, 19, 18, 18, 17, 17, 17,
-        22, 22, 22, 21, 20, 20, 19, 19, 19, 19, 18, 17, 17, 17, 17, 17, 21, 22,
-        22, 21, 20, 19, 19, 19, 19, 18, 18, 17, 17, 16, 16, 17, 21, 22, 22, 22,
-        20, 19, 18, 18, 18, 17, 17, 16, 16, 16, 16, 16, 21, 23, 22, 22, 20, 19,
-        18, 18, 17, 17, 17, 16, 16, 16, 15, 16, 21, 23, 23, 22, 20, 19, 18, 17,
-        17, 16, 16, 15, 15, 15, 15, 15, 20, 22, 22, 22, 20, 19, 17, 17, 16, 16,
-        16, 15, 14, 15, 14, 15, 20, 22, 22, 22, 20, 19, 17, 17, 16, 16, 15, 14,
-        14, 14, 14, 14, 20, 21, 22, 22, 19, 19, 17, 16, 16, 15, 14, 14, 14, 14,
-        14, 14, 19, 21, 21, 21, 19, 19, 17, 16, 15, 14, 14, 13, 13, 13, 14, 13,
-        19, 20, 21, 21, 19, 19, 17, 16, 15, 14, 14, 13, 13, 13, 13, 13, 18, 20,
-        20, 20, 19, 18, 16, 16, 15, 14, 13, 13, 12, 13, 13, 13, 18, 20, 20, 20,
-        19, 18, 16, 16, 15, 14, 13, 12, 12, 12, 12, 13, 17, 19, 19, 20, 18, 18,
-        16, 15, 14, 13, 13, 12, 12, 12, 12, 12, 17, 18, 19, 19, 18, 17, 16, 15,
-        14, 13, 13, 12, 12, 12, 12, 12, 16, 18, 18, 19, 17, 17, 15, 15, 14, 13,
-        12, 12, 11, 11, 12, 12, 16, 18, 18, 18, 17, 17, 15, 14, 14, 13, 12, 11,
-        11, 11, 11, 12, 16, 17, 18, 18, 17, 17, 15, 14, 14, 13, 12, 11, 11, 11,
-        11, 11, 16, 17, 17, 18, 17, 16, 15, 14, 13, 12, 12, 11, 11, 11, 11, 11,
-        15, 17, 17, 18, 17, 16, 15, 15, 13, 13, 12, 11, 11, 11, 11, 11, 15, 17,
-        17, 17, 17, 16, 15, 14, 13, 13, 12, 12, 11, 11, 11, 10, 15, 16, 17, 17,
-        17, 16, 15, 14, 13, 13, 12, 12, 11, 11, 10, 10, 15, 16, 16, 17, 17, 16,
-        15, 14, 14, 13, 12, 12, 11, 11, 10, 10, 15, 16, 16, 17, 17, 15, 15, 14,
-        14, 12, 12, 11, 11, 10, 10, 10,
-        /* Size 32x16 */
         32, 33, 33, 34, 31, 31, 28, 26, 24, 22, 21, 21, 21, 21, 20, 20, 20, 19,
         19, 18, 18, 17, 17, 16, 16, 16, 16, 15, 15, 15, 15, 15, 33, 33, 32, 32,
         29, 28, 26, 25, 24, 22, 22, 22, 23, 23, 22, 22, 21, 21, 20, 20, 20, 19,
@@ -8247,33 +8217,47 @@
         12, 12, 12, 12, 11, 11, 11, 11, 11, 10, 10, 10, 15, 16, 16, 17, 17, 17,
         17, 17, 17, 17, 17, 16, 16, 15, 15, 14, 14, 13, 13, 13, 13, 12, 12, 12,
         12, 11, 11, 11, 10, 10, 10, 10,
+        /* Size 32x16 */
+        32, 33, 31, 28, 23, 21, 21, 20, 20, 18, 18, 16, 16, 15, 15, 15, 33, 33,
+        30, 27, 23, 22, 22, 21, 20, 19, 19, 17, 17, 16, 16, 16, 33, 32, 30, 26,
+        23, 22, 22, 22, 21, 20, 19, 17, 17, 17, 16, 16, 34, 32, 29, 26, 23, 22,
+        23, 22, 21, 20, 20, 18, 18, 17, 17, 17, 31, 29, 28, 24, 22, 22, 23, 22,
+        22, 20, 20, 18, 18, 17, 17, 17, 31, 28, 27, 24, 22, 22, 22, 22, 22, 20,
+        20, 18, 18, 17, 17, 17, 28, 26, 24, 22, 22, 22, 23, 22, 22, 21, 20, 19,
+        19, 18, 17, 17, 26, 25, 24, 22, 21, 21, 22, 22, 21, 20, 20, 19, 18, 18,
+        18, 17, 24, 24, 23, 22, 21, 20, 21, 20, 20, 19, 19, 18, 18, 17, 17, 17,
+        22, 22, 22, 21, 20, 20, 19, 19, 19, 19, 18, 17, 17, 17, 17, 17, 21, 22,
+        22, 21, 20, 19, 19, 19, 19, 18, 18, 17, 17, 16, 16, 17, 21, 22, 22, 22,
+        20, 19, 18, 18, 18, 17, 17, 16, 16, 16, 16, 16, 21, 23, 22, 22, 20, 19,
+        18, 18, 17, 17, 17, 16, 16, 16, 15, 16, 21, 23, 23, 22, 20, 19, 18, 17,
+        17, 16, 16, 15, 15, 15, 15, 15, 20, 22, 22, 22, 20, 19, 17, 17, 16, 16,
+        16, 15, 14, 15, 14, 15, 20, 22, 22, 22, 20, 19, 17, 17, 16, 16, 15, 14,
+        14, 14, 14, 14, 20, 21, 22, 22, 19, 19, 17, 16, 16, 15, 14, 14, 14, 14,
+        14, 14, 19, 21, 21, 21, 19, 19, 17, 16, 15, 14, 14, 13, 13, 13, 14, 13,
+        19, 20, 21, 21, 19, 19, 17, 16, 15, 14, 14, 13, 13, 13, 13, 13, 18, 20,
+        20, 20, 19, 18, 16, 16, 15, 14, 13, 13, 12, 13, 13, 13, 18, 20, 20, 20,
+        19, 18, 16, 16, 15, 14, 13, 12, 12, 12, 12, 13, 17, 19, 19, 20, 18, 18,
+        16, 15, 14, 13, 13, 12, 12, 12, 12, 12, 17, 18, 19, 19, 18, 17, 16, 15,
+        14, 13, 13, 12, 12, 12, 12, 12, 16, 18, 18, 19, 17, 17, 15, 15, 14, 13,
+        12, 12, 11, 11, 12, 12, 16, 18, 18, 18, 17, 17, 15, 14, 14, 13, 12, 11,
+        11, 11, 11, 12, 16, 17, 18, 18, 17, 17, 15, 14, 14, 13, 12, 11, 11, 11,
+        11, 11, 16, 17, 17, 18, 17, 16, 15, 14, 13, 12, 12, 11, 11, 11, 11, 11,
+        15, 17, 17, 18, 17, 16, 15, 15, 13, 13, 12, 11, 11, 11, 11, 11, 15, 17,
+        17, 17, 17, 16, 15, 14, 13, 13, 12, 12, 11, 11, 11, 10, 15, 16, 17, 17,
+        17, 16, 15, 14, 13, 13, 12, 12, 11, 11, 10, 10, 15, 16, 16, 17, 17, 16,
+        15, 14, 14, 13, 12, 12, 11, 11, 10, 10, 15, 16, 16, 17, 17, 15, 15, 14,
+        14, 12, 12, 11, 11, 10, 10, 10,
         /* Size 4x16 */
-        33, 21, 18, 15, 32, 22, 20, 17, 29, 22, 20, 17, 26, 22, 21, 18, 24, 20,
-        19, 17, 22, 19, 18, 16, 23, 19, 17, 16, 22, 19, 16, 15, 21, 19, 15, 14,
-        20, 19, 14, 13, 20, 18, 14, 12, 18, 17, 13, 12, 18, 17, 13, 11, 17, 16,
-        12, 11, 17, 16, 13, 11, 16, 16, 13, 11,
-        /* Size 16x4 */
         33, 32, 29, 26, 24, 22, 23, 22, 21, 20, 20, 18, 18, 17, 17, 16, 21, 22,
         22, 22, 20, 19, 19, 19, 19, 19, 18, 17, 17, 16, 16, 16, 18, 20, 20, 21,
         19, 18, 17, 16, 15, 14, 14, 13, 13, 12, 13, 13, 15, 17, 17, 18, 17, 16,
         16, 15, 14, 13, 12, 12, 11, 11, 11, 11,
+        /* Size 16x4 */
+        33, 21, 18, 15, 32, 22, 20, 17, 29, 22, 20, 17, 26, 22, 21, 18, 24, 20,
+        19, 17, 22, 19, 18, 16, 23, 19, 17, 16, 22, 19, 16, 15, 21, 19, 15, 14,
+        20, 19, 14, 13, 20, 18, 14, 12, 18, 17, 13, 12, 18, 17, 13, 11, 17, 16,
+        12, 11, 17, 16, 13, 11, 16, 16, 13, 11,
         /* Size 8x32 */
-        32, 31, 23, 21, 20, 18, 16, 15, 33, 30, 23, 22, 20, 19, 17, 16, 33, 30,
-        23, 22, 21, 19, 17, 16, 34, 29, 23, 23, 21, 20, 18, 17, 31, 28, 22, 23,
-        22, 20, 18, 17, 31, 27, 22, 22, 22, 20, 18, 17, 28, 24, 22, 23, 22, 20,
-        19, 17, 26, 24, 21, 22, 21, 20, 18, 18, 24, 23, 21, 21, 20, 19, 18, 17,
-        22, 22, 20, 19, 19, 18, 17, 17, 21, 22, 20, 19, 19, 18, 17, 16, 21, 22,
-        20, 18, 18, 17, 16, 16, 21, 22, 20, 18, 17, 17, 16, 15, 21, 23, 20, 18,
-        17, 16, 15, 15, 20, 22, 20, 17, 16, 16, 14, 14, 20, 22, 20, 17, 16, 15,
-        14, 14, 20, 22, 19, 17, 16, 14, 14, 14, 19, 21, 19, 17, 15, 14, 13, 14,
-        19, 21, 19, 17, 15, 14, 13, 13, 18, 20, 19, 16, 15, 13, 12, 13, 18, 20,
-        19, 16, 15, 13, 12, 12, 17, 19, 18, 16, 14, 13, 12, 12, 17, 19, 18, 16,
-        14, 13, 12, 12, 16, 18, 17, 15, 14, 12, 11, 12, 16, 18, 17, 15, 14, 12,
-        11, 11, 16, 18, 17, 15, 14, 12, 11, 11, 16, 17, 17, 15, 13, 12, 11, 11,
-        15, 17, 17, 15, 13, 12, 11, 11, 15, 17, 17, 15, 13, 12, 11, 11, 15, 17,
-        17, 15, 13, 12, 11, 10, 15, 16, 17, 15, 14, 12, 11, 10, 15, 16, 17, 15,
-        14, 12, 11, 10,
-        /* Size 32x8 */
         32, 33, 33, 34, 31, 31, 28, 26, 24, 22, 21, 21, 21, 21, 20, 20, 20, 19,
         19, 18, 18, 17, 17, 16, 16, 16, 16, 15, 15, 15, 15, 15, 31, 30, 30, 29,
         28, 27, 24, 24, 23, 22, 22, 22, 22, 23, 22, 22, 22, 21, 21, 20, 20, 19,
@@ -8288,7 +8272,23 @@
         19, 18, 18, 17, 17, 16, 16, 15, 14, 14, 14, 13, 13, 12, 12, 12, 12, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 15, 16, 16, 17, 17, 17, 17, 18, 17, 17,
         16, 16, 15, 15, 14, 14, 14, 14, 13, 13, 12, 12, 12, 12, 11, 11, 11, 11,
-        11, 10, 10, 10 },
+        11, 10, 10, 10,
+        /* Size 32x8 */
+        32, 31, 23, 21, 20, 18, 16, 15, 33, 30, 23, 22, 20, 19, 17, 16, 33, 30,
+        23, 22, 21, 19, 17, 16, 34, 29, 23, 23, 21, 20, 18, 17, 31, 28, 22, 23,
+        22, 20, 18, 17, 31, 27, 22, 22, 22, 20, 18, 17, 28, 24, 22, 23, 22, 20,
+        19, 17, 26, 24, 21, 22, 21, 20, 18, 18, 24, 23, 21, 21, 20, 19, 18, 17,
+        22, 22, 20, 19, 19, 18, 17, 17, 21, 22, 20, 19, 19, 18, 17, 16, 21, 22,
+        20, 18, 18, 17, 16, 16, 21, 22, 20, 18, 17, 17, 16, 15, 21, 23, 20, 18,
+        17, 16, 15, 15, 20, 22, 20, 17, 16, 16, 14, 14, 20, 22, 20, 17, 16, 15,
+        14, 14, 20, 22, 19, 17, 16, 14, 14, 14, 19, 21, 19, 17, 15, 14, 13, 14,
+        19, 21, 19, 17, 15, 14, 13, 13, 18, 20, 19, 16, 15, 13, 12, 13, 18, 20,
+        19, 16, 15, 13, 12, 12, 17, 19, 18, 16, 14, 13, 12, 12, 17, 19, 18, 16,
+        14, 13, 12, 12, 16, 18, 17, 15, 14, 12, 11, 12, 16, 18, 17, 15, 14, 12,
+        11, 11, 16, 18, 17, 15, 14, 12, 11, 11, 16, 17, 17, 15, 13, 12, 11, 11,
+        15, 17, 17, 15, 13, 12, 11, 11, 15, 17, 17, 15, 13, 12, 11, 11, 15, 17,
+        17, 15, 13, 12, 11, 10, 15, 16, 17, 15, 14, 12, 11, 10, 15, 16, 17, 15,
+        14, 12, 11, 10 },
   },
   {
       { /* Luma */
@@ -8371,20 +8371,12 @@
         7, 7, 7, 7, 7, 6, 6, 11, 12, 12, 12, 12, 12, 12, 13, 13, 12, 12, 11, 11,
         11, 11, 10, 10, 9, 9, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 6, 6, 6,
         /* Size 4x8 */
-        32, 29, 20, 13, 32, 28, 20, 14, 30, 24, 19, 14, 27, 20, 15, 12, 21, 17,
-        13, 10, 17, 15, 11, 9, 14, 13, 10, 8, 13, 12, 9, 7,
-        /* Size 8x4 */
         32, 32, 30, 27, 21, 17, 14, 13, 29, 28, 24, 20, 17, 15, 13, 12, 20, 20,
         19, 15, 13, 11, 10, 9, 13, 14, 14, 12, 10, 9, 8, 7,
+        /* Size 8x4 */
+        32, 29, 20, 13, 32, 28, 20, 14, 30, 24, 19, 14, 27, 20, 15, 12, 21, 17,
+        13, 10, 17, 15, 11, 9, 14, 13, 10, 8, 13, 12, 9, 7,
         /* Size 8x16 */
-        32, 33, 31, 26, 20, 16, 13, 12, 33, 32, 31, 26, 21, 17, 14, 12, 33, 32,
-        30, 27, 22, 17, 14, 13, 32, 31, 28, 26, 21, 18, 15, 13, 31, 30, 27, 23,
-        20, 17, 14, 13, 28, 29, 24, 20, 18, 15, 13, 12, 26, 27, 23, 19, 16, 14,
-        12, 12, 23, 25, 22, 17, 15, 13, 11, 11, 21, 23, 20, 17, 14, 12, 11, 10,
-        19, 21, 19, 16, 13, 11, 10, 9, 18, 19, 18, 15, 12, 10, 9, 9, 16, 17, 16,
-        14, 11, 10, 9, 8, 14, 15, 15, 13, 11, 9, 8, 8, 13, 14, 14, 12, 10, 9, 8,
-        7, 12, 13, 13, 11, 10, 8, 7, 7, 11, 12, 13, 11, 10, 9, 7, 7,
-        /* Size 16x8 */
         32, 33, 33, 32, 31, 28, 26, 23, 21, 19, 18, 16, 14, 13, 12, 11, 33, 32,
         32, 31, 30, 29, 27, 25, 23, 21, 19, 17, 15, 14, 13, 12, 31, 31, 30, 28,
         27, 24, 23, 22, 20, 19, 18, 16, 15, 14, 13, 13, 26, 26, 27, 26, 23, 20,
@@ -8392,36 +8384,15 @@
         14, 13, 12, 11, 11, 10, 10, 10, 16, 17, 17, 18, 17, 15, 14, 13, 12, 11,
         10, 10, 9, 9, 8, 9, 13, 14, 14, 15, 14, 13, 12, 11, 11, 10, 9, 9, 8, 8,
         7, 7, 12, 12, 13, 13, 13, 12, 12, 11, 10, 9, 9, 8, 8, 7, 7, 7,
+        /* Size 16x8 */
+        32, 33, 31, 26, 20, 16, 13, 12, 33, 32, 31, 26, 21, 17, 14, 12, 33, 32,
+        30, 27, 22, 17, 14, 13, 32, 31, 28, 26, 21, 18, 15, 13, 31, 30, 27, 23,
+        20, 17, 14, 13, 28, 29, 24, 20, 18, 15, 13, 12, 26, 27, 23, 19, 16, 14,
+        12, 12, 23, 25, 22, 17, 15, 13, 11, 11, 21, 23, 20, 17, 14, 12, 11, 10,
+        19, 21, 19, 16, 13, 11, 10, 9, 18, 19, 18, 15, 12, 10, 9, 9, 16, 17, 16,
+        14, 11, 10, 9, 8, 14, 15, 15, 13, 11, 9, 8, 8, 13, 14, 14, 12, 10, 9, 8,
+        7, 12, 13, 13, 11, 10, 8, 7, 7, 11, 12, 13, 11, 10, 9, 7, 7,
         /* Size 16x32 */
-        32, 33, 33, 32, 31, 28, 26, 23, 20, 19, 16, 16, 13, 13, 12, 11, 33, 32,
-        32, 32, 31, 29, 26, 24, 21, 20, 17, 16, 14, 13, 12, 12, 33, 32, 32, 32,
-        31, 29, 26, 24, 21, 20, 17, 17, 14, 13, 12, 12, 33, 32, 32, 31, 31, 30,
-        27, 25, 22, 21, 17, 17, 14, 14, 13, 13, 33, 32, 32, 31, 30, 29, 27, 25,
-        22, 21, 17, 17, 14, 14, 13, 13, 32, 32, 31, 30, 29, 28, 26, 24, 21, 20,
-        17, 17, 14, 14, 13, 13, 32, 32, 31, 29, 28, 28, 26, 24, 21, 21, 18, 17,
-        15, 14, 13, 13, 32, 31, 31, 29, 28, 27, 25, 24, 21, 21, 18, 17, 15, 15,
-        14, 13, 31, 31, 30, 28, 27, 25, 23, 22, 20, 19, 17, 16, 14, 14, 13, 13,
-        30, 30, 30, 28, 26, 24, 23, 21, 19, 19, 16, 16, 14, 14, 13, 12, 28, 30,
-        29, 27, 24, 21, 20, 19, 18, 17, 15, 15, 13, 13, 12, 12, 28, 29, 29, 27,
-        24, 21, 20, 19, 17, 17, 15, 15, 13, 13, 12, 12, 26, 28, 27, 26, 23, 20,
-        19, 18, 16, 16, 14, 14, 12, 12, 12, 12, 26, 27, 26, 25, 23, 20, 18, 17,
-        16, 15, 14, 13, 12, 12, 11, 11, 23, 25, 25, 24, 22, 19, 17, 16, 15, 14,
-        13, 13, 11, 11, 11, 11, 22, 24, 24, 23, 21, 19, 17, 16, 14, 14, 12, 12,
-        11, 11, 11, 10, 21, 23, 23, 22, 20, 18, 17, 15, 14, 13, 12, 12, 11, 10,
-        10, 10, 20, 21, 21, 21, 20, 17, 16, 15, 13, 13, 11, 11, 10, 10, 10, 10,
-        19, 21, 21, 20, 19, 17, 16, 14, 13, 12, 11, 11, 10, 10, 9, 10, 18, 19,
-        19, 19, 18, 16, 15, 14, 12, 12, 11, 10, 9, 9, 9, 9, 18, 19, 19, 19, 18,
-        16, 15, 14, 12, 12, 10, 10, 9, 9, 9, 9, 16, 17, 17, 18, 17, 15, 14, 13,
-        12, 11, 10, 10, 9, 9, 8, 8, 16, 17, 17, 17, 16, 15, 14, 13, 11, 11, 10,
-        10, 9, 8, 8, 8, 14, 16, 16, 16, 15, 14, 13, 12, 11, 11, 9, 9, 8, 8, 8,
-        8, 14, 15, 15, 16, 15, 14, 13, 12, 11, 10, 9, 9, 8, 8, 8, 8, 13, 14, 14,
-        15, 14, 13, 12, 11, 10, 10, 9, 9, 8, 8, 7, 7, 13, 14, 14, 14, 14, 13,
-        12, 11, 10, 10, 9, 8, 8, 7, 7, 7, 12, 14, 14, 14, 14, 13, 12, 11, 10,
-        10, 8, 8, 8, 7, 7, 7, 12, 13, 13, 14, 13, 12, 11, 11, 10, 9, 8, 8, 7, 7,
-        7, 7, 12, 13, 13, 13, 13, 12, 11, 10, 10, 9, 8, 8, 7, 7, 7, 7, 11, 12,
-        12, 13, 13, 12, 11, 10, 10, 9, 9, 8, 7, 7, 7, 7, 11, 12, 12, 13, 13, 11,
-        11, 10, 10, 9, 9, 8, 8, 7, 7, 6,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 32, 32, 32, 31, 30, 28, 28, 26, 26, 23, 22, 21, 20,
         19, 18, 18, 16, 16, 14, 14, 13, 13, 12, 12, 12, 11, 11, 33, 32, 32, 32,
         32, 32, 32, 31, 31, 30, 30, 29, 28, 27, 25, 24, 23, 21, 21, 19, 19, 17,
@@ -8450,32 +8421,46 @@
         11, 11, 11, 10, 10, 9, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 11, 12,
         12, 13, 13, 13, 13, 13, 13, 12, 12, 12, 12, 11, 11, 10, 10, 10, 10, 9,
         9, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 6,
+        /* Size 32x16 */
+        32, 33, 33, 32, 31, 28, 26, 23, 20, 19, 16, 16, 13, 13, 12, 11, 33, 32,
+        32, 32, 31, 29, 26, 24, 21, 20, 17, 16, 14, 13, 12, 12, 33, 32, 32, 32,
+        31, 29, 26, 24, 21, 20, 17, 17, 14, 13, 12, 12, 33, 32, 32, 31, 31, 30,
+        27, 25, 22, 21, 17, 17, 14, 14, 13, 13, 33, 32, 32, 31, 30, 29, 27, 25,
+        22, 21, 17, 17, 14, 14, 13, 13, 32, 32, 31, 30, 29, 28, 26, 24, 21, 20,
+        17, 17, 14, 14, 13, 13, 32, 32, 31, 29, 28, 28, 26, 24, 21, 21, 18, 17,
+        15, 14, 13, 13, 32, 31, 31, 29, 28, 27, 25, 24, 21, 21, 18, 17, 15, 15,
+        14, 13, 31, 31, 30, 28, 27, 25, 23, 22, 20, 19, 17, 16, 14, 14, 13, 13,
+        30, 30, 30, 28, 26, 24, 23, 21, 19, 19, 16, 16, 14, 14, 13, 12, 28, 30,
+        29, 27, 24, 21, 20, 19, 18, 17, 15, 15, 13, 13, 12, 12, 28, 29, 29, 27,
+        24, 21, 20, 19, 17, 17, 15, 15, 13, 13, 12, 12, 26, 28, 27, 26, 23, 20,
+        19, 18, 16, 16, 14, 14, 12, 12, 12, 12, 26, 27, 26, 25, 23, 20, 18, 17,
+        16, 15, 14, 13, 12, 12, 11, 11, 23, 25, 25, 24, 22, 19, 17, 16, 15, 14,
+        13, 13, 11, 11, 11, 11, 22, 24, 24, 23, 21, 19, 17, 16, 14, 14, 12, 12,
+        11, 11, 11, 10, 21, 23, 23, 22, 20, 18, 17, 15, 14, 13, 12, 12, 11, 10,
+        10, 10, 20, 21, 21, 21, 20, 17, 16, 15, 13, 13, 11, 11, 10, 10, 10, 10,
+        19, 21, 21, 20, 19, 17, 16, 14, 13, 12, 11, 11, 10, 10, 9, 10, 18, 19,
+        19, 19, 18, 16, 15, 14, 12, 12, 11, 10, 9, 9, 9, 9, 18, 19, 19, 19, 18,
+        16, 15, 14, 12, 12, 10, 10, 9, 9, 9, 9, 16, 17, 17, 18, 17, 15, 14, 13,
+        12, 11, 10, 10, 9, 9, 8, 8, 16, 17, 17, 17, 16, 15, 14, 13, 11, 11, 10,
+        10, 9, 8, 8, 8, 14, 16, 16, 16, 15, 14, 13, 12, 11, 11, 9, 9, 8, 8, 8,
+        8, 14, 15, 15, 16, 15, 14, 13, 12, 11, 10, 9, 9, 8, 8, 8, 8, 13, 14, 14,
+        15, 14, 13, 12, 11, 10, 10, 9, 9, 8, 8, 7, 7, 13, 14, 14, 14, 14, 13,
+        12, 11, 10, 10, 9, 8, 8, 7, 7, 7, 12, 14, 14, 14, 14, 13, 12, 11, 10,
+        10, 8, 8, 8, 7, 7, 7, 12, 13, 13, 14, 13, 12, 11, 11, 10, 9, 8, 8, 7, 7,
+        7, 7, 12, 13, 13, 13, 13, 12, 11, 10, 10, 9, 8, 8, 7, 7, 7, 7, 11, 12,
+        12, 13, 13, 12, 11, 10, 10, 9, 9, 8, 7, 7, 7, 7, 11, 12, 12, 13, 13, 11,
+        11, 10, 10, 9, 9, 8, 8, 7, 7, 6,
         /* Size 4x16 */
-        33, 28, 19, 13, 32, 29, 20, 13, 32, 29, 21, 14, 32, 28, 21, 14, 31, 25,
-        19, 14, 30, 21, 17, 13, 28, 20, 16, 12, 25, 19, 14, 11, 23, 18, 13, 10,
-        21, 17, 12, 10, 19, 16, 12, 9, 17, 15, 11, 8, 15, 14, 10, 8, 14, 13, 10,
-        7, 13, 12, 9, 7, 12, 12, 9, 7,
-        /* Size 16x4 */
         33, 32, 32, 32, 31, 30, 28, 25, 23, 21, 19, 17, 15, 14, 13, 12, 28, 29,
         29, 28, 25, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 12, 19, 20, 21, 21,
         19, 17, 16, 14, 13, 12, 12, 11, 10, 10, 9, 9, 13, 13, 14, 14, 14, 13,
         12, 11, 10, 10, 9, 8, 8, 7, 7, 7,
+        /* Size 16x4 */
+        33, 28, 19, 13, 32, 29, 20, 13, 32, 29, 21, 14, 32, 28, 21, 14, 31, 25,
+        19, 14, 30, 21, 17, 13, 28, 20, 16, 12, 25, 19, 14, 11, 23, 18, 13, 10,
+        21, 17, 12, 10, 19, 16, 12, 9, 17, 15, 11, 8, 15, 14, 10, 8, 14, 13, 10,
+        7, 13, 12, 9, 7, 12, 12, 9, 7,
         /* Size 8x32 */
-        32, 33, 31, 26, 20, 16, 13, 12, 33, 32, 31, 26, 21, 17, 14, 12, 33, 32,
-        31, 26, 21, 17, 14, 12, 33, 32, 31, 27, 22, 17, 14, 13, 33, 32, 30, 27,
-        22, 17, 14, 13, 32, 31, 29, 26, 21, 17, 14, 13, 32, 31, 28, 26, 21, 18,
-        15, 13, 32, 31, 28, 25, 21, 18, 15, 14, 31, 30, 27, 23, 20, 17, 14, 13,
-        30, 30, 26, 23, 19, 16, 14, 13, 28, 29, 24, 20, 18, 15, 13, 12, 28, 29,
-        24, 20, 17, 15, 13, 12, 26, 27, 23, 19, 16, 14, 12, 12, 26, 26, 23, 18,
-        16, 14, 12, 11, 23, 25, 22, 17, 15, 13, 11, 11, 22, 24, 21, 17, 14, 12,
-        11, 11, 21, 23, 20, 17, 14, 12, 11, 10, 20, 21, 20, 16, 13, 11, 10, 10,
-        19, 21, 19, 16, 13, 11, 10, 9, 18, 19, 18, 15, 12, 11, 9, 9, 18, 19, 18,
-        15, 12, 10, 9, 9, 16, 17, 17, 14, 12, 10, 9, 8, 16, 17, 16, 14, 11, 10,
-        9, 8, 14, 16, 15, 13, 11, 9, 8, 8, 14, 15, 15, 13, 11, 9, 8, 8, 13, 14,
-        14, 12, 10, 9, 8, 7, 13, 14, 14, 12, 10, 9, 8, 7, 12, 14, 14, 12, 10, 8,
-        8, 7, 12, 13, 13, 11, 10, 8, 7, 7, 12, 13, 13, 11, 10, 8, 7, 7, 11, 12,
-        13, 11, 10, 9, 7, 7, 11, 12, 13, 11, 10, 9, 8, 7,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 32, 32, 32, 31, 30, 28, 28, 26, 26, 23, 22, 21, 20,
         19, 18, 18, 16, 16, 14, 14, 13, 13, 12, 12, 12, 11, 11, 33, 32, 32, 32,
         32, 31, 31, 31, 30, 30, 29, 29, 27, 26, 25, 24, 23, 21, 21, 19, 19, 17,
@@ -8489,7 +8474,22 @@
         10, 10, 10, 9, 9, 9, 9, 8, 8, 8, 9, 9, 13, 14, 14, 14, 14, 14, 15, 15,
         14, 14, 13, 13, 12, 12, 11, 11, 11, 10, 10, 9, 9, 9, 9, 8, 8, 8, 8, 8,
         7, 7, 7, 8, 12, 12, 12, 13, 13, 13, 13, 14, 13, 13, 12, 12, 12, 11, 11,
-        11, 10, 10, 9, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7 },
+        11, 10, 10, 9, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7,
+        /* Size 32x8 */
+        32, 33, 31, 26, 20, 16, 13, 12, 33, 32, 31, 26, 21, 17, 14, 12, 33, 32,
+        31, 26, 21, 17, 14, 12, 33, 32, 31, 27, 22, 17, 14, 13, 33, 32, 30, 27,
+        22, 17, 14, 13, 32, 31, 29, 26, 21, 17, 14, 13, 32, 31, 28, 26, 21, 18,
+        15, 13, 32, 31, 28, 25, 21, 18, 15, 14, 31, 30, 27, 23, 20, 17, 14, 13,
+        30, 30, 26, 23, 19, 16, 14, 13, 28, 29, 24, 20, 18, 15, 13, 12, 28, 29,
+        24, 20, 17, 15, 13, 12, 26, 27, 23, 19, 16, 14, 12, 12, 26, 26, 23, 18,
+        16, 14, 12, 11, 23, 25, 22, 17, 15, 13, 11, 11, 22, 24, 21, 17, 14, 12,
+        11, 11, 21, 23, 20, 17, 14, 12, 11, 10, 20, 21, 20, 16, 13, 11, 10, 10,
+        19, 21, 19, 16, 13, 11, 10, 9, 18, 19, 18, 15, 12, 11, 9, 9, 18, 19, 18,
+        15, 12, 10, 9, 9, 16, 17, 17, 14, 12, 10, 9, 8, 16, 17, 16, 14, 11, 10,
+        9, 8, 14, 16, 15, 13, 11, 9, 8, 8, 14, 15, 15, 13, 11, 9, 8, 8, 13, 14,
+        14, 12, 10, 9, 8, 7, 13, 14, 14, 12, 10, 9, 8, 7, 12, 14, 14, 12, 10, 8,
+        8, 7, 12, 13, 13, 11, 10, 8, 7, 7, 12, 13, 13, 11, 10, 8, 7, 7, 11, 12,
+        13, 11, 10, 9, 7, 7, 11, 12, 13, 11, 10, 9, 8, 7 },
       { /* Chroma */
         /* Size 4x4 */
         32, 22, 21, 18, 22, 19, 19, 17, 21, 19, 15, 13, 18, 17, 13, 11,
@@ -8573,21 +8573,12 @@
         10, 11, 15, 16, 16, 17, 17, 17, 17, 18, 17, 17, 17, 16, 16, 15, 15, 14,
         14, 14, 14, 13, 13, 12, 12, 12, 12, 11, 11, 11, 11, 11, 11, 10,
         /* Size 4x8 */
-        33, 22, 20, 17, 28, 22, 22, 18, 24, 20, 20, 18, 23, 19, 18, 16, 22, 19,
-        16, 14, 20, 18, 15, 12, 18, 17, 14, 11, 17, 16, 13, 11,
-        /* Size 8x4 */
         33, 28, 24, 23, 22, 20, 18, 17, 22, 22, 20, 19, 19, 18, 17, 16, 20, 22,
         20, 18, 16, 15, 14, 13, 17, 18, 18, 16, 14, 12, 11, 11,
+        /* Size 8x4 */
+        33, 22, 20, 17, 28, 22, 22, 18, 24, 20, 20, 18, 23, 19, 18, 16, 22, 19,
+        16, 14, 20, 18, 15, 12, 18, 17, 14, 11, 17, 16, 13, 11,
         /* Size 8x16 */
-        32, 32, 26, 21, 20, 18, 16, 15, 33, 31, 25, 22, 21, 19, 17, 16, 33, 29,
-        24, 22, 22, 20, 18, 17, 29, 26, 22, 22, 22, 20, 19, 18, 25, 24, 21, 21,
-        21, 20, 18, 17, 21, 22, 20, 19, 19, 18, 17, 17, 21, 22, 21, 19, 18, 17,
-        16, 16, 21, 23, 21, 18, 17, 16, 15, 15, 20, 22, 21, 18, 16, 15, 14, 14,
-        20, 21, 20, 18, 16, 14, 14, 13, 19, 20, 20, 17, 15, 14, 13, 13, 18, 20,
-        19, 17, 15, 13, 12, 12, 17, 19, 18, 16, 14, 13, 12, 12, 16, 18, 18, 16,
-        14, 12, 12, 11, 16, 17, 17, 16, 14, 12, 11, 11, 15, 17, 17, 16, 14, 13,
-        12, 11,
-        /* Size 16x8 */
         32, 33, 33, 29, 25, 21, 21, 21, 20, 20, 19, 18, 17, 16, 16, 15, 32, 31,
         29, 26, 24, 22, 22, 23, 22, 21, 20, 20, 19, 18, 17, 17, 26, 25, 24, 22,
         21, 20, 21, 21, 21, 20, 20, 19, 18, 18, 17, 17, 21, 22, 22, 22, 21, 19,
@@ -8596,37 +8587,16 @@
         14, 13, 13, 12, 12, 13, 16, 17, 18, 19, 18, 17, 16, 15, 14, 14, 13, 12,
         12, 12, 11, 12, 15, 16, 17, 18, 17, 17, 16, 15, 14, 13, 13, 12, 12, 11,
         11, 11,
+        /* Size 16x8 */
+        32, 32, 26, 21, 20, 18, 16, 15, 33, 31, 25, 22, 21, 19, 17, 16, 33, 29,
+        24, 22, 22, 20, 18, 17, 29, 26, 22, 22, 22, 20, 19, 18, 25, 24, 21, 21,
+        21, 20, 18, 17, 21, 22, 20, 19, 19, 18, 17, 17, 21, 22, 21, 19, 18, 17,
+        16, 16, 21, 23, 21, 18, 17, 16, 15, 15, 20, 22, 21, 18, 16, 15, 14, 14,
+        20, 21, 20, 18, 16, 14, 14, 13, 19, 20, 20, 17, 15, 14, 13, 13, 18, 20,
+        19, 17, 15, 13, 12, 12, 17, 19, 18, 16, 14, 13, 12, 12, 16, 18, 18, 16,
+        14, 12, 12, 11, 16, 17, 17, 16, 14, 12, 11, 11, 15, 17, 17, 16, 14, 13,
+        12, 11,
         /* Size 16x32 */
-        32, 33, 32, 28, 26, 21, 21, 21, 20, 20, 18, 18, 16, 16, 15, 15, 33, 33,
-        31, 27, 25, 22, 22, 22, 21, 20, 19, 19, 17, 17, 16, 16, 33, 33, 31, 27,
-        25, 22, 22, 22, 21, 21, 19, 19, 17, 17, 16, 16, 34, 32, 31, 26, 24, 22,
-        23, 23, 22, 21, 20, 20, 18, 18, 17, 17, 33, 31, 29, 25, 24, 22, 22, 23,
-        22, 21, 20, 20, 18, 18, 17, 17, 31, 28, 28, 24, 23, 22, 22, 22, 22, 22,
-        20, 20, 18, 18, 17, 17, 29, 27, 26, 23, 22, 22, 22, 23, 22, 22, 20, 20,
-        19, 18, 18, 17, 28, 26, 25, 22, 22, 22, 22, 23, 22, 22, 20, 20, 19, 19,
-        18, 18, 25, 24, 24, 22, 21, 21, 21, 21, 21, 20, 20, 19, 18, 18, 17, 18,
-        24, 24, 24, 22, 21, 20, 21, 21, 20, 20, 19, 19, 18, 18, 17, 17, 21, 22,
-        22, 21, 20, 19, 19, 19, 19, 19, 18, 18, 17, 17, 17, 17, 21, 22, 22, 21,
-        20, 19, 19, 19, 19, 19, 18, 18, 17, 17, 16, 16, 21, 22, 22, 22, 21, 19,
-        19, 18, 18, 18, 17, 17, 16, 16, 16, 16, 21, 23, 22, 22, 21, 19, 19, 18,
-        18, 18, 17, 17, 16, 16, 16, 15, 21, 23, 23, 22, 21, 19, 18, 18, 17, 17,
-        16, 16, 15, 15, 15, 15, 21, 22, 22, 22, 21, 19, 18, 17, 17, 17, 16, 16,
-        15, 15, 15, 15, 20, 22, 22, 22, 21, 19, 18, 17, 16, 16, 15, 15, 14, 14,
-        14, 14, 20, 22, 22, 22, 21, 19, 18, 17, 16, 16, 15, 15, 14, 14, 14, 14,
-        20, 21, 21, 22, 20, 19, 18, 17, 16, 16, 14, 14, 14, 14, 13, 14, 19, 20,
-        21, 21, 20, 19, 17, 17, 15, 15, 14, 14, 13, 13, 13, 13, 19, 20, 20, 21,
-        20, 19, 17, 17, 15, 15, 14, 14, 13, 13, 13, 13, 18, 20, 20, 20, 20, 18,
-        17, 16, 15, 15, 13, 13, 12, 12, 12, 12, 18, 20, 20, 20, 19, 18, 17, 16,
-        15, 14, 13, 13, 12, 12, 12, 12, 17, 19, 19, 20, 19, 18, 17, 16, 14, 14,
-        13, 13, 12, 12, 12, 12, 17, 18, 19, 19, 18, 17, 16, 16, 14, 14, 13, 13,
-        12, 12, 12, 12, 16, 18, 18, 19, 18, 17, 16, 15, 14, 14, 12, 12, 12, 11,
-        11, 11, 16, 18, 18, 19, 18, 17, 16, 15, 14, 14, 12, 12, 12, 11, 11, 11,
-        16, 17, 18, 18, 18, 17, 16, 15, 14, 14, 12, 12, 11, 11, 11, 11, 16, 17,
-        17, 18, 17, 17, 16, 15, 14, 13, 12, 12, 11, 11, 11, 11, 15, 17, 17, 18,
-        17, 16, 16, 15, 14, 13, 12, 12, 11, 11, 11, 11, 15, 17, 17, 18, 17, 16,
-        16, 14, 14, 13, 13, 12, 12, 11, 11, 11, 15, 17, 17, 17, 17, 16, 16, 14,
-        14, 13, 13, 12, 12, 11, 11, 10,
-        /* Size 32x16 */
         32, 33, 33, 34, 33, 31, 29, 28, 25, 24, 21, 21, 21, 21, 21, 21, 20, 20,
         20, 19, 19, 18, 18, 17, 17, 16, 16, 16, 16, 15, 15, 15, 33, 33, 33, 32,
         31, 28, 27, 26, 24, 24, 22, 22, 22, 23, 23, 22, 22, 22, 21, 20, 20, 20,
@@ -8656,33 +8626,47 @@
         13, 12, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 15, 16, 16, 17, 17, 17,
         17, 18, 18, 17, 17, 16, 16, 15, 15, 15, 14, 14, 14, 13, 13, 12, 12, 12,
         12, 11, 11, 11, 11, 11, 11, 10,
+        /* Size 32x16 */
+        32, 33, 32, 28, 26, 21, 21, 21, 20, 20, 18, 18, 16, 16, 15, 15, 33, 33,
+        31, 27, 25, 22, 22, 22, 21, 20, 19, 19, 17, 17, 16, 16, 33, 33, 31, 27,
+        25, 22, 22, 22, 21, 21, 19, 19, 17, 17, 16, 16, 34, 32, 31, 26, 24, 22,
+        23, 23, 22, 21, 20, 20, 18, 18, 17, 17, 33, 31, 29, 25, 24, 22, 22, 23,
+        22, 21, 20, 20, 18, 18, 17, 17, 31, 28, 28, 24, 23, 22, 22, 22, 22, 22,
+        20, 20, 18, 18, 17, 17, 29, 27, 26, 23, 22, 22, 22, 23, 22, 22, 20, 20,
+        19, 18, 18, 17, 28, 26, 25, 22, 22, 22, 22, 23, 22, 22, 20, 20, 19, 19,
+        18, 18, 25, 24, 24, 22, 21, 21, 21, 21, 21, 20, 20, 19, 18, 18, 17, 18,
+        24, 24, 24, 22, 21, 20, 21, 21, 20, 20, 19, 19, 18, 18, 17, 17, 21, 22,
+        22, 21, 20, 19, 19, 19, 19, 19, 18, 18, 17, 17, 17, 17, 21, 22, 22, 21,
+        20, 19, 19, 19, 19, 19, 18, 18, 17, 17, 16, 16, 21, 22, 22, 22, 21, 19,
+        19, 18, 18, 18, 17, 17, 16, 16, 16, 16, 21, 23, 22, 22, 21, 19, 19, 18,
+        18, 18, 17, 17, 16, 16, 16, 15, 21, 23, 23, 22, 21, 19, 18, 18, 17, 17,
+        16, 16, 15, 15, 15, 15, 21, 22, 22, 22, 21, 19, 18, 17, 17, 17, 16, 16,
+        15, 15, 15, 15, 20, 22, 22, 22, 21, 19, 18, 17, 16, 16, 15, 15, 14, 14,
+        14, 14, 20, 22, 22, 22, 21, 19, 18, 17, 16, 16, 15, 15, 14, 14, 14, 14,
+        20, 21, 21, 22, 20, 19, 18, 17, 16, 16, 14, 14, 14, 14, 13, 14, 19, 20,
+        21, 21, 20, 19, 17, 17, 15, 15, 14, 14, 13, 13, 13, 13, 19, 20, 20, 21,
+        20, 19, 17, 17, 15, 15, 14, 14, 13, 13, 13, 13, 18, 20, 20, 20, 20, 18,
+        17, 16, 15, 15, 13, 13, 12, 12, 12, 12, 18, 20, 20, 20, 19, 18, 17, 16,
+        15, 14, 13, 13, 12, 12, 12, 12, 17, 19, 19, 20, 19, 18, 17, 16, 14, 14,
+        13, 13, 12, 12, 12, 12, 17, 18, 19, 19, 18, 17, 16, 16, 14, 14, 13, 13,
+        12, 12, 12, 12, 16, 18, 18, 19, 18, 17, 16, 15, 14, 14, 12, 12, 12, 11,
+        11, 11, 16, 18, 18, 19, 18, 17, 16, 15, 14, 14, 12, 12, 12, 11, 11, 11,
+        16, 17, 18, 18, 18, 17, 16, 15, 14, 14, 12, 12, 11, 11, 11, 11, 16, 17,
+        17, 18, 17, 17, 16, 15, 14, 13, 12, 12, 11, 11, 11, 11, 15, 17, 17, 18,
+        17, 16, 16, 15, 14, 13, 12, 12, 11, 11, 11, 11, 15, 17, 17, 18, 17, 16,
+        16, 14, 14, 13, 13, 12, 12, 11, 11, 11, 15, 17, 17, 17, 17, 16, 16, 14,
+        14, 13, 13, 12, 12, 11, 11, 10,
         /* Size 4x16 */
-        33, 21, 20, 16, 33, 22, 21, 17, 31, 22, 21, 18, 27, 22, 22, 18, 24, 21,
-        20, 18, 22, 19, 19, 17, 22, 19, 18, 16, 23, 19, 17, 15, 22, 19, 16, 14,
-        21, 19, 16, 14, 20, 19, 15, 13, 20, 18, 14, 12, 18, 17, 14, 12, 18, 17,
-        14, 11, 17, 17, 13, 11, 17, 16, 13, 11,
-        /* Size 16x4 */
         33, 33, 31, 27, 24, 22, 22, 23, 22, 21, 20, 20, 18, 18, 17, 17, 21, 22,
         22, 22, 21, 19, 19, 19, 19, 19, 19, 18, 17, 17, 17, 16, 20, 21, 21, 22,
         20, 19, 18, 17, 16, 16, 15, 14, 14, 14, 13, 13, 16, 17, 18, 18, 18, 17,
         16, 15, 14, 14, 13, 12, 12, 11, 11, 11,
+        /* Size 16x4 */
+        33, 21, 20, 16, 33, 22, 21, 17, 31, 22, 21, 18, 27, 22, 22, 18, 24, 21,
+        20, 18, 22, 19, 19, 17, 22, 19, 18, 16, 23, 19, 17, 15, 22, 19, 16, 14,
+        21, 19, 16, 14, 20, 19, 15, 13, 20, 18, 14, 12, 18, 17, 14, 12, 18, 17,
+        14, 11, 17, 17, 13, 11, 17, 16, 13, 11,
         /* Size 8x32 */
-        32, 32, 26, 21, 20, 18, 16, 15, 33, 31, 25, 22, 21, 19, 17, 16, 33, 31,
-        25, 22, 21, 19, 17, 16, 34, 31, 24, 23, 22, 20, 18, 17, 33, 29, 24, 22,
-        22, 20, 18, 17, 31, 28, 23, 22, 22, 20, 18, 17, 29, 26, 22, 22, 22, 20,
-        19, 18, 28, 25, 22, 22, 22, 20, 19, 18, 25, 24, 21, 21, 21, 20, 18, 17,
-        24, 24, 21, 21, 20, 19, 18, 17, 21, 22, 20, 19, 19, 18, 17, 17, 21, 22,
-        20, 19, 19, 18, 17, 16, 21, 22, 21, 19, 18, 17, 16, 16, 21, 22, 21, 19,
-        18, 17, 16, 16, 21, 23, 21, 18, 17, 16, 15, 15, 21, 22, 21, 18, 17, 16,
-        15, 15, 20, 22, 21, 18, 16, 15, 14, 14, 20, 22, 21, 18, 16, 15, 14, 14,
-        20, 21, 20, 18, 16, 14, 14, 13, 19, 21, 20, 17, 15, 14, 13, 13, 19, 20,
-        20, 17, 15, 14, 13, 13, 18, 20, 20, 17, 15, 13, 12, 12, 18, 20, 19, 17,
-        15, 13, 12, 12, 17, 19, 19, 17, 14, 13, 12, 12, 17, 19, 18, 16, 14, 13,
-        12, 12, 16, 18, 18, 16, 14, 12, 12, 11, 16, 18, 18, 16, 14, 12, 12, 11,
-        16, 18, 18, 16, 14, 12, 11, 11, 16, 17, 17, 16, 14, 12, 11, 11, 15, 17,
-        17, 16, 14, 12, 11, 11, 15, 17, 17, 16, 14, 13, 12, 11, 15, 17, 17, 16,
-        14, 13, 12, 11,
-        /* Size 32x8 */
         32, 33, 33, 34, 33, 31, 29, 28, 25, 24, 21, 21, 21, 21, 21, 21, 20, 20,
         20, 19, 19, 18, 18, 17, 17, 16, 16, 16, 16, 15, 15, 15, 32, 31, 31, 31,
         29, 28, 26, 25, 24, 24, 22, 22, 22, 22, 23, 22, 22, 22, 21, 21, 20, 20,
@@ -8697,7 +8681,23 @@
         19, 19, 18, 18, 17, 17, 16, 16, 15, 15, 14, 14, 14, 13, 13, 12, 12, 12,
         12, 12, 12, 11, 11, 11, 12, 12, 15, 16, 16, 17, 17, 17, 18, 18, 17, 17,
         17, 16, 16, 16, 15, 15, 14, 14, 13, 13, 13, 12, 12, 12, 12, 11, 11, 11,
-        11, 11, 11, 11 },
+        11, 11, 11, 11,
+        /* Size 32x8 */
+        32, 32, 26, 21, 20, 18, 16, 15, 33, 31, 25, 22, 21, 19, 17, 16, 33, 31,
+        25, 22, 21, 19, 17, 16, 34, 31, 24, 23, 22, 20, 18, 17, 33, 29, 24, 22,
+        22, 20, 18, 17, 31, 28, 23, 22, 22, 20, 18, 17, 29, 26, 22, 22, 22, 20,
+        19, 18, 28, 25, 22, 22, 22, 20, 19, 18, 25, 24, 21, 21, 21, 20, 18, 17,
+        24, 24, 21, 21, 20, 19, 18, 17, 21, 22, 20, 19, 19, 18, 17, 17, 21, 22,
+        20, 19, 19, 18, 17, 16, 21, 22, 21, 19, 18, 17, 16, 16, 21, 22, 21, 19,
+        18, 17, 16, 16, 21, 23, 21, 18, 17, 16, 15, 15, 21, 22, 21, 18, 17, 16,
+        15, 15, 20, 22, 21, 18, 16, 15, 14, 14, 20, 22, 21, 18, 16, 15, 14, 14,
+        20, 21, 20, 18, 16, 14, 14, 13, 19, 21, 20, 17, 15, 14, 13, 13, 19, 20,
+        20, 17, 15, 14, 13, 13, 18, 20, 20, 17, 15, 13, 12, 12, 18, 20, 19, 17,
+        15, 13, 12, 12, 17, 19, 19, 17, 14, 13, 12, 12, 17, 19, 18, 16, 14, 13,
+        12, 12, 16, 18, 18, 16, 14, 12, 12, 11, 16, 18, 18, 16, 14, 12, 12, 11,
+        16, 18, 18, 16, 14, 12, 11, 11, 16, 17, 17, 16, 14, 12, 11, 11, 15, 17,
+        17, 16, 14, 12, 11, 11, 15, 17, 17, 16, 14, 13, 12, 11, 15, 17, 17, 16,
+        14, 13, 12, 11 },
   },
   {
       { /* Luma */
@@ -8781,12 +8781,20 @@
         12, 12, 13, 13, 13, 13, 14, 14, 13, 13, 12, 12, 11, 11, 11, 11, 10, 10,
         9, 9, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 7,
         /* Size 4x8 */
-        32, 29, 20, 14, 32, 28, 20, 14, 30, 24, 19, 14, 28, 20, 16, 12, 23, 18,
-        13, 11, 19, 16, 12, 9, 16, 14, 11, 8, 14, 13, 10, 8,
-        /* Size 8x4 */
         32, 32, 30, 28, 23, 19, 16, 14, 29, 28, 24, 20, 18, 16, 14, 13, 20, 20,
         19, 16, 13, 12, 11, 10, 14, 14, 14, 12, 11, 9, 8, 8,
+        /* Size 8x4 */
+        32, 29, 20, 14, 32, 28, 20, 14, 30, 24, 19, 14, 28, 20, 16, 12, 23, 18,
+        13, 11, 19, 16, 12, 9, 16, 14, 11, 8, 14, 13, 10, 8,
         /* Size 8x16 */
+        32, 33, 33, 32, 32, 30, 28, 26, 23, 21, 19, 18, 16, 14, 13, 12, 33, 32,
+        32, 32, 31, 30, 30, 28, 25, 23, 21, 19, 17, 16, 14, 14, 32, 32, 31, 30,
+        29, 28, 27, 26, 24, 22, 20, 19, 18, 16, 15, 14, 28, 29, 30, 28, 27, 24,
+        21, 20, 19, 18, 17, 16, 15, 14, 13, 13, 23, 24, 25, 24, 24, 21, 19, 18,
+        16, 15, 14, 14, 13, 12, 11, 11, 19, 20, 21, 20, 21, 19, 17, 16, 14, 13,
+        12, 12, 11, 11, 10, 10, 16, 17, 17, 17, 18, 16, 15, 14, 13, 12, 11, 10,
+        10, 9, 9, 8, 13, 14, 14, 14, 15, 14, 13, 12, 11, 11, 10, 9, 9, 8, 8, 8,
+        /* Size 16x8 */
         32, 33, 32, 28, 23, 19, 16, 13, 33, 32, 32, 29, 24, 20, 17, 14, 33, 32,
         31, 30, 25, 21, 17, 14, 32, 32, 30, 28, 24, 20, 17, 14, 32, 31, 29, 27,
         24, 21, 18, 15, 30, 30, 28, 24, 21, 19, 16, 14, 28, 30, 27, 21, 19, 17,
@@ -8795,44 +8803,7 @@
         19, 16, 14, 12, 10, 9, 16, 17, 18, 15, 13, 11, 10, 9, 14, 16, 16, 14,
         12, 11, 9, 8, 13, 14, 15, 13, 11, 10, 9, 8, 12, 14, 14, 13, 11, 10, 8,
         8,
-        /* Size 16x8 */
-        32, 33, 33, 32, 32, 30, 28, 26, 23, 21, 19, 18, 16, 14, 13, 12, 33, 32,
-        32, 32, 31, 30, 30, 28, 25, 23, 21, 19, 17, 16, 14, 14, 32, 32, 31, 30,
-        29, 28, 27, 26, 24, 22, 20, 19, 18, 16, 15, 14, 28, 29, 30, 28, 27, 24,
-        21, 20, 19, 18, 17, 16, 15, 14, 13, 13, 23, 24, 25, 24, 24, 21, 19, 18,
-        16, 15, 14, 14, 13, 12, 11, 11, 19, 20, 21, 20, 21, 19, 17, 16, 14, 13,
-        12, 12, 11, 11, 10, 10, 16, 17, 17, 17, 18, 16, 15, 14, 13, 12, 11, 10,
-        10, 9, 9, 8, 13, 14, 14, 14, 15, 14, 13, 12, 11, 11, 10, 9, 9, 8, 8, 8,
         /* Size 16x32 */
-        32, 33, 33, 32, 32, 28, 28, 23, 23, 19, 19, 16, 16, 13, 13, 12, 33, 32,
-        32, 32, 32, 29, 29, 24, 24, 20, 20, 17, 17, 14, 14, 12, 33, 32, 32, 32,
-        32, 29, 29, 24, 24, 20, 20, 17, 17, 14, 14, 12, 33, 32, 32, 31, 31, 30,
-        30, 25, 25, 21, 21, 17, 17, 14, 14, 13, 33, 32, 32, 31, 31, 30, 30, 25,
-        25, 21, 21, 17, 17, 14, 14, 13, 32, 32, 32, 30, 30, 28, 28, 24, 24, 20,
-        20, 17, 17, 14, 14, 13, 32, 32, 32, 30, 30, 28, 28, 24, 24, 20, 20, 17,
-        17, 14, 14, 13, 32, 31, 31, 29, 29, 27, 27, 24, 24, 21, 21, 18, 18, 15,
-        15, 14, 32, 31, 31, 29, 29, 27, 27, 24, 24, 21, 21, 18, 18, 15, 15, 14,
-        30, 30, 30, 28, 28, 24, 24, 21, 21, 19, 19, 16, 16, 14, 14, 13, 30, 30,
-        30, 28, 28, 24, 24, 21, 21, 19, 19, 16, 16, 14, 14, 13, 28, 30, 30, 27,
-        27, 21, 21, 19, 19, 17, 17, 15, 15, 13, 13, 12, 28, 30, 30, 27, 27, 21,
-        21, 19, 19, 17, 17, 15, 15, 13, 13, 12, 26, 28, 28, 26, 26, 20, 20, 18,
-        18, 16, 16, 14, 14, 12, 12, 12, 26, 28, 28, 26, 26, 20, 20, 18, 18, 16,
-        16, 14, 14, 12, 12, 12, 23, 25, 25, 24, 24, 19, 19, 16, 16, 14, 14, 13,
-        13, 11, 11, 11, 23, 25, 25, 24, 24, 19, 19, 16, 16, 14, 14, 13, 13, 11,
-        11, 11, 21, 23, 23, 22, 22, 18, 18, 15, 15, 13, 13, 12, 12, 11, 11, 10,
-        21, 23, 23, 22, 22, 18, 18, 15, 15, 13, 13, 12, 12, 11, 11, 10, 19, 21,
-        21, 20, 20, 17, 17, 14, 14, 12, 12, 11, 11, 10, 10, 9, 19, 21, 21, 20,
-        20, 17, 17, 14, 14, 12, 12, 11, 11, 10, 10, 9, 18, 19, 19, 19, 19, 16,
-        16, 14, 14, 12, 12, 10, 10, 9, 9, 9, 18, 19, 19, 19, 19, 16, 16, 14, 14,
-        12, 12, 10, 10, 9, 9, 9, 16, 17, 17, 18, 18, 15, 15, 13, 13, 11, 11, 10,
-        10, 9, 9, 8, 16, 17, 17, 18, 18, 15, 15, 13, 13, 11, 11, 10, 10, 9, 9,
-        8, 14, 16, 16, 16, 16, 14, 14, 12, 12, 11, 11, 9, 9, 8, 8, 8, 14, 16,
-        16, 16, 16, 14, 14, 12, 12, 11, 11, 9, 9, 8, 8, 8, 13, 14, 14, 15, 15,
-        13, 13, 11, 11, 10, 10, 9, 9, 8, 8, 7, 13, 14, 14, 15, 15, 13, 13, 11,
-        11, 10, 10, 9, 9, 8, 8, 7, 12, 14, 14, 14, 14, 13, 13, 11, 11, 10, 10,
-        8, 8, 8, 8, 7, 12, 14, 14, 14, 14, 13, 13, 11, 11, 10, 10, 8, 8, 8, 8,
-        7, 12, 13, 13, 13, 13, 12, 12, 11, 11, 9, 9, 8, 8, 7, 7, 7,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 32, 32, 32, 32, 30, 30, 28, 28, 26, 26, 23, 23, 21,
         21, 19, 19, 18, 18, 16, 16, 14, 14, 13, 13, 12, 12, 12, 33, 32, 32, 32,
         32, 32, 32, 31, 31, 30, 30, 30, 30, 28, 28, 25, 25, 23, 23, 21, 21, 19,
@@ -8861,32 +8832,46 @@
         15, 14, 14, 13, 13, 12, 12, 11, 11, 11, 11, 10, 10, 9, 9, 9, 9, 8, 8, 8,
         8, 8, 8, 7, 12, 12, 12, 13, 13, 13, 13, 14, 14, 13, 13, 12, 12, 12, 12,
         11, 11, 10, 10, 9, 9, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 7,
+        /* Size 32x16 */
+        32, 33, 33, 32, 32, 28, 28, 23, 23, 19, 19, 16, 16, 13, 13, 12, 33, 32,
+        32, 32, 32, 29, 29, 24, 24, 20, 20, 17, 17, 14, 14, 12, 33, 32, 32, 32,
+        32, 29, 29, 24, 24, 20, 20, 17, 17, 14, 14, 12, 33, 32, 32, 31, 31, 30,
+        30, 25, 25, 21, 21, 17, 17, 14, 14, 13, 33, 32, 32, 31, 31, 30, 30, 25,
+        25, 21, 21, 17, 17, 14, 14, 13, 32, 32, 32, 30, 30, 28, 28, 24, 24, 20,
+        20, 17, 17, 14, 14, 13, 32, 32, 32, 30, 30, 28, 28, 24, 24, 20, 20, 17,
+        17, 14, 14, 13, 32, 31, 31, 29, 29, 27, 27, 24, 24, 21, 21, 18, 18, 15,
+        15, 14, 32, 31, 31, 29, 29, 27, 27, 24, 24, 21, 21, 18, 18, 15, 15, 14,
+        30, 30, 30, 28, 28, 24, 24, 21, 21, 19, 19, 16, 16, 14, 14, 13, 30, 30,
+        30, 28, 28, 24, 24, 21, 21, 19, 19, 16, 16, 14, 14, 13, 28, 30, 30, 27,
+        27, 21, 21, 19, 19, 17, 17, 15, 15, 13, 13, 12, 28, 30, 30, 27, 27, 21,
+        21, 19, 19, 17, 17, 15, 15, 13, 13, 12, 26, 28, 28, 26, 26, 20, 20, 18,
+        18, 16, 16, 14, 14, 12, 12, 12, 26, 28, 28, 26, 26, 20, 20, 18, 18, 16,
+        16, 14, 14, 12, 12, 12, 23, 25, 25, 24, 24, 19, 19, 16, 16, 14, 14, 13,
+        13, 11, 11, 11, 23, 25, 25, 24, 24, 19, 19, 16, 16, 14, 14, 13, 13, 11,
+        11, 11, 21, 23, 23, 22, 22, 18, 18, 15, 15, 13, 13, 12, 12, 11, 11, 10,
+        21, 23, 23, 22, 22, 18, 18, 15, 15, 13, 13, 12, 12, 11, 11, 10, 19, 21,
+        21, 20, 20, 17, 17, 14, 14, 12, 12, 11, 11, 10, 10, 9, 19, 21, 21, 20,
+        20, 17, 17, 14, 14, 12, 12, 11, 11, 10, 10, 9, 18, 19, 19, 19, 19, 16,
+        16, 14, 14, 12, 12, 10, 10, 9, 9, 9, 18, 19, 19, 19, 19, 16, 16, 14, 14,
+        12, 12, 10, 10, 9, 9, 9, 16, 17, 17, 18, 18, 15, 15, 13, 13, 11, 11, 10,
+        10, 9, 9, 8, 16, 17, 17, 18, 18, 15, 15, 13, 13, 11, 11, 10, 10, 9, 9,
+        8, 14, 16, 16, 16, 16, 14, 14, 12, 12, 11, 11, 9, 9, 8, 8, 8, 14, 16,
+        16, 16, 16, 14, 14, 12, 12, 11, 11, 9, 9, 8, 8, 8, 13, 14, 14, 15, 15,
+        13, 13, 11, 11, 10, 10, 9, 9, 8, 8, 7, 13, 14, 14, 15, 15, 13, 13, 11,
+        11, 10, 10, 9, 9, 8, 8, 7, 12, 14, 14, 14, 14, 13, 13, 11, 11, 10, 10,
+        8, 8, 8, 8, 7, 12, 14, 14, 14, 14, 13, 13, 11, 11, 10, 10, 8, 8, 8, 8,
+        7, 12, 13, 13, 13, 13, 12, 12, 11, 11, 9, 9, 8, 8, 7, 7, 7,
         /* Size 4x16 */
-        33, 28, 19, 13, 32, 29, 20, 14, 32, 30, 21, 14, 32, 28, 20, 14, 31, 27,
-        21, 15, 30, 24, 19, 14, 30, 21, 17, 13, 28, 20, 16, 12, 25, 19, 14, 11,
-        23, 18, 13, 11, 21, 17, 12, 10, 19, 16, 12, 9, 17, 15, 11, 9, 16, 14,
-        11, 8, 14, 13, 10, 8, 14, 13, 10, 8,
-        /* Size 16x4 */
         33, 32, 32, 32, 31, 30, 30, 28, 25, 23, 21, 19, 17, 16, 14, 14, 28, 29,
         30, 28, 27, 24, 21, 20, 19, 18, 17, 16, 15, 14, 13, 13, 19, 20, 21, 20,
         21, 19, 17, 16, 14, 13, 12, 12, 11, 11, 10, 10, 13, 14, 14, 14, 15, 14,
         13, 12, 11, 11, 10, 9, 9, 8, 8, 8,
+        /* Size 16x4 */
+        33, 28, 19, 13, 32, 29, 20, 14, 32, 30, 21, 14, 32, 28, 20, 14, 31, 27,
+        21, 15, 30, 24, 19, 14, 30, 21, 17, 13, 28, 20, 16, 12, 25, 19, 14, 11,
+        23, 18, 13, 11, 21, 17, 12, 10, 19, 16, 12, 9, 17, 15, 11, 9, 16, 14,
+        11, 8, 14, 13, 10, 8, 14, 13, 10, 8,
         /* Size 8x32 */
-        32, 33, 32, 28, 23, 19, 16, 13, 33, 32, 32, 29, 24, 20, 17, 14, 33, 32,
-        32, 29, 24, 20, 17, 14, 33, 32, 31, 30, 25, 21, 17, 14, 33, 32, 31, 30,
-        25, 21, 17, 14, 32, 32, 30, 28, 24, 20, 17, 14, 32, 32, 30, 28, 24, 20,
-        17, 14, 32, 31, 29, 27, 24, 21, 18, 15, 32, 31, 29, 27, 24, 21, 18, 15,
-        30, 30, 28, 24, 21, 19, 16, 14, 30, 30, 28, 24, 21, 19, 16, 14, 28, 30,
-        27, 21, 19, 17, 15, 13, 28, 30, 27, 21, 19, 17, 15, 13, 26, 28, 26, 20,
-        18, 16, 14, 12, 26, 28, 26, 20, 18, 16, 14, 12, 23, 25, 24, 19, 16, 14,
-        13, 11, 23, 25, 24, 19, 16, 14, 13, 11, 21, 23, 22, 18, 15, 13, 12, 11,
-        21, 23, 22, 18, 15, 13, 12, 11, 19, 21, 20, 17, 14, 12, 11, 10, 19, 21,
-        20, 17, 14, 12, 11, 10, 18, 19, 19, 16, 14, 12, 10, 9, 18, 19, 19, 16,
-        14, 12, 10, 9, 16, 17, 18, 15, 13, 11, 10, 9, 16, 17, 18, 15, 13, 11,
-        10, 9, 14, 16, 16, 14, 12, 11, 9, 8, 14, 16, 16, 14, 12, 11, 9, 8, 13,
-        14, 15, 13, 11, 10, 9, 8, 13, 14, 15, 13, 11, 10, 9, 8, 12, 14, 14, 13,
-        11, 10, 8, 8, 12, 14, 14, 13, 11, 10, 8, 8, 12, 13, 13, 12, 11, 9, 8, 7,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 32, 32, 32, 32, 30, 30, 28, 28, 26, 26, 23, 23, 21,
         21, 19, 19, 18, 18, 16, 16, 14, 14, 13, 13, 12, 12, 12, 33, 32, 32, 32,
         32, 32, 32, 31, 31, 30, 30, 30, 30, 28, 28, 25, 25, 23, 23, 21, 21, 19,
@@ -8900,7 +8885,23 @@
         12, 12, 12, 11, 11, 11, 11, 10, 10, 10, 10, 9, 16, 17, 17, 17, 17, 17,
         17, 18, 18, 16, 16, 15, 15, 14, 14, 13, 13, 12, 12, 11, 11, 10, 10, 10,
         10, 9, 9, 9, 9, 8, 8, 8, 13, 14, 14, 14, 14, 14, 14, 15, 15, 14, 14, 13,
-        13, 12, 12, 11, 11, 11, 11, 10, 10, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 7 },
+        13, 12, 12, 11, 11, 11, 11, 10, 10, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 7,
+        /* Size 32x8 */
+        32, 33, 32, 28, 23, 19, 16, 13, 33, 32, 32, 29, 24, 20, 17, 14, 33, 32,
+        32, 29, 24, 20, 17, 14, 33, 32, 31, 30, 25, 21, 17, 14, 33, 32, 31, 30,
+        25, 21, 17, 14, 32, 32, 30, 28, 24, 20, 17, 14, 32, 32, 30, 28, 24, 20,
+        17, 14, 32, 31, 29, 27, 24, 21, 18, 15, 32, 31, 29, 27, 24, 21, 18, 15,
+        30, 30, 28, 24, 21, 19, 16, 14, 30, 30, 28, 24, 21, 19, 16, 14, 28, 30,
+        27, 21, 19, 17, 15, 13, 28, 30, 27, 21, 19, 17, 15, 13, 26, 28, 26, 20,
+        18, 16, 14, 12, 26, 28, 26, 20, 18, 16, 14, 12, 23, 25, 24, 19, 16, 14,
+        13, 11, 23, 25, 24, 19, 16, 14, 13, 11, 21, 23, 22, 18, 15, 13, 12, 11,
+        21, 23, 22, 18, 15, 13, 12, 11, 19, 21, 20, 17, 14, 12, 11, 10, 19, 21,
+        20, 17, 14, 12, 11, 10, 18, 19, 19, 16, 14, 12, 10, 9, 18, 19, 19, 16,
+        14, 12, 10, 9, 16, 17, 18, 15, 13, 11, 10, 9, 16, 17, 18, 15, 13, 11,
+        10, 9, 14, 16, 16, 14, 12, 11, 9, 8, 14, 16, 16, 14, 12, 11, 9, 8, 13,
+        14, 15, 13, 11, 10, 9, 8, 13, 14, 15, 13, 11, 10, 9, 8, 12, 14, 14, 13,
+        11, 10, 8, 8, 12, 14, 14, 13, 11, 10, 8, 8, 12, 13, 13, 12, 11, 9, 8,
+        7 },
       { /* Chroma */
         /* Size 4x4 */
         32, 22, 22, 18, 22, 19, 19, 17, 22, 19, 16, 14, 18, 17, 14, 12,
@@ -8984,21 +8985,12 @@
         11, 11, 15, 16, 16, 17, 17, 17, 17, 18, 18, 17, 17, 17, 17, 16, 16, 15,
         15, 14, 14, 13, 13, 13, 13, 12, 12, 12, 12, 11, 11, 11, 11, 11,
         /* Size 4x8 */
-        33, 22, 20, 17, 28, 22, 22, 18, 24, 20, 20, 18, 22, 19, 18, 16, 22, 19,
-        16, 14, 20, 19, 15, 13, 19, 18, 14, 12, 17, 17, 14, 11,
-        /* Size 8x4 */
         33, 28, 24, 22, 22, 20, 19, 17, 22, 22, 20, 19, 19, 19, 18, 17, 20, 22,
         20, 18, 16, 15, 14, 14, 17, 18, 18, 16, 14, 13, 12, 11,
+        /* Size 8x4 */
+        33, 22, 20, 17, 28, 22, 22, 18, 24, 20, 20, 18, 22, 19, 18, 16, 22, 19,
+        16, 14, 20, 19, 15, 13, 19, 18, 14, 12, 17, 17, 14, 11,
         /* Size 8x16 */
-        32, 33, 28, 21, 21, 20, 18, 16, 33, 33, 27, 22, 22, 20, 19, 17, 34, 32,
-        26, 22, 23, 21, 20, 18, 31, 28, 24, 22, 22, 22, 20, 18, 28, 26, 22, 22,
-        23, 22, 20, 19, 24, 24, 22, 20, 21, 20, 19, 18, 21, 22, 21, 19, 19, 19,
-        18, 17, 21, 22, 22, 19, 18, 18, 17, 16, 21, 23, 22, 19, 18, 17, 16, 15,
-        20, 22, 22, 19, 17, 16, 15, 14, 20, 21, 22, 19, 17, 16, 14, 14, 19, 20,
-        21, 19, 17, 15, 14, 13, 18, 20, 20, 18, 16, 15, 13, 12, 17, 19, 20, 18,
-        16, 14, 13, 12, 16, 18, 19, 17, 15, 14, 12, 12, 16, 17, 18, 17, 15, 14,
-        12, 11,
-        /* Size 16x8 */
         32, 33, 34, 31, 28, 24, 21, 21, 21, 20, 20, 19, 18, 17, 16, 16, 33, 33,
         32, 28, 26, 24, 22, 22, 23, 22, 21, 20, 20, 19, 18, 17, 28, 27, 26, 24,
         22, 22, 21, 22, 22, 22, 22, 21, 20, 20, 19, 18, 21, 22, 22, 22, 22, 20,
@@ -9007,37 +8999,16 @@
         16, 15, 15, 14, 14, 14, 18, 19, 20, 20, 20, 19, 18, 17, 16, 15, 14, 14,
         13, 13, 12, 12, 16, 17, 18, 18, 19, 18, 17, 16, 15, 14, 14, 13, 12, 12,
         12, 11,
+        /* Size 16x8 */
+        32, 33, 28, 21, 21, 20, 18, 16, 33, 33, 27, 22, 22, 20, 19, 17, 34, 32,
+        26, 22, 23, 21, 20, 18, 31, 28, 24, 22, 22, 22, 20, 18, 28, 26, 22, 22,
+        23, 22, 20, 19, 24, 24, 22, 20, 21, 20, 19, 18, 21, 22, 21, 19, 19, 19,
+        18, 17, 21, 22, 22, 19, 18, 18, 17, 16, 21, 23, 22, 19, 18, 17, 16, 15,
+        20, 22, 22, 19, 17, 16, 15, 14, 20, 21, 22, 19, 17, 16, 14, 14, 19, 20,
+        21, 19, 17, 15, 14, 13, 18, 20, 20, 18, 16, 15, 13, 12, 17, 19, 20, 18,
+        16, 14, 13, 12, 16, 18, 19, 17, 15, 14, 12, 12, 16, 17, 18, 17, 15, 14,
+        12, 11,
         /* Size 16x32 */
-        32, 33, 33, 28, 28, 21, 21, 21, 21, 20, 20, 18, 18, 16, 16, 16, 33, 33,
-        33, 27, 27, 22, 22, 22, 22, 20, 20, 19, 19, 17, 17, 16, 33, 33, 33, 27,
-        27, 22, 22, 22, 22, 20, 20, 19, 19, 17, 17, 16, 34, 32, 32, 26, 26, 22,
-        22, 23, 23, 21, 21, 20, 20, 18, 18, 17, 34, 32, 32, 26, 26, 22, 22, 23,
-        23, 21, 21, 20, 20, 18, 18, 17, 31, 28, 28, 24, 24, 22, 22, 22, 22, 22,
-        22, 20, 20, 18, 18, 17, 31, 28, 28, 24, 24, 22, 22, 22, 22, 22, 22, 20,
-        20, 18, 18, 17, 28, 26, 26, 22, 22, 22, 22, 23, 23, 22, 22, 20, 20, 19,
-        19, 18, 28, 26, 26, 22, 22, 22, 22, 23, 23, 22, 22, 20, 20, 19, 19, 18,
-        24, 24, 24, 22, 22, 20, 20, 21, 21, 20, 20, 19, 19, 18, 18, 17, 24, 24,
-        24, 22, 22, 20, 20, 21, 21, 20, 20, 19, 19, 18, 18, 17, 21, 22, 22, 21,
-        21, 19, 19, 19, 19, 19, 19, 18, 18, 17, 17, 17, 21, 22, 22, 21, 21, 19,
-        19, 19, 19, 19, 19, 18, 18, 17, 17, 17, 21, 22, 22, 22, 22, 19, 19, 18,
-        18, 18, 18, 17, 17, 16, 16, 16, 21, 22, 22, 22, 22, 19, 19, 18, 18, 18,
-        18, 17, 17, 16, 16, 16, 21, 23, 23, 22, 22, 19, 19, 18, 18, 17, 17, 16,
-        16, 15, 15, 15, 21, 23, 23, 22, 22, 19, 19, 18, 18, 17, 17, 16, 16, 15,
-        15, 15, 20, 22, 22, 22, 22, 19, 19, 17, 17, 16, 16, 15, 15, 14, 14, 14,
-        20, 22, 22, 22, 22, 19, 19, 17, 17, 16, 16, 15, 15, 14, 14, 14, 20, 21,
-        21, 22, 22, 19, 19, 17, 17, 16, 16, 14, 14, 14, 14, 13, 20, 21, 21, 22,
-        22, 19, 19, 17, 17, 16, 16, 14, 14, 14, 14, 13, 19, 20, 20, 21, 21, 19,
-        19, 17, 17, 15, 15, 14, 14, 13, 13, 13, 19, 20, 20, 21, 21, 19, 19, 17,
-        17, 15, 15, 14, 14, 13, 13, 13, 18, 20, 20, 20, 20, 18, 18, 16, 16, 15,
-        15, 13, 13, 12, 12, 12, 18, 20, 20, 20, 20, 18, 18, 16, 16, 15, 15, 13,
-        13, 12, 12, 12, 17, 19, 19, 20, 20, 18, 18, 16, 16, 14, 14, 13, 13, 12,
-        12, 12, 17, 19, 19, 20, 20, 18, 18, 16, 16, 14, 14, 13, 13, 12, 12, 12,
-        16, 18, 18, 19, 19, 17, 17, 15, 15, 14, 14, 12, 12, 12, 12, 11, 16, 18,
-        18, 19, 19, 17, 17, 15, 15, 14, 14, 12, 12, 12, 12, 11, 16, 17, 17, 18,
-        18, 17, 17, 15, 15, 14, 14, 12, 12, 11, 11, 11, 16, 17, 17, 18, 18, 17,
-        17, 15, 15, 14, 14, 12, 12, 11, 11, 11, 16, 17, 17, 18, 18, 16, 16, 15,
-        15, 13, 13, 12, 12, 11, 11, 11,
-        /* Size 32x16 */
         32, 33, 33, 34, 34, 31, 31, 28, 28, 24, 24, 21, 21, 21, 21, 21, 21, 20,
         20, 20, 20, 19, 19, 18, 18, 17, 17, 16, 16, 16, 16, 16, 33, 33, 33, 32,
         32, 28, 28, 26, 26, 24, 24, 22, 22, 22, 22, 23, 23, 22, 22, 21, 21, 20,
@@ -9067,33 +9038,47 @@
         14, 13, 13, 12, 12, 12, 12, 12, 12, 11, 11, 11, 16, 16, 16, 17, 17, 17,
         17, 18, 18, 17, 17, 17, 17, 16, 16, 15, 15, 14, 14, 13, 13, 13, 13, 12,
         12, 12, 12, 11, 11, 11, 11, 11,
+        /* Size 32x16 */
+        32, 33, 33, 28, 28, 21, 21, 21, 21, 20, 20, 18, 18, 16, 16, 16, 33, 33,
+        33, 27, 27, 22, 22, 22, 22, 20, 20, 19, 19, 17, 17, 16, 33, 33, 33, 27,
+        27, 22, 22, 22, 22, 20, 20, 19, 19, 17, 17, 16, 34, 32, 32, 26, 26, 22,
+        22, 23, 23, 21, 21, 20, 20, 18, 18, 17, 34, 32, 32, 26, 26, 22, 22, 23,
+        23, 21, 21, 20, 20, 18, 18, 17, 31, 28, 28, 24, 24, 22, 22, 22, 22, 22,
+        22, 20, 20, 18, 18, 17, 31, 28, 28, 24, 24, 22, 22, 22, 22, 22, 22, 20,
+        20, 18, 18, 17, 28, 26, 26, 22, 22, 22, 22, 23, 23, 22, 22, 20, 20, 19,
+        19, 18, 28, 26, 26, 22, 22, 22, 22, 23, 23, 22, 22, 20, 20, 19, 19, 18,
+        24, 24, 24, 22, 22, 20, 20, 21, 21, 20, 20, 19, 19, 18, 18, 17, 24, 24,
+        24, 22, 22, 20, 20, 21, 21, 20, 20, 19, 19, 18, 18, 17, 21, 22, 22, 21,
+        21, 19, 19, 19, 19, 19, 19, 18, 18, 17, 17, 17, 21, 22, 22, 21, 21, 19,
+        19, 19, 19, 19, 19, 18, 18, 17, 17, 17, 21, 22, 22, 22, 22, 19, 19, 18,
+        18, 18, 18, 17, 17, 16, 16, 16, 21, 22, 22, 22, 22, 19, 19, 18, 18, 18,
+        18, 17, 17, 16, 16, 16, 21, 23, 23, 22, 22, 19, 19, 18, 18, 17, 17, 16,
+        16, 15, 15, 15, 21, 23, 23, 22, 22, 19, 19, 18, 18, 17, 17, 16, 16, 15,
+        15, 15, 20, 22, 22, 22, 22, 19, 19, 17, 17, 16, 16, 15, 15, 14, 14, 14,
+        20, 22, 22, 22, 22, 19, 19, 17, 17, 16, 16, 15, 15, 14, 14, 14, 20, 21,
+        21, 22, 22, 19, 19, 17, 17, 16, 16, 14, 14, 14, 14, 13, 20, 21, 21, 22,
+        22, 19, 19, 17, 17, 16, 16, 14, 14, 14, 14, 13, 19, 20, 20, 21, 21, 19,
+        19, 17, 17, 15, 15, 14, 14, 13, 13, 13, 19, 20, 20, 21, 21, 19, 19, 17,
+        17, 15, 15, 14, 14, 13, 13, 13, 18, 20, 20, 20, 20, 18, 18, 16, 16, 15,
+        15, 13, 13, 12, 12, 12, 18, 20, 20, 20, 20, 18, 18, 16, 16, 15, 15, 13,
+        13, 12, 12, 12, 17, 19, 19, 20, 20, 18, 18, 16, 16, 14, 14, 13, 13, 12,
+        12, 12, 17, 19, 19, 20, 20, 18, 18, 16, 16, 14, 14, 13, 13, 12, 12, 12,
+        16, 18, 18, 19, 19, 17, 17, 15, 15, 14, 14, 12, 12, 12, 12, 11, 16, 18,
+        18, 19, 19, 17, 17, 15, 15, 14, 14, 12, 12, 12, 12, 11, 16, 17, 17, 18,
+        18, 17, 17, 15, 15, 14, 14, 12, 12, 11, 11, 11, 16, 17, 17, 18, 18, 17,
+        17, 15, 15, 14, 14, 12, 12, 11, 11, 11, 16, 17, 17, 18, 18, 16, 16, 15,
+        15, 13, 13, 12, 12, 11, 11, 11,
         /* Size 4x16 */
-        33, 21, 20, 16, 33, 22, 20, 17, 32, 22, 21, 18, 28, 22, 22, 18, 26, 22,
-        22, 19, 24, 20, 20, 18, 22, 19, 19, 17, 22, 19, 18, 16, 23, 19, 17, 15,
-        22, 19, 16, 14, 21, 19, 16, 14, 20, 19, 15, 13, 20, 18, 15, 12, 19, 18,
-        14, 12, 18, 17, 14, 12, 17, 17, 14, 11,
-        /* Size 16x4 */
         33, 33, 32, 28, 26, 24, 22, 22, 23, 22, 21, 20, 20, 19, 18, 17, 21, 22,
         22, 22, 22, 20, 19, 19, 19, 19, 19, 19, 18, 18, 17, 17, 20, 20, 21, 22,
         22, 20, 19, 18, 17, 16, 16, 15, 15, 14, 14, 14, 16, 17, 18, 18, 19, 18,
         17, 16, 15, 14, 14, 13, 12, 12, 12, 11,
+        /* Size 16x4 */
+        33, 21, 20, 16, 33, 22, 20, 17, 32, 22, 21, 18, 28, 22, 22, 18, 26, 22,
+        22, 19, 24, 20, 20, 18, 22, 19, 19, 17, 22, 19, 18, 16, 23, 19, 17, 15,
+        22, 19, 16, 14, 21, 19, 16, 14, 20, 19, 15, 13, 20, 18, 15, 12, 19, 18,
+        14, 12, 18, 17, 14, 12, 17, 17, 14, 11,
         /* Size 8x32 */
-        32, 33, 28, 21, 21, 20, 18, 16, 33, 33, 27, 22, 22, 20, 19, 17, 33, 33,
-        27, 22, 22, 20, 19, 17, 34, 32, 26, 22, 23, 21, 20, 18, 34, 32, 26, 22,
-        23, 21, 20, 18, 31, 28, 24, 22, 22, 22, 20, 18, 31, 28, 24, 22, 22, 22,
-        20, 18, 28, 26, 22, 22, 23, 22, 20, 19, 28, 26, 22, 22, 23, 22, 20, 19,
-        24, 24, 22, 20, 21, 20, 19, 18, 24, 24, 22, 20, 21, 20, 19, 18, 21, 22,
-        21, 19, 19, 19, 18, 17, 21, 22, 21, 19, 19, 19, 18, 17, 21, 22, 22, 19,
-        18, 18, 17, 16, 21, 22, 22, 19, 18, 18, 17, 16, 21, 23, 22, 19, 18, 17,
-        16, 15, 21, 23, 22, 19, 18, 17, 16, 15, 20, 22, 22, 19, 17, 16, 15, 14,
-        20, 22, 22, 19, 17, 16, 15, 14, 20, 21, 22, 19, 17, 16, 14, 14, 20, 21,
-        22, 19, 17, 16, 14, 14, 19, 20, 21, 19, 17, 15, 14, 13, 19, 20, 21, 19,
-        17, 15, 14, 13, 18, 20, 20, 18, 16, 15, 13, 12, 18, 20, 20, 18, 16, 15,
-        13, 12, 17, 19, 20, 18, 16, 14, 13, 12, 17, 19, 20, 18, 16, 14, 13, 12,
-        16, 18, 19, 17, 15, 14, 12, 12, 16, 18, 19, 17, 15, 14, 12, 12, 16, 17,
-        18, 17, 15, 14, 12, 11, 16, 17, 18, 17, 15, 14, 12, 11, 16, 17, 18, 16,
-        15, 13, 12, 11,
-        /* Size 32x8 */
         32, 33, 33, 34, 34, 31, 31, 28, 28, 24, 24, 21, 21, 21, 21, 21, 21, 20,
         20, 20, 20, 19, 19, 18, 18, 17, 17, 16, 16, 16, 16, 16, 33, 33, 33, 32,
         32, 28, 28, 26, 26, 24, 24, 22, 22, 22, 22, 23, 23, 22, 22, 21, 21, 20,
@@ -9108,7 +9093,23 @@
         20, 20, 20, 19, 19, 18, 18, 17, 17, 16, 16, 15, 15, 14, 14, 14, 14, 13,
         13, 13, 13, 12, 12, 12, 12, 12, 16, 17, 17, 18, 18, 18, 18, 19, 19, 18,
         18, 17, 17, 16, 16, 15, 15, 14, 14, 14, 14, 13, 13, 12, 12, 12, 12, 12,
-        12, 11, 11, 11 },
+        12, 11, 11, 11,
+        /* Size 32x8 */
+        32, 33, 28, 21, 21, 20, 18, 16, 33, 33, 27, 22, 22, 20, 19, 17, 33, 33,
+        27, 22, 22, 20, 19, 17, 34, 32, 26, 22, 23, 21, 20, 18, 34, 32, 26, 22,
+        23, 21, 20, 18, 31, 28, 24, 22, 22, 22, 20, 18, 31, 28, 24, 22, 22, 22,
+        20, 18, 28, 26, 22, 22, 23, 22, 20, 19, 28, 26, 22, 22, 23, 22, 20, 19,
+        24, 24, 22, 20, 21, 20, 19, 18, 24, 24, 22, 20, 21, 20, 19, 18, 21, 22,
+        21, 19, 19, 19, 18, 17, 21, 22, 21, 19, 19, 19, 18, 17, 21, 22, 22, 19,
+        18, 18, 17, 16, 21, 22, 22, 19, 18, 18, 17, 16, 21, 23, 22, 19, 18, 17,
+        16, 15, 21, 23, 22, 19, 18, 17, 16, 15, 20, 22, 22, 19, 17, 16, 15, 14,
+        20, 22, 22, 19, 17, 16, 15, 14, 20, 21, 22, 19, 17, 16, 14, 14, 20, 21,
+        22, 19, 17, 16, 14, 14, 19, 20, 21, 19, 17, 15, 14, 13, 19, 20, 21, 19,
+        17, 15, 14, 13, 18, 20, 20, 18, 16, 15, 13, 12, 18, 20, 20, 18, 16, 15,
+        13, 12, 17, 19, 20, 18, 16, 14, 13, 12, 17, 19, 20, 18, 16, 14, 13, 12,
+        16, 18, 19, 17, 15, 14, 12, 12, 16, 18, 19, 17, 15, 14, 12, 12, 16, 17,
+        18, 17, 15, 14, 12, 11, 16, 17, 18, 17, 15, 14, 12, 11, 16, 17, 18, 16,
+        15, 13, 12, 11 },
   },
   {
       { /* Luma */
@@ -9194,21 +9195,12 @@
         14, 15, 15, 15, 14, 14, 13, 13, 12, 12, 12, 11, 11, 11, 11, 10, 10, 9,
         9, 9, 9, 9, 8, 8, 8, 8,
         /* Size 4x8 */
-        32, 30, 24, 17, 32, 30, 24, 17, 31, 28, 23, 18, 29, 24, 19, 15, 25, 21,
-        16, 13, 21, 19, 14, 11, 18, 17, 13, 10, 16, 15, 12, 9,
-        /* Size 8x4 */
         32, 32, 31, 29, 25, 21, 18, 16, 30, 30, 28, 24, 21, 19, 17, 15, 24, 24,
         23, 19, 16, 14, 13, 12, 17, 17, 18, 15, 13, 11, 10, 9,
+        /* Size 8x4 */
+        32, 30, 24, 17, 32, 30, 24, 17, 31, 28, 23, 18, 29, 24, 19, 15, 25, 21,
+        16, 13, 21, 19, 14, 11, 18, 17, 13, 10, 16, 15, 12, 9,
         /* Size 8x16 */
-        32, 33, 32, 28, 23, 19, 17, 14, 33, 32, 32, 29, 24, 20, 17, 15, 33, 32,
-        31, 30, 25, 21, 18, 16, 32, 32, 30, 28, 24, 20, 18, 16, 32, 31, 29, 27,
-        24, 21, 18, 16, 30, 30, 28, 24, 21, 19, 17, 15, 29, 30, 27, 22, 20, 17,
-        16, 14, 27, 28, 26, 21, 18, 16, 15, 13, 25, 26, 25, 20, 17, 15, 14, 13,
-        23, 24, 24, 19, 16, 14, 13, 12, 21, 23, 22, 18, 15, 13, 12, 11, 19, 21,
-        20, 17, 14, 12, 11, 10, 18, 19, 19, 16, 14, 12, 11, 10, 16, 17, 18, 15,
-        13, 11, 10, 9, 14, 16, 16, 14, 12, 11, 9, 9, 13, 14, 15, 13, 11, 10, 9,
-        8,
-        /* Size 16x8 */
         32, 33, 33, 32, 32, 30, 29, 27, 25, 23, 21, 19, 18, 16, 14, 13, 33, 32,
         32, 32, 31, 30, 30, 28, 26, 24, 23, 21, 19, 17, 16, 14, 32, 32, 31, 30,
         29, 28, 27, 26, 25, 24, 22, 20, 19, 18, 16, 15, 28, 29, 30, 28, 27, 24,
@@ -9217,37 +9209,16 @@
         13, 12, 12, 11, 11, 10, 17, 17, 18, 18, 18, 17, 16, 15, 14, 13, 12, 11,
         11, 10, 9, 9, 14, 15, 16, 16, 16, 15, 14, 13, 13, 12, 11, 10, 10, 9, 9,
         8,
+        /* Size 16x8 */
+        32, 33, 32, 28, 23, 19, 17, 14, 33, 32, 32, 29, 24, 20, 17, 15, 33, 32,
+        31, 30, 25, 21, 18, 16, 32, 32, 30, 28, 24, 20, 18, 16, 32, 31, 29, 27,
+        24, 21, 18, 16, 30, 30, 28, 24, 21, 19, 17, 15, 29, 30, 27, 22, 20, 17,
+        16, 14, 27, 28, 26, 21, 18, 16, 15, 13, 25, 26, 25, 20, 17, 15, 14, 13,
+        23, 24, 24, 19, 16, 14, 13, 12, 21, 23, 22, 18, 15, 13, 12, 11, 19, 21,
+        20, 17, 14, 12, 11, 10, 18, 19, 19, 16, 14, 12, 11, 10, 16, 17, 18, 15,
+        13, 11, 10, 9, 14, 16, 16, 14, 12, 11, 9, 9, 13, 14, 15, 13, 11, 10, 9,
+        8,
         /* Size 16x32 */
-        32, 33, 33, 32, 32, 30, 28, 27, 23, 23, 19, 19, 17, 16, 14, 13, 33, 32,
-        32, 32, 32, 30, 29, 28, 24, 24, 20, 20, 17, 17, 15, 14, 33, 32, 32, 32,
-        32, 30, 29, 28, 24, 24, 20, 20, 17, 17, 15, 14, 33, 32, 32, 32, 32, 31,
-        29, 28, 25, 24, 20, 20, 18, 17, 15, 14, 33, 32, 32, 32, 31, 31, 30, 28,
-        25, 25, 21, 21, 18, 17, 16, 14, 33, 32, 32, 31, 31, 30, 29, 28, 25, 24,
-        21, 21, 18, 17, 16, 14, 32, 32, 32, 31, 30, 29, 28, 27, 24, 24, 20, 20,
-        18, 17, 16, 14, 32, 32, 32, 30, 30, 29, 28, 27, 24, 24, 21, 21, 18, 17,
-        16, 15, 32, 32, 31, 30, 29, 28, 27, 26, 24, 24, 21, 21, 18, 18, 16, 15,
-        32, 31, 31, 30, 29, 28, 26, 26, 24, 23, 20, 20, 18, 18, 16, 15, 30, 30,
-        30, 28, 28, 26, 24, 23, 21, 21, 19, 19, 17, 16, 15, 14, 30, 30, 30, 28,
-        28, 26, 24, 23, 21, 21, 19, 19, 17, 16, 15, 14, 29, 30, 30, 28, 27, 24,
-        22, 21, 20, 19, 17, 17, 16, 15, 14, 13, 28, 29, 30, 28, 27, 24, 21, 21,
-        19, 19, 17, 17, 16, 15, 14, 13, 27, 28, 28, 27, 26, 23, 21, 20, 18, 18,
-        16, 16, 15, 14, 13, 13, 26, 27, 28, 26, 26, 23, 20, 20, 18, 18, 16, 16,
-        14, 14, 13, 12, 25, 26, 26, 25, 25, 22, 20, 19, 17, 17, 15, 15, 14, 13,
-        13, 12, 23, 25, 25, 24, 24, 21, 19, 18, 16, 16, 14, 14, 13, 13, 12, 11,
-        23, 24, 24, 24, 24, 21, 19, 18, 16, 16, 14, 14, 13, 13, 12, 11, 21, 23,
-        23, 22, 22, 20, 18, 17, 15, 15, 13, 13, 12, 12, 11, 11, 21, 23, 23, 22,
-        22, 20, 18, 17, 15, 15, 13, 13, 12, 12, 11, 11, 19, 21, 21, 21, 21, 19,
-        17, 17, 14, 14, 13, 13, 12, 11, 10, 10, 19, 20, 21, 20, 20, 19, 17, 16,
-        14, 14, 12, 12, 11, 11, 10, 10, 18, 19, 20, 20, 20, 18, 17, 16, 14, 14,
-        12, 12, 11, 11, 10, 9, 18, 19, 19, 19, 19, 18, 16, 15, 14, 13, 12, 12,
-        11, 10, 10, 9, 17, 18, 18, 18, 18, 17, 16, 15, 13, 13, 12, 12, 10, 10,
-        9, 9, 16, 17, 17, 17, 18, 16, 15, 14, 13, 13, 11, 11, 10, 10, 9, 9, 15,
-        17, 17, 17, 17, 16, 15, 14, 13, 12, 11, 11, 10, 10, 9, 9, 14, 16, 16,
-        16, 16, 15, 14, 13, 12, 12, 11, 11, 9, 9, 9, 8, 14, 16, 16, 16, 16, 15,
-        14, 13, 12, 12, 10, 10, 9, 9, 9, 8, 13, 14, 14, 14, 15, 14, 13, 12, 11,
-        11, 10, 10, 9, 9, 8, 8, 13, 14, 14, 14, 15, 14, 13, 12, 11, 11, 10, 10,
-        9, 9, 8, 8,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 32, 32, 32, 32, 30, 30, 29, 28, 27, 26, 25, 23,
         23, 21, 21, 19, 19, 18, 18, 17, 16, 15, 14, 14, 13, 13, 33, 32, 32, 32,
         32, 32, 32, 32, 32, 31, 30, 30, 30, 29, 28, 27, 26, 25, 24, 23, 23, 21,
@@ -9277,33 +9248,47 @@
         10, 10, 10, 9, 9, 9, 9, 9, 8, 8, 13, 14, 14, 14, 14, 14, 14, 15, 15, 15,
         14, 14, 13, 13, 13, 12, 12, 11, 11, 11, 11, 10, 10, 9, 9, 9, 9, 9, 8, 8,
         8, 8,
+        /* Size 32x16 */
+        32, 33, 33, 32, 32, 30, 28, 27, 23, 23, 19, 19, 17, 16, 14, 13, 33, 32,
+        32, 32, 32, 30, 29, 28, 24, 24, 20, 20, 17, 17, 15, 14, 33, 32, 32, 32,
+        32, 30, 29, 28, 24, 24, 20, 20, 17, 17, 15, 14, 33, 32, 32, 32, 32, 31,
+        29, 28, 25, 24, 20, 20, 18, 17, 15, 14, 33, 32, 32, 32, 31, 31, 30, 28,
+        25, 25, 21, 21, 18, 17, 16, 14, 33, 32, 32, 31, 31, 30, 29, 28, 25, 24,
+        21, 21, 18, 17, 16, 14, 32, 32, 32, 31, 30, 29, 28, 27, 24, 24, 20, 20,
+        18, 17, 16, 14, 32, 32, 32, 30, 30, 29, 28, 27, 24, 24, 21, 21, 18, 17,
+        16, 15, 32, 32, 31, 30, 29, 28, 27, 26, 24, 24, 21, 21, 18, 18, 16, 15,
+        32, 31, 31, 30, 29, 28, 26, 26, 24, 23, 20, 20, 18, 18, 16, 15, 30, 30,
+        30, 28, 28, 26, 24, 23, 21, 21, 19, 19, 17, 16, 15, 14, 30, 30, 30, 28,
+        28, 26, 24, 23, 21, 21, 19, 19, 17, 16, 15, 14, 29, 30, 30, 28, 27, 24,
+        22, 21, 20, 19, 17, 17, 16, 15, 14, 13, 28, 29, 30, 28, 27, 24, 21, 21,
+        19, 19, 17, 17, 16, 15, 14, 13, 27, 28, 28, 27, 26, 23, 21, 20, 18, 18,
+        16, 16, 15, 14, 13, 13, 26, 27, 28, 26, 26, 23, 20, 20, 18, 18, 16, 16,
+        14, 14, 13, 12, 25, 26, 26, 25, 25, 22, 20, 19, 17, 17, 15, 15, 14, 13,
+        13, 12, 23, 25, 25, 24, 24, 21, 19, 18, 16, 16, 14, 14, 13, 13, 12, 11,
+        23, 24, 24, 24, 24, 21, 19, 18, 16, 16, 14, 14, 13, 13, 12, 11, 21, 23,
+        23, 22, 22, 20, 18, 17, 15, 15, 13, 13, 12, 12, 11, 11, 21, 23, 23, 22,
+        22, 20, 18, 17, 15, 15, 13, 13, 12, 12, 11, 11, 19, 21, 21, 21, 21, 19,
+        17, 17, 14, 14, 13, 13, 12, 11, 10, 10, 19, 20, 21, 20, 20, 19, 17, 16,
+        14, 14, 12, 12, 11, 11, 10, 10, 18, 19, 20, 20, 20, 18, 17, 16, 14, 14,
+        12, 12, 11, 11, 10, 9, 18, 19, 19, 19, 19, 18, 16, 15, 14, 13, 12, 12,
+        11, 10, 10, 9, 17, 18, 18, 18, 18, 17, 16, 15, 13, 13, 12, 12, 10, 10,
+        9, 9, 16, 17, 17, 17, 18, 16, 15, 14, 13, 13, 11, 11, 10, 10, 9, 9, 15,
+        17, 17, 17, 17, 16, 15, 14, 13, 12, 11, 11, 10, 10, 9, 9, 14, 16, 16,
+        16, 16, 15, 14, 13, 12, 12, 11, 11, 9, 9, 9, 8, 14, 16, 16, 16, 16, 15,
+        14, 13, 12, 12, 10, 10, 9, 9, 9, 8, 13, 14, 14, 14, 15, 14, 13, 12, 11,
+        11, 10, 10, 9, 9, 8, 8, 13, 14, 14, 14, 15, 14, 13, 12, 11, 11, 10, 10,
+        9, 9, 8, 8,
         /* Size 4x16 */
-        33, 30, 23, 16, 32, 30, 24, 17, 32, 31, 25, 17, 32, 29, 24, 17, 32, 28,
-        24, 18, 30, 26, 21, 16, 30, 24, 19, 15, 28, 23, 18, 14, 26, 22, 17, 13,
-        24, 21, 16, 13, 23, 20, 15, 12, 20, 19, 14, 11, 19, 18, 13, 10, 17, 16,
-        13, 10, 16, 15, 12, 9, 14, 14, 11, 9,
-        /* Size 16x4 */
         33, 32, 32, 32, 32, 30, 30, 28, 26, 24, 23, 20, 19, 17, 16, 14, 30, 30,
         31, 29, 28, 26, 24, 23, 22, 21, 20, 19, 18, 16, 15, 14, 23, 24, 25, 24,
         24, 21, 19, 18, 17, 16, 15, 14, 13, 13, 12, 11, 16, 17, 17, 17, 18, 16,
         15, 14, 13, 13, 12, 11, 10, 10, 9, 9,
+        /* Size 16x4 */
+        33, 30, 23, 16, 32, 30, 24, 17, 32, 31, 25, 17, 32, 29, 24, 17, 32, 28,
+        24, 18, 30, 26, 21, 16, 30, 24, 19, 15, 28, 23, 18, 14, 26, 22, 17, 13,
+        24, 21, 16, 13, 23, 20, 15, 12, 20, 19, 14, 11, 19, 18, 13, 10, 17, 16,
+        13, 10, 16, 15, 12, 9, 14, 14, 11, 9,
         /* Size 8x32 */
-        32, 33, 32, 28, 23, 19, 17, 14, 33, 32, 32, 29, 24, 20, 17, 15, 33, 32,
-        32, 29, 24, 20, 17, 15, 33, 32, 32, 29, 25, 20, 18, 15, 33, 32, 31, 30,
-        25, 21, 18, 16, 33, 32, 31, 29, 25, 21, 18, 16, 32, 32, 30, 28, 24, 20,
-        18, 16, 32, 32, 30, 28, 24, 21, 18, 16, 32, 31, 29, 27, 24, 21, 18, 16,
-        32, 31, 29, 26, 24, 20, 18, 16, 30, 30, 28, 24, 21, 19, 17, 15, 30, 30,
-        28, 24, 21, 19, 17, 15, 29, 30, 27, 22, 20, 17, 16, 14, 28, 30, 27, 21,
-        19, 17, 16, 14, 27, 28, 26, 21, 18, 16, 15, 13, 26, 28, 26, 20, 18, 16,
-        14, 13, 25, 26, 25, 20, 17, 15, 14, 13, 23, 25, 24, 19, 16, 14, 13, 12,
-        23, 24, 24, 19, 16, 14, 13, 12, 21, 23, 22, 18, 15, 13, 12, 11, 21, 23,
-        22, 18, 15, 13, 12, 11, 19, 21, 21, 17, 14, 13, 12, 10, 19, 21, 20, 17,
-        14, 12, 11, 10, 18, 20, 20, 17, 14, 12, 11, 10, 18, 19, 19, 16, 14, 12,
-        11, 10, 17, 18, 18, 16, 13, 12, 10, 9, 16, 17, 18, 15, 13, 11, 10, 9,
-        15, 17, 17, 15, 13, 11, 10, 9, 14, 16, 16, 14, 12, 11, 9, 9, 14, 16, 16,
-        14, 12, 10, 9, 9, 13, 14, 15, 13, 11, 10, 9, 8, 13, 14, 15, 13, 11, 10,
-        9, 8,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 33, 32, 32, 32, 32, 30, 30, 29, 28, 27, 26, 25, 23,
         23, 21, 21, 19, 19, 18, 18, 17, 16, 15, 14, 14, 13, 13, 33, 32, 32, 32,
         32, 32, 32, 32, 31, 31, 30, 30, 30, 30, 28, 28, 26, 25, 24, 23, 23, 21,
@@ -9318,7 +9303,23 @@
         18, 18, 18, 18, 17, 17, 16, 16, 15, 14, 14, 13, 13, 12, 12, 12, 11, 11,
         11, 10, 10, 10, 9, 9, 9, 9, 14, 15, 15, 15, 16, 16, 16, 16, 16, 16, 15,
         15, 14, 14, 13, 13, 13, 12, 12, 11, 11, 10, 10, 10, 10, 9, 9, 9, 9, 9,
-        8, 8 },
+        8, 8,
+        /* Size 32x8 */
+        32, 33, 32, 28, 23, 19, 17, 14, 33, 32, 32, 29, 24, 20, 17, 15, 33, 32,
+        32, 29, 24, 20, 17, 15, 33, 32, 32, 29, 25, 20, 18, 15, 33, 32, 31, 30,
+        25, 21, 18, 16, 33, 32, 31, 29, 25, 21, 18, 16, 32, 32, 30, 28, 24, 20,
+        18, 16, 32, 32, 30, 28, 24, 21, 18, 16, 32, 31, 29, 27, 24, 21, 18, 16,
+        32, 31, 29, 26, 24, 20, 18, 16, 30, 30, 28, 24, 21, 19, 17, 15, 30, 30,
+        28, 24, 21, 19, 17, 15, 29, 30, 27, 22, 20, 17, 16, 14, 28, 30, 27, 21,
+        19, 17, 16, 14, 27, 28, 26, 21, 18, 16, 15, 13, 26, 28, 26, 20, 18, 16,
+        14, 13, 25, 26, 25, 20, 17, 15, 14, 13, 23, 25, 24, 19, 16, 14, 13, 12,
+        23, 24, 24, 19, 16, 14, 13, 12, 21, 23, 22, 18, 15, 13, 12, 11, 21, 23,
+        22, 18, 15, 13, 12, 11, 19, 21, 21, 17, 14, 13, 12, 10, 19, 21, 20, 17,
+        14, 12, 11, 10, 18, 20, 20, 17, 14, 12, 11, 10, 18, 19, 19, 16, 14, 12,
+        11, 10, 17, 18, 18, 16, 13, 12, 10, 9, 16, 17, 18, 15, 13, 11, 10, 9,
+        15, 17, 17, 15, 13, 11, 10, 9, 14, 16, 16, 14, 12, 11, 9, 9, 14, 16, 16,
+        14, 12, 10, 9, 9, 13, 14, 15, 13, 11, 10, 9, 8, 13, 14, 15, 13, 11, 10,
+        9, 8 },
       { /* Chroma */
         /* Size 4x4 */
         33, 24, 22, 19, 24, 21, 20, 19, 22, 20, 17, 15, 19, 19, 15, 13,
@@ -9402,21 +9403,12 @@
         12, 12, 16, 17, 17, 18, 18, 18, 18, 19, 19, 19, 18, 18, 17, 17, 17, 16,
         16, 15, 15, 14, 14, 14, 14, 13, 13, 13, 12, 12, 12, 12, 12, 12,
         /* Size 4x8 */
-        33, 24, 22, 19, 31, 23, 23, 20, 26, 22, 22, 20, 22, 20, 19, 18, 23, 21,
-        17, 16, 21, 20, 17, 15, 20, 20, 16, 14, 19, 19, 16, 13,
-        /* Size 8x4 */
         33, 31, 26, 22, 23, 21, 20, 19, 24, 23, 22, 20, 21, 20, 20, 19, 22, 23,
         22, 19, 17, 17, 16, 16, 19, 20, 20, 18, 16, 15, 14, 13,
+        /* Size 8x4 */
+        33, 24, 22, 19, 31, 23, 23, 20, 26, 22, 22, 20, 22, 20, 19, 18, 23, 21,
+        17, 16, 21, 20, 17, 15, 20, 20, 16, 14, 19, 19, 16, 13,
         /* Size 8x16 */
-        32, 33, 28, 21, 21, 20, 18, 17, 33, 33, 27, 22, 22, 20, 19, 18, 34, 32,
-        26, 22, 23, 21, 20, 19, 31, 28, 24, 22, 22, 22, 20, 19, 28, 26, 22, 22,
-        23, 22, 21, 20, 24, 24, 22, 20, 21, 20, 19, 18, 22, 22, 21, 20, 19, 19,
-        19, 18, 21, 22, 22, 19, 19, 18, 18, 17, 21, 23, 22, 19, 18, 17, 17, 16,
-        21, 23, 22, 19, 18, 17, 16, 16, 20, 22, 22, 19, 17, 16, 16, 15, 20, 21,
-        22, 19, 17, 16, 15, 14, 19, 20, 21, 19, 17, 15, 14, 13, 18, 20, 20, 18,
-        16, 15, 14, 13, 17, 19, 20, 18, 16, 14, 13, 12, 16, 18, 19, 17, 15, 14,
-        13, 12,
-        /* Size 16x8 */
         32, 33, 34, 31, 28, 24, 22, 21, 21, 21, 20, 20, 19, 18, 17, 16, 33, 33,
         32, 28, 26, 24, 22, 22, 23, 23, 22, 21, 20, 20, 19, 18, 28, 27, 26, 24,
         22, 22, 21, 22, 22, 22, 22, 22, 21, 20, 20, 19, 21, 22, 22, 22, 22, 20,
@@ -9425,37 +9417,16 @@
         16, 16, 15, 15, 14, 14, 18, 19, 20, 20, 21, 19, 19, 18, 17, 16, 16, 15,
         14, 14, 13, 13, 17, 18, 19, 19, 20, 18, 18, 17, 16, 16, 15, 14, 13, 13,
         12, 12,
+        /* Size 16x8 */
+        32, 33, 28, 21, 21, 20, 18, 17, 33, 33, 27, 22, 22, 20, 19, 18, 34, 32,
+        26, 22, 23, 21, 20, 19, 31, 28, 24, 22, 22, 22, 20, 19, 28, 26, 22, 22,
+        23, 22, 21, 20, 24, 24, 22, 20, 21, 20, 19, 18, 22, 22, 21, 20, 19, 19,
+        19, 18, 21, 22, 22, 19, 19, 18, 18, 17, 21, 23, 22, 19, 18, 17, 17, 16,
+        21, 23, 22, 19, 18, 17, 16, 16, 20, 22, 22, 19, 17, 16, 16, 15, 20, 21,
+        22, 19, 17, 16, 15, 14, 19, 20, 21, 19, 17, 15, 14, 13, 18, 20, 20, 18,
+        16, 15, 14, 13, 17, 19, 20, 18, 16, 14, 13, 12, 16, 18, 19, 17, 15, 14,
+        13, 12,
         /* Size 16x32 */
-        32, 33, 33, 29, 28, 24, 21, 21, 21, 21, 20, 20, 18, 18, 17, 16, 33, 33,
-        33, 28, 27, 24, 22, 22, 22, 22, 20, 20, 19, 19, 18, 17, 33, 33, 33, 28,
-        27, 24, 22, 22, 22, 22, 20, 20, 19, 19, 18, 17, 34, 32, 32, 28, 26, 24,
-        22, 22, 22, 22, 21, 21, 20, 20, 18, 18, 34, 32, 32, 28, 26, 24, 22, 22,
-        23, 23, 21, 21, 20, 20, 19, 18, 32, 31, 30, 26, 25, 23, 22, 22, 23, 23,
-        21, 21, 20, 20, 19, 18, 31, 29, 28, 26, 24, 23, 22, 22, 22, 22, 22, 22,
-        20, 20, 19, 18, 30, 28, 28, 24, 23, 23, 22, 22, 23, 22, 22, 22, 20, 20,
-        19, 19, 28, 26, 26, 23, 22, 22, 22, 22, 23, 22, 22, 22, 21, 20, 20, 19,
-        28, 26, 26, 23, 22, 22, 21, 22, 22, 22, 22, 22, 21, 20, 19, 19, 24, 24,
-        24, 22, 22, 21, 20, 20, 21, 21, 20, 20, 19, 19, 18, 18, 24, 24, 24, 22,
-        22, 21, 20, 20, 21, 21, 20, 20, 19, 19, 18, 18, 22, 22, 22, 22, 21, 20,
-        20, 20, 19, 19, 19, 19, 19, 18, 18, 17, 21, 22, 22, 22, 21, 20, 19, 19,
-        19, 19, 19, 19, 18, 18, 17, 17, 21, 22, 22, 22, 22, 20, 19, 19, 19, 19,
-        18, 18, 18, 18, 17, 17, 21, 22, 22, 22, 22, 20, 19, 19, 18, 18, 18, 18,
-        17, 17, 17, 16, 21, 22, 23, 22, 22, 21, 19, 19, 18, 18, 17, 17, 17, 17,
-        16, 16, 21, 23, 23, 23, 22, 21, 19, 19, 18, 17, 17, 17, 16, 16, 16, 15,
-        21, 22, 23, 22, 22, 21, 19, 19, 18, 17, 17, 17, 16, 16, 16, 15, 20, 22,
-        22, 22, 22, 20, 19, 19, 17, 17, 16, 16, 16, 15, 15, 14, 20, 22, 22, 22,
-        22, 20, 19, 19, 17, 17, 16, 16, 16, 15, 15, 14, 20, 21, 21, 22, 22, 20,
-        19, 18, 17, 17, 16, 16, 15, 15, 14, 14, 20, 21, 21, 22, 22, 20, 19, 18,
-        17, 17, 16, 16, 15, 14, 14, 14, 19, 20, 21, 21, 21, 20, 19, 18, 17, 17,
-        15, 15, 14, 14, 14, 13, 19, 20, 20, 21, 21, 20, 19, 18, 17, 16, 15, 15,
-        14, 14, 13, 13, 19, 20, 20, 20, 21, 20, 18, 18, 16, 16, 15, 15, 14, 14,
-        13, 13, 18, 20, 20, 20, 20, 19, 18, 18, 16, 16, 15, 15, 14, 13, 13, 12,
-        18, 19, 19, 20, 20, 19, 18, 17, 16, 16, 14, 14, 13, 13, 13, 12, 17, 19,
-        19, 19, 20, 19, 18, 17, 16, 16, 14, 14, 13, 13, 12, 12, 17, 19, 19, 19,
-        19, 19, 17, 17, 16, 16, 14, 14, 13, 13, 12, 12, 16, 18, 18, 18, 19, 18,
-        17, 17, 15, 15, 14, 14, 13, 12, 12, 12, 16, 18, 18, 18, 19, 18, 17, 17,
-        15, 15, 14, 14, 13, 12, 12, 12,
-        /* Size 32x16 */
         32, 33, 33, 34, 34, 32, 31, 30, 28, 28, 24, 24, 22, 21, 21, 21, 21, 21,
         21, 20, 20, 20, 20, 19, 19, 19, 18, 18, 17, 17, 16, 16, 33, 33, 33, 32,
         32, 31, 29, 28, 26, 26, 24, 24, 22, 22, 22, 22, 22, 23, 22, 22, 22, 21,
@@ -9485,33 +9456,47 @@
         15, 14, 14, 14, 13, 13, 13, 13, 12, 12, 12, 12, 16, 17, 17, 18, 18, 18,
         18, 19, 19, 19, 18, 18, 17, 17, 17, 16, 16, 15, 15, 14, 14, 14, 14, 13,
         13, 13, 12, 12, 12, 12, 12, 12,
+        /* Size 32x16 */
+        32, 33, 33, 29, 28, 24, 21, 21, 21, 21, 20, 20, 18, 18, 17, 16, 33, 33,
+        33, 28, 27, 24, 22, 22, 22, 22, 20, 20, 19, 19, 18, 17, 33, 33, 33, 28,
+        27, 24, 22, 22, 22, 22, 20, 20, 19, 19, 18, 17, 34, 32, 32, 28, 26, 24,
+        22, 22, 22, 22, 21, 21, 20, 20, 18, 18, 34, 32, 32, 28, 26, 24, 22, 22,
+        23, 23, 21, 21, 20, 20, 19, 18, 32, 31, 30, 26, 25, 23, 22, 22, 23, 23,
+        21, 21, 20, 20, 19, 18, 31, 29, 28, 26, 24, 23, 22, 22, 22, 22, 22, 22,
+        20, 20, 19, 18, 30, 28, 28, 24, 23, 23, 22, 22, 23, 22, 22, 22, 20, 20,
+        19, 19, 28, 26, 26, 23, 22, 22, 22, 22, 23, 22, 22, 22, 21, 20, 20, 19,
+        28, 26, 26, 23, 22, 22, 21, 22, 22, 22, 22, 22, 21, 20, 19, 19, 24, 24,
+        24, 22, 22, 21, 20, 20, 21, 21, 20, 20, 19, 19, 18, 18, 24, 24, 24, 22,
+        22, 21, 20, 20, 21, 21, 20, 20, 19, 19, 18, 18, 22, 22, 22, 22, 21, 20,
+        20, 20, 19, 19, 19, 19, 19, 18, 18, 17, 21, 22, 22, 22, 21, 20, 19, 19,
+        19, 19, 19, 19, 18, 18, 17, 17, 21, 22, 22, 22, 22, 20, 19, 19, 19, 19,
+        18, 18, 18, 18, 17, 17, 21, 22, 22, 22, 22, 20, 19, 19, 18, 18, 18, 18,
+        17, 17, 17, 16, 21, 22, 23, 22, 22, 21, 19, 19, 18, 18, 17, 17, 17, 17,
+        16, 16, 21, 23, 23, 23, 22, 21, 19, 19, 18, 17, 17, 17, 16, 16, 16, 15,
+        21, 22, 23, 22, 22, 21, 19, 19, 18, 17, 17, 17, 16, 16, 16, 15, 20, 22,
+        22, 22, 22, 20, 19, 19, 17, 17, 16, 16, 16, 15, 15, 14, 20, 22, 22, 22,
+        22, 20, 19, 19, 17, 17, 16, 16, 16, 15, 15, 14, 20, 21, 21, 22, 22, 20,
+        19, 18, 17, 17, 16, 16, 15, 15, 14, 14, 20, 21, 21, 22, 22, 20, 19, 18,
+        17, 17, 16, 16, 15, 14, 14, 14, 19, 20, 21, 21, 21, 20, 19, 18, 17, 17,
+        15, 15, 14, 14, 14, 13, 19, 20, 20, 21, 21, 20, 19, 18, 17, 16, 15, 15,
+        14, 14, 13, 13, 19, 20, 20, 20, 21, 20, 18, 18, 16, 16, 15, 15, 14, 14,
+        13, 13, 18, 20, 20, 20, 20, 19, 18, 18, 16, 16, 15, 15, 14, 13, 13, 12,
+        18, 19, 19, 20, 20, 19, 18, 17, 16, 16, 14, 14, 13, 13, 13, 12, 17, 19,
+        19, 19, 20, 19, 18, 17, 16, 16, 14, 14, 13, 13, 12, 12, 17, 19, 19, 19,
+        19, 19, 17, 17, 16, 16, 14, 14, 13, 13, 12, 12, 16, 18, 18, 18, 19, 18,
+        17, 17, 15, 15, 14, 14, 13, 12, 12, 12, 16, 18, 18, 18, 19, 18, 17, 17,
+        15, 15, 14, 14, 13, 12, 12, 12,
         /* Size 4x16 */
-        33, 24, 21, 18, 33, 24, 22, 19, 32, 24, 23, 20, 29, 23, 22, 20, 26, 22,
-        22, 20, 24, 21, 21, 19, 22, 20, 19, 18, 22, 20, 19, 18, 22, 21, 18, 17,
-        22, 21, 17, 16, 22, 20, 17, 15, 21, 20, 17, 14, 20, 20, 16, 14, 20, 19,
-        16, 13, 19, 19, 16, 13, 18, 18, 15, 12,
-        /* Size 16x4 */
         33, 33, 32, 29, 26, 24, 22, 22, 22, 22, 22, 21, 20, 20, 19, 18, 24, 24,
         24, 23, 22, 21, 20, 20, 21, 21, 20, 20, 20, 19, 19, 18, 21, 22, 23, 22,
         22, 21, 19, 19, 18, 17, 17, 17, 16, 16, 16, 15, 18, 19, 20, 20, 20, 19,
         18, 18, 17, 16, 15, 14, 14, 13, 13, 12,
+        /* Size 16x4 */
+        33, 24, 21, 18, 33, 24, 22, 19, 32, 24, 23, 20, 29, 23, 22, 20, 26, 22,
+        22, 20, 24, 21, 21, 19, 22, 20, 19, 18, 22, 20, 19, 18, 22, 21, 18, 17,
+        22, 21, 17, 16, 22, 20, 17, 15, 21, 20, 17, 14, 20, 20, 16, 14, 20, 19,
+        16, 13, 19, 19, 16, 13, 18, 18, 15, 12,
         /* Size 8x32 */
-        32, 33, 28, 21, 21, 20, 18, 17, 33, 33, 27, 22, 22, 20, 19, 18, 33, 33,
-        27, 22, 22, 20, 19, 18, 34, 32, 26, 22, 22, 21, 20, 18, 34, 32, 26, 22,
-        23, 21, 20, 19, 32, 30, 25, 22, 23, 21, 20, 19, 31, 28, 24, 22, 22, 22,
-        20, 19, 30, 28, 23, 22, 23, 22, 20, 19, 28, 26, 22, 22, 23, 22, 21, 20,
-        28, 26, 22, 21, 22, 22, 21, 19, 24, 24, 22, 20, 21, 20, 19, 18, 24, 24,
-        22, 20, 21, 20, 19, 18, 22, 22, 21, 20, 19, 19, 19, 18, 21, 22, 21, 19,
-        19, 19, 18, 17, 21, 22, 22, 19, 19, 18, 18, 17, 21, 22, 22, 19, 18, 18,
-        17, 17, 21, 23, 22, 19, 18, 17, 17, 16, 21, 23, 22, 19, 18, 17, 16, 16,
-        21, 23, 22, 19, 18, 17, 16, 16, 20, 22, 22, 19, 17, 16, 16, 15, 20, 22,
-        22, 19, 17, 16, 16, 15, 20, 21, 22, 19, 17, 16, 15, 14, 20, 21, 22, 19,
-        17, 16, 15, 14, 19, 21, 21, 19, 17, 15, 14, 14, 19, 20, 21, 19, 17, 15,
-        14, 13, 19, 20, 21, 18, 16, 15, 14, 13, 18, 20, 20, 18, 16, 15, 14, 13,
-        18, 19, 20, 18, 16, 14, 13, 13, 17, 19, 20, 18, 16, 14, 13, 12, 17, 19,
-        19, 17, 16, 14, 13, 12, 16, 18, 19, 17, 15, 14, 13, 12, 16, 18, 19, 17,
-        15, 14, 13, 12,
-        /* Size 32x8 */
         32, 33, 33, 34, 34, 32, 31, 30, 28, 28, 24, 24, 22, 21, 21, 21, 21, 21,
         21, 20, 20, 20, 20, 19, 19, 19, 18, 18, 17, 17, 16, 16, 33, 33, 33, 32,
         32, 30, 28, 28, 26, 26, 24, 24, 22, 22, 22, 22, 23, 23, 23, 22, 22, 21,
@@ -9526,7 +9511,23 @@
         20, 20, 21, 21, 19, 19, 19, 18, 18, 17, 17, 16, 16, 16, 16, 15, 15, 14,
         14, 14, 14, 13, 13, 13, 13, 13, 17, 18, 18, 18, 19, 19, 19, 19, 20, 19,
         18, 18, 18, 17, 17, 17, 16, 16, 16, 15, 15, 14, 14, 14, 13, 13, 13, 13,
-        12, 12, 12, 12 },
+        12, 12, 12, 12,
+        /* Size 32x8 */
+        32, 33, 28, 21, 21, 20, 18, 17, 33, 33, 27, 22, 22, 20, 19, 18, 33, 33,
+        27, 22, 22, 20, 19, 18, 34, 32, 26, 22, 22, 21, 20, 18, 34, 32, 26, 22,
+        23, 21, 20, 19, 32, 30, 25, 22, 23, 21, 20, 19, 31, 28, 24, 22, 22, 22,
+        20, 19, 30, 28, 23, 22, 23, 22, 20, 19, 28, 26, 22, 22, 23, 22, 21, 20,
+        28, 26, 22, 21, 22, 22, 21, 19, 24, 24, 22, 20, 21, 20, 19, 18, 24, 24,
+        22, 20, 21, 20, 19, 18, 22, 22, 21, 20, 19, 19, 19, 18, 21, 22, 21, 19,
+        19, 19, 18, 17, 21, 22, 22, 19, 19, 18, 18, 17, 21, 22, 22, 19, 18, 18,
+        17, 17, 21, 23, 22, 19, 18, 17, 17, 16, 21, 23, 22, 19, 18, 17, 16, 16,
+        21, 23, 22, 19, 18, 17, 16, 16, 20, 22, 22, 19, 17, 16, 16, 15, 20, 22,
+        22, 19, 17, 16, 16, 15, 20, 21, 22, 19, 17, 16, 15, 14, 20, 21, 22, 19,
+        17, 16, 15, 14, 19, 21, 21, 19, 17, 15, 14, 14, 19, 20, 21, 19, 17, 15,
+        14, 13, 19, 20, 21, 18, 16, 15, 14, 13, 18, 20, 20, 18, 16, 15, 14, 13,
+        18, 19, 20, 18, 16, 14, 13, 13, 17, 19, 20, 18, 16, 14, 13, 12, 17, 19,
+        19, 17, 16, 14, 13, 12, 16, 18, 19, 17, 15, 14, 13, 12, 16, 18, 19, 17,
+        15, 14, 13, 12 },
   },
   {
       { /* Luma */
@@ -9612,21 +9613,12 @@
         10, 9, 15, 15, 15, 16, 16, 16, 16, 16, 16, 17, 17, 16, 15, 15, 14, 14,
         13, 13, 13, 12, 12, 12, 12, 11, 11, 11, 10, 10, 10, 9, 9, 9,
         /* Size 4x8 */
-        32, 32, 24, 18, 32, 31, 25, 19, 32, 29, 24, 20, 30, 28, 20, 17, 27, 26,
-        18, 15, 23, 23, 16, 13, 20, 20, 14, 12, 17, 18, 13, 11,
-        /* Size 8x4 */
         32, 32, 32, 30, 27, 23, 20, 17, 32, 31, 29, 28, 26, 23, 20, 18, 24, 25,
         24, 20, 18, 16, 14, 13, 18, 19, 20, 17, 15, 13, 12, 11,
+        /* Size 8x4 */
+        32, 32, 24, 18, 32, 31, 25, 19, 32, 29, 24, 20, 30, 28, 20, 17, 27, 26,
+        18, 15, 23, 23, 16, 13, 20, 20, 14, 12, 17, 18, 13, 11,
         /* Size 8x16 */
-        32, 33, 32, 29, 26, 23, 19, 16, 33, 32, 32, 29, 27, 24, 20, 17, 33, 32,
-        31, 30, 28, 25, 21, 17, 33, 32, 30, 29, 27, 24, 21, 17, 32, 32, 30, 28,
-        26, 24, 21, 18, 32, 31, 29, 28, 26, 24, 21, 18, 30, 30, 28, 25, 23, 21,
-        19, 16, 28, 30, 27, 22, 20, 19, 17, 15, 27, 28, 26, 22, 20, 18, 16, 14,
-        25, 26, 25, 21, 19, 17, 15, 13, 23, 25, 24, 20, 18, 16, 14, 13, 21, 23,
-        22, 19, 17, 15, 13, 12, 19, 21, 20, 18, 16, 14, 12, 11, 18, 19, 19, 17,
-        15, 14, 12, 11, 17, 18, 18, 16, 15, 13, 12, 10, 16, 17, 18, 16, 14, 13,
-        11, 10,
-        /* Size 16x8 */
         32, 33, 33, 33, 32, 32, 30, 28, 27, 25, 23, 21, 19, 18, 17, 16, 33, 32,
         32, 32, 32, 31, 30, 30, 28, 26, 25, 23, 21, 19, 18, 17, 32, 32, 31, 30,
         30, 29, 28, 27, 26, 25, 24, 22, 20, 19, 18, 18, 29, 29, 30, 29, 28, 28,
@@ -9635,37 +9627,16 @@
         16, 15, 14, 14, 13, 13, 19, 20, 21, 21, 21, 21, 19, 17, 16, 15, 14, 13,
         12, 12, 12, 11, 16, 17, 17, 17, 18, 18, 16, 15, 14, 13, 13, 12, 11, 11,
         10, 10,
+        /* Size 16x8 */
+        32, 33, 32, 29, 26, 23, 19, 16, 33, 32, 32, 29, 27, 24, 20, 17, 33, 32,
+        31, 30, 28, 25, 21, 17, 33, 32, 30, 29, 27, 24, 21, 17, 32, 32, 30, 28,
+        26, 24, 21, 18, 32, 31, 29, 28, 26, 24, 21, 18, 30, 30, 28, 25, 23, 21,
+        19, 16, 28, 30, 27, 22, 20, 19, 17, 15, 27, 28, 26, 22, 20, 18, 16, 14,
+        25, 26, 25, 21, 19, 17, 15, 13, 23, 25, 24, 20, 18, 16, 14, 13, 21, 23,
+        22, 19, 17, 15, 13, 12, 19, 21, 20, 18, 16, 14, 12, 11, 18, 19, 19, 17,
+        15, 14, 12, 11, 17, 18, 18, 16, 15, 13, 12, 10, 16, 17, 18, 16, 14, 13,
+        11, 10,
         /* Size 16x32 */
-        32, 33, 33, 33, 32, 32, 29, 28, 26, 23, 23, 20, 19, 18, 16, 16, 33, 32,
-        32, 32, 32, 32, 29, 29, 27, 24, 24, 21, 20, 18, 16, 16, 33, 32, 32, 32,
-        32, 32, 29, 29, 27, 24, 24, 21, 20, 19, 17, 17, 33, 32, 32, 32, 32, 32,
-        30, 29, 28, 25, 25, 21, 20, 19, 17, 17, 33, 32, 32, 32, 31, 31, 30, 30,
-        28, 25, 25, 22, 21, 19, 17, 17, 33, 32, 32, 32, 31, 31, 30, 30, 28, 25,
-        25, 22, 21, 19, 17, 17, 33, 32, 32, 31, 30, 30, 29, 28, 27, 24, 24, 21,
-        21, 19, 17, 17, 32, 32, 32, 31, 30, 30, 28, 28, 27, 24, 24, 21, 20, 19,
-        17, 17, 32, 32, 32, 31, 30, 30, 28, 28, 26, 24, 24, 21, 21, 19, 18, 18,
-        32, 32, 31, 30, 29, 29, 28, 27, 26, 24, 24, 21, 21, 20, 18, 18, 32, 32,
-        31, 30, 29, 29, 28, 27, 26, 24, 24, 21, 21, 20, 18, 18, 31, 31, 31, 29,
-        28, 28, 26, 25, 24, 22, 22, 20, 19, 18, 17, 17, 30, 30, 30, 29, 28, 28,
-        25, 24, 23, 21, 21, 19, 19, 18, 16, 16, 30, 30, 30, 29, 28, 28, 24, 23,
-        22, 20, 20, 19, 18, 17, 16, 16, 28, 29, 30, 28, 27, 27, 22, 21, 20, 19,
-        19, 18, 17, 16, 15, 15, 28, 29, 30, 28, 27, 27, 22, 21, 20, 19, 19, 18,
-        17, 16, 15, 15, 27, 28, 28, 27, 26, 26, 22, 20, 20, 18, 18, 17, 16, 15,
-        14, 14, 26, 27, 28, 26, 26, 26, 21, 20, 19, 18, 18, 16, 16, 15, 14, 14,
-        25, 26, 26, 26, 25, 25, 21, 20, 19, 17, 17, 16, 15, 15, 13, 13, 23, 25,
-        25, 24, 24, 24, 20, 19, 18, 16, 16, 15, 14, 14, 13, 13, 23, 25, 25, 24,
-        24, 24, 20, 19, 18, 16, 16, 15, 14, 14, 13, 13, 22, 23, 23, 23, 23, 23,
-        19, 18, 17, 16, 16, 14, 14, 13, 12, 12, 21, 23, 23, 23, 22, 22, 19, 18,
-        17, 15, 15, 14, 13, 13, 12, 12, 20, 22, 22, 22, 22, 22, 19, 18, 17, 15,
-        15, 13, 13, 12, 12, 12, 19, 20, 21, 20, 20, 20, 18, 17, 16, 14, 14, 13,
-        12, 12, 11, 11, 19, 20, 21, 20, 20, 20, 18, 17, 16, 14, 14, 13, 12, 12,
-        11, 11, 18, 19, 19, 19, 19, 19, 17, 16, 15, 14, 14, 12, 12, 11, 11, 11,
-        18, 19, 19, 19, 19, 19, 17, 16, 15, 14, 14, 12, 12, 11, 10, 10, 17, 18,
-        18, 18, 18, 18, 16, 16, 15, 13, 13, 12, 12, 11, 10, 10, 16, 17, 17, 17,
-        18, 18, 16, 15, 14, 13, 13, 12, 11, 11, 10, 10, 16, 17, 17, 17, 18, 18,
-        16, 15, 14, 13, 13, 12, 11, 11, 10, 10, 15, 16, 16, 16, 17, 17, 15, 14,
-        13, 12, 12, 11, 11, 10, 9, 9,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 31, 30, 30, 28, 28, 27, 26,
         25, 23, 23, 22, 21, 20, 19, 19, 18, 18, 17, 16, 16, 15, 33, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 31, 30, 30, 29, 29, 28, 27, 26, 25, 25, 23,
@@ -9695,33 +9666,47 @@
         13, 12, 12, 12, 11, 11, 11, 10, 10, 10, 10, 9, 16, 16, 17, 17, 17, 17,
         17, 17, 18, 18, 18, 17, 16, 16, 15, 15, 14, 14, 13, 13, 13, 12, 12, 12,
         11, 11, 11, 10, 10, 10, 10, 9,
+        /* Size 32x16 */
+        32, 33, 33, 33, 32, 32, 29, 28, 26, 23, 23, 20, 19, 18, 16, 16, 33, 32,
+        32, 32, 32, 32, 29, 29, 27, 24, 24, 21, 20, 18, 16, 16, 33, 32, 32, 32,
+        32, 32, 29, 29, 27, 24, 24, 21, 20, 19, 17, 17, 33, 32, 32, 32, 32, 32,
+        30, 29, 28, 25, 25, 21, 20, 19, 17, 17, 33, 32, 32, 32, 31, 31, 30, 30,
+        28, 25, 25, 22, 21, 19, 17, 17, 33, 32, 32, 32, 31, 31, 30, 30, 28, 25,
+        25, 22, 21, 19, 17, 17, 33, 32, 32, 31, 30, 30, 29, 28, 27, 24, 24, 21,
+        21, 19, 17, 17, 32, 32, 32, 31, 30, 30, 28, 28, 27, 24, 24, 21, 20, 19,
+        17, 17, 32, 32, 32, 31, 30, 30, 28, 28, 26, 24, 24, 21, 21, 19, 18, 18,
+        32, 32, 31, 30, 29, 29, 28, 27, 26, 24, 24, 21, 21, 20, 18, 18, 32, 32,
+        31, 30, 29, 29, 28, 27, 26, 24, 24, 21, 21, 20, 18, 18, 31, 31, 31, 29,
+        28, 28, 26, 25, 24, 22, 22, 20, 19, 18, 17, 17, 30, 30, 30, 29, 28, 28,
+        25, 24, 23, 21, 21, 19, 19, 18, 16, 16, 30, 30, 30, 29, 28, 28, 24, 23,
+        22, 20, 20, 19, 18, 17, 16, 16, 28, 29, 30, 28, 27, 27, 22, 21, 20, 19,
+        19, 18, 17, 16, 15, 15, 28, 29, 30, 28, 27, 27, 22, 21, 20, 19, 19, 18,
+        17, 16, 15, 15, 27, 28, 28, 27, 26, 26, 22, 20, 20, 18, 18, 17, 16, 15,
+        14, 14, 26, 27, 28, 26, 26, 26, 21, 20, 19, 18, 18, 16, 16, 15, 14, 14,
+        25, 26, 26, 26, 25, 25, 21, 20, 19, 17, 17, 16, 15, 15, 13, 13, 23, 25,
+        25, 24, 24, 24, 20, 19, 18, 16, 16, 15, 14, 14, 13, 13, 23, 25, 25, 24,
+        24, 24, 20, 19, 18, 16, 16, 15, 14, 14, 13, 13, 22, 23, 23, 23, 23, 23,
+        19, 18, 17, 16, 16, 14, 14, 13, 12, 12, 21, 23, 23, 23, 22, 22, 19, 18,
+        17, 15, 15, 14, 13, 13, 12, 12, 20, 22, 22, 22, 22, 22, 19, 18, 17, 15,
+        15, 13, 13, 12, 12, 12, 19, 20, 21, 20, 20, 20, 18, 17, 16, 14, 14, 13,
+        12, 12, 11, 11, 19, 20, 21, 20, 20, 20, 18, 17, 16, 14, 14, 13, 12, 12,
+        11, 11, 18, 19, 19, 19, 19, 19, 17, 16, 15, 14, 14, 12, 12, 11, 11, 11,
+        18, 19, 19, 19, 19, 19, 17, 16, 15, 14, 14, 12, 12, 11, 10, 10, 17, 18,
+        18, 18, 18, 18, 16, 16, 15, 13, 13, 12, 12, 11, 10, 10, 16, 17, 17, 17,
+        18, 18, 16, 15, 14, 13, 13, 12, 11, 11, 10, 10, 16, 17, 17, 17, 18, 18,
+        16, 15, 14, 13, 13, 12, 11, 11, 10, 10, 15, 16, 16, 16, 17, 17, 15, 14,
+        13, 12, 12, 11, 11, 10, 9, 9,
         /* Size 4x16 */
-        33, 32, 23, 18, 32, 32, 24, 19, 32, 31, 25, 19, 32, 30, 24, 19, 32, 30,
-        24, 19, 32, 29, 24, 20, 30, 28, 21, 18, 29, 27, 19, 16, 28, 26, 18, 15,
-        26, 25, 17, 15, 25, 24, 16, 14, 23, 22, 15, 13, 20, 20, 14, 12, 19, 19,
-        14, 11, 18, 18, 13, 11, 17, 18, 13, 11,
-        /* Size 16x4 */
         33, 32, 32, 32, 32, 32, 30, 29, 28, 26, 25, 23, 20, 19, 18, 17, 32, 32,
         31, 30, 30, 29, 28, 27, 26, 25, 24, 22, 20, 19, 18, 18, 23, 24, 25, 24,
         24, 24, 21, 19, 18, 17, 16, 15, 14, 14, 13, 13, 18, 19, 19, 19, 19, 20,
         18, 16, 15, 15, 14, 13, 12, 11, 11, 11,
+        /* Size 16x4 */
+        33, 32, 23, 18, 32, 32, 24, 19, 32, 31, 25, 19, 32, 30, 24, 19, 32, 30,
+        24, 19, 32, 29, 24, 20, 30, 28, 21, 18, 29, 27, 19, 16, 28, 26, 18, 15,
+        26, 25, 17, 15, 25, 24, 16, 14, 23, 22, 15, 13, 20, 20, 14, 12, 19, 19,
+        14, 11, 18, 18, 13, 11, 17, 18, 13, 11,
         /* Size 8x32 */
-        32, 33, 32, 29, 26, 23, 19, 16, 33, 32, 32, 29, 27, 24, 20, 16, 33, 32,
-        32, 29, 27, 24, 20, 17, 33, 32, 32, 30, 28, 25, 20, 17, 33, 32, 31, 30,
-        28, 25, 21, 17, 33, 32, 31, 30, 28, 25, 21, 17, 33, 32, 30, 29, 27, 24,
-        21, 17, 32, 32, 30, 28, 27, 24, 20, 17, 32, 32, 30, 28, 26, 24, 21, 18,
-        32, 31, 29, 28, 26, 24, 21, 18, 32, 31, 29, 28, 26, 24, 21, 18, 31, 31,
-        28, 26, 24, 22, 19, 17, 30, 30, 28, 25, 23, 21, 19, 16, 30, 30, 28, 24,
-        22, 20, 18, 16, 28, 30, 27, 22, 20, 19, 17, 15, 28, 30, 27, 22, 20, 19,
-        17, 15, 27, 28, 26, 22, 20, 18, 16, 14, 26, 28, 26, 21, 19, 18, 16, 14,
-        25, 26, 25, 21, 19, 17, 15, 13, 23, 25, 24, 20, 18, 16, 14, 13, 23, 25,
-        24, 20, 18, 16, 14, 13, 22, 23, 23, 19, 17, 16, 14, 12, 21, 23, 22, 19,
-        17, 15, 13, 12, 20, 22, 22, 19, 17, 15, 13, 12, 19, 21, 20, 18, 16, 14,
-        12, 11, 19, 21, 20, 18, 16, 14, 12, 11, 18, 19, 19, 17, 15, 14, 12, 11,
-        18, 19, 19, 17, 15, 14, 12, 10, 17, 18, 18, 16, 15, 13, 12, 10, 16, 17,
-        18, 16, 14, 13, 11, 10, 16, 17, 18, 16, 14, 13, 11, 10, 15, 16, 17, 15,
-        13, 12, 11, 9,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 31, 30, 30, 28, 28, 27, 26,
         25, 23, 23, 22, 21, 20, 19, 19, 18, 18, 17, 16, 16, 15, 33, 32, 32, 32,
         32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 28, 28, 26, 25, 25, 23,
@@ -9736,7 +9721,23 @@
         21, 20, 21, 21, 21, 19, 19, 18, 17, 17, 16, 16, 15, 14, 14, 14, 13, 13,
         12, 12, 12, 12, 12, 11, 11, 11, 16, 16, 17, 17, 17, 17, 17, 17, 18, 18,
         18, 17, 16, 16, 15, 15, 14, 14, 13, 13, 13, 12, 12, 12, 11, 11, 11, 10,
-        10, 10, 10, 9 },
+        10, 10, 10, 9,
+        /* Size 32x8 */
+        32, 33, 32, 29, 26, 23, 19, 16, 33, 32, 32, 29, 27, 24, 20, 16, 33, 32,
+        32, 29, 27, 24, 20, 17, 33, 32, 32, 30, 28, 25, 20, 17, 33, 32, 31, 30,
+        28, 25, 21, 17, 33, 32, 31, 30, 28, 25, 21, 17, 33, 32, 30, 29, 27, 24,
+        21, 17, 32, 32, 30, 28, 27, 24, 20, 17, 32, 32, 30, 28, 26, 24, 21, 18,
+        32, 31, 29, 28, 26, 24, 21, 18, 32, 31, 29, 28, 26, 24, 21, 18, 31, 31,
+        28, 26, 24, 22, 19, 17, 30, 30, 28, 25, 23, 21, 19, 16, 30, 30, 28, 24,
+        22, 20, 18, 16, 28, 30, 27, 22, 20, 19, 17, 15, 28, 30, 27, 22, 20, 19,
+        17, 15, 27, 28, 26, 22, 20, 18, 16, 14, 26, 28, 26, 21, 19, 18, 16, 14,
+        25, 26, 25, 21, 19, 17, 15, 13, 23, 25, 24, 20, 18, 16, 14, 13, 23, 25,
+        24, 20, 18, 16, 14, 13, 22, 23, 23, 19, 17, 16, 14, 12, 21, 23, 22, 19,
+        17, 15, 13, 12, 20, 22, 22, 19, 17, 15, 13, 12, 19, 21, 20, 18, 16, 14,
+        12, 11, 19, 21, 20, 18, 16, 14, 12, 11, 18, 19, 19, 17, 15, 14, 12, 11,
+        18, 19, 19, 17, 15, 14, 12, 10, 17, 18, 18, 16, 15, 13, 12, 10, 16, 17,
+        18, 16, 14, 13, 11, 10, 16, 17, 18, 16, 14, 13, 11, 10, 15, 16, 17, 15,
+        13, 12, 11, 9 },
       { /* Chroma */
         /* Size 4x4 */
         33, 25, 22, 20, 25, 21, 21, 20, 22, 21, 18, 17, 20, 20, 17, 14,
@@ -9820,21 +9821,12 @@
         13, 13, 17, 18, 18, 19, 19, 19, 19, 19, 20, 20, 20, 19, 19, 18, 18, 18,
         17, 17, 16, 16, 16, 15, 15, 15, 14, 14, 14, 14, 13, 13, 13, 13,
         /* Size 4x8 */
-        33, 27, 22, 20, 32, 26, 23, 21, 26, 22, 23, 21, 23, 22, 20, 19, 22, 22,
-        18, 18, 22, 22, 17, 16, 21, 22, 17, 15, 19, 20, 16, 14,
-        /* Size 8x4 */
         33, 32, 26, 23, 22, 22, 21, 19, 27, 26, 22, 22, 22, 22, 22, 20, 22, 23,
         23, 20, 18, 17, 17, 16, 20, 21, 21, 19, 18, 16, 15, 14,
+        /* Size 8x4 */
+        33, 27, 22, 20, 32, 26, 23, 21, 26, 22, 23, 21, 23, 22, 20, 19, 22, 22,
+        18, 18, 22, 22, 17, 16, 21, 22, 17, 15, 19, 20, 16, 14,
         /* Size 8x16 */
-        32, 33, 28, 23, 21, 21, 20, 18, 33, 33, 27, 23, 22, 22, 20, 19, 34, 32,
-        26, 23, 23, 23, 21, 20, 31, 29, 24, 22, 22, 23, 22, 20, 29, 28, 23, 22,
-        22, 23, 22, 20, 28, 26, 22, 22, 22, 23, 22, 20, 24, 24, 22, 21, 20, 21,
-        20, 19, 21, 22, 21, 20, 19, 19, 19, 18, 21, 22, 22, 20, 19, 19, 18, 17,
-        21, 23, 22, 20, 19, 18, 17, 17, 21, 23, 22, 20, 19, 18, 17, 16, 20, 22,
-        22, 20, 18, 17, 16, 15, 20, 21, 22, 19, 18, 17, 16, 14, 19, 21, 21, 19,
-        18, 17, 15, 14, 19, 20, 21, 19, 18, 16, 15, 14, 18, 20, 20, 19, 17, 16,
-        15, 13,
-        /* Size 16x8 */
         32, 33, 34, 31, 29, 28, 24, 21, 21, 21, 21, 20, 20, 19, 19, 18, 33, 33,
         32, 29, 28, 26, 24, 22, 22, 23, 23, 22, 21, 21, 20, 20, 28, 27, 26, 24,
         23, 22, 22, 21, 22, 22, 22, 22, 22, 21, 21, 20, 23, 23, 23, 22, 22, 22,
@@ -9843,37 +9835,16 @@
         18, 17, 17, 17, 16, 16, 20, 20, 21, 22, 22, 22, 20, 19, 18, 17, 17, 16,
         16, 15, 15, 15, 18, 19, 20, 20, 20, 20, 19, 18, 17, 17, 16, 15, 14, 14,
         14, 13,
+        /* Size 16x8 */
+        32, 33, 28, 23, 21, 21, 20, 18, 33, 33, 27, 23, 22, 22, 20, 19, 34, 32,
+        26, 23, 23, 23, 21, 20, 31, 29, 24, 22, 22, 23, 22, 20, 29, 28, 23, 22,
+        22, 23, 22, 20, 28, 26, 22, 22, 22, 23, 22, 20, 24, 24, 22, 21, 20, 21,
+        20, 19, 21, 22, 21, 20, 19, 19, 19, 18, 21, 22, 22, 20, 19, 19, 18, 17,
+        21, 23, 22, 20, 19, 18, 17, 17, 21, 23, 22, 20, 19, 18, 17, 16, 20, 22,
+        22, 20, 18, 17, 16, 15, 20, 21, 22, 19, 18, 17, 16, 14, 19, 21, 21, 19,
+        18, 17, 15, 14, 19, 20, 21, 19, 18, 16, 15, 14, 18, 20, 20, 19, 17, 16,
+        15, 13,
         /* Size 16x32 */
-        32, 33, 33, 31, 28, 28, 23, 21, 21, 21, 21, 20, 20, 19, 18, 18, 33, 33,
-        33, 30, 27, 27, 23, 22, 22, 22, 22, 20, 20, 20, 19, 19, 33, 33, 33, 30,
-        27, 27, 23, 22, 22, 22, 22, 21, 20, 20, 19, 19, 33, 33, 32, 30, 26, 26,
-        23, 22, 22, 22, 22, 21, 21, 20, 19, 19, 34, 32, 32, 29, 26, 26, 23, 22,
-        23, 23, 23, 22, 21, 21, 20, 20, 34, 32, 32, 29, 26, 26, 23, 22, 23, 23,
-        23, 22, 21, 21, 20, 20, 31, 30, 29, 28, 24, 24, 22, 22, 22, 23, 23, 22,
-        22, 21, 20, 20, 31, 29, 28, 27, 24, 24, 22, 22, 22, 22, 22, 22, 22, 21,
-        20, 20, 29, 28, 28, 26, 23, 23, 22, 22, 22, 23, 23, 22, 22, 21, 20, 20,
-        28, 26, 26, 24, 22, 22, 22, 22, 22, 23, 23, 22, 22, 21, 20, 20, 28, 26,
-        26, 24, 22, 22, 22, 22, 22, 23, 23, 22, 22, 21, 20, 20, 25, 24, 24, 23,
-        22, 22, 21, 21, 21, 21, 21, 21, 20, 20, 20, 20, 24, 24, 24, 23, 22, 22,
-        21, 20, 20, 21, 21, 20, 20, 20, 19, 19, 23, 23, 23, 23, 22, 22, 20, 20,
-        20, 20, 20, 20, 20, 19, 19, 19, 21, 22, 22, 22, 21, 21, 20, 19, 19, 19,
-        19, 19, 19, 19, 18, 18, 21, 22, 22, 22, 21, 21, 20, 19, 19, 19, 19, 19,
-        19, 19, 18, 18, 21, 22, 22, 22, 22, 22, 20, 19, 19, 19, 19, 18, 18, 18,
-        17, 17, 21, 22, 22, 22, 22, 22, 20, 19, 19, 18, 18, 18, 18, 18, 17, 17,
-        21, 22, 23, 22, 22, 22, 20, 19, 19, 18, 18, 18, 17, 17, 17, 17, 21, 22,
-        23, 23, 22, 22, 20, 19, 19, 18, 18, 17, 17, 17, 16, 16, 21, 22, 23, 23,
-        22, 22, 20, 19, 19, 18, 18, 17, 17, 17, 16, 16, 20, 22, 22, 22, 22, 22,
-        20, 19, 18, 17, 17, 17, 16, 16, 16, 16, 20, 22, 22, 22, 22, 22, 20, 19,
-        18, 17, 17, 16, 16, 16, 15, 15, 20, 21, 22, 22, 22, 22, 20, 19, 18, 17,
-        17, 16, 16, 16, 15, 15, 20, 21, 21, 22, 22, 22, 19, 19, 18, 17, 17, 16,
-        16, 15, 14, 14, 20, 21, 21, 22, 22, 22, 19, 19, 18, 17, 17, 16, 16, 15,
-        14, 14, 19, 20, 21, 21, 21, 21, 19, 19, 18, 17, 17, 15, 15, 15, 14, 14,
-        19, 20, 20, 21, 21, 21, 19, 19, 18, 17, 17, 15, 15, 15, 14, 14, 19, 20,
-        20, 20, 21, 21, 19, 18, 18, 16, 16, 15, 15, 14, 14, 14, 18, 19, 20, 20,
-        20, 20, 19, 18, 17, 16, 16, 15, 15, 14, 13, 13, 18, 19, 20, 20, 20, 20,
-        19, 18, 17, 16, 16, 15, 15, 14, 13, 13, 17, 19, 19, 19, 20, 20, 18, 18,
-        17, 16, 16, 15, 14, 14, 13, 13,
-        /* Size 32x16 */
         32, 33, 33, 33, 34, 34, 31, 31, 29, 28, 28, 25, 24, 23, 21, 21, 21, 21,
         21, 21, 21, 20, 20, 20, 20, 20, 19, 19, 19, 18, 18, 17, 33, 33, 33, 33,
         32, 32, 30, 29, 28, 26, 26, 24, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22,
@@ -9903,33 +9874,47 @@
         16, 16, 15, 15, 14, 14, 14, 14, 14, 13, 13, 13, 18, 19, 19, 19, 20, 20,
         20, 20, 20, 20, 20, 20, 19, 19, 18, 18, 17, 17, 17, 16, 16, 16, 15, 15,
         14, 14, 14, 14, 14, 13, 13, 13,
+        /* Size 32x16 */
+        32, 33, 33, 31, 28, 28, 23, 21, 21, 21, 21, 20, 20, 19, 18, 18, 33, 33,
+        33, 30, 27, 27, 23, 22, 22, 22, 22, 20, 20, 20, 19, 19, 33, 33, 33, 30,
+        27, 27, 23, 22, 22, 22, 22, 21, 20, 20, 19, 19, 33, 33, 32, 30, 26, 26,
+        23, 22, 22, 22, 22, 21, 21, 20, 19, 19, 34, 32, 32, 29, 26, 26, 23, 22,
+        23, 23, 23, 22, 21, 21, 20, 20, 34, 32, 32, 29, 26, 26, 23, 22, 23, 23,
+        23, 22, 21, 21, 20, 20, 31, 30, 29, 28, 24, 24, 22, 22, 22, 23, 23, 22,
+        22, 21, 20, 20, 31, 29, 28, 27, 24, 24, 22, 22, 22, 22, 22, 22, 22, 21,
+        20, 20, 29, 28, 28, 26, 23, 23, 22, 22, 22, 23, 23, 22, 22, 21, 20, 20,
+        28, 26, 26, 24, 22, 22, 22, 22, 22, 23, 23, 22, 22, 21, 20, 20, 28, 26,
+        26, 24, 22, 22, 22, 22, 22, 23, 23, 22, 22, 21, 20, 20, 25, 24, 24, 23,
+        22, 22, 21, 21, 21, 21, 21, 21, 20, 20, 20, 20, 24, 24, 24, 23, 22, 22,
+        21, 20, 20, 21, 21, 20, 20, 20, 19, 19, 23, 23, 23, 23, 22, 22, 20, 20,
+        20, 20, 20, 20, 20, 19, 19, 19, 21, 22, 22, 22, 21, 21, 20, 19, 19, 19,
+        19, 19, 19, 19, 18, 18, 21, 22, 22, 22, 21, 21, 20, 19, 19, 19, 19, 19,
+        19, 19, 18, 18, 21, 22, 22, 22, 22, 22, 20, 19, 19, 19, 19, 18, 18, 18,
+        17, 17, 21, 22, 22, 22, 22, 22, 20, 19, 19, 18, 18, 18, 18, 18, 17, 17,
+        21, 22, 23, 22, 22, 22, 20, 19, 19, 18, 18, 18, 17, 17, 17, 17, 21, 22,
+        23, 23, 22, 22, 20, 19, 19, 18, 18, 17, 17, 17, 16, 16, 21, 22, 23, 23,
+        22, 22, 20, 19, 19, 18, 18, 17, 17, 17, 16, 16, 20, 22, 22, 22, 22, 22,
+        20, 19, 18, 17, 17, 17, 16, 16, 16, 16, 20, 22, 22, 22, 22, 22, 20, 19,
+        18, 17, 17, 16, 16, 16, 15, 15, 20, 21, 22, 22, 22, 22, 20, 19, 18, 17,
+        17, 16, 16, 16, 15, 15, 20, 21, 21, 22, 22, 22, 19, 19, 18, 17, 17, 16,
+        16, 15, 14, 14, 20, 21, 21, 22, 22, 22, 19, 19, 18, 17, 17, 16, 16, 15,
+        14, 14, 19, 20, 21, 21, 21, 21, 19, 19, 18, 17, 17, 15, 15, 15, 14, 14,
+        19, 20, 20, 21, 21, 21, 19, 19, 18, 17, 17, 15, 15, 15, 14, 14, 19, 20,
+        20, 20, 21, 21, 19, 18, 18, 16, 16, 15, 15, 14, 14, 14, 18, 19, 20, 20,
+        20, 20, 19, 18, 17, 16, 16, 15, 15, 14, 13, 13, 18, 19, 20, 20, 20, 20,
+        19, 18, 17, 16, 16, 15, 15, 14, 13, 13, 17, 19, 19, 19, 20, 20, 18, 18,
+        17, 16, 16, 15, 14, 14, 13, 13,
         /* Size 4x16 */
-        33, 28, 21, 19, 33, 27, 22, 20, 32, 26, 23, 21, 30, 24, 23, 21, 28, 23,
-        23, 21, 26, 22, 23, 21, 24, 22, 21, 20, 22, 21, 19, 19, 22, 22, 19, 18,
-        22, 22, 18, 17, 22, 22, 18, 17, 22, 22, 17, 16, 21, 22, 17, 15, 20, 21,
-        17, 15, 20, 21, 16, 14, 19, 20, 16, 14,
-        /* Size 16x4 */
         33, 33, 32, 30, 28, 26, 24, 22, 22, 22, 22, 22, 21, 20, 20, 19, 28, 27,
         26, 24, 23, 22, 22, 21, 22, 22, 22, 22, 22, 21, 21, 20, 21, 22, 23, 23,
         23, 23, 21, 19, 19, 18, 18, 17, 17, 17, 16, 16, 19, 20, 21, 21, 21, 21,
         20, 19, 18, 17, 17, 16, 15, 15, 14, 14,
+        /* Size 16x4 */
+        33, 28, 21, 19, 33, 27, 22, 20, 32, 26, 23, 21, 30, 24, 23, 21, 28, 23,
+        23, 21, 26, 22, 23, 21, 24, 22, 21, 20, 22, 21, 19, 19, 22, 22, 19, 18,
+        22, 22, 18, 17, 22, 22, 18, 17, 22, 22, 17, 16, 21, 22, 17, 15, 20, 21,
+        17, 15, 20, 21, 16, 14, 19, 20, 16, 14,
         /* Size 8x32 */
-        32, 33, 28, 23, 21, 21, 20, 18, 33, 33, 27, 23, 22, 22, 20, 19, 33, 33,
-        27, 23, 22, 22, 20, 19, 33, 32, 26, 23, 22, 22, 21, 19, 34, 32, 26, 23,
-        23, 23, 21, 20, 34, 32, 26, 23, 23, 23, 21, 20, 31, 29, 24, 22, 22, 23,
-        22, 20, 31, 28, 24, 22, 22, 22, 22, 20, 29, 28, 23, 22, 22, 23, 22, 20,
-        28, 26, 22, 22, 22, 23, 22, 20, 28, 26, 22, 22, 22, 23, 22, 20, 25, 24,
-        22, 21, 21, 21, 20, 20, 24, 24, 22, 21, 20, 21, 20, 19, 23, 23, 22, 20,
-        20, 20, 20, 19, 21, 22, 21, 20, 19, 19, 19, 18, 21, 22, 21, 20, 19, 19,
-        19, 18, 21, 22, 22, 20, 19, 19, 18, 17, 21, 22, 22, 20, 19, 18, 18, 17,
-        21, 23, 22, 20, 19, 18, 17, 17, 21, 23, 22, 20, 19, 18, 17, 16, 21, 23,
-        22, 20, 19, 18, 17, 16, 20, 22, 22, 20, 18, 17, 16, 16, 20, 22, 22, 20,
-        18, 17, 16, 15, 20, 22, 22, 20, 18, 17, 16, 15, 20, 21, 22, 19, 18, 17,
-        16, 14, 20, 21, 22, 19, 18, 17, 16, 14, 19, 21, 21, 19, 18, 17, 15, 14,
-        19, 20, 21, 19, 18, 17, 15, 14, 19, 20, 21, 19, 18, 16, 15, 14, 18, 20,
-        20, 19, 17, 16, 15, 13, 18, 20, 20, 19, 17, 16, 15, 13, 17, 19, 20, 18,
-        17, 16, 14, 13,
-        /* Size 32x8 */
         32, 33, 33, 33, 34, 34, 31, 31, 29, 28, 28, 25, 24, 23, 21, 21, 21, 21,
         21, 21, 21, 20, 20, 20, 20, 20, 19, 19, 19, 18, 18, 17, 33, 33, 33, 32,
         32, 32, 29, 28, 28, 26, 26, 24, 24, 23, 22, 22, 22, 22, 23, 23, 23, 22,
@@ -9944,7 +9929,23 @@
         22, 22, 22, 22, 22, 20, 20, 20, 19, 19, 18, 18, 17, 17, 17, 16, 16, 16,
         16, 16, 15, 15, 15, 15, 15, 14, 18, 19, 19, 19, 20, 20, 20, 20, 20, 20,
         20, 20, 19, 19, 18, 18, 17, 17, 17, 16, 16, 16, 15, 15, 14, 14, 14, 14,
-        14, 13, 13, 13 },
+        14, 13, 13, 13,
+        /* Size 32x8 */
+        32, 33, 28, 23, 21, 21, 20, 18, 33, 33, 27, 23, 22, 22, 20, 19, 33, 33,
+        27, 23, 22, 22, 20, 19, 33, 32, 26, 23, 22, 22, 21, 19, 34, 32, 26, 23,
+        23, 23, 21, 20, 34, 32, 26, 23, 23, 23, 21, 20, 31, 29, 24, 22, 22, 23,
+        22, 20, 31, 28, 24, 22, 22, 22, 22, 20, 29, 28, 23, 22, 22, 23, 22, 20,
+        28, 26, 22, 22, 22, 23, 22, 20, 28, 26, 22, 22, 22, 23, 22, 20, 25, 24,
+        22, 21, 21, 21, 20, 20, 24, 24, 22, 21, 20, 21, 20, 19, 23, 23, 22, 20,
+        20, 20, 20, 19, 21, 22, 21, 20, 19, 19, 19, 18, 21, 22, 21, 20, 19, 19,
+        19, 18, 21, 22, 22, 20, 19, 19, 18, 17, 21, 22, 22, 20, 19, 18, 18, 17,
+        21, 23, 22, 20, 19, 18, 17, 17, 21, 23, 22, 20, 19, 18, 17, 16, 21, 23,
+        22, 20, 19, 18, 17, 16, 20, 22, 22, 20, 18, 17, 16, 16, 20, 22, 22, 20,
+        18, 17, 16, 15, 20, 22, 22, 20, 18, 17, 16, 15, 20, 21, 22, 19, 18, 17,
+        16, 14, 20, 21, 22, 19, 18, 17, 16, 14, 19, 21, 21, 19, 18, 17, 15, 14,
+        19, 20, 21, 19, 18, 17, 15, 14, 19, 20, 21, 19, 18, 16, 15, 14, 18, 20,
+        20, 19, 17, 16, 15, 13, 18, 20, 20, 19, 17, 16, 15, 13, 17, 19, 20, 18,
+        17, 16, 14, 13 },
   },
   {
       { /* Luma */
@@ -10030,21 +10031,12 @@
         11, 11, 17, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 19, 18, 18, 17,
         16, 16, 15, 15, 15, 14, 14, 13, 13, 13, 13, 12, 12, 12, 11, 11,
         /* Size 4x8 */
-        32, 32, 28, 20, 32, 31, 28, 21, 32, 30, 27, 21, 30, 28, 23, 19, 29, 27,
-        21, 17, 26, 24, 19, 15, 22, 22, 17, 13, 20, 20, 16, 12,
-        /* Size 8x4 */
         32, 32, 32, 30, 29, 26, 22, 20, 32, 31, 30, 28, 27, 24, 22, 20, 28, 28,
         27, 23, 21, 19, 17, 16, 20, 21, 21, 19, 17, 15, 13, 12,
+        /* Size 8x4 */
+        32, 32, 28, 20, 32, 31, 28, 21, 32, 30, 27, 21, 30, 28, 23, 19, 29, 27,
+        21, 17, 26, 24, 19, 15, 22, 22, 17, 13, 20, 20, 16, 12,
         /* Size 8x16 */
-        32, 33, 32, 32, 28, 23, 22, 19, 33, 32, 32, 31, 29, 24, 23, 20, 33, 32,
-        32, 31, 29, 25, 23, 21, 33, 32, 31, 31, 29, 25, 23, 21, 32, 32, 30, 30,
-        28, 24, 23, 20, 32, 31, 29, 28, 27, 24, 23, 21, 32, 31, 29, 28, 26, 23,
-        22, 20, 30, 30, 28, 27, 24, 21, 20, 19, 28, 30, 28, 26, 21, 19, 18, 17,
-        27, 28, 26, 25, 21, 18, 18, 16, 26, 28, 26, 24, 20, 18, 17, 16, 23, 25,
-        24, 23, 19, 16, 16, 14, 22, 23, 23, 22, 18, 16, 15, 14, 21, 22, 22, 21,
-        18, 15, 14, 13, 19, 21, 20, 20, 17, 14, 14, 12, 18, 19, 19, 19, 16, 14,
-        13, 12,
-        /* Size 16x8 */
         32, 33, 33, 33, 32, 32, 32, 30, 28, 27, 26, 23, 22, 21, 19, 18, 33, 32,
         32, 32, 32, 31, 31, 30, 30, 28, 28, 25, 23, 22, 21, 19, 32, 32, 32, 31,
         30, 29, 29, 28, 28, 26, 26, 24, 23, 22, 20, 19, 32, 31, 31, 31, 30, 28,
@@ -10053,37 +10045,16 @@
         18, 16, 16, 15, 14, 14, 22, 23, 23, 23, 23, 23, 22, 20, 18, 18, 17, 16,
         15, 14, 14, 13, 19, 20, 21, 21, 20, 21, 20, 19, 17, 16, 16, 14, 14, 13,
         12, 12,
+        /* Size 16x8 */
+        32, 33, 32, 32, 28, 23, 22, 19, 33, 32, 32, 31, 29, 24, 23, 20, 33, 32,
+        32, 31, 29, 25, 23, 21, 33, 32, 31, 31, 29, 25, 23, 21, 32, 32, 30, 30,
+        28, 24, 23, 20, 32, 31, 29, 28, 27, 24, 23, 21, 32, 31, 29, 28, 26, 23,
+        22, 20, 30, 30, 28, 27, 24, 21, 20, 19, 28, 30, 28, 26, 21, 19, 18, 17,
+        27, 28, 26, 25, 21, 18, 18, 16, 26, 28, 26, 24, 20, 18, 17, 16, 23, 25,
+        24, 23, 19, 16, 16, 14, 22, 23, 23, 22, 18, 16, 15, 14, 21, 22, 22, 21,
+        18, 15, 14, 13, 19, 21, 20, 20, 17, 14, 14, 12, 18, 19, 19, 19, 16, 14,
+        13, 12,
         /* Size 16x32 */
-        32, 33, 33, 33, 32, 32, 32, 29, 28, 27, 23, 23, 22, 19, 19, 17, 33, 32,
-        32, 32, 32, 32, 31, 29, 29, 28, 24, 24, 22, 20, 20, 18, 33, 32, 32, 32,
-        32, 32, 31, 29, 29, 28, 24, 24, 23, 20, 20, 18, 33, 32, 32, 32, 32, 32,
-        31, 29, 29, 28, 24, 24, 23, 20, 20, 18, 33, 32, 32, 32, 32, 32, 31, 30,
-        29, 28, 25, 25, 23, 21, 21, 19, 33, 32, 32, 32, 32, 31, 31, 30, 30, 28,
-        25, 25, 23, 21, 21, 19, 33, 32, 32, 32, 31, 31, 31, 29, 29, 28, 25, 25,
-        23, 21, 21, 19, 32, 32, 32, 32, 31, 30, 30, 28, 28, 27, 24, 24, 23, 21,
-        21, 19, 32, 32, 32, 31, 30, 30, 30, 28, 28, 27, 24, 24, 23, 20, 20, 19,
-        32, 32, 32, 31, 30, 30, 29, 28, 28, 27, 24, 24, 23, 21, 21, 19, 32, 32,
-        31, 31, 29, 29, 28, 27, 27, 26, 24, 24, 23, 21, 21, 19, 32, 32, 31, 31,
-        29, 29, 28, 27, 27, 26, 24, 24, 23, 21, 21, 19, 32, 31, 31, 31, 29, 28,
-        28, 26, 26, 25, 23, 23, 22, 20, 20, 19, 30, 30, 30, 30, 28, 28, 27, 24,
-        24, 23, 21, 21, 20, 19, 19, 18, 30, 30, 30, 30, 28, 28, 27, 24, 24, 23,
-        21, 21, 20, 19, 19, 18, 29, 30, 30, 30, 28, 28, 26, 23, 23, 22, 20, 20,
-        19, 18, 18, 17, 28, 29, 30, 29, 28, 27, 26, 22, 21, 21, 19, 19, 18, 17,
-        17, 16, 28, 29, 30, 29, 28, 27, 26, 22, 21, 21, 19, 19, 18, 17, 17, 16,
-        27, 28, 28, 28, 26, 26, 25, 21, 21, 20, 18, 18, 18, 16, 16, 15, 26, 27,
-        28, 27, 26, 26, 24, 21, 20, 20, 18, 18, 17, 16, 16, 15, 26, 27, 28, 27,
-        26, 26, 24, 21, 20, 20, 18, 18, 17, 16, 16, 15, 24, 26, 26, 26, 24, 24,
-        23, 20, 20, 19, 17, 17, 16, 15, 15, 14, 23, 24, 25, 25, 24, 24, 23, 20,
-        19, 18, 16, 16, 16, 14, 14, 14, 23, 24, 25, 25, 24, 24, 23, 20, 19, 18,
-        16, 16, 16, 14, 14, 13, 22, 23, 23, 23, 23, 23, 22, 19, 18, 18, 16, 16,
-        15, 14, 14, 13, 21, 22, 23, 23, 22, 22, 21, 19, 18, 17, 15, 15, 15, 13,
-        13, 13, 21, 22, 22, 22, 22, 22, 21, 18, 18, 17, 15, 15, 14, 13, 13, 13,
-        19, 20, 21, 21, 21, 21, 20, 18, 17, 17, 14, 14, 14, 13, 13, 12, 19, 20,
-        21, 21, 20, 20, 20, 17, 17, 16, 14, 14, 14, 12, 12, 12, 19, 20, 20, 20,
-        20, 20, 19, 17, 17, 16, 14, 14, 13, 12, 12, 12, 18, 19, 19, 19, 19, 19,
-        19, 17, 16, 15, 14, 14, 13, 12, 12, 11, 18, 19, 19, 19, 19, 19, 19, 17,
-        16, 15, 14, 14, 13, 12, 12, 11,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 30, 30, 29, 28, 28,
         27, 26, 26, 24, 23, 23, 22, 21, 21, 19, 19, 19, 18, 18, 33, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 30, 30, 29, 29, 28, 27, 27, 26,
@@ -10113,33 +10084,47 @@
         16, 15, 14, 14, 14, 13, 13, 13, 12, 12, 12, 12, 17, 18, 18, 18, 19, 19,
         19, 19, 19, 19, 19, 19, 19, 18, 18, 17, 16, 16, 15, 15, 15, 14, 14, 13,
         13, 13, 13, 12, 12, 12, 11, 11,
+        /* Size 32x16 */
+        32, 33, 33, 33, 32, 32, 32, 29, 28, 27, 23, 23, 22, 19, 19, 17, 33, 32,
+        32, 32, 32, 32, 31, 29, 29, 28, 24, 24, 22, 20, 20, 18, 33, 32, 32, 32,
+        32, 32, 31, 29, 29, 28, 24, 24, 23, 20, 20, 18, 33, 32, 32, 32, 32, 32,
+        31, 29, 29, 28, 24, 24, 23, 20, 20, 18, 33, 32, 32, 32, 32, 32, 31, 30,
+        29, 28, 25, 25, 23, 21, 21, 19, 33, 32, 32, 32, 32, 31, 31, 30, 30, 28,
+        25, 25, 23, 21, 21, 19, 33, 32, 32, 32, 31, 31, 31, 29, 29, 28, 25, 25,
+        23, 21, 21, 19, 32, 32, 32, 32, 31, 30, 30, 28, 28, 27, 24, 24, 23, 21,
+        21, 19, 32, 32, 32, 31, 30, 30, 30, 28, 28, 27, 24, 24, 23, 20, 20, 19,
+        32, 32, 32, 31, 30, 30, 29, 28, 28, 27, 24, 24, 23, 21, 21, 19, 32, 32,
+        31, 31, 29, 29, 28, 27, 27, 26, 24, 24, 23, 21, 21, 19, 32, 32, 31, 31,
+        29, 29, 28, 27, 27, 26, 24, 24, 23, 21, 21, 19, 32, 31, 31, 31, 29, 28,
+        28, 26, 26, 25, 23, 23, 22, 20, 20, 19, 30, 30, 30, 30, 28, 28, 27, 24,
+        24, 23, 21, 21, 20, 19, 19, 18, 30, 30, 30, 30, 28, 28, 27, 24, 24, 23,
+        21, 21, 20, 19, 19, 18, 29, 30, 30, 30, 28, 28, 26, 23, 23, 22, 20, 20,
+        19, 18, 18, 17, 28, 29, 30, 29, 28, 27, 26, 22, 21, 21, 19, 19, 18, 17,
+        17, 16, 28, 29, 30, 29, 28, 27, 26, 22, 21, 21, 19, 19, 18, 17, 17, 16,
+        27, 28, 28, 28, 26, 26, 25, 21, 21, 20, 18, 18, 18, 16, 16, 15, 26, 27,
+        28, 27, 26, 26, 24, 21, 20, 20, 18, 18, 17, 16, 16, 15, 26, 27, 28, 27,
+        26, 26, 24, 21, 20, 20, 18, 18, 17, 16, 16, 15, 24, 26, 26, 26, 24, 24,
+        23, 20, 20, 19, 17, 17, 16, 15, 15, 14, 23, 24, 25, 25, 24, 24, 23, 20,
+        19, 18, 16, 16, 16, 14, 14, 14, 23, 24, 25, 25, 24, 24, 23, 20, 19, 18,
+        16, 16, 16, 14, 14, 13, 22, 23, 23, 23, 23, 23, 22, 19, 18, 18, 16, 16,
+        15, 14, 14, 13, 21, 22, 23, 23, 22, 22, 21, 19, 18, 17, 15, 15, 15, 13,
+        13, 13, 21, 22, 22, 22, 22, 22, 21, 18, 18, 17, 15, 15, 14, 13, 13, 13,
+        19, 20, 21, 21, 21, 21, 20, 18, 17, 17, 14, 14, 14, 13, 13, 12, 19, 20,
+        21, 21, 20, 20, 20, 17, 17, 16, 14, 14, 14, 12, 12, 12, 19, 20, 20, 20,
+        20, 20, 19, 17, 17, 16, 14, 14, 13, 12, 12, 12, 18, 19, 19, 19, 19, 19,
+        19, 17, 16, 15, 14, 14, 13, 12, 12, 11, 18, 19, 19, 19, 19, 19, 19, 17,
+        16, 15, 14, 14, 13, 12, 12, 11,
         /* Size 4x16 */
-        33, 32, 27, 19, 32, 32, 28, 20, 32, 32, 28, 21, 32, 31, 28, 21, 32, 30,
-        27, 20, 32, 29, 26, 21, 31, 28, 25, 20, 30, 28, 23, 19, 29, 27, 21, 17,
-        28, 26, 20, 16, 27, 26, 20, 16, 24, 24, 18, 14, 23, 23, 18, 14, 22, 22,
-        17, 13, 20, 20, 16, 12, 19, 19, 15, 12,
-        /* Size 16x4 */
         33, 32, 32, 32, 32, 32, 31, 30, 29, 28, 27, 24, 23, 22, 20, 19, 32, 32,
         32, 31, 30, 29, 28, 28, 27, 26, 26, 24, 23, 22, 20, 19, 27, 28, 28, 28,
         27, 26, 25, 23, 21, 20, 20, 18, 18, 17, 16, 15, 19, 20, 21, 21, 20, 21,
         20, 19, 17, 16, 16, 14, 14, 13, 12, 12,
+        /* Size 16x4 */
+        33, 32, 27, 19, 32, 32, 28, 20, 32, 32, 28, 21, 32, 31, 28, 21, 32, 30,
+        27, 20, 32, 29, 26, 21, 31, 28, 25, 20, 30, 28, 23, 19, 29, 27, 21, 17,
+        28, 26, 20, 16, 27, 26, 20, 16, 24, 24, 18, 14, 23, 23, 18, 14, 22, 22,
+        17, 13, 20, 20, 16, 12, 19, 19, 15, 12,
         /* Size 8x32 */
-        32, 33, 32, 32, 28, 23, 22, 19, 33, 32, 32, 31, 29, 24, 22, 20, 33, 32,
-        32, 31, 29, 24, 23, 20, 33, 32, 32, 31, 29, 24, 23, 20, 33, 32, 32, 31,
-        29, 25, 23, 21, 33, 32, 32, 31, 30, 25, 23, 21, 33, 32, 31, 31, 29, 25,
-        23, 21, 32, 32, 31, 30, 28, 24, 23, 21, 32, 32, 30, 30, 28, 24, 23, 20,
-        32, 32, 30, 29, 28, 24, 23, 21, 32, 31, 29, 28, 27, 24, 23, 21, 32, 31,
-        29, 28, 27, 24, 23, 21, 32, 31, 29, 28, 26, 23, 22, 20, 30, 30, 28, 27,
-        24, 21, 20, 19, 30, 30, 28, 27, 24, 21, 20, 19, 29, 30, 28, 26, 23, 20,
-        19, 18, 28, 30, 28, 26, 21, 19, 18, 17, 28, 30, 28, 26, 21, 19, 18, 17,
-        27, 28, 26, 25, 21, 18, 18, 16, 26, 28, 26, 24, 20, 18, 17, 16, 26, 28,
-        26, 24, 20, 18, 17, 16, 24, 26, 24, 23, 20, 17, 16, 15, 23, 25, 24, 23,
-        19, 16, 16, 14, 23, 25, 24, 23, 19, 16, 16, 14, 22, 23, 23, 22, 18, 16,
-        15, 14, 21, 23, 22, 21, 18, 15, 15, 13, 21, 22, 22, 21, 18, 15, 14, 13,
-        19, 21, 21, 20, 17, 14, 14, 13, 19, 21, 20, 20, 17, 14, 14, 12, 19, 20,
-        20, 19, 17, 14, 13, 12, 18, 19, 19, 19, 16, 14, 13, 12, 18, 19, 19, 19,
-        16, 14, 13, 12,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 30, 30, 29, 28, 28,
         27, 26, 26, 24, 23, 23, 22, 21, 21, 19, 19, 19, 18, 18, 33, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 30, 28, 28, 28, 26,
@@ -10154,7 +10139,23 @@
         23, 23, 23, 23, 23, 23, 22, 20, 20, 19, 18, 18, 18, 17, 17, 16, 16, 16,
         15, 15, 14, 14, 14, 13, 13, 13, 19, 20, 20, 20, 21, 21, 21, 21, 20, 21,
         21, 21, 20, 19, 19, 18, 17, 17, 16, 16, 16, 15, 14, 14, 14, 13, 13, 13,
-        12, 12, 12, 12 },
+        12, 12, 12, 12,
+        /* Size 32x8 */
+        32, 33, 32, 32, 28, 23, 22, 19, 33, 32, 32, 31, 29, 24, 22, 20, 33, 32,
+        32, 31, 29, 24, 23, 20, 33, 32, 32, 31, 29, 24, 23, 20, 33, 32, 32, 31,
+        29, 25, 23, 21, 33, 32, 32, 31, 30, 25, 23, 21, 33, 32, 31, 31, 29, 25,
+        23, 21, 32, 32, 31, 30, 28, 24, 23, 21, 32, 32, 30, 30, 28, 24, 23, 20,
+        32, 32, 30, 29, 28, 24, 23, 21, 32, 31, 29, 28, 27, 24, 23, 21, 32, 31,
+        29, 28, 27, 24, 23, 21, 32, 31, 29, 28, 26, 23, 22, 20, 30, 30, 28, 27,
+        24, 21, 20, 19, 30, 30, 28, 27, 24, 21, 20, 19, 29, 30, 28, 26, 23, 20,
+        19, 18, 28, 30, 28, 26, 21, 19, 18, 17, 28, 30, 28, 26, 21, 19, 18, 17,
+        27, 28, 26, 25, 21, 18, 18, 16, 26, 28, 26, 24, 20, 18, 17, 16, 26, 28,
+        26, 24, 20, 18, 17, 16, 24, 26, 24, 23, 20, 17, 16, 15, 23, 25, 24, 23,
+        19, 16, 16, 14, 23, 25, 24, 23, 19, 16, 16, 14, 22, 23, 23, 22, 18, 16,
+        15, 14, 21, 23, 22, 21, 18, 15, 15, 13, 21, 22, 22, 21, 18, 15, 14, 13,
+        19, 21, 21, 20, 17, 14, 14, 13, 19, 21, 20, 20, 17, 14, 14, 12, 19, 20,
+        20, 19, 17, 14, 13, 12, 18, 19, 19, 19, 16, 14, 13, 12, 18, 19, 19, 19,
+        16, 14, 13, 12 },
       { /* Chroma */
         /* Size 4x4 */
         33, 27, 22, 21, 27, 22, 22, 22, 22, 22, 19, 18, 21, 22, 18, 16,
@@ -10238,21 +10239,12 @@
         14, 14, 19, 19, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 20, 20, 19,
         19, 19, 18, 18, 18, 17, 17, 16, 16, 16, 16, 15, 15, 15, 14, 14,
         /* Size 4x8 */
-        33, 27, 22, 20, 33, 26, 22, 21, 28, 23, 22, 22, 24, 22, 20, 20, 22, 21,
-        19, 19, 22, 22, 19, 17, 21, 22, 19, 16, 20, 21, 18, 15,
-        /* Size 8x4 */
         33, 33, 28, 24, 22, 22, 21, 20, 27, 26, 23, 22, 21, 22, 22, 21, 22, 22,
         22, 20, 19, 19, 19, 18, 20, 21, 22, 20, 19, 17, 16, 15,
+        /* Size 8x4 */
+        33, 27, 22, 20, 33, 26, 22, 21, 28, 23, 22, 22, 24, 22, 20, 20, 22, 21,
+        19, 19, 22, 22, 19, 17, 21, 22, 19, 16, 20, 21, 18, 15,
         /* Size 8x16 */
-        32, 33, 29, 27, 21, 21, 20, 20, 33, 33, 28, 26, 22, 22, 21, 20, 34, 32,
-        27, 26, 22, 23, 22, 21, 33, 31, 27, 25, 22, 23, 22, 21, 31, 28, 25, 23,
-        22, 22, 22, 22, 28, 26, 23, 22, 22, 23, 22, 22, 26, 25, 22, 22, 21, 22,
-        22, 21, 24, 24, 22, 21, 20, 21, 20, 20, 21, 22, 21, 21, 19, 19, 19, 19,
-        21, 22, 22, 21, 19, 19, 19, 18, 21, 22, 22, 21, 19, 18, 18, 18, 21, 23,
-        23, 22, 19, 18, 17, 17, 20, 22, 22, 21, 19, 17, 17, 16, 20, 22, 22, 21,
-        19, 17, 17, 16, 20, 21, 22, 21, 19, 17, 16, 16, 19, 20, 21, 20, 19, 17,
-        16, 15,
-        /* Size 16x8 */
         32, 33, 34, 33, 31, 28, 26, 24, 21, 21, 21, 21, 20, 20, 20, 19, 33, 33,
         32, 31, 28, 26, 25, 24, 22, 22, 22, 23, 22, 22, 21, 20, 29, 28, 27, 27,
         25, 23, 22, 22, 21, 22, 22, 23, 22, 22, 22, 21, 27, 26, 26, 25, 23, 22,
@@ -10261,37 +10253,16 @@
         18, 18, 17, 17, 17, 17, 20, 21, 22, 22, 22, 22, 22, 20, 19, 19, 18, 17,
         17, 17, 16, 16, 20, 20, 21, 21, 22, 22, 21, 20, 19, 18, 18, 17, 16, 16,
         16, 15,
+        /* Size 16x8 */
+        32, 33, 29, 27, 21, 21, 20, 20, 33, 33, 28, 26, 22, 22, 21, 20, 34, 32,
+        27, 26, 22, 23, 22, 21, 33, 31, 27, 25, 22, 23, 22, 21, 31, 28, 25, 23,
+        22, 22, 22, 22, 28, 26, 23, 22, 22, 23, 22, 22, 26, 25, 22, 22, 21, 22,
+        22, 21, 24, 24, 22, 21, 20, 21, 20, 20, 21, 22, 21, 21, 19, 19, 19, 19,
+        21, 22, 22, 21, 19, 19, 19, 18, 21, 22, 22, 21, 19, 18, 18, 18, 21, 23,
+        23, 22, 19, 18, 17, 17, 20, 22, 22, 21, 19, 17, 17, 16, 20, 22, 22, 21,
+        19, 17, 17, 16, 20, 21, 22, 21, 19, 17, 16, 16, 19, 20, 21, 20, 19, 17,
+        16, 15,
         /* Size 16x32 */
-        32, 33, 33, 33, 29, 28, 27, 22, 21, 21, 21, 21, 20, 20, 20, 19, 33, 33,
-        33, 32, 28, 27, 26, 22, 22, 22, 21, 21, 21, 20, 20, 19, 33, 33, 33, 32,
-        28, 27, 26, 22, 22, 22, 22, 22, 21, 20, 20, 20, 33, 33, 33, 32, 28, 27,
-        26, 22, 22, 22, 22, 22, 21, 20, 20, 20, 34, 33, 32, 32, 27, 26, 26, 23,
-        22, 22, 23, 23, 22, 21, 21, 20, 34, 33, 32, 31, 27, 26, 25, 23, 22, 22,
-        23, 23, 22, 21, 21, 20, 33, 32, 31, 31, 27, 26, 25, 23, 22, 22, 23, 23,
-        22, 21, 21, 20, 31, 29, 29, 28, 25, 24, 24, 22, 22, 22, 23, 23, 22, 22,
-        22, 21, 31, 29, 28, 28, 25, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22, 21,
-        30, 28, 28, 28, 24, 23, 23, 22, 22, 22, 23, 23, 22, 22, 22, 21, 28, 26,
-        26, 25, 23, 22, 22, 22, 22, 22, 23, 23, 22, 22, 22, 21, 28, 26, 26, 25,
-        23, 22, 22, 22, 22, 22, 23, 23, 22, 22, 22, 21, 26, 26, 25, 24, 22, 22,
-        22, 21, 21, 21, 22, 22, 22, 21, 21, 20, 24, 24, 24, 24, 22, 22, 21, 20,
-        20, 20, 21, 21, 20, 20, 20, 20, 24, 24, 24, 24, 22, 22, 21, 20, 20, 20,
-        21, 21, 20, 20, 20, 20, 23, 23, 23, 23, 22, 22, 21, 20, 20, 20, 20, 20,
-        20, 20, 20, 19, 21, 22, 22, 22, 21, 21, 21, 20, 19, 19, 19, 19, 19, 19,
-        19, 19, 21, 22, 22, 22, 21, 21, 21, 20, 19, 19, 19, 19, 19, 19, 19, 19,
-        21, 22, 22, 22, 22, 22, 21, 20, 19, 19, 19, 19, 19, 18, 18, 18, 21, 22,
-        22, 22, 22, 22, 21, 20, 19, 19, 18, 18, 18, 18, 18, 17, 21, 22, 22, 22,
-        22, 22, 21, 20, 19, 19, 18, 18, 18, 18, 18, 17, 21, 22, 23, 23, 22, 22,
-        22, 20, 19, 19, 18, 18, 18, 17, 17, 17, 21, 22, 23, 23, 23, 22, 22, 20,
-        19, 19, 18, 18, 17, 17, 17, 17, 21, 22, 23, 23, 22, 22, 22, 20, 19, 19,
-        18, 18, 17, 17, 17, 16, 20, 22, 22, 22, 22, 22, 21, 19, 19, 19, 17, 17,
-        17, 16, 16, 16, 20, 21, 22, 22, 22, 22, 21, 19, 19, 19, 17, 17, 17, 16,
-        16, 16, 20, 21, 22, 22, 22, 22, 21, 19, 19, 19, 17, 17, 17, 16, 16, 16,
-        20, 21, 21, 21, 22, 22, 21, 19, 19, 18, 17, 17, 16, 16, 16, 15, 20, 21,
-        21, 21, 22, 22, 21, 19, 19, 18, 17, 17, 16, 16, 16, 15, 19, 20, 21, 21,
-        21, 21, 21, 19, 19, 18, 17, 17, 16, 15, 15, 15, 19, 20, 20, 20, 21, 21,
-        20, 19, 19, 18, 17, 17, 16, 15, 15, 14, 19, 20, 20, 20, 21, 21, 20, 19,
-        19, 18, 17, 17, 16, 15, 15, 14,
-        /* Size 32x16 */
         32, 33, 33, 33, 34, 34, 33, 31, 31, 30, 28, 28, 26, 24, 24, 23, 21, 21,
         21, 21, 21, 21, 21, 21, 20, 20, 20, 20, 20, 19, 19, 19, 33, 33, 33, 33,
         33, 33, 32, 29, 29, 28, 26, 26, 26, 24, 24, 23, 22, 22, 22, 22, 22, 22,
@@ -10321,33 +10292,47 @@
         18, 17, 17, 17, 16, 16, 16, 16, 16, 15, 15, 15, 19, 19, 20, 20, 20, 20,
         20, 21, 21, 21, 21, 21, 20, 20, 20, 19, 19, 19, 18, 17, 17, 17, 17, 16,
         16, 16, 16, 15, 15, 15, 14, 14,
+        /* Size 32x16 */
+        32, 33, 33, 33, 29, 28, 27, 22, 21, 21, 21, 21, 20, 20, 20, 19, 33, 33,
+        33, 32, 28, 27, 26, 22, 22, 22, 21, 21, 21, 20, 20, 19, 33, 33, 33, 32,
+        28, 27, 26, 22, 22, 22, 22, 22, 21, 20, 20, 20, 33, 33, 33, 32, 28, 27,
+        26, 22, 22, 22, 22, 22, 21, 20, 20, 20, 34, 33, 32, 32, 27, 26, 26, 23,
+        22, 22, 23, 23, 22, 21, 21, 20, 34, 33, 32, 31, 27, 26, 25, 23, 22, 22,
+        23, 23, 22, 21, 21, 20, 33, 32, 31, 31, 27, 26, 25, 23, 22, 22, 23, 23,
+        22, 21, 21, 20, 31, 29, 29, 28, 25, 24, 24, 22, 22, 22, 23, 23, 22, 22,
+        22, 21, 31, 29, 28, 28, 25, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22, 21,
+        30, 28, 28, 28, 24, 23, 23, 22, 22, 22, 23, 23, 22, 22, 22, 21, 28, 26,
+        26, 25, 23, 22, 22, 22, 22, 22, 23, 23, 22, 22, 22, 21, 28, 26, 26, 25,
+        23, 22, 22, 22, 22, 22, 23, 23, 22, 22, 22, 21, 26, 26, 25, 24, 22, 22,
+        22, 21, 21, 21, 22, 22, 22, 21, 21, 20, 24, 24, 24, 24, 22, 22, 21, 20,
+        20, 20, 21, 21, 20, 20, 20, 20, 24, 24, 24, 24, 22, 22, 21, 20, 20, 20,
+        21, 21, 20, 20, 20, 20, 23, 23, 23, 23, 22, 22, 21, 20, 20, 20, 20, 20,
+        20, 20, 20, 19, 21, 22, 22, 22, 21, 21, 21, 20, 19, 19, 19, 19, 19, 19,
+        19, 19, 21, 22, 22, 22, 21, 21, 21, 20, 19, 19, 19, 19, 19, 19, 19, 19,
+        21, 22, 22, 22, 22, 22, 21, 20, 19, 19, 19, 19, 19, 18, 18, 18, 21, 22,
+        22, 22, 22, 22, 21, 20, 19, 19, 18, 18, 18, 18, 18, 17, 21, 22, 22, 22,
+        22, 22, 21, 20, 19, 19, 18, 18, 18, 18, 18, 17, 21, 22, 23, 23, 22, 22,
+        22, 20, 19, 19, 18, 18, 18, 17, 17, 17, 21, 22, 23, 23, 23, 22, 22, 20,
+        19, 19, 18, 18, 17, 17, 17, 17, 21, 22, 23, 23, 22, 22, 22, 20, 19, 19,
+        18, 18, 17, 17, 17, 16, 20, 22, 22, 22, 22, 22, 21, 19, 19, 19, 17, 17,
+        17, 16, 16, 16, 20, 21, 22, 22, 22, 22, 21, 19, 19, 19, 17, 17, 17, 16,
+        16, 16, 20, 21, 22, 22, 22, 22, 21, 19, 19, 19, 17, 17, 17, 16, 16, 16,
+        20, 21, 21, 21, 22, 22, 21, 19, 19, 18, 17, 17, 16, 16, 16, 15, 20, 21,
+        21, 21, 22, 22, 21, 19, 19, 18, 17, 17, 16, 16, 16, 15, 19, 20, 21, 21,
+        21, 21, 21, 19, 19, 18, 17, 17, 16, 15, 15, 15, 19, 20, 20, 20, 21, 21,
+        20, 19, 19, 18, 17, 17, 16, 15, 15, 14, 19, 20, 20, 20, 21, 21, 20, 19,
+        19, 18, 17, 17, 16, 15, 15, 14,
         /* Size 4x16 */
-        33, 28, 21, 20, 33, 27, 22, 20, 33, 26, 22, 21, 32, 26, 22, 21, 29, 24,
-        22, 22, 26, 22, 22, 22, 26, 22, 21, 21, 24, 22, 20, 20, 22, 21, 19, 19,
-        22, 22, 19, 18, 22, 22, 19, 18, 22, 22, 19, 17, 22, 22, 19, 16, 21, 22,
-        19, 16, 21, 22, 18, 16, 20, 21, 18, 15,
-        /* Size 16x4 */
         33, 33, 33, 32, 29, 26, 26, 24, 22, 22, 22, 22, 22, 21, 21, 20, 28, 27,
         26, 26, 24, 22, 22, 22, 21, 22, 22, 22, 22, 22, 22, 21, 21, 22, 22, 22,
         22, 22, 21, 20, 19, 19, 19, 19, 19, 19, 18, 18, 20, 20, 21, 21, 22, 22,
         21, 20, 19, 18, 18, 17, 16, 16, 16, 15,
+        /* Size 16x4 */
+        33, 28, 21, 20, 33, 27, 22, 20, 33, 26, 22, 21, 32, 26, 22, 21, 29, 24,
+        22, 22, 26, 22, 22, 22, 26, 22, 21, 21, 24, 22, 20, 20, 22, 21, 19, 19,
+        22, 22, 19, 18, 22, 22, 19, 18, 22, 22, 19, 17, 22, 22, 19, 16, 21, 22,
+        19, 16, 21, 22, 18, 16, 20, 21, 18, 15,
         /* Size 8x32 */
-        32, 33, 29, 27, 21, 21, 20, 20, 33, 33, 28, 26, 22, 21, 21, 20, 33, 33,
-        28, 26, 22, 22, 21, 20, 33, 33, 28, 26, 22, 22, 21, 20, 34, 32, 27, 26,
-        22, 23, 22, 21, 34, 32, 27, 25, 22, 23, 22, 21, 33, 31, 27, 25, 22, 23,
-        22, 21, 31, 29, 25, 24, 22, 23, 22, 22, 31, 28, 25, 23, 22, 22, 22, 22,
-        30, 28, 24, 23, 22, 23, 22, 22, 28, 26, 23, 22, 22, 23, 22, 22, 28, 26,
-        23, 22, 22, 23, 22, 22, 26, 25, 22, 22, 21, 22, 22, 21, 24, 24, 22, 21,
-        20, 21, 20, 20, 24, 24, 22, 21, 20, 21, 20, 20, 23, 23, 22, 21, 20, 20,
-        20, 20, 21, 22, 21, 21, 19, 19, 19, 19, 21, 22, 21, 21, 19, 19, 19, 19,
-        21, 22, 22, 21, 19, 19, 19, 18, 21, 22, 22, 21, 19, 18, 18, 18, 21, 22,
-        22, 21, 19, 18, 18, 18, 21, 23, 22, 22, 19, 18, 18, 17, 21, 23, 23, 22,
-        19, 18, 17, 17, 21, 23, 22, 22, 19, 18, 17, 17, 20, 22, 22, 21, 19, 17,
-        17, 16, 20, 22, 22, 21, 19, 17, 17, 16, 20, 22, 22, 21, 19, 17, 17, 16,
-        20, 21, 22, 21, 19, 17, 16, 16, 20, 21, 22, 21, 19, 17, 16, 16, 19, 21,
-        21, 21, 19, 17, 16, 15, 19, 20, 21, 20, 19, 17, 16, 15, 19, 20, 21, 20,
-        19, 17, 16, 15,
-        /* Size 32x8 */
         32, 33, 33, 33, 34, 34, 33, 31, 31, 30, 28, 28, 26, 24, 24, 23, 21, 21,
         21, 21, 21, 21, 21, 21, 20, 20, 20, 20, 20, 19, 19, 19, 33, 33, 33, 33,
         32, 32, 31, 29, 28, 28, 26, 26, 25, 24, 24, 23, 22, 22, 22, 22, 22, 23,
@@ -10362,7 +10347,23 @@
         22, 22, 22, 22, 22, 22, 22, 20, 20, 20, 19, 19, 19, 18, 18, 18, 17, 17,
         17, 17, 17, 16, 16, 16, 16, 16, 20, 20, 20, 20, 21, 21, 21, 22, 22, 22,
         22, 22, 21, 20, 20, 20, 19, 19, 18, 18, 18, 17, 17, 17, 16, 16, 16, 16,
-        16, 15, 15, 15 },
+        16, 15, 15, 15,
+        /* Size 32x8 */
+        32, 33, 29, 27, 21, 21, 20, 20, 33, 33, 28, 26, 22, 21, 21, 20, 33, 33,
+        28, 26, 22, 22, 21, 20, 33, 33, 28, 26, 22, 22, 21, 20, 34, 32, 27, 26,
+        22, 23, 22, 21, 34, 32, 27, 25, 22, 23, 22, 21, 33, 31, 27, 25, 22, 23,
+        22, 21, 31, 29, 25, 24, 22, 23, 22, 22, 31, 28, 25, 23, 22, 22, 22, 22,
+        30, 28, 24, 23, 22, 23, 22, 22, 28, 26, 23, 22, 22, 23, 22, 22, 28, 26,
+        23, 22, 22, 23, 22, 22, 26, 25, 22, 22, 21, 22, 22, 21, 24, 24, 22, 21,
+        20, 21, 20, 20, 24, 24, 22, 21, 20, 21, 20, 20, 23, 23, 22, 21, 20, 20,
+        20, 20, 21, 22, 21, 21, 19, 19, 19, 19, 21, 22, 21, 21, 19, 19, 19, 19,
+        21, 22, 22, 21, 19, 19, 19, 18, 21, 22, 22, 21, 19, 18, 18, 18, 21, 22,
+        22, 21, 19, 18, 18, 18, 21, 23, 22, 22, 19, 18, 18, 17, 21, 23, 23, 22,
+        19, 18, 17, 17, 21, 23, 22, 22, 19, 18, 17, 17, 20, 22, 22, 21, 19, 17,
+        17, 16, 20, 22, 22, 21, 19, 17, 17, 16, 20, 22, 22, 21, 19, 17, 17, 16,
+        20, 21, 22, 21, 19, 17, 16, 16, 20, 21, 22, 21, 19, 17, 16, 16, 19, 21,
+        21, 21, 19, 17, 16, 15, 19, 20, 21, 20, 19, 17, 16, 15, 19, 20, 21, 20,
+        19, 17, 16, 15 },
   },
   {
       { /* Luma */
@@ -10448,21 +10449,12 @@
         14, 14, 20, 20, 21, 21, 21, 22, 22, 22, 21, 21, 21, 21, 21, 21, 20, 19,
         19, 19, 18, 18, 18, 17, 16, 16, 16, 15, 15, 15, 14, 14, 14, 13,
         /* Size 4x8 */
-        33, 32, 29, 24, 32, 31, 30, 25, 32, 30, 28, 24, 32, 29, 27, 24, 30, 28,
-        24, 21, 28, 26, 21, 18, 24, 24, 19, 16, 22, 22, 18, 15,
-        /* Size 8x4 */
         33, 32, 32, 32, 30, 28, 24, 22, 32, 31, 30, 29, 28, 26, 24, 22, 29, 30,
         28, 27, 24, 21, 19, 18, 24, 25, 24, 24, 21, 18, 16, 15,
+        /* Size 8x4 */
+        33, 32, 29, 24, 32, 31, 30, 25, 32, 30, 28, 24, 32, 29, 27, 24, 30, 28,
+        24, 21, 28, 26, 21, 18, 24, 24, 19, 16, 22, 22, 18, 15,
         /* Size 8x16 */
-        32, 33, 33, 32, 29, 28, 23, 22, 33, 32, 32, 32, 29, 29, 24, 23, 33, 32,
-        32, 32, 30, 29, 25, 23, 33, 32, 32, 31, 30, 30, 25, 23, 33, 32, 31, 30,
-        29, 28, 24, 23, 32, 32, 31, 30, 28, 28, 24, 23, 32, 31, 30, 29, 28, 27,
-        24, 23, 32, 31, 30, 28, 26, 26, 23, 22, 30, 30, 29, 28, 25, 24, 21, 20,
-        29, 30, 28, 27, 23, 22, 20, 19, 28, 30, 28, 27, 22, 21, 19, 18, 26, 28,
-        26, 26, 21, 20, 18, 17, 25, 26, 26, 25, 21, 20, 17, 17, 23, 25, 24, 24,
-        20, 19, 16, 16, 22, 23, 23, 23, 19, 18, 16, 15, 21, 23, 23, 22, 19, 18,
-        15, 15,
-        /* Size 16x8 */
         32, 33, 33, 33, 33, 32, 32, 32, 30, 29, 28, 26, 25, 23, 22, 21, 33, 32,
         32, 32, 32, 32, 31, 31, 30, 30, 30, 28, 26, 25, 23, 23, 33, 32, 32, 32,
         31, 31, 30, 30, 29, 28, 28, 26, 26, 24, 23, 23, 32, 32, 32, 31, 30, 30,
@@ -10471,37 +10463,16 @@
         21, 20, 20, 19, 18, 18, 23, 24, 25, 25, 24, 24, 24, 23, 21, 20, 19, 18,
         17, 16, 16, 15, 22, 23, 23, 23, 23, 23, 23, 22, 20, 19, 18, 17, 17, 16,
         15, 15,
+        /* Size 16x8 */
+        32, 33, 33, 32, 29, 28, 23, 22, 33, 32, 32, 32, 29, 29, 24, 23, 33, 32,
+        32, 32, 30, 29, 25, 23, 33, 32, 32, 31, 30, 30, 25, 23, 33, 32, 31, 30,
+        29, 28, 24, 23, 32, 32, 31, 30, 28, 28, 24, 23, 32, 31, 30, 29, 28, 27,
+        24, 23, 32, 31, 30, 28, 26, 26, 23, 22, 30, 30, 29, 28, 25, 24, 21, 20,
+        29, 30, 28, 27, 23, 22, 20, 19, 28, 30, 28, 27, 22, 21, 19, 18, 26, 28,
+        26, 26, 21, 20, 18, 17, 25, 26, 26, 25, 21, 20, 17, 17, 23, 25, 24, 24,
+        20, 19, 16, 16, 22, 23, 23, 23, 19, 18, 16, 15, 21, 23, 23, 22, 19, 18,
+        15, 15,
         /* Size 16x32 */
-        32, 33, 33, 33, 33, 32, 32, 32, 29, 28, 28, 26, 23, 23, 22, 19, 33, 33,
-        32, 32, 32, 32, 32, 31, 29, 29, 29, 26, 24, 24, 22, 20, 33, 32, 32, 32,
-        32, 32, 32, 31, 29, 29, 29, 26, 24, 24, 23, 20, 33, 32, 32, 32, 32, 32,
-        32, 31, 29, 29, 29, 26, 24, 24, 23, 20, 33, 32, 32, 32, 32, 32, 32, 31,
-        30, 29, 29, 26, 25, 25, 23, 20, 33, 32, 32, 32, 32, 31, 31, 31, 30, 30,
-        30, 27, 25, 25, 23, 21, 33, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 27,
-        25, 25, 23, 21, 33, 32, 32, 32, 32, 31, 31, 31, 30, 29, 29, 27, 25, 25,
-        23, 21, 33, 32, 32, 32, 31, 30, 30, 30, 29, 28, 28, 26, 24, 24, 23, 21,
-        32, 32, 32, 32, 31, 30, 30, 30, 28, 28, 28, 26, 24, 24, 23, 20, 32, 32,
-        32, 32, 31, 30, 30, 30, 28, 28, 28, 26, 24, 24, 23, 20, 32, 32, 32, 32,
-        31, 29, 29, 29, 28, 28, 28, 26, 24, 24, 23, 21, 32, 32, 31, 31, 30, 29,
-        29, 28, 28, 27, 27, 25, 24, 24, 23, 21, 32, 32, 31, 31, 30, 29, 29, 28,
-        28, 27, 27, 25, 24, 24, 23, 21, 32, 31, 31, 31, 30, 28, 28, 28, 26, 26,
-        26, 24, 23, 23, 22, 20, 30, 30, 30, 30, 29, 28, 28, 27, 25, 24, 24, 23,
-        21, 21, 20, 19, 30, 30, 30, 30, 29, 28, 28, 27, 25, 24, 24, 23, 21, 21,
-        20, 19, 30, 30, 30, 30, 29, 28, 28, 27, 24, 24, 24, 22, 21, 21, 20, 19,
-        29, 29, 30, 30, 28, 27, 27, 26, 23, 22, 22, 20, 20, 20, 19, 17, 28, 29,
-        30, 30, 28, 27, 27, 26, 22, 21, 21, 20, 19, 19, 18, 17, 28, 29, 30, 30,
-        28, 27, 27, 26, 22, 21, 21, 20, 19, 19, 18, 17, 27, 28, 28, 28, 28, 26,
-        26, 25, 22, 21, 21, 19, 18, 18, 18, 16, 26, 27, 28, 28, 26, 26, 26, 24,
-        21, 20, 20, 19, 18, 18, 17, 16, 26, 27, 28, 28, 26, 26, 26, 24, 21, 20,
-        20, 19, 18, 18, 17, 16, 25, 26, 26, 26, 26, 25, 25, 24, 21, 20, 20, 18,
-        17, 17, 17, 15, 23, 24, 25, 25, 24, 24, 24, 23, 20, 19, 19, 17, 16, 16,
-        16, 14, 23, 24, 25, 25, 24, 24, 24, 23, 20, 19, 19, 17, 16, 16, 16, 14,
-        23, 24, 24, 24, 24, 24, 24, 23, 20, 19, 19, 17, 16, 16, 15, 14, 22, 23,
-        23, 23, 23, 23, 23, 22, 19, 18, 18, 17, 16, 16, 15, 14, 21, 22, 23, 23,
-        23, 22, 22, 21, 19, 18, 18, 17, 15, 15, 15, 13, 21, 22, 23, 23, 23, 22,
-        22, 21, 19, 18, 18, 17, 15, 15, 15, 13, 20, 21, 22, 22, 21, 21, 21, 20,
-        18, 18, 18, 16, 15, 15, 14, 13,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 30, 30, 30,
         29, 28, 28, 27, 26, 26, 25, 23, 23, 23, 22, 21, 21, 20, 33, 33, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 30, 30, 29, 29, 29, 28,
@@ -10531,33 +10502,47 @@
         18, 18, 17, 17, 17, 16, 16, 15, 15, 15, 15, 14, 19, 20, 20, 20, 20, 21,
         21, 21, 21, 20, 20, 21, 21, 21, 20, 19, 19, 19, 17, 17, 17, 16, 16, 16,
         15, 14, 14, 14, 14, 13, 13, 13,
+        /* Size 32x16 */
+        32, 33, 33, 33, 33, 32, 32, 32, 29, 28, 28, 26, 23, 23, 22, 19, 33, 33,
+        32, 32, 32, 32, 32, 31, 29, 29, 29, 26, 24, 24, 22, 20, 33, 32, 32, 32,
+        32, 32, 32, 31, 29, 29, 29, 26, 24, 24, 23, 20, 33, 32, 32, 32, 32, 32,
+        32, 31, 29, 29, 29, 26, 24, 24, 23, 20, 33, 32, 32, 32, 32, 32, 32, 31,
+        30, 29, 29, 26, 25, 25, 23, 20, 33, 32, 32, 32, 32, 31, 31, 31, 30, 30,
+        30, 27, 25, 25, 23, 21, 33, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 27,
+        25, 25, 23, 21, 33, 32, 32, 32, 32, 31, 31, 31, 30, 29, 29, 27, 25, 25,
+        23, 21, 33, 32, 32, 32, 31, 30, 30, 30, 29, 28, 28, 26, 24, 24, 23, 21,
+        32, 32, 32, 32, 31, 30, 30, 30, 28, 28, 28, 26, 24, 24, 23, 20, 32, 32,
+        32, 32, 31, 30, 30, 30, 28, 28, 28, 26, 24, 24, 23, 20, 32, 32, 32, 32,
+        31, 29, 29, 29, 28, 28, 28, 26, 24, 24, 23, 21, 32, 32, 31, 31, 30, 29,
+        29, 28, 28, 27, 27, 25, 24, 24, 23, 21, 32, 32, 31, 31, 30, 29, 29, 28,
+        28, 27, 27, 25, 24, 24, 23, 21, 32, 31, 31, 31, 30, 28, 28, 28, 26, 26,
+        26, 24, 23, 23, 22, 20, 30, 30, 30, 30, 29, 28, 28, 27, 25, 24, 24, 23,
+        21, 21, 20, 19, 30, 30, 30, 30, 29, 28, 28, 27, 25, 24, 24, 23, 21, 21,
+        20, 19, 30, 30, 30, 30, 29, 28, 28, 27, 24, 24, 24, 22, 21, 21, 20, 19,
+        29, 29, 30, 30, 28, 27, 27, 26, 23, 22, 22, 20, 20, 20, 19, 17, 28, 29,
+        30, 30, 28, 27, 27, 26, 22, 21, 21, 20, 19, 19, 18, 17, 28, 29, 30, 30,
+        28, 27, 27, 26, 22, 21, 21, 20, 19, 19, 18, 17, 27, 28, 28, 28, 28, 26,
+        26, 25, 22, 21, 21, 19, 18, 18, 18, 16, 26, 27, 28, 28, 26, 26, 26, 24,
+        21, 20, 20, 19, 18, 18, 17, 16, 26, 27, 28, 28, 26, 26, 26, 24, 21, 20,
+        20, 19, 18, 18, 17, 16, 25, 26, 26, 26, 26, 25, 25, 24, 21, 20, 20, 18,
+        17, 17, 17, 15, 23, 24, 25, 25, 24, 24, 24, 23, 20, 19, 19, 17, 16, 16,
+        16, 14, 23, 24, 25, 25, 24, 24, 24, 23, 20, 19, 19, 17, 16, 16, 16, 14,
+        23, 24, 24, 24, 24, 24, 24, 23, 20, 19, 19, 17, 16, 16, 15, 14, 22, 23,
+        23, 23, 23, 23, 23, 22, 19, 18, 18, 17, 16, 16, 15, 14, 21, 22, 23, 23,
+        23, 22, 22, 21, 19, 18, 18, 17, 15, 15, 15, 13, 21, 22, 23, 23, 23, 22,
+        22, 21, 19, 18, 18, 17, 15, 15, 15, 13, 20, 21, 22, 22, 21, 21, 21, 20,
+        18, 18, 18, 16, 15, 15, 14, 13,
         /* Size 4x16 */
-        33, 32, 28, 23, 32, 32, 29, 24, 32, 32, 29, 25, 32, 31, 30, 25, 32, 30,
-        28, 24, 32, 30, 28, 24, 32, 29, 27, 24, 31, 28, 26, 23, 30, 28, 24, 21,
-        29, 27, 22, 20, 29, 27, 21, 19, 27, 26, 20, 18, 26, 25, 20, 17, 24, 24,
-        19, 16, 23, 23, 18, 16, 22, 22, 18, 15,
-        /* Size 16x4 */
         33, 32, 32, 32, 32, 32, 32, 31, 30, 29, 29, 27, 26, 24, 23, 22, 32, 32,
         32, 31, 30, 30, 29, 28, 28, 27, 27, 26, 25, 24, 23, 22, 28, 29, 29, 30,
         28, 28, 27, 26, 24, 22, 21, 20, 20, 19, 18, 18, 23, 24, 25, 25, 24, 24,
         24, 23, 21, 20, 19, 18, 17, 16, 16, 15,
+        /* Size 16x4 */
+        33, 32, 28, 23, 32, 32, 29, 24, 32, 32, 29, 25, 32, 31, 30, 25, 32, 30,
+        28, 24, 32, 30, 28, 24, 32, 29, 27, 24, 31, 28, 26, 23, 30, 28, 24, 21,
+        29, 27, 22, 20, 29, 27, 21, 19, 27, 26, 20, 18, 26, 25, 20, 17, 24, 24,
+        19, 16, 23, 23, 18, 16, 22, 22, 18, 15,
         /* Size 8x32 */
-        32, 33, 33, 32, 29, 28, 23, 22, 33, 32, 32, 32, 29, 29, 24, 22, 33, 32,
-        32, 32, 29, 29, 24, 23, 33, 32, 32, 32, 29, 29, 24, 23, 33, 32, 32, 32,
-        30, 29, 25, 23, 33, 32, 32, 31, 30, 30, 25, 23, 33, 32, 32, 31, 30, 30,
-        25, 23, 33, 32, 32, 31, 30, 29, 25, 23, 33, 32, 31, 30, 29, 28, 24, 23,
-        32, 32, 31, 30, 28, 28, 24, 23, 32, 32, 31, 30, 28, 28, 24, 23, 32, 32,
-        31, 29, 28, 28, 24, 23, 32, 31, 30, 29, 28, 27, 24, 23, 32, 31, 30, 29,
-        28, 27, 24, 23, 32, 31, 30, 28, 26, 26, 23, 22, 30, 30, 29, 28, 25, 24,
-        21, 20, 30, 30, 29, 28, 25, 24, 21, 20, 30, 30, 29, 28, 24, 24, 21, 20,
-        29, 30, 28, 27, 23, 22, 20, 19, 28, 30, 28, 27, 22, 21, 19, 18, 28, 30,
-        28, 27, 22, 21, 19, 18, 27, 28, 28, 26, 22, 21, 18, 18, 26, 28, 26, 26,
-        21, 20, 18, 17, 26, 28, 26, 26, 21, 20, 18, 17, 25, 26, 26, 25, 21, 20,
-        17, 17, 23, 25, 24, 24, 20, 19, 16, 16, 23, 25, 24, 24, 20, 19, 16, 16,
-        23, 24, 24, 24, 20, 19, 16, 15, 22, 23, 23, 23, 19, 18, 16, 15, 21, 23,
-        23, 22, 19, 18, 15, 15, 21, 23, 23, 22, 19, 18, 15, 15, 20, 22, 21, 21,
-        18, 18, 15, 14,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 30, 30, 30,
         29, 28, 28, 27, 26, 26, 25, 23, 23, 23, 22, 21, 21, 20, 33, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 30, 30, 28,
@@ -10572,7 +10557,23 @@
         25, 25, 24, 24, 24, 24, 24, 24, 23, 21, 21, 21, 20, 19, 19, 18, 18, 18,
         17, 16, 16, 16, 16, 15, 15, 15, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23,
         23, 23, 23, 23, 22, 20, 20, 20, 19, 18, 18, 18, 17, 17, 17, 16, 16, 15,
-        15, 15, 15, 14 },
+        15, 15, 15, 14,
+        /* Size 32x8 */
+        32, 33, 33, 32, 29, 28, 23, 22, 33, 32, 32, 32, 29, 29, 24, 22, 33, 32,
+        32, 32, 29, 29, 24, 23, 33, 32, 32, 32, 29, 29, 24, 23, 33, 32, 32, 32,
+        30, 29, 25, 23, 33, 32, 32, 31, 30, 30, 25, 23, 33, 32, 32, 31, 30, 30,
+        25, 23, 33, 32, 32, 31, 30, 29, 25, 23, 33, 32, 31, 30, 29, 28, 24, 23,
+        32, 32, 31, 30, 28, 28, 24, 23, 32, 32, 31, 30, 28, 28, 24, 23, 32, 32,
+        31, 29, 28, 28, 24, 23, 32, 31, 30, 29, 28, 27, 24, 23, 32, 31, 30, 29,
+        28, 27, 24, 23, 32, 31, 30, 28, 26, 26, 23, 22, 30, 30, 29, 28, 25, 24,
+        21, 20, 30, 30, 29, 28, 25, 24, 21, 20, 30, 30, 29, 28, 24, 24, 21, 20,
+        29, 30, 28, 27, 23, 22, 20, 19, 28, 30, 28, 27, 22, 21, 19, 18, 28, 30,
+        28, 27, 22, 21, 19, 18, 27, 28, 28, 26, 22, 21, 18, 18, 26, 28, 26, 26,
+        21, 20, 18, 17, 26, 28, 26, 26, 21, 20, 18, 17, 25, 26, 26, 25, 21, 20,
+        17, 17, 23, 25, 24, 24, 20, 19, 16, 16, 23, 25, 24, 24, 20, 19, 16, 16,
+        23, 24, 24, 24, 20, 19, 16, 15, 22, 23, 23, 23, 19, 18, 16, 15, 21, 23,
+        23, 22, 19, 18, 15, 15, 21, 23, 23, 22, 19, 18, 15, 15, 20, 22, 21, 21,
+        18, 18, 15, 14 },
       { /* Chroma */
         /* Size 4x4 */
         33, 28, 22, 22, 28, 23, 22, 23, 22, 22, 19, 19, 22, 23, 19, 17,
@@ -10656,21 +10657,12 @@
         17, 16, 20, 20, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 20,
         20, 20, 19, 19, 19, 18, 18, 18, 18, 17, 17, 17, 17, 16, 16, 16,
         /* Size 4x8 */
-        33, 27, 22, 21, 33, 26, 22, 23, 29, 24, 22, 22, 26, 22, 22, 23, 24, 22,
-        20, 20, 22, 22, 19, 19, 22, 22, 19, 18, 21, 22, 19, 17,
-        /* Size 8x4 */
         33, 33, 29, 26, 24, 22, 22, 21, 27, 26, 24, 22, 22, 22, 22, 22, 22, 22,
         22, 22, 20, 19, 19, 19, 21, 23, 22, 23, 20, 19, 18, 17,
+        /* Size 8x4 */
+        33, 27, 22, 21, 33, 26, 22, 23, 29, 24, 22, 22, 26, 22, 22, 23, 24, 22,
+        20, 20, 22, 22, 19, 19, 22, 22, 19, 18, 21, 22, 19, 17,
         /* Size 8x16 */
-        32, 33, 31, 28, 23, 21, 21, 20, 33, 33, 30, 27, 23, 22, 22, 21, 33, 32,
-        30, 26, 23, 22, 22, 22, 34, 32, 29, 26, 23, 22, 23, 22, 31, 29, 28, 24,
-        22, 22, 23, 22, 31, 28, 27, 24, 22, 22, 22, 22, 28, 26, 24, 22, 22, 22,
-        23, 22, 26, 25, 24, 22, 21, 21, 22, 22, 24, 24, 23, 22, 21, 20, 21, 20,
-        22, 22, 22, 21, 20, 20, 19, 19, 21, 22, 22, 21, 20, 19, 19, 19, 21, 22,
-        22, 22, 20, 19, 18, 18, 21, 23, 22, 22, 20, 19, 18, 18, 21, 23, 23, 22,
-        20, 19, 18, 17, 20, 22, 22, 22, 20, 19, 17, 17, 20, 22, 22, 22, 20, 19,
-        17, 17,
-        /* Size 16x8 */
         32, 33, 33, 34, 31, 31, 28, 26, 24, 22, 21, 21, 21, 21, 20, 20, 33, 33,
         32, 32, 29, 28, 26, 25, 24, 22, 22, 22, 23, 23, 22, 22, 31, 30, 30, 29,
         28, 27, 24, 24, 23, 22, 22, 22, 22, 23, 22, 22, 28, 27, 26, 26, 24, 24,
@@ -10679,37 +10671,16 @@
         19, 19, 19, 19, 19, 19, 21, 22, 22, 23, 23, 22, 23, 22, 21, 19, 19, 18,
         18, 18, 17, 17, 20, 21, 22, 22, 22, 22, 22, 22, 20, 19, 19, 18, 18, 17,
         17, 17,
+        /* Size 16x8 */
+        32, 33, 31, 28, 23, 21, 21, 20, 33, 33, 30, 27, 23, 22, 22, 21, 33, 32,
+        30, 26, 23, 22, 22, 22, 34, 32, 29, 26, 23, 22, 23, 22, 31, 29, 28, 24,
+        22, 22, 23, 22, 31, 28, 27, 24, 22, 22, 22, 22, 28, 26, 24, 22, 22, 22,
+        23, 22, 26, 25, 24, 22, 21, 21, 22, 22, 24, 24, 23, 22, 21, 20, 21, 20,
+        22, 22, 22, 21, 20, 20, 19, 19, 21, 22, 22, 21, 20, 19, 19, 19, 21, 22,
+        22, 22, 20, 19, 18, 18, 21, 23, 22, 22, 20, 19, 18, 18, 21, 23, 23, 22,
+        20, 19, 18, 17, 20, 22, 22, 22, 20, 19, 17, 17, 20, 22, 22, 22, 20, 19,
+        17, 17,
         /* Size 16x32 */
-        32, 33, 33, 33, 31, 28, 28, 27, 23, 21, 21, 21, 21, 21, 20, 20, 33, 33,
-        33, 33, 31, 27, 27, 26, 23, 22, 22, 21, 21, 21, 21, 20, 33, 33, 33, 33,
-        30, 27, 27, 26, 23, 22, 22, 22, 22, 22, 21, 20, 33, 33, 33, 33, 30, 27,
-        27, 26, 23, 22, 22, 22, 22, 22, 21, 20, 33, 33, 32, 32, 30, 26, 26, 26,
-        23, 22, 22, 22, 22, 22, 22, 21, 34, 33, 32, 32, 29, 26, 26, 25, 23, 22,
-        22, 23, 23, 23, 22, 21, 34, 33, 32, 32, 29, 26, 26, 25, 23, 22, 22, 23,
-        23, 23, 22, 21, 33, 32, 31, 31, 29, 26, 26, 25, 23, 22, 22, 23, 23, 23,
-        22, 21, 31, 30, 29, 29, 28, 24, 24, 24, 22, 22, 22, 22, 23, 23, 22, 22,
-        31, 29, 28, 28, 27, 24, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22, 31, 29,
-        28, 28, 27, 24, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22, 29, 28, 27, 27,
-        25, 23, 23, 22, 22, 22, 22, 22, 23, 23, 22, 22, 28, 26, 26, 26, 24, 22,
-        22, 22, 22, 22, 22, 22, 23, 23, 22, 22, 28, 26, 26, 26, 24, 22, 22, 22,
-        22, 22, 22, 22, 23, 23, 22, 22, 26, 26, 25, 25, 24, 22, 22, 22, 21, 21,
-        21, 22, 22, 22, 22, 21, 24, 24, 24, 24, 23, 22, 22, 21, 21, 20, 20, 21,
-        21, 21, 20, 20, 24, 24, 24, 24, 23, 22, 22, 21, 21, 20, 20, 21, 21, 21,
-        20, 20, 24, 24, 24, 24, 23, 22, 22, 21, 20, 20, 20, 20, 20, 20, 20, 20,
-        22, 22, 22, 22, 22, 21, 21, 21, 20, 20, 20, 20, 19, 19, 19, 19, 21, 22,
-        22, 22, 22, 21, 21, 21, 20, 19, 19, 19, 19, 19, 19, 19, 21, 22, 22, 22,
-        22, 21, 21, 21, 20, 19, 19, 19, 19, 19, 19, 19, 21, 22, 22, 22, 22, 22,
-        22, 21, 20, 19, 19, 19, 19, 19, 19, 18, 21, 22, 22, 22, 22, 22, 22, 21,
-        20, 19, 19, 19, 18, 18, 18, 18, 21, 22, 22, 22, 22, 22, 22, 21, 20, 19,
-        19, 19, 18, 18, 18, 18, 21, 22, 23, 23, 22, 22, 22, 22, 20, 19, 19, 19,
-        18, 18, 18, 17, 21, 22, 23, 23, 23, 22, 22, 22, 20, 19, 19, 18, 18, 18,
-        17, 17, 21, 22, 23, 23, 23, 22, 22, 22, 20, 19, 19, 18, 18, 18, 17, 17,
-        21, 22, 23, 23, 23, 22, 22, 22, 20, 19, 19, 18, 18, 18, 17, 17, 20, 21,
-        22, 22, 22, 22, 22, 21, 20, 19, 19, 18, 17, 17, 17, 16, 20, 21, 22, 22,
-        22, 22, 22, 21, 20, 19, 19, 18, 17, 17, 17, 16, 20, 21, 22, 22, 22, 22,
-        22, 21, 20, 19, 19, 18, 17, 17, 17, 16, 20, 21, 22, 22, 22, 22, 22, 21,
-        20, 19, 19, 18, 17, 17, 17, 16,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 34, 34, 33, 31, 31, 31, 29, 28, 28, 26, 24, 24, 24,
         22, 21, 21, 21, 21, 21, 21, 21, 21, 21, 20, 20, 20, 20, 33, 33, 33, 33,
         33, 33, 33, 32, 30, 29, 29, 28, 26, 26, 26, 24, 24, 24, 22, 22, 22, 22,
@@ -10739,33 +10710,47 @@
         19, 19, 18, 18, 18, 17, 17, 17, 17, 17, 17, 17, 20, 20, 20, 20, 21, 21,
         21, 21, 22, 22, 22, 22, 22, 22, 21, 20, 20, 20, 19, 19, 19, 18, 18, 18,
         17, 17, 17, 17, 16, 16, 16, 16,
+        /* Size 32x16 */
+        32, 33, 33, 33, 31, 28, 28, 27, 23, 21, 21, 21, 21, 21, 20, 20, 33, 33,
+        33, 33, 31, 27, 27, 26, 23, 22, 22, 21, 21, 21, 21, 20, 33, 33, 33, 33,
+        30, 27, 27, 26, 23, 22, 22, 22, 22, 22, 21, 20, 33, 33, 33, 33, 30, 27,
+        27, 26, 23, 22, 22, 22, 22, 22, 21, 20, 33, 33, 32, 32, 30, 26, 26, 26,
+        23, 22, 22, 22, 22, 22, 22, 21, 34, 33, 32, 32, 29, 26, 26, 25, 23, 22,
+        22, 23, 23, 23, 22, 21, 34, 33, 32, 32, 29, 26, 26, 25, 23, 22, 22, 23,
+        23, 23, 22, 21, 33, 32, 31, 31, 29, 26, 26, 25, 23, 22, 22, 23, 23, 23,
+        22, 21, 31, 30, 29, 29, 28, 24, 24, 24, 22, 22, 22, 22, 23, 23, 22, 22,
+        31, 29, 28, 28, 27, 24, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22, 31, 29,
+        28, 28, 27, 24, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22, 29, 28, 27, 27,
+        25, 23, 23, 22, 22, 22, 22, 22, 23, 23, 22, 22, 28, 26, 26, 26, 24, 22,
+        22, 22, 22, 22, 22, 22, 23, 23, 22, 22, 28, 26, 26, 26, 24, 22, 22, 22,
+        22, 22, 22, 22, 23, 23, 22, 22, 26, 26, 25, 25, 24, 22, 22, 22, 21, 21,
+        21, 22, 22, 22, 22, 21, 24, 24, 24, 24, 23, 22, 22, 21, 21, 20, 20, 21,
+        21, 21, 20, 20, 24, 24, 24, 24, 23, 22, 22, 21, 21, 20, 20, 21, 21, 21,
+        20, 20, 24, 24, 24, 24, 23, 22, 22, 21, 20, 20, 20, 20, 20, 20, 20, 20,
+        22, 22, 22, 22, 22, 21, 21, 21, 20, 20, 20, 20, 19, 19, 19, 19, 21, 22,
+        22, 22, 22, 21, 21, 21, 20, 19, 19, 19, 19, 19, 19, 19, 21, 22, 22, 22,
+        22, 21, 21, 21, 20, 19, 19, 19, 19, 19, 19, 19, 21, 22, 22, 22, 22, 22,
+        22, 21, 20, 19, 19, 19, 19, 19, 19, 18, 21, 22, 22, 22, 22, 22, 22, 21,
+        20, 19, 19, 19, 18, 18, 18, 18, 21, 22, 22, 22, 22, 22, 22, 21, 20, 19,
+        19, 19, 18, 18, 18, 18, 21, 22, 23, 23, 22, 22, 22, 22, 20, 19, 19, 19,
+        18, 18, 18, 17, 21, 22, 23, 23, 23, 22, 22, 22, 20, 19, 19, 18, 18, 18,
+        17, 17, 21, 22, 23, 23, 23, 22, 22, 22, 20, 19, 19, 18, 18, 18, 17, 17,
+        21, 22, 23, 23, 23, 22, 22, 22, 20, 19, 19, 18, 18, 18, 17, 17, 20, 21,
+        22, 22, 22, 22, 22, 21, 20, 19, 19, 18, 17, 17, 17, 16, 20, 21, 22, 22,
+        22, 22, 22, 21, 20, 19, 19, 18, 17, 17, 17, 16, 20, 21, 22, 22, 22, 22,
+        22, 21, 20, 19, 19, 18, 17, 17, 17, 16, 20, 21, 22, 22, 22, 22, 22, 21,
+        20, 19, 19, 18, 17, 17, 17, 16,
         /* Size 4x16 */
-        33, 28, 21, 21, 33, 27, 22, 22, 33, 26, 22, 22, 33, 26, 22, 23, 30, 24,
-        22, 23, 29, 24, 22, 22, 26, 22, 22, 23, 26, 22, 21, 22, 24, 22, 20, 21,
-        22, 21, 20, 19, 22, 21, 19, 19, 22, 22, 19, 18, 22, 22, 19, 18, 22, 22,
-        19, 18, 21, 22, 19, 17, 21, 22, 19, 17,
-        /* Size 16x4 */
         33, 33, 33, 33, 30, 29, 26, 26, 24, 22, 22, 22, 22, 22, 21, 21, 28, 27,
         26, 26, 24, 24, 22, 22, 22, 21, 21, 22, 22, 22, 22, 22, 21, 22, 22, 22,
         22, 22, 22, 21, 20, 20, 19, 19, 19, 19, 19, 19, 21, 22, 22, 23, 23, 22,
         23, 22, 21, 19, 19, 18, 18, 18, 17, 17,
+        /* Size 16x4 */
+        33, 28, 21, 21, 33, 27, 22, 22, 33, 26, 22, 22, 33, 26, 22, 23, 30, 24,
+        22, 23, 29, 24, 22, 22, 26, 22, 22, 23, 26, 22, 21, 22, 24, 22, 20, 21,
+        22, 21, 20, 19, 22, 21, 19, 19, 22, 22, 19, 18, 22, 22, 19, 18, 22, 22,
+        19, 18, 21, 22, 19, 17, 21, 22, 19, 17,
         /* Size 8x32 */
-        32, 33, 31, 28, 23, 21, 21, 20, 33, 33, 31, 27, 23, 22, 21, 21, 33, 33,
-        30, 27, 23, 22, 22, 21, 33, 33, 30, 27, 23, 22, 22, 21, 33, 32, 30, 26,
-        23, 22, 22, 22, 34, 32, 29, 26, 23, 22, 23, 22, 34, 32, 29, 26, 23, 22,
-        23, 22, 33, 31, 29, 26, 23, 22, 23, 22, 31, 29, 28, 24, 22, 22, 23, 22,
-        31, 28, 27, 24, 22, 22, 22, 22, 31, 28, 27, 24, 22, 22, 22, 22, 29, 27,
-        25, 23, 22, 22, 23, 22, 28, 26, 24, 22, 22, 22, 23, 22, 28, 26, 24, 22,
-        22, 22, 23, 22, 26, 25, 24, 22, 21, 21, 22, 22, 24, 24, 23, 22, 21, 20,
-        21, 20, 24, 24, 23, 22, 21, 20, 21, 20, 24, 24, 23, 22, 20, 20, 20, 20,
-        22, 22, 22, 21, 20, 20, 19, 19, 21, 22, 22, 21, 20, 19, 19, 19, 21, 22,
-        22, 21, 20, 19, 19, 19, 21, 22, 22, 22, 20, 19, 19, 19, 21, 22, 22, 22,
-        20, 19, 18, 18, 21, 22, 22, 22, 20, 19, 18, 18, 21, 23, 22, 22, 20, 19,
-        18, 18, 21, 23, 23, 22, 20, 19, 18, 17, 21, 23, 23, 22, 20, 19, 18, 17,
-        21, 23, 23, 22, 20, 19, 18, 17, 20, 22, 22, 22, 20, 19, 17, 17, 20, 22,
-        22, 22, 20, 19, 17, 17, 20, 22, 22, 22, 20, 19, 17, 17, 20, 22, 22, 22,
-        20, 19, 17, 17,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 34, 34, 33, 31, 31, 31, 29, 28, 28, 26, 24, 24, 24,
         22, 21, 21, 21, 21, 21, 21, 21, 21, 21, 20, 20, 20, 20, 33, 33, 33, 33,
         32, 32, 32, 31, 29, 28, 28, 27, 26, 26, 25, 24, 24, 24, 22, 22, 22, 22,
@@ -10780,7 +10765,23 @@
         23, 23, 23, 22, 22, 23, 23, 23, 22, 21, 21, 20, 19, 19, 19, 19, 18, 18,
         18, 18, 18, 18, 17, 17, 17, 17, 20, 21, 21, 21, 22, 22, 22, 22, 22, 22,
         22, 22, 22, 22, 22, 20, 20, 20, 19, 19, 19, 19, 18, 18, 18, 17, 17, 17,
-        17, 17, 17, 17 },
+        17, 17, 17, 17,
+        /* Size 32x8 */
+        32, 33, 31, 28, 23, 21, 21, 20, 33, 33, 31, 27, 23, 22, 21, 21, 33, 33,
+        30, 27, 23, 22, 22, 21, 33, 33, 30, 27, 23, 22, 22, 21, 33, 32, 30, 26,
+        23, 22, 22, 22, 34, 32, 29, 26, 23, 22, 23, 22, 34, 32, 29, 26, 23, 22,
+        23, 22, 33, 31, 29, 26, 23, 22, 23, 22, 31, 29, 28, 24, 22, 22, 23, 22,
+        31, 28, 27, 24, 22, 22, 22, 22, 31, 28, 27, 24, 22, 22, 22, 22, 29, 27,
+        25, 23, 22, 22, 23, 22, 28, 26, 24, 22, 22, 22, 23, 22, 28, 26, 24, 22,
+        22, 22, 23, 22, 26, 25, 24, 22, 21, 21, 22, 22, 24, 24, 23, 22, 21, 20,
+        21, 20, 24, 24, 23, 22, 21, 20, 21, 20, 24, 24, 23, 22, 20, 20, 20, 20,
+        22, 22, 22, 21, 20, 20, 19, 19, 21, 22, 22, 21, 20, 19, 19, 19, 21, 22,
+        22, 21, 20, 19, 19, 19, 21, 22, 22, 22, 20, 19, 19, 19, 21, 22, 22, 22,
+        20, 19, 18, 18, 21, 22, 22, 22, 20, 19, 18, 18, 21, 23, 22, 22, 20, 19,
+        18, 18, 21, 23, 23, 22, 20, 19, 18, 17, 21, 23, 23, 22, 20, 19, 18, 17,
+        21, 23, 23, 22, 20, 19, 18, 17, 20, 22, 22, 22, 20, 19, 17, 17, 20, 22,
+        22, 22, 20, 19, 17, 17, 20, 22, 22, 22, 20, 19, 17, 17, 20, 22, 22, 22,
+        20, 19, 17, 17 },
   },
   {
       { /* Luma */
@@ -10866,21 +10867,12 @@
         16, 16, 23, 24, 24, 24, 24, 25, 25, 25, 25, 25, 24, 24, 24, 24, 24, 24,
         24, 23, 22, 22, 22, 20, 19, 19, 19, 18, 18, 18, 18, 17, 16, 16,
         /* Size 4x8 */
-        33, 32, 30, 26, 32, 32, 30, 27, 32, 31, 30, 27, 32, 31, 28, 26, 31, 30,
-        27, 24, 30, 28, 25, 22, 28, 27, 23, 20, 26, 26, 22, 18,
-        /* Size 8x4 */
         33, 32, 32, 32, 31, 30, 28, 26, 32, 32, 31, 31, 30, 28, 27, 26, 30, 30,
         30, 28, 27, 25, 23, 22, 26, 27, 27, 26, 24, 22, 20, 18,
+        /* Size 8x4 */
+        33, 32, 30, 26, 32, 32, 30, 27, 32, 31, 30, 27, 32, 31, 28, 26, 31, 30,
+        27, 24, 30, 28, 25, 22, 28, 27, 23, 20, 26, 26, 22, 18,
         /* Size 8x16 */
-        32, 33, 33, 32, 32, 28, 28, 23, 33, 32, 32, 32, 32, 29, 29, 24, 33, 32,
-        32, 32, 32, 29, 29, 24, 33, 32, 32, 31, 31, 30, 30, 25, 33, 32, 32, 31,
-        31, 30, 30, 25, 32, 32, 32, 30, 30, 28, 28, 24, 32, 32, 32, 30, 30, 28,
-        28, 24, 32, 31, 31, 29, 29, 27, 27, 24, 32, 31, 31, 29, 29, 27, 27, 24,
-        30, 30, 30, 28, 28, 24, 24, 21, 30, 30, 30, 28, 28, 24, 24, 21, 28, 30,
-        30, 27, 27, 21, 21, 19, 28, 30, 30, 27, 27, 21, 21, 19, 26, 28, 28, 26,
-        26, 20, 20, 18, 26, 28, 28, 26, 26, 20, 20, 18, 23, 25, 25, 24, 24, 19,
-        19, 16,
-        /* Size 16x8 */
         32, 33, 33, 33, 33, 32, 32, 32, 32, 30, 30, 28, 28, 26, 26, 23, 33, 32,
         32, 32, 32, 32, 32, 31, 31, 30, 30, 30, 30, 28, 28, 25, 33, 32, 32, 32,
         32, 32, 32, 31, 31, 30, 30, 30, 30, 28, 28, 25, 32, 32, 32, 31, 31, 30,
@@ -10889,37 +10881,16 @@
         24, 21, 21, 20, 20, 19, 28, 29, 29, 30, 30, 28, 28, 27, 27, 24, 24, 21,
         21, 20, 20, 19, 23, 24, 24, 25, 25, 24, 24, 24, 24, 21, 21, 19, 19, 18,
         18, 16,
+        /* Size 16x8 */
+        32, 33, 33, 32, 32, 28, 28, 23, 33, 32, 32, 32, 32, 29, 29, 24, 33, 32,
+        32, 32, 32, 29, 29, 24, 33, 32, 32, 31, 31, 30, 30, 25, 33, 32, 32, 31,
+        31, 30, 30, 25, 32, 32, 32, 30, 30, 28, 28, 24, 32, 32, 32, 30, 30, 28,
+        28, 24, 32, 31, 31, 29, 29, 27, 27, 24, 32, 31, 31, 29, 29, 27, 27, 24,
+        30, 30, 30, 28, 28, 24, 24, 21, 30, 30, 30, 28, 28, 24, 24, 21, 28, 30,
+        30, 27, 27, 21, 21, 19, 28, 30, 30, 27, 27, 21, 21, 19, 26, 28, 28, 26,
+        26, 20, 20, 18, 26, 28, 28, 26, 26, 20, 20, 18, 23, 25, 25, 24, 24, 19,
+        19, 16,
         /* Size 16x32 */
-        32, 33, 33, 33, 33, 32, 32, 32, 32, 30, 28, 28, 28, 26, 23, 23, 33, 33,
-        33, 33, 33, 32, 32, 32, 32, 30, 29, 29, 29, 26, 24, 24, 33, 32, 32, 32,
-        32, 32, 32, 32, 32, 30, 29, 29, 29, 27, 24, 24, 33, 32, 32, 32, 32, 32,
-        32, 32, 32, 30, 29, 29, 29, 27, 24, 24, 33, 32, 32, 32, 32, 32, 32, 32,
-        32, 30, 29, 29, 29, 27, 24, 24, 33, 32, 32, 32, 32, 32, 32, 32, 32, 30,
-        29, 29, 29, 27, 25, 25, 33, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30,
-        30, 28, 25, 25, 33, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 30, 28,
-        25, 25, 33, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 30, 28, 25, 25,
-        33, 32, 32, 32, 32, 31, 31, 31, 31, 30, 29, 29, 29, 27, 25, 25, 32, 32,
-        32, 32, 32, 31, 30, 30, 30, 29, 28, 28, 28, 26, 24, 24, 32, 32, 32, 32,
-        32, 31, 30, 30, 30, 29, 28, 28, 28, 26, 24, 24, 32, 32, 32, 32, 32, 31,
-        30, 30, 30, 29, 28, 28, 28, 26, 24, 24, 32, 32, 32, 32, 32, 31, 30, 30,
-        30, 28, 28, 28, 28, 26, 24, 24, 32, 32, 31, 31, 31, 30, 29, 29, 29, 28,
-        27, 27, 27, 26, 24, 24, 32, 32, 31, 31, 31, 30, 29, 29, 29, 28, 27, 27,
-        27, 26, 24, 24, 32, 32, 31, 31, 31, 30, 29, 29, 29, 28, 27, 27, 27, 26,
-        24, 24, 31, 31, 31, 31, 31, 30, 28, 28, 28, 27, 26, 26, 26, 24, 23, 23,
-        30, 30, 30, 30, 30, 29, 28, 28, 28, 26, 24, 24, 24, 23, 21, 21, 30, 30,
-        30, 30, 30, 29, 28, 28, 28, 26, 24, 24, 24, 23, 21, 21, 30, 30, 30, 30,
-        30, 29, 28, 28, 28, 26, 24, 24, 24, 23, 21, 21, 29, 30, 30, 30, 30, 28,
-        28, 28, 28, 25, 23, 23, 23, 22, 20, 20, 28, 29, 30, 30, 30, 28, 27, 27,
-        27, 24, 21, 21, 21, 20, 19, 19, 28, 29, 30, 30, 30, 28, 27, 27, 27, 24,
-        21, 21, 21, 20, 19, 19, 28, 29, 30, 30, 30, 28, 27, 27, 27, 24, 21, 21,
-        21, 20, 19, 19, 28, 28, 28, 28, 28, 27, 26, 26, 26, 23, 21, 21, 21, 20,
-        18, 18, 26, 27, 28, 28, 28, 26, 26, 26, 26, 23, 20, 20, 20, 19, 18, 18,
-        26, 27, 28, 28, 28, 26, 26, 26, 26, 23, 20, 20, 20, 19, 18, 18, 26, 27,
-        28, 28, 28, 26, 26, 26, 26, 23, 20, 20, 20, 19, 18, 18, 25, 26, 26, 26,
-        26, 26, 24, 24, 24, 22, 20, 20, 20, 18, 17, 17, 23, 24, 25, 25, 25, 24,
-        24, 24, 24, 21, 19, 19, 19, 18, 16, 16, 23, 24, 25, 25, 25, 24, 24, 24,
-        24, 21, 19, 19, 19, 18, 16, 16,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 31,
         30, 30, 30, 29, 28, 28, 28, 28, 26, 26, 26, 25, 23, 23, 33, 33, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 30, 30, 30,
@@ -10949,33 +10920,47 @@
         21, 20, 19, 19, 19, 18, 18, 18, 18, 17, 16, 16, 23, 24, 24, 24, 24, 25,
         25, 25, 25, 25, 24, 24, 24, 24, 24, 24, 24, 23, 21, 21, 21, 20, 19, 19,
         19, 18, 18, 18, 18, 17, 16, 16,
+        /* Size 32x16 */
+        32, 33, 33, 33, 33, 32, 32, 32, 32, 30, 28, 28, 28, 26, 23, 23, 33, 33,
+        33, 33, 33, 32, 32, 32, 32, 30, 29, 29, 29, 26, 24, 24, 33, 32, 32, 32,
+        32, 32, 32, 32, 32, 30, 29, 29, 29, 27, 24, 24, 33, 32, 32, 32, 32, 32,
+        32, 32, 32, 30, 29, 29, 29, 27, 24, 24, 33, 32, 32, 32, 32, 32, 32, 32,
+        32, 30, 29, 29, 29, 27, 24, 24, 33, 32, 32, 32, 32, 32, 32, 32, 32, 30,
+        29, 29, 29, 27, 25, 25, 33, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30,
+        30, 28, 25, 25, 33, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 30, 28,
+        25, 25, 33, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 30, 28, 25, 25,
+        33, 32, 32, 32, 32, 31, 31, 31, 31, 30, 29, 29, 29, 27, 25, 25, 32, 32,
+        32, 32, 32, 31, 30, 30, 30, 29, 28, 28, 28, 26, 24, 24, 32, 32, 32, 32,
+        32, 31, 30, 30, 30, 29, 28, 28, 28, 26, 24, 24, 32, 32, 32, 32, 32, 31,
+        30, 30, 30, 29, 28, 28, 28, 26, 24, 24, 32, 32, 32, 32, 32, 31, 30, 30,
+        30, 28, 28, 28, 28, 26, 24, 24, 32, 32, 31, 31, 31, 30, 29, 29, 29, 28,
+        27, 27, 27, 26, 24, 24, 32, 32, 31, 31, 31, 30, 29, 29, 29, 28, 27, 27,
+        27, 26, 24, 24, 32, 32, 31, 31, 31, 30, 29, 29, 29, 28, 27, 27, 27, 26,
+        24, 24, 31, 31, 31, 31, 31, 30, 28, 28, 28, 27, 26, 26, 26, 24, 23, 23,
+        30, 30, 30, 30, 30, 29, 28, 28, 28, 26, 24, 24, 24, 23, 21, 21, 30, 30,
+        30, 30, 30, 29, 28, 28, 28, 26, 24, 24, 24, 23, 21, 21, 30, 30, 30, 30,
+        30, 29, 28, 28, 28, 26, 24, 24, 24, 23, 21, 21, 29, 30, 30, 30, 30, 28,
+        28, 28, 28, 25, 23, 23, 23, 22, 20, 20, 28, 29, 30, 30, 30, 28, 27, 27,
+        27, 24, 21, 21, 21, 20, 19, 19, 28, 29, 30, 30, 30, 28, 27, 27, 27, 24,
+        21, 21, 21, 20, 19, 19, 28, 29, 30, 30, 30, 28, 27, 27, 27, 24, 21, 21,
+        21, 20, 19, 19, 28, 28, 28, 28, 28, 27, 26, 26, 26, 23, 21, 21, 21, 20,
+        18, 18, 26, 27, 28, 28, 28, 26, 26, 26, 26, 23, 20, 20, 20, 19, 18, 18,
+        26, 27, 28, 28, 28, 26, 26, 26, 26, 23, 20, 20, 20, 19, 18, 18, 26, 27,
+        28, 28, 28, 26, 26, 26, 26, 23, 20, 20, 20, 19, 18, 18, 25, 26, 26, 26,
+        26, 26, 24, 24, 24, 22, 20, 20, 20, 18, 17, 17, 23, 24, 25, 25, 25, 24,
+        24, 24, 24, 21, 19, 19, 19, 18, 16, 16, 23, 24, 25, 25, 25, 24, 24, 24,
+        24, 21, 19, 19, 19, 18, 16, 16,
         /* Size 4x16 */
-        33, 32, 30, 26, 32, 32, 30, 27, 32, 32, 30, 27, 32, 32, 31, 28, 32, 32,
-        31, 28, 32, 31, 29, 26, 32, 31, 29, 26, 32, 30, 28, 26, 32, 30, 28, 26,
-        30, 29, 26, 23, 30, 29, 26, 23, 29, 28, 24, 20, 29, 28, 24, 20, 27, 26,
-        23, 19, 27, 26, 23, 19, 24, 24, 21, 18,
-        /* Size 16x4 */
         33, 32, 32, 32, 32, 32, 32, 32, 32, 30, 30, 29, 29, 27, 27, 24, 32, 32,
         32, 32, 32, 31, 31, 30, 30, 29, 29, 28, 28, 26, 26, 24, 30, 30, 30, 31,
         31, 29, 29, 28, 28, 26, 26, 24, 24, 23, 23, 21, 26, 27, 27, 28, 28, 26,
         26, 26, 26, 23, 23, 20, 20, 19, 19, 18,
+        /* Size 16x4 */
+        33, 32, 30, 26, 32, 32, 30, 27, 32, 32, 30, 27, 32, 32, 31, 28, 32, 32,
+        31, 28, 32, 31, 29, 26, 32, 31, 29, 26, 32, 30, 28, 26, 32, 30, 28, 26,
+        30, 29, 26, 23, 30, 29, 26, 23, 29, 28, 24, 20, 29, 28, 24, 20, 27, 26,
+        23, 19, 27, 26, 23, 19, 24, 24, 21, 18,
         /* Size 8x32 */
-        32, 33, 33, 32, 32, 28, 28, 23, 33, 33, 33, 32, 32, 29, 29, 24, 33, 32,
-        32, 32, 32, 29, 29, 24, 33, 32, 32, 32, 32, 29, 29, 24, 33, 32, 32, 32,
-        32, 29, 29, 24, 33, 32, 32, 32, 32, 29, 29, 25, 33, 32, 32, 31, 31, 30,
-        30, 25, 33, 32, 32, 31, 31, 30, 30, 25, 33, 32, 32, 31, 31, 30, 30, 25,
-        33, 32, 32, 31, 31, 29, 29, 25, 32, 32, 32, 30, 30, 28, 28, 24, 32, 32,
-        32, 30, 30, 28, 28, 24, 32, 32, 32, 30, 30, 28, 28, 24, 32, 32, 32, 30,
-        30, 28, 28, 24, 32, 31, 31, 29, 29, 27, 27, 24, 32, 31, 31, 29, 29, 27,
-        27, 24, 32, 31, 31, 29, 29, 27, 27, 24, 31, 31, 31, 28, 28, 26, 26, 23,
-        30, 30, 30, 28, 28, 24, 24, 21, 30, 30, 30, 28, 28, 24, 24, 21, 30, 30,
-        30, 28, 28, 24, 24, 21, 29, 30, 30, 28, 28, 23, 23, 20, 28, 30, 30, 27,
-        27, 21, 21, 19, 28, 30, 30, 27, 27, 21, 21, 19, 28, 30, 30, 27, 27, 21,
-        21, 19, 28, 28, 28, 26, 26, 21, 21, 18, 26, 28, 28, 26, 26, 20, 20, 18,
-        26, 28, 28, 26, 26, 20, 20, 18, 26, 28, 28, 26, 26, 20, 20, 18, 25, 26,
-        26, 24, 24, 20, 20, 17, 23, 25, 25, 24, 24, 19, 19, 16, 23, 25, 25, 24,
-        24, 19, 19, 16,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 31,
         30, 30, 30, 29, 28, 28, 28, 28, 26, 26, 26, 25, 23, 23, 33, 33, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 30, 30,
@@ -10990,7 +10975,23 @@
         30, 30, 30, 29, 28, 28, 28, 28, 27, 27, 27, 26, 24, 24, 24, 23, 21, 21,
         21, 21, 20, 20, 20, 20, 19, 19, 23, 24, 24, 24, 24, 25, 25, 25, 25, 25,
         24, 24, 24, 24, 24, 24, 24, 23, 21, 21, 21, 20, 19, 19, 19, 18, 18, 18,
-        18, 17, 16, 16 },
+        18, 17, 16, 16,
+        /* Size 32x8 */
+        32, 33, 33, 32, 32, 28, 28, 23, 33, 33, 33, 32, 32, 29, 29, 24, 33, 32,
+        32, 32, 32, 29, 29, 24, 33, 32, 32, 32, 32, 29, 29, 24, 33, 32, 32, 32,
+        32, 29, 29, 24, 33, 32, 32, 32, 32, 29, 29, 25, 33, 32, 32, 31, 31, 30,
+        30, 25, 33, 32, 32, 31, 31, 30, 30, 25, 33, 32, 32, 31, 31, 30, 30, 25,
+        33, 32, 32, 31, 31, 29, 29, 25, 32, 32, 32, 30, 30, 28, 28, 24, 32, 32,
+        32, 30, 30, 28, 28, 24, 32, 32, 32, 30, 30, 28, 28, 24, 32, 32, 32, 30,
+        30, 28, 28, 24, 32, 31, 31, 29, 29, 27, 27, 24, 32, 31, 31, 29, 29, 27,
+        27, 24, 32, 31, 31, 29, 29, 27, 27, 24, 31, 31, 31, 28, 28, 26, 26, 23,
+        30, 30, 30, 28, 28, 24, 24, 21, 30, 30, 30, 28, 28, 24, 24, 21, 30, 30,
+        30, 28, 28, 24, 24, 21, 29, 30, 30, 28, 28, 23, 23, 20, 28, 30, 30, 27,
+        27, 21, 21, 19, 28, 30, 30, 27, 27, 21, 21, 19, 28, 30, 30, 27, 27, 21,
+        21, 19, 28, 28, 28, 26, 26, 21, 21, 18, 26, 28, 28, 26, 26, 20, 20, 18,
+        26, 28, 28, 26, 26, 20, 20, 18, 26, 28, 28, 26, 26, 20, 20, 18, 25, 26,
+        26, 24, 24, 20, 20, 17, 23, 25, 25, 24, 24, 19, 19, 16, 23, 25, 25, 24,
+        24, 19, 19, 16 },
       { /* Chroma */
         /* Size 4x4 */
         33, 30, 24, 22, 30, 26, 23, 22, 24, 23, 21, 21, 22, 22, 21, 19,
@@ -11074,21 +11075,12 @@
         18, 18, 21, 21, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23,
         23, 22, 21, 21, 21, 20, 19, 19, 19, 19, 19, 19, 19, 18, 18, 18,
         /* Size 4x8 */
-        33, 30, 24, 21, 33, 29, 24, 22, 31, 28, 23, 22, 28, 25, 22, 22, 26, 23,
-        21, 21, 23, 22, 21, 20, 22, 22, 20, 19, 22, 22, 21, 19,
-        /* Size 8x4 */
         33, 33, 31, 28, 26, 23, 22, 22, 30, 29, 28, 25, 23, 22, 22, 22, 24, 24,
         23, 22, 21, 21, 20, 21, 21, 22, 22, 22, 21, 20, 19, 19,
+        /* Size 8x4 */
+        33, 30, 24, 21, 33, 29, 24, 22, 31, 28, 23, 22, 28, 25, 22, 22, 26, 23,
+        21, 21, 23, 22, 21, 20, 22, 22, 20, 19, 22, 22, 21, 19,
         /* Size 8x16 */
-        32, 33, 33, 28, 28, 21, 21, 21, 33, 33, 33, 27, 27, 22, 22, 22, 33, 33,
-        33, 27, 27, 22, 22, 22, 34, 32, 32, 26, 26, 22, 22, 23, 34, 32, 32, 26,
-        26, 22, 22, 23, 31, 28, 28, 24, 24, 22, 22, 22, 31, 28, 28, 24, 24, 22,
-        22, 22, 28, 26, 26, 22, 22, 22, 22, 23, 28, 26, 26, 22, 22, 22, 22, 23,
-        24, 24, 24, 22, 22, 20, 20, 21, 24, 24, 24, 22, 22, 20, 20, 21, 21, 22,
-        22, 21, 21, 19, 19, 19, 21, 22, 22, 21, 21, 19, 19, 19, 21, 22, 22, 22,
-        22, 19, 19, 18, 21, 22, 22, 22, 22, 19, 19, 18, 21, 23, 23, 22, 22, 19,
-        19, 18,
-        /* Size 16x8 */
         32, 33, 33, 34, 34, 31, 31, 28, 28, 24, 24, 21, 21, 21, 21, 21, 33, 33,
         33, 32, 32, 28, 28, 26, 26, 24, 24, 22, 22, 22, 22, 23, 33, 33, 33, 32,
         32, 28, 28, 26, 26, 24, 24, 22, 22, 22, 22, 23, 28, 27, 27, 26, 26, 24,
@@ -11097,37 +11089,16 @@
         20, 19, 19, 19, 19, 19, 21, 22, 22, 22, 22, 22, 22, 22, 22, 20, 20, 19,
         19, 19, 19, 19, 21, 22, 22, 23, 23, 22, 22, 23, 23, 21, 21, 19, 19, 18,
         18, 18,
+        /* Size 16x8 */
+        32, 33, 33, 28, 28, 21, 21, 21, 33, 33, 33, 27, 27, 22, 22, 22, 33, 33,
+        33, 27, 27, 22, 22, 22, 34, 32, 32, 26, 26, 22, 22, 23, 34, 32, 32, 26,
+        26, 22, 22, 23, 31, 28, 28, 24, 24, 22, 22, 22, 31, 28, 28, 24, 24, 22,
+        22, 22, 28, 26, 26, 22, 22, 22, 22, 23, 28, 26, 26, 22, 22, 22, 22, 23,
+        24, 24, 24, 22, 22, 20, 20, 21, 24, 24, 24, 22, 22, 20, 20, 21, 21, 22,
+        22, 21, 21, 19, 19, 19, 21, 22, 22, 21, 21, 19, 19, 19, 21, 22, 22, 22,
+        22, 19, 19, 18, 21, 22, 22, 22, 22, 19, 19, 18, 21, 23, 23, 22, 22, 19,
+        19, 18,
         /* Size 16x32 */
-        32, 33, 33, 33, 33, 31, 28, 28, 28, 24, 21, 21, 21, 21, 21, 21, 33, 33,
-        33, 33, 33, 30, 28, 28, 28, 24, 22, 22, 22, 21, 21, 21, 33, 33, 33, 33,
-        33, 30, 27, 27, 27, 24, 22, 22, 22, 22, 22, 22, 33, 33, 33, 33, 33, 30,
-        27, 27, 27, 24, 22, 22, 22, 22, 22, 22, 33, 33, 33, 33, 33, 30, 27, 27,
-        27, 24, 22, 22, 22, 22, 22, 22, 33, 33, 32, 32, 32, 29, 26, 26, 26, 24,
-        22, 22, 22, 22, 22, 22, 34, 33, 32, 32, 32, 29, 26, 26, 26, 24, 22, 22,
-        22, 23, 23, 23, 34, 33, 32, 32, 32, 29, 26, 26, 26, 24, 22, 22, 22, 23,
-        23, 23, 34, 33, 32, 32, 32, 29, 26, 26, 26, 24, 22, 22, 22, 23, 23, 23,
-        32, 31, 30, 30, 30, 28, 25, 25, 25, 23, 22, 22, 22, 22, 23, 23, 31, 30,
-        28, 28, 28, 26, 24, 24, 24, 23, 22, 22, 22, 22, 22, 22, 31, 30, 28, 28,
-        28, 26, 24, 24, 24, 23, 22, 22, 22, 22, 22, 22, 31, 30, 28, 28, 28, 26,
-        24, 24, 24, 23, 22, 22, 22, 22, 22, 22, 29, 28, 27, 27, 27, 25, 23, 23,
-        23, 22, 22, 22, 22, 22, 23, 23, 28, 27, 26, 26, 26, 24, 22, 22, 22, 22,
-        22, 22, 22, 22, 23, 23, 28, 27, 26, 26, 26, 24, 22, 22, 22, 22, 22, 22,
-        22, 22, 23, 23, 28, 27, 26, 26, 26, 24, 22, 22, 22, 22, 22, 22, 22, 22,
-        23, 23, 26, 26, 25, 25, 25, 23, 22, 22, 22, 21, 21, 21, 21, 21, 22, 22,
-        24, 24, 24, 24, 24, 23, 22, 22, 22, 21, 20, 20, 20, 20, 21, 21, 24, 24,
-        24, 24, 24, 23, 22, 22, 22, 21, 20, 20, 20, 20, 21, 21, 24, 24, 24, 24,
-        24, 23, 22, 22, 22, 21, 20, 20, 20, 20, 21, 21, 23, 23, 23, 23, 23, 22,
-        22, 22, 22, 21, 20, 20, 20, 20, 20, 20, 21, 21, 22, 22, 22, 22, 21, 21,
-        21, 20, 19, 19, 19, 19, 19, 19, 21, 21, 22, 22, 22, 22, 21, 21, 21, 20,
-        19, 19, 19, 19, 19, 19, 21, 21, 22, 22, 22, 22, 21, 21, 21, 20, 19, 19,
-        19, 19, 19, 19, 21, 22, 22, 22, 22, 22, 22, 22, 22, 20, 19, 19, 19, 19,
-        19, 19, 21, 22, 22, 22, 22, 22, 22, 22, 22, 20, 19, 19, 19, 19, 18, 18,
-        21, 22, 22, 22, 22, 22, 22, 22, 22, 20, 19, 19, 19, 19, 18, 18, 21, 22,
-        22, 22, 22, 22, 22, 22, 22, 20, 19, 19, 19, 19, 18, 18, 21, 22, 23, 23,
-        23, 22, 22, 22, 22, 21, 19, 19, 19, 19, 18, 18, 21, 22, 23, 23, 23, 23,
-        22, 22, 22, 21, 19, 19, 19, 18, 18, 18, 21, 22, 23, 23, 23, 23, 22, 22,
-        22, 21, 19, 19, 19, 18, 18, 18,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 34, 34, 34, 32, 31, 31, 31, 29, 28, 28, 28, 26,
         24, 24, 24, 23, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 31, 30, 30, 30, 28, 27, 27, 27, 26, 24, 24, 24, 23,
@@ -11157,33 +11128,47 @@
         21, 20, 19, 19, 19, 19, 18, 18, 18, 18, 18, 18, 21, 21, 22, 22, 22, 22,
         23, 23, 23, 23, 22, 22, 22, 23, 23, 23, 23, 22, 21, 21, 21, 20, 19, 19,
         19, 19, 18, 18, 18, 18, 18, 18,
+        /* Size 32x16 */
+        32, 33, 33, 33, 33, 31, 28, 28, 28, 24, 21, 21, 21, 21, 21, 21, 33, 33,
+        33, 33, 33, 30, 28, 28, 28, 24, 22, 22, 22, 21, 21, 21, 33, 33, 33, 33,
+        33, 30, 27, 27, 27, 24, 22, 22, 22, 22, 22, 22, 33, 33, 33, 33, 33, 30,
+        27, 27, 27, 24, 22, 22, 22, 22, 22, 22, 33, 33, 33, 33, 33, 30, 27, 27,
+        27, 24, 22, 22, 22, 22, 22, 22, 33, 33, 32, 32, 32, 29, 26, 26, 26, 24,
+        22, 22, 22, 22, 22, 22, 34, 33, 32, 32, 32, 29, 26, 26, 26, 24, 22, 22,
+        22, 23, 23, 23, 34, 33, 32, 32, 32, 29, 26, 26, 26, 24, 22, 22, 22, 23,
+        23, 23, 34, 33, 32, 32, 32, 29, 26, 26, 26, 24, 22, 22, 22, 23, 23, 23,
+        32, 31, 30, 30, 30, 28, 25, 25, 25, 23, 22, 22, 22, 22, 23, 23, 31, 30,
+        28, 28, 28, 26, 24, 24, 24, 23, 22, 22, 22, 22, 22, 22, 31, 30, 28, 28,
+        28, 26, 24, 24, 24, 23, 22, 22, 22, 22, 22, 22, 31, 30, 28, 28, 28, 26,
+        24, 24, 24, 23, 22, 22, 22, 22, 22, 22, 29, 28, 27, 27, 27, 25, 23, 23,
+        23, 22, 22, 22, 22, 22, 23, 23, 28, 27, 26, 26, 26, 24, 22, 22, 22, 22,
+        22, 22, 22, 22, 23, 23, 28, 27, 26, 26, 26, 24, 22, 22, 22, 22, 22, 22,
+        22, 22, 23, 23, 28, 27, 26, 26, 26, 24, 22, 22, 22, 22, 22, 22, 22, 22,
+        23, 23, 26, 26, 25, 25, 25, 23, 22, 22, 22, 21, 21, 21, 21, 21, 22, 22,
+        24, 24, 24, 24, 24, 23, 22, 22, 22, 21, 20, 20, 20, 20, 21, 21, 24, 24,
+        24, 24, 24, 23, 22, 22, 22, 21, 20, 20, 20, 20, 21, 21, 24, 24, 24, 24,
+        24, 23, 22, 22, 22, 21, 20, 20, 20, 20, 21, 21, 23, 23, 23, 23, 23, 22,
+        22, 22, 22, 21, 20, 20, 20, 20, 20, 20, 21, 21, 22, 22, 22, 22, 21, 21,
+        21, 20, 19, 19, 19, 19, 19, 19, 21, 21, 22, 22, 22, 22, 21, 21, 21, 20,
+        19, 19, 19, 19, 19, 19, 21, 21, 22, 22, 22, 22, 21, 21, 21, 20, 19, 19,
+        19, 19, 19, 19, 21, 22, 22, 22, 22, 22, 22, 22, 22, 20, 19, 19, 19, 19,
+        19, 19, 21, 22, 22, 22, 22, 22, 22, 22, 22, 20, 19, 19, 19, 19, 18, 18,
+        21, 22, 22, 22, 22, 22, 22, 22, 22, 20, 19, 19, 19, 19, 18, 18, 21, 22,
+        22, 22, 22, 22, 22, 22, 22, 20, 19, 19, 19, 19, 18, 18, 21, 22, 23, 23,
+        23, 22, 22, 22, 22, 21, 19, 19, 19, 19, 18, 18, 21, 22, 23, 23, 23, 23,
+        22, 22, 22, 21, 19, 19, 19, 18, 18, 18, 21, 22, 23, 23, 23, 23, 22, 22,
+        22, 21, 19, 19, 19, 18, 18, 18,
         /* Size 4x16 */
-        33, 31, 24, 21, 33, 30, 24, 22, 33, 30, 24, 22, 33, 29, 24, 23, 33, 29,
-        24, 23, 30, 26, 23, 22, 30, 26, 23, 22, 27, 24, 22, 22, 27, 24, 22, 22,
-        24, 23, 21, 20, 24, 23, 21, 20, 21, 22, 20, 19, 21, 22, 20, 19, 22, 22,
-        20, 19, 22, 22, 20, 19, 22, 23, 21, 18,
-        /* Size 16x4 */
         33, 33, 33, 33, 33, 30, 30, 27, 27, 24, 24, 21, 21, 22, 22, 22, 31, 30,
         30, 29, 29, 26, 26, 24, 24, 23, 23, 22, 22, 22, 22, 23, 24, 24, 24, 24,
         24, 23, 23, 22, 22, 21, 21, 20, 20, 20, 20, 21, 21, 22, 22, 23, 23, 22,
         22, 22, 22, 20, 20, 19, 19, 19, 19, 18,
+        /* Size 16x4 */
+        33, 31, 24, 21, 33, 30, 24, 22, 33, 30, 24, 22, 33, 29, 24, 23, 33, 29,
+        24, 23, 30, 26, 23, 22, 30, 26, 23, 22, 27, 24, 22, 22, 27, 24, 22, 22,
+        24, 23, 21, 20, 24, 23, 21, 20, 21, 22, 20, 19, 21, 22, 20, 19, 22, 22,
+        20, 19, 22, 22, 20, 19, 22, 23, 21, 18,
         /* Size 8x32 */
-        32, 33, 33, 28, 28, 21, 21, 21, 33, 33, 33, 28, 28, 22, 22, 21, 33, 33,
-        33, 27, 27, 22, 22, 22, 33, 33, 33, 27, 27, 22, 22, 22, 33, 33, 33, 27,
-        27, 22, 22, 22, 33, 32, 32, 26, 26, 22, 22, 22, 34, 32, 32, 26, 26, 22,
-        22, 23, 34, 32, 32, 26, 26, 22, 22, 23, 34, 32, 32, 26, 26, 22, 22, 23,
-        32, 30, 30, 25, 25, 22, 22, 23, 31, 28, 28, 24, 24, 22, 22, 22, 31, 28,
-        28, 24, 24, 22, 22, 22, 31, 28, 28, 24, 24, 22, 22, 22, 29, 27, 27, 23,
-        23, 22, 22, 23, 28, 26, 26, 22, 22, 22, 22, 23, 28, 26, 26, 22, 22, 22,
-        22, 23, 28, 26, 26, 22, 22, 22, 22, 23, 26, 25, 25, 22, 22, 21, 21, 22,
-        24, 24, 24, 22, 22, 20, 20, 21, 24, 24, 24, 22, 22, 20, 20, 21, 24, 24,
-        24, 22, 22, 20, 20, 21, 23, 23, 23, 22, 22, 20, 20, 20, 21, 22, 22, 21,
-        21, 19, 19, 19, 21, 22, 22, 21, 21, 19, 19, 19, 21, 22, 22, 21, 21, 19,
-        19, 19, 21, 22, 22, 22, 22, 19, 19, 19, 21, 22, 22, 22, 22, 19, 19, 18,
-        21, 22, 22, 22, 22, 19, 19, 18, 21, 22, 22, 22, 22, 19, 19, 18, 21, 23,
-        23, 22, 22, 19, 19, 18, 21, 23, 23, 22, 22, 19, 19, 18, 21, 23, 23, 22,
-        22, 19, 19, 18,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 33, 34, 34, 34, 32, 31, 31, 31, 29, 28, 28, 28, 26,
         24, 24, 24, 23, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 33, 33, 33, 33,
         33, 32, 32, 32, 32, 30, 28, 28, 28, 27, 26, 26, 26, 25, 24, 24, 24, 23,
@@ -11198,7 +11183,23 @@
         22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 20, 20, 20, 20, 19, 19,
         19, 19, 19, 19, 19, 19, 19, 19, 21, 21, 22, 22, 22, 22, 23, 23, 23, 23,
         22, 22, 22, 23, 23, 23, 23, 22, 21, 21, 21, 20, 19, 19, 19, 19, 18, 18,
-        18, 18, 18, 18 },
+        18, 18, 18, 18,
+        /* Size 32x8 */
+        32, 33, 33, 28, 28, 21, 21, 21, 33, 33, 33, 28, 28, 22, 22, 21, 33, 33,
+        33, 27, 27, 22, 22, 22, 33, 33, 33, 27, 27, 22, 22, 22, 33, 33, 33, 27,
+        27, 22, 22, 22, 33, 32, 32, 26, 26, 22, 22, 22, 34, 32, 32, 26, 26, 22,
+        22, 23, 34, 32, 32, 26, 26, 22, 22, 23, 34, 32, 32, 26, 26, 22, 22, 23,
+        32, 30, 30, 25, 25, 22, 22, 23, 31, 28, 28, 24, 24, 22, 22, 22, 31, 28,
+        28, 24, 24, 22, 22, 22, 31, 28, 28, 24, 24, 22, 22, 22, 29, 27, 27, 23,
+        23, 22, 22, 23, 28, 26, 26, 22, 22, 22, 22, 23, 28, 26, 26, 22, 22, 22,
+        22, 23, 28, 26, 26, 22, 22, 22, 22, 23, 26, 25, 25, 22, 22, 21, 21, 22,
+        24, 24, 24, 22, 22, 20, 20, 21, 24, 24, 24, 22, 22, 20, 20, 21, 24, 24,
+        24, 22, 22, 20, 20, 21, 23, 23, 23, 22, 22, 20, 20, 20, 21, 22, 22, 21,
+        21, 19, 19, 19, 21, 22, 22, 21, 21, 19, 19, 19, 21, 22, 22, 21, 21, 19,
+        19, 19, 21, 22, 22, 22, 22, 19, 19, 19, 21, 22, 22, 22, 22, 19, 19, 18,
+        21, 22, 22, 22, 22, 19, 19, 18, 21, 22, 22, 22, 22, 19, 19, 18, 21, 23,
+        23, 22, 22, 19, 19, 18, 21, 23, 23, 22, 22, 19, 19, 18, 21, 23, 23, 22,
+        22, 19, 19, 18 },
   },
   {
       { /* Luma */
@@ -11284,21 +11285,12 @@
         21, 21, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 28, 28, 28, 28, 28,
         27, 26, 26, 26, 26, 25, 24, 24, 24, 24, 23, 21, 21, 21, 21, 20,
         /* Size 4x8 */
-        33, 33, 32, 29, 32, 32, 32, 29, 32, 32, 31, 30, 32, 32, 30, 28, 32, 31,
-        29, 27, 31, 31, 28, 26, 30, 30, 28, 24, 29, 30, 27, 21,
-        /* Size 8x4 */
         33, 32, 32, 32, 32, 31, 30, 29, 33, 32, 32, 32, 31, 31, 30, 30, 32, 32,
         31, 30, 29, 28, 28, 27, 29, 29, 30, 28, 27, 26, 24, 21,
+        /* Size 8x4 */
+        33, 33, 32, 29, 32, 32, 32, 29, 32, 32, 31, 30, 32, 32, 30, 28, 32, 31,
+        29, 27, 31, 31, 28, 26, 30, 30, 28, 24, 29, 30, 27, 21,
         /* Size 8x16 */
-        32, 33, 33, 33, 32, 32, 29, 28, 33, 32, 32, 32, 32, 32, 29, 29, 33, 32,
-        32, 32, 32, 32, 29, 29, 33, 32, 32, 32, 32, 32, 30, 29, 33, 32, 32, 32,
-        31, 31, 30, 30, 33, 32, 32, 32, 31, 31, 30, 30, 33, 32, 32, 31, 30, 30,
-        29, 28, 32, 32, 32, 31, 30, 30, 28, 28, 32, 32, 32, 31, 30, 30, 28, 28,
-        32, 32, 31, 30, 29, 29, 28, 27, 32, 32, 31, 30, 29, 29, 28, 27, 31, 31,
-        31, 29, 28, 28, 26, 25, 30, 30, 30, 29, 28, 28, 25, 24, 30, 30, 30, 29,
-        28, 28, 24, 23, 28, 29, 30, 28, 27, 27, 22, 21, 28, 29, 30, 28, 27, 27,
-        22, 21,
-        /* Size 16x8 */
         32, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 31, 30, 30, 28, 28, 33, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 30, 29, 29, 33, 32, 32, 32,
         32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 33, 32, 32, 32, 32, 32,
@@ -11307,37 +11299,16 @@
         29, 28, 28, 28, 27, 27, 29, 29, 29, 30, 30, 30, 29, 28, 28, 28, 28, 26,
         25, 24, 22, 22, 28, 29, 29, 29, 30, 30, 28, 28, 28, 27, 27, 25, 24, 23,
         21, 21,
+        /* Size 16x8 */
+        32, 33, 33, 33, 32, 32, 29, 28, 33, 32, 32, 32, 32, 32, 29, 29, 33, 32,
+        32, 32, 32, 32, 29, 29, 33, 32, 32, 32, 32, 32, 30, 29, 33, 32, 32, 32,
+        31, 31, 30, 30, 33, 32, 32, 32, 31, 31, 30, 30, 33, 32, 32, 31, 30, 30,
+        29, 28, 32, 32, 32, 31, 30, 30, 28, 28, 32, 32, 32, 31, 30, 30, 28, 28,
+        32, 32, 31, 30, 29, 29, 28, 27, 32, 32, 31, 30, 29, 29, 28, 27, 31, 31,
+        31, 29, 28, 28, 26, 25, 30, 30, 30, 29, 28, 28, 25, 24, 30, 30, 30, 29,
+        28, 28, 24, 23, 28, 29, 30, 28, 27, 27, 22, 21, 28, 29, 30, 28, 27, 27,
+        22, 21,
         /* Size 16x32 */
-        32, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 31, 29, 28, 28, 28, 33, 33,
-        33, 33, 33, 33, 32, 32, 32, 32, 32, 31, 29, 29, 29, 29, 33, 33, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 31, 29, 29, 29, 29, 33, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 31, 29, 29, 29, 29, 33, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 31, 29, 29, 29, 29, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 31, 29, 29, 29, 29, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31,
-        30, 29, 29, 29, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 29,
-        29, 29, 33, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 30, 30,
-        33, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 30, 30, 33, 32,
-        32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 30, 30, 33, 32, 32, 32,
-        32, 32, 31, 31, 31, 31, 31, 30, 29, 29, 29, 29, 33, 32, 32, 32, 32, 32,
-        31, 31, 30, 30, 30, 30, 29, 28, 28, 28, 32, 32, 32, 32, 32, 32, 31, 30,
-        30, 30, 30, 29, 28, 28, 28, 28, 32, 32, 32, 32, 32, 32, 31, 30, 30, 30,
-        30, 29, 28, 28, 28, 28, 32, 32, 32, 32, 32, 32, 31, 30, 30, 30, 30, 29,
-        28, 28, 28, 28, 32, 32, 32, 32, 32, 32, 31, 30, 30, 30, 30, 29, 28, 28,
-        28, 28, 32, 32, 32, 31, 31, 31, 31, 30, 29, 29, 29, 28, 28, 27, 27, 27,
-        32, 32, 32, 31, 31, 31, 30, 29, 29, 29, 29, 28, 28, 27, 27, 27, 32, 32,
-        32, 31, 31, 31, 30, 29, 29, 29, 29, 28, 28, 27, 27, 27, 32, 32, 32, 31,
-        31, 31, 30, 29, 29, 29, 29, 28, 28, 27, 27, 27, 32, 31, 31, 31, 31, 31,
-        30, 29, 28, 28, 28, 28, 26, 26, 26, 26, 31, 31, 31, 31, 31, 31, 29, 28,
-        28, 28, 28, 27, 26, 25, 25, 25, 30, 30, 30, 30, 30, 30, 29, 28, 28, 28,
-        28, 26, 25, 24, 24, 24, 30, 30, 30, 30, 30, 30, 29, 28, 28, 28, 28, 26,
-        25, 24, 24, 24, 30, 30, 30, 30, 30, 30, 29, 28, 28, 28, 28, 26, 25, 24,
-        24, 24, 30, 30, 30, 30, 30, 30, 29, 28, 28, 28, 28, 26, 24, 23, 23, 23,
-        29, 29, 30, 30, 30, 30, 28, 28, 27, 27, 27, 25, 23, 22, 22, 22, 28, 29,
-        29, 30, 30, 30, 28, 28, 27, 27, 27, 24, 22, 21, 21, 21, 28, 29, 29, 30,
-        30, 30, 28, 28, 27, 27, 27, 24, 22, 21, 21, 21, 28, 29, 29, 30, 30, 30,
-        28, 28, 27, 27, 27, 24, 22, 21, 21, 21, 28, 28, 28, 28, 28, 28, 28, 27,
-        26, 26, 26, 24, 22, 21, 21, 21,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 31, 30, 30, 30, 30, 29, 28, 28, 28, 28, 33, 33, 33, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31,
@@ -11367,33 +11338,47 @@
         27, 26, 25, 24, 24, 24, 23, 22, 21, 21, 21, 21, 28, 29, 29, 29, 29, 29,
         29, 29, 30, 30, 30, 29, 28, 28, 28, 28, 28, 27, 27, 27, 27, 26, 25, 24,
         24, 24, 23, 22, 21, 21, 21, 21,
+        /* Size 32x16 */
+        32, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 31, 29, 28, 28, 28, 33, 33,
+        33, 33, 33, 33, 32, 32, 32, 32, 32, 31, 29, 29, 29, 29, 33, 33, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 31, 29, 29, 29, 29, 33, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 31, 29, 29, 29, 29, 33, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 31, 29, 29, 29, 29, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 31, 29, 29, 29, 29, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31,
+        30, 29, 29, 29, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 29,
+        29, 29, 33, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 30, 30,
+        33, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 30, 30, 33, 32,
+        32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 30, 30, 33, 32, 32, 32,
+        32, 32, 31, 31, 31, 31, 31, 30, 29, 29, 29, 29, 33, 32, 32, 32, 32, 32,
+        31, 31, 30, 30, 30, 30, 29, 28, 28, 28, 32, 32, 32, 32, 32, 32, 31, 30,
+        30, 30, 30, 29, 28, 28, 28, 28, 32, 32, 32, 32, 32, 32, 31, 30, 30, 30,
+        30, 29, 28, 28, 28, 28, 32, 32, 32, 32, 32, 32, 31, 30, 30, 30, 30, 29,
+        28, 28, 28, 28, 32, 32, 32, 32, 32, 32, 31, 30, 30, 30, 30, 29, 28, 28,
+        28, 28, 32, 32, 32, 31, 31, 31, 31, 30, 29, 29, 29, 28, 28, 27, 27, 27,
+        32, 32, 32, 31, 31, 31, 30, 29, 29, 29, 29, 28, 28, 27, 27, 27, 32, 32,
+        32, 31, 31, 31, 30, 29, 29, 29, 29, 28, 28, 27, 27, 27, 32, 32, 32, 31,
+        31, 31, 30, 29, 29, 29, 29, 28, 28, 27, 27, 27, 32, 31, 31, 31, 31, 31,
+        30, 29, 28, 28, 28, 28, 26, 26, 26, 26, 31, 31, 31, 31, 31, 31, 29, 28,
+        28, 28, 28, 27, 26, 25, 25, 25, 30, 30, 30, 30, 30, 30, 29, 28, 28, 28,
+        28, 26, 25, 24, 24, 24, 30, 30, 30, 30, 30, 30, 29, 28, 28, 28, 28, 26,
+        25, 24, 24, 24, 30, 30, 30, 30, 30, 30, 29, 28, 28, 28, 28, 26, 25, 24,
+        24, 24, 30, 30, 30, 30, 30, 30, 29, 28, 28, 28, 28, 26, 24, 23, 23, 23,
+        29, 29, 30, 30, 30, 30, 28, 28, 27, 27, 27, 25, 23, 22, 22, 22, 28, 29,
+        29, 30, 30, 30, 28, 28, 27, 27, 27, 24, 22, 21, 21, 21, 28, 29, 29, 30,
+        30, 30, 28, 28, 27, 27, 27, 24, 22, 21, 21, 21, 28, 29, 29, 30, 30, 30,
+        28, 28, 27, 27, 27, 24, 22, 21, 21, 21, 28, 28, 28, 28, 28, 28, 28, 27,
+        26, 26, 26, 24, 22, 21, 21, 21,
         /* Size 4x16 */
-        33, 33, 32, 28, 33, 32, 32, 29, 32, 32, 32, 29, 32, 32, 32, 29, 32, 32,
-        31, 30, 32, 32, 31, 30, 32, 32, 30, 28, 32, 32, 30, 28, 32, 32, 30, 28,
-        32, 31, 29, 27, 32, 31, 29, 27, 31, 31, 28, 25, 30, 30, 28, 24, 30, 30,
-        28, 23, 29, 30, 27, 21, 29, 30, 27, 21,
-        /* Size 16x4 */
         33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 30, 29, 29, 33, 32,
         32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 32, 32, 32, 32,
         31, 31, 30, 30, 30, 29, 29, 28, 28, 28, 27, 27, 28, 29, 29, 29, 30, 30,
         28, 28, 28, 27, 27, 25, 24, 23, 21, 21,
+        /* Size 16x4 */
+        33, 33, 32, 28, 33, 32, 32, 29, 32, 32, 32, 29, 32, 32, 32, 29, 32, 32,
+        31, 30, 32, 32, 31, 30, 32, 32, 30, 28, 32, 32, 30, 28, 32, 32, 30, 28,
+        32, 31, 29, 27, 32, 31, 29, 27, 31, 31, 28, 25, 30, 30, 28, 24, 30, 30,
+        28, 23, 29, 30, 27, 21, 29, 30, 27, 21,
         /* Size 8x32 */
-        32, 33, 33, 33, 32, 32, 29, 28, 33, 33, 33, 32, 32, 32, 29, 29, 33, 32,
-        32, 32, 32, 32, 29, 29, 33, 32, 32, 32, 32, 32, 29, 29, 33, 32, 32, 32,
-        32, 32, 29, 29, 33, 32, 32, 32, 32, 32, 29, 29, 33, 32, 32, 32, 32, 32,
-        30, 29, 33, 32, 32, 32, 32, 32, 30, 29, 33, 32, 32, 32, 31, 31, 30, 30,
-        33, 32, 32, 32, 31, 31, 30, 30, 33, 32, 32, 32, 31, 31, 30, 30, 33, 32,
-        32, 31, 31, 31, 29, 29, 33, 32, 32, 31, 30, 30, 29, 28, 32, 32, 32, 31,
-        30, 30, 28, 28, 32, 32, 32, 31, 30, 30, 28, 28, 32, 32, 32, 31, 30, 30,
-        28, 28, 32, 32, 32, 31, 30, 30, 28, 28, 32, 32, 31, 31, 29, 29, 28, 27,
-        32, 32, 31, 30, 29, 29, 28, 27, 32, 32, 31, 30, 29, 29, 28, 27, 32, 32,
-        31, 30, 29, 29, 28, 27, 32, 31, 31, 30, 28, 28, 26, 26, 31, 31, 31, 29,
-        28, 28, 26, 25, 30, 30, 30, 29, 28, 28, 25, 24, 30, 30, 30, 29, 28, 28,
-        25, 24, 30, 30, 30, 29, 28, 28, 25, 24, 30, 30, 30, 29, 28, 28, 24, 23,
-        29, 30, 30, 28, 27, 27, 23, 22, 28, 29, 30, 28, 27, 27, 22, 21, 28, 29,
-        30, 28, 27, 27, 22, 21, 28, 29, 30, 28, 27, 27, 22, 21, 28, 28, 28, 28,
-        26, 26, 22, 21,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 31, 30, 30, 30, 30, 29, 28, 28, 28, 28, 33, 33, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31,
@@ -11408,7 +11393,23 @@
         30, 30, 30, 30, 30, 29, 29, 28, 28, 28, 28, 28, 28, 28, 28, 26, 26, 25,
         25, 25, 24, 23, 22, 22, 22, 22, 28, 29, 29, 29, 29, 29, 29, 29, 30, 30,
         30, 29, 28, 28, 28, 28, 28, 27, 27, 27, 27, 26, 25, 24, 24, 24, 23, 22,
-        21, 21, 21, 21 },
+        21, 21, 21, 21,
+        /* Size 32x8 */
+        32, 33, 33, 33, 32, 32, 29, 28, 33, 33, 33, 32, 32, 32, 29, 29, 33, 32,
+        32, 32, 32, 32, 29, 29, 33, 32, 32, 32, 32, 32, 29, 29, 33, 32, 32, 32,
+        32, 32, 29, 29, 33, 32, 32, 32, 32, 32, 29, 29, 33, 32, 32, 32, 32, 32,
+        30, 29, 33, 32, 32, 32, 32, 32, 30, 29, 33, 32, 32, 32, 31, 31, 30, 30,
+        33, 32, 32, 32, 31, 31, 30, 30, 33, 32, 32, 32, 31, 31, 30, 30, 33, 32,
+        32, 31, 31, 31, 29, 29, 33, 32, 32, 31, 30, 30, 29, 28, 32, 32, 32, 31,
+        30, 30, 28, 28, 32, 32, 32, 31, 30, 30, 28, 28, 32, 32, 32, 31, 30, 30,
+        28, 28, 32, 32, 32, 31, 30, 30, 28, 28, 32, 32, 31, 31, 29, 29, 28, 27,
+        32, 32, 31, 30, 29, 29, 28, 27, 32, 32, 31, 30, 29, 29, 28, 27, 32, 32,
+        31, 30, 29, 29, 28, 27, 32, 31, 31, 30, 28, 28, 26, 26, 31, 31, 31, 29,
+        28, 28, 26, 25, 30, 30, 30, 29, 28, 28, 25, 24, 30, 30, 30, 29, 28, 28,
+        25, 24, 30, 30, 30, 29, 28, 28, 25, 24, 30, 30, 30, 29, 28, 28, 24, 23,
+        29, 30, 30, 28, 27, 27, 23, 22, 28, 29, 30, 28, 27, 27, 22, 21, 28, 29,
+        30, 28, 27, 27, 22, 21, 28, 29, 30, 28, 27, 27, 22, 21, 28, 28, 28, 28,
+        26, 26, 22, 21 },
       { /* Chroma */
         /* Size 4x4 */
         33, 32, 27, 22, 32, 30, 25, 22, 27, 25, 22, 22, 22, 22, 22, 20,
@@ -11492,21 +11493,12 @@
         19, 19, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
         22, 22, 22, 22, 22, 21, 21, 20, 20, 20, 20, 20, 19, 19, 19, 19,
         /* Size 4x8 */
-        33, 33, 28, 21, 33, 33, 27, 22, 33, 32, 26, 22, 30, 28, 24, 22, 28, 26,
-        22, 22, 26, 25, 22, 21, 24, 24, 22, 20, 21, 22, 21, 19,
-        /* Size 8x4 */
         33, 33, 33, 30, 28, 26, 24, 21, 33, 33, 32, 28, 26, 25, 24, 22, 28, 27,
         26, 24, 22, 22, 22, 21, 21, 22, 22, 22, 22, 21, 20, 19,
+        /* Size 8x4 */
+        33, 33, 28, 21, 33, 33, 27, 22, 33, 32, 26, 22, 30, 28, 24, 22, 28, 26,
+        22, 22, 26, 25, 22, 21, 24, 24, 22, 20, 21, 22, 21, 19,
         /* Size 8x16 */
-        32, 33, 33, 31, 28, 28, 23, 21, 33, 33, 33, 30, 27, 27, 23, 22, 33, 33,
-        33, 30, 27, 27, 23, 22, 33, 33, 32, 30, 26, 26, 23, 22, 34, 32, 32, 29,
-        26, 26, 23, 22, 34, 32, 32, 29, 26, 26, 23, 22, 31, 30, 29, 28, 24, 24,
-        22, 22, 31, 29, 28, 27, 24, 24, 22, 22, 29, 28, 28, 26, 23, 23, 22, 22,
-        28, 26, 26, 24, 22, 22, 22, 22, 28, 26, 26, 24, 22, 22, 22, 22, 25, 24,
-        24, 23, 22, 22, 21, 21, 24, 24, 24, 23, 22, 22, 21, 20, 23, 23, 23, 23,
-        22, 22, 20, 20, 21, 22, 22, 22, 21, 21, 20, 19, 21, 22, 22, 22, 21, 21,
-        20, 19,
-        /* Size 16x8 */
         32, 33, 33, 33, 34, 34, 31, 31, 29, 28, 28, 25, 24, 23, 21, 21, 33, 33,
         33, 33, 32, 32, 30, 29, 28, 26, 26, 24, 24, 23, 22, 22, 33, 33, 33, 32,
         32, 32, 29, 28, 28, 26, 26, 24, 24, 23, 22, 22, 31, 30, 30, 30, 29, 29,
@@ -11515,37 +11507,16 @@
         22, 22, 22, 22, 21, 21, 23, 23, 23, 23, 23, 23, 22, 22, 22, 22, 22, 21,
         21, 20, 20, 20, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 20, 20,
         19, 19,
+        /* Size 16x8 */
+        32, 33, 33, 31, 28, 28, 23, 21, 33, 33, 33, 30, 27, 27, 23, 22, 33, 33,
+        33, 30, 27, 27, 23, 22, 33, 33, 32, 30, 26, 26, 23, 22, 34, 32, 32, 29,
+        26, 26, 23, 22, 34, 32, 32, 29, 26, 26, 23, 22, 31, 30, 29, 28, 24, 24,
+        22, 22, 31, 29, 28, 27, 24, 24, 22, 22, 29, 28, 28, 26, 23, 23, 22, 22,
+        28, 26, 26, 24, 22, 22, 22, 22, 28, 26, 26, 24, 22, 22, 22, 22, 25, 24,
+        24, 23, 22, 22, 21, 21, 24, 24, 24, 23, 22, 22, 21, 20, 23, 23, 23, 23,
+        22, 22, 20, 20, 21, 22, 22, 22, 21, 21, 20, 19, 21, 22, 22, 22, 21, 21,
+        20, 19,
         /* Size 16x32 */
-        32, 33, 33, 33, 33, 33, 31, 29, 28, 28, 28, 26, 23, 21, 21, 21, 33, 33,
-        33, 33, 33, 33, 31, 28, 28, 28, 28, 25, 23, 21, 21, 21, 33, 33, 33, 33,
-        33, 33, 30, 28, 27, 27, 27, 25, 23, 22, 22, 22, 33, 33, 33, 33, 33, 33,
-        30, 28, 27, 27, 27, 25, 23, 22, 22, 22, 33, 33, 33, 33, 33, 33, 30, 28,
-        27, 27, 27, 25, 23, 22, 22, 22, 33, 33, 33, 33, 33, 33, 30, 28, 27, 27,
-        27, 25, 23, 22, 22, 22, 33, 33, 33, 32, 32, 32, 30, 28, 26, 26, 26, 25,
-        23, 22, 22, 22, 34, 33, 33, 32, 32, 32, 30, 27, 26, 26, 26, 24, 23, 22,
-        22, 22, 34, 33, 32, 32, 32, 32, 29, 27, 26, 26, 26, 24, 23, 22, 22, 22,
-        34, 33, 32, 32, 32, 32, 29, 27, 26, 26, 26, 24, 23, 22, 22, 22, 34, 33,
-        32, 32, 32, 32, 29, 27, 26, 26, 26, 24, 23, 22, 22, 22, 33, 32, 31, 31,
-        31, 31, 28, 26, 25, 25, 25, 24, 23, 22, 22, 22, 31, 30, 30, 29, 29, 29,
-        28, 26, 24, 24, 24, 23, 22, 22, 22, 22, 31, 30, 29, 28, 28, 28, 27, 25,
-        24, 24, 24, 23, 22, 22, 22, 22, 31, 30, 29, 28, 28, 28, 27, 25, 24, 24,
-        24, 23, 22, 22, 22, 22, 31, 30, 29, 28, 28, 28, 27, 25, 24, 24, 24, 23,
-        22, 22, 22, 22, 29, 28, 28, 28, 28, 28, 26, 24, 23, 23, 23, 23, 22, 22,
-        22, 22, 28, 28, 27, 26, 26, 26, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22,
-        28, 27, 26, 26, 26, 26, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22, 28, 27,
-        26, 26, 26, 26, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22, 28, 27, 26, 26,
-        26, 26, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22, 26, 26, 26, 25, 25, 25,
-        24, 22, 22, 22, 22, 21, 21, 21, 21, 21, 25, 25, 24, 24, 24, 24, 23, 22,
-        22, 22, 22, 21, 21, 21, 21, 21, 24, 24, 24, 24, 24, 24, 23, 22, 22, 22,
-        22, 21, 21, 20, 20, 20, 24, 24, 24, 24, 24, 24, 23, 22, 22, 22, 22, 21,
-        21, 20, 20, 20, 24, 24, 24, 24, 24, 24, 23, 22, 22, 22, 22, 21, 21, 20,
-        20, 20, 23, 23, 23, 23, 23, 23, 23, 22, 22, 22, 22, 21, 20, 20, 20, 20,
-        22, 22, 22, 22, 22, 22, 22, 22, 21, 21, 21, 21, 20, 20, 20, 20, 21, 21,
-        22, 22, 22, 22, 22, 21, 21, 21, 21, 20, 20, 19, 19, 19, 21, 21, 22, 22,
-        22, 22, 22, 21, 21, 21, 21, 20, 20, 19, 19, 19, 21, 21, 22, 22, 22, 22,
-        22, 21, 21, 21, 21, 20, 20, 19, 19, 19, 21, 21, 22, 22, 22, 22, 22, 22,
-        22, 22, 22, 21, 20, 19, 19, 19,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 33, 31, 31, 31, 31, 29, 28,
         28, 28, 28, 26, 25, 24, 24, 24, 23, 22, 21, 21, 21, 21, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 32, 30, 30, 30, 30, 28, 28, 27, 27, 27, 26,
@@ -11575,33 +11546,47 @@
         22, 21, 21, 20, 20, 20, 20, 20, 19, 19, 19, 19, 21, 21, 22, 22, 22, 22,
         22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 21, 20,
         20, 20, 20, 20, 19, 19, 19, 19,
+        /* Size 32x16 */
+        32, 33, 33, 33, 33, 33, 31, 29, 28, 28, 28, 26, 23, 21, 21, 21, 33, 33,
+        33, 33, 33, 33, 31, 28, 28, 28, 28, 25, 23, 21, 21, 21, 33, 33, 33, 33,
+        33, 33, 30, 28, 27, 27, 27, 25, 23, 22, 22, 22, 33, 33, 33, 33, 33, 33,
+        30, 28, 27, 27, 27, 25, 23, 22, 22, 22, 33, 33, 33, 33, 33, 33, 30, 28,
+        27, 27, 27, 25, 23, 22, 22, 22, 33, 33, 33, 33, 33, 33, 30, 28, 27, 27,
+        27, 25, 23, 22, 22, 22, 33, 33, 33, 32, 32, 32, 30, 28, 26, 26, 26, 25,
+        23, 22, 22, 22, 34, 33, 33, 32, 32, 32, 30, 27, 26, 26, 26, 24, 23, 22,
+        22, 22, 34, 33, 32, 32, 32, 32, 29, 27, 26, 26, 26, 24, 23, 22, 22, 22,
+        34, 33, 32, 32, 32, 32, 29, 27, 26, 26, 26, 24, 23, 22, 22, 22, 34, 33,
+        32, 32, 32, 32, 29, 27, 26, 26, 26, 24, 23, 22, 22, 22, 33, 32, 31, 31,
+        31, 31, 28, 26, 25, 25, 25, 24, 23, 22, 22, 22, 31, 30, 30, 29, 29, 29,
+        28, 26, 24, 24, 24, 23, 22, 22, 22, 22, 31, 30, 29, 28, 28, 28, 27, 25,
+        24, 24, 24, 23, 22, 22, 22, 22, 31, 30, 29, 28, 28, 28, 27, 25, 24, 24,
+        24, 23, 22, 22, 22, 22, 31, 30, 29, 28, 28, 28, 27, 25, 24, 24, 24, 23,
+        22, 22, 22, 22, 29, 28, 28, 28, 28, 28, 26, 24, 23, 23, 23, 23, 22, 22,
+        22, 22, 28, 28, 27, 26, 26, 26, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22,
+        28, 27, 26, 26, 26, 26, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22, 28, 27,
+        26, 26, 26, 26, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22, 28, 27, 26, 26,
+        26, 26, 24, 23, 22, 22, 22, 22, 22, 22, 22, 22, 26, 26, 26, 25, 25, 25,
+        24, 22, 22, 22, 22, 21, 21, 21, 21, 21, 25, 25, 24, 24, 24, 24, 23, 22,
+        22, 22, 22, 21, 21, 21, 21, 21, 24, 24, 24, 24, 24, 24, 23, 22, 22, 22,
+        22, 21, 21, 20, 20, 20, 24, 24, 24, 24, 24, 24, 23, 22, 22, 22, 22, 21,
+        21, 20, 20, 20, 24, 24, 24, 24, 24, 24, 23, 22, 22, 22, 22, 21, 21, 20,
+        20, 20, 23, 23, 23, 23, 23, 23, 23, 22, 22, 22, 22, 21, 20, 20, 20, 20,
+        22, 22, 22, 22, 22, 22, 22, 22, 21, 21, 21, 21, 20, 20, 20, 20, 21, 21,
+        22, 22, 22, 22, 22, 21, 21, 21, 21, 20, 20, 19, 19, 19, 21, 21, 22, 22,
+        22, 22, 22, 21, 21, 21, 21, 20, 20, 19, 19, 19, 21, 21, 22, 22, 22, 22,
+        22, 21, 21, 21, 21, 20, 20, 19, 19, 19, 21, 21, 22, 22, 22, 22, 22, 22,
+        22, 22, 22, 21, 20, 19, 19, 19,
         /* Size 4x16 */
-        33, 33, 28, 21, 33, 33, 27, 22, 33, 33, 27, 22, 33, 32, 26, 22, 33, 32,
-        26, 22, 33, 32, 26, 22, 30, 29, 24, 22, 30, 28, 24, 22, 28, 28, 23, 22,
-        27, 26, 22, 22, 27, 26, 22, 22, 25, 24, 22, 21, 24, 24, 22, 20, 23, 23,
-        22, 20, 21, 22, 21, 19, 21, 22, 21, 19,
-        /* Size 16x4 */
         33, 33, 33, 33, 33, 33, 30, 30, 28, 27, 27, 25, 24, 23, 21, 21, 33, 33,
         33, 32, 32, 32, 29, 28, 28, 26, 26, 24, 24, 23, 22, 22, 28, 27, 27, 26,
         26, 26, 24, 24, 23, 22, 22, 22, 22, 22, 21, 21, 21, 22, 22, 22, 22, 22,
         22, 22, 22, 22, 22, 21, 20, 20, 19, 19,
+        /* Size 16x4 */
+        33, 33, 28, 21, 33, 33, 27, 22, 33, 33, 27, 22, 33, 32, 26, 22, 33, 32,
+        26, 22, 33, 32, 26, 22, 30, 29, 24, 22, 30, 28, 24, 22, 28, 28, 23, 22,
+        27, 26, 22, 22, 27, 26, 22, 22, 25, 24, 22, 21, 24, 24, 22, 20, 23, 23,
+        22, 20, 21, 22, 21, 19, 21, 22, 21, 19,
         /* Size 8x32 */
-        32, 33, 33, 31, 28, 28, 23, 21, 33, 33, 33, 31, 28, 28, 23, 21, 33, 33,
-        33, 30, 27, 27, 23, 22, 33, 33, 33, 30, 27, 27, 23, 22, 33, 33, 33, 30,
-        27, 27, 23, 22, 33, 33, 33, 30, 27, 27, 23, 22, 33, 33, 32, 30, 26, 26,
-        23, 22, 34, 33, 32, 30, 26, 26, 23, 22, 34, 32, 32, 29, 26, 26, 23, 22,
-        34, 32, 32, 29, 26, 26, 23, 22, 34, 32, 32, 29, 26, 26, 23, 22, 33, 31,
-        31, 28, 25, 25, 23, 22, 31, 30, 29, 28, 24, 24, 22, 22, 31, 29, 28, 27,
-        24, 24, 22, 22, 31, 29, 28, 27, 24, 24, 22, 22, 31, 29, 28, 27, 24, 24,
-        22, 22, 29, 28, 28, 26, 23, 23, 22, 22, 28, 27, 26, 24, 22, 22, 22, 22,
-        28, 26, 26, 24, 22, 22, 22, 22, 28, 26, 26, 24, 22, 22, 22, 22, 28, 26,
-        26, 24, 22, 22, 22, 22, 26, 26, 25, 24, 22, 22, 21, 21, 25, 24, 24, 23,
-        22, 22, 21, 21, 24, 24, 24, 23, 22, 22, 21, 20, 24, 24, 24, 23, 22, 22,
-        21, 20, 24, 24, 24, 23, 22, 22, 21, 20, 23, 23, 23, 23, 22, 22, 20, 20,
-        22, 22, 22, 22, 21, 21, 20, 20, 21, 22, 22, 22, 21, 21, 20, 19, 21, 22,
-        22, 22, 21, 21, 20, 19, 21, 22, 22, 22, 21, 21, 20, 19, 21, 22, 22, 22,
-        22, 22, 20, 19,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 33, 31, 31, 31, 31, 29, 28,
         28, 28, 28, 26, 25, 24, 24, 24, 23, 22, 21, 21, 21, 21, 33, 33, 33, 33,
         33, 33, 33, 33, 32, 32, 32, 31, 30, 29, 29, 29, 28, 27, 26, 26, 26, 26,
@@ -11616,7 +11601,23 @@
         23, 23, 23, 23, 23, 23, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 21, 21,
         21, 21, 20, 20, 20, 20, 20, 20, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22,
         22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 21, 20, 20, 20, 20, 20,
-        19, 19, 19, 19 },
+        19, 19, 19, 19,
+        /* Size 32x8 */
+        32, 33, 33, 31, 28, 28, 23, 21, 33, 33, 33, 31, 28, 28, 23, 21, 33, 33,
+        33, 30, 27, 27, 23, 22, 33, 33, 33, 30, 27, 27, 23, 22, 33, 33, 33, 30,
+        27, 27, 23, 22, 33, 33, 33, 30, 27, 27, 23, 22, 33, 33, 32, 30, 26, 26,
+        23, 22, 34, 33, 32, 30, 26, 26, 23, 22, 34, 32, 32, 29, 26, 26, 23, 22,
+        34, 32, 32, 29, 26, 26, 23, 22, 34, 32, 32, 29, 26, 26, 23, 22, 33, 31,
+        31, 28, 25, 25, 23, 22, 31, 30, 29, 28, 24, 24, 22, 22, 31, 29, 28, 27,
+        24, 24, 22, 22, 31, 29, 28, 27, 24, 24, 22, 22, 31, 29, 28, 27, 24, 24,
+        22, 22, 29, 28, 28, 26, 23, 23, 22, 22, 28, 27, 26, 24, 22, 22, 22, 22,
+        28, 26, 26, 24, 22, 22, 22, 22, 28, 26, 26, 24, 22, 22, 22, 22, 28, 26,
+        26, 24, 22, 22, 22, 22, 26, 26, 25, 24, 22, 22, 21, 21, 25, 24, 24, 23,
+        22, 22, 21, 21, 24, 24, 24, 23, 22, 22, 21, 20, 24, 24, 24, 23, 22, 22,
+        21, 20, 24, 24, 24, 23, 22, 22, 21, 20, 23, 23, 23, 23, 22, 22, 20, 20,
+        22, 22, 22, 22, 21, 21, 20, 20, 21, 22, 22, 22, 21, 21, 20, 19, 21, 22,
+        22, 22, 21, 21, 20, 19, 21, 22, 22, 22, 21, 21, 20, 19, 21, 22, 22, 22,
+        22, 22, 20, 19 },
   },
   {
       { /* Luma */
@@ -11702,21 +11703,12 @@
         26, 26, 30, 30, 30, 30, 30, 30, 30, 30, 30, 31, 31, 31, 31, 31, 30, 30,
         29, 29, 29, 29, 29, 29, 28, 28, 28, 28, 28, 28, 27, 27, 26, 26,
         /* Size 4x8 */
-        33, 33, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32,
-        31, 30, 32, 32, 30, 30, 32, 31, 30, 29, 31, 31, 29, 28,
-        /* Size 8x4 */
         33, 33, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32, 32, 32, 31, 31, 32, 32,
         32, 32, 31, 30, 30, 29, 32, 32, 32, 31, 30, 30, 29, 28,
+        /* Size 8x4 */
+        33, 33, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 32, 32,
+        31, 30, 32, 32, 30, 30, 32, 31, 30, 29, 31, 31, 29, 28,
         /* Size 8x16 */
-        32, 33, 33, 33, 33, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 31, 33, 32,
-        32, 32, 32, 32, 32, 31, 33, 32, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32,
-        32, 32, 32, 31, 33, 32, 32, 32, 32, 31, 31, 31, 33, 32, 32, 32, 32, 31,
-        31, 31, 33, 32, 32, 32, 32, 31, 31, 31, 33, 32, 32, 32, 31, 30, 30, 30,
-        32, 32, 32, 32, 31, 30, 30, 30, 32, 32, 32, 32, 31, 30, 30, 30, 32, 32,
-        32, 32, 31, 29, 29, 29, 32, 32, 31, 31, 30, 29, 29, 28, 32, 32, 31, 31,
-        30, 29, 29, 28, 32, 31, 31, 31, 30, 28, 28, 28, 30, 30, 30, 30, 29, 28,
-        28, 27,
-        /* Size 16x8 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 30, 33, 33,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 33, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 33, 32, 32, 32, 32, 32,
@@ -11725,37 +11717,16 @@
         30, 29, 29, 29, 28, 28, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 29,
         29, 29, 28, 28, 32, 31, 31, 31, 31, 31, 31, 31, 30, 30, 30, 29, 28, 28,
         28, 27,
+        /* Size 16x8 */
+        32, 33, 33, 33, 33, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 31, 33, 32,
+        32, 32, 32, 32, 32, 31, 33, 32, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32,
+        32, 32, 32, 31, 33, 32, 32, 32, 32, 31, 31, 31, 33, 32, 32, 32, 32, 31,
+        31, 31, 33, 32, 32, 32, 32, 31, 31, 31, 33, 32, 32, 32, 31, 30, 30, 30,
+        32, 32, 32, 32, 31, 30, 30, 30, 32, 32, 32, 32, 31, 30, 30, 30, 32, 32,
+        32, 32, 31, 29, 29, 29, 32, 32, 31, 31, 30, 29, 29, 28, 32, 32, 31, 31,
+        30, 29, 29, 28, 32, 31, 31, 31, 30, 28, 28, 28, 30, 30, 30, 30, 29, 28,
+        28, 27,
         /* Size 16x32 */
-        32, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 30, 33, 33,
-        33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 31, 30, 33, 33, 33, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 33, 33, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 33, 33, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 31, 30, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 31, 30, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 31, 30, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        31, 30, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30,
-        33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 33, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 33, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 33, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 33, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 31, 31, 31, 31, 31, 31, 33, 32, 32, 32, 32, 32, 32, 32, 32, 31,
-        31, 31, 31, 31, 31, 30, 33, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31,
-        31, 31, 30, 30, 33, 32, 32, 32, 32, 32, 32, 32, 31, 31, 30, 30, 30, 30,
-        30, 29, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 30, 29,
-        32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 30, 29, 32, 32,
-        32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 30, 29, 32, 32, 32, 32,
-        32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 30, 29, 32, 32, 32, 32, 32, 32,
-        32, 31, 31, 30, 30, 30, 30, 30, 29, 29, 32, 32, 32, 32, 32, 32, 32, 31,
-        31, 30, 29, 29, 29, 29, 29, 28, 32, 32, 32, 32, 31, 31, 31, 31, 31, 30,
-        29, 29, 29, 29, 28, 28, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 29, 29,
-        29, 29, 28, 28, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 29, 29, 29, 29,
-        28, 28, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 29, 29, 29, 29, 28, 28,
-        32, 32, 32, 31, 31, 31, 31, 31, 30, 30, 29, 29, 29, 29, 28, 28, 32, 31,
-        31, 31, 31, 31, 31, 31, 30, 29, 28, 28, 28, 28, 28, 27, 31, 31, 31, 31,
-        31, 31, 31, 30, 30, 29, 28, 28, 28, 28, 28, 27, 30, 30, 30, 30, 30, 30,
-        30, 30, 29, 28, 28, 28, 28, 28, 27, 26, 30, 30, 30, 30, 30, 30, 30, 30,
-        29, 28, 28, 28, 28, 28, 27, 26,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 30, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
@@ -11785,33 +11756,47 @@
         30, 29, 29, 28, 28, 28, 28, 28, 28, 28, 27, 27, 30, 30, 30, 30, 30, 30,
         30, 30, 30, 31, 31, 31, 31, 31, 30, 30, 29, 29, 29, 29, 29, 29, 28, 28,
         28, 28, 28, 28, 27, 27, 26, 26,
+        /* Size 32x16 */
+        32, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 30, 33, 33,
+        33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 31, 30, 33, 33, 33, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 33, 33, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 33, 33, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 31, 30, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 31, 30, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 31, 30, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        31, 30, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30,
+        33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 33, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 33, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 33, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 33, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 31, 31, 31, 31, 31, 31, 33, 32, 32, 32, 32, 32, 32, 32, 32, 31,
+        31, 31, 31, 31, 31, 30, 33, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31,
+        31, 31, 30, 30, 33, 32, 32, 32, 32, 32, 32, 32, 31, 31, 30, 30, 30, 30,
+        30, 29, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 30, 29,
+        32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 30, 29, 32, 32,
+        32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 30, 29, 32, 32, 32, 32,
+        32, 32, 32, 31, 31, 31, 30, 30, 30, 30, 30, 29, 32, 32, 32, 32, 32, 32,
+        32, 31, 31, 30, 30, 30, 30, 30, 29, 29, 32, 32, 32, 32, 32, 32, 32, 31,
+        31, 30, 29, 29, 29, 29, 29, 28, 32, 32, 32, 32, 31, 31, 31, 31, 31, 30,
+        29, 29, 29, 29, 28, 28, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 29, 29,
+        29, 29, 28, 28, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 29, 29, 29, 29,
+        28, 28, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 29, 29, 29, 29, 28, 28,
+        32, 32, 32, 31, 31, 31, 31, 31, 30, 30, 29, 29, 29, 29, 28, 28, 32, 31,
+        31, 31, 31, 31, 31, 31, 30, 29, 28, 28, 28, 28, 28, 27, 31, 31, 31, 31,
+        31, 31, 31, 30, 30, 29, 28, 28, 28, 28, 28, 27, 30, 30, 30, 30, 30, 30,
+        30, 30, 29, 28, 28, 28, 28, 28, 27, 26, 30, 30, 30, 30, 30, 30, 30, 30,
+        29, 28, 28, 28, 28, 28, 27, 26,
         /* Size 4x16 */
-        33, 33, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32,
-        32, 32, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 31, 31, 32, 32, 31, 30,
-        32, 32, 31, 30, 32, 32, 31, 30, 32, 32, 30, 29, 32, 31, 30, 29, 32, 31,
-        30, 29, 31, 31, 29, 28, 30, 30, 28, 28,
-        /* Size 16x4 */
         33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 33, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 32, 32, 32, 32,
         32, 32, 32, 31, 31, 31, 31, 30, 30, 30, 29, 28, 32, 32, 32, 32, 32, 31,
         31, 31, 30, 30, 30, 29, 29, 29, 28, 28,
+        /* Size 16x4 */
+        33, 33, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32,
+        32, 32, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32, 31, 31, 32, 32, 31, 30,
+        32, 32, 31, 30, 32, 32, 31, 30, 32, 32, 30, 29, 32, 31, 30, 29, 32, 31,
+        30, 29, 31, 31, 29, 28, 30, 30, 28, 28,
         /* Size 8x32 */
-        32, 33, 33, 33, 33, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 31, 33, 33,
-        32, 32, 32, 32, 32, 31, 33, 32, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32,
-        32, 32, 32, 31, 33, 32, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32, 32, 32,
-        32, 31, 33, 32, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32, 32, 32, 32, 31,
-        33, 32, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32, 32, 31, 31, 31, 33, 32,
-        32, 32, 32, 31, 31, 31, 33, 32, 32, 32, 32, 31, 31, 31, 33, 32, 32, 32,
-        32, 31, 31, 31, 33, 32, 32, 32, 32, 31, 31, 31, 33, 32, 32, 32, 31, 31,
-        31, 30, 33, 32, 32, 32, 31, 30, 30, 30, 32, 32, 32, 32, 31, 30, 30, 30,
-        32, 32, 32, 32, 31, 30, 30, 30, 32, 32, 32, 32, 31, 30, 30, 30, 32, 32,
-        32, 32, 31, 30, 30, 30, 32, 32, 32, 32, 31, 30, 30, 29, 32, 32, 32, 32,
-        31, 29, 29, 29, 32, 32, 31, 31, 31, 29, 29, 28, 32, 32, 31, 31, 30, 29,
-        29, 28, 32, 32, 31, 31, 30, 29, 29, 28, 32, 32, 31, 31, 30, 29, 29, 28,
-        32, 32, 31, 31, 30, 29, 29, 28, 32, 31, 31, 31, 30, 28, 28, 28, 31, 31,
-        31, 31, 30, 28, 28, 28, 30, 30, 30, 30, 29, 28, 28, 27, 30, 30, 30, 30,
-        29, 28, 28, 27,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 30, 30, 33, 33, 33, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
@@ -11826,7 +11811,23 @@
         32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 30, 30, 30, 30, 30, 30, 29, 29,
         29, 29, 29, 29, 28, 28, 28, 28, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         31, 31, 31, 31, 31, 30, 30, 30, 30, 30, 30, 29, 29, 28, 28, 28, 28, 28,
-        28, 28, 27, 27 },
+        28, 28, 27, 27,
+        /* Size 32x8 */
+        32, 33, 33, 33, 33, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 31, 33, 33,
+        32, 32, 32, 32, 32, 31, 33, 32, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32,
+        32, 32, 32, 31, 33, 32, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32, 32, 32,
+        32, 31, 33, 32, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32, 32, 32, 32, 31,
+        33, 32, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32, 32, 31, 31, 31, 33, 32,
+        32, 32, 32, 31, 31, 31, 33, 32, 32, 32, 32, 31, 31, 31, 33, 32, 32, 32,
+        32, 31, 31, 31, 33, 32, 32, 32, 32, 31, 31, 31, 33, 32, 32, 32, 31, 31,
+        31, 30, 33, 32, 32, 32, 31, 30, 30, 30, 32, 32, 32, 32, 31, 30, 30, 30,
+        32, 32, 32, 32, 31, 30, 30, 30, 32, 32, 32, 32, 31, 30, 30, 30, 32, 32,
+        32, 32, 31, 30, 30, 30, 32, 32, 32, 32, 31, 30, 30, 29, 32, 32, 32, 32,
+        31, 29, 29, 29, 32, 32, 31, 31, 31, 29, 29, 28, 32, 32, 31, 31, 30, 29,
+        29, 28, 32, 32, 31, 31, 30, 29, 29, 28, 32, 32, 31, 31, 30, 29, 29, 28,
+        32, 32, 31, 31, 30, 29, 29, 28, 32, 31, 31, 31, 30, 28, 28, 28, 31, 31,
+        31, 31, 30, 28, 28, 28, 30, 30, 30, 30, 29, 28, 28, 27, 30, 30, 30, 30,
+        29, 28, 28, 27 },
       { /* Chroma */
         /* Size 4x4 */
         33, 33, 30, 27, 33, 32, 29, 26, 30, 29, 26, 24, 27, 26, 24, 22,
@@ -11910,21 +11911,12 @@
         21, 21, 25, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
         23, 23, 23, 23, 23, 23, 22, 22, 22, 22, 22, 22, 21, 21, 21, 21,
         /* Size 4x8 */
-        33, 33, 29, 28, 33, 33, 28, 27, 33, 32, 28, 26, 33, 32, 28, 26, 30, 28,
-        26, 24, 29, 28, 24, 23, 27, 26, 23, 22, 25, 24, 23, 22,
-        /* Size 8x4 */
         33, 33, 33, 33, 30, 29, 27, 25, 33, 33, 32, 32, 28, 28, 26, 24, 29, 28,
         28, 28, 26, 24, 23, 23, 28, 27, 26, 26, 24, 23, 22, 22,
+        /* Size 8x4 */
+        33, 33, 29, 28, 33, 33, 28, 27, 33, 32, 28, 26, 33, 32, 28, 26, 30, 28,
+        26, 24, 29, 28, 24, 23, 27, 26, 23, 22, 25, 24, 23, 22,
         /* Size 8x16 */
-        32, 33, 33, 33, 31, 28, 28, 27, 33, 33, 33, 33, 31, 27, 27, 26, 33, 33,
-        33, 33, 30, 27, 27, 26, 33, 33, 33, 33, 30, 27, 27, 26, 33, 33, 32, 32,
-        30, 26, 26, 26, 34, 33, 32, 32, 29, 26, 26, 25, 34, 33, 32, 32, 29, 26,
-        26, 25, 33, 32, 31, 31, 29, 26, 26, 25, 31, 30, 29, 29, 28, 24, 24, 24,
-        31, 29, 28, 28, 27, 24, 24, 23, 31, 29, 28, 28, 27, 24, 24, 23, 29, 28,
-        27, 27, 25, 23, 23, 22, 28, 26, 26, 26, 24, 22, 22, 22, 28, 26, 26, 26,
-        24, 22, 22, 22, 26, 26, 25, 25, 24, 22, 22, 22, 24, 24, 24, 24, 23, 22,
-        22, 21,
-        /* Size 16x8 */
         32, 33, 33, 33, 33, 34, 34, 33, 31, 31, 31, 29, 28, 28, 26, 24, 33, 33,
         33, 33, 33, 33, 33, 32, 30, 29, 29, 28, 26, 26, 26, 24, 33, 33, 33, 33,
         32, 32, 32, 31, 29, 28, 28, 27, 26, 26, 25, 24, 33, 33, 33, 33, 32, 32,
@@ -11933,37 +11925,16 @@
         24, 23, 22, 22, 22, 22, 28, 27, 27, 27, 26, 26, 26, 26, 24, 24, 24, 23,
         22, 22, 22, 22, 27, 26, 26, 26, 26, 25, 25, 25, 24, 23, 23, 22, 22, 22,
         22, 21,
+        /* Size 16x8 */
+        32, 33, 33, 33, 31, 28, 28, 27, 33, 33, 33, 33, 31, 27, 27, 26, 33, 33,
+        33, 33, 30, 27, 27, 26, 33, 33, 33, 33, 30, 27, 27, 26, 33, 33, 32, 32,
+        30, 26, 26, 26, 34, 33, 32, 32, 29, 26, 26, 25, 34, 33, 32, 32, 29, 26,
+        26, 25, 33, 32, 31, 31, 29, 26, 26, 25, 31, 30, 29, 29, 28, 24, 24, 24,
+        31, 29, 28, 28, 27, 24, 24, 23, 31, 29, 28, 28, 27, 24, 24, 23, 29, 28,
+        27, 27, 25, 23, 23, 22, 28, 26, 26, 26, 24, 22, 22, 22, 28, 26, 26, 26,
+        24, 22, 22, 22, 26, 26, 25, 25, 24, 22, 22, 22, 24, 24, 24, 24, 23, 22,
+        22, 21,
         /* Size 16x32 */
-        32, 33, 33, 33, 33, 33, 33, 33, 31, 29, 28, 28, 28, 28, 27, 24, 33, 33,
-        33, 33, 33, 33, 33, 33, 31, 29, 28, 28, 28, 28, 26, 24, 33, 33, 33, 33,
-        33, 33, 33, 32, 31, 29, 27, 27, 27, 27, 26, 24, 33, 33, 33, 33, 33, 33,
-        33, 32, 30, 28, 27, 27, 27, 27, 26, 24, 33, 33, 33, 33, 33, 33, 33, 32,
-        30, 28, 27, 27, 27, 27, 26, 24, 33, 33, 33, 33, 33, 33, 33, 32, 30, 28,
-        27, 27, 27, 27, 26, 24, 33, 33, 33, 33, 33, 33, 33, 32, 30, 28, 27, 27,
-        27, 27, 26, 24, 33, 33, 33, 33, 33, 33, 33, 32, 30, 28, 27, 27, 27, 27,
-        26, 24, 33, 33, 33, 33, 32, 32, 32, 32, 30, 28, 26, 26, 26, 26, 26, 24,
-        34, 33, 33, 32, 32, 32, 32, 32, 30, 28, 26, 26, 26, 26, 26, 24, 34, 33,
-        33, 32, 32, 32, 32, 31, 29, 28, 26, 26, 26, 26, 25, 24, 34, 33, 33, 32,
-        32, 32, 32, 31, 29, 28, 26, 26, 26, 26, 25, 24, 34, 33, 33, 32, 32, 32,
-        32, 31, 29, 28, 26, 26, 26, 26, 25, 24, 34, 33, 33, 32, 32, 32, 32, 31,
-        29, 28, 26, 26, 26, 26, 25, 24, 33, 33, 32, 32, 31, 31, 31, 31, 29, 27,
-        26, 26, 26, 26, 25, 24, 32, 32, 31, 31, 30, 30, 30, 30, 28, 26, 25, 25,
-        25, 25, 24, 23, 31, 31, 30, 29, 29, 29, 29, 29, 28, 26, 24, 24, 24, 24,
-        24, 23, 31, 30, 29, 29, 28, 28, 28, 28, 27, 26, 24, 24, 24, 24, 23, 23,
-        31, 30, 29, 29, 28, 28, 28, 28, 27, 26, 24, 24, 24, 24, 23, 23, 31, 30,
-        29, 29, 28, 28, 28, 28, 27, 26, 24, 24, 24, 24, 23, 23, 31, 30, 29, 29,
-        28, 28, 28, 28, 27, 26, 24, 24, 24, 24, 23, 23, 30, 29, 28, 28, 28, 28,
-        28, 28, 26, 24, 23, 23, 23, 23, 23, 23, 29, 28, 28, 27, 27, 27, 27, 26,
-        25, 24, 23, 23, 23, 23, 22, 22, 28, 28, 27, 26, 26, 26, 26, 26, 24, 23,
-        22, 22, 22, 22, 22, 22, 28, 27, 26, 26, 26, 26, 26, 25, 24, 23, 22, 22,
-        22, 22, 22, 22, 28, 27, 26, 26, 26, 26, 26, 25, 24, 23, 22, 22, 22, 22,
-        22, 22, 28, 27, 26, 26, 26, 26, 26, 25, 24, 23, 22, 22, 22, 22, 22, 22,
-        28, 27, 26, 26, 26, 26, 26, 25, 24, 23, 22, 22, 22, 22, 22, 22, 26, 26,
-        26, 25, 25, 25, 25, 24, 24, 23, 22, 22, 22, 22, 22, 21, 26, 25, 25, 24,
-        24, 24, 24, 24, 23, 23, 22, 22, 22, 22, 22, 21, 24, 24, 24, 24, 24, 24,
-        24, 24, 23, 22, 22, 22, 22, 22, 21, 21, 24, 24, 24, 24, 24, 24, 24, 24,
-        23, 22, 22, 22, 22, 22, 21, 21,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 33, 32, 31, 31,
         31, 31, 31, 30, 29, 28, 28, 28, 28, 28, 26, 26, 24, 24, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 31, 30, 30, 30, 30, 29,
@@ -11993,33 +11964,47 @@
         23, 23, 22, 22, 22, 22, 22, 22, 22, 22, 21, 21, 24, 24, 24, 24, 24, 24,
         24, 24, 24, 24, 24, 24, 24, 24, 24, 23, 23, 23, 23, 23, 23, 23, 22, 22,
         22, 22, 22, 22, 21, 21, 21, 21,
+        /* Size 32x16 */
+        32, 33, 33, 33, 33, 33, 33, 33, 31, 29, 28, 28, 28, 28, 27, 24, 33, 33,
+        33, 33, 33, 33, 33, 33, 31, 29, 28, 28, 28, 28, 26, 24, 33, 33, 33, 33,
+        33, 33, 33, 32, 31, 29, 27, 27, 27, 27, 26, 24, 33, 33, 33, 33, 33, 33,
+        33, 32, 30, 28, 27, 27, 27, 27, 26, 24, 33, 33, 33, 33, 33, 33, 33, 32,
+        30, 28, 27, 27, 27, 27, 26, 24, 33, 33, 33, 33, 33, 33, 33, 32, 30, 28,
+        27, 27, 27, 27, 26, 24, 33, 33, 33, 33, 33, 33, 33, 32, 30, 28, 27, 27,
+        27, 27, 26, 24, 33, 33, 33, 33, 33, 33, 33, 32, 30, 28, 27, 27, 27, 27,
+        26, 24, 33, 33, 33, 33, 32, 32, 32, 32, 30, 28, 26, 26, 26, 26, 26, 24,
+        34, 33, 33, 32, 32, 32, 32, 32, 30, 28, 26, 26, 26, 26, 26, 24, 34, 33,
+        33, 32, 32, 32, 32, 31, 29, 28, 26, 26, 26, 26, 25, 24, 34, 33, 33, 32,
+        32, 32, 32, 31, 29, 28, 26, 26, 26, 26, 25, 24, 34, 33, 33, 32, 32, 32,
+        32, 31, 29, 28, 26, 26, 26, 26, 25, 24, 34, 33, 33, 32, 32, 32, 32, 31,
+        29, 28, 26, 26, 26, 26, 25, 24, 33, 33, 32, 32, 31, 31, 31, 31, 29, 27,
+        26, 26, 26, 26, 25, 24, 32, 32, 31, 31, 30, 30, 30, 30, 28, 26, 25, 25,
+        25, 25, 24, 23, 31, 31, 30, 29, 29, 29, 29, 29, 28, 26, 24, 24, 24, 24,
+        24, 23, 31, 30, 29, 29, 28, 28, 28, 28, 27, 26, 24, 24, 24, 24, 23, 23,
+        31, 30, 29, 29, 28, 28, 28, 28, 27, 26, 24, 24, 24, 24, 23, 23, 31, 30,
+        29, 29, 28, 28, 28, 28, 27, 26, 24, 24, 24, 24, 23, 23, 31, 30, 29, 29,
+        28, 28, 28, 28, 27, 26, 24, 24, 24, 24, 23, 23, 30, 29, 28, 28, 28, 28,
+        28, 28, 26, 24, 23, 23, 23, 23, 23, 23, 29, 28, 28, 27, 27, 27, 27, 26,
+        25, 24, 23, 23, 23, 23, 22, 22, 28, 28, 27, 26, 26, 26, 26, 26, 24, 23,
+        22, 22, 22, 22, 22, 22, 28, 27, 26, 26, 26, 26, 26, 25, 24, 23, 22, 22,
+        22, 22, 22, 22, 28, 27, 26, 26, 26, 26, 26, 25, 24, 23, 22, 22, 22, 22,
+        22, 22, 28, 27, 26, 26, 26, 26, 26, 25, 24, 23, 22, 22, 22, 22, 22, 22,
+        28, 27, 26, 26, 26, 26, 26, 25, 24, 23, 22, 22, 22, 22, 22, 22, 26, 26,
+        26, 25, 25, 25, 25, 24, 24, 23, 22, 22, 22, 22, 22, 21, 26, 25, 25, 24,
+        24, 24, 24, 24, 23, 23, 22, 22, 22, 22, 22, 21, 24, 24, 24, 24, 24, 24,
+        24, 24, 23, 22, 22, 22, 22, 22, 21, 21, 24, 24, 24, 24, 24, 24, 24, 24,
+        23, 22, 22, 22, 22, 22, 21, 21,
         /* Size 4x16 */
-        33, 33, 29, 28, 33, 33, 29, 27, 33, 33, 28, 27, 33, 33, 28, 27, 33, 32,
-        28, 26, 33, 32, 28, 26, 33, 32, 28, 26, 33, 31, 27, 26, 31, 29, 26, 24,
-        30, 28, 26, 24, 30, 28, 26, 24, 28, 27, 24, 23, 27, 26, 23, 22, 27, 26,
-        23, 22, 26, 25, 23, 22, 24, 24, 22, 22,
-        /* Size 16x4 */
         33, 33, 33, 33, 33, 33, 33, 33, 31, 30, 30, 28, 27, 27, 26, 24, 33, 33,
         33, 33, 32, 32, 32, 31, 29, 28, 28, 27, 26, 26, 25, 24, 29, 29, 28, 28,
         28, 28, 28, 27, 26, 26, 26, 24, 23, 23, 23, 22, 28, 27, 27, 27, 26, 26,
         26, 26, 24, 24, 24, 23, 22, 22, 22, 22,
+        /* Size 16x4 */
+        33, 33, 29, 28, 33, 33, 29, 27, 33, 33, 28, 27, 33, 33, 28, 27, 33, 32,
+        28, 26, 33, 32, 28, 26, 33, 32, 28, 26, 33, 31, 27, 26, 31, 29, 26, 24,
+        30, 28, 26, 24, 30, 28, 26, 24, 28, 27, 24, 23, 27, 26, 23, 22, 27, 26,
+        23, 22, 26, 25, 23, 22, 24, 24, 22, 22,
         /* Size 8x32 */
-        32, 33, 33, 33, 31, 28, 28, 27, 33, 33, 33, 33, 31, 28, 28, 26, 33, 33,
-        33, 33, 31, 27, 27, 26, 33, 33, 33, 33, 30, 27, 27, 26, 33, 33, 33, 33,
-        30, 27, 27, 26, 33, 33, 33, 33, 30, 27, 27, 26, 33, 33, 33, 33, 30, 27,
-        27, 26, 33, 33, 33, 33, 30, 27, 27, 26, 33, 33, 32, 32, 30, 26, 26, 26,
-        34, 33, 32, 32, 30, 26, 26, 26, 34, 33, 32, 32, 29, 26, 26, 25, 34, 33,
-        32, 32, 29, 26, 26, 25, 34, 33, 32, 32, 29, 26, 26, 25, 34, 33, 32, 32,
-        29, 26, 26, 25, 33, 32, 31, 31, 29, 26, 26, 25, 32, 31, 30, 30, 28, 25,
-        25, 24, 31, 30, 29, 29, 28, 24, 24, 24, 31, 29, 28, 28, 27, 24, 24, 23,
-        31, 29, 28, 28, 27, 24, 24, 23, 31, 29, 28, 28, 27, 24, 24, 23, 31, 29,
-        28, 28, 27, 24, 24, 23, 30, 28, 28, 28, 26, 23, 23, 23, 29, 28, 27, 27,
-        25, 23, 23, 22, 28, 27, 26, 26, 24, 22, 22, 22, 28, 26, 26, 26, 24, 22,
-        22, 22, 28, 26, 26, 26, 24, 22, 22, 22, 28, 26, 26, 26, 24, 22, 22, 22,
-        28, 26, 26, 26, 24, 22, 22, 22, 26, 26, 25, 25, 24, 22, 22, 22, 26, 25,
-        24, 24, 23, 22, 22, 22, 24, 24, 24, 24, 23, 22, 22, 21, 24, 24, 24, 24,
-        23, 22, 22, 21,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 33, 32, 31, 31,
         31, 31, 31, 30, 29, 28, 28, 28, 28, 28, 26, 26, 24, 24, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 31, 30, 29, 29, 29, 29, 28,
@@ -12034,7 +12019,23 @@
         27, 27, 26, 26, 26, 26, 26, 26, 26, 25, 24, 24, 24, 24, 24, 23, 23, 22,
         22, 22, 22, 22, 22, 22, 22, 22, 27, 26, 26, 26, 26, 26, 26, 26, 26, 26,
         25, 25, 25, 25, 25, 24, 24, 23, 23, 23, 23, 23, 22, 22, 22, 22, 22, 22,
-        22, 22, 21, 21 },
+        22, 22, 21, 21,
+        /* Size 32x8 */
+        32, 33, 33, 33, 31, 28, 28, 27, 33, 33, 33, 33, 31, 28, 28, 26, 33, 33,
+        33, 33, 31, 27, 27, 26, 33, 33, 33, 33, 30, 27, 27, 26, 33, 33, 33, 33,
+        30, 27, 27, 26, 33, 33, 33, 33, 30, 27, 27, 26, 33, 33, 33, 33, 30, 27,
+        27, 26, 33, 33, 33, 33, 30, 27, 27, 26, 33, 33, 32, 32, 30, 26, 26, 26,
+        34, 33, 32, 32, 30, 26, 26, 26, 34, 33, 32, 32, 29, 26, 26, 25, 34, 33,
+        32, 32, 29, 26, 26, 25, 34, 33, 32, 32, 29, 26, 26, 25, 34, 33, 32, 32,
+        29, 26, 26, 25, 33, 32, 31, 31, 29, 26, 26, 25, 32, 31, 30, 30, 28, 25,
+        25, 24, 31, 30, 29, 29, 28, 24, 24, 24, 31, 29, 28, 28, 27, 24, 24, 23,
+        31, 29, 28, 28, 27, 24, 24, 23, 31, 29, 28, 28, 27, 24, 24, 23, 31, 29,
+        28, 28, 27, 24, 24, 23, 30, 28, 28, 28, 26, 23, 23, 23, 29, 28, 27, 27,
+        25, 23, 23, 22, 28, 27, 26, 26, 24, 22, 22, 22, 28, 26, 26, 26, 24, 22,
+        22, 22, 28, 26, 26, 26, 24, 22, 22, 22, 28, 26, 26, 26, 24, 22, 22, 22,
+        28, 26, 26, 26, 24, 22, 22, 22, 26, 26, 25, 25, 24, 22, 22, 22, 26, 25,
+        24, 24, 23, 22, 22, 22, 24, 24, 24, 24, 23, 22, 22, 21, 24, 24, 24, 24,
+        23, 22, 22, 21 },
   },
   {
       { /* Luma */
@@ -12120,21 +12121,12 @@
         31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         /* Size 4x8 */
-        33, 33, 33, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32,
-        32, 32, 33, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31,
-        /* Size 8x4 */
         33, 33, 33, 33, 33, 33, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31,
+        /* Size 8x4 */
+        33, 33, 33, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32,
+        32, 32, 33, 32, 32, 31, 32, 32, 32, 31, 32, 32, 32, 31,
         /* Size 8x16 */
-        32, 33, 33, 33, 33, 33, 33, 32, 33, 33, 33, 33, 33, 33, 32, 32, 33, 33,
-        32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32,
-        32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32,
-        32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32,
-        33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32,
-        32, 32, 32, 32, 31, 31, 33, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
-        32, 32, 31, 30, 32, 32, 32, 32, 32, 32, 31, 30, 32, 32, 32, 32, 32, 32,
-        31, 30,
-        /* Size 16x8 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 33, 33,
         33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
@@ -12143,37 +12135,16 @@
         32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31,
         31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 30,
         30, 30,
+        /* Size 16x8 */
+        32, 33, 33, 33, 33, 33, 33, 32, 33, 33, 33, 33, 33, 33, 32, 32, 33, 33,
+        32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32,
+        32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32,
+        32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32,
+        33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32,
+        32, 32, 32, 32, 31, 31, 33, 32, 32, 32, 32, 32, 31, 31, 32, 32, 32, 32,
+        32, 32, 31, 30, 32, 32, 32, 32, 32, 32, 31, 30, 32, 32, 32, 32, 32, 32,
+        31, 30,
         /* Size 16x32 */
-        32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 31, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 31, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31,
-        33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 33, 33,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 33, 33, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 33, 33, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 33, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 31, 31, 31, 31, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 31, 31, 31, 31, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31,
-        31, 31, 31, 30, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31,
-        30, 30, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 31, 31, 31, 30, 30,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
@@ -12203,33 +12174,47 @@
         32, 31, 31, 31, 31, 30, 30, 30, 30, 30, 30, 30, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31,
         30, 30, 30, 30, 30, 30, 30, 30,
+        /* Size 32x16 */
+        32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 31, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 31, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31,
+        33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 33, 33,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 33, 33, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 33, 33, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 33, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 31, 31, 31, 31, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 31, 31, 31, 31, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31,
+        31, 31, 31, 30, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31,
+        30, 30, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 31, 31, 31, 30, 30, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 31, 31, 31, 30, 30,
         /* Size 4x16 */
-        33, 33, 33, 32, 33, 33, 33, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32,
-        32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32,
-        33, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32,
-        32, 31, 32, 32, 32, 31, 32, 32, 32, 31,
-        /* Size 16x4 */
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 33, 33,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 31, 31, 31, 31, 31,
+        /* Size 16x4 */
+        33, 33, 33, 32, 33, 33, 33, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32,
+        32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32,
+        33, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 31, 32, 32, 32, 31, 32, 32,
+        32, 31, 32, 32, 32, 31, 32, 32, 32, 31,
         /* Size 8x32 */
-        32, 33, 33, 33, 33, 33, 33, 32, 33, 33, 33, 33, 33, 33, 32, 32, 33, 33,
-        33, 33, 33, 33, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
-        32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32,
-        32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32,
-        33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32,
-        32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32,
-        32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32,
-        32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32,
-        33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32,
-        32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32,
-        32, 32, 31, 31, 33, 32, 32, 32, 32, 32, 31, 31, 33, 32, 32, 32, 32, 32,
-        31, 31, 32, 32, 32, 32, 32, 32, 31, 30, 32, 32, 32, 32, 32, 32, 31, 30,
-        32, 32, 32, 32, 32, 32, 31, 30, 32, 32, 32, 32, 32, 32, 31, 30, 32, 32,
-        32, 32, 32, 32, 31, 30, 32, 32, 32, 32, 32, 32, 31, 30, 32, 32, 32, 32,
-        32, 32, 31, 30,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33,
         33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
@@ -12244,7 +12229,23 @@
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31,
         31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 30, 30, 30,
-        30, 30, 30, 30 },
+        30, 30, 30, 30,
+        /* Size 32x8 */
+        32, 33, 33, 33, 33, 33, 33, 32, 33, 33, 33, 33, 33, 33, 32, 32, 33, 33,
+        33, 33, 33, 33, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
+        32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32,
+        32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32,
+        33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32,
+        32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32,
+        32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32,
+        32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32,
+        33, 32, 32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 32, 33, 32,
+        32, 32, 32, 32, 32, 32, 33, 32, 32, 32, 32, 32, 32, 31, 33, 32, 32, 32,
+        32, 32, 31, 31, 33, 32, 32, 32, 32, 32, 31, 31, 33, 32, 32, 32, 32, 32,
+        31, 31, 32, 32, 32, 32, 32, 32, 31, 30, 32, 32, 32, 32, 32, 32, 31, 30,
+        32, 32, 32, 32, 32, 32, 31, 30, 32, 32, 32, 32, 32, 32, 31, 30, 32, 32,
+        32, 32, 32, 32, 31, 30, 32, 32, 32, 32, 32, 32, 31, 30, 32, 32, 32, 32,
+        32, 32, 31, 30 },
       { /* Chroma */
         /* Size 4x4 */
         33, 33, 33, 30, 33, 33, 33, 29, 33, 33, 32, 29, 30, 29, 29, 26,
@@ -12328,21 +12329,12 @@
         26, 26, 30, 30, 30, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 28, 28, 28,
         28, 28, 28, 28, 28, 28, 28, 27, 26, 26, 26, 26, 26, 26, 26, 26,
         /* Size 4x8 */
-        33, 33, 33, 30, 33, 33, 33, 29, 33, 33, 33, 29, 33, 32, 32, 28, 33, 32,
-        32, 28, 33, 31, 31, 28, 30, 28, 28, 26, 30, 28, 28, 26,
-        /* Size 8x4 */
         33, 33, 33, 33, 33, 33, 30, 30, 33, 33, 33, 32, 32, 31, 28, 28, 33, 33,
         33, 32, 32, 31, 28, 28, 30, 29, 29, 28, 28, 28, 26, 26,
+        /* Size 8x4 */
+        33, 33, 33, 30, 33, 33, 33, 29, 33, 33, 33, 29, 33, 32, 32, 28, 33, 32,
+        32, 28, 33, 31, 31, 28, 30, 28, 28, 26, 30, 28, 28, 26,
         /* Size 8x16 */
-        32, 33, 33, 33, 33, 33, 31, 29, 33, 33, 33, 33, 33, 33, 31, 28, 33, 33,
-        33, 33, 33, 33, 30, 28, 33, 33, 33, 33, 33, 33, 30, 28, 33, 33, 33, 33,
-        33, 33, 30, 28, 33, 33, 33, 33, 33, 33, 30, 28, 33, 33, 33, 32, 32, 32,
-        30, 28, 34, 33, 33, 32, 32, 32, 30, 27, 34, 33, 32, 32, 32, 32, 29, 27,
-        34, 33, 32, 32, 32, 32, 29, 27, 34, 33, 32, 32, 32, 32, 29, 27, 33, 32,
-        31, 31, 31, 31, 28, 26, 31, 30, 30, 29, 29, 29, 28, 26, 31, 30, 29, 28,
-        28, 28, 27, 25, 31, 30, 29, 28, 28, 28, 27, 25, 31, 30, 29, 28, 28, 28,
-        27, 25,
-        /* Size 16x8 */
         32, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 33, 31, 31, 31, 31, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 30, 30, 30, 30, 33, 33, 33, 33,
         33, 33, 33, 33, 32, 32, 32, 31, 30, 29, 29, 29, 33, 33, 33, 33, 33, 33,
@@ -12351,37 +12343,16 @@
         32, 31, 29, 28, 28, 28, 31, 31, 30, 30, 30, 30, 30, 30, 29, 29, 29, 28,
         28, 27, 27, 27, 29, 28, 28, 28, 28, 28, 28, 27, 27, 27, 27, 26, 26, 25,
         25, 25,
+        /* Size 16x8 */
+        32, 33, 33, 33, 33, 33, 31, 29, 33, 33, 33, 33, 33, 33, 31, 28, 33, 33,
+        33, 33, 33, 33, 30, 28, 33, 33, 33, 33, 33, 33, 30, 28, 33, 33, 33, 33,
+        33, 33, 30, 28, 33, 33, 33, 33, 33, 33, 30, 28, 33, 33, 33, 32, 32, 32,
+        30, 28, 34, 33, 33, 32, 32, 32, 30, 27, 34, 33, 32, 32, 32, 32, 29, 27,
+        34, 33, 32, 32, 32, 32, 29, 27, 34, 33, 32, 32, 32, 32, 29, 27, 33, 32,
+        31, 31, 31, 31, 28, 26, 31, 30, 30, 29, 29, 29, 28, 26, 31, 30, 29, 28,
+        28, 28, 27, 25, 31, 30, 29, 28, 28, 28, 27, 25, 31, 30, 29, 28, 28, 28,
+        27, 25,
         /* Size 16x32 */
-        32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 31, 30, 29, 28, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 31, 30, 29, 28, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 32, 31, 30, 28, 28, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 32, 31, 29, 28, 27, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 32, 30, 29, 28, 27, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 31, 30, 29, 28, 27, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 31,
-        30, 29, 28, 27, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 31, 30, 29,
-        28, 27, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 31, 30, 29, 28, 27,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 31, 30, 29, 28, 27, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 31, 30, 29, 28, 27, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 31, 30, 29, 28, 27, 33, 33, 33, 33, 33, 32,
-        32, 32, 32, 32, 32, 31, 30, 28, 28, 26, 33, 33, 33, 33, 33, 32, 32, 32,
-        32, 32, 32, 31, 30, 28, 28, 26, 34, 33, 33, 33, 33, 32, 32, 32, 32, 32,
-        32, 31, 30, 28, 27, 26, 34, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 31,
-        29, 28, 27, 26, 34, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 31, 29, 28,
-        27, 26, 34, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 31, 29, 28, 27, 26,
-        34, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 31, 29, 28, 27, 26, 34, 33,
-        33, 33, 32, 32, 32, 32, 32, 32, 32, 31, 29, 28, 27, 26, 34, 33, 33, 33,
-        32, 32, 32, 32, 32, 32, 32, 31, 29, 28, 27, 26, 33, 33, 33, 32, 32, 31,
-        31, 31, 31, 31, 31, 30, 29, 28, 27, 26, 33, 32, 32, 31, 31, 31, 31, 31,
-        31, 31, 31, 29, 28, 28, 26, 25, 32, 32, 31, 31, 30, 30, 30, 30, 30, 30,
-        30, 29, 28, 27, 26, 25, 31, 31, 30, 30, 30, 29, 29, 29, 29, 29, 29, 28,
-        28, 26, 26, 24, 31, 30, 30, 29, 29, 28, 28, 28, 28, 28, 28, 28, 27, 26,
-        25, 24, 31, 30, 30, 29, 29, 28, 28, 28, 28, 28, 28, 28, 27, 26, 25, 24,
-        31, 30, 30, 29, 29, 28, 28, 28, 28, 28, 28, 28, 27, 26, 25, 24, 31, 30,
-        30, 29, 29, 28, 28, 28, 28, 28, 28, 28, 27, 26, 25, 24, 31, 30, 30, 29,
-        29, 28, 28, 28, 28, 28, 28, 28, 27, 26, 25, 24, 31, 30, 30, 29, 29, 28,
-        28, 28, 28, 28, 28, 28, 27, 26, 25, 24, 30, 30, 29, 29, 28, 28, 28, 28,
-        28, 28, 28, 27, 26, 26, 24, 23,
-        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34,
         34, 34, 34, 33, 33, 32, 31, 31, 31, 31, 31, 31, 31, 30, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
@@ -12411,33 +12382,47 @@
         27, 27, 26, 26, 26, 25, 25, 25, 25, 25, 25, 24, 28, 28, 28, 27, 27, 27,
         27, 27, 27, 27, 27, 27, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 25, 25,
         24, 24, 24, 24, 24, 24, 24, 23,
+        /* Size 32x16 */
+        32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 31, 30, 29, 28, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 31, 30, 29, 28, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 32, 31, 30, 28, 28, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 32, 31, 29, 28, 27, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 32, 30, 29, 28, 27, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 31, 30, 29, 28, 27, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 31,
+        30, 29, 28, 27, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 31, 30, 29,
+        28, 27, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 31, 30, 29, 28, 27,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 31, 30, 29, 28, 27, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 31, 30, 29, 28, 27, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 31, 30, 29, 28, 27, 33, 33, 33, 33, 33, 32,
+        32, 32, 32, 32, 32, 31, 30, 28, 28, 26, 33, 33, 33, 33, 33, 32, 32, 32,
+        32, 32, 32, 31, 30, 28, 28, 26, 34, 33, 33, 33, 33, 32, 32, 32, 32, 32,
+        32, 31, 30, 28, 27, 26, 34, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 31,
+        29, 28, 27, 26, 34, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 31, 29, 28,
+        27, 26, 34, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 31, 29, 28, 27, 26,
+        34, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 31, 29, 28, 27, 26, 34, 33,
+        33, 33, 32, 32, 32, 32, 32, 32, 32, 31, 29, 28, 27, 26, 34, 33, 33, 33,
+        32, 32, 32, 32, 32, 32, 32, 31, 29, 28, 27, 26, 33, 33, 33, 32, 32, 31,
+        31, 31, 31, 31, 31, 30, 29, 28, 27, 26, 33, 32, 32, 31, 31, 31, 31, 31,
+        31, 31, 31, 29, 28, 28, 26, 25, 32, 32, 31, 31, 30, 30, 30, 30, 30, 30,
+        30, 29, 28, 27, 26, 25, 31, 31, 30, 30, 30, 29, 29, 29, 29, 29, 29, 28,
+        28, 26, 26, 24, 31, 30, 30, 29, 29, 28, 28, 28, 28, 28, 28, 28, 27, 26,
+        25, 24, 31, 30, 30, 29, 29, 28, 28, 28, 28, 28, 28, 28, 27, 26, 25, 24,
+        31, 30, 30, 29, 29, 28, 28, 28, 28, 28, 28, 28, 27, 26, 25, 24, 31, 30,
+        30, 29, 29, 28, 28, 28, 28, 28, 28, 28, 27, 26, 25, 24, 31, 30, 30, 29,
+        29, 28, 28, 28, 28, 28, 28, 28, 27, 26, 25, 24, 31, 30, 30, 29, 29, 28,
+        28, 28, 28, 28, 28, 28, 27, 26, 25, 24, 30, 30, 29, 29, 28, 28, 28, 28,
+        28, 28, 28, 27, 26, 26, 24, 23,
         /* Size 4x16 */
-        33, 33, 33, 30, 33, 33, 33, 30, 33, 33, 33, 29, 33, 33, 33, 29, 33, 33,
-        33, 29, 33, 33, 33, 29, 33, 32, 32, 28, 33, 32, 32, 28, 33, 32, 32, 28,
-        33, 32, 32, 28, 33, 32, 32, 28, 32, 31, 31, 28, 31, 29, 29, 26, 30, 28,
-        28, 26, 30, 28, 28, 26, 30, 28, 28, 26,
-        /* Size 16x4 */
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 31, 30, 30, 30, 33, 33,
         33, 33, 33, 33, 32, 32, 32, 32, 32, 31, 29, 28, 28, 28, 33, 33, 33, 33,
         33, 33, 32, 32, 32, 32, 32, 31, 29, 28, 28, 28, 30, 30, 29, 29, 29, 29,
         28, 28, 28, 28, 28, 28, 26, 26, 26, 26,
+        /* Size 16x4 */
+        33, 33, 33, 30, 33, 33, 33, 30, 33, 33, 33, 29, 33, 33, 33, 29, 33, 33,
+        33, 29, 33, 33, 33, 29, 33, 32, 32, 28, 33, 32, 32, 28, 33, 32, 32, 28,
+        33, 32, 32, 28, 33, 32, 32, 28, 32, 31, 31, 28, 31, 29, 29, 26, 30, 28,
+        28, 26, 30, 28, 28, 26, 30, 28, 28, 26,
         /* Size 8x32 */
-        32, 33, 33, 33, 33, 33, 31, 29, 33, 33, 33, 33, 33, 33, 31, 29, 33, 33,
-        33, 33, 33, 33, 31, 28, 33, 33, 33, 33, 33, 33, 31, 28, 33, 33, 33, 33,
-        33, 33, 30, 28, 33, 33, 33, 33, 33, 33, 30, 28, 33, 33, 33, 33, 33, 33,
-        30, 28, 33, 33, 33, 33, 33, 33, 30, 28, 33, 33, 33, 33, 33, 33, 30, 28,
-        33, 33, 33, 33, 33, 33, 30, 28, 33, 33, 33, 33, 33, 33, 30, 28, 33, 33,
-        33, 33, 33, 33, 30, 28, 33, 33, 33, 32, 32, 32, 30, 28, 33, 33, 33, 32,
-        32, 32, 30, 28, 34, 33, 33, 32, 32, 32, 30, 27, 34, 33, 32, 32, 32, 32,
-        29, 27, 34, 33, 32, 32, 32, 32, 29, 27, 34, 33, 32, 32, 32, 32, 29, 27,
-        34, 33, 32, 32, 32, 32, 29, 27, 34, 33, 32, 32, 32, 32, 29, 27, 34, 33,
-        32, 32, 32, 32, 29, 27, 33, 33, 32, 31, 31, 31, 29, 27, 33, 32, 31, 31,
-        31, 31, 28, 26, 32, 31, 30, 30, 30, 30, 28, 26, 31, 30, 30, 29, 29, 29,
-        28, 26, 31, 30, 29, 28, 28, 28, 27, 25, 31, 30, 29, 28, 28, 28, 27, 25,
-        31, 30, 29, 28, 28, 28, 27, 25, 31, 30, 29, 28, 28, 28, 27, 25, 31, 30,
-        29, 28, 28, 28, 27, 25, 31, 30, 29, 28, 28, 28, 27, 25, 30, 29, 28, 28,
-        28, 28, 26, 24,
-        /* Size 32x8 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34,
         34, 34, 34, 33, 33, 32, 31, 31, 31, 31, 31, 31, 31, 30, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
@@ -12452,7 +12437,23 @@
         30, 30, 30, 30, 30, 30, 30, 30, 30, 29, 29, 29, 29, 29, 29, 29, 28, 28,
         28, 27, 27, 27, 27, 27, 27, 26, 29, 29, 28, 28, 28, 28, 28, 28, 28, 28,
         28, 28, 28, 28, 27, 27, 27, 27, 27, 27, 27, 27, 26, 26, 26, 25, 25, 25,
-        25, 25, 25, 24 },
+        25, 25, 25, 24,
+        /* Size 32x8 */
+        32, 33, 33, 33, 33, 33, 31, 29, 33, 33, 33, 33, 33, 33, 31, 29, 33, 33,
+        33, 33, 33, 33, 31, 28, 33, 33, 33, 33, 33, 33, 31, 28, 33, 33, 33, 33,
+        33, 33, 30, 28, 33, 33, 33, 33, 33, 33, 30, 28, 33, 33, 33, 33, 33, 33,
+        30, 28, 33, 33, 33, 33, 33, 33, 30, 28, 33, 33, 33, 33, 33, 33, 30, 28,
+        33, 33, 33, 33, 33, 33, 30, 28, 33, 33, 33, 33, 33, 33, 30, 28, 33, 33,
+        33, 33, 33, 33, 30, 28, 33, 33, 33, 32, 32, 32, 30, 28, 33, 33, 33, 32,
+        32, 32, 30, 28, 34, 33, 33, 32, 32, 32, 30, 27, 34, 33, 32, 32, 32, 32,
+        29, 27, 34, 33, 32, 32, 32, 32, 29, 27, 34, 33, 32, 32, 32, 32, 29, 27,
+        34, 33, 32, 32, 32, 32, 29, 27, 34, 33, 32, 32, 32, 32, 29, 27, 34, 33,
+        32, 32, 32, 32, 29, 27, 33, 33, 32, 31, 31, 31, 29, 27, 33, 32, 31, 31,
+        31, 31, 28, 26, 32, 31, 30, 30, 30, 30, 28, 26, 31, 30, 30, 29, 29, 29,
+        28, 26, 31, 30, 29, 28, 28, 28, 27, 25, 31, 30, 29, 28, 28, 28, 27, 25,
+        31, 30, 29, 28, 28, 28, 27, 25, 31, 30, 29, 28, 28, 28, 27, 25, 31, 30,
+        29, 28, 28, 28, 27, 25, 31, 30, 29, 28, 28, 28, 27, 25, 30, 29, 28, 28,
+        28, 28, 26, 24 },
   },
   {
       { /* Luma */
@@ -12538,22 +12539,13 @@
         32, 32, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         /* Size 4x8 */
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32,
-        32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32,
-        /* Size 8x4 */
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33,
         32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32,
+        /* Size 8x4 */
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32,
+        32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32,
         /* Size 8x16 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 33, 33, 33, 32,
-        32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
-        32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32,
-        33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33,
-        32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
-        32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
-        32, 32,
-        /* Size 16x8 */
-        32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 32, 32, 32, 32, 32,
@@ -12561,37 +12553,16 @@
         32, 32, 32, 32, 32, 32, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32,
-        /* Size 16x32 */
+        /* Size 16x8 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32,
-        32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33,
-        33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33,
-        33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        32, 32, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
-        33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33,
-        33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 32, 32, 32, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32,
-        /* Size 32x16 */
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 33, 33, 33, 32,
+        32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
+        32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32,
+        33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33,
+        32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
+        32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
+        32, 32,
+        /* Size 16x32 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
@@ -12621,35 +12592,49 @@
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32,
+        /* Size 32x16 */
+        32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32,
+        32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33,
+        33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33,
+        33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33,
+        33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 32, 32, 32, 32, 32,
+        32, 32, 32, 32, 32, 32, 32, 32,
         /* Size 4x16 */
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 33, 32,
-        32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32,
-        33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32,
-        32, 32, 33, 32, 32, 32, 33, 32, 32, 32,
-        /* Size 16x4 */
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        /* Size 16x4 */
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 33, 32,
+        32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32,
+        33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32, 32, 32, 33, 32,
+        32, 32, 33, 32, 32, 32, 33, 32, 32, 32,
         /* Size 8x32 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32,
-        32, 32, 33, 33, 33, 32, 32, 32, 32, 32, 33, 33, 33, 32, 32, 32, 32, 32,
-        33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33,
-        32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
-        32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
-        32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32,
-        33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33,
-        32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
-        32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
-        32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32,
-        33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33,
-        32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
-        32, 32, 32, 32,
-        /* Size 32x8 */
-        32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
@@ -12662,6 +12647,22 @@
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
+        32, 32, 32, 32,
+        /* Size 32x8 */
+        32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32,
+        32, 32, 33, 33, 33, 32, 32, 32, 32, 32, 33, 33, 33, 32, 32, 32, 32, 32,
+        33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33,
+        32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
+        32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
+        32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32,
+        33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33,
+        32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
+        32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32,
+        32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32,
+        33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33,
+        32, 32, 32, 32, 32, 32, 33, 33, 32, 32, 32, 32, 32, 32, 33, 33, 32, 32,
         32, 32, 32, 32 },
       { /* Chroma */
         /* Size 4x4 */
@@ -12746,21 +12747,12 @@
         32, 32, 34, 34, 34, 34, 34, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32,
         /* Size 4x8 */
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 34, 33, 32, 32,
-        /* Size 8x4 */
         33, 33, 33, 33, 33, 33, 33, 34, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 32, 32, 33, 33, 33, 33, 33, 33, 32, 32,
+        /* Size 8x4 */
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 34, 33, 32, 32,
         /* Size 8x16 */
-        32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 33, 33, 33, 33,
-        33, 32, 32, 32, 34, 33, 33, 33, 33, 32, 32, 32, 34, 33, 33, 33, 32, 32,
-        32, 32,
-        /* Size 16x8 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 34, 34, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
@@ -12769,37 +12761,16 @@
         33, 33, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32,
         32, 32,
-        /* Size 16x32 */
+        /* Size 16x8 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32,
-        32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32,
-        32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32,
-        34, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 34, 33,
-        33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 34, 34, 33, 33,
-        33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 34, 34, 33, 33, 33, 33,
-        33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 34, 34, 33, 33, 33, 33, 33, 33,
-        32, 32, 32, 32, 32, 32, 32, 32,
-        /* Size 32x16 */
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 33, 33, 33, 33,
+        33, 32, 32, 32, 34, 33, 33, 33, 33, 32, 32, 32, 34, 33, 33, 33, 32, 32,
+        32, 32,
+        /* Size 16x32 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
@@ -12829,17 +12800,7 @@
         33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32,
         32, 32, 32, 32, 32, 32, 32, 32,
-        /* Size 4x16 */
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 33, 33,
-        32, 32, 33, 33, 32, 32, 34, 33, 32, 32,
-        /* Size 16x4 */
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 34, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 32, 32, 32, 32,
-        /* Size 8x32 */
+        /* Size 32x16 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
@@ -12850,12 +12811,36 @@
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 33, 33, 33, 33, 33, 32,
-        32, 32, 33, 33, 33, 33, 33, 32, 32, 32, 33, 33, 33, 33, 33, 32, 32, 32,
-        34, 33, 33, 33, 33, 32, 32, 32, 34, 33, 33, 33, 33, 32, 32, 32, 34, 33,
-        33, 33, 32, 32, 32, 32, 34, 33, 33, 33, 32, 32, 32, 32, 34, 33, 33, 33,
-        32, 32, 32, 32,
-        /* Size 32x8 */
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32,
+        32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32,
+        32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32,
+        34, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 34, 33,
+        33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 34, 34, 33, 33,
+        33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 34, 34, 33, 33, 33, 33,
+        33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 34, 34, 33, 33, 33, 33, 33, 33,
+        32, 32, 32, 32, 32, 32, 32, 32,
+        /* Size 4x16 */
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 34, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 32, 32, 32, 32,
+        /* Size 16x4 */
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 33, 33,
+        32, 32, 33, 33, 32, 32, 34, 33, 32, 32,
+        /* Size 8x32 */
         32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
@@ -12870,6 +12855,22 @@
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32,
         32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 32, 32,
+        32, 32, 32, 32,
+        /* Size 32x8 */
+        32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
+        33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32, 32, 33, 33, 33, 33, 33, 32,
+        32, 32, 33, 33, 33, 33, 33, 32, 32, 32, 33, 33, 33, 33, 33, 32, 32, 32,
+        34, 33, 33, 33, 33, 32, 32, 32, 34, 33, 33, 33, 33, 32, 32, 32, 34, 33,
+        33, 33, 32, 32, 32, 32, 34, 33, 33, 33, 32, 32, 32, 32, 34, 33, 33, 33,
         32, 32, 32, 32 },
   },
-};
+};
\ No newline at end of file

diff --git a/av1/common/reconinter.c b/av1/common/reconinter.c
index 11fd257..602fab7 100644
--- a/av1/common/reconinter.c
+++ b/av1/common/reconinter.c

@@ -30,11 +30,11 @@
 
 // This function will determine whether or not to create a warped
 // prediction.
-int av1_allow_warp(const MB_MODE_INFO *const mbmi,
-                   const WarpTypesAllowed *const warp_types,
-                   const WarpedMotionParams *const gm_params,
-                   int build_for_obmc, const struct scale_factors *const sf,
-                   WarpedMotionParams *final_warp_params) {
+static int allow_warp(const MB_MODE_INFO *const mbmi,
+                      const WarpTypesAllowed *const warp_types,
+                      const WarpedMotionParams *const gm_params,
+                      int build_for_obmc, const struct scale_factors *const sf,
+                      WarpedMotionParams *final_warp_params) {
   // Note: As per the spec, we must test the fixed point scales here, which are
   // at a higher precision (1 << 14) than the xs and ys in subpel_params (that
   // have 1 << 10 precision).
@@ -65,9 +65,9 @@
 
   if (xd->cur_frame_force_integer_mv) return;
 
-  if (av1_allow_warp(mi, warp_types, &xd->global_motion[mi->ref_frame[ref]], 0,
-                     inter_pred_params->scale_factors,
-                     &inter_pred_params->warp_params)) {
+  if (allow_warp(mi, warp_types, &xd->global_motion[mi->ref_frame[ref]], 0,
+                 inter_pred_params->scale_factors,
+                 &inter_pred_params->warp_params)) {
     inter_pred_params->mode = WARP_PRED;
   }
 }
@@ -819,11 +819,11 @@
 #if DISABLE_CHROMA_U8X8_OBMC
     case BLOCK_4X4:
     case BLOCK_8X4:
-    case BLOCK_4X8: return 1; break;
+    case BLOCK_4X8: return 1;
 #else
     case BLOCK_4X4:
     case BLOCK_8X4:
-    case BLOCK_4X8: return dir == 0; break;
+    case BLOCK_4X8: return dir == 0;
 #endif
     default: return 0;
   }
@@ -832,8 +832,6 @@
 void av1_modify_neighbor_predictor_for_obmc(MB_MODE_INFO *mbmi) {
   mbmi->ref_frame[1] = NONE_FRAME;
   mbmi->interinter_comp.type = COMPOUND_AVERAGE;
-
-  return;
 }
 
 struct obmc_inter_pred_ctxt {

diff --git a/av1/common/reconinter.h b/av1/common/reconinter.h
index da7b84a..0b93d3b 100644
--- a/av1/common/reconinter.h
+++ b/av1/common/reconinter.h

@@ -482,12 +482,6 @@
                             const uint8_t *inter_pred, int inter_stride,
                             const uint8_t *intra_pred, int intra_stride);
 
-int av1_allow_warp(const MB_MODE_INFO *const mbmi,
-                   const WarpTypesAllowed *const warp_types,
-                   const WarpedMotionParams *const gm_params,
-                   int build_for_obmc, const struct scale_factors *const sf,
-                   WarpedMotionParams *final_warp_params);
-
 #ifdef __cplusplus
 }  // extern "C"
 #endif

diff --git a/av1/common/reconintra.c b/av1/common/reconintra.c
index d5f806e..3704b8a 100644
--- a/av1/common/reconintra.c
+++ b/av1/common/reconintra.c

@@ -1683,7 +1683,7 @@
 #endif
     CFL_CTX *const cfl = &xd->cfl;
     CFL_PRED_TYPE pred_plane = get_cfl_pred_type(plane);
-    if (cfl->dc_pred_is_cached[pred_plane] == 0) {
+    if (!cfl->dc_pred_is_cached[pred_plane]) {
       av1_predict_intra_block(xd, seq_params->sb_size,
                               seq_params->enable_intra_edge_filter, pd->width,
                               pd->height, tx_size, mode, angle_delta,
@@ -1691,7 +1691,7 @@
                               dst, dst_stride, blk_col, blk_row, plane);
       if (cfl->use_dc_pred_cache) {
         cfl_store_dc_pred(xd, dst, pred_plane, tx_size_wide[tx_size]);
-        cfl->dc_pred_is_cached[pred_plane] = 1;
+        cfl->dc_pred_is_cached[pred_plane] = true;
       }
     } else {
       cfl_load_dc_pred(xd, dst, dst_stride, tx_size, pred_plane);

diff --git a/av1/common/resize.c b/av1/common/resize.c
index 242930c..f4bfcd0 100644
--- a/av1/common/resize.c
+++ b/av1/common/resize.c

@@ -20,6 +20,7 @@
 #include "config/aom_config.h"
 
 #include "aom_dsp/aom_dsp_common.h"
+#include "aom_dsp/flow_estimation/corner_detect.h"
 #include "aom_ports/mem.h"
 #include "aom_scale/aom_scale.h"
 #include "av1/common/common.h"
@@ -1369,7 +1370,7 @@
     AV1_COMMON *cm, YV12_BUFFER_CONFIG *unscaled, YV12_BUFFER_CONFIG *scaled,
     const InterpFilter filter, const int phase, const bool use_optimized_scaler,
     const bool for_psnr, const int border_in_pixels,
-    const bool alloc_y_buffer_8bit) {
+    const int num_pyramid_levels) {
   // If scaling is performed for the sole purpose of calculating PSNR, then our
   // target dimensions are superres upscaled width/height. Otherwise our target
   // dimensions are coded width/height.
@@ -1389,7 +1390,7 @@
             scaled, scaled_width, scaled_height, seq_params->subsampling_x,
             seq_params->subsampling_y, seq_params->use_highbitdepth,
             border_in_pixels, cm->features.byte_alignment, NULL, NULL, NULL,
-            alloc_y_buffer_8bit, 0))
+            num_pyramid_levels, 0))
       aom_internal_error(cm->error, AOM_CODEC_MEM_ERROR,
                          "Failed to allocate scaled buffer");
 
@@ -1421,9 +1422,8 @@
     }
 #endif
     return scaled;
-  } else {
-    return unscaled;
   }
+  return unscaled;
 }
 
 // Calculates the scaled dimension given the original dimension and the scale
@@ -1481,7 +1481,8 @@
 // TODO(afergs): Look for in-place upscaling
 // TODO(afergs): aom_ vs av1_ functions? Which can I use?
 // Upscale decoded image.
-void av1_superres_upscale(AV1_COMMON *cm, BufferPool *const pool) {
+void av1_superres_upscale(AV1_COMMON *cm, BufferPool *const pool,
+                          int num_pyramid_levels) {
   const int num_planes = av1_num_planes(cm);
   if (!av1_superres_scaled(cm)) return;
   const SequenceHeader *const seq_params = cm->seq_params;
@@ -1496,7 +1497,7 @@
   if (aom_alloc_frame_buffer(
           &copy_buffer, aligned_width, cm->height, seq_params->subsampling_x,
           seq_params->subsampling_y, seq_params->use_highbitdepth,
-          AOM_BORDER_IN_PIXELS, byte_alignment, 0))
+          AOM_BORDER_IN_PIXELS, byte_alignment, 0, 0))
     aom_internal_error(cm->error, AOM_CODEC_MEM_ERROR,
                        "Failed to allocate copy buffer for superres upscaling");
 
@@ -1528,7 +1529,8 @@
             frame_to_show, cm->superres_upscaled_width,
             cm->superres_upscaled_height, seq_params->subsampling_x,
             seq_params->subsampling_y, seq_params->use_highbitdepth,
-            AOM_BORDER_IN_PIXELS, byte_alignment, fb, cb, cb_priv, 0, 0)) {
+            AOM_BORDER_IN_PIXELS, byte_alignment, fb, cb, cb_priv,
+            num_pyramid_levels, 0)) {
       unlock_buffer_pool(pool);
       aom_internal_error(
           cm->error, AOM_CODEC_MEM_ERROR,
@@ -1545,7 +1547,7 @@
             frame_to_show, cm->superres_upscaled_width,
             cm->superres_upscaled_height, seq_params->subsampling_x,
             seq_params->subsampling_y, seq_params->use_highbitdepth,
-            AOM_BORDER_IN_PIXELS, byte_alignment, 0))
+            AOM_BORDER_IN_PIXELS, byte_alignment, num_pyramid_levels, 0))
       aom_internal_error(
           cm->error, AOM_CODEC_MEM_ERROR,
           "Failed to reallocate current frame buffer for superres upscaling");

diff --git a/av1/common/resize.h b/av1/common/resize.h
index 4e8ee0f..5927d8e 100644
--- a/av1/common/resize.h
+++ b/av1/common/resize.h

@@ -75,7 +75,7 @@
     AV1_COMMON *cm, YV12_BUFFER_CONFIG *unscaled, YV12_BUFFER_CONFIG *scaled,
     const InterpFilter filter, const int phase, const bool use_optimized_scaler,
     const bool for_psnr, const int border_in_pixels,
-    const bool alloc_y_buffer_8bit);
+    const int num_pyramid_levels);
 
 void av1_resize_and_extend_frame_nonnormative(const YV12_BUFFER_CONFIG *src,
                                               YV12_BUFFER_CONFIG *dst, int bd,
@@ -95,7 +95,8 @@
 // denominator.
 void av1_calculate_unscaled_superres_size(int *width, int *height, int denom);
 
-void av1_superres_upscale(AV1_COMMON *cm, BufferPool *const pool);
+void av1_superres_upscale(AV1_COMMON *cm, BufferPool *const pool,
+                          int num_pyramid_levels);
 
 // Returns 1 if a superres upscaled frame is scaled and 0 otherwise.
 static INLINE int av1_superres_scaled(const AV1_COMMON *cm) {

diff --git a/av1/common/scan.c b/av1/common/scan.c
index b86068d..0943579 100644
--- a/av1/common/scan.c
+++ b/av1/common/scan.c

@@ -14,112 +14,91 @@
 #include "av1/common/common_data.h"
 #include "av1/common/scan.h"
 
-DECLARE_ALIGNED(16, static const int16_t,
-                default_scan_4x4[16]) = { 0, 1,  4,  8,  5, 2,  3,  6,
-                                          9, 12, 13, 10, 7, 11, 14, 15 };
-
-DECLARE_ALIGNED(16, static const int16_t, mcol_scan_4x4[16]) = {
-  0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15,
+DECLARE_ALIGNED(16, static const int16_t, default_scan_4x4[16]) = {
+  0, 4, 1, 2, 5, 8, 12, 9, 6, 3, 7, 10, 13, 14, 11, 15,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, mrow_scan_4x4[16]) = {
+DECLARE_ALIGNED(16, static const int16_t, mcol_scan_4x4[16]) = {
   0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, mrow_scan_4x4[16]) = {
+  0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, default_scan_4x8[32]) = {
-  0,  1,  4,  2,  5,  8,  3,  6,  9,  12, 7,  10, 13, 16, 11, 14,
-  17, 20, 15, 18, 21, 24, 19, 22, 25, 28, 23, 26, 29, 27, 30, 31,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, mcol_scan_4x8[32]) = {
-  0, 4, 8,  12, 16, 20, 24, 28, 1, 5, 9,  13, 17, 21, 25, 29,
-  2, 6, 10, 14, 18, 22, 26, 30, 3, 7, 11, 15, 19, 23, 27, 31,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, mrow_scan_4x8[32]) = {
-  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
-  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, default_scan_8x4[32]) = {
   0,  8, 1,  16, 9,  2, 24, 17, 10, 3, 25, 18, 11, 4,  26, 19,
   12, 5, 27, 20, 13, 6, 28, 21, 14, 7, 29, 22, 15, 30, 23, 31,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, mcol_scan_8x4[32]) = {
-  0, 8,  16, 24, 1, 9,  17, 25, 2, 10, 18, 26, 3, 11, 19, 27,
-  4, 12, 20, 28, 5, 13, 21, 29, 6, 14, 22, 30, 7, 15, 23, 31,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, mrow_scan_8x4[32]) = {
+DECLARE_ALIGNED(16, static const int16_t, mcol_scan_4x8[32]) = {
   0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
   16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, default_scan_4x16[64]) = {
-  0,  1,  4,  2,  5,  8,  3,  6,  9,  12, 7,  10, 13, 16, 11, 14,
-  17, 20, 15, 18, 21, 24, 19, 22, 25, 28, 23, 26, 29, 32, 27, 30,
-  33, 36, 31, 34, 37, 40, 35, 38, 41, 44, 39, 42, 45, 48, 43, 46,
-  49, 52, 47, 50, 53, 56, 51, 54, 57, 60, 55, 58, 61, 59, 62, 63,
+DECLARE_ALIGNED(16, static const int16_t, mrow_scan_4x8[32]) = {
+  0, 8,  16, 24, 1, 9,  17, 25, 2, 10, 18, 26, 3, 11, 19, 27,
+  4, 12, 20, 28, 5, 13, 21, 29, 6, 14, 22, 30, 7, 15, 23, 31,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, default_scan_16x4[64]) = {
+DECLARE_ALIGNED(16, static const int16_t, default_scan_8x4[32]) = {
+  0,  1,  4,  2,  5,  8,  3,  6,  9,  12, 7,  10, 13, 16, 11, 14,
+  17, 20, 15, 18, 21, 24, 19, 22, 25, 28, 23, 26, 29, 27, 30, 31,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, mcol_scan_8x4[32]) = {
+  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
+  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, mrow_scan_8x4[32]) = {
+  0, 4, 8,  12, 16, 20, 24, 28, 1, 5, 9,  13, 17, 21, 25, 29,
+  2, 6, 10, 14, 18, 22, 26, 30, 3, 7, 11, 15, 19, 23, 27, 31,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, default_scan_4x16[64]) = {
   0,  16, 1,  32, 17, 2,  48, 33, 18, 3,  49, 34, 19, 4,  50, 35,
   20, 5,  51, 36, 21, 6,  52, 37, 22, 7,  53, 38, 23, 8,  54, 39,
   24, 9,  55, 40, 25, 10, 56, 41, 26, 11, 57, 42, 27, 12, 58, 43,
   28, 13, 59, 44, 29, 14, 60, 45, 30, 15, 61, 46, 31, 62, 47, 63,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, default_scan_16x4[64]) = {
+  0,  1,  4,  2,  5,  8,  3,  6,  9,  12, 7,  10, 13, 16, 11, 14,
+  17, 20, 15, 18, 21, 24, 19, 22, 25, 28, 23, 26, 29, 32, 27, 30,
+  33, 36, 31, 34, 37, 40, 35, 38, 41, 44, 39, 42, 45, 48, 43, 46,
+  49, 52, 47, 50, 53, 56, 51, 54, 57, 60, 55, 58, 61, 59, 62, 63,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, mrow_scan_4x16[64]) = {
-  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
-  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
-  32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
-  48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, mrow_scan_16x4[64]) = {
-  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
-  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
-  32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
-  48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, mcol_scan_4x16[64]) = {
-  0, 4, 8,  12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60,
-  1, 5, 9,  13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57, 61,
-  2, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62,
-  3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, mcol_scan_16x4[64]) = {
   0,  16, 32, 48, 1,  17, 33, 49, 2,  18, 34, 50, 3,  19, 35, 51,
   4,  20, 36, 52, 5,  21, 37, 53, 6,  22, 38, 54, 7,  23, 39, 55,
   8,  24, 40, 56, 9,  25, 41, 57, 10, 26, 42, 58, 11, 27, 43, 59,
   12, 28, 44, 60, 13, 29, 45, 61, 14, 30, 46, 62, 15, 31, 47, 63,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, default_scan_8x32[256]) = {
-  0,   1,   8,   2,   9,   16,  3,   10,  17,  24,  4,   11,  18,  25,  32,
-  5,   12,  19,  26,  33,  40,  6,   13,  20,  27,  34,  41,  48,  7,   14,
-  21,  28,  35,  42,  49,  56,  15,  22,  29,  36,  43,  50,  57,  64,  23,
-  30,  37,  44,  51,  58,  65,  72,  31,  38,  45,  52,  59,  66,  73,  80,
-  39,  46,  53,  60,  67,  74,  81,  88,  47,  54,  61,  68,  75,  82,  89,
-  96,  55,  62,  69,  76,  83,  90,  97,  104, 63,  70,  77,  84,  91,  98,
-  105, 112, 71,  78,  85,  92,  99,  106, 113, 120, 79,  86,  93,  100, 107,
-  114, 121, 128, 87,  94,  101, 108, 115, 122, 129, 136, 95,  102, 109, 116,
-  123, 130, 137, 144, 103, 110, 117, 124, 131, 138, 145, 152, 111, 118, 125,
-  132, 139, 146, 153, 160, 119, 126, 133, 140, 147, 154, 161, 168, 127, 134,
-  141, 148, 155, 162, 169, 176, 135, 142, 149, 156, 163, 170, 177, 184, 143,
-  150, 157, 164, 171, 178, 185, 192, 151, 158, 165, 172, 179, 186, 193, 200,
-  159, 166, 173, 180, 187, 194, 201, 208, 167, 174, 181, 188, 195, 202, 209,
-  216, 175, 182, 189, 196, 203, 210, 217, 224, 183, 190, 197, 204, 211, 218,
-  225, 232, 191, 198, 205, 212, 219, 226, 233, 240, 199, 206, 213, 220, 227,
-  234, 241, 248, 207, 214, 221, 228, 235, 242, 249, 215, 222, 229, 236, 243,
-  250, 223, 230, 237, 244, 251, 231, 238, 245, 252, 239, 246, 253, 247, 254,
-  255,
+DECLARE_ALIGNED(16, static const int16_t, mrow_scan_16x4[64]) = {
+  0, 4, 8,  12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60,
+  1, 5, 9,  13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57, 61,
+  2, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62,
+  3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, default_scan_32x8[256]) = {
+DECLARE_ALIGNED(16, static const int16_t, mcol_scan_4x16[64]) = {
+  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
+  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
+  32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
+  48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, mcol_scan_16x4[64]) = {
+  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
+  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
+  32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
+  48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, default_scan_8x32[256]) = {
   0,   32,  1,   64,  33,  2,   96,  65,  34,  3,   128, 97,  66,  35,  4,
   160, 129, 98,  67,  36,  5,   192, 161, 130, 99,  68,  37,  6,   224, 193,
   162, 131, 100, 69,  38,  7,   225, 194, 163, 132, 101, 70,  39,  8,   226,
@@ -140,49 +119,47 @@
   255,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, mrow_scan_8x32[256]) = {
-  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
-  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
-  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
-  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
-  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
-  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
-  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
-  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
-  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
-  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
-  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
-  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
-  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
-  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
-  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
-  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
-  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
+DECLARE_ALIGNED(16, static const int16_t, default_scan_32x8[256]) = {
+  0,   1,   8,   2,   9,   16,  3,   10,  17,  24,  4,   11,  18,  25,  32,
+  5,   12,  19,  26,  33,  40,  6,   13,  20,  27,  34,  41,  48,  7,   14,
+  21,  28,  35,  42,  49,  56,  15,  22,  29,  36,  43,  50,  57,  64,  23,
+  30,  37,  44,  51,  58,  65,  72,  31,  38,  45,  52,  59,  66,  73,  80,
+  39,  46,  53,  60,  67,  74,  81,  88,  47,  54,  61,  68,  75,  82,  89,
+  96,  55,  62,  69,  76,  83,  90,  97,  104, 63,  70,  77,  84,  91,  98,
+  105, 112, 71,  78,  85,  92,  99,  106, 113, 120, 79,  86,  93,  100, 107,
+  114, 121, 128, 87,  94,  101, 108, 115, 122, 129, 136, 95,  102, 109, 116,
+  123, 130, 137, 144, 103, 110, 117, 124, 131, 138, 145, 152, 111, 118, 125,
+  132, 139, 146, 153, 160, 119, 126, 133, 140, 147, 154, 161, 168, 127, 134,
+  141, 148, 155, 162, 169, 176, 135, 142, 149, 156, 163, 170, 177, 184, 143,
+  150, 157, 164, 171, 178, 185, 192, 151, 158, 165, 172, 179, 186, 193, 200,
+  159, 166, 173, 180, 187, 194, 201, 208, 167, 174, 181, 188, 195, 202, 209,
+  216, 175, 182, 189, 196, 203, 210, 217, 224, 183, 190, 197, 204, 211, 218,
+  225, 232, 191, 198, 205, 212, 219, 226, 233, 240, 199, 206, 213, 220, 227,
+  234, 241, 248, 207, 214, 221, 228, 235, 242, 249, 215, 222, 229, 236, 243,
+  250, 223, 230, 237, 244, 251, 231, 238, 245, 252, 239, 246, 253, 247, 254,
   255,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, mrow_scan_8x32[256]) = {
+  0,  32, 64, 96,  128, 160, 192, 224, 1,  33, 65, 97,  129, 161, 193, 225,
+  2,  34, 66, 98,  130, 162, 194, 226, 3,  35, 67, 99,  131, 163, 195, 227,
+  4,  36, 68, 100, 132, 164, 196, 228, 5,  37, 69, 101, 133, 165, 197, 229,
+  6,  38, 70, 102, 134, 166, 198, 230, 7,  39, 71, 103, 135, 167, 199, 231,
+  8,  40, 72, 104, 136, 168, 200, 232, 9,  41, 73, 105, 137, 169, 201, 233,
+  10, 42, 74, 106, 138, 170, 202, 234, 11, 43, 75, 107, 139, 171, 203, 235,
+  12, 44, 76, 108, 140, 172, 204, 236, 13, 45, 77, 109, 141, 173, 205, 237,
+  14, 46, 78, 110, 142, 174, 206, 238, 15, 47, 79, 111, 143, 175, 207, 239,
+  16, 48, 80, 112, 144, 176, 208, 240, 17, 49, 81, 113, 145, 177, 209, 241,
+  18, 50, 82, 114, 146, 178, 210, 242, 19, 51, 83, 115, 147, 179, 211, 243,
+  20, 52, 84, 116, 148, 180, 212, 244, 21, 53, 85, 117, 149, 181, 213, 245,
+  22, 54, 86, 118, 150, 182, 214, 246, 23, 55, 87, 119, 151, 183, 215, 247,
+  24, 56, 88, 120, 152, 184, 216, 248, 25, 57, 89, 121, 153, 185, 217, 249,
+  26, 58, 90, 122, 154, 186, 218, 250, 27, 59, 91, 123, 155, 187, 219, 251,
+  28, 60, 92, 124, 156, 188, 220, 252, 29, 61, 93, 125, 157, 189, 221, 253,
+  30, 62, 94, 126, 158, 190, 222, 254, 31, 63, 95, 127, 159, 191, 223, 255,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, mrow_scan_32x8[256]) = {
-  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
-  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
-  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
-  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
-  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
-  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
-  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
-  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
-  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
-  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
-  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
-  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
-  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
-  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
-  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
-  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
-  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
-  255,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, mcol_scan_8x32[256]) = {
   0,   8,   16,  24,  32,  40,  48,  56,  64,  72,  80,  88,  96,  104, 112,
   120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232,
   240, 248, 1,   9,   17,  25,  33,  41,  49,  57,  65,  73,  81,  89,  97,
@@ -203,47 +180,81 @@
   255,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, mcol_scan_8x32[256]) = {
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
+  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
+  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
+  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
+  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
+  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
+  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
+  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
+  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
+  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
+  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
+  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
+  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
+  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
+  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
+  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
+  255,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, mcol_scan_32x8[256]) = {
-  0,  32, 64, 96,  128, 160, 192, 224, 1,  33, 65, 97,  129, 161, 193, 225,
-  2,  34, 66, 98,  130, 162, 194, 226, 3,  35, 67, 99,  131, 163, 195, 227,
-  4,  36, 68, 100, 132, 164, 196, 228, 5,  37, 69, 101, 133, 165, 197, 229,
-  6,  38, 70, 102, 134, 166, 198, 230, 7,  39, 71, 103, 135, 167, 199, 231,
-  8,  40, 72, 104, 136, 168, 200, 232, 9,  41, 73, 105, 137, 169, 201, 233,
-  10, 42, 74, 106, 138, 170, 202, 234, 11, 43, 75, 107, 139, 171, 203, 235,
-  12, 44, 76, 108, 140, 172, 204, 236, 13, 45, 77, 109, 141, 173, 205, 237,
-  14, 46, 78, 110, 142, 174, 206, 238, 15, 47, 79, 111, 143, 175, 207, 239,
-  16, 48, 80, 112, 144, 176, 208, 240, 17, 49, 81, 113, 145, 177, 209, 241,
-  18, 50, 82, 114, 146, 178, 210, 242, 19, 51, 83, 115, 147, 179, 211, 243,
-  20, 52, 84, 116, 148, 180, 212, 244, 21, 53, 85, 117, 149, 181, 213, 245,
-  22, 54, 86, 118, 150, 182, 214, 246, 23, 55, 87, 119, 151, 183, 215, 247,
-  24, 56, 88, 120, 152, 184, 216, 248, 25, 57, 89, 121, 153, 185, 217, 249,
-  26, 58, 90, 122, 154, 186, 218, 250, 27, 59, 91, 123, 155, 187, 219, 251,
-  28, 60, 92, 124, 156, 188, 220, 252, 29, 61, 93, 125, 157, 189, 221, 253,
-  30, 62, 94, 126, 158, 190, 222, 254, 31, 63, 95, 127, 159, 191, 223, 255,
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
+  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
+  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
+  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
+  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
+  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
+  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
+  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
+  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
+  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
+  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
+  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
+  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
+  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
+  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
+  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
+  255,
 };
 
 DECLARE_ALIGNED(16, static const int16_t, default_scan_8x8[64]) = {
-  0,  1,  8,  16, 9,  2,  3,  10, 17, 24, 32, 25, 18, 11, 4,  5,
-  12, 19, 26, 33, 40, 48, 41, 34, 27, 20, 13, 6,  7,  14, 21, 28,
-  35, 42, 49, 56, 57, 50, 43, 36, 29, 22, 15, 23, 30, 37, 44, 51,
-  58, 59, 52, 45, 38, 31, 39, 46, 53, 60, 61, 54, 47, 55, 62, 63
+  0,  8,  1,  2,  9,  16, 24, 17, 10, 3,  4,  11, 18, 25, 32, 40,
+  33, 26, 19, 12, 5,  6,  13, 20, 27, 34, 41, 48, 56, 49, 42, 35,
+  28, 21, 14, 7,  15, 22, 29, 36, 43, 50, 57, 58, 51, 44, 37, 30,
+  23, 31, 38, 45, 52, 59, 60, 53, 46, 39, 47, 54, 61, 62, 55, 63,
 };
 
 DECLARE_ALIGNED(16, static const int16_t, mcol_scan_8x8[64]) = {
-  0, 8,  16, 24, 32, 40, 48, 56, 1, 9,  17, 25, 33, 41, 49, 57,
-  2, 10, 18, 26, 34, 42, 50, 58, 3, 11, 19, 27, 35, 43, 51, 59,
-  4, 12, 20, 28, 36, 44, 52, 60, 5, 13, 21, 29, 37, 45, 53, 61,
-  6, 14, 22, 30, 38, 46, 54, 62, 7, 15, 23, 31, 39, 47, 55, 63,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, mrow_scan_8x8[64]) = {
   0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
   16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
   32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
   48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, mrow_scan_8x8[64]) = {
+  0, 8,  16, 24, 32, 40, 48, 56, 1, 9,  17, 25, 33, 41, 49, 57,
+  2, 10, 18, 26, 34, 42, 50, 58, 3, 11, 19, 27, 35, 43, 51, 59,
+  4, 12, 20, 28, 36, 44, 52, 60, 5, 13, 21, 29, 37, 45, 53, 61,
+  6, 14, 22, 30, 38, 46, 54, 62, 7, 15, 23, 31, 39, 47, 55, 63,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, default_scan_8x16[128]) = {
+  0,  16,  1,   32, 17,  2,   48,  33,  18, 3,  64,  49,  34,  19,  4,   80,
+  65, 50,  35,  20, 5,   96,  81,  66,  51, 36, 21,  6,   112, 97,  82,  67,
+  52, 37,  22,  7,  113, 98,  83,  68,  53, 38, 23,  8,   114, 99,  84,  69,
+  54, 39,  24,  9,  115, 100, 85,  70,  55, 40, 25,  10,  116, 101, 86,  71,
+  56, 41,  26,  11, 117, 102, 87,  72,  57, 42, 27,  12,  118, 103, 88,  73,
+  58, 43,  28,  13, 119, 104, 89,  74,  59, 44, 29,  14,  120, 105, 90,  75,
+  60, 45,  30,  15, 121, 106, 91,  76,  61, 46, 31,  122, 107, 92,  77,  62,
+  47, 123, 108, 93, 78,  63,  124, 109, 94, 79, 125, 110, 95,  126, 111, 127,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, default_scan_16x8[128]) = {
   0,   1,   8,   2,   9,   16,  3,   10,  17,  24,  4,   11,  18,  25,  32,
   5,   12,  19,  26,  33,  40,  6,   13,  20,  27,  34,  41,  48,  7,   14,
   21,  28,  35,  42,  49,  56,  15,  22,  29,  36,  43,  50,  57,  64,  23,
@@ -255,29 +266,31 @@
   117, 124, 111, 118, 125, 119, 126, 127,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, default_scan_16x8[128]) = {
-  0,  16,  1,   32, 17,  2,   48,  33,  18, 3,  64,  49,  34,  19,  4,   80,
-  65, 50,  35,  20, 5,   96,  81,  66,  51, 36, 21,  6,   112, 97,  82,  67,
-  52, 37,  22,  7,  113, 98,  83,  68,  53, 38, 23,  8,   114, 99,  84,  69,
-  54, 39,  24,  9,  115, 100, 85,  70,  55, 40, 25,  10,  116, 101, 86,  71,
-  56, 41,  26,  11, 117, 102, 87,  72,  57, 42, 27,  12,  118, 103, 88,  73,
-  58, 43,  28,  13, 119, 104, 89,  74,  59, 44, 29,  14,  120, 105, 90,  75,
-  60, 45,  30,  15, 121, 106, 91,  76,  61, 46, 31,  122, 107, 92,  77,  62,
-  47, 123, 108, 93, 78,  63,  124, 109, 94, 79, 125, 110, 95,  126, 111, 127,
-};
-
 DECLARE_ALIGNED(16, static const int16_t, mcol_scan_8x16[128]) = {
-  0, 8,  16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96,  104, 112, 120,
-  1, 9,  17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97,  105, 113, 121,
-  2, 10, 18, 26, 34, 42, 50, 58, 66, 74, 82, 90, 98,  106, 114, 122,
-  3, 11, 19, 27, 35, 43, 51, 59, 67, 75, 83, 91, 99,  107, 115, 123,
-  4, 12, 20, 28, 36, 44, 52, 60, 68, 76, 84, 92, 100, 108, 116, 124,
-  5, 13, 21, 29, 37, 45, 53, 61, 69, 77, 85, 93, 101, 109, 117, 125,
-  6, 14, 22, 30, 38, 46, 54, 62, 70, 78, 86, 94, 102, 110, 118, 126,
-  7, 15, 23, 31, 39, 47, 55, 63, 71, 79, 87, 95, 103, 111, 119, 127,
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
+  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
+  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
+  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
+  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
+  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
+  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
+  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+  120, 121, 122, 123, 124, 125, 126, 127,
 };
 
 DECLARE_ALIGNED(16, static const int16_t, mcol_scan_16x8[128]) = {
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
+  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
+  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
+  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
+  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
+  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
+  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
+  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+  120, 121, 122, 123, 124, 125, 126, 127,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, mrow_scan_8x16[128]) = {
   0,  16, 32, 48, 64, 80, 96,  112, 1,  17, 33, 49, 65, 81, 97,  113,
   2,  18, 34, 50, 66, 82, 98,  114, 3,  19, 35, 51, 67, 83, 99,  115,
   4,  20, 36, 52, 68, 84, 100, 116, 5,  21, 37, 53, 69, 85, 101, 117,
@@ -288,69 +301,18 @@
   14, 30, 46, 62, 78, 94, 110, 126, 15, 31, 47, 63, 79, 95, 111, 127,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, mrow_scan_8x16[128]) = {
-  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
-  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
-  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
-  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
-  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
-  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
-  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
-  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
-  120, 121, 122, 123, 124, 125, 126, 127,
-};
-
 DECLARE_ALIGNED(16, static const int16_t, mrow_scan_16x8[128]) = {
-  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
-  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
-  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
-  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
-  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
-  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
-  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
-  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
-  120, 121, 122, 123, 124, 125, 126, 127,
+  0, 8,  16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96,  104, 112, 120,
+  1, 9,  17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97,  105, 113, 121,
+  2, 10, 18, 26, 34, 42, 50, 58, 66, 74, 82, 90, 98,  106, 114, 122,
+  3, 11, 19, 27, 35, 43, 51, 59, 67, 75, 83, 91, 99,  107, 115, 123,
+  4, 12, 20, 28, 36, 44, 52, 60, 68, 76, 84, 92, 100, 108, 116, 124,
+  5, 13, 21, 29, 37, 45, 53, 61, 69, 77, 85, 93, 101, 109, 117, 125,
+  6, 14, 22, 30, 38, 46, 54, 62, 70, 78, 86, 94, 102, 110, 118, 126,
+  7, 15, 23, 31, 39, 47, 55, 63, 71, 79, 87, 95, 103, 111, 119, 127,
 };
 
 DECLARE_ALIGNED(16, static const int16_t, default_scan_16x32[512]) = {
-  0,   1,   16,  2,   17,  32,  3,   18,  33,  48,  4,   19,  34,  49,  64,
-  5,   20,  35,  50,  65,  80,  6,   21,  36,  51,  66,  81,  96,  7,   22,
-  37,  52,  67,  82,  97,  112, 8,   23,  38,  53,  68,  83,  98,  113, 128,
-  9,   24,  39,  54,  69,  84,  99,  114, 129, 144, 10,  25,  40,  55,  70,
-  85,  100, 115, 130, 145, 160, 11,  26,  41,  56,  71,  86,  101, 116, 131,
-  146, 161, 176, 12,  27,  42,  57,  72,  87,  102, 117, 132, 147, 162, 177,
-  192, 13,  28,  43,  58,  73,  88,  103, 118, 133, 148, 163, 178, 193, 208,
-  14,  29,  44,  59,  74,  89,  104, 119, 134, 149, 164, 179, 194, 209, 224,
-  15,  30,  45,  60,  75,  90,  105, 120, 135, 150, 165, 180, 195, 210, 225,
-  240, 31,  46,  61,  76,  91,  106, 121, 136, 151, 166, 181, 196, 211, 226,
-  241, 256, 47,  62,  77,  92,  107, 122, 137, 152, 167, 182, 197, 212, 227,
-  242, 257, 272, 63,  78,  93,  108, 123, 138, 153, 168, 183, 198, 213, 228,
-  243, 258, 273, 288, 79,  94,  109, 124, 139, 154, 169, 184, 199, 214, 229,
-  244, 259, 274, 289, 304, 95,  110, 125, 140, 155, 170, 185, 200, 215, 230,
-  245, 260, 275, 290, 305, 320, 111, 126, 141, 156, 171, 186, 201, 216, 231,
-  246, 261, 276, 291, 306, 321, 336, 127, 142, 157, 172, 187, 202, 217, 232,
-  247, 262, 277, 292, 307, 322, 337, 352, 143, 158, 173, 188, 203, 218, 233,
-  248, 263, 278, 293, 308, 323, 338, 353, 368, 159, 174, 189, 204, 219, 234,
-  249, 264, 279, 294, 309, 324, 339, 354, 369, 384, 175, 190, 205, 220, 235,
-  250, 265, 280, 295, 310, 325, 340, 355, 370, 385, 400, 191, 206, 221, 236,
-  251, 266, 281, 296, 311, 326, 341, 356, 371, 386, 401, 416, 207, 222, 237,
-  252, 267, 282, 297, 312, 327, 342, 357, 372, 387, 402, 417, 432, 223, 238,
-  253, 268, 283, 298, 313, 328, 343, 358, 373, 388, 403, 418, 433, 448, 239,
-  254, 269, 284, 299, 314, 329, 344, 359, 374, 389, 404, 419, 434, 449, 464,
-  255, 270, 285, 300, 315, 330, 345, 360, 375, 390, 405, 420, 435, 450, 465,
-  480, 271, 286, 301, 316, 331, 346, 361, 376, 391, 406, 421, 436, 451, 466,
-  481, 496, 287, 302, 317, 332, 347, 362, 377, 392, 407, 422, 437, 452, 467,
-  482, 497, 303, 318, 333, 348, 363, 378, 393, 408, 423, 438, 453, 468, 483,
-  498, 319, 334, 349, 364, 379, 394, 409, 424, 439, 454, 469, 484, 499, 335,
-  350, 365, 380, 395, 410, 425, 440, 455, 470, 485, 500, 351, 366, 381, 396,
-  411, 426, 441, 456, 471, 486, 501, 367, 382, 397, 412, 427, 442, 457, 472,
-  487, 502, 383, 398, 413, 428, 443, 458, 473, 488, 503, 399, 414, 429, 444,
-  459, 474, 489, 504, 415, 430, 445, 460, 475, 490, 505, 431, 446, 461, 476,
-  491, 506, 447, 462, 477, 492, 507, 463, 478, 493, 508, 479, 494, 509, 495,
-  510, 511,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, default_scan_32x16[512]) = {
   0,   32,  1,   64,  33,  2,   96,  65,  34,  3,   128, 97,  66,  35,  4,
   160, 129, 98,  67,  36,  5,   192, 161, 130, 99,  68,  37,  6,   224, 193,
   162, 131, 100, 69,  38,  7,   256, 225, 194, 163, 132, 101, 70,  39,  8,
@@ -388,7 +350,156 @@
   479, 511,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, default_scan_32x16[512]) = {
+  0,   1,   16,  2,   17,  32,  3,   18,  33,  48,  4,   19,  34,  49,  64,
+  5,   20,  35,  50,  65,  80,  6,   21,  36,  51,  66,  81,  96,  7,   22,
+  37,  52,  67,  82,  97,  112, 8,   23,  38,  53,  68,  83,  98,  113, 128,
+  9,   24,  39,  54,  69,  84,  99,  114, 129, 144, 10,  25,  40,  55,  70,
+  85,  100, 115, 130, 145, 160, 11,  26,  41,  56,  71,  86,  101, 116, 131,
+  146, 161, 176, 12,  27,  42,  57,  72,  87,  102, 117, 132, 147, 162, 177,
+  192, 13,  28,  43,  58,  73,  88,  103, 118, 133, 148, 163, 178, 193, 208,
+  14,  29,  44,  59,  74,  89,  104, 119, 134, 149, 164, 179, 194, 209, 224,
+  15,  30,  45,  60,  75,  90,  105, 120, 135, 150, 165, 180, 195, 210, 225,
+  240, 31,  46,  61,  76,  91,  106, 121, 136, 151, 166, 181, 196, 211, 226,
+  241, 256, 47,  62,  77,  92,  107, 122, 137, 152, 167, 182, 197, 212, 227,
+  242, 257, 272, 63,  78,  93,  108, 123, 138, 153, 168, 183, 198, 213, 228,
+  243, 258, 273, 288, 79,  94,  109, 124, 139, 154, 169, 184, 199, 214, 229,
+  244, 259, 274, 289, 304, 95,  110, 125, 140, 155, 170, 185, 200, 215, 230,
+  245, 260, 275, 290, 305, 320, 111, 126, 141, 156, 171, 186, 201, 216, 231,
+  246, 261, 276, 291, 306, 321, 336, 127, 142, 157, 172, 187, 202, 217, 232,
+  247, 262, 277, 292, 307, 322, 337, 352, 143, 158, 173, 188, 203, 218, 233,
+  248, 263, 278, 293, 308, 323, 338, 353, 368, 159, 174, 189, 204, 219, 234,
+  249, 264, 279, 294, 309, 324, 339, 354, 369, 384, 175, 190, 205, 220, 235,
+  250, 265, 280, 295, 310, 325, 340, 355, 370, 385, 400, 191, 206, 221, 236,
+  251, 266, 281, 296, 311, 326, 341, 356, 371, 386, 401, 416, 207, 222, 237,
+  252, 267, 282, 297, 312, 327, 342, 357, 372, 387, 402, 417, 432, 223, 238,
+  253, 268, 283, 298, 313, 328, 343, 358, 373, 388, 403, 418, 433, 448, 239,
+  254, 269, 284, 299, 314, 329, 344, 359, 374, 389, 404, 419, 434, 449, 464,
+  255, 270, 285, 300, 315, 330, 345, 360, 375, 390, 405, 420, 435, 450, 465,
+  480, 271, 286, 301, 316, 331, 346, 361, 376, 391, 406, 421, 436, 451, 466,
+  481, 496, 287, 302, 317, 332, 347, 362, 377, 392, 407, 422, 437, 452, 467,
+  482, 497, 303, 318, 333, 348, 363, 378, 393, 408, 423, 438, 453, 468, 483,
+  498, 319, 334, 349, 364, 379, 394, 409, 424, 439, 454, 469, 484, 499, 335,
+  350, 365, 380, 395, 410, 425, 440, 455, 470, 485, 500, 351, 366, 381, 396,
+  411, 426, 441, 456, 471, 486, 501, 367, 382, 397, 412, 427, 442, 457, 472,
+  487, 502, 383, 398, 413, 428, 443, 458, 473, 488, 503, 399, 414, 429, 444,
+  459, 474, 489, 504, 415, 430, 445, 460, 475, 490, 505, 431, 446, 461, 476,
+  491, 506, 447, 462, 477, 492, 507, 463, 478, 493, 508, 479, 494, 509, 495,
+  510, 511,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, mcol_scan_16x32[512]) = {
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
+  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
+  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
+  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
+  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
+  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
+  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
+  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
+  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
+  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
+  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
+  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
+  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
+  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
+  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
+  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
+  255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,
+  270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284,
+  285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299,
+  300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314,
+  315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329,
+  330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344,
+  345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359,
+  360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374,
+  375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,
+  390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404,
+  405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419,
+  420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434,
+  435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449,
+  450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464,
+  465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479,
+  480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494,
+  495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509,
+  510, 511,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, mcol_scan_32x16[512]) = {
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
+  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
+  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
+  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
+  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
+  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
+  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
+  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
+  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
+  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
+  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
+  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
+  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
+  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
+  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
+  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
+  255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,
+  270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284,
+  285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299,
+  300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314,
+  315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329,
+  330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344,
+  345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359,
+  360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374,
+  375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,
+  390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404,
+  405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419,
+  420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434,
+  435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449,
+  450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464,
+  465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479,
+  480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494,
+  495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509,
+  510, 511,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, mrow_scan_16x32[512]) = {
+  0,  32, 64, 96,  128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480,
+  1,  33, 65, 97,  129, 161, 193, 225, 257, 289, 321, 353, 385, 417, 449, 481,
+  2,  34, 66, 98,  130, 162, 194, 226, 258, 290, 322, 354, 386, 418, 450, 482,
+  3,  35, 67, 99,  131, 163, 195, 227, 259, 291, 323, 355, 387, 419, 451, 483,
+  4,  36, 68, 100, 132, 164, 196, 228, 260, 292, 324, 356, 388, 420, 452, 484,
+  5,  37, 69, 101, 133, 165, 197, 229, 261, 293, 325, 357, 389, 421, 453, 485,
+  6,  38, 70, 102, 134, 166, 198, 230, 262, 294, 326, 358, 390, 422, 454, 486,
+  7,  39, 71, 103, 135, 167, 199, 231, 263, 295, 327, 359, 391, 423, 455, 487,
+  8,  40, 72, 104, 136, 168, 200, 232, 264, 296, 328, 360, 392, 424, 456, 488,
+  9,  41, 73, 105, 137, 169, 201, 233, 265, 297, 329, 361, 393, 425, 457, 489,
+  10, 42, 74, 106, 138, 170, 202, 234, 266, 298, 330, 362, 394, 426, 458, 490,
+  11, 43, 75, 107, 139, 171, 203, 235, 267, 299, 331, 363, 395, 427, 459, 491,
+  12, 44, 76, 108, 140, 172, 204, 236, 268, 300, 332, 364, 396, 428, 460, 492,
+  13, 45, 77, 109, 141, 173, 205, 237, 269, 301, 333, 365, 397, 429, 461, 493,
+  14, 46, 78, 110, 142, 174, 206, 238, 270, 302, 334, 366, 398, 430, 462, 494,
+  15, 47, 79, 111, 143, 175, 207, 239, 271, 303, 335, 367, 399, 431, 463, 495,
+  16, 48, 80, 112, 144, 176, 208, 240, 272, 304, 336, 368, 400, 432, 464, 496,
+  17, 49, 81, 113, 145, 177, 209, 241, 273, 305, 337, 369, 401, 433, 465, 497,
+  18, 50, 82, 114, 146, 178, 210, 242, 274, 306, 338, 370, 402, 434, 466, 498,
+  19, 51, 83, 115, 147, 179, 211, 243, 275, 307, 339, 371, 403, 435, 467, 499,
+  20, 52, 84, 116, 148, 180, 212, 244, 276, 308, 340, 372, 404, 436, 468, 500,
+  21, 53, 85, 117, 149, 181, 213, 245, 277, 309, 341, 373, 405, 437, 469, 501,
+  22, 54, 86, 118, 150, 182, 214, 246, 278, 310, 342, 374, 406, 438, 470, 502,
+  23, 55, 87, 119, 151, 183, 215, 247, 279, 311, 343, 375, 407, 439, 471, 503,
+  24, 56, 88, 120, 152, 184, 216, 248, 280, 312, 344, 376, 408, 440, 472, 504,
+  25, 57, 89, 121, 153, 185, 217, 249, 281, 313, 345, 377, 409, 441, 473, 505,
+  26, 58, 90, 122, 154, 186, 218, 250, 282, 314, 346, 378, 410, 442, 474, 506,
+  27, 59, 91, 123, 155, 187, 219, 251, 283, 315, 347, 379, 411, 443, 475, 507,
+  28, 60, 92, 124, 156, 188, 220, 252, 284, 316, 348, 380, 412, 444, 476, 508,
+  29, 61, 93, 125, 157, 189, 221, 253, 285, 317, 349, 381, 413, 445, 477, 509,
+  30, 62, 94, 126, 158, 190, 222, 254, 286, 318, 350, 382, 414, 446, 478, 510,
+  31, 63, 95, 127, 159, 191, 223, 255, 287, 319, 351, 383, 415, 447, 479, 511,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, mrow_scan_32x16[512]) = {
   0,   16,  32,  48,  64,  80,  96,  112, 128, 144, 160, 176, 192, 208, 224,
   240, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464,
   480, 496, 1,   17,  33,  49,  65,  81,  97,  113, 129, 145, 161, 177, 193,
@@ -426,158 +537,28 @@
   495, 511,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, mcol_scan_32x16[512]) = {
-  0,  32, 64, 96,  128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480,
-  1,  33, 65, 97,  129, 161, 193, 225, 257, 289, 321, 353, 385, 417, 449, 481,
-  2,  34, 66, 98,  130, 162, 194, 226, 258, 290, 322, 354, 386, 418, 450, 482,
-  3,  35, 67, 99,  131, 163, 195, 227, 259, 291, 323, 355, 387, 419, 451, 483,
-  4,  36, 68, 100, 132, 164, 196, 228, 260, 292, 324, 356, 388, 420, 452, 484,
-  5,  37, 69, 101, 133, 165, 197, 229, 261, 293, 325, 357, 389, 421, 453, 485,
-  6,  38, 70, 102, 134, 166, 198, 230, 262, 294, 326, 358, 390, 422, 454, 486,
-  7,  39, 71, 103, 135, 167, 199, 231, 263, 295, 327, 359, 391, 423, 455, 487,
-  8,  40, 72, 104, 136, 168, 200, 232, 264, 296, 328, 360, 392, 424, 456, 488,
-  9,  41, 73, 105, 137, 169, 201, 233, 265, 297, 329, 361, 393, 425, 457, 489,
-  10, 42, 74, 106, 138, 170, 202, 234, 266, 298, 330, 362, 394, 426, 458, 490,
-  11, 43, 75, 107, 139, 171, 203, 235, 267, 299, 331, 363, 395, 427, 459, 491,
-  12, 44, 76, 108, 140, 172, 204, 236, 268, 300, 332, 364, 396, 428, 460, 492,
-  13, 45, 77, 109, 141, 173, 205, 237, 269, 301, 333, 365, 397, 429, 461, 493,
-  14, 46, 78, 110, 142, 174, 206, 238, 270, 302, 334, 366, 398, 430, 462, 494,
-  15, 47, 79, 111, 143, 175, 207, 239, 271, 303, 335, 367, 399, 431, 463, 495,
-  16, 48, 80, 112, 144, 176, 208, 240, 272, 304, 336, 368, 400, 432, 464, 496,
-  17, 49, 81, 113, 145, 177, 209, 241, 273, 305, 337, 369, 401, 433, 465, 497,
-  18, 50, 82, 114, 146, 178, 210, 242, 274, 306, 338, 370, 402, 434, 466, 498,
-  19, 51, 83, 115, 147, 179, 211, 243, 275, 307, 339, 371, 403, 435, 467, 499,
-  20, 52, 84, 116, 148, 180, 212, 244, 276, 308, 340, 372, 404, 436, 468, 500,
-  21, 53, 85, 117, 149, 181, 213, 245, 277, 309, 341, 373, 405, 437, 469, 501,
-  22, 54, 86, 118, 150, 182, 214, 246, 278, 310, 342, 374, 406, 438, 470, 502,
-  23, 55, 87, 119, 151, 183, 215, 247, 279, 311, 343, 375, 407, 439, 471, 503,
-  24, 56, 88, 120, 152, 184, 216, 248, 280, 312, 344, 376, 408, 440, 472, 504,
-  25, 57, 89, 121, 153, 185, 217, 249, 281, 313, 345, 377, 409, 441, 473, 505,
-  26, 58, 90, 122, 154, 186, 218, 250, 282, 314, 346, 378, 410, 442, 474, 506,
-  27, 59, 91, 123, 155, 187, 219, 251, 283, 315, 347, 379, 411, 443, 475, 507,
-  28, 60, 92, 124, 156, 188, 220, 252, 284, 316, 348, 380, 412, 444, 476, 508,
-  29, 61, 93, 125, 157, 189, 221, 253, 285, 317, 349, 381, 413, 445, 477, 509,
-  30, 62, 94, 126, 158, 190, 222, 254, 286, 318, 350, 382, 414, 446, 478, 510,
-  31, 63, 95, 127, 159, 191, 223, 255, 287, 319, 351, 383, 415, 447, 479, 511,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, mrow_scan_16x32[512]) = {
-  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
-  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
-  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
-  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
-  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
-  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
-  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
-  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
-  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
-  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
-  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
-  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
-  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
-  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
-  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
-  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
-  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
-  255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,
-  270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284,
-  285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299,
-  300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314,
-  315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329,
-  330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344,
-  345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359,
-  360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374,
-  375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,
-  390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404,
-  405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419,
-  420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434,
-  435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449,
-  450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464,
-  465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479,
-  480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494,
-  495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509,
-  510, 511,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, mrow_scan_32x16[512]) = {
-  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
-  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
-  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
-  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
-  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
-  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
-  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
-  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
-  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
-  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
-  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
-  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
-  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
-  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
-  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
-  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
-  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
-  255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,
-  270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284,
-  285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299,
-  300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314,
-  315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329,
-  330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344,
-  345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359,
-  360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374,
-  375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,
-  390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404,
-  405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419,
-  420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434,
-  435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449,
-  450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464,
-  465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479,
-  480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494,
-  495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509,
-  510, 511,
-};
-
 DECLARE_ALIGNED(16, static const int16_t, default_scan_16x16[256]) = {
-  0,   1,   16,  32,  17,  2,   3,   18,  33,  48,  64,  49,  34,  19,  4,
-  5,   20,  35,  50,  65,  80,  96,  81,  66,  51,  36,  21,  6,   7,   22,
-  37,  52,  67,  82,  97,  112, 128, 113, 98,  83,  68,  53,  38,  23,  8,
-  9,   24,  39,  54,  69,  84,  99,  114, 129, 144, 160, 145, 130, 115, 100,
-  85,  70,  55,  40,  25,  10,  11,  26,  41,  56,  71,  86,  101, 116, 131,
-  146, 161, 176, 192, 177, 162, 147, 132, 117, 102, 87,  72,  57,  42,  27,
-  12,  13,  28,  43,  58,  73,  88,  103, 118, 133, 148, 163, 178, 193, 208,
-  224, 209, 194, 179, 164, 149, 134, 119, 104, 89,  74,  59,  44,  29,  14,
-  15,  30,  45,  60,  75,  90,  105, 120, 135, 150, 165, 180, 195, 210, 225,
-  240, 241, 226, 211, 196, 181, 166, 151, 136, 121, 106, 91,  76,  61,  46,
-  31,  47,  62,  77,  92,  107, 122, 137, 152, 167, 182, 197, 212, 227, 242,
-  243, 228, 213, 198, 183, 168, 153, 138, 123, 108, 93,  78,  63,  79,  94,
-  109, 124, 139, 154, 169, 184, 199, 214, 229, 244, 245, 230, 215, 200, 185,
-  170, 155, 140, 125, 110, 95,  111, 126, 141, 156, 171, 186, 201, 216, 231,
-  246, 247, 232, 217, 202, 187, 172, 157, 142, 127, 143, 158, 173, 188, 203,
-  218, 233, 248, 249, 234, 219, 204, 189, 174, 159, 175, 190, 205, 220, 235,
-  250, 251, 236, 221, 206, 191, 207, 222, 237, 252, 253, 238, 223, 239, 254,
-  255
+  0,   16,  1,   2,   17,  32,  48,  33,  18,  3,   4,   19,  34,  49,  64,
+  80,  65,  50,  35,  20,  5,   6,   21,  36,  51,  66,  81,  96,  112, 97,
+  82,  67,  52,  37,  22,  7,   8,   23,  38,  53,  68,  83,  98,  113, 128,
+  144, 129, 114, 99,  84,  69,  54,  39,  24,  9,   10,  25,  40,  55,  70,
+  85,  100, 115, 130, 145, 160, 176, 161, 146, 131, 116, 101, 86,  71,  56,
+  41,  26,  11,  12,  27,  42,  57,  72,  87,  102, 117, 132, 147, 162, 177,
+  192, 208, 193, 178, 163, 148, 133, 118, 103, 88,  73,  58,  43,  28,  13,
+  14,  29,  44,  59,  74,  89,  104, 119, 134, 149, 164, 179, 194, 209, 224,
+  240, 225, 210, 195, 180, 165, 150, 135, 120, 105, 90,  75,  60,  45,  30,
+  15,  31,  46,  61,  76,  91,  106, 121, 136, 151, 166, 181, 196, 211, 226,
+  241, 242, 227, 212, 197, 182, 167, 152, 137, 122, 107, 92,  77,  62,  47,
+  63,  78,  93,  108, 123, 138, 153, 168, 183, 198, 213, 228, 243, 244, 229,
+  214, 199, 184, 169, 154, 139, 124, 109, 94,  79,  95,  110, 125, 140, 155,
+  170, 185, 200, 215, 230, 245, 246, 231, 216, 201, 186, 171, 156, 141, 126,
+  111, 127, 142, 157, 172, 187, 202, 217, 232, 247, 248, 233, 218, 203, 188,
+  173, 158, 143, 159, 174, 189, 204, 219, 234, 249, 250, 235, 220, 205, 190,
+  175, 191, 206, 221, 236, 251, 252, 237, 222, 207, 223, 238, 253, 254, 239,
+  255,
 };
 
 DECLARE_ALIGNED(16, static const int16_t, mcol_scan_16x16[256]) = {
-  0,  16, 32, 48, 64, 80, 96,  112, 128, 144, 160, 176, 192, 208, 224, 240,
-  1,  17, 33, 49, 65, 81, 97,  113, 129, 145, 161, 177, 193, 209, 225, 241,
-  2,  18, 34, 50, 66, 82, 98,  114, 130, 146, 162, 178, 194, 210, 226, 242,
-  3,  19, 35, 51, 67, 83, 99,  115, 131, 147, 163, 179, 195, 211, 227, 243,
-  4,  20, 36, 52, 68, 84, 100, 116, 132, 148, 164, 180, 196, 212, 228, 244,
-  5,  21, 37, 53, 69, 85, 101, 117, 133, 149, 165, 181, 197, 213, 229, 245,
-  6,  22, 38, 54, 70, 86, 102, 118, 134, 150, 166, 182, 198, 214, 230, 246,
-  7,  23, 39, 55, 71, 87, 103, 119, 135, 151, 167, 183, 199, 215, 231, 247,
-  8,  24, 40, 56, 72, 88, 104, 120, 136, 152, 168, 184, 200, 216, 232, 248,
-  9,  25, 41, 57, 73, 89, 105, 121, 137, 153, 169, 185, 201, 217, 233, 249,
-  10, 26, 42, 58, 74, 90, 106, 122, 138, 154, 170, 186, 202, 218, 234, 250,
-  11, 27, 43, 59, 75, 91, 107, 123, 139, 155, 171, 187, 203, 219, 235, 251,
-  12, 28, 44, 60, 76, 92, 108, 124, 140, 156, 172, 188, 204, 220, 236, 252,
-  13, 29, 45, 61, 77, 93, 109, 125, 141, 157, 173, 189, 205, 221, 237, 253,
-  14, 30, 46, 62, 78, 94, 110, 126, 142, 158, 174, 190, 206, 222, 238, 254,
-  15, 31, 47, 63, 79, 95, 111, 127, 143, 159, 175, 191, 207, 223, 239, 255,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, mrow_scan_16x16[256]) = {
   0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
   15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
   30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
@@ -598,84 +579,26 @@
   255,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, mcol_scan_32x32[1024]) = {
-  0,   32,   64,  96,   128, 160,  192, 224,  256, 288,  320, 352,  384, 416,
-  448, 480,  512, 544,  576, 608,  640, 672,  704, 736,  768, 800,  832, 864,
-  896, 928,  960, 992,  1,   33,   65,  97,   129, 161,  193, 225,  257, 289,
-  321, 353,  385, 417,  449, 481,  513, 545,  577, 609,  641, 673,  705, 737,
-  769, 801,  833, 865,  897, 929,  961, 993,  2,   34,   66,  98,   130, 162,
-  194, 226,  258, 290,  322, 354,  386, 418,  450, 482,  514, 546,  578, 610,
-  642, 674,  706, 738,  770, 802,  834, 866,  898, 930,  962, 994,  3,   35,
-  67,  99,   131, 163,  195, 227,  259, 291,  323, 355,  387, 419,  451, 483,
-  515, 547,  579, 611,  643, 675,  707, 739,  771, 803,  835, 867,  899, 931,
-  963, 995,  4,   36,   68,  100,  132, 164,  196, 228,  260, 292,  324, 356,
-  388, 420,  452, 484,  516, 548,  580, 612,  644, 676,  708, 740,  772, 804,
-  836, 868,  900, 932,  964, 996,  5,   37,   69,  101,  133, 165,  197, 229,
-  261, 293,  325, 357,  389, 421,  453, 485,  517, 549,  581, 613,  645, 677,
-  709, 741,  773, 805,  837, 869,  901, 933,  965, 997,  6,   38,   70,  102,
-  134, 166,  198, 230,  262, 294,  326, 358,  390, 422,  454, 486,  518, 550,
-  582, 614,  646, 678,  710, 742,  774, 806,  838, 870,  902, 934,  966, 998,
-  7,   39,   71,  103,  135, 167,  199, 231,  263, 295,  327, 359,  391, 423,
-  455, 487,  519, 551,  583, 615,  647, 679,  711, 743,  775, 807,  839, 871,
-  903, 935,  967, 999,  8,   40,   72,  104,  136, 168,  200, 232,  264, 296,
-  328, 360,  392, 424,  456, 488,  520, 552,  584, 616,  648, 680,  712, 744,
-  776, 808,  840, 872,  904, 936,  968, 1000, 9,   41,   73,  105,  137, 169,
-  201, 233,  265, 297,  329, 361,  393, 425,  457, 489,  521, 553,  585, 617,
-  649, 681,  713, 745,  777, 809,  841, 873,  905, 937,  969, 1001, 10,  42,
-  74,  106,  138, 170,  202, 234,  266, 298,  330, 362,  394, 426,  458, 490,
-  522, 554,  586, 618,  650, 682,  714, 746,  778, 810,  842, 874,  906, 938,
-  970, 1002, 11,  43,   75,  107,  139, 171,  203, 235,  267, 299,  331, 363,
-  395, 427,  459, 491,  523, 555,  587, 619,  651, 683,  715, 747,  779, 811,
-  843, 875,  907, 939,  971, 1003, 12,  44,   76,  108,  140, 172,  204, 236,
-  268, 300,  332, 364,  396, 428,  460, 492,  524, 556,  588, 620,  652, 684,
-  716, 748,  780, 812,  844, 876,  908, 940,  972, 1004, 13,  45,   77,  109,
-  141, 173,  205, 237,  269, 301,  333, 365,  397, 429,  461, 493,  525, 557,
-  589, 621,  653, 685,  717, 749,  781, 813,  845, 877,  909, 941,  973, 1005,
-  14,  46,   78,  110,  142, 174,  206, 238,  270, 302,  334, 366,  398, 430,
-  462, 494,  526, 558,  590, 622,  654, 686,  718, 750,  782, 814,  846, 878,
-  910, 942,  974, 1006, 15,  47,   79,  111,  143, 175,  207, 239,  271, 303,
-  335, 367,  399, 431,  463, 495,  527, 559,  591, 623,  655, 687,  719, 751,
-  783, 815,  847, 879,  911, 943,  975, 1007, 16,  48,   80,  112,  144, 176,
-  208, 240,  272, 304,  336, 368,  400, 432,  464, 496,  528, 560,  592, 624,
-  656, 688,  720, 752,  784, 816,  848, 880,  912, 944,  976, 1008, 17,  49,
-  81,  113,  145, 177,  209, 241,  273, 305,  337, 369,  401, 433,  465, 497,
-  529, 561,  593, 625,  657, 689,  721, 753,  785, 817,  849, 881,  913, 945,
-  977, 1009, 18,  50,   82,  114,  146, 178,  210, 242,  274, 306,  338, 370,
-  402, 434,  466, 498,  530, 562,  594, 626,  658, 690,  722, 754,  786, 818,
-  850, 882,  914, 946,  978, 1010, 19,  51,   83,  115,  147, 179,  211, 243,
-  275, 307,  339, 371,  403, 435,  467, 499,  531, 563,  595, 627,  659, 691,
-  723, 755,  787, 819,  851, 883,  915, 947,  979, 1011, 20,  52,   84,  116,
-  148, 180,  212, 244,  276, 308,  340, 372,  404, 436,  468, 500,  532, 564,
-  596, 628,  660, 692,  724, 756,  788, 820,  852, 884,  916, 948,  980, 1012,
-  21,  53,   85,  117,  149, 181,  213, 245,  277, 309,  341, 373,  405, 437,
-  469, 501,  533, 565,  597, 629,  661, 693,  725, 757,  789, 821,  853, 885,
-  917, 949,  981, 1013, 22,  54,   86,  118,  150, 182,  214, 246,  278, 310,
-  342, 374,  406, 438,  470, 502,  534, 566,  598, 630,  662, 694,  726, 758,
-  790, 822,  854, 886,  918, 950,  982, 1014, 23,  55,   87,  119,  151, 183,
-  215, 247,  279, 311,  343, 375,  407, 439,  471, 503,  535, 567,  599, 631,
-  663, 695,  727, 759,  791, 823,  855, 887,  919, 951,  983, 1015, 24,  56,
-  88,  120,  152, 184,  216, 248,  280, 312,  344, 376,  408, 440,  472, 504,
-  536, 568,  600, 632,  664, 696,  728, 760,  792, 824,  856, 888,  920, 952,
-  984, 1016, 25,  57,   89,  121,  153, 185,  217, 249,  281, 313,  345, 377,
-  409, 441,  473, 505,  537, 569,  601, 633,  665, 697,  729, 761,  793, 825,
-  857, 889,  921, 953,  985, 1017, 26,  58,   90,  122,  154, 186,  218, 250,
-  282, 314,  346, 378,  410, 442,  474, 506,  538, 570,  602, 634,  666, 698,
-  730, 762,  794, 826,  858, 890,  922, 954,  986, 1018, 27,  59,   91,  123,
-  155, 187,  219, 251,  283, 315,  347, 379,  411, 443,  475, 507,  539, 571,
-  603, 635,  667, 699,  731, 763,  795, 827,  859, 891,  923, 955,  987, 1019,
-  28,  60,   92,  124,  156, 188,  220, 252,  284, 316,  348, 380,  412, 444,
-  476, 508,  540, 572,  604, 636,  668, 700,  732, 764,  796, 828,  860, 892,
-  924, 956,  988, 1020, 29,  61,   93,  125,  157, 189,  221, 253,  285, 317,
-  349, 381,  413, 445,  477, 509,  541, 573,  605, 637,  669, 701,  733, 765,
-  797, 829,  861, 893,  925, 957,  989, 1021, 30,  62,   94,  126,  158, 190,
-  222, 254,  286, 318,  350, 382,  414, 446,  478, 510,  542, 574,  606, 638,
-  670, 702,  734, 766,  798, 830,  862, 894,  926, 958,  990, 1022, 31,  63,
-  95,  127,  159, 191,  223, 255,  287, 319,  351, 383,  415, 447,  479, 511,
-  543, 575,  607, 639,  671, 703,  735, 767,  799, 831,  863, 895,  927, 959,
-  991, 1023,
+DECLARE_ALIGNED(16, static const int16_t, mrow_scan_16x16[256]) = {
+  0,  16, 32, 48, 64, 80, 96,  112, 128, 144, 160, 176, 192, 208, 224, 240,
+  1,  17, 33, 49, 65, 81, 97,  113, 129, 145, 161, 177, 193, 209, 225, 241,
+  2,  18, 34, 50, 66, 82, 98,  114, 130, 146, 162, 178, 194, 210, 226, 242,
+  3,  19, 35, 51, 67, 83, 99,  115, 131, 147, 163, 179, 195, 211, 227, 243,
+  4,  20, 36, 52, 68, 84, 100, 116, 132, 148, 164, 180, 196, 212, 228, 244,
+  5,  21, 37, 53, 69, 85, 101, 117, 133, 149, 165, 181, 197, 213, 229, 245,
+  6,  22, 38, 54, 70, 86, 102, 118, 134, 150, 166, 182, 198, 214, 230, 246,
+  7,  23, 39, 55, 71, 87, 103, 119, 135, 151, 167, 183, 199, 215, 231, 247,
+  8,  24, 40, 56, 72, 88, 104, 120, 136, 152, 168, 184, 200, 216, 232, 248,
+  9,  25, 41, 57, 73, 89, 105, 121, 137, 153, 169, 185, 201, 217, 233, 249,
+  10, 26, 42, 58, 74, 90, 106, 122, 138, 154, 170, 186, 202, 218, 234, 250,
+  11, 27, 43, 59, 75, 91, 107, 123, 139, 155, 171, 187, 203, 219, 235, 251,
+  12, 28, 44, 60, 76, 92, 108, 124, 140, 156, 172, 188, 204, 220, 236, 252,
+  13, 29, 45, 61, 77, 93, 109, 125, 141, 157, 173, 189, 205, 221, 237, 253,
+  14, 30, 46, 62, 78, 94, 110, 126, 142, 158, 174, 190, 206, 222, 238, 254,
+  15, 31, 47, 63, 79, 95, 111, 127, 143, 159, 175, 191, 207, 223, 239, 255,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, mrow_scan_32x32[1024]) = {
+DECLARE_ALIGNED(16, static const int16_t, mcol_scan_32x32[1024]) = {
   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,    10,   11,   12,
   13,   14,   15,   16,   17,   18,   19,   20,   21,   22,   23,   24,   25,
   26,   27,   28,   29,   30,   31,   32,   33,   34,   35,   36,   37,   38,
@@ -757,194 +680,250 @@
   1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, default_scan_32x32[1024]) = {
-  0,    1,    32,   64,   33,   2,   3,    34,   65,   96,   128,  97,  66,
-  35,   4,    5,    36,   67,   98,  129,  160,  192,  161,  130,  99,  68,
-  37,   6,    7,    38,   69,   100, 131,  162,  193,  224,  256,  225, 194,
-  163,  132,  101,  70,   39,   8,   9,    40,   71,   102,  133,  164, 195,
-  226,  257,  288,  320,  289,  258, 227,  196,  165,  134,  103,  72,  41,
-  10,   11,   42,   73,   104,  135, 166,  197,  228,  259,  290,  321, 352,
-  384,  353,  322,  291,  260,  229, 198,  167,  136,  105,  74,   43,  12,
-  13,   44,   75,   106,  137,  168, 199,  230,  261,  292,  323,  354, 385,
-  416,  448,  417,  386,  355,  324, 293,  262,  231,  200,  169,  138, 107,
-  76,   45,   14,   15,   46,   77,  108,  139,  170,  201,  232,  263, 294,
-  325,  356,  387,  418,  449,  480, 512,  481,  450,  419,  388,  357, 326,
-  295,  264,  233,  202,  171,  140, 109,  78,   47,   16,   17,   48,  79,
-  110,  141,  172,  203,  234,  265, 296,  327,  358,  389,  420,  451, 482,
-  513,  544,  576,  545,  514,  483, 452,  421,  390,  359,  328,  297, 266,
-  235,  204,  173,  142,  111,  80,  49,   18,   19,   50,   81,   112, 143,
-  174,  205,  236,  267,  298,  329, 360,  391,  422,  453,  484,  515, 546,
-  577,  608,  640,  609,  578,  547, 516,  485,  454,  423,  392,  361, 330,
-  299,  268,  237,  206,  175,  144, 113,  82,   51,   20,   21,   52,  83,
-  114,  145,  176,  207,  238,  269, 300,  331,  362,  393,  424,  455, 486,
-  517,  548,  579,  610,  641,  672, 704,  673,  642,  611,  580,  549, 518,
-  487,  456,  425,  394,  363,  332, 301,  270,  239,  208,  177,  146, 115,
-  84,   53,   22,   23,   54,   85,  116,  147,  178,  209,  240,  271, 302,
-  333,  364,  395,  426,  457,  488, 519,  550,  581,  612,  643,  674, 705,
-  736,  768,  737,  706,  675,  644, 613,  582,  551,  520,  489,  458, 427,
-  396,  365,  334,  303,  272,  241, 210,  179,  148,  117,  86,   55,  24,
-  25,   56,   87,   118,  149,  180, 211,  242,  273,  304,  335,  366, 397,
-  428,  459,  490,  521,  552,  583, 614,  645,  676,  707,  738,  769, 800,
-  832,  801,  770,  739,  708,  677, 646,  615,  584,  553,  522,  491, 460,
-  429,  398,  367,  336,  305,  274, 243,  212,  181,  150,  119,  88,  57,
-  26,   27,   58,   89,   120,  151, 182,  213,  244,  275,  306,  337, 368,
-  399,  430,  461,  492,  523,  554, 585,  616,  647,  678,  709,  740, 771,
-  802,  833,  864,  896,  865,  834, 803,  772,  741,  710,  679,  648, 617,
-  586,  555,  524,  493,  462,  431, 400,  369,  338,  307,  276,  245, 214,
-  183,  152,  121,  90,   59,   28,  29,   60,   91,   122,  153,  184, 215,
-  246,  277,  308,  339,  370,  401, 432,  463,  494,  525,  556,  587, 618,
-  649,  680,  711,  742,  773,  804, 835,  866,  897,  928,  960,  929, 898,
-  867,  836,  805,  774,  743,  712, 681,  650,  619,  588,  557,  526, 495,
-  464,  433,  402,  371,  340,  309, 278,  247,  216,  185,  154,  123, 92,
-  61,   30,   31,   62,   93,   124, 155,  186,  217,  248,  279,  310, 341,
-  372,  403,  434,  465,  496,  527, 558,  589,  620,  651,  682,  713, 744,
-  775,  806,  837,  868,  899,  930, 961,  992,  993,  962,  931,  900, 869,
-  838,  807,  776,  745,  714,  683, 652,  621,  590,  559,  528,  497, 466,
-  435,  404,  373,  342,  311,  280, 249,  218,  187,  156,  125,  94,  63,
-  95,   126,  157,  188,  219,  250, 281,  312,  343,  374,  405,  436, 467,
-  498,  529,  560,  591,  622,  653, 684,  715,  746,  777,  808,  839, 870,
-  901,  932,  963,  994,  995,  964, 933,  902,  871,  840,  809,  778, 747,
-  716,  685,  654,  623,  592,  561, 530,  499,  468,  437,  406,  375, 344,
-  313,  282,  251,  220,  189,  158, 127,  159,  190,  221,  252,  283, 314,
-  345,  376,  407,  438,  469,  500, 531,  562,  593,  624,  655,  686, 717,
-  748,  779,  810,  841,  872,  903, 934,  965,  996,  997,  966,  935, 904,
-  873,  842,  811,  780,  749,  718, 687,  656,  625,  594,  563,  532, 501,
-  470,  439,  408,  377,  346,  315, 284,  253,  222,  191,  223,  254, 285,
-  316,  347,  378,  409,  440,  471, 502,  533,  564,  595,  626,  657, 688,
-  719,  750,  781,  812,  843,  874, 905,  936,  967,  998,  999,  968, 937,
-  906,  875,  844,  813,  782,  751, 720,  689,  658,  627,  596,  565, 534,
-  503,  472,  441,  410,  379,  348, 317,  286,  255,  287,  318,  349, 380,
-  411,  442,  473,  504,  535,  566, 597,  628,  659,  690,  721,  752, 783,
-  814,  845,  876,  907,  938,  969, 1000, 1001, 970,  939,  908,  877, 846,
-  815,  784,  753,  722,  691,  660, 629,  598,  567,  536,  505,  474, 443,
-  412,  381,  350,  319,  351,  382, 413,  444,  475,  506,  537,  568, 599,
-  630,  661,  692,  723,  754,  785, 816,  847,  878,  909,  940,  971, 1002,
-  1003, 972,  941,  910,  879,  848, 817,  786,  755,  724,  693,  662, 631,
-  600,  569,  538,  507,  476,  445, 414,  383,  415,  446,  477,  508, 539,
-  570,  601,  632,  663,  694,  725, 756,  787,  818,  849,  880,  911, 942,
-  973,  1004, 1005, 974,  943,  912, 881,  850,  819,  788,  757,  726, 695,
-  664,  633,  602,  571,  540,  509, 478,  447,  479,  510,  541,  572, 603,
-  634,  665,  696,  727,  758,  789, 820,  851,  882,  913,  944,  975, 1006,
-  1007, 976,  945,  914,  883,  852, 821,  790,  759,  728,  697,  666, 635,
-  604,  573,  542,  511,  543,  574, 605,  636,  667,  698,  729,  760, 791,
-  822,  853,  884,  915,  946,  977, 1008, 1009, 978,  947,  916,  885, 854,
-  823,  792,  761,  730,  699,  668, 637,  606,  575,  607,  638,  669, 700,
-  731,  762,  793,  824,  855,  886, 917,  948,  979,  1010, 1011, 980, 949,
-  918,  887,  856,  825,  794,  763, 732,  701,  670,  639,  671,  702, 733,
-  764,  795,  826,  857,  888,  919, 950,  981,  1012, 1013, 982,  951, 920,
-  889,  858,  827,  796,  765,  734, 703,  735,  766,  797,  828,  859, 890,
-  921,  952,  983,  1014, 1015, 984, 953,  922,  891,  860,  829,  798, 767,
-  799,  830,  861,  892,  923,  954, 985,  1016, 1017, 986,  955,  924, 893,
-  862,  831,  863,  894,  925,  956, 987,  1018, 1019, 988,  957,  926, 895,
-  927,  958,  989,  1020, 1021, 990, 959,  991,  1022, 1023
+DECLARE_ALIGNED(16, static const int16_t, mrow_scan_32x32[1024]) = {
+  0,   32,   64,  96,   128, 160,  192, 224,  256, 288,  320, 352,  384, 416,
+  448, 480,  512, 544,  576, 608,  640, 672,  704, 736,  768, 800,  832, 864,
+  896, 928,  960, 992,  1,   33,   65,  97,   129, 161,  193, 225,  257, 289,
+  321, 353,  385, 417,  449, 481,  513, 545,  577, 609,  641, 673,  705, 737,
+  769, 801,  833, 865,  897, 929,  961, 993,  2,   34,   66,  98,   130, 162,
+  194, 226,  258, 290,  322, 354,  386, 418,  450, 482,  514, 546,  578, 610,
+  642, 674,  706, 738,  770, 802,  834, 866,  898, 930,  962, 994,  3,   35,
+  67,  99,   131, 163,  195, 227,  259, 291,  323, 355,  387, 419,  451, 483,
+  515, 547,  579, 611,  643, 675,  707, 739,  771, 803,  835, 867,  899, 931,
+  963, 995,  4,   36,   68,  100,  132, 164,  196, 228,  260, 292,  324, 356,
+  388, 420,  452, 484,  516, 548,  580, 612,  644, 676,  708, 740,  772, 804,
+  836, 868,  900, 932,  964, 996,  5,   37,   69,  101,  133, 165,  197, 229,
+  261, 293,  325, 357,  389, 421,  453, 485,  517, 549,  581, 613,  645, 677,
+  709, 741,  773, 805,  837, 869,  901, 933,  965, 997,  6,   38,   70,  102,
+  134, 166,  198, 230,  262, 294,  326, 358,  390, 422,  454, 486,  518, 550,
+  582, 614,  646, 678,  710, 742,  774, 806,  838, 870,  902, 934,  966, 998,
+  7,   39,   71,  103,  135, 167,  199, 231,  263, 295,  327, 359,  391, 423,
+  455, 487,  519, 551,  583, 615,  647, 679,  711, 743,  775, 807,  839, 871,
+  903, 935,  967, 999,  8,   40,   72,  104,  136, 168,  200, 232,  264, 296,
+  328, 360,  392, 424,  456, 488,  520, 552,  584, 616,  648, 680,  712, 744,
+  776, 808,  840, 872,  904, 936,  968, 1000, 9,   41,   73,  105,  137, 169,
+  201, 233,  265, 297,  329, 361,  393, 425,  457, 489,  521, 553,  585, 617,
+  649, 681,  713, 745,  777, 809,  841, 873,  905, 937,  969, 1001, 10,  42,
+  74,  106,  138, 170,  202, 234,  266, 298,  330, 362,  394, 426,  458, 490,
+  522, 554,  586, 618,  650, 682,  714, 746,  778, 810,  842, 874,  906, 938,
+  970, 1002, 11,  43,   75,  107,  139, 171,  203, 235,  267, 299,  331, 363,
+  395, 427,  459, 491,  523, 555,  587, 619,  651, 683,  715, 747,  779, 811,
+  843, 875,  907, 939,  971, 1003, 12,  44,   76,  108,  140, 172,  204, 236,
+  268, 300,  332, 364,  396, 428,  460, 492,  524, 556,  588, 620,  652, 684,
+  716, 748,  780, 812,  844, 876,  908, 940,  972, 1004, 13,  45,   77,  109,
+  141, 173,  205, 237,  269, 301,  333, 365,  397, 429,  461, 493,  525, 557,
+  589, 621,  653, 685,  717, 749,  781, 813,  845, 877,  909, 941,  973, 1005,
+  14,  46,   78,  110,  142, 174,  206, 238,  270, 302,  334, 366,  398, 430,
+  462, 494,  526, 558,  590, 622,  654, 686,  718, 750,  782, 814,  846, 878,
+  910, 942,  974, 1006, 15,  47,   79,  111,  143, 175,  207, 239,  271, 303,
+  335, 367,  399, 431,  463, 495,  527, 559,  591, 623,  655, 687,  719, 751,
+  783, 815,  847, 879,  911, 943,  975, 1007, 16,  48,   80,  112,  144, 176,
+  208, 240,  272, 304,  336, 368,  400, 432,  464, 496,  528, 560,  592, 624,
+  656, 688,  720, 752,  784, 816,  848, 880,  912, 944,  976, 1008, 17,  49,
+  81,  113,  145, 177,  209, 241,  273, 305,  337, 369,  401, 433,  465, 497,
+  529, 561,  593, 625,  657, 689,  721, 753,  785, 817,  849, 881,  913, 945,
+  977, 1009, 18,  50,   82,  114,  146, 178,  210, 242,  274, 306,  338, 370,
+  402, 434,  466, 498,  530, 562,  594, 626,  658, 690,  722, 754,  786, 818,
+  850, 882,  914, 946,  978, 1010, 19,  51,   83,  115,  147, 179,  211, 243,
+  275, 307,  339, 371,  403, 435,  467, 499,  531, 563,  595, 627,  659, 691,
+  723, 755,  787, 819,  851, 883,  915, 947,  979, 1011, 20,  52,   84,  116,
+  148, 180,  212, 244,  276, 308,  340, 372,  404, 436,  468, 500,  532, 564,
+  596, 628,  660, 692,  724, 756,  788, 820,  852, 884,  916, 948,  980, 1012,
+  21,  53,   85,  117,  149, 181,  213, 245,  277, 309,  341, 373,  405, 437,
+  469, 501,  533, 565,  597, 629,  661, 693,  725, 757,  789, 821,  853, 885,
+  917, 949,  981, 1013, 22,  54,   86,  118,  150, 182,  214, 246,  278, 310,
+  342, 374,  406, 438,  470, 502,  534, 566,  598, 630,  662, 694,  726, 758,
+  790, 822,  854, 886,  918, 950,  982, 1014, 23,  55,   87,  119,  151, 183,
+  215, 247,  279, 311,  343, 375,  407, 439,  471, 503,  535, 567,  599, 631,
+  663, 695,  727, 759,  791, 823,  855, 887,  919, 951,  983, 1015, 24,  56,
+  88,  120,  152, 184,  216, 248,  280, 312,  344, 376,  408, 440,  472, 504,
+  536, 568,  600, 632,  664, 696,  728, 760,  792, 824,  856, 888,  920, 952,
+  984, 1016, 25,  57,   89,  121,  153, 185,  217, 249,  281, 313,  345, 377,
+  409, 441,  473, 505,  537, 569,  601, 633,  665, 697,  729, 761,  793, 825,
+  857, 889,  921, 953,  985, 1017, 26,  58,   90,  122,  154, 186,  218, 250,
+  282, 314,  346, 378,  410, 442,  474, 506,  538, 570,  602, 634,  666, 698,
+  730, 762,  794, 826,  858, 890,  922, 954,  986, 1018, 27,  59,   91,  123,
+  155, 187,  219, 251,  283, 315,  347, 379,  411, 443,  475, 507,  539, 571,
+  603, 635,  667, 699,  731, 763,  795, 827,  859, 891,  923, 955,  987, 1019,
+  28,  60,   92,  124,  156, 188,  220, 252,  284, 316,  348, 380,  412, 444,
+  476, 508,  540, 572,  604, 636,  668, 700,  732, 764,  796, 828,  860, 892,
+  924, 956,  988, 1020, 29,  61,   93,  125,  157, 189,  221, 253,  285, 317,
+  349, 381,  413, 445,  477, 509,  541, 573,  605, 637,  669, 701,  733, 765,
+  797, 829,  861, 893,  925, 957,  989, 1021, 30,  62,   94,  126,  158, 190,
+  222, 254,  286, 318,  350, 382,  414, 446,  478, 510,  542, 574,  606, 638,
+  670, 702,  734, 766,  798, 830,  862, 894,  926, 958,  990, 1022, 31,  63,
+  95,  127,  159, 191,  223, 255,  287, 319,  351, 383,  415, 447,  479, 511,
+  543, 575,  607, 639,  671, 703,  735, 767,  799, 831,  863, 895,  927, 959,
+  991, 1023,
 };
 
-DECLARE_ALIGNED(16, static const int16_t,
-                av1_default_iscan_4x4[16]) = { 0, 1, 5,  6,  2, 4,  7,  12,
-                                               3, 8, 11, 13, 9, 10, 14, 15 };
+DECLARE_ALIGNED(16, static const int16_t, default_scan_32x32[1024]) = {
+  0,    32,   1,    2,    33,   64,  96,   65,   34,   3,    4,    35,  66,
+  97,   128,  160,  129,  98,   67,  36,   5,    6,    37,   68,   99,  130,
+  161,  192,  224,  193,  162,  131, 100,  69,   38,   7,    8,    39,  70,
+  101,  132,  163,  194,  225,  256, 288,  257,  226,  195,  164,  133, 102,
+  71,   40,   9,    10,   41,   72,  103,  134,  165,  196,  227,  258, 289,
+  320,  352,  321,  290,  259,  228, 197,  166,  135,  104,  73,   42,  11,
+  12,   43,   74,   105,  136,  167, 198,  229,  260,  291,  322,  353, 384,
+  416,  385,  354,  323,  292,  261, 230,  199,  168,  137,  106,  75,  44,
+  13,   14,   45,   76,   107,  138, 169,  200,  231,  262,  293,  324, 355,
+  386,  417,  448,  480,  449,  418, 387,  356,  325,  294,  263,  232, 201,
+  170,  139,  108,  77,   46,   15,  16,   47,   78,   109,  140,  171, 202,
+  233,  264,  295,  326,  357,  388, 419,  450,  481,  512,  544,  513, 482,
+  451,  420,  389,  358,  327,  296, 265,  234,  203,  172,  141,  110, 79,
+  48,   17,   18,   49,   80,   111, 142,  173,  204,  235,  266,  297, 328,
+  359,  390,  421,  452,  483,  514, 545,  576,  608,  577,  546,  515, 484,
+  453,  422,  391,  360,  329,  298, 267,  236,  205,  174,  143,  112, 81,
+  50,   19,   20,   51,   82,   113, 144,  175,  206,  237,  268,  299, 330,
+  361,  392,  423,  454,  485,  516, 547,  578,  609,  640,  672,  641, 610,
+  579,  548,  517,  486,  455,  424, 393,  362,  331,  300,  269,  238, 207,
+  176,  145,  114,  83,   52,   21,  22,   53,   84,   115,  146,  177, 208,
+  239,  270,  301,  332,  363,  394, 425,  456,  487,  518,  549,  580, 611,
+  642,  673,  704,  736,  705,  674, 643,  612,  581,  550,  519,  488, 457,
+  426,  395,  364,  333,  302,  271, 240,  209,  178,  147,  116,  85,  54,
+  23,   24,   55,   86,   117,  148, 179,  210,  241,  272,  303,  334, 365,
+  396,  427,  458,  489,  520,  551, 582,  613,  644,  675,  706,  737, 768,
+  800,  769,  738,  707,  676,  645, 614,  583,  552,  521,  490,  459, 428,
+  397,  366,  335,  304,  273,  242, 211,  180,  149,  118,  87,   56,  25,
+  26,   57,   88,   119,  150,  181, 212,  243,  274,  305,  336,  367, 398,
+  429,  460,  491,  522,  553,  584, 615,  646,  677,  708,  739,  770, 801,
+  832,  864,  833,  802,  771,  740, 709,  678,  647,  616,  585,  554, 523,
+  492,  461,  430,  399,  368,  337, 306,  275,  244,  213,  182,  151, 120,
+  89,   58,   27,   28,   59,   90,  121,  152,  183,  214,  245,  276, 307,
+  338,  369,  400,  431,  462,  493, 524,  555,  586,  617,  648,  679, 710,
+  741,  772,  803,  834,  865,  896, 928,  897,  866,  835,  804,  773, 742,
+  711,  680,  649,  618,  587,  556, 525,  494,  463,  432,  401,  370, 339,
+  308,  277,  246,  215,  184,  153, 122,  91,   60,   29,   30,   61,  92,
+  123,  154,  185,  216,  247,  278, 309,  340,  371,  402,  433,  464, 495,
+  526,  557,  588,  619,  650,  681, 712,  743,  774,  805,  836,  867, 898,
+  929,  960,  992,  961,  930,  899, 868,  837,  806,  775,  744,  713, 682,
+  651,  620,  589,  558,  527,  496, 465,  434,  403,  372,  341,  310, 279,
+  248,  217,  186,  155,  124,  93,  62,   31,   63,   94,   125,  156, 187,
+  218,  249,  280,  311,  342,  373, 404,  435,  466,  497,  528,  559, 590,
+  621,  652,  683,  714,  745,  776, 807,  838,  869,  900,  931,  962, 993,
+  994,  963,  932,  901,  870,  839, 808,  777,  746,  715,  684,  653, 622,
+  591,  560,  529,  498,  467,  436, 405,  374,  343,  312,  281,  250, 219,
+  188,  157,  126,  95,   127,  158, 189,  220,  251,  282,  313,  344, 375,
+  406,  437,  468,  499,  530,  561, 592,  623,  654,  685,  716,  747, 778,
+  809,  840,  871,  902,  933,  964, 995,  996,  965,  934,  903,  872, 841,
+  810,  779,  748,  717,  686,  655, 624,  593,  562,  531,  500,  469, 438,
+  407,  376,  345,  314,  283,  252, 221,  190,  159,  191,  222,  253, 284,
+  315,  346,  377,  408,  439,  470, 501,  532,  563,  594,  625,  656, 687,
+  718,  749,  780,  811,  842,  873, 904,  935,  966,  997,  998,  967, 936,
+  905,  874,  843,  812,  781,  750, 719,  688,  657,  626,  595,  564, 533,
+  502,  471,  440,  409,  378,  347, 316,  285,  254,  223,  255,  286, 317,
+  348,  379,  410,  441,  472,  503, 534,  565,  596,  627,  658,  689, 720,
+  751,  782,  813,  844,  875,  906, 937,  968,  999,  1000, 969,  938, 907,
+  876,  845,  814,  783,  752,  721, 690,  659,  628,  597,  566,  535, 504,
+  473,  442,  411,  380,  349,  318, 287,  319,  350,  381,  412,  443, 474,
+  505,  536,  567,  598,  629,  660, 691,  722,  753,  784,  815,  846, 877,
+  908,  939,  970,  1001, 1002, 971, 940,  909,  878,  847,  816,  785, 754,
+  723,  692,  661,  630,  599,  568, 537,  506,  475,  444,  413,  382, 351,
+  383,  414,  445,  476,  507,  538, 569,  600,  631,  662,  693,  724, 755,
+  786,  817,  848,  879,  910,  941, 972,  1003, 1004, 973,  942,  911, 880,
+  849,  818,  787,  756,  725,  694, 663,  632,  601,  570,  539,  508, 477,
+  446,  415,  447,  478,  509,  540, 571,  602,  633,  664,  695,  726, 757,
+  788,  819,  850,  881,  912,  943, 974,  1005, 1006, 975,  944,  913, 882,
+  851,  820,  789,  758,  727,  696, 665,  634,  603,  572,  541,  510, 479,
+  511,  542,  573,  604,  635,  666, 697,  728,  759,  790,  821,  852, 883,
+  914,  945,  976,  1007, 1008, 977, 946,  915,  884,  853,  822,  791, 760,
+  729,  698,  667,  636,  605,  574, 543,  575,  606,  637,  668,  699, 730,
+  761,  792,  823,  854,  885,  916, 947,  978,  1009, 1010, 979,  948, 917,
+  886,  855,  824,  793,  762,  731, 700,  669,  638,  607,  639,  670, 701,
+  732,  763,  794,  825,  856,  887, 918,  949,  980,  1011, 1012, 981, 950,
+  919,  888,  857,  826,  795,  764, 733,  702,  671,  703,  734,  765, 796,
+  827,  858,  889,  920,  951,  982, 1013, 1014, 983,  952,  921,  890, 859,
+  828,  797,  766,  735,  767,  798, 829,  860,  891,  922,  953,  984, 1015,
+  1016, 985,  954,  923,  892,  861, 830,  799,  831,  862,  893,  924, 955,
+  986,  1017, 1018, 987,  956,  925, 894,  863,  895,  926,  957,  988, 1019,
+  1020, 989,  958,  927,  959,  990, 1021, 1022, 991,  1023,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_4x4[16]) = {
+  0, 2, 3, 9, 1, 4, 8, 10, 5, 7, 11, 14, 6, 12, 13, 15,
+};
 
 DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_4x4[16]) = {
-  0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_4x4[16]) = {
   0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_4x4[16]) = {
+  0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_4x8[32]) = {
-  0,  1,  3,  6,  2,  4,  7,  10, 5,  8,  11, 14, 9,  12, 15, 18,
-  13, 16, 19, 22, 17, 20, 23, 26, 21, 24, 27, 29, 25, 28, 30, 31,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_4x8[32]) = {
-  0, 8,  16, 24, 1, 9,  17, 25, 2, 10, 18, 26, 3, 11, 19, 27,
-  4, 12, 20, 28, 5, 13, 21, 29, 6, 14, 22, 30, 7, 15, 23, 31,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_4x8[32]) = {
-  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
-  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_8x4[32]) = {
   0, 2, 5,  9,  13, 17, 21, 25, 1, 4,  8,  12, 16, 20, 24, 28,
   3, 7, 11, 15, 19, 23, 27, 30, 6, 10, 14, 18, 22, 26, 29, 31,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_8x4[32]) = {
-  0, 4, 8,  12, 16, 20, 24, 28, 1, 5, 9,  13, 17, 21, 25, 29,
-  2, 6, 10, 14, 18, 22, 26, 30, 3, 7, 11, 15, 19, 23, 27, 31,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_8x4[32]) = {
+DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_4x8[32]) = {
   0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
   16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_4x16[64]) = {
-  0,  1,  3,  6,  2,  4,  7,  10, 5,  8,  11, 14, 9,  12, 15, 18,
-  13, 16, 19, 22, 17, 20, 23, 26, 21, 24, 27, 30, 25, 28, 31, 34,
-  29, 32, 35, 38, 33, 36, 39, 42, 37, 40, 43, 46, 41, 44, 47, 50,
-  45, 48, 51, 54, 49, 52, 55, 58, 53, 56, 59, 61, 57, 60, 62, 63,
+DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_4x8[32]) = {
+  0, 4, 8,  12, 16, 20, 24, 28, 1, 5, 9,  13, 17, 21, 25, 29,
+  2, 6, 10, 14, 18, 22, 26, 30, 3, 7, 11, 15, 19, 23, 27, 31,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_16x4[64]) = {
+DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_8x4[32]) = {
+  0,  1,  3,  6,  2,  4,  7,  10, 5,  8,  11, 14, 9,  12, 15, 18,
+  13, 16, 19, 22, 17, 20, 23, 26, 21, 24, 27, 29, 25, 28, 30, 31,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_8x4[32]) = {
+  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
+  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_8x4[32]) = {
+  0, 8,  16, 24, 1, 9,  17, 25, 2, 10, 18, 26, 3, 11, 19, 27,
+  4, 12, 20, 28, 5, 13, 21, 29, 6, 14, 22, 30, 7, 15, 23, 31,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_4x16[64]) = {
   0, 2,  5,  9,  13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57,
   1, 4,  8,  12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60,
   3, 7,  11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 62,
   6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 61, 63,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_16x4[64]) = {
+  0,  1,  3,  6,  2,  4,  7,  10, 5,  8,  11, 14, 9,  12, 15, 18,
+  13, 16, 19, 22, 17, 20, 23, 26, 21, 24, 27, 30, 25, 28, 31, 34,
+  29, 32, 35, 38, 33, 36, 39, 42, 37, 40, 43, 46, 41, 44, 47, 50,
+  45, 48, 51, 54, 49, 52, 55, 58, 53, 56, 59, 61, 57, 60, 62, 63,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_4x16[64]) = {
-  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
-  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
-  32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
-  48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_16x4[64]) = {
-  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
-  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
-  32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
-  48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_4x16[64]) = {
-  0,  16, 32, 48, 1,  17, 33, 49, 2,  18, 34, 50, 3,  19, 35, 51,
-  4,  20, 36, 52, 5,  21, 37, 53, 6,  22, 38, 54, 7,  23, 39, 55,
-  8,  24, 40, 56, 9,  25, 41, 57, 10, 26, 42, 58, 11, 27, 43, 59,
-  12, 28, 44, 60, 13, 29, 45, 61, 14, 30, 46, 62, 15, 31, 47, 63,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_16x4[64]) = {
   0, 4, 8,  12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60,
   1, 5, 9,  13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57, 61,
   2, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62,
   3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_8x32[256]) = {
-  0,   1,   3,   6,   10,  15,  21,  28,  2,   4,   7,   11,  16,  22,  29,
-  36,  5,   8,   12,  17,  23,  30,  37,  44,  9,   13,  18,  24,  31,  38,
-  45,  52,  14,  19,  25,  32,  39,  46,  53,  60,  20,  26,  33,  40,  47,
-  54,  61,  68,  27,  34,  41,  48,  55,  62,  69,  76,  35,  42,  49,  56,
-  63,  70,  77,  84,  43,  50,  57,  64,  71,  78,  85,  92,  51,  58,  65,
-  72,  79,  86,  93,  100, 59,  66,  73,  80,  87,  94,  101, 108, 67,  74,
-  81,  88,  95,  102, 109, 116, 75,  82,  89,  96,  103, 110, 117, 124, 83,
-  90,  97,  104, 111, 118, 125, 132, 91,  98,  105, 112, 119, 126, 133, 140,
-  99,  106, 113, 120, 127, 134, 141, 148, 107, 114, 121, 128, 135, 142, 149,
-  156, 115, 122, 129, 136, 143, 150, 157, 164, 123, 130, 137, 144, 151, 158,
-  165, 172, 131, 138, 145, 152, 159, 166, 173, 180, 139, 146, 153, 160, 167,
-  174, 181, 188, 147, 154, 161, 168, 175, 182, 189, 196, 155, 162, 169, 176,
-  183, 190, 197, 204, 163, 170, 177, 184, 191, 198, 205, 212, 171, 178, 185,
-  192, 199, 206, 213, 220, 179, 186, 193, 200, 207, 214, 221, 228, 187, 194,
-  201, 208, 215, 222, 229, 235, 195, 202, 209, 216, 223, 230, 236, 241, 203,
-  210, 217, 224, 231, 237, 242, 246, 211, 218, 225, 232, 238, 243, 247, 250,
-  219, 226, 233, 239, 244, 248, 251, 253, 227, 234, 240, 245, 249, 252, 254,
-  255,
+DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_16x4[64]) = {
+  0,  16, 32, 48, 1,  17, 33, 49, 2,  18, 34, 50, 3,  19, 35, 51,
+  4,  20, 36, 52, 5,  21, 37, 53, 6,  22, 38, 54, 7,  23, 39, 55,
+  8,  24, 40, 56, 9,  25, 41, 57, 10, 26, 42, 58, 11, 27, 43, 59,
+  12, 28, 44, 60, 13, 29, 45, 61, 14, 30, 46, 62, 15, 31, 47, 63,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_32x8[256]) = {
+DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_4x16[64]) = {
+  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
+  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
+  32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
+  48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_16x4[64]) = {
+  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
+  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
+  32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
+  48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_8x32[256]) = {
   0,   2,   5,   9,   14,  20,  27,  35,  43,  51,  59,  67,  75,  83,  91,
   99,  107, 115, 123, 131, 139, 147, 155, 163, 171, 179, 187, 195, 203, 211,
   219, 227, 1,   4,   8,   13,  19,  26,  34,  42,  50,  58,  66,  74,  82,
@@ -965,68 +944,28 @@
   255,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_32x8[256]) = {
+  0,   1,   3,   6,   10,  15,  21,  28,  2,   4,   7,   11,  16,  22,  29,
+  36,  5,   8,   12,  17,  23,  30,  37,  44,  9,   13,  18,  24,  31,  38,
+  45,  52,  14,  19,  25,  32,  39,  46,  53,  60,  20,  26,  33,  40,  47,
+  54,  61,  68,  27,  34,  41,  48,  55,  62,  69,  76,  35,  42,  49,  56,
+  63,  70,  77,  84,  43,  50,  57,  64,  71,  78,  85,  92,  51,  58,  65,
+  72,  79,  86,  93,  100, 59,  66,  73,  80,  87,  94,  101, 108, 67,  74,
+  81,  88,  95,  102, 109, 116, 75,  82,  89,  96,  103, 110, 117, 124, 83,
+  90,  97,  104, 111, 118, 125, 132, 91,  98,  105, 112, 119, 126, 133, 140,
+  99,  106, 113, 120, 127, 134, 141, 148, 107, 114, 121, 128, 135, 142, 149,
+  156, 115, 122, 129, 136, 143, 150, 157, 164, 123, 130, 137, 144, 151, 158,
+  165, 172, 131, 138, 145, 152, 159, 166, 173, 180, 139, 146, 153, 160, 167,
+  174, 181, 188, 147, 154, 161, 168, 175, 182, 189, 196, 155, 162, 169, 176,
+  183, 190, 197, 204, 163, 170, 177, 184, 191, 198, 205, 212, 171, 178, 185,
+  192, 199, 206, 213, 220, 179, 186, 193, 200, 207, 214, 221, 228, 187, 194,
+  201, 208, 215, 222, 229, 235, 195, 202, 209, 216, 223, 230, 236, 241, 203,
+  210, 217, 224, 231, 237, 242, 246, 211, 218, 225, 232, 238, 243, 247, 250,
+  219, 226, 233, 239, 244, 248, 251, 253, 227, 234, 240, 245, 249, 252, 254,
+  255,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_8x32[256]) = {
-  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
-  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
-  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
-  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
-  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
-  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
-  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
-  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
-  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
-  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
-  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
-  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
-  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
-  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
-  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
-  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
-  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
-  255,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_32x8[256]) = {
-  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
-  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
-  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
-  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
-  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
-  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
-  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
-  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
-  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
-  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
-  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
-  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
-  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
-  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
-  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
-  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
-  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
-  255,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_8x32[256]) = {
-  0,  32, 64, 96,  128, 160, 192, 224, 1,  33, 65, 97,  129, 161, 193, 225,
-  2,  34, 66, 98,  130, 162, 194, 226, 3,  35, 67, 99,  131, 163, 195, 227,
-  4,  36, 68, 100, 132, 164, 196, 228, 5,  37, 69, 101, 133, 165, 197, 229,
-  6,  38, 70, 102, 134, 166, 198, 230, 7,  39, 71, 103, 135, 167, 199, 231,
-  8,  40, 72, 104, 136, 168, 200, 232, 9,  41, 73, 105, 137, 169, 201, 233,
-  10, 42, 74, 106, 138, 170, 202, 234, 11, 43, 75, 107, 139, 171, 203, 235,
-  12, 44, 76, 108, 140, 172, 204, 236, 13, 45, 77, 109, 141, 173, 205, 237,
-  14, 46, 78, 110, 142, 174, 206, 238, 15, 47, 79, 111, 143, 175, 207, 239,
-  16, 48, 80, 112, 144, 176, 208, 240, 17, 49, 81, 113, 145, 177, 209, 241,
-  18, 50, 82, 114, 146, 178, 210, 242, 19, 51, 83, 115, 147, 179, 211, 243,
-  20, 52, 84, 116, 148, 180, 212, 244, 21, 53, 85, 117, 149, 181, 213, 245,
-  22, 54, 86, 118, 150, 182, 214, 246, 23, 55, 87, 119, 151, 183, 215, 247,
-  24, 56, 88, 120, 152, 184, 216, 248, 25, 57, 89, 121, 153, 185, 217, 249,
-  26, 58, 90, 122, 154, 186, 218, 250, 27, 59, 91, 123, 155, 187, 219, 251,
-  28, 60, 92, 124, 156, 188, 220, 252, 29, 61, 93, 125, 157, 189, 221, 253,
-  30, 62, 94, 126, 158, 190, 222, 254, 31, 63, 95, 127, 159, 191, 223, 255,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_32x8[256]) = {
   0,   8,   16,  24,  32,  40,  48,  56,  64,  72,  80,  88,  96,  104, 112,
   120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232,
   240, 248, 1,   9,   17,  25,  33,  41,  49,  57,  65,  73,  81,  89,  97,
@@ -1047,39 +986,89 @@
   255,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_8x8[64]) = {
-  0, 8,  16, 24, 32, 40, 48, 56, 1, 9,  17, 25, 33, 41, 49, 57,
-  2, 10, 18, 26, 34, 42, 50, 58, 3, 11, 19, 27, 35, 43, 51, 59,
-  4, 12, 20, 28, 36, 44, 52, 60, 5, 13, 21, 29, 37, 45, 53, 61,
-  6, 14, 22, 30, 38, 46, 54, 62, 7, 15, 23, 31, 39, 47, 55, 63,
+DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_32x8[256]) = {
+  0,  32, 64, 96,  128, 160, 192, 224, 1,  33, 65, 97,  129, 161, 193, 225,
+  2,  34, 66, 98,  130, 162, 194, 226, 3,  35, 67, 99,  131, 163, 195, 227,
+  4,  36, 68, 100, 132, 164, 196, 228, 5,  37, 69, 101, 133, 165, 197, 229,
+  6,  38, 70, 102, 134, 166, 198, 230, 7,  39, 71, 103, 135, 167, 199, 231,
+  8,  40, 72, 104, 136, 168, 200, 232, 9,  41, 73, 105, 137, 169, 201, 233,
+  10, 42, 74, 106, 138, 170, 202, 234, 11, 43, 75, 107, 139, 171, 203, 235,
+  12, 44, 76, 108, 140, 172, 204, 236, 13, 45, 77, 109, 141, 173, 205, 237,
+  14, 46, 78, 110, 142, 174, 206, 238, 15, 47, 79, 111, 143, 175, 207, 239,
+  16, 48, 80, 112, 144, 176, 208, 240, 17, 49, 81, 113, 145, 177, 209, 241,
+  18, 50, 82, 114, 146, 178, 210, 242, 19, 51, 83, 115, 147, 179, 211, 243,
+  20, 52, 84, 116, 148, 180, 212, 244, 21, 53, 85, 117, 149, 181, 213, 245,
+  22, 54, 86, 118, 150, 182, 214, 246, 23, 55, 87, 119, 151, 183, 215, 247,
+  24, 56, 88, 120, 152, 184, 216, 248, 25, 57, 89, 121, 153, 185, 217, 249,
+  26, 58, 90, 122, 154, 186, 218, 250, 27, 59, 91, 123, 155, 187, 219, 251,
+  28, 60, 92, 124, 156, 188, 220, 252, 29, 61, 93, 125, 157, 189, 221, 253,
+  30, 62, 94, 126, 158, 190, 222, 254, 31, 63, 95, 127, 159, 191, 223, 255,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_8x8[64]) = {
+DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_8x32[256]) = {
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
+  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
+  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
+  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
+  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
+  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
+  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
+  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
+  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
+  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
+  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
+  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
+  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
+  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
+  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
+  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
+  255,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_32x8[256]) = {
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
+  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
+  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
+  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
+  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
+  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
+  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
+  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
+  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
+  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
+  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
+  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
+  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
+  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
+  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
+  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
+  255,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_8x8[64]) = {
   0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15,
   16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
   32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
   48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_8x8[64]) = {
+  0, 8,  16, 24, 32, 40, 48, 56, 1, 9,  17, 25, 33, 41, 49, 57,
+  2, 10, 18, 26, 34, 42, 50, 58, 3, 11, 19, 27, 35, 43, 51, 59,
+  4, 12, 20, 28, 36, 44, 52, 60, 5, 13, 21, 29, 37, 45, 53, 61,
+  6, 14, 22, 30, 38, 46, 54, 62, 7, 15, 23, 31, 39, 47, 55, 63,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_8x8[64]) = {
-  0,  1,  5,  6,  14, 15, 27, 28, 2,  4,  7,  13, 16, 26, 29, 42,
-  3,  8,  12, 17, 25, 30, 41, 43, 9,  11, 18, 24, 31, 40, 44, 53,
-  10, 19, 23, 32, 39, 45, 52, 54, 20, 22, 33, 38, 46, 51, 55, 60,
-  21, 34, 37, 47, 50, 56, 59, 61, 35, 36, 48, 49, 57, 58, 62, 63
+  0,  2,  3,  9,  10, 20, 21, 35, 1,  4,  8,  11, 19, 22, 34, 36,
+  5,  7,  12, 18, 23, 33, 37, 48, 6,  13, 17, 24, 32, 38, 47, 49,
+  14, 16, 25, 31, 39, 46, 50, 57, 15, 26, 30, 40, 45, 51, 56, 58,
+  27, 29, 41, 44, 52, 55, 59, 62, 28, 42, 43, 53, 54, 60, 61, 63,
 };
 
 DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_8x16[128]) = {
-  0,  1,  3,   6,   10,  15,  21,  28,  2,  4,   7,   11,  16,  22,  29,  36,
-  5,  8,  12,  17,  23,  30,  37,  44,  9,  13,  18,  24,  31,  38,  45,  52,
-  14, 19, 25,  32,  39,  46,  53,  60,  20, 26,  33,  40,  47,  54,  61,  68,
-  27, 34, 41,  48,  55,  62,  69,  76,  35, 42,  49,  56,  63,  70,  77,  84,
-  43, 50, 57,  64,  71,  78,  85,  92,  51, 58,  65,  72,  79,  86,  93,  100,
-  59, 66, 73,  80,  87,  94,  101, 107, 67, 74,  81,  88,  95,  102, 108, 113,
-  75, 82, 89,  96,  103, 109, 114, 118, 83, 90,  97,  104, 110, 115, 119, 122,
-  91, 98, 105, 111, 116, 120, 123, 125, 99, 106, 112, 117, 121, 124, 126, 127,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_16x8[128]) = {
   0,  2,  5,  9,  14, 20, 27, 35, 43, 51,  59,  67,  75,  83,  91,  99,
   1,  4,  8,  13, 19, 26, 34, 42, 50, 58,  66,  74,  82,  90,  98,  106,
   3,  7,  12, 18, 25, 33, 41, 49, 57, 65,  73,  81,  89,  97,  105, 112,
@@ -1090,18 +1079,42 @@
   28, 36, 44, 52, 60, 68, 76, 84, 92, 100, 107, 113, 118, 122, 125, 127,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_16x8[128]) = {
+  0,  1,  3,   6,   10,  15,  21,  28,  2,  4,   7,   11,  16,  22,  29,  36,
+  5,  8,  12,  17,  23,  30,  37,  44,  9,  13,  18,  24,  31,  38,  45,  52,
+  14, 19, 25,  32,  39,  46,  53,  60,  20, 26,  33,  40,  47,  54,  61,  68,
+  27, 34, 41,  48,  55,  62,  69,  76,  35, 42,  49,  56,  63,  70,  77,  84,
+  43, 50, 57,  64,  71,  78,  85,  92,  51, 58,  65,  72,  79,  86,  93,  100,
+  59, 66, 73,  80,  87,  94,  101, 107, 67, 74,  81,  88,  95,  102, 108, 113,
+  75, 82, 89,  96,  103, 109, 114, 118, 83, 90,  97,  104, 110, 115, 119, 122,
+  91, 98, 105, 111, 116, 120, 123, 125, 99, 106, 112, 117, 121, 124, 126, 127,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_8x16[128]) = {
-  0,  16, 32, 48, 64, 80, 96,  112, 1,  17, 33, 49, 65, 81, 97,  113,
-  2,  18, 34, 50, 66, 82, 98,  114, 3,  19, 35, 51, 67, 83, 99,  115,
-  4,  20, 36, 52, 68, 84, 100, 116, 5,  21, 37, 53, 69, 85, 101, 117,
-  6,  22, 38, 54, 70, 86, 102, 118, 7,  23, 39, 55, 71, 87, 103, 119,
-  8,  24, 40, 56, 72, 88, 104, 120, 9,  25, 41, 57, 73, 89, 105, 121,
-  10, 26, 42, 58, 74, 90, 106, 122, 11, 27, 43, 59, 75, 91, 107, 123,
-  12, 28, 44, 60, 76, 92, 108, 124, 13, 29, 45, 61, 77, 93, 109, 125,
-  14, 30, 46, 62, 78, 94, 110, 126, 15, 31, 47, 63, 79, 95, 111, 127,
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
+  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
+  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
+  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
+  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
+  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
+  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
+  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+  120, 121, 122, 123, 124, 125, 126, 127,
 };
 
 DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_16x8[128]) = {
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
+  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
+  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
+  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
+  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
+  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
+  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
+  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+  120, 121, 122, 123, 124, 125, 126, 127,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_8x16[128]) = {
   0, 8,  16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96,  104, 112, 120,
   1, 9,  17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97,  105, 113, 121,
   2, 10, 18, 26, 34, 42, 50, 58, 66, 74, 82, 90, 98,  106, 114, 122,
@@ -1112,69 +1125,18 @@
   7, 15, 23, 31, 39, 47, 55, 63, 71, 79, 87, 95, 103, 111, 119, 127,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_8x16[128]) = {
-  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
-  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
-  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
-  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
-  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
-  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
-  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
-  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
-  120, 121, 122, 123, 124, 125, 126, 127,
-};
-
 DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_16x8[128]) = {
-  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
-  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
-  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
-  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
-  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
-  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
-  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
-  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
-  120, 121, 122, 123, 124, 125, 126, 127,
+  0,  16, 32, 48, 64, 80, 96,  112, 1,  17, 33, 49, 65, 81, 97,  113,
+  2,  18, 34, 50, 66, 82, 98,  114, 3,  19, 35, 51, 67, 83, 99,  115,
+  4,  20, 36, 52, 68, 84, 100, 116, 5,  21, 37, 53, 69, 85, 101, 117,
+  6,  22, 38, 54, 70, 86, 102, 118, 7,  23, 39, 55, 71, 87, 103, 119,
+  8,  24, 40, 56, 72, 88, 104, 120, 9,  25, 41, 57, 73, 89, 105, 121,
+  10, 26, 42, 58, 74, 90, 106, 122, 11, 27, 43, 59, 75, 91, 107, 123,
+  12, 28, 44, 60, 76, 92, 108, 124, 13, 29, 45, 61, 77, 93, 109, 125,
+  14, 30, 46, 62, 78, 94, 110, 126, 15, 31, 47, 63, 79, 95, 111, 127,
 };
 
 DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_16x32[512]) = {
-  0,   1,   3,   6,   10,  15,  21,  28,  36,  45,  55,  66,  78,  91,  105,
-  120, 2,   4,   7,   11,  16,  22,  29,  37,  46,  56,  67,  79,  92,  106,
-  121, 136, 5,   8,   12,  17,  23,  30,  38,  47,  57,  68,  80,  93,  107,
-  122, 137, 152, 9,   13,  18,  24,  31,  39,  48,  58,  69,  81,  94,  108,
-  123, 138, 153, 168, 14,  19,  25,  32,  40,  49,  59,  70,  82,  95,  109,
-  124, 139, 154, 169, 184, 20,  26,  33,  41,  50,  60,  71,  83,  96,  110,
-  125, 140, 155, 170, 185, 200, 27,  34,  42,  51,  61,  72,  84,  97,  111,
-  126, 141, 156, 171, 186, 201, 216, 35,  43,  52,  62,  73,  85,  98,  112,
-  127, 142, 157, 172, 187, 202, 217, 232, 44,  53,  63,  74,  86,  99,  113,
-  128, 143, 158, 173, 188, 203, 218, 233, 248, 54,  64,  75,  87,  100, 114,
-  129, 144, 159, 174, 189, 204, 219, 234, 249, 264, 65,  76,  88,  101, 115,
-  130, 145, 160, 175, 190, 205, 220, 235, 250, 265, 280, 77,  89,  102, 116,
-  131, 146, 161, 176, 191, 206, 221, 236, 251, 266, 281, 296, 90,  103, 117,
-  132, 147, 162, 177, 192, 207, 222, 237, 252, 267, 282, 297, 312, 104, 118,
-  133, 148, 163, 178, 193, 208, 223, 238, 253, 268, 283, 298, 313, 328, 119,
-  134, 149, 164, 179, 194, 209, 224, 239, 254, 269, 284, 299, 314, 329, 344,
-  135, 150, 165, 180, 195, 210, 225, 240, 255, 270, 285, 300, 315, 330, 345,
-  360, 151, 166, 181, 196, 211, 226, 241, 256, 271, 286, 301, 316, 331, 346,
-  361, 376, 167, 182, 197, 212, 227, 242, 257, 272, 287, 302, 317, 332, 347,
-  362, 377, 392, 183, 198, 213, 228, 243, 258, 273, 288, 303, 318, 333, 348,
-  363, 378, 393, 407, 199, 214, 229, 244, 259, 274, 289, 304, 319, 334, 349,
-  364, 379, 394, 408, 421, 215, 230, 245, 260, 275, 290, 305, 320, 335, 350,
-  365, 380, 395, 409, 422, 434, 231, 246, 261, 276, 291, 306, 321, 336, 351,
-  366, 381, 396, 410, 423, 435, 446, 247, 262, 277, 292, 307, 322, 337, 352,
-  367, 382, 397, 411, 424, 436, 447, 457, 263, 278, 293, 308, 323, 338, 353,
-  368, 383, 398, 412, 425, 437, 448, 458, 467, 279, 294, 309, 324, 339, 354,
-  369, 384, 399, 413, 426, 438, 449, 459, 468, 476, 295, 310, 325, 340, 355,
-  370, 385, 400, 414, 427, 439, 450, 460, 469, 477, 484, 311, 326, 341, 356,
-  371, 386, 401, 415, 428, 440, 451, 461, 470, 478, 485, 491, 327, 342, 357,
-  372, 387, 402, 416, 429, 441, 452, 462, 471, 479, 486, 492, 497, 343, 358,
-  373, 388, 403, 417, 430, 442, 453, 463, 472, 480, 487, 493, 498, 502, 359,
-  374, 389, 404, 418, 431, 443, 454, 464, 473, 481, 488, 494, 499, 503, 506,
-  375, 390, 405, 419, 432, 444, 455, 465, 474, 482, 489, 495, 500, 504, 507,
-  509, 391, 406, 420, 433, 445, 456, 466, 475, 483, 490, 496, 501, 505, 508,
-  510, 511,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_32x16[512]) = {
   0,   2,   5,   9,   14,  20,  27,  35,  44,  54,  65,  77,  90,  104, 119,
   135, 151, 167, 183, 199, 215, 231, 247, 263, 279, 295, 311, 327, 343, 359,
   375, 391, 1,   4,   8,   13,  19,  26,  34,  43,  53,  64,  76,  89,  103,
@@ -1212,42 +1174,121 @@
   509, 511,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_32x16[512]) = {
+  0,   1,   3,   6,   10,  15,  21,  28,  36,  45,  55,  66,  78,  91,  105,
+  120, 2,   4,   7,   11,  16,  22,  29,  37,  46,  56,  67,  79,  92,  106,
+  121, 136, 5,   8,   12,  17,  23,  30,  38,  47,  57,  68,  80,  93,  107,
+  122, 137, 152, 9,   13,  18,  24,  31,  39,  48,  58,  69,  81,  94,  108,
+  123, 138, 153, 168, 14,  19,  25,  32,  40,  49,  59,  70,  82,  95,  109,
+  124, 139, 154, 169, 184, 20,  26,  33,  41,  50,  60,  71,  83,  96,  110,
+  125, 140, 155, 170, 185, 200, 27,  34,  42,  51,  61,  72,  84,  97,  111,
+  126, 141, 156, 171, 186, 201, 216, 35,  43,  52,  62,  73,  85,  98,  112,
+  127, 142, 157, 172, 187, 202, 217, 232, 44,  53,  63,  74,  86,  99,  113,
+  128, 143, 158, 173, 188, 203, 218, 233, 248, 54,  64,  75,  87,  100, 114,
+  129, 144, 159, 174, 189, 204, 219, 234, 249, 264, 65,  76,  88,  101, 115,
+  130, 145, 160, 175, 190, 205, 220, 235, 250, 265, 280, 77,  89,  102, 116,
+  131, 146, 161, 176, 191, 206, 221, 236, 251, 266, 281, 296, 90,  103, 117,
+  132, 147, 162, 177, 192, 207, 222, 237, 252, 267, 282, 297, 312, 104, 118,
+  133, 148, 163, 178, 193, 208, 223, 238, 253, 268, 283, 298, 313, 328, 119,
+  134, 149, 164, 179, 194, 209, 224, 239, 254, 269, 284, 299, 314, 329, 344,
+  135, 150, 165, 180, 195, 210, 225, 240, 255, 270, 285, 300, 315, 330, 345,
+  360, 151, 166, 181, 196, 211, 226, 241, 256, 271, 286, 301, 316, 331, 346,
+  361, 376, 167, 182, 197, 212, 227, 242, 257, 272, 287, 302, 317, 332, 347,
+  362, 377, 392, 183, 198, 213, 228, 243, 258, 273, 288, 303, 318, 333, 348,
+  363, 378, 393, 407, 199, 214, 229, 244, 259, 274, 289, 304, 319, 334, 349,
+  364, 379, 394, 408, 421, 215, 230, 245, 260, 275, 290, 305, 320, 335, 350,
+  365, 380, 395, 409, 422, 434, 231, 246, 261, 276, 291, 306, 321, 336, 351,
+  366, 381, 396, 410, 423, 435, 446, 247, 262, 277, 292, 307, 322, 337, 352,
+  367, 382, 397, 411, 424, 436, 447, 457, 263, 278, 293, 308, 323, 338, 353,
+  368, 383, 398, 412, 425, 437, 448, 458, 467, 279, 294, 309, 324, 339, 354,
+  369, 384, 399, 413, 426, 438, 449, 459, 468, 476, 295, 310, 325, 340, 355,
+  370, 385, 400, 414, 427, 439, 450, 460, 469, 477, 484, 311, 326, 341, 356,
+  371, 386, 401, 415, 428, 440, 451, 461, 470, 478, 485, 491, 327, 342, 357,
+  372, 387, 402, 416, 429, 441, 452, 462, 471, 479, 486, 492, 497, 343, 358,
+  373, 388, 403, 417, 430, 442, 453, 463, 472, 480, 487, 493, 498, 502, 359,
+  374, 389, 404, 418, 431, 443, 454, 464, 473, 481, 488, 494, 499, 503, 506,
+  375, 390, 405, 419, 432, 444, 455, 465, 474, 482, 489, 495, 500, 504, 507,
+  509, 391, 406, 420, 433, 445, 456, 466, 475, 483, 490, 496, 501, 505, 508,
+  510, 511,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_16x32[512]) = {
-  0,  32, 64, 96,  128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480,
-  1,  33, 65, 97,  129, 161, 193, 225, 257, 289, 321, 353, 385, 417, 449, 481,
-  2,  34, 66, 98,  130, 162, 194, 226, 258, 290, 322, 354, 386, 418, 450, 482,
-  3,  35, 67, 99,  131, 163, 195, 227, 259, 291, 323, 355, 387, 419, 451, 483,
-  4,  36, 68, 100, 132, 164, 196, 228, 260, 292, 324, 356, 388, 420, 452, 484,
-  5,  37, 69, 101, 133, 165, 197, 229, 261, 293, 325, 357, 389, 421, 453, 485,
-  6,  38, 70, 102, 134, 166, 198, 230, 262, 294, 326, 358, 390, 422, 454, 486,
-  7,  39, 71, 103, 135, 167, 199, 231, 263, 295, 327, 359, 391, 423, 455, 487,
-  8,  40, 72, 104, 136, 168, 200, 232, 264, 296, 328, 360, 392, 424, 456, 488,
-  9,  41, 73, 105, 137, 169, 201, 233, 265, 297, 329, 361, 393, 425, 457, 489,
-  10, 42, 74, 106, 138, 170, 202, 234, 266, 298, 330, 362, 394, 426, 458, 490,
-  11, 43, 75, 107, 139, 171, 203, 235, 267, 299, 331, 363, 395, 427, 459, 491,
-  12, 44, 76, 108, 140, 172, 204, 236, 268, 300, 332, 364, 396, 428, 460, 492,
-  13, 45, 77, 109, 141, 173, 205, 237, 269, 301, 333, 365, 397, 429, 461, 493,
-  14, 46, 78, 110, 142, 174, 206, 238, 270, 302, 334, 366, 398, 430, 462, 494,
-  15, 47, 79, 111, 143, 175, 207, 239, 271, 303, 335, 367, 399, 431, 463, 495,
-  16, 48, 80, 112, 144, 176, 208, 240, 272, 304, 336, 368, 400, 432, 464, 496,
-  17, 49, 81, 113, 145, 177, 209, 241, 273, 305, 337, 369, 401, 433, 465, 497,
-  18, 50, 82, 114, 146, 178, 210, 242, 274, 306, 338, 370, 402, 434, 466, 498,
-  19, 51, 83, 115, 147, 179, 211, 243, 275, 307, 339, 371, 403, 435, 467, 499,
-  20, 52, 84, 116, 148, 180, 212, 244, 276, 308, 340, 372, 404, 436, 468, 500,
-  21, 53, 85, 117, 149, 181, 213, 245, 277, 309, 341, 373, 405, 437, 469, 501,
-  22, 54, 86, 118, 150, 182, 214, 246, 278, 310, 342, 374, 406, 438, 470, 502,
-  23, 55, 87, 119, 151, 183, 215, 247, 279, 311, 343, 375, 407, 439, 471, 503,
-  24, 56, 88, 120, 152, 184, 216, 248, 280, 312, 344, 376, 408, 440, 472, 504,
-  25, 57, 89, 121, 153, 185, 217, 249, 281, 313, 345, 377, 409, 441, 473, 505,
-  26, 58, 90, 122, 154, 186, 218, 250, 282, 314, 346, 378, 410, 442, 474, 506,
-  27, 59, 91, 123, 155, 187, 219, 251, 283, 315, 347, 379, 411, 443, 475, 507,
-  28, 60, 92, 124, 156, 188, 220, 252, 284, 316, 348, 380, 412, 444, 476, 508,
-  29, 61, 93, 125, 157, 189, 221, 253, 285, 317, 349, 381, 413, 445, 477, 509,
-  30, 62, 94, 126, 158, 190, 222, 254, 286, 318, 350, 382, 414, 446, 478, 510,
-  31, 63, 95, 127, 159, 191, 223, 255, 287, 319, 351, 383, 415, 447, 479, 511,
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
+  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
+  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
+  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
+  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
+  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
+  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
+  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
+  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
+  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
+  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
+  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
+  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
+  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
+  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
+  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
+  255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,
+  270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284,
+  285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299,
+  300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314,
+  315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329,
+  330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344,
+  345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359,
+  360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374,
+  375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,
+  390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404,
+  405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419,
+  420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434,
+  435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449,
+  450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464,
+  465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479,
+  480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494,
+  495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509,
+  510, 511,
 };
 
 DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_32x16[512]) = {
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
+  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
+  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
+  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
+  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
+  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
+  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
+  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
+  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
+  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
+  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
+  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
+  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
+  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
+  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
+  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
+  255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,
+  270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284,
+  285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299,
+  300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314,
+  315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329,
+  330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344,
+  345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359,
+  360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374,
+  375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,
+  390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404,
+  405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419,
+  420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434,
+  435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449,
+  450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464,
+  465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479,
+  480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494,
+  495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509,
+  510, 511,
+};
+
+DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_16x32[512]) = {
   0,   16,  32,  48,  64,  80,  96,  112, 128, 144, 160, 176, 192, 208, 224,
   240, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464,
   480, 496, 1,   17,  33,  49,  65,  81,  97,  113, 129, 145, 161, 177, 193,
@@ -1285,102 +1326,42 @@
   495, 511,
 };
 
-DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_16x32[512]) = {
-  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
-  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
-  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
-  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
-  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
-  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
-  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
-  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
-  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
-  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
-  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
-  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
-  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
-  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
-  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
-  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
-  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
-  255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,
-  270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284,
-  285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299,
-  300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314,
-  315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329,
-  330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344,
-  345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359,
-  360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374,
-  375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,
-  390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404,
-  405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419,
-  420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434,
-  435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449,
-  450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464,
-  465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479,
-  480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494,
-  495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509,
-  510, 511,
-};
-
 DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_32x16[512]) = {
-  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
-  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
-  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
-  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
-  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
-  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
-  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,  100, 101, 102, 103, 104,
-  105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
-  120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
-  135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
-  150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
-  165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
-  180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
-  195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,
-  210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
-  225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
-  240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
-  255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,
-  270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284,
-  285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299,
-  300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314,
-  315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329,
-  330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344,
-  345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359,
-  360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374,
-  375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,
-  390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404,
-  405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419,
-  420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434,
-  435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449,
-  450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464,
-  465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479,
-  480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494,
-  495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509,
-  510, 511,
+  0,  32, 64, 96,  128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480,
+  1,  33, 65, 97,  129, 161, 193, 225, 257, 289, 321, 353, 385, 417, 449, 481,
+  2,  34, 66, 98,  130, 162, 194, 226, 258, 290, 322, 354, 386, 418, 450, 482,
+  3,  35, 67, 99,  131, 163, 195, 227, 259, 291, 323, 355, 387, 419, 451, 483,
+  4,  36, 68, 100, 132, 164, 196, 228, 260, 292, 324, 356, 388, 420, 452, 484,
+  5,  37, 69, 101, 133, 165, 197, 229, 261, 293, 325, 357, 389, 421, 453, 485,
+  6,  38, 70, 102, 134, 166, 198, 230, 262, 294, 326, 358, 390, 422, 454, 486,
+  7,  39, 71, 103, 135, 167, 199, 231, 263, 295, 327, 359, 391, 423, 455, 487,
+  8,  40, 72, 104, 136, 168, 200, 232, 264, 296, 328, 360, 392, 424, 456, 488,
+  9,  41, 73, 105, 137, 169, 201, 233, 265, 297, 329, 361, 393, 425, 457, 489,
+  10, 42, 74, 106, 138, 170, 202, 234, 266, 298, 330, 362, 394, 426, 458, 490,
+  11, 43, 75, 107, 139, 171, 203, 235, 267, 299, 331, 363, 395, 427, 459, 491,
+  12, 44, 76, 108, 140, 172, 204, 236, 268, 300, 332, 364, 396, 428, 460, 492,
+  13, 45, 77, 109, 141, 173, 205, 237, 269, 301, 333, 365, 397, 429, 461, 493,
+  14, 46, 78, 110, 142, 174, 206, 238, 270, 302, 334, 366, 398, 430, 462, 494,
+  15, 47, 79, 111, 143, 175, 207, 239, 271, 303, 335, 367, 399, 431, 463, 495,
+  16, 48, 80, 112, 144, 176, 208, 240, 272, 304, 336, 368, 400, 432, 464, 496,
+  17, 49, 81, 113, 145, 177, 209, 241, 273, 305, 337, 369, 401, 433, 465, 497,
+  18, 50, 82, 114, 146, 178, 210, 242, 274, 306, 338, 370, 402, 434, 466, 498,
+  19, 51, 83, 115, 147, 179, 211, 243, 275, 307, 339, 371, 403, 435, 467, 499,
+  20, 52, 84, 116, 148, 180, 212, 244, 276, 308, 340, 372, 404, 436, 468, 500,
+  21, 53, 85, 117, 149, 181, 213, 245, 277, 309, 341, 373, 405, 437, 469, 501,
+  22, 54, 86, 118, 150, 182, 214, 246, 278, 310, 342, 374, 406, 438, 470, 502,
+  23, 55, 87, 119, 151, 183, 215, 247, 279, 311, 343, 375, 407, 439, 471, 503,
+  24, 56, 88, 120, 152, 184, 216, 248, 280, 312, 344, 376, 408, 440, 472, 504,
+  25, 57, 89, 121, 153, 185, 217, 249, 281, 313, 345, 377, 409, 441, 473, 505,
+  26, 58, 90, 122, 154, 186, 218, 250, 282, 314, 346, 378, 410, 442, 474, 506,
+  27, 59, 91, 123, 155, 187, 219, 251, 283, 315, 347, 379, 411, 443, 475, 507,
+  28, 60, 92, 124, 156, 188, 220, 252, 284, 316, 348, 380, 412, 444, 476, 508,
+  29, 61, 93, 125, 157, 189, 221, 253, 285, 317, 349, 381, 413, 445, 477, 509,
+  30, 62, 94, 126, 158, 190, 222, 254, 286, 318, 350, 382, 414, 446, 478, 510,
+  31, 63, 95, 127, 159, 191, 223, 255, 287, 319, 351, 383, 415, 447, 479, 511,
 };
 
 DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_16x16[256]) = {
-  0,  16, 32, 48, 64, 80, 96,  112, 128, 144, 160, 176, 192, 208, 224, 240,
-  1,  17, 33, 49, 65, 81, 97,  113, 129, 145, 161, 177, 193, 209, 225, 241,
-  2,  18, 34, 50, 66, 82, 98,  114, 130, 146, 162, 178, 194, 210, 226, 242,
-  3,  19, 35, 51, 67, 83, 99,  115, 131, 147, 163, 179, 195, 211, 227, 243,
-  4,  20, 36, 52, 68, 84, 100, 116, 132, 148, 164, 180, 196, 212, 228, 244,
-  5,  21, 37, 53, 69, 85, 101, 117, 133, 149, 165, 181, 197, 213, 229, 245,
-  6,  22, 38, 54, 70, 86, 102, 118, 134, 150, 166, 182, 198, 214, 230, 246,
-  7,  23, 39, 55, 71, 87, 103, 119, 135, 151, 167, 183, 199, 215, 231, 247,
-  8,  24, 40, 56, 72, 88, 104, 120, 136, 152, 168, 184, 200, 216, 232, 248,
-  9,  25, 41, 57, 73, 89, 105, 121, 137, 153, 169, 185, 201, 217, 233, 249,
-  10, 26, 42, 58, 74, 90, 106, 122, 138, 154, 170, 186, 202, 218, 234, 250,
-  11, 27, 43, 59, 75, 91, 107, 123, 139, 155, 171, 187, 203, 219, 235, 251,
-  12, 28, 44, 60, 76, 92, 108, 124, 140, 156, 172, 188, 204, 220, 236, 252,
-  13, 29, 45, 61, 77, 93, 109, 125, 141, 157, 173, 189, 205, 221, 237, 253,
-  14, 30, 46, 62, 78, 94, 110, 126, 142, 158, 174, 190, 206, 222, 238, 254,
-  15, 31, 47, 63, 79, 95, 111, 127, 143, 159, 175, 191, 207, 223, 239, 255,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_16x16[256]) = {
   0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   10,  11,  12,  13,  14,
   15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,
   30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,
@@ -1401,105 +1382,47 @@
   255,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_16x16[256]) = {
+  0,  16, 32, 48, 64, 80, 96,  112, 128, 144, 160, 176, 192, 208, 224, 240,
+  1,  17, 33, 49, 65, 81, 97,  113, 129, 145, 161, 177, 193, 209, 225, 241,
+  2,  18, 34, 50, 66, 82, 98,  114, 130, 146, 162, 178, 194, 210, 226, 242,
+  3,  19, 35, 51, 67, 83, 99,  115, 131, 147, 163, 179, 195, 211, 227, 243,
+  4,  20, 36, 52, 68, 84, 100, 116, 132, 148, 164, 180, 196, 212, 228, 244,
+  5,  21, 37, 53, 69, 85, 101, 117, 133, 149, 165, 181, 197, 213, 229, 245,
+  6,  22, 38, 54, 70, 86, 102, 118, 134, 150, 166, 182, 198, 214, 230, 246,
+  7,  23, 39, 55, 71, 87, 103, 119, 135, 151, 167, 183, 199, 215, 231, 247,
+  8,  24, 40, 56, 72, 88, 104, 120, 136, 152, 168, 184, 200, 216, 232, 248,
+  9,  25, 41, 57, 73, 89, 105, 121, 137, 153, 169, 185, 201, 217, 233, 249,
+  10, 26, 42, 58, 74, 90, 106, 122, 138, 154, 170, 186, 202, 218, 234, 250,
+  11, 27, 43, 59, 75, 91, 107, 123, 139, 155, 171, 187, 203, 219, 235, 251,
+  12, 28, 44, 60, 76, 92, 108, 124, 140, 156, 172, 188, 204, 220, 236, 252,
+  13, 29, 45, 61, 77, 93, 109, 125, 141, 157, 173, 189, 205, 221, 237, 253,
+  14, 30, 46, 62, 78, 94, 110, 126, 142, 158, 174, 190, 206, 222, 238, 254,
+  15, 31, 47, 63, 79, 95, 111, 127, 143, 159, 175, 191, 207, 223, 239, 255,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_16x16[256]) = {
-  0,   1,   5,   6,   14,  15,  27,  28,  44,  45,  65,  66,  90,  91,  119,
-  120, 2,   4,   7,   13,  16,  26,  29,  43,  46,  64,  67,  89,  92,  118,
-  121, 150, 3,   8,   12,  17,  25,  30,  42,  47,  63,  68,  88,  93,  117,
-  122, 149, 151, 9,   11,  18,  24,  31,  41,  48,  62,  69,  87,  94,  116,
-  123, 148, 152, 177, 10,  19,  23,  32,  40,  49,  61,  70,  86,  95,  115,
-  124, 147, 153, 176, 178, 20,  22,  33,  39,  50,  60,  71,  85,  96,  114,
-  125, 146, 154, 175, 179, 200, 21,  34,  38,  51,  59,  72,  84,  97,  113,
-  126, 145, 155, 174, 180, 199, 201, 35,  37,  52,  58,  73,  83,  98,  112,
-  127, 144, 156, 173, 181, 198, 202, 219, 36,  53,  57,  74,  82,  99,  111,
-  128, 143, 157, 172, 182, 197, 203, 218, 220, 54,  56,  75,  81,  100, 110,
-  129, 142, 158, 171, 183, 196, 204, 217, 221, 234, 55,  76,  80,  101, 109,
-  130, 141, 159, 170, 184, 195, 205, 216, 222, 233, 235, 77,  79,  102, 108,
-  131, 140, 160, 169, 185, 194, 206, 215, 223, 232, 236, 245, 78,  103, 107,
-  132, 139, 161, 168, 186, 193, 207, 214, 224, 231, 237, 244, 246, 104, 106,
-  133, 138, 162, 167, 187, 192, 208, 213, 225, 230, 238, 243, 247, 252, 105,
-  134, 137, 163, 166, 188, 191, 209, 212, 226, 229, 239, 242, 248, 251, 253,
-  135, 136, 164, 165, 189, 190, 210, 211, 227, 228, 240, 241, 249, 250, 254,
-  255
+  0,   2,   3,   9,   10,  20,  21,  35,  36,  54,  55,  77,  78,  104, 105,
+  135, 1,   4,   8,   11,  19,  22,  34,  37,  53,  56,  76,  79,  103, 106,
+  134, 136, 5,   7,   12,  18,  23,  33,  38,  52,  57,  75,  80,  102, 107,
+  133, 137, 164, 6,   13,  17,  24,  32,  39,  51,  58,  74,  81,  101, 108,
+  132, 138, 163, 165, 14,  16,  25,  31,  40,  50,  59,  73,  82,  100, 109,
+  131, 139, 162, 166, 189, 15,  26,  30,  41,  49,  60,  72,  83,  99,  110,
+  130, 140, 161, 167, 188, 190, 27,  29,  42,  48,  61,  71,  84,  98,  111,
+  129, 141, 160, 168, 187, 191, 210, 28,  43,  47,  62,  70,  85,  97,  112,
+  128, 142, 159, 169, 186, 192, 209, 211, 44,  46,  63,  69,  86,  96,  113,
+  127, 143, 158, 170, 185, 193, 208, 212, 227, 45,  64,  68,  87,  95,  114,
+  126, 144, 157, 171, 184, 194, 207, 213, 226, 228, 65,  67,  88,  94,  115,
+  125, 145, 156, 172, 183, 195, 206, 214, 225, 229, 240, 66,  89,  93,  116,
+  124, 146, 155, 173, 182, 196, 205, 215, 224, 230, 239, 241, 90,  92,  117,
+  123, 147, 154, 174, 181, 197, 204, 216, 223, 231, 238, 242, 249, 91,  118,
+  122, 148, 153, 175, 180, 198, 203, 217, 222, 232, 237, 243, 248, 250, 119,
+  121, 149, 152, 176, 179, 199, 202, 218, 221, 233, 236, 244, 247, 251, 254,
+  120, 150, 151, 177, 178, 200, 201, 219, 220, 234, 235, 245, 246, 252, 253,
+  255,
 };
 
 DECLARE_ALIGNED(16, static const int16_t, av1_mcol_iscan_32x32[1024]) = {
-  0,   32,   64,  96,   128, 160,  192, 224,  256, 288,  320, 352,  384, 416,
-  448, 480,  512, 544,  576, 608,  640, 672,  704, 736,  768, 800,  832, 864,
-  896, 928,  960, 992,  1,   33,   65,  97,   129, 161,  193, 225,  257, 289,
-  321, 353,  385, 417,  449, 481,  513, 545,  577, 609,  641, 673,  705, 737,
-  769, 801,  833, 865,  897, 929,  961, 993,  2,   34,   66,  98,   130, 162,
-  194, 226,  258, 290,  322, 354,  386, 418,  450, 482,  514, 546,  578, 610,
-  642, 674,  706, 738,  770, 802,  834, 866,  898, 930,  962, 994,  3,   35,
-  67,  99,   131, 163,  195, 227,  259, 291,  323, 355,  387, 419,  451, 483,
-  515, 547,  579, 611,  643, 675,  707, 739,  771, 803,  835, 867,  899, 931,
-  963, 995,  4,   36,   68,  100,  132, 164,  196, 228,  260, 292,  324, 356,
-  388, 420,  452, 484,  516, 548,  580, 612,  644, 676,  708, 740,  772, 804,
-  836, 868,  900, 932,  964, 996,  5,   37,   69,  101,  133, 165,  197, 229,
-  261, 293,  325, 357,  389, 421,  453, 485,  517, 549,  581, 613,  645, 677,
-  709, 741,  773, 805,  837, 869,  901, 933,  965, 997,  6,   38,   70,  102,
-  134, 166,  198, 230,  262, 294,  326, 358,  390, 422,  454, 486,  518, 550,
-  582, 614,  646, 678,  710, 742,  774, 806,  838, 870,  902, 934,  966, 998,
-  7,   39,   71,  103,  135, 167,  199, 231,  263, 295,  327, 359,  391, 423,
-  455, 487,  519, 551,  583, 615,  647, 679,  711, 743,  775, 807,  839, 871,
-  903, 935,  967, 999,  8,   40,   72,  104,  136, 168,  200, 232,  264, 296,
-  328, 360,  392, 424,  456, 488,  520, 552,  584, 616,  648, 680,  712, 744,
-  776, 808,  840, 872,  904, 936,  968, 1000, 9,   41,   73,  105,  137, 169,
-  201, 233,  265, 297,  329, 361,  393, 425,  457, 489,  521, 553,  585, 617,
-  649, 681,  713, 745,  777, 809,  841, 873,  905, 937,  969, 1001, 10,  42,
-  74,  106,  138, 170,  202, 234,  266, 298,  330, 362,  394, 426,  458, 490,
-  522, 554,  586, 618,  650, 682,  714, 746,  778, 810,  842, 874,  906, 938,
-  970, 1002, 11,  43,   75,  107,  139, 171,  203, 235,  267, 299,  331, 363,
-  395, 427,  459, 491,  523, 555,  587, 619,  651, 683,  715, 747,  779, 811,
-  843, 875,  907, 939,  971, 1003, 12,  44,   76,  108,  140, 172,  204, 236,
-  268, 300,  332, 364,  396, 428,  460, 492,  524, 556,  588, 620,  652, 684,
-  716, 748,  780, 812,  844, 876,  908, 940,  972, 1004, 13,  45,   77,  109,
-  141, 173,  205, 237,  269, 301,  333, 365,  397, 429,  461, 493,  525, 557,
-  589, 621,  653, 685,  717, 749,  781, 813,  845, 877,  909, 941,  973, 1005,
-  14,  46,   78,  110,  142, 174,  206, 238,  270, 302,  334, 366,  398, 430,
-  462, 494,  526, 558,  590, 622,  654, 686,  718, 750,  782, 814,  846, 878,
-  910, 942,  974, 1006, 15,  47,   79,  111,  143, 175,  207, 239,  271, 303,
-  335, 367,  399, 431,  463, 495,  527, 559,  591, 623,  655, 687,  719, 751,
-  783, 815,  847, 879,  911, 943,  975, 1007, 16,  48,   80,  112,  144, 176,
-  208, 240,  272, 304,  336, 368,  400, 432,  464, 496,  528, 560,  592, 624,
-  656, 688,  720, 752,  784, 816,  848, 880,  912, 944,  976, 1008, 17,  49,
-  81,  113,  145, 177,  209, 241,  273, 305,  337, 369,  401, 433,  465, 497,
-  529, 561,  593, 625,  657, 689,  721, 753,  785, 817,  849, 881,  913, 945,
-  977, 1009, 18,  50,   82,  114,  146, 178,  210, 242,  274, 306,  338, 370,
-  402, 434,  466, 498,  530, 562,  594, 626,  658, 690,  722, 754,  786, 818,
-  850, 882,  914, 946,  978, 1010, 19,  51,   83,  115,  147, 179,  211, 243,
-  275, 307,  339, 371,  403, 435,  467, 499,  531, 563,  595, 627,  659, 691,
-  723, 755,  787, 819,  851, 883,  915, 947,  979, 1011, 20,  52,   84,  116,
-  148, 180,  212, 244,  276, 308,  340, 372,  404, 436,  468, 500,  532, 564,
-  596, 628,  660, 692,  724, 756,  788, 820,  852, 884,  916, 948,  980, 1012,
-  21,  53,   85,  117,  149, 181,  213, 245,  277, 309,  341, 373,  405, 437,
-  469, 501,  533, 565,  597, 629,  661, 693,  725, 757,  789, 821,  853, 885,
-  917, 949,  981, 1013, 22,  54,   86,  118,  150, 182,  214, 246,  278, 310,
-  342, 374,  406, 438,  470, 502,  534, 566,  598, 630,  662, 694,  726, 758,
-  790, 822,  854, 886,  918, 950,  982, 1014, 23,  55,   87,  119,  151, 183,
-  215, 247,  279, 311,  343, 375,  407, 439,  471, 503,  535, 567,  599, 631,
-  663, 695,  727, 759,  791, 823,  855, 887,  919, 951,  983, 1015, 24,  56,
-  88,  120,  152, 184,  216, 248,  280, 312,  344, 376,  408, 440,  472, 504,
-  536, 568,  600, 632,  664, 696,  728, 760,  792, 824,  856, 888,  920, 952,
-  984, 1016, 25,  57,   89,  121,  153, 185,  217, 249,  281, 313,  345, 377,
-  409, 441,  473, 505,  537, 569,  601, 633,  665, 697,  729, 761,  793, 825,
-  857, 889,  921, 953,  985, 1017, 26,  58,   90,  122,  154, 186,  218, 250,
-  282, 314,  346, 378,  410, 442,  474, 506,  538, 570,  602, 634,  666, 698,
-  730, 762,  794, 826,  858, 890,  922, 954,  986, 1018, 27,  59,   91,  123,
-  155, 187,  219, 251,  283, 315,  347, 379,  411, 443,  475, 507,  539, 571,
-  603, 635,  667, 699,  731, 763,  795, 827,  859, 891,  923, 955,  987, 1019,
-  28,  60,   92,  124,  156, 188,  220, 252,  284, 316,  348, 380,  412, 444,
-  476, 508,  540, 572,  604, 636,  668, 700,  732, 764,  796, 828,  860, 892,
-  924, 956,  988, 1020, 29,  61,   93,  125,  157, 189,  221, 253,  285, 317,
-  349, 381,  413, 445,  477, 509,  541, 573,  605, 637,  669, 701,  733, 765,
-  797, 829,  861, 893,  925, 957,  989, 1021, 30,  62,   94,  126,  158, 190,
-  222, 254,  286, 318,  350, 382,  414, 446,  478, 510,  542, 574,  606, 638,
-  670, 702,  734, 766,  798, 830,  862, 894,  926, 958,  990, 1022, 31,  63,
-  95,  127,  159, 191,  223, 255,  287, 319,  351, 383,  415, 447,  479, 511,
-  543, 575,  607, 639,  671, 703,  735, 767,  799, 831,  863, 895,  927, 959,
-  991, 1023,
-};
-
-DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_32x32[1024]) = {
   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,    10,   11,   12,
   13,   14,   15,   16,   17,   18,   19,   20,   21,   22,   23,   24,   25,
   26,   27,   28,   29,   30,   31,   32,   33,   34,   35,   36,   37,   38,
@@ -1581,86 +1504,163 @@
   1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023,
 };
 
+DECLARE_ALIGNED(16, static const int16_t, av1_mrow_iscan_32x32[1024]) = {
+  0,   32,   64,  96,   128, 160,  192, 224,  256, 288,  320, 352,  384, 416,
+  448, 480,  512, 544,  576, 608,  640, 672,  704, 736,  768, 800,  832, 864,
+  896, 928,  960, 992,  1,   33,   65,  97,   129, 161,  193, 225,  257, 289,
+  321, 353,  385, 417,  449, 481,  513, 545,  577, 609,  641, 673,  705, 737,
+  769, 801,  833, 865,  897, 929,  961, 993,  2,   34,   66,  98,   130, 162,
+  194, 226,  258, 290,  322, 354,  386, 418,  450, 482,  514, 546,  578, 610,
+  642, 674,  706, 738,  770, 802,  834, 866,  898, 930,  962, 994,  3,   35,
+  67,  99,   131, 163,  195, 227,  259, 291,  323, 355,  387, 419,  451, 483,
+  515, 547,  579, 611,  643, 675,  707, 739,  771, 803,  835, 867,  899, 931,
+  963, 995,  4,   36,   68,  100,  132, 164,  196, 228,  260, 292,  324, 356,
+  388, 420,  452, 484,  516, 548,  580, 612,  644, 676,  708, 740,  772, 804,
+  836, 868,  900, 932,  964, 996,  5,   37,   69,  101,  133, 165,  197, 229,
+  261, 293,  325, 357,  389, 421,  453, 485,  517, 549,  581, 613,  645, 677,
+  709, 741,  773, 805,  837, 869,  901, 933,  965, 997,  6,   38,   70,  102,
+  134, 166,  198, 230,  262, 294,  326, 358,  390, 422,  454, 486,  518, 550,
+  582, 614,  646, 678,  710, 742,  774, 806,  838, 870,  902, 934,  966, 998,
+  7,   39,   71,  103,  135, 167,  199, 231,  263, 295,  327, 359,  391, 423,
+  455, 487,  519, 551,  583, 615,  647, 679,  711, 743,  775, 807,  839, 871,
+  903, 935,  967, 999,  8,   40,   72,  104,  136, 168,  200, 232,  264, 296,
+  328, 360,  392, 424,  456, 488,  520, 552,  584, 616,  648, 680,  712, 744,
+  776, 808,  840, 872,  904, 936,  968, 1000, 9,   41,   73,  105,  137, 169,
+  201, 233,  265, 297,  329, 361,  393, 425,  457, 489,  521, 553,  585, 617,
+  649, 681,  713, 745,  777, 809,  841, 873,  905, 937,  969, 1001, 10,  42,
+  74,  106,  138, 170,  202, 234,  266, 298,  330, 362,  394, 426,  458, 490,
+  522, 554,  586, 618,  650, 682,  714, 746,  778, 810,  842, 874,  906, 938,
+  970, 1002, 11,  43,   75,  107,  139, 171,  203, 235,  267, 299,  331, 363,
+  395, 427,  459, 491,  523, 555,  587, 619,  651, 683,  715, 747,  779, 811,
+  843, 875,  907, 939,  971, 1003, 12,  44,   76,  108,  140, 172,  204, 236,
+  268, 300,  332, 364,  396, 428,  460, 492,  524, 556,  588, 620,  652, 684,
+  716, 748,  780, 812,  844, 876,  908, 940,  972, 1004, 13,  45,   77,  109,
+  141, 173,  205, 237,  269, 301,  333, 365,  397, 429,  461, 493,  525, 557,
+  589, 621,  653, 685,  717, 749,  781, 813,  845, 877,  909, 941,  973, 1005,
+  14,  46,   78,  110,  142, 174,  206, 238,  270, 302,  334, 366,  398, 430,
+  462, 494,  526, 558,  590, 622,  654, 686,  718, 750,  782, 814,  846, 878,
+  910, 942,  974, 1006, 15,  47,   79,  111,  143, 175,  207, 239,  271, 303,
+  335, 367,  399, 431,  463, 495,  527, 559,  591, 623,  655, 687,  719, 751,
+  783, 815,  847, 879,  911, 943,  975, 1007, 16,  48,   80,  112,  144, 176,
+  208, 240,  272, 304,  336, 368,  400, 432,  464, 496,  528, 560,  592, 624,
+  656, 688,  720, 752,  784, 816,  848, 880,  912, 944,  976, 1008, 17,  49,
+  81,  113,  145, 177,  209, 241,  273, 305,  337, 369,  401, 433,  465, 497,
+  529, 561,  593, 625,  657, 689,  721, 753,  785, 817,  849, 881,  913, 945,
+  977, 1009, 18,  50,   82,  114,  146, 178,  210, 242,  274, 306,  338, 370,
+  402, 434,  466, 498,  530, 562,  594, 626,  658, 690,  722, 754,  786, 818,
+  850, 882,  914, 946,  978, 1010, 19,  51,   83,  115,  147, 179,  211, 243,
+  275, 307,  339, 371,  403, 435,  467, 499,  531, 563,  595, 627,  659, 691,
+  723, 755,  787, 819,  851, 883,  915, 947,  979, 1011, 20,  52,   84,  116,
+  148, 180,  212, 244,  276, 308,  340, 372,  404, 436,  468, 500,  532, 564,
+  596, 628,  660, 692,  724, 756,  788, 820,  852, 884,  916, 948,  980, 1012,
+  21,  53,   85,  117,  149, 181,  213, 245,  277, 309,  341, 373,  405, 437,
+  469, 501,  533, 565,  597, 629,  661, 693,  725, 757,  789, 821,  853, 885,
+  917, 949,  981, 1013, 22,  54,   86,  118,  150, 182,  214, 246,  278, 310,
+  342, 374,  406, 438,  470, 502,  534, 566,  598, 630,  662, 694,  726, 758,
+  790, 822,  854, 886,  918, 950,  982, 1014, 23,  55,   87,  119,  151, 183,
+  215, 247,  279, 311,  343, 375,  407, 439,  471, 503,  535, 567,  599, 631,
+  663, 695,  727, 759,  791, 823,  855, 887,  919, 951,  983, 1015, 24,  56,
+  88,  120,  152, 184,  216, 248,  280, 312,  344, 376,  408, 440,  472, 504,
+  536, 568,  600, 632,  664, 696,  728, 760,  792, 824,  856, 888,  920, 952,
+  984, 1016, 25,  57,   89,  121,  153, 185,  217, 249,  281, 313,  345, 377,
+  409, 441,  473, 505,  537, 569,  601, 633,  665, 697,  729, 761,  793, 825,
+  857, 889,  921, 953,  985, 1017, 26,  58,   90,  122,  154, 186,  218, 250,
+  282, 314,  346, 378,  410, 442,  474, 506,  538, 570,  602, 634,  666, 698,
+  730, 762,  794, 826,  858, 890,  922, 954,  986, 1018, 27,  59,   91,  123,
+  155, 187,  219, 251,  283, 315,  347, 379,  411, 443,  475, 507,  539, 571,
+  603, 635,  667, 699,  731, 763,  795, 827,  859, 891,  923, 955,  987, 1019,
+  28,  60,   92,  124,  156, 188,  220, 252,  284, 316,  348, 380,  412, 444,
+  476, 508,  540, 572,  604, 636,  668, 700,  732, 764,  796, 828,  860, 892,
+  924, 956,  988, 1020, 29,  61,   93,  125,  157, 189,  221, 253,  285, 317,
+  349, 381,  413, 445,  477, 509,  541, 573,  605, 637,  669, 701,  733, 765,
+  797, 829,  861, 893,  925, 957,  989, 1021, 30,  62,   94,  126,  158, 190,
+  222, 254,  286, 318,  350, 382,  414, 446,  478, 510,  542, 574,  606, 638,
+  670, 702,  734, 766,  798, 830,  862, 894,  926, 958,  990, 1022, 31,  63,
+  95,  127,  159, 191,  223, 255,  287, 319,  351, 383,  415, 447,  479, 511,
+  543, 575,  607, 639,  671, 703,  735, 767,  799, 831,  863, 895,  927, 959,
+  991, 1023,
+};
+
 DECLARE_ALIGNED(16, static const int16_t, av1_default_iscan_32x32[1024]) = {
-  0,    1,    5,    6,    14,   15,   27,   28,   44,   45,   65,   66,   90,
-  91,   119,  120,  152,  153,  189,  190,  230,  231,  275,  276,  324,  325,
-  377,  378,  434,  435,  495,  496,  2,    4,    7,    13,   16,   26,   29,
-  43,   46,   64,   67,   89,   92,   118,  121,  151,  154,  188,  191,  229,
-  232,  274,  277,  323,  326,  376,  379,  433,  436,  494,  497,  558,  3,
-  8,    12,   17,   25,   30,   42,   47,   63,   68,   88,   93,   117,  122,
-  150,  155,  187,  192,  228,  233,  273,  278,  322,  327,  375,  380,  432,
-  437,  493,  498,  557,  559,  9,    11,   18,   24,   31,   41,   48,   62,
-  69,   87,   94,   116,  123,  149,  156,  186,  193,  227,  234,  272,  279,
-  321,  328,  374,  381,  431,  438,  492,  499,  556,  560,  617,  10,   19,
-  23,   32,   40,   49,   61,   70,   86,   95,   115,  124,  148,  157,  185,
-  194,  226,  235,  271,  280,  320,  329,  373,  382,  430,  439,  491,  500,
-  555,  561,  616,  618,  20,   22,   33,   39,   50,   60,   71,   85,   96,
-  114,  125,  147,  158,  184,  195,  225,  236,  270,  281,  319,  330,  372,
-  383,  429,  440,  490,  501,  554,  562,  615,  619,  672,  21,   34,   38,
-  51,   59,   72,   84,   97,   113,  126,  146,  159,  183,  196,  224,  237,
-  269,  282,  318,  331,  371,  384,  428,  441,  489,  502,  553,  563,  614,
-  620,  671,  673,  35,   37,   52,   58,   73,   83,   98,   112,  127,  145,
-  160,  182,  197,  223,  238,  268,  283,  317,  332,  370,  385,  427,  442,
-  488,  503,  552,  564,  613,  621,  670,  674,  723,  36,   53,   57,   74,
-  82,   99,   111,  128,  144,  161,  181,  198,  222,  239,  267,  284,  316,
-  333,  369,  386,  426,  443,  487,  504,  551,  565,  612,  622,  669,  675,
-  722,  724,  54,   56,   75,   81,   100,  110,  129,  143,  162,  180,  199,
-  221,  240,  266,  285,  315,  334,  368,  387,  425,  444,  486,  505,  550,
-  566,  611,  623,  668,  676,  721,  725,  770,  55,   76,   80,   101,  109,
-  130,  142,  163,  179,  200,  220,  241,  265,  286,  314,  335,  367,  388,
-  424,  445,  485,  506,  549,  567,  610,  624,  667,  677,  720,  726,  769,
-  771,  77,   79,   102,  108,  131,  141,  164,  178,  201,  219,  242,  264,
-  287,  313,  336,  366,  389,  423,  446,  484,  507,  548,  568,  609,  625,
-  666,  678,  719,  727,  768,  772,  813,  78,   103,  107,  132,  140,  165,
-  177,  202,  218,  243,  263,  288,  312,  337,  365,  390,  422,  447,  483,
-  508,  547,  569,  608,  626,  665,  679,  718,  728,  767,  773,  812,  814,
-  104,  106,  133,  139,  166,  176,  203,  217,  244,  262,  289,  311,  338,
-  364,  391,  421,  448,  482,  509,  546,  570,  607,  627,  664,  680,  717,
-  729,  766,  774,  811,  815,  852,  105,  134,  138,  167,  175,  204,  216,
-  245,  261,  290,  310,  339,  363,  392,  420,  449,  481,  510,  545,  571,
-  606,  628,  663,  681,  716,  730,  765,  775,  810,  816,  851,  853,  135,
-  137,  168,  174,  205,  215,  246,  260,  291,  309,  340,  362,  393,  419,
-  450,  480,  511,  544,  572,  605,  629,  662,  682,  715,  731,  764,  776,
-  809,  817,  850,  854,  887,  136,  169,  173,  206,  214,  247,  259,  292,
-  308,  341,  361,  394,  418,  451,  479,  512,  543,  573,  604,  630,  661,
-  683,  714,  732,  763,  777,  808,  818,  849,  855,  886,  888,  170,  172,
-  207,  213,  248,  258,  293,  307,  342,  360,  395,  417,  452,  478,  513,
-  542,  574,  603,  631,  660,  684,  713,  733,  762,  778,  807,  819,  848,
-  856,  885,  889,  918,  171,  208,  212,  249,  257,  294,  306,  343,  359,
-  396,  416,  453,  477,  514,  541,  575,  602,  632,  659,  685,  712,  734,
-  761,  779,  806,  820,  847,  857,  884,  890,  917,  919,  209,  211,  250,
-  256,  295,  305,  344,  358,  397,  415,  454,  476,  515,  540,  576,  601,
-  633,  658,  686,  711,  735,  760,  780,  805,  821,  846,  858,  883,  891,
-  916,  920,  945,  210,  251,  255,  296,  304,  345,  357,  398,  414,  455,
-  475,  516,  539,  577,  600,  634,  657,  687,  710,  736,  759,  781,  804,
-  822,  845,  859,  882,  892,  915,  921,  944,  946,  252,  254,  297,  303,
-  346,  356,  399,  413,  456,  474,  517,  538,  578,  599,  635,  656,  688,
-  709,  737,  758,  782,  803,  823,  844,  860,  881,  893,  914,  922,  943,
-  947,  968,  253,  298,  302,  347,  355,  400,  412,  457,  473,  518,  537,
-  579,  598,  636,  655,  689,  708,  738,  757,  783,  802,  824,  843,  861,
-  880,  894,  913,  923,  942,  948,  967,  969,  299,  301,  348,  354,  401,
-  411,  458,  472,  519,  536,  580,  597,  637,  654,  690,  707,  739,  756,
-  784,  801,  825,  842,  862,  879,  895,  912,  924,  941,  949,  966,  970,
-  987,  300,  349,  353,  402,  410,  459,  471,  520,  535,  581,  596,  638,
-  653,  691,  706,  740,  755,  785,  800,  826,  841,  863,  878,  896,  911,
-  925,  940,  950,  965,  971,  986,  988,  350,  352,  403,  409,  460,  470,
-  521,  534,  582,  595,  639,  652,  692,  705,  741,  754,  786,  799,  827,
-  840,  864,  877,  897,  910,  926,  939,  951,  964,  972,  985,  989,  1002,
-  351,  404,  408,  461,  469,  522,  533,  583,  594,  640,  651,  693,  704,
-  742,  753,  787,  798,  828,  839,  865,  876,  898,  909,  927,  938,  952,
-  963,  973,  984,  990,  1001, 1003, 405,  407,  462,  468,  523,  532,  584,
-  593,  641,  650,  694,  703,  743,  752,  788,  797,  829,  838,  866,  875,
-  899,  908,  928,  937,  953,  962,  974,  983,  991,  1000, 1004, 1013, 406,
-  463,  467,  524,  531,  585,  592,  642,  649,  695,  702,  744,  751,  789,
-  796,  830,  837,  867,  874,  900,  907,  929,  936,  954,  961,  975,  982,
-  992,  999,  1005, 1012, 1014, 464,  466,  525,  530,  586,  591,  643,  648,
-  696,  701,  745,  750,  790,  795,  831,  836,  868,  873,  901,  906,  930,
-  935,  955,  960,  976,  981,  993,  998,  1006, 1011, 1015, 1020, 465,  526,
-  529,  587,  590,  644,  647,  697,  700,  746,  749,  791,  794,  832,  835,
-  869,  872,  902,  905,  931,  934,  956,  959,  977,  980,  994,  997,  1007,
-  1010, 1016, 1019, 1021, 527,  528,  588,  589,  645,  646,  698,  699,  747,
-  748,  792,  793,  833,  834,  870,  871,  903,  904,  932,  933,  957,  958,
-  978,  979,  995,  996,  1008, 1009, 1017, 1018, 1022, 1023
+  0,    2,    3,    9,    10,   20,   21,   35,   36,   54,   55,   77,   78,
+  104,  105,  135,  136,  170,  171,  209,  210,  252,  253,  299,  300,  350,
+  351,  405,  406,  464,  465,  527,  1,    4,    8,    11,   19,   22,   34,
+  37,   53,   56,   76,   79,   103,  106,  134,  137,  169,  172,  208,  211,
+  251,  254,  298,  301,  349,  352,  404,  407,  463,  466,  526,  528,  5,
+  7,    12,   18,   23,   33,   38,   52,   57,   75,   80,   102,  107,  133,
+  138,  168,  173,  207,  212,  250,  255,  297,  302,  348,  353,  403,  408,
+  462,  467,  525,  529,  588,  6,    13,   17,   24,   32,   39,   51,   58,
+  74,   81,   101,  108,  132,  139,  167,  174,  206,  213,  249,  256,  296,
+  303,  347,  354,  402,  409,  461,  468,  524,  530,  587,  589,  14,   16,
+  25,   31,   40,   50,   59,   73,   82,   100,  109,  131,  140,  166,  175,
+  205,  214,  248,  257,  295,  304,  346,  355,  401,  410,  460,  469,  523,
+  531,  586,  590,  645,  15,   26,   30,   41,   49,   60,   72,   83,   99,
+  110,  130,  141,  165,  176,  204,  215,  247,  258,  294,  305,  345,  356,
+  400,  411,  459,  470,  522,  532,  585,  591,  644,  646,  27,   29,   42,
+  48,   61,   71,   84,   98,   111,  129,  142,  164,  177,  203,  216,  246,
+  259,  293,  306,  344,  357,  399,  412,  458,  471,  521,  533,  584,  592,
+  643,  647,  698,  28,   43,   47,   62,   70,   85,   97,   112,  128,  143,
+  163,  178,  202,  217,  245,  260,  292,  307,  343,  358,  398,  413,  457,
+  472,  520,  534,  583,  593,  642,  648,  697,  699,  44,   46,   63,   69,
+  86,   96,   113,  127,  144,  162,  179,  201,  218,  244,  261,  291,  308,
+  342,  359,  397,  414,  456,  473,  519,  535,  582,  594,  641,  649,  696,
+  700,  747,  45,   64,   68,   87,   95,   114,  126,  145,  161,  180,  200,
+  219,  243,  262,  290,  309,  341,  360,  396,  415,  455,  474,  518,  536,
+  581,  595,  640,  650,  695,  701,  746,  748,  65,   67,   88,   94,   115,
+  125,  146,  160,  181,  199,  220,  242,  263,  289,  310,  340,  361,  395,
+  416,  454,  475,  517,  537,  580,  596,  639,  651,  694,  702,  745,  749,
+  792,  66,   89,   93,   116,  124,  147,  159,  182,  198,  221,  241,  264,
+  288,  311,  339,  362,  394,  417,  453,  476,  516,  538,  579,  597,  638,
+  652,  693,  703,  744,  750,  791,  793,  90,   92,   117,  123,  148,  158,
+  183,  197,  222,  240,  265,  287,  312,  338,  363,  393,  418,  452,  477,
+  515,  539,  578,  598,  637,  653,  692,  704,  743,  751,  790,  794,  833,
+  91,   118,  122,  149,  157,  184,  196,  223,  239,  266,  286,  313,  337,
+  364,  392,  419,  451,  478,  514,  540,  577,  599,  636,  654,  691,  705,
+  742,  752,  789,  795,  832,  834,  119,  121,  150,  156,  185,  195,  224,
+  238,  267,  285,  314,  336,  365,  391,  420,  450,  479,  513,  541,  576,
+  600,  635,  655,  690,  706,  741,  753,  788,  796,  831,  835,  870,  120,
+  151,  155,  186,  194,  225,  237,  268,  284,  315,  335,  366,  390,  421,
+  449,  480,  512,  542,  575,  601,  634,  656,  689,  707,  740,  754,  787,
+  797,  830,  836,  869,  871,  152,  154,  187,  193,  226,  236,  269,  283,
+  316,  334,  367,  389,  422,  448,  481,  511,  543,  574,  602,  633,  657,
+  688,  708,  739,  755,  786,  798,  829,  837,  868,  872,  903,  153,  188,
+  192,  227,  235,  270,  282,  317,  333,  368,  388,  423,  447,  482,  510,
+  544,  573,  603,  632,  658,  687,  709,  738,  756,  785,  799,  828,  838,
+  867,  873,  902,  904,  189,  191,  228,  234,  271,  281,  318,  332,  369,
+  387,  424,  446,  483,  509,  545,  572,  604,  631,  659,  686,  710,  737,
+  757,  784,  800,  827,  839,  866,  874,  901,  905,  932,  190,  229,  233,
+  272,  280,  319,  331,  370,  386,  425,  445,  484,  508,  546,  571,  605,
+  630,  660,  685,  711,  736,  758,  783,  801,  826,  840,  865,  875,  900,
+  906,  931,  933,  230,  232,  273,  279,  320,  330,  371,  385,  426,  444,
+  485,  507,  547,  570,  606,  629,  661,  684,  712,  735,  759,  782,  802,
+  825,  841,  864,  876,  899,  907,  930,  934,  957,  231,  274,  278,  321,
+  329,  372,  384,  427,  443,  486,  506,  548,  569,  607,  628,  662,  683,
+  713,  734,  760,  781,  803,  824,  842,  863,  877,  898,  908,  929,  935,
+  956,  958,  275,  277,  322,  328,  373,  383,  428,  442,  487,  505,  549,
+  568,  608,  627,  663,  682,  714,  733,  761,  780,  804,  823,  843,  862,
+  878,  897,  909,  928,  936,  955,  959,  978,  276,  323,  327,  374,  382,
+  429,  441,  488,  504,  550,  567,  609,  626,  664,  681,  715,  732,  762,
+  779,  805,  822,  844,  861,  879,  896,  910,  927,  937,  954,  960,  977,
+  979,  324,  326,  375,  381,  430,  440,  489,  503,  551,  566,  610,  625,
+  665,  680,  716,  731,  763,  778,  806,  821,  845,  860,  880,  895,  911,
+  926,  938,  953,  961,  976,  980,  995,  325,  376,  380,  431,  439,  490,
+  502,  552,  565,  611,  624,  666,  679,  717,  730,  764,  777,  807,  820,
+  846,  859,  881,  894,  912,  925,  939,  952,  962,  975,  981,  994,  996,
+  377,  379,  432,  438,  491,  501,  553,  564,  612,  623,  667,  678,  718,
+  729,  765,  776,  808,  819,  847,  858,  882,  893,  913,  924,  940,  951,
+  963,  974,  982,  993,  997,  1008, 378,  433,  437,  492,  500,  554,  563,
+  613,  622,  668,  677,  719,  728,  766,  775,  809,  818,  848,  857,  883,
+  892,  914,  923,  941,  950,  964,  973,  983,  992,  998,  1007, 1009, 434,
+  436,  493,  499,  555,  562,  614,  621,  669,  676,  720,  727,  767,  774,
+  810,  817,  849,  856,  884,  891,  915,  922,  942,  949,  965,  972,  984,
+  991,  999,  1006, 1010, 1017, 435,  494,  498,  556,  561,  615,  620,  670,
+  675,  721,  726,  768,  773,  811,  816,  850,  855,  885,  890,  916,  921,
+  943,  948,  966,  971,  985,  990,  1000, 1005, 1011, 1016, 1018, 495,  497,
+  557,  560,  616,  619,  671,  674,  722,  725,  769,  772,  812,  815,  851,
+  854,  886,  889,  917,  920,  944,  947,  967,  970,  986,  989,  1001, 1004,
+  1012, 1015, 1019, 1022, 496,  558,  559,  617,  618,  672,  673,  723,  724,
+  770,  771,  813,  814,  852,  853,  887,  888,  918,  919,  945,  946,  968,
+  969,  987,  988,  1002, 1003, 1013, 1014, 1020, 1021, 1023,
 };
 
 const SCAN_ORDER av1_scan_orders[TX_SIZES_ALL][TX_TYPES] = {

diff --git a/av1/common/txb_common.c b/av1/common/txb_common.c
index 4eef319..bf2bc36 100644
--- a/av1/common/txb_common.c
+++ b/av1/common/txb_common.c

@@ -12,90 +12,6 @@
 #include "av1/common/av1_common_int.h"
 #include "av1/common/txb_common.h"
 
-const int8_t av1_coeff_band_4x4[16] = { 0, 1, 2,  3,  4,  5,  6,  7,
-                                        8, 9, 10, 11, 12, 13, 14, 15 };
-
-const int8_t av1_coeff_band_8x8[64] = {
-  0,  1,  2,  2,  3,  3,  4,  4,  5,  6,  2,  2,  3,  3,  4,  4,
-  7,  7,  8,  8,  9,  9,  10, 10, 7,  7,  8,  8,  9,  9,  10, 10,
-  11, 11, 12, 12, 13, 13, 14, 14, 11, 11, 12, 12, 13, 13, 14, 14,
-  15, 15, 16, 16, 17, 17, 18, 18, 15, 15, 16, 16, 17, 17, 18, 18,
-};
-
-const int8_t av1_coeff_band_16x16[256] = {
-  0,  1,  4,  4,  7,  7,  7,  7,  8,  8,  8,  8,  9,  9,  9,  9,  2,  3,  4,
-  4,  7,  7,  7,  7,  8,  8,  8,  8,  9,  9,  9,  9,  5,  5,  6,  6,  7,  7,
-  7,  7,  8,  8,  8,  8,  9,  9,  9,  9,  5,  5,  6,  6,  7,  7,  7,  7,  8,
-  8,  8,  8,  9,  9,  9,  9,  10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12,
-  13, 13, 13, 13, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13,
-  13, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 10, 10,
-  10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15,
-  15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17, 14, 14, 14, 14, 15, 15, 15, 15,
-  16, 16, 16, 16, 17, 17, 17, 17, 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16,
-  16, 17, 17, 17, 17, 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17,
-  17, 17, 18, 18, 18, 18, 19, 19, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21, 18,
-  18, 18, 18, 19, 19, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21, 18, 18, 18, 18,
-  19, 19, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21, 18, 18, 18, 18, 19, 19, 19,
-  19, 20, 20, 20, 20, 21, 21, 21, 21,
-};
-
-const int8_t av1_coeff_band_32x32[1024] = {
-  0,  1,  4,  4,  7,  7,  7,  7,  10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11,
-  11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 2,  3,  4,  4,  7,  7,
-  7,  7,  10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 12,
-  12, 12, 12, 12, 12, 12, 12, 5,  5,  6,  6,  7,  7,  7,  7,  10, 10, 10, 10,
-  10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12,
-  12, 5,  5,  6,  6,  7,  7,  7,  7,  10, 10, 10, 10, 10, 10, 10, 10, 11, 11,
-  11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 8,  8,  8,  8,  9,
-  9,  9,  9,  10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11,
-  12, 12, 12, 12, 12, 12, 12, 12, 8,  8,  8,  8,  9,  9,  9,  9,  10, 10, 10,
-  10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12,
-  12, 12, 8,  8,  8,  8,  9,  9,  9,  9,  10, 10, 10, 10, 10, 10, 10, 10, 11,
-  11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 8,  8,  8,  8,
-  9,  9,  9,  9,  10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11,
-  11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14,
-  14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16,
-  16, 16, 16, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14,
-  15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 13, 13, 13,
-  13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15,
-  15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 13, 13, 13, 13, 13, 13, 13, 13, 14,
-  14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16,
-  16, 16, 16, 16, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14,
-  14, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 13, 13,
-  13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15,
-  15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 13, 13, 13, 13, 13, 13, 13, 13,
-  14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16,
-  16, 16, 16, 16, 16, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14,
-  14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 17,
-  17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19,
-  19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 17, 17, 17, 17, 17, 17, 17,
-  17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20,
-  20, 20, 20, 20, 20, 20, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18,
-  18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20,
-  17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19,
-  19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 17, 17, 17, 17, 17, 17,
-  17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 20,
-  20, 20, 20, 20, 20, 20, 20, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18,
-  18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20,
-  20, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19,
-  19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 17, 17, 17, 17, 17,
-  17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19,
-  20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22,
-  22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24,
-  24, 24, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22, 23,
-  23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 21, 21, 21, 21,
-  21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23,
-  23, 24, 24, 24, 24, 24, 24, 24, 24, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22,
-  22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24,
-  24, 24, 24, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22,
-  23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 21, 21, 21,
-  21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23,
-  23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 21, 21, 21, 21, 21, 21, 21, 21, 22,
-  22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24,
-  24, 24, 24, 24, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22,
-  22, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24,
-};
-
 // The ctx offset table when TX is TX_CLASS_2D.
 // TX col and row indices are clamped to 4
 
@@ -184,34 +100,54 @@
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
 };
 
-const int8_t av1_nz_map_ctx_offset_8x4[32] = {
-  0,  16, 6,  6,  21, 21, 21, 21, 16, 16, 6,  21, 21, 21, 21, 21,
-  16, 16, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21,
+const int8_t av1_nz_map_ctx_offset_4x8[32] = {
+  0,  11, 6,  6,  21, 21, 21, 21, 11, 11, 6,  21, 21, 21, 21, 21,
+  11, 11, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21,
 };
 
 const int8_t av1_nz_map_ctx_offset_8x16[128] = {
-  0,  11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 6,  6,  21,
-  21, 21, 21, 21, 21, 6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-};
-
-const int8_t av1_nz_map_ctx_offset_16x8[128] = {
-  0,  16, 6,  6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 6,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16,
+  0,  11, 6,  6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 6,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
 };
 
 const int8_t av1_nz_map_ctx_offset_16x32[512] = {
-  0,  11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
-  11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 6,  6,  21, 21, 21, 21,
+  0,  11, 6,  6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 6,  21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+};
+
+const int8_t av1_nz_map_ctx_offset_32x16[512] = {
+  0,  16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
+  16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 6,  6,  21, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 6,  21, 21, 21, 21, 21, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
@@ -239,41 +175,68 @@
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
 };
 
-const int8_t av1_nz_map_ctx_offset_32x16[512] = {
-  0,  16, 6,  6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 6,  21, 21, 21,
+const int8_t av1_nz_map_ctx_offset_32x64[1024] = {
+  0,  11, 6,  6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 6,  21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21,
+  21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21,
+  21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21,
+  21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16,
+  21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11,
+  11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
 };
 
-const int8_t av1_nz_map_ctx_offset_32x64[1024] = {
-  0,  11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
-  11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
-  11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
-  11, 11, 11, 11, 11, 11, 11, 6,  6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+const int8_t av1_nz_map_ctx_offset_64x32[1024] = {
+  0,  16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
+  16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
+  16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
+  16, 16, 16, 16, 16, 16, 16, 6,  6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
   21, 6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
@@ -326,79 +289,39 @@
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
 };
 
-const int8_t av1_nz_map_ctx_offset_64x32[1024] = {
-  0,  16, 6,  6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 6,  21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16,
-  16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-};
-
 const int8_t av1_nz_map_ctx_offset_4x16[64] = {
-  0,  11, 11, 11, 11, 11, 11, 11, 6,  6,  21, 21, 6,  21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  0,  11, 6,  6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  11, 11, 6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
 };
 
 const int8_t av1_nz_map_ctx_offset_16x4[64] = {
-  0,  16, 6,  6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  16, 16, 6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  0,  16, 16, 16, 16, 16, 16, 16, 6,  6,  21, 21, 6,  21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
 };
 
 const int8_t av1_nz_map_ctx_offset_8x32[256] = {
-  0,  11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 6,  6,  21,
+  0,  11, 6,  6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 6,  21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
+  21, 21, 21, 21, 21, 21, 21, 21, 21,
+};
+
+const int8_t av1_nz_map_ctx_offset_32x8[256] = {
+  0,  16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 6,  6,  21,
   21, 21, 21, 21, 21, 6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
   21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
@@ -414,33 +337,16 @@
   21, 21, 21, 21, 21, 21, 21, 21, 21,
 };
 
-const int8_t av1_nz_map_ctx_offset_32x8[256] = {
-  0,  16, 6,  6,  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 6,  21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
-  21, 21, 21, 21, 21, 21, 21, 21, 21,
-};
-
 const int8_t *av1_nz_map_ctx_offset[19] = {
   av1_nz_map_ctx_offset_4x4,    // TX_4x4
   av1_nz_map_ctx_offset_8x8,    // TX_8x8
   av1_nz_map_ctx_offset_16x16,  // TX_16x16
   av1_nz_map_ctx_offset_32x32,  // TX_32x32
-  av1_nz_map_ctx_offset_32x32,  // TX_32x32
-  av1_nz_map_ctx_offset_4x16,   // TX_4x8
-  av1_nz_map_ctx_offset_8x4,    // TX_8x4
-  av1_nz_map_ctx_offset_8x32,   // TX_8x16
-  av1_nz_map_ctx_offset_16x8,   // TX_16x8
+  av1_nz_map_ctx_offset_32x32,  // TX_64x64
+  av1_nz_map_ctx_offset_4x8,    // TX_4x8
+  av1_nz_map_ctx_offset_16x4,   // TX_8x4
+  av1_nz_map_ctx_offset_8x16,   // TX_8x16
+  av1_nz_map_ctx_offset_32x8,   // TX_16x8
   av1_nz_map_ctx_offset_16x32,  // TX_16x32
   av1_nz_map_ctx_offset_32x16,  // TX_32x16
   av1_nz_map_ctx_offset_32x64,  // TX_32x64
@@ -449,8 +355,8 @@
   av1_nz_map_ctx_offset_16x4,   // TX_16x4
   av1_nz_map_ctx_offset_8x32,   // TX_8x32
   av1_nz_map_ctx_offset_32x8,   // TX_32x8
-  av1_nz_map_ctx_offset_16x32,  // TX_16x64
-  av1_nz_map_ctx_offset_64x32,  // TX_64x16
+  av1_nz_map_ctx_offset_32x64,  // TX_16x64
+  av1_nz_map_ctx_offset_32x16,  // TX_64x16
 };
 
 const int16_t av1_eob_group_start[12] = { 0,  1,  2,  3,   5,   9,

diff --git a/av1/common/txb_common.h b/av1/common/txb_common.h
index 40fcffc..9628090 100644
--- a/av1/common/txb_common.h
+++ b/av1/common/txb_common.h

@@ -17,14 +17,6 @@
 extern const int16_t av1_eob_group_start[12];
 extern const int16_t av1_eob_offset_bits[12];
 
-extern const int8_t av1_coeff_band_4x4[16];
-
-extern const int8_t av1_coeff_band_8x8[64];
-
-extern const int8_t av1_coeff_band_16x16[256];
-
-extern const int8_t av1_coeff_band_32x32[1024];
-
 extern const int8_t *av1_nz_map_ctx_offset[TX_SIZES_ALL];
 
 typedef struct txb_ctx {
@@ -55,9 +47,9 @@
   TX_CLASS_HORIZ,  // H_FLIPADST
 };
 
-static INLINE int get_txb_bwl(TX_SIZE tx_size) {
+static INLINE int get_txb_bhl(TX_SIZE tx_size) {
   tx_size = av1_get_adjusted_tx_size(tx_size);
-  return tx_size_wide_log2[tx_size];
+  return tx_size_high_log2[tx_size];
 }
 
 static INLINE int get_txb_wide(TX_SIZE tx_size) {
@@ -70,22 +62,22 @@
   return tx_size_high[tx_size];
 }
 
-static INLINE uint8_t *set_levels(uint8_t *const levels_buf, const int width) {
-  return levels_buf + TX_PAD_TOP * (width + TX_PAD_HOR);
+static INLINE uint8_t *set_levels(uint8_t *const levels_buf, const int height) {
+  return levels_buf + TX_PAD_TOP * (height + TX_PAD_HOR);
 }
 
-static INLINE int get_padded_idx(const int idx, const int bwl) {
-  return idx + ((idx >> bwl) << TX_PAD_HOR_LOG2);
+static INLINE int get_padded_idx(const int idx, const int bhl) {
+  return idx + ((idx >> bhl) << TX_PAD_HOR_LOG2);
 }
 
 static INLINE int get_br_ctx_2d(const uint8_t *const levels,
                                 const int c,  // raster order
-                                const int bwl) {
+                                const int bhl) {
   assert(c > 0);
-  const int row = c >> bwl;
-  const int col = c - (row << bwl);
-  const int stride = (1 << bwl) + TX_PAD_HOR;
-  const int pos = row * stride + col;
+  const int col = c >> bhl;
+  const int row = c - (col << bhl);
+  const int stride = (1 << bhl) + TX_PAD_HOR;
+  const int pos = col * stride + row;
   int mag = AOMMIN(levels[pos + 1], MAX_BASE_BR_RANGE) +
             AOMMIN(levels[pos + stride], MAX_BASE_BR_RANGE) +
             AOMMIN(levels[pos + 1 + stride], MAX_BASE_BR_RANGE);
@@ -96,10 +88,10 @@
 }
 
 static AOM_FORCE_INLINE int get_br_ctx_eob(const int c,  // raster order
-                                           const int bwl,
+                                           const int bhl,
                                            const TX_CLASS tx_class) {
-  const int row = c >> bwl;
-  const int col = c - (row << bwl);
+  const int col = c >> bhl;
+  const int row = c - (col << bhl);
   if (c == 0) return 0;
   if ((tx_class == TX_CLASS_2D && row < 2 && col < 2) ||
       (tx_class == TX_CLASS_HORIZ && col == 0) ||
@@ -110,11 +102,11 @@
 
 static AOM_FORCE_INLINE int get_br_ctx(const uint8_t *const levels,
                                        const int c,  // raster order
-                                       const int bwl, const TX_CLASS tx_class) {
-  const int row = c >> bwl;
-  const int col = c - (row << bwl);
-  const int stride = (1 << bwl) + TX_PAD_HOR;
-  const int pos = row * stride + col;
+                                       const int bhl, const TX_CLASS tx_class) {
+  const int col = c >> bhl;
+  const int row = c - (col << bhl);
+  const int stride = (1 << bhl) + TX_PAD_HOR;
+  const int pos = col * stride + row;
   int mag = levels[pos + 1];
   mag += levels[pos + stride];
   switch (tx_class) {
@@ -125,13 +117,13 @@
       if ((row < 2) && (col < 2)) return mag + 7;
       break;
     case TX_CLASS_HORIZ:
-      mag += levels[pos + 2];
+      mag += levels[pos + (stride << 1)];
       mag = AOMMIN((mag + 1) >> 1, 6);
       if (c == 0) return mag;
       if (col == 0) return mag + 7;
       break;
     case TX_CLASS_VERT:
-      mag += levels[pos + (stride << 1)];
+      mag += levels[pos + 2];
       mag = AOMMIN((mag + 1) >> 1, 6);
       if (c == 0) return mag;
       if (row == 0) return mag + 7;
@@ -156,25 +148,25 @@
 };
 
 static AOM_FORCE_INLINE int get_nz_mag(const uint8_t *const levels,
-                                       const int bwl, const TX_CLASS tx_class) {
+                                       const int bhl, const TX_CLASS tx_class) {
   int mag;
 
   // Note: AOMMIN(level, 3) is useless for decoder since level < 3.
-  mag = clip_max3[levels[1]];                         // { 0, 1 }
-  mag += clip_max3[levels[(1 << bwl) + TX_PAD_HOR]];  // { 1, 0 }
+  mag = clip_max3[levels[(1 << bhl) + TX_PAD_HOR]];  // { 0, 1 }
+  mag += clip_max3[levels[1]];                       // { 1, 0 }
 
   if (tx_class == TX_CLASS_2D) {
-    mag += clip_max3[levels[(1 << bwl) + TX_PAD_HOR + 1]];          // { 1, 1 }
-    mag += clip_max3[levels[2]];                                    // { 0, 2 }
-    mag += clip_max3[levels[(2 << bwl) + (2 << TX_PAD_HOR_LOG2)]];  // { 2, 0 }
+    mag += clip_max3[levels[(1 << bhl) + TX_PAD_HOR + 1]];          // { 1, 1 }
+    mag += clip_max3[levels[(2 << bhl) + (2 << TX_PAD_HOR_LOG2)]];  // { 0, 2 }
+    mag += clip_max3[levels[2]];                                    // { 2, 0 }
   } else if (tx_class == TX_CLASS_VERT) {
-    mag += clip_max3[levels[(2 << bwl) + (2 << TX_PAD_HOR_LOG2)]];  // { 2, 0 }
-    mag += clip_max3[levels[(3 << bwl) + (3 << TX_PAD_HOR_LOG2)]];  // { 3, 0 }
-    mag += clip_max3[levels[(4 << bwl) + (4 << TX_PAD_HOR_LOG2)]];  // { 4, 0 }
+    mag += clip_max3[levels[2]];  // { 2, 0 }
+    mag += clip_max3[levels[3]];  // { 3, 0 }
+    mag += clip_max3[levels[4]];  // { 4, 0 }
   } else {
-    mag += clip_max3[levels[2]];  // { 0, 2 }
-    mag += clip_max3[levels[3]];  // { 0, 3 }
-    mag += clip_max3[levels[4]];  // { 0, 4 }
+    mag += clip_max3[levels[(2 << bhl) + (2 << TX_PAD_HOR_LOG2)]];  // { 0, 2 }
+    mag += clip_max3[levels[(3 << bhl) + (3 << TX_PAD_HOR_LOG2)]];  // { 0, 3 }
+    mag += clip_max3[levels[(4 << bhl) + (4 << TX_PAD_HOR_LOG2)]];  // { 0, 4 }
   }
 
   return mag;
@@ -197,7 +189,7 @@
 static AOM_FORCE_INLINE int get_nz_map_ctx_from_stats(
     const int stats,
     const int coeff_idx,  // raster order
-    const int bwl, const TX_SIZE tx_size, const TX_CLASS tx_class) {
+    const int bhl, const TX_SIZE tx_size, const TX_CLASS tx_class) {
   // tx_class == 0(TX_CLASS_2D)
   if ((tx_class | coeff_idx) == 0) return 0;
   int ctx = (stats + 1) >> 1;
@@ -218,12 +210,12 @@
       return ctx + av1_nz_map_ctx_offset[tx_size][coeff_idx];
     }
     case TX_CLASS_HORIZ: {
-      const int row = coeff_idx >> bwl;
-      const int col = coeff_idx - (row << bwl);
+      const int col = coeff_idx >> bhl;
       return ctx + nz_map_ctx_offset_1d[col];
     }
     case TX_CLASS_VERT: {
-      const int row = coeff_idx >> bwl;
+      const int col = coeff_idx >> bhl;
+      const int row = coeff_idx - (col << bhl);
       return ctx + nz_map_ctx_offset_1d[row];
     }
     default: break;
@@ -234,49 +226,49 @@
 typedef aom_cdf_prob (*base_cdf_arr)[CDF_SIZE(4)];
 typedef aom_cdf_prob (*br_cdf_arr)[CDF_SIZE(BR_CDF_SIZE)];
 
-static INLINE int get_lower_levels_ctx_eob(int bwl, int height, int scan_idx) {
+static INLINE int get_lower_levels_ctx_eob(int bhl, int width, int scan_idx) {
   if (scan_idx == 0) return 0;
-  if (scan_idx <= (height << bwl) / 8) return 1;
-  if (scan_idx <= (height << bwl) / 4) return 2;
+  if (scan_idx <= (width << bhl) / 8) return 1;
+  if (scan_idx <= (width << bhl) / 4) return 2;
   return 3;
 }
 
 static INLINE int get_lower_levels_ctx_2d(const uint8_t *levels, int coeff_idx,
-                                          int bwl, TX_SIZE tx_size) {
+                                          int bhl, TX_SIZE tx_size) {
   assert(coeff_idx > 0);
   int mag;
   // Note: AOMMIN(level, 3) is useless for decoder since level < 3.
-  levels = levels + get_padded_idx(coeff_idx, bwl);
-  mag = AOMMIN(levels[1], 3);                                     // { 0, 1 }
-  mag += AOMMIN(levels[(1 << bwl) + TX_PAD_HOR], 3);              // { 1, 0 }
-  mag += AOMMIN(levels[(1 << bwl) + TX_PAD_HOR + 1], 3);          // { 1, 1 }
-  mag += AOMMIN(levels[2], 3);                                    // { 0, 2 }
-  mag += AOMMIN(levels[(2 << bwl) + (2 << TX_PAD_HOR_LOG2)], 3);  // { 2, 0 }
+  levels = levels + get_padded_idx(coeff_idx, bhl);
+  mag = AOMMIN(levels[(1 << bhl) + TX_PAD_HOR], 3);               // { 0, 1 }
+  mag += AOMMIN(levels[1], 3);                                    // { 1, 0 }
+  mag += AOMMIN(levels[(1 << bhl) + TX_PAD_HOR + 1], 3);          // { 1, 1 }
+  mag += AOMMIN(levels[(2 << bhl) + (2 << TX_PAD_HOR_LOG2)], 3);  // { 0, 2 }
+  mag += AOMMIN(levels[2], 3);                                    // { 2, 0 }
 
   const int ctx = AOMMIN((mag + 1) >> 1, 4);
   return ctx + av1_nz_map_ctx_offset[tx_size][coeff_idx];
 }
 static AOM_FORCE_INLINE int get_lower_levels_ctx(const uint8_t *levels,
-                                                 int coeff_idx, int bwl,
+                                                 int coeff_idx, int bhl,
                                                  TX_SIZE tx_size,
                                                  TX_CLASS tx_class) {
   const int stats =
-      get_nz_mag(levels + get_padded_idx(coeff_idx, bwl), bwl, tx_class);
-  return get_nz_map_ctx_from_stats(stats, coeff_idx, bwl, tx_size, tx_class);
+      get_nz_mag(levels + get_padded_idx(coeff_idx, bhl), bhl, tx_class);
+  return get_nz_map_ctx_from_stats(stats, coeff_idx, bhl, tx_size, tx_class);
 }
 
 static INLINE int get_lower_levels_ctx_general(int is_last, int scan_idx,
-                                               int bwl, int height,
+                                               int bhl, int width,
                                                const uint8_t *levels,
                                                int coeff_idx, TX_SIZE tx_size,
                                                TX_CLASS tx_class) {
   if (is_last) {
     if (scan_idx == 0) return 0;
-    if (scan_idx <= (height << bwl) >> 3) return 1;
-    if (scan_idx <= (height << bwl) >> 2) return 2;
+    if (scan_idx <= (width << bhl) >> 3) return 1;
+    if (scan_idx <= (width << bhl) >> 2) return 2;
     return 3;
   }
-  return get_lower_levels_ctx(levels, coeff_idx, bwl, tx_size, tx_class);
+  return get_lower_levels_ctx(levels, coeff_idx, bhl, tx_size, tx_class);
 }
 
 static INLINE void set_dc_sign(int *cul_level, int dc_val) {

diff --git a/av1/common/warped_motion.c b/av1/common/warped_motion.c
index 4e5966e..83f410e 100644
--- a/av1/common/warped_motion.c
+++ b/av1/common/warped_motion.c

@@ -27,7 +27,6 @@
 // We need an extra 2 taps to fit this in, for a total of 8 taps.
 /* clang-format off */
 const int16_t av1_warped_filter[WARPEDPIXEL_PREC_SHIFTS * 3 + 1][8] = {
-#if WARPEDPIXEL_PREC_BITS == 6
   // [-1, 0)
   { 0,   0, 127,   1,   0, 0, 0, 0 }, { 0, - 1, 127,   2,   0, 0, 0, 0 },
   { 1, - 3, 127,   4, - 1, 0, 0, 0 }, { 1, - 4, 126,   6, - 2, 1, 0, 0 },
@@ -131,63 +130,6 @@
   { 0, 0, 0, - 1,   4, 127, - 3, 1 }, { 0, 0, 0,   0,   2, 127, - 1, 0 },
   // dummy (replicate row index 191)
   { 0, 0, 0,   0,   2, 127, - 1, 0 },
-
-#elif WARPEDPIXEL_PREC_BITS == 5
-  // [-1, 0)
-  {0,   0, 127,   1,   0, 0, 0, 0}, {1,  -3, 127,   4,  -1, 0, 0, 0},
-  {1,  -5, 126,   8,  -3, 1, 0, 0}, {1,  -7, 124,  13,  -4, 1, 0, 0},
-  {2,  -9, 122,  18,  -6, 1, 0, 0}, {2, -11, 120,  22,  -7, 2, 0, 0},
-  {3, -13, 117,  27,  -8, 2, 0, 0}, {3, -14, 114,  32, -10, 3, 0, 0},
-  {3, -15, 111,  37, -11, 3, 0, 0}, {3, -16, 108,  42, -12, 3, 0, 0},
-  {4, -17, 104,  47, -13, 3, 0, 0}, {4, -17, 100,  52, -14, 3, 0, 0},
-  {4, -18,  96,  58, -15, 3, 0, 0}, {4, -18,  91,  63, -16, 4, 0, 0},
-  {4, -18,  87,  68, -17, 4, 0, 0}, {4, -18,  82,  73, -17, 4, 0, 0},
-  {4, -18,  78,  78, -18, 4, 0, 0}, {4, -17,  73,  82, -18, 4, 0, 0},
-  {4, -17,  68,  87, -18, 4, 0, 0}, {4, -16,  63,  91, -18, 4, 0, 0},
-  {3, -15,  58,  96, -18, 4, 0, 0}, {3, -14,  52, 100, -17, 4, 0, 0},
-  {3, -13,  47, 104, -17, 4, 0, 0}, {3, -12,  42, 108, -16, 3, 0, 0},
-  {3, -11,  37, 111, -15, 3, 0, 0}, {3, -10,  32, 114, -14, 3, 0, 0},
-  {2,  -8,  27, 117, -13, 3, 0, 0}, {2,  -7,  22, 120, -11, 2, 0, 0},
-  {1,  -6,  18, 122,  -9, 2, 0, 0}, {1,  -4,  13, 124,  -7, 1, 0, 0},
-  {1,  -3,   8, 126,  -5, 1, 0, 0}, {0,  -1,   4, 127,  -3, 1, 0, 0},
-  // [0, 1)
-  { 0,  0,   0, 127,   1,   0,   0,  0}, { 0,  1,  -3, 127,   4,  -2,   1,  0},
-  { 0,  2,  -6, 126,   8,  -3,   1,  0}, {-1,  3,  -8, 125,  13,  -5,   2, -1},
-  {-1,  4, -11, 123,  18,  -7,   3, -1}, {-1,  4, -13, 121,  23,  -8,   3, -1},
-  {-1,  5, -15, 119,  27, -10,   4, -1}, {-2,  6, -17, 116,  33, -12,   5, -1},
-  {-2,  6, -18, 113,  38, -13,   5, -1}, {-2,  7, -19, 110,  43, -15,   6, -2},
-  {-2,  7, -20, 106,  49, -16,   6, -2}, {-2,  7, -21, 102,  54, -17,   7, -2},
-  {-2,  8, -22,  98,  59, -18,   7, -2}, {-2,  8, -22,  94,  64, -19,   7, -2},
-  {-2,  8, -22,  89,  69, -20,   8, -2}, {-2,  8, -21,  84,  74, -21,   8, -2},
-  {-2,  8, -21,  79,  79, -21,   8, -2}, {-2,  8, -21,  74,  84, -21,   8, -2},
-  {-2,  8, -20,  69,  89, -22,   8, -2}, {-2,  7, -19,  64,  94, -22,   8, -2},
-  {-2,  7, -18,  59,  98, -22,   8, -2}, {-2,  7, -17,  54, 102, -21,   7, -2},
-  {-2,  6, -16,  49, 106, -20,   7, -2}, {-2,  6, -15,  43, 110, -19,   7, -2},
-  {-1,  5, -13,  38, 113, -18,   6, -2}, {-1,  5, -12,  33, 116, -17,   6, -2},
-  {-1,  4, -10,  27, 119, -15,   5, -1}, {-1,  3,  -8,  23, 121, -13,   4, -1},
-  {-1,  3,  -7,  18, 123, -11,   4, -1}, {-1,  2,  -5,  13, 125,  -8,   3, -1},
-  { 0,  1,  -3,   8, 126,  -6,   2,  0}, { 0,  1,  -2,   4, 127,  -3,   1,  0},
-  // [1, 2)
-  {0, 0, 0,   1, 127,   0,   0, 0}, {0, 0, 1,  -3, 127,   4,  -1, 0},
-  {0, 0, 1,  -5, 126,   8,  -3, 1}, {0, 0, 1,  -7, 124,  13,  -4, 1},
-  {0, 0, 2,  -9, 122,  18,  -6, 1}, {0, 0, 2, -11, 120,  22,  -7, 2},
-  {0, 0, 3, -13, 117,  27,  -8, 2}, {0, 0, 3, -14, 114,  32, -10, 3},
-  {0, 0, 3, -15, 111,  37, -11, 3}, {0, 0, 3, -16, 108,  42, -12, 3},
-  {0, 0, 4, -17, 104,  47, -13, 3}, {0, 0, 4, -17, 100,  52, -14, 3},
-  {0, 0, 4, -18,  96,  58, -15, 3}, {0, 0, 4, -18,  91,  63, -16, 4},
-  {0, 0, 4, -18,  87,  68, -17, 4}, {0, 0, 4, -18,  82,  73, -17, 4},
-  {0, 0, 4, -18,  78,  78, -18, 4}, {0, 0, 4, -17,  73,  82, -18, 4},
-  {0, 0, 4, -17,  68,  87, -18, 4}, {0, 0, 4, -16,  63,  91, -18, 4},
-  {0, 0, 3, -15,  58,  96, -18, 4}, {0, 0, 3, -14,  52, 100, -17, 4},
-  {0, 0, 3, -13,  47, 104, -17, 4}, {0, 0, 3, -12,  42, 108, -16, 3},
-  {0, 0, 3, -11,  37, 111, -15, 3}, {0, 0, 3, -10,  32, 114, -14, 3},
-  {0, 0, 2,  -8,  27, 117, -13, 3}, {0, 0, 2,  -7,  22, 120, -11, 2},
-  {0, 0, 1,  -6,  18, 122,  -9, 2}, {0, 0, 1,  -4,  13, 124,  -7, 1},
-  {0, 0, 1,  -3,   8, 126,  -5, 1}, {0, 0, 0,  -1,   4, 127,  -3, 1},
-  // dummy (replicate row index 95)
-  {0, 0, 0,  -1,   4, 127,  -3, 1},
-
-#endif  // WARPEDPIXEL_PREC_BITS == 6
 };
 
 /* clang-format on */

diff --git a/av1/common/x86/av1_inv_txfm_avx2.c b/av1/common/x86/av1_inv_txfm_avx2.c
index 7993707..0afd42b 100644
--- a/av1/common/x86/av1_inv_txfm_avx2.c
+++ b/av1/common/x86/av1_inv_txfm_avx2.c

@@ -1641,9 +1641,9 @@
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
   const int buf_size_w_div16 = txfm_size_col >> 4;
-  const int buf_size_nonzero_w_div16 = (eobx + 16) >> 4;
+  const int buf_size_nonzero_w = ((eobx + 16) >> 4) << 4;
   const int buf_size_nonzero_h_div16 = (eoby + 16) >> 4;
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int input_stride = AOMMIN(32, txfm_size_row);
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
 
   const int fun_idx_x = lowbd_txfm_all_1d_zeros_idx[eobx];
@@ -1660,16 +1660,10 @@
   const __m256i scale0 = _mm256_set1_epi16(1 << (15 + shift[0]));
   for (int i = 0; i < buf_size_nonzero_h_div16; i++) {
     __m256i buf0[64];
-    const int32_t *input_row = input + (i << 4) * input_stride;
-    for (int j = 0; j < buf_size_nonzero_w_div16; ++j) {
-      __m256i *buf0_cur = buf0 + j * 16;
-      const int32_t *input_cur = input_row + j * 16;
-      load_buffer_32bit_to_16bit_w16_avx2(input_cur, input_stride, buf0_cur,
-                                          16);
-      transpose_16bit_16x16_avx2(buf0_cur, buf0_cur);
-    }
+    load_buffer_32bit_to_16bit_w16_avx2(input + 16 * i, input_stride, buf0,
+                                        buf_size_nonzero_w);
     if (rect_type == 1 || rect_type == -1) {
-      round_shift_avx2(buf0, buf0, input_stride);  // rect special code
+      round_shift_avx2(buf0, buf0, buf_size_nonzero_w);  // rect special code
     }
     row_txfm(buf0, buf0);
     for (int j = 0; j < txfm_size_col; ++j) {
@@ -1778,15 +1772,20 @@
   const int txh_idx = get_txh_idx(tx_size);
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int col_max = AOMMIN(32, txfm_size_col);
   const int row_max = AOMMIN(32, txfm_size_row);
+  const int input_stride = row_max;
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
   __m256i buf[32];
-  for (int i = 0; i < input_stride; i += 16) {
-    iidentity_row_16xn_avx2(buf, input + i, input_stride, shift[0], row_max,
-                            txw_idx, rect_type);
-    iidentity_col_16xn_avx2(output + i, stride, buf, shift[1], row_max,
-                            txh_idx);
+
+  for (int i = 0; i < (col_max >> 4); ++i) {
+    for (int j = 0; j < (row_max >> 4); j++) {
+      iidentity_row_16xn_avx2(buf, input + j * 16 + i * 16 * input_stride,
+                              row_max, shift[0], 16, txw_idx, rect_type);
+      transpose_16bit_16x16_avx2(buf, buf);
+      iidentity_col_16xn_avx2(output + i * 16 + j * 16 * stride, stride, buf,
+                              shift[1], 16, txh_idx);
+    }
   }
 }
 
@@ -1800,9 +1799,10 @@
   const int txh_idx = get_txh_idx(tx_size);
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
-  const int txfm_size_col_notzero = AOMMIN(32, txfm_size_col);
-  const int input_stride = txfm_size_col_notzero;
+  const int txfm_size_row_notzero = AOMMIN(32, txfm_size_row);
+  const int input_stride = txfm_size_row_notzero;
   const int buf_size_w_div16 = (eobx + 16) >> 4;
+  const int buf_size_h_div16 = (eoby + 16) >> 4;
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
 
   const int fun_idx_y = lowbd_txfm_all_1d_zeros_idx[eoby];
@@ -1815,8 +1815,13 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
   for (int i = 0; i < buf_size_w_div16; i++) {
     __m256i buf0[64];
-    iidentity_row_16xn_avx2(buf0, input + (i << 4), input_stride, shift[0],
-                            eoby + 1, txw_idx, rect_type);
+    for (int j = 0; j < buf_size_h_div16; j++) {
+      __m256i *buf0_cur = buf0 + j * 16;
+      const int32_t *input_cur = input + i * 16 * input_stride + j * 16;
+      iidentity_row_16xn_avx2(buf0_cur, input_cur, input_stride, shift[0], 16,
+                              txw_idx, rect_type);
+      transpose_16bit_16x16_avx2(buf0_cur, buf0_cur);
+    }
     col_txfm(buf0, buf0);
     __m256i mshift = _mm256_set1_epi16(1 << (15 + shift[1]));
     int k = ud_flip ? (txfm_size_row - 1) : 0;
@@ -1841,7 +1846,8 @@
   const int txfm_size_row = tx_size_high[tx_size];
   const int buf_size_w_div16 = txfm_size_col >> 4;
   const int buf_size_h_div16 = (eoby + 16) >> 4;
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int buf_size_nonzero_w = ((eobx + 8) >> 3) << 3;
+  const int input_stride = AOMMIN(32, txfm_size_row);
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
 
   const int fun_idx_x = lowbd_txfm_all_1d_zeros_idx[eobx];
@@ -1854,15 +1860,10 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
   for (int i = 0; i < buf_size_h_div16; i++) {
     __m256i buf0[64];
-    const int32_t *input_row = input + i * input_stride * 16;
-    for (int j = 0; j < AOMMIN(4, buf_size_w_div16); ++j) {
-      __m256i *buf0_cur = buf0 + j * 16;
-      load_buffer_32bit_to_16bit_w16_avx2(input_row + j * 16, input_stride,
-                                          buf0_cur, 16);
-      transpose_16bit_16x16_avx2(buf0_cur, buf0_cur);
-    }
+    load_buffer_32bit_to_16bit_w16_avx2(input + i * 16, input_stride, buf0,
+                                        buf_size_nonzero_w);
     if (rect_type == 1 || rect_type == -1) {
-      round_shift_avx2(buf0, buf0, input_stride);  // rect special code
+      round_shift_avx2(buf0, buf0, buf_size_nonzero_w);  // rect special code
     }
     row_txfm(buf0, buf0);
     round_shift_16bit_w16_avx2(buf0, txfm_size_col, shift[0]);
@@ -1886,6 +1887,285 @@
   }
 }
 
+static const transform_1d_ssse3 lowbd_txfm_all_1d_zeros_8x8_arr[2][2] = {
+  { av1_idct8_low1_ssse3, av1_idct8_sse2 },
+  { av1_iadst8_low1_ssse3, av1_iadst8_sse2 }
+};
+
+static INLINE void load_buffer_avx2(const int32_t *in, int stride,
+                                    __m128i *out) {
+  const __m256i a = _mm256_load_si256((const __m256i *)in);
+  const __m256i b = _mm256_load_si256((const __m256i *)(in + stride * 1));
+  const __m256i c = _mm256_load_si256((const __m256i *)(in + stride * 2));
+  const __m256i d = _mm256_load_si256((const __m256i *)(in + stride * 3));
+  const __m256i e = _mm256_load_si256((const __m256i *)(in + stride * 4));
+  const __m256i f = _mm256_load_si256((const __m256i *)(in + stride * 5));
+  const __m256i g = _mm256_load_si256((const __m256i *)(in + stride * 6));
+  const __m256i h = _mm256_load_si256((const __m256i *)(in + stride * 7));
+
+  // a0 a1 a2 a3 b0 b1 b2 b3 a4 a5 a6 a7 b4 b5 b6 b7
+  const __m256i ab_16bit = _mm256_packs_epi32(a, b);
+  // c0 c1 c2 c3 d0 d1 d2 d3 c4 c5 c6 c7 d4 d5 d6 d7
+  const __m256i cd_16bit = _mm256_packs_epi32(c, d);
+  // e0 e1 e2 e3 f0 f1 f2 f3 e4 e5 e6 e7 f4 f5 f6 f7
+  const __m256i ef_16bit = _mm256_packs_epi32(e, f);
+  // g0 g1 g2 g3 h0 h1 h2 h3 g4 g5 g6 g7 h4 h5 h6 h7
+  const __m256i gh_16bit = _mm256_packs_epi32(g, h);
+
+  // a0 a1 a2 a3 a4 a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7
+  const __m256i ab = _mm256_permute4x64_epi64(ab_16bit, 0xd8);
+  // c0 c1 c2 c3 c4 c5 c6 c7 d0 d1 d2 d3 d4 d5 d6 d7
+  const __m256i cd = _mm256_permute4x64_epi64(cd_16bit, 0xd8);
+  // e0 e1 e2 e3 e4 e5 e6 e7 f0 f1 f2 f3 f4 f5 f6 f7
+  const __m256i ef = _mm256_permute4x64_epi64(ef_16bit, 0xd8);
+  // g0 g1 g2 g3 g4 g5 g6 g7 h0 h1 h2 h3 h4 h5 h6 h7
+  const __m256i gh = _mm256_permute4x64_epi64(gh_16bit, 0xd8);
+
+  out[0] = _mm256_castsi256_si128(ab);
+  out[1] = _mm256_extractf128_si256(ab, 1);
+  out[2] = _mm256_castsi256_si128(cd);
+  out[3] = _mm256_extractf128_si256(cd, 1);
+  out[4] = _mm256_castsi256_si128(ef);
+  out[5] = _mm256_extractf128_si256(ef, 1);
+  out[6] = _mm256_castsi256_si128(gh);
+  out[7] = _mm256_extractf128_si256(gh, 1);
+}
+
+static INLINE void round_and_transpose_avx2(const __m128i *const in,
+                                            __m128i *const out, int bit,
+                                            int *lr_flip) {
+  __m256i buf_temp[4];
+  const __m256i scale = _mm256_set1_epi16(1 << (15 + bit));
+  int j = *lr_flip ? 7 : 0;
+  const int step = *lr_flip ? -1 : 1;
+
+  // 70 71 72 73 74 75 76 77 | 30 31 32 33 34 35 36 37
+  buf_temp[0] = _mm256_inserti128_si256(_mm256_castsi128_si256(in[j]),
+                                        in[j + 4 * step], 1);
+  j += step;
+  // 60 61 62 63 64 65 66 67 | 20 21 22 23 24 25 26 27
+  buf_temp[1] = _mm256_inserti128_si256(_mm256_castsi128_si256(in[j]),
+                                        in[j + 4 * step], 1);
+  j += step;
+  // 50 51 52 53 54 55 56 57 | 10 11 12 13 14 15 16 17
+  buf_temp[2] = _mm256_inserti128_si256(_mm256_castsi128_si256(in[j]),
+                                        in[j + 4 * step], 1);
+  j += step;
+  // 40 41 42 43 44 45 46 47 | 00 01 02 03 04 05 06 07
+  buf_temp[3] = _mm256_inserti128_si256(_mm256_castsi128_si256(in[j]),
+                                        in[j + 4 * step], 1);
+
+  // 70 71 72 73 74 75 76 77 | 30 31 32 33 34 35 36 37
+  buf_temp[0] = _mm256_mulhrs_epi16(buf_temp[0], scale);
+  // 60 61 62 63 64 65 66 67 | 20 21 22 23 24 25 26 27
+  buf_temp[1] = _mm256_mulhrs_epi16(buf_temp[1], scale);
+  // 50 51 52 53 54 55 56 57 | 10 11 12 13 14 15 16 17
+  buf_temp[2] = _mm256_mulhrs_epi16(buf_temp[2], scale);
+  // 40 41 42 43 44 45 46 47 | 00 01 02 03 04 05 06 07
+  buf_temp[3] = _mm256_mulhrs_epi16(buf_temp[3], scale);
+
+  // 70 60 71 61 72 62 73 63 | 30 20 31 21 32 22 33 23
+  const __m256i unpcklo0 = _mm256_unpacklo_epi16(buf_temp[0], buf_temp[1]);
+  // 74 64 75 65 76 66 77 67 | 34 24 35 25 36 26 37 27
+  const __m256i unpckhi0 = _mm256_unpackhi_epi16(buf_temp[0], buf_temp[1]);
+  // 50 40 51 41 52 42 53 43 | 10 00 11 01 12 02 13 03
+  const __m256i unpcklo1 = _mm256_unpacklo_epi16(buf_temp[2], buf_temp[3]);
+  // 54 44 55 45 56 46 57 47 | 14 04 15 05 16 06 17 07
+  const __m256i unpckhi1 = _mm256_unpackhi_epi16(buf_temp[2], buf_temp[3]);
+
+  // 70 60 50 40 71 61 51 41 | 30 20 10 00 31 21 11 01
+  const __m256i unpcklo00 = _mm256_unpacklo_epi32(unpcklo0, unpcklo1);
+  // 72 62 52 42 73 63 53 43 | 32 22 12 02 33 23 13 03
+  const __m256i unpckhi00 = _mm256_unpackhi_epi32(unpcklo0, unpcklo1);
+  // 74 64 54 44 75 65 55 45 | 34 24 14 04 35 25 15 05
+  const __m256i unpcklo01 = _mm256_unpacklo_epi32(unpckhi0, unpckhi1);
+  // 76 66 56 46 77 67 57 47 | 36 26 16 06 37 27 17 07
+  const __m256i unpckhi01 = _mm256_unpackhi_epi32(unpckhi0, unpckhi1);
+
+  // 70 60 50 40 30 20 10 00 | 71 61 51 41 31 21 11 01
+  const __m256i reg_00 = _mm256_permute4x64_epi64(unpcklo00, 0xd8);
+  // 72 62 52 42 32 22 12 02 | 73 63 53 43 33 23 13 03
+  const __m256i reg_01 = _mm256_permute4x64_epi64(unpckhi00, 0xd8);
+  // 74 64 54 44 34 24 14 04 | 75 65 55 45 35 25 15 05
+  const __m256i reg_10 = _mm256_permute4x64_epi64(unpcklo01, 0xd8);
+  // 76 66 56 46 36 26 16 06 | 77 67 57 47 37 27 17 07
+  const __m256i reg_11 = _mm256_permute4x64_epi64(unpckhi01, 0xd8);
+
+  // 70 60 50 40 30 20 10 00
+  out[0] = _mm256_castsi256_si128(reg_00);
+  // 71 61 51 41 31 21 11 01
+  out[1] = _mm256_extracti128_si256(reg_00, 1);
+  // 72 62 52 42 32 22 12 02
+  out[2] = _mm256_castsi256_si128(reg_01);
+  // 73 63 53 43 33 23 13 03
+  out[3] = _mm256_extracti128_si256(reg_01, 1);
+  // 74 64 54 44 34 24 14 04
+  out[4] = _mm256_castsi256_si128(reg_10);
+  // 75 65 55 45 35 25 15 05
+  out[5] = _mm256_extracti128_si256(reg_10, 1);
+  // 76 66 56 46 36 26 16 06
+  out[6] = _mm256_castsi256_si128(reg_11);
+  // 77 67 57 47 37 27 17 07
+  out[7] = _mm256_extracti128_si256(reg_11, 1);
+}
+
+static INLINE void round_shift_lowbd_write_buffer_avx2(__m128i *in, int bit,
+                                                       uint8_t *output,
+                                                       int stride, int flipud) {
+  __m256i in_256[4], v_256[4];
+  int j = flipud ? 7 : 0;
+  const int step = flipud ? -1 : 1;
+  const __m256i scale = _mm256_set1_epi16(1 << (15 + bit));
+  const __m256i zero = _mm256_setzero_si256();
+  // in[0], in[1]
+  in_256[0] =
+      _mm256_inserti128_si256(_mm256_castsi128_si256(in[j]), in[j + step], 1);
+  j += 2 * step;
+  // in[2], in[3]
+  in_256[1] =
+      _mm256_inserti128_si256(_mm256_castsi128_si256(in[j]), in[j + step], 1);
+  j += 2 * step;
+  // in[4], in[5]
+  in_256[2] =
+      _mm256_inserti128_si256(_mm256_castsi128_si256(in[j]), in[j + step], 1);
+  j += 2 * step;
+  // in[6], in[7]
+  in_256[3] =
+      _mm256_inserti128_si256(_mm256_castsi128_si256(in[j]), in[j + step], 1);
+
+  // i00 i01 i02 i03 i04 i05 i06 i07 i10 i11 i12 i13 i14 i15 i16 i17
+  in_256[0] = _mm256_mulhrs_epi16(in_256[0], scale);
+  // i20 i21 i22 i23 i24 i25 i26 i27 i30 i31 i32 i33 i34 i35 i36 i37
+  in_256[1] = _mm256_mulhrs_epi16(in_256[1], scale);
+  // i40 i41 i42 i43 i44 i45 i46 i47 i50 i51 i52 i53 i54 i55 i56 i57
+  in_256[2] = _mm256_mulhrs_epi16(in_256[2], scale);
+  // i60 i61 i62 i63 i64 i65 i66 i67 i70 i71 i72 i73 i74 i75 i76 i77
+  in_256[3] = _mm256_mulhrs_epi16(in_256[3], scale);
+
+  const __m128i v0 = _mm_loadl_epi64((__m128i const *)(output));
+  const __m128i v1 = _mm_loadl_epi64((__m128i const *)(output + stride));
+  const __m128i v2 = _mm_loadl_epi64((__m128i const *)(output + 2 * stride));
+  const __m128i v3 = _mm_loadl_epi64((__m128i const *)(output + 3 * stride));
+  const __m128i v4 = _mm_loadl_epi64((__m128i const *)(output + 4 * stride));
+  const __m128i v5 = _mm_loadl_epi64((__m128i const *)(output + 5 * stride));
+  const __m128i v6 = _mm_loadl_epi64((__m128i const *)(output + 6 * stride));
+  const __m128i v7 = _mm_loadl_epi64((__m128i const *)(output + 7 * stride));
+
+  v_256[0] = _mm256_inserti128_si256(_mm256_castsi128_si256(v0), v1, 1);
+  v_256[1] = _mm256_inserti128_si256(_mm256_castsi128_si256(v2), v3, 1);
+  v_256[2] = _mm256_inserti128_si256(_mm256_castsi128_si256(v4), v5, 1);
+  v_256[3] = _mm256_inserti128_si256(_mm256_castsi128_si256(v6), v7, 1);
+
+  const __m256i unpcklo0 = _mm256_unpacklo_epi8(v_256[0], zero);
+  const __m256i unpcklo1 = _mm256_unpacklo_epi8(v_256[1], zero);
+  const __m256i unpcklo2 = _mm256_unpacklo_epi8(v_256[2], zero);
+  const __m256i unpcklo3 = _mm256_unpacklo_epi8(v_256[3], zero);
+  // 00 01 10 11
+  const __m256i x0 = _mm256_adds_epi16(in_256[0], unpcklo0);
+  // 20 21 30 31
+  const __m256i x1 = _mm256_adds_epi16(in_256[1], unpcklo1);
+  // 40 41 50 51
+  const __m256i x2 = _mm256_adds_epi16(in_256[2], unpcklo2);
+  // 60 61 70 71
+  const __m256i x3 = _mm256_adds_epi16(in_256[3], unpcklo3);
+
+  // 00 01 20 21 10 11 30 31
+  const __m256i res_0123 = _mm256_packus_epi16(x0, x1);
+  // 40 41 60 61 50 51 70 71
+  const __m256i res_4567 = _mm256_packus_epi16(x2, x3);
+
+  // 00 01 20 21
+  const __m128i res_02 = _mm256_castsi256_si128(res_0123);
+  // 10 11 30 31
+  const __m128i res_13 = _mm256_extracti128_si256(res_0123, 1);
+  // 40 41 60 61
+  const __m128i res_46 = _mm256_castsi256_si128(res_4567);
+  // 50 51 70 71
+  const __m128i res_57 = _mm256_extracti128_si256(res_4567, 1);
+
+  // 00 01
+  _mm_storel_epi64((__m128i *)(output), res_02);
+  // 10 11
+  _mm_storel_epi64((__m128i *)(output + stride), res_13);
+  // 20 21
+  _mm_storel_epi64((__m128i *)(output + 2 * stride),
+                   _mm_unpackhi_epi64(res_02, res_02));
+  // 30 31
+  _mm_storel_epi64((__m128i *)(output + 3 * stride),
+                   _mm_unpackhi_epi64(res_13, res_13));
+  // 40 41
+  _mm_storel_epi64((__m128i *)(output + 4 * stride), res_46);
+  // 50 51
+  _mm_storel_epi64((__m128i *)(output + 5 * stride), res_57);
+  // 60 61
+  _mm_storel_epi64((__m128i *)(output + 6 * stride),
+                   _mm_unpackhi_epi64(res_46, res_46));
+  // 70 71
+  _mm_storel_epi64((__m128i *)(output + 7 * stride),
+                   _mm_unpackhi_epi64(res_57, res_57));
+}
+
+// AVX2 implementation has the advantage when combined multiple operations
+// together.
+static INLINE void lowbd_inv_txfm2d_8x8_no_identity_avx2(
+    const int32_t *input, uint8_t *output, int stride, TX_TYPE tx_type,
+    TX_SIZE tx_size, int eob) {
+  __m128i buf1[8];
+  const int input_stride = 8;
+  const int8_t *shift = av1_inv_txfm_shift_ls[tx_size];
+  assert(hitx_1d_tab[tx_type] < 2);
+  assert(vitx_1d_tab[tx_type] < 2);
+  const transform_1d_ssse3 row_txfm =
+      lowbd_txfm_all_1d_zeros_8x8_arr[hitx_1d_tab[tx_type]][eob != 1];
+  const transform_1d_ssse3 col_txfm =
+      lowbd_txfm_all_1d_zeros_8x8_arr[vitx_1d_tab[tx_type]][eob != 1];
+
+  assert(col_txfm != NULL);
+  assert(row_txfm != NULL);
+  int ud_flip, lr_flip;
+  get_flip_cfg(tx_type, &ud_flip, &lr_flip);
+
+  __m128i buf0[8];
+  __m128i *buf0_cur = buf0;
+  load_buffer_avx2(input, input_stride, buf0_cur);
+  row_txfm(buf0, buf0);
+
+  assert(shift[0] < 0);
+  __m128i *_buf1 = buf1;
+  round_and_transpose_avx2(buf0, _buf1, shift[0], &lr_flip);
+  assert(shift[1] < 0);
+  col_txfm(buf1, buf1);
+  round_shift_lowbd_write_buffer_avx2(buf1, shift[1], output, stride, ud_flip);
+}
+
+// AVX2 implementation of 8x8 inverse transform. Observed that coding AVX2 for
+// tx_type with identity in either of the direction has no advantage.
+static void lowbd_inv_txfm2d_add_8x8_avx2(const int32_t *input, uint8_t *output,
+                                          int stride, TX_TYPE tx_type,
+                                          TX_SIZE tx_size, int eob) {
+  switch (tx_type) {
+    case IDTX:
+      av1_lowbd_inv_txfm2d_add_idtx_ssse3(input, output, stride, tx_size);
+
+      break;
+    case V_DCT:
+    case V_ADST:
+    case V_FLIPADST:
+      av1_lowbd_inv_txfm2d_add_h_identity_ssse3(input, output, stride, tx_type,
+                                                tx_size, eob);
+      break;
+    case H_DCT:
+    case H_ADST:
+    case H_FLIPADST:
+      av1_lowbd_inv_txfm2d_add_v_identity_ssse3(input, output, stride, tx_type,
+                                                tx_size, eob);
+      break;
+    default:
+      lowbd_inv_txfm2d_8x8_no_identity_avx2(input, output, stride, tx_type,
+                                            tx_size, eob);
+  }
+}
+
 // for 32x32,32x64,64x32,64x64,16x32,32x16,64x16,16x64
 static INLINE void lowbd_inv_txfm2d_add_universe_avx2(
     const int32_t *input, uint8_t *output, int stride, TX_TYPE tx_type,
@@ -1931,7 +2211,6 @@
                                    int eob) {
   switch (tx_size) {
     case TX_4X4:
-    case TX_8X8:
     case TX_4X8:
     case TX_8X4:
     case TX_8X16:
@@ -1943,6 +2222,10 @@
       av1_lowbd_inv_txfm2d_add_ssse3(input, output, stride, tx_type, tx_size,
                                      eob);
       break;
+    case TX_8X8:
+      lowbd_inv_txfm2d_add_8x8_avx2(input, output, stride, tx_type, tx_size,
+                                    eob);
+      break;
     case TX_16X16:
     case TX_32X32:
     case TX_64X64:

diff --git a/av1/common/x86/av1_inv_txfm_ssse3.c b/av1/common/x86/av1_inv_txfm_ssse3.c
index 738cc98..79a6064 100644
--- a/av1/common/x86/av1_inv_txfm_ssse3.c
+++ b/av1/common/x86/av1_inv_txfm_ssse3.c

@@ -76,7 +76,7 @@
   btf_16_adds_subs_out_sse2(output[1], output[2], x[1], x[2]);
 }
 
-static void idct8_low1_ssse3(const __m128i *input, __m128i *output) {
+void av1_idct8_low1_ssse3(const __m128i *input, __m128i *output) {
   const int32_t *cospi = cospi_arr(INV_COS_BIT);
 
   // stage 1
@@ -99,7 +99,7 @@
   output[4] = x[0];
 }
 
-static void idct8_sse2(const __m128i *input, __m128i *output) {
+void av1_idct8_sse2(const __m128i *input, __m128i *output) {
   const int8_t cos_bit = INV_COS_BIT;
   const int32_t *cospi = cospi_arr(INV_COS_BIT);
   const __m128i __rounding = _mm_set1_epi32(1 << (INV_COS_BIT - 1));
@@ -1698,7 +1698,7 @@
   }
 }
 
-static void iadst8_low1_ssse3(const __m128i *input, __m128i *output) {
+void av1_iadst8_low1_ssse3(const __m128i *input, __m128i *output) {
   const int8_t cos_bit = INV_COS_BIT;
   const int32_t *cospi = cospi_arr(INV_COS_BIT);
   const __m128i __zero = _mm_setzero_si128();
@@ -1744,7 +1744,7 @@
   output[7] = _mm_subs_epi16(__zero, x[1]);
 }
 
-static void iadst8_sse2(const __m128i *input, __m128i *output) {
+void av1_iadst8_sse2(const __m128i *input, __m128i *output) {
   const int8_t cos_bit = INV_COS_BIT;
   const int32_t *cospi = cospi_arr(INV_COS_BIT);
   const __m128i __zero = _mm_setzero_si128();
@@ -2269,7 +2269,7 @@
 static const transform_1d_ssse3
     lowbd_txfm_all_1d_w8_arr[TX_SIZES][ITX_TYPES_1D] = {
       { idct4_sse2, iadst4_sse2, iidentity4_ssse3 },
-      { idct8_sse2, iadst8_sse2, iidentity8_sse2 },
+      { av1_idct8_sse2, av1_iadst8_sse2, iidentity8_sse2 },
       { idct16_sse2, iadst16_sse2, iidentity16_ssse3 },
       { idct32_sse2, NULL, NULL },
       { idct64_low32_ssse3, NULL, NULL },
@@ -2284,8 +2284,8 @@
           { iadst4_sse2, iadst4_sse2, NULL, NULL },
           { iidentity4_ssse3, iidentity4_ssse3, NULL, NULL },
       },
-      { { idct8_low1_ssse3, idct8_sse2, NULL, NULL },
-        { iadst8_low1_ssse3, iadst8_sse2, NULL, NULL },
+      { { av1_idct8_low1_ssse3, av1_idct8_sse2, NULL, NULL },
+        { av1_iadst8_low1_ssse3, av1_iadst8_sse2, NULL, NULL },
         { iidentity8_sse2, iidentity8_sse2, NULL, NULL } },
       {
           { idct16_low1_ssse3, idct16_low8_ssse3, idct16_sse2, NULL },
@@ -2382,24 +2382,27 @@
   }
 }
 
-static INLINE void lowbd_inv_txfm2d_add_idtx_ssse3(const int32_t *input,
-                                                   uint8_t *output, int stride,
-                                                   TX_SIZE tx_size) {
+void av1_lowbd_inv_txfm2d_add_idtx_ssse3(const int32_t *input, uint8_t *output,
+                                         int stride, TX_SIZE tx_size) {
   const int8_t *shift = av1_inv_txfm_shift_ls[tx_size];
   const int txw_idx = get_txw_idx(tx_size);
   const int txh_idx = get_txh_idx(tx_size);
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int col_max = AOMMIN(32, txfm_size_col);
   const int row_max = AOMMIN(32, txfm_size_row);
+  const int input_stride = row_max;
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
-  __m128i buf[32];
 
-  for (int i = 0; i < (input_stride >> 3); ++i) {
-    iidentity_row_8xn_ssse3(buf, input + 8 * i, input_stride, shift[0], row_max,
-                            txw_idx, rect_type);
-    iidentity_col_8xn_ssse3(output + 8 * i, stride, buf, shift[1], row_max,
-                            txh_idx);
+  for (int i = 0; i < (col_max >> 3); ++i) {
+    for (int j = 0; j < (row_max >> 3); j++) {
+      __m128i buf[8];
+      iidentity_row_8xn_ssse3(buf, input + j * 8 + i * 8 * input_stride,
+                              row_max, shift[0], 8, txw_idx, rect_type);
+      transpose_16bit_8x8(buf, buf);
+      iidentity_col_8xn_ssse3(output + i * 8 + j * 8 * stride, stride, buf,
+                              shift[1], 8, txh_idx);
+    }
   }
 }
 
@@ -2424,8 +2427,7 @@
 
   int ud_flip, lr_flip;
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
-  load_buffer_32bit_to_16bit_w4(input, txfm_size_col, buf, txfm_size_row);
-  transpose_16bit_4x4(buf, buf);
+  load_buffer_32bit_to_16bit_w4(input, txfm_size_row, buf, txfm_size_col);
   row_txfm(buf, buf);
   if (lr_flip) {
     __m128i temp[4];
@@ -2481,9 +2483,9 @@
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
   const int buf_size_w_div8 = txfm_size_col >> 3;
-  const int buf_size_nonzero_w_div8 = (eobx + 8) >> 3;
+  const int buf_size_nonzero_w = ((eobx + 8) >> 3) << 3;
   const int buf_size_nonzero_h_div8 = (eoby + 8) >> 3;
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int input_stride = AOMMIN(32, txfm_size_row);
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
 
   const int fun_idx_x = lowbd_txfm_all_1d_zeros_idx[eobx];
@@ -2499,14 +2501,10 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
   for (int i = 0; i < buf_size_nonzero_h_div8; i++) {
     __m128i buf0[64];
-    const int32_t *input_row = input + i * input_stride * 8;
-    for (int j = 0; j < buf_size_nonzero_w_div8; ++j) {
-      __m128i *buf0_cur = buf0 + j * 8;
-      load_buffer_32bit_to_16bit(input_row + j * 8, input_stride, buf0_cur, 8);
-      transpose_16bit_8x8(buf0_cur, buf0_cur);
-    }
+    load_buffer_32bit_to_16bit(input + 8 * i, input_stride, buf0,
+                               buf_size_nonzero_w);
     if (rect_type == 1 || rect_type == -1) {
-      round_shift_ssse3(buf0, buf0, input_stride);  // rect special code
+      round_shift_ssse3(buf0, buf0, buf_size_nonzero_w);  // rect special code
     }
     row_txfm(buf0, buf0);
     round_shift_16bit_ssse3(buf0, txfm_size_col, shift[0]);
@@ -2540,9 +2538,10 @@
   }
 }
 
-static INLINE void lowbd_inv_txfm2d_add_h_identity_ssse3(
-    const int32_t *input, uint8_t *output, int stride, TX_TYPE tx_type,
-    TX_SIZE tx_size, int eob) {
+void av1_lowbd_inv_txfm2d_add_h_identity_ssse3(const int32_t *input,
+                                               uint8_t *output, int stride,
+                                               TX_TYPE tx_type, TX_SIZE tx_size,
+                                               int eob) {
   const int8_t *shift = av1_inv_txfm_shift_ls[tx_size];
   int eobx, eoby;
   get_eobx_eoby_scan_h_identity(&eobx, &eoby, tx_size, eob);
@@ -2551,7 +2550,8 @@
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
   const int buf_size_w_div8 = (eobx + 8) >> 3;
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int buf_size_h_div8 = (eoby + 8) >> 3;
+  const int input_stride = AOMMIN(32, txfm_size_row);
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
 
   const int fun_idx = lowbd_txfm_all_1d_zeros_idx[eoby];
@@ -2565,8 +2565,13 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
   for (int i = 0; i < buf_size_w_div8; i++) {
     __m128i buf0[64];
-    iidentity_row_8xn_ssse3(buf0, input + 8 * i, input_stride, shift[0],
-                            eoby + 1, txw_idx, rect_type);
+    for (int j = 0; j < buf_size_h_div8; j++) {
+      __m128i *buf0_cur = buf0 + j * 8;
+      const int32_t *input_cur = input + i * 8 * input_stride + j * 8;
+      iidentity_row_8xn_ssse3(buf0_cur, input_cur, input_stride, shift[0], 8,
+                              txw_idx, rect_type);
+      transpose_16bit_8x8(buf0_cur, buf0_cur);
+    }
     col_txfm(buf0, buf0);
     __m128i mshift = _mm_set1_epi16(1 << (15 + shift[1]));
     int k = ud_flip ? (txfm_size_row - 1) : 0;
@@ -2582,9 +2587,10 @@
   }
 }
 
-static INLINE void lowbd_inv_txfm2d_add_v_identity_ssse3(
-    const int32_t *input, uint8_t *output, int stride, TX_TYPE tx_type,
-    TX_SIZE tx_size, int eob) {
+void av1_lowbd_inv_txfm2d_add_v_identity_ssse3(const int32_t *input,
+                                               uint8_t *output, int stride,
+                                               TX_TYPE tx_type, TX_SIZE tx_size,
+                                               int eob) {
   __m128i buf1[64];
   int eobx, eoby;
   get_eobx_eoby_scan_v_identity(&eobx, &eoby, tx_size, eob);
@@ -2594,8 +2600,9 @@
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
   const int buf_size_w_div8 = txfm_size_col >> 3;
+  const int buf_size_nonzero_w = ((eobx + 8) >> 3) << 3;
   const int buf_size_h_div8 = (eoby + 8) >> 3;
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int input_stride = AOMMIN(32, txfm_size_row);
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
 
   const int fun_idx = lowbd_txfm_all_1d_zeros_idx[eobx];
@@ -2607,14 +2614,10 @@
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
   for (int i = 0; i < buf_size_h_div8; i++) {
     __m128i buf0[64];
-    const int32_t *input_row = input + i * input_stride * 8;
-    for (int j = 0; j < AOMMIN(4, buf_size_w_div8); ++j) {
-      __m128i *buf0_cur = buf0 + j * 8;
-      load_buffer_32bit_to_16bit(input_row + j * 8, input_stride, buf0_cur, 8);
-      transpose_16bit_8x8(buf0_cur, buf0_cur);
-    }
+    load_buffer_32bit_to_16bit(input + i * 8, input_stride, buf0,
+                               buf_size_nonzero_w);
     if (rect_type == 1 || rect_type == -1) {
-      round_shift_ssse3(buf0, buf0, input_stride);  // rect special code
+      round_shift_ssse3(buf0, buf0, buf_size_nonzero_w);  // rect special code
     }
     row_txfm(buf0, buf0);
     round_shift_16bit_ssse3(buf0, txfm_size_col, shift[0]);
@@ -2648,19 +2651,19 @@
                                              tx_size, eob);
       break;
     case IDTX:
-      lowbd_inv_txfm2d_add_idtx_ssse3(input, output, stride, tx_size);
+      av1_lowbd_inv_txfm2d_add_idtx_ssse3(input, output, stride, tx_size);
       break;
     case V_DCT:
     case V_ADST:
     case V_FLIPADST:
-      lowbd_inv_txfm2d_add_h_identity_ssse3(input, output, stride, tx_type,
-                                            tx_size, eob);
+      av1_lowbd_inv_txfm2d_add_h_identity_ssse3(input, output, stride, tx_type,
+                                                tx_size, eob);
       break;
     case H_DCT:
     case H_ADST:
     case H_FLIPADST:
-      lowbd_inv_txfm2d_add_v_identity_ssse3(input, output, stride, tx_type,
-                                            tx_size, eob);
+      av1_lowbd_inv_txfm2d_add_v_identity_ssse3(input, output, stride, tx_type,
+                                                tx_size, eob);
       break;
     default:
       lowbd_inv_txfm2d_add_no_identity_ssse3(input, output, stride, tx_type,
@@ -2690,8 +2693,7 @@
 
   int ud_flip, lr_flip;
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
-  load_buffer_32bit_to_16bit_w4(input, txfm_size_col, buf, txfm_size_row);
-  transpose_16bit_4x8(buf, buf);
+  load_buffer_32bit_to_16bit(input, txfm_size_row, buf, txfm_size_col);
   round_shift_ssse3(buf, buf, txfm_size_col);  // rect special code
   row_txfm(buf, buf);
   // round_shift_16bit_ssse3(buf, txfm_size_col, shift[0]);// shift[0] is 0
@@ -2728,8 +2730,7 @@
 
   int ud_flip, lr_flip;
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
-  load_buffer_32bit_to_16bit(input, txfm_size_col, buf, txfm_size_row);
-  transpose_16bit_8x4(buf, buf);
+  load_buffer_32bit_to_16bit_w4(input, txfm_size_row, buf, txfm_size_col);
   round_shift_ssse3(buf, buf, txfm_size_col);  // rect special code
   row_txfm(buf, buf);
   // round_shift_16bit_ssse3(buf, txfm_size_col, shift[0]); // shift[0] is 0
@@ -2769,11 +2770,10 @@
 
   const int row_one_loop = 8;
   for (int i = 0; i < 2; ++i) {
-    const int32_t *input_cur = input + i * txfm_size_col * row_one_loop;
+    const int32_t *input_cur = input + i * row_one_loop;
     __m128i *buf_cur = buf + i * row_one_loop;
-    load_buffer_32bit_to_16bit_w4(input_cur, txfm_size_col, buf_cur,
-                                  row_one_loop);
-    transpose_16bit_4x8(buf_cur, buf_cur);
+    load_buffer_32bit_to_16bit(input_cur, txfm_size_row, buf_cur,
+                               txfm_size_col);
     if (row_txfm == iidentity4_ssse3) {
       const __m128i scale = pair_set_epi16(NewSqrt2, 3 << (NewSqrt2Bits - 1));
       const __m128i ones = _mm_set1_epi16(1);
@@ -2826,13 +2826,7 @@
   int ud_flip, lr_flip;
   get_flip_cfg(tx_type, &ud_flip, &lr_flip);
   const int row_one_loop = 8;
-  for (int i = 0; i < buf_size_w_div8; ++i) {
-    const int32_t *input_cur = input + i * row_one_loop;
-    __m128i *buf_cur = buf + i * row_one_loop;
-    load_buffer_32bit_to_16bit(input_cur, txfm_size_col, buf_cur,
-                               txfm_size_row);
-    transpose_16bit_8x4(buf_cur, buf_cur);
-  }
+  load_buffer_32bit_to_16bit_w4(input, txfm_size_row, buf, txfm_size_col);
   if (row_txfm == iidentity16_ssse3) {
     const __m128i scale = pair_set_epi16(2 * NewSqrt2, 3 << (NewSqrt2Bits - 1));
     const __m128i ones = _mm_set1_epi16(1);

diff --git a/av1/common/x86/av1_inv_txfm_ssse3.h b/av1/common/x86/av1_inv_txfm_ssse3.h
index b85bc9d..1873d01 100644
--- a/av1/common/x86/av1_inv_txfm_ssse3.h
+++ b/av1/common/x86/av1_inv_txfm_ssse3.h

@@ -19,7 +19,6 @@
 
 #include "aom/aom_integer.h"
 #include "aom_dsp/x86/transpose_sse2.h"
-#include "aom_dsp/x86/txfm_common_sse2.h"
 
 #ifdef __cplusplus
 extern "C" {
@@ -215,7 +214,7 @@
   eob -= 1;
   const int txfm_size_row = tx_size_high[tx_size];
   const int eoby_max = AOMMIN(32, txfm_size_row) - 1;
-  *eobx = eob / (eoby_max + 1);
+  *eobx = eob_fill[eob / (eoby_max + 1)];
   *eoby = (eob >= eoby_max) ? eoby_max : eob_fill[eob];
 }
 
@@ -224,6 +223,23 @@
 void av1_lowbd_inv_txfm2d_add_ssse3(const int32_t *input, uint8_t *output,
                                     int stride, TX_TYPE tx_type,
                                     TX_SIZE tx_size, int eob);
+
+void av1_lowbd_inv_txfm2d_add_idtx_ssse3(const int32_t *input, uint8_t *output,
+                                         int stride, TX_SIZE tx_size);
+
+void av1_lowbd_inv_txfm2d_add_h_identity_ssse3(const int32_t *input,
+                                               uint8_t *output, int stride,
+                                               TX_TYPE tx_type, TX_SIZE tx_size,
+                                               int eob);
+void av1_lowbd_inv_txfm2d_add_v_identity_ssse3(const int32_t *input,
+                                               uint8_t *output, int stride,
+                                               TX_TYPE tx_type, TX_SIZE tx_size,
+                                               int eob);
+
+void av1_iadst8_low1_ssse3(const __m128i *input, __m128i *output);
+
+void av1_idct8_low1_ssse3(const __m128i *input, __m128i *output);
+
 #ifdef __cplusplus
 }  // extern "C"
 #endif

diff --git a/av1/common/x86/av1_txfm_sse2.h b/av1/common/x86/av1_txfm_sse2.h
index b67bf54..129721c 100644
--- a/av1/common/x86/av1_txfm_sse2.h
+++ b/av1/common/x86/av1_txfm_sse2.h

@@ -307,6 +307,10 @@
 typedef void (*transform_1d_sse2)(const __m128i *input, __m128i *output,
                                   int8_t cos_bit);
 
+void av1_iadst8_sse2(const __m128i *input, __m128i *output);
+
+void av1_idct8_sse2(const __m128i *input, __m128i *output);
+
 typedef struct {
   transform_1d_sse2 col, row;  // vertical and horizontal
 } transform_2d_sse2;

diff --git a/av1/common/x86/av1_txfm_sse4.c b/av1/common/x86/av1_txfm_sse4.c
index 65ccd19..1894efd 100644
--- a/av1/common/x86/av1_txfm_sse4.c
+++ b/av1/common/x86/av1_txfm_sse4.c

@@ -14,6 +14,7 @@
 #include "av1/common/av1_txfm.h"
 #include "av1/common/x86/av1_txfm_sse4.h"
 
+// This function assumes `arr` is 16-byte aligned.
 void av1_round_shift_array_sse4_1(int32_t *arr, int size, int bit) {
   __m128i *const vec = (__m128i *)arr;
   const int vec_size = size >> 2;

diff --git a/av1/common/x86/av1_txfm_sse4.h b/av1/common/x86/av1_txfm_sse4.h
index 6cad821..387dfd6 100644
--- a/av1/common/x86/av1_txfm_sse4.h
+++ b/av1/common/x86/av1_txfm_sse4.h

@@ -25,7 +25,7 @@
   return _mm_srai_epi32(tmp, bit);
 }
 
-static INLINE void av1_round_shift_array_32_sse4_1(__m128i *input,
+static INLINE void av1_round_shift_array_32_sse4_1(const __m128i *input,
                                                    __m128i *output,
                                                    const int size,
                                                    const int bit) {
@@ -42,7 +42,7 @@
   }
 }
 
-static INLINE void av1_round_shift_rect_array_32_sse4_1(__m128i *input,
+static INLINE void av1_round_shift_rect_array_32_sse4_1(const __m128i *input,
                                                         __m128i *output,
                                                         const int size,
                                                         const int bit,

diff --git a/av1/common/x86/convolve_avx2.c b/av1/common/x86/convolve_avx2.c
index 30de982..3862bbe 100644
--- a/av1/common/x86/convolve_avx2.c
+++ b/av1/common/x86/convolve_avx2.c

@@ -714,32 +714,32 @@
                 (__m128i *)(&src_ptr[i * src_stride + src_stride]))),
             0x20);
         // row0 0..7 row1 0..7
-        const __m256i s_16l = _mm256_unpacklo_epi8(data, v_zero);
+        const __m256i s_16lo = _mm256_unpacklo_epi8(data, v_zero);
         // row0 8..F row1 8..F
-        const __m256i s_16h = _mm256_unpackhi_epi8(data, v_zero);
+        const __m256i s_16hi = _mm256_unpackhi_epi8(data, v_zero);
 
         // row0 00 00 01 01 .. 03 03 row1 00 00 01 01 .. 03 03
-        const __m256i s_ll = _mm256_unpacklo_epi16(s_16l, s_16l);
+        const __m256i s_lolo = _mm256_unpacklo_epi16(s_16lo, s_16lo);
         // row0 04 04 .. 07 07 row1 04 04 .. 07 07
-        const __m256i s_lh = _mm256_unpackhi_epi16(s_16l, s_16l);
+        const __m256i s_lohi = _mm256_unpackhi_epi16(s_16lo, s_16lo);
 
         // row0 08 08 09 09 .. 0B 0B row1 08 08 09 09 .. 0B 0B
-        const __m256i s_hl = _mm256_unpacklo_epi16(s_16h, s_16h);
+        const __m256i s_hilo = _mm256_unpacklo_epi16(s_16hi, s_16hi);
         // row0 0C 0C .. 0F 0F row1 0C 0C .. 0F 0F
-        const __m256i s_hh = _mm256_unpackhi_epi16(s_16h, s_16h);
+        const __m256i s_hihi = _mm256_unpackhi_epi16(s_16hi, s_16hi);
 
         // 00 01 01 02 02 03 03 04 10 11 11 12 12 13 13 14
-        s[0] = _mm256_alignr_epi8(s_lh, s_ll, 2);
+        s[0] = _mm256_alignr_epi8(s_lohi, s_lolo, 2);
         // 02 03 03 04 04 05 05 06 12 13 13 14 14 15 15 16
-        s[1] = _mm256_alignr_epi8(s_lh, s_ll, 10);
+        s[1] = _mm256_alignr_epi8(s_lohi, s_lolo, 10);
         // 04 05 05 06 06 07 07 08 14 15 15 16 16 17 17 18
-        s[2] = _mm256_alignr_epi8(s_hl, s_lh, 2);
+        s[2] = _mm256_alignr_epi8(s_hilo, s_lohi, 2);
         // 06 07 07 08 08 09 09 0A 16 17 17 18 18 19 19 1A
-        s[3] = _mm256_alignr_epi8(s_hl, s_lh, 10);
+        s[3] = _mm256_alignr_epi8(s_hilo, s_lohi, 10);
         // 08 09 09 0A 0A 0B 0B 0C 18 19 19 1A 1A 1B 1B 1C
-        s[4] = _mm256_alignr_epi8(s_hh, s_hl, 2);
+        s[4] = _mm256_alignr_epi8(s_hihi, s_hilo, 2);
         // 0A 0B 0B 0C 0C 0D 0D 0E 1A 1B 1B 1C 1C 1D 1D 1E
-        s[5] = _mm256_alignr_epi8(s_hh, s_hl, 10);
+        s[5] = _mm256_alignr_epi8(s_hihi, s_hilo, 10);
 
         const __m256i res_lo = convolve_12taps(s, coeffs);
 
@@ -784,26 +784,26 @@
                   (__m128i *)(&src_ptr[i * src_stride + j + 4]))),
               0x20);
           // row0 0..7 4..B
-          const __m256i s_16l = _mm256_unpacklo_epi8(data, v_zero);
+          const __m256i s_16lo = _mm256_unpacklo_epi8(data, v_zero);
           // row0 8..F C..13
-          const __m256i s_16h = _mm256_unpackhi_epi8(data, v_zero);
+          const __m256i s_16hi = _mm256_unpackhi_epi8(data, v_zero);
 
           // row0 00 00 01 01 .. 03 03 04 04 05 05 .. 07 07
-          const __m256i s_ll = _mm256_unpacklo_epi16(s_16l, s_16l);
+          const __m256i s_lolo = _mm256_unpacklo_epi16(s_16lo, s_16lo);
           // row0 04 04 .. 07 07 08 08 .. 0B 0B
-          const __m256i s_lh = _mm256_unpackhi_epi16(s_16l, s_16l);
+          const __m256i s_lohi = _mm256_unpackhi_epi16(s_16lo, s_16lo);
 
           // row0 08 08 09 09 .. 0B 0B 0C 0C 0D 0D .. 0F 0F
-          const __m256i s_hl = _mm256_unpacklo_epi16(s_16h, s_16h);
+          const __m256i s_hilo = _mm256_unpacklo_epi16(s_16hi, s_16hi);
           // row0 0C 0C 0D 0D .. 0F 0F 10 10 11 11 .. 13 13
-          const __m256i s_hh = _mm256_unpackhi_epi16(s_16h, s_16h);
+          const __m256i s_hihi = _mm256_unpackhi_epi16(s_16hi, s_16hi);
 
-          s[0] = _mm256_alignr_epi8(s_lh, s_ll, 2);
-          s[1] = _mm256_alignr_epi8(s_lh, s_ll, 10);
-          s[2] = _mm256_alignr_epi8(s_hl, s_lh, 2);
-          s[3] = _mm256_alignr_epi8(s_hl, s_lh, 10);
-          s[4] = _mm256_alignr_epi8(s_hh, s_hl, 2);
-          s[5] = _mm256_alignr_epi8(s_hh, s_hl, 10);
+          s[0] = _mm256_alignr_epi8(s_lohi, s_lolo, 2);
+          s[1] = _mm256_alignr_epi8(s_lohi, s_lolo, 10);
+          s[2] = _mm256_alignr_epi8(s_hilo, s_lohi, 2);
+          s[3] = _mm256_alignr_epi8(s_hilo, s_lohi, 10);
+          s[4] = _mm256_alignr_epi8(s_hihi, s_hilo, 2);
+          s[5] = _mm256_alignr_epi8(s_hihi, s_hilo, 10);
 
           const __m256i res_lo = convolve_12taps(s, coeffs);
 

diff --git a/av1/common/x86/highbd_inv_txfm_avx2.c b/av1/common/x86/highbd_inv_txfm_avx2.c
index 0798c6d..cbfe561 100644
--- a/av1/common/x86/highbd_inv_txfm_avx2.c
+++ b/av1/common/x86/highbd_inv_txfm_avx2.c

@@ -231,11 +231,10 @@
   out[7] = _mm256_permute2f128_si256(x0, x1, 0x31);
 }
 
-static void load_buffer_32x32(const int32_t *coeff, __m256i *in,
-                              int input_stiride, int size) {
-  int i;
-  for (i = 0; i < size; ++i) {
-    in[i] = _mm256_loadu_si256((const __m256i *)(coeff + i * input_stiride));
+static INLINE void load_buffer_32bit_input(const int32_t *in, int stride,
+                                           __m256i *out, int out_size) {
+  for (int i = 0; i < out_size; ++i) {
+    out[i] = _mm256_loadu_si256((const __m256i *)(in + i * stride));
   }
 }
 
@@ -4119,9 +4118,9 @@
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
   const int buf_size_w_div8 = txfm_size_col >> 3;
-  const int buf_size_nonzero_w_div8 = (eobx + 8) >> 3;
+  const int buf_size_nonzero_w = (eobx + 8) >> 3 << 3;
   const int buf_size_nonzero_h_div8 = (eoby + 8) >> 3;
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int input_stride = AOMMIN(32, txfm_size_row);
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
   const int fun_idx_x = lowbd_txfm_all_1d_zeros_idx[eobx];
   const int fun_idx_y = lowbd_txfm_all_1d_zeros_idx[eoby];
@@ -4138,16 +4137,11 @@
   // 1st stage: column transform
   for (int i = 0; i < buf_size_nonzero_h_div8; i++) {
     __m256i buf0[64];
-    const int32_t *input_row = input + i * input_stride * 8;
-    for (int j = 0; j < buf_size_nonzero_w_div8; ++j) {
-      __m256i *buf0_cur = buf0 + j * 8;
-      load_buffer_32x32(input_row + j * 8, buf0_cur, input_stride, 8);
-
-      transpose_8x8_avx2(&buf0_cur[0], &buf0_cur[0]);
-    }
+    load_buffer_32bit_input(input + i * 8, input_stride, buf0,
+                            buf_size_nonzero_w);
     if (rect_type == 1 || rect_type == -1) {
-      round_shift_rect_array_32_avx2(buf0, buf0, buf_size_nonzero_w_div8 << 3,
-                                     0, NewInvSqrt2);
+      round_shift_rect_array_32_avx2(buf0, buf0, buf_size_nonzero_w, 0,
+                                     NewInvSqrt2);
     }
     row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
 

diff --git a/av1/common/x86/highbd_inv_txfm_sse4.c b/av1/common/x86/highbd_inv_txfm_sse4.c
index de3af3a..4ff6a90 100644
--- a/av1/common/x86/highbd_inv_txfm_sse4.c
+++ b/av1/common/x86/highbd_inv_txfm_sse4.c

@@ -161,8 +161,6 @@
   op[3] = _mm_srai_epi32(op[3], UNIT_QUANT_SHIFT);
 
   for (int i = 0; i < 2; ++i) {
-    transpose_32bit_4x4(op, op);
-
     __m128i a1 = op[0];
     __m128i c1 = op[1];
     __m128i d1 = op[2];
@@ -180,6 +178,9 @@
     op[1] = b1;
     op[2] = c1;
     op[3] = d1;
+    if (i == 0) {
+      transpose_32bit_4x4(op, op);
+    }
   }
 
   // Convert to int16_t. The C code checks that we are in range.
@@ -468,15 +469,10 @@
   // Stage 0
   // Stage 1
   // Stage 2
-  v0 = _mm_unpacklo_epi32(in[0], in[1]);
-  v1 = _mm_unpackhi_epi32(in[0], in[1]);
-  v2 = _mm_unpacklo_epi32(in[2], in[3]);
-  v3 = _mm_unpackhi_epi32(in[2], in[3]);
-
-  u0 = _mm_unpacklo_epi64(v0, v2);
-  u1 = _mm_unpackhi_epi64(v0, v2);
-  u2 = _mm_unpacklo_epi64(v1, v3);
-  u3 = _mm_unpackhi_epi64(v1, v3);
+  u0 = in[0];
+  u1 = in[1];
+  u2 = in[2];
+  u3 = in[3];
 
   x = _mm_mullo_epi32(u0, cospi32);
   y = _mm_mullo_epi32(u2, cospi32);
@@ -529,19 +525,13 @@
   __m128i s0, s1, s2, s3, s4, s5, s6, s7;
   __m128i x0, x1, x2, x3;
   __m128i u0, u1, u2, u3;
-  __m128i v0, v1, v2, v3;
   __m128i u0_low, u1_low, u2_low, u3_low;
   __m128i u0_high, u1_high, u2_high, u3_high;
 
-  v0 = _mm_unpacklo_epi32(in[0], in[1]);
-  v1 = _mm_unpackhi_epi32(in[0], in[1]);
-  v2 = _mm_unpacklo_epi32(in[2], in[3]);
-  v3 = _mm_unpackhi_epi32(in[2], in[3]);
-
-  x0 = _mm_unpacklo_epi64(v0, v2);
-  x1 = _mm_unpackhi_epi64(v0, v2);
-  x2 = _mm_unpacklo_epi64(v1, v3);
-  x3 = _mm_unpackhi_epi64(v1, v3);
+  x0 = in[0];
+  x1 = in[1];
+  x2 = in[2];
+  x3 = in[3];
 
   s0 = _mm_mullo_epi32(x0, sinpi1);
   s1 = _mm_mullo_epi32(x0, sinpi2);
@@ -697,7 +687,6 @@
 static void iidentity4_sse4_1(__m128i *in, __m128i *out, int bit, int do_cols,
                               int bd, int out_shift) {
   (void)bit;
-  __m128i v[4];
   __m128i zero = _mm_setzero_si128();
   __m128i fact = _mm_set1_epi32(NewSqrt2);
   __m128i offset = _mm_set1_epi32(1 << (NewSqrt2Bits - 1));
@@ -728,17 +717,6 @@
     round_shift_4x4(out, out_shift);
     highbd_clamp_epi32_sse4_1(out, out, &clamp_lo, &clamp_hi, 4);
   }
-
-  // Transpose for 4x4
-  v[0] = _mm_unpacklo_epi32(out[0], out[1]);
-  v[1] = _mm_unpackhi_epi32(out[0], out[1]);
-  v[2] = _mm_unpacklo_epi32(out[2], out[3]);
-  v[3] = _mm_unpackhi_epi32(out[2], out[3]);
-
-  out[0] = _mm_unpacklo_epi64(v[0], v[2]);
-  out[1] = _mm_unpackhi_epi64(v[0], v[2]);
-  out[2] = _mm_unpacklo_epi64(v[1], v[3]);
-  out[3] = _mm_unpackhi_epi64(v[1], v[3]);
 }
 void av1_inv_txfm2d_add_4x4_sse4_1(const int32_t *input, uint16_t *output,
                                    int stride, TX_TYPE tx_type, int bd) {
@@ -749,96 +727,112 @@
     case DCT_DCT:
       load_buffer_4x4(input, in);
       idct4x4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       idct4x4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case ADST_DCT:
       load_buffer_4x4(input, in);
       idct4x4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case DCT_ADST:
       load_buffer_4x4(input, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       idct4x4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case ADST_ADST:
       load_buffer_4x4(input, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case FLIPADST_DCT:
       load_buffer_4x4(input, in);
       idct4x4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 1, -shift[1], bd);
       break;
     case DCT_FLIPADST:
       load_buffer_4x4(input, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       idct4x4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 1, 0, -shift[1], bd);
       break;
     case FLIPADST_FLIPADST:
       load_buffer_4x4(input, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 1, 1, -shift[1], bd);
       break;
     case ADST_FLIPADST:
       load_buffer_4x4(input, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 1, 0, -shift[1], bd);
       break;
     case FLIPADST_ADST:
       load_buffer_4x4(input, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 1, -shift[1], bd);
       break;
     case IDTX:
       load_buffer_4x4(input, in);
       iidentity4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       iidentity4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case V_DCT:
       load_buffer_4x4(input, in);
       iidentity4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       idct4x4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case H_DCT:
       load_buffer_4x4(input, in);
       idct4x4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       iidentity4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case V_ADST:
       load_buffer_4x4(input, in);
       iidentity4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case H_ADST:
       load_buffer_4x4(input, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       iidentity4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 0, -shift[1], bd);
       break;
     case V_FLIPADST:
       load_buffer_4x4(input, in);
       iidentity4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 0, 1, -shift[1], bd);
       break;
     case H_FLIPADST:
       load_buffer_4x4(input, in);
       iadst4x4_sse4_1(in, in, INV_COS_BIT, 0, bd, 0);
+      transpose_32bit_4x4(in, in);
       iidentity4_sse4_1(in, in, INV_COS_BIT, 1, bd, 0);
       write_buffer_4x4(in, output, stride, 1, 0, -shift[1], bd);
       break;
@@ -1408,75 +1402,66 @@
   switch (tx_type) {
     case DCT_DCT:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      idct8x8_sse4_1(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      idct8x8_sse4_1(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 0, 0, -shift[1], bd);
+      idct8x8_sse4_1(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      idct8x8_sse4_1(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 0, 0, -shift[1], bd);
       break;
     case DCT_ADST:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      iadst8x8_sse4_1(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      idct8x8_sse4_1(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 0, 0, -shift[1], bd);
+      iadst8x8_sse4_1(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      idct8x8_sse4_1(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 0, 0, -shift[1], bd);
       break;
     case ADST_DCT:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      idct8x8_sse4_1(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      iadst8x8_sse4_1(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 0, 0, -shift[1], bd);
+      idct8x8_sse4_1(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      iadst8x8_sse4_1(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 0, 0, -shift[1], bd);
       break;
     case ADST_ADST:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      iadst8x8_sse4_1(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      iadst8x8_sse4_1(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 0, 0, -shift[1], bd);
+      iadst8x8_sse4_1(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      iadst8x8_sse4_1(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 0, 0, -shift[1], bd);
       break;
     case FLIPADST_DCT:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      idct8x8_sse4_1(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      iadst8x8_sse4_1(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 0, 1, -shift[1], bd);
+      idct8x8_sse4_1(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      iadst8x8_sse4_1(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 0, 1, -shift[1], bd);
       break;
     case DCT_FLIPADST:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      iadst8x8_sse4_1(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      idct8x8_sse4_1(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 1, 0, -shift[1], bd);
+      iadst8x8_sse4_1(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      idct8x8_sse4_1(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 1, 0, -shift[1], bd);
       break;
     case ADST_FLIPADST:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      iadst8x8_sse4_1(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      iadst8x8_sse4_1(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 1, 0, -shift[1], bd);
+      iadst8x8_sse4_1(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      iadst8x8_sse4_1(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 1, 0, -shift[1], bd);
       break;
     case FLIPADST_FLIPADST:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      iadst8x8_sse4_1(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      iadst8x8_sse4_1(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 1, 1, -shift[1], bd);
+      iadst8x8_sse4_1(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      iadst8x8_sse4_1(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 1, 1, -shift[1], bd);
       break;
     case FLIPADST_ADST:
       load_buffer_8x8(input, in);
-      transpose_8x8(in, out);
-      iadst8x8_sse4_1(out, in, INV_COS_BIT, 0, bd, -shift[0]);
-      transpose_8x8(in, out);
-      iadst8x8_sse4_1(out, in, INV_COS_BIT, 1, bd, 0);
-      write_buffer_8x8(in, output, stride, 0, 1, -shift[1], bd);
+      iadst8x8_sse4_1(in, out, INV_COS_BIT, 0, bd, -shift[0]);
+      transpose_8x8(out, in);
+      iadst8x8_sse4_1(in, out, INV_COS_BIT, 1, bd, 0);
+      write_buffer_8x8(out, output, stride, 0, 1, -shift[1], bd);
       break;
     default: assert(0);
   }
@@ -5251,9 +5236,11 @@
   const int txh_idx = get_txh_idx(tx_size);
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
-  const int input_stride = AOMMIN(32, txfm_size_col);
-  const int buf_size_w_div4 = input_stride >> 2;
+  const int buf_size_w = AOMMIN(32, txfm_size_col);
+  const int buf_size_w_div4 = buf_size_w >> 2;
   const int buf_size_h_div8 = (eoby + 8) >> 3;
+  const int row_max = AOMMIN(32, txfm_size_row);
+  const int input_stride = row_max;
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
   const int fun_idx = lowbd_txfm_all_1d_zeros_idx[eoby];
   const transform_1d_sse4_1 row_txfm =
@@ -5265,13 +5252,9 @@
 
   for (int i = 0; i < (buf_size_h_div8 << 1); ++i) {
     __m128i buf0[16];
-    const int32_t *input_row = input + i * input_stride * 4;
-    for (int j = 0; j < buf_size_w_div4; ++j) {
-      __m128i *buf0_cur = buf0 + j * 4;
-      load_buffer_32bit_input(input_row + j * 4, input_stride, buf0_cur, 4);
-    }
+    load_buffer_32bit_input(input + i * 4, input_stride, buf0, buf_size_w);
     if (rect_type == 1 || rect_type == -1) {
-      av1_round_shift_rect_array_32_sse4_1(buf0, buf0, input_stride, 0,
+      av1_round_shift_rect_array_32_sse4_1(buf0, buf0, buf_size_w, 0,
                                            NewInvSqrt2);
     }
     row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
@@ -5279,10 +5262,13 @@
     __m128i *_buf1 = buf1 + i * 4;
 
     for (int j = 0; j < buf_size_w_div4; ++j) {
-      _buf1[j * txfm_size_row + 0] = buf0[j * 4 + 0];
-      _buf1[j * txfm_size_row + 1] = buf0[j * 4 + 1];
-      _buf1[j * txfm_size_row + 2] = buf0[j * 4 + 2];
-      _buf1[j * txfm_size_row + 3] = buf0[j * 4 + 3];
+      __m128i *buf0_cur = buf0 + j * 4;
+      TRANSPOSE_4X4(buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3],
+                    buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3]);
+      _buf1[j * txfm_size_row + 0] = buf0_cur[0];
+      _buf1[j * txfm_size_row + 1] = buf0_cur[1];
+      _buf1[j * txfm_size_row + 2] = buf0_cur[2];
+      _buf1[j * txfm_size_row + 3] = buf0_cur[3];
     }
   }
   for (int i = 0; i < buf_size_w_div4; i++) {
@@ -5313,10 +5299,11 @@
   const int txh_idx = get_txh_idx(tx_size);
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
-  const int input_stride = AOMMIN(32, txfm_size_col);
-  const int buf_size_w_div8 = input_stride >> 2;
+  const int buf_size_w_div4 = AOMMIN(32, txfm_size_col) >> 2;
   const int row_max = AOMMIN(32, txfm_size_row);
+  const int input_stride = row_max;
   const int buf_size_nonzero_w_div8 = (eobx + 8) >> 3;
+  const int buf_size_nonzero_w = buf_size_nonzero_w_div8 << 3;
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
   const int fun_idx = lowbd_txfm_all_1d_zeros_idx[eobx];
   const transform_1d_sse4_1 row_txfm =
@@ -5328,32 +5315,26 @@
 
   for (int i = 0; i < (row_max >> 2); ++i) {
     __m128i buf0[16];
-    const int32_t *input_row = input + i * input_stride * 4;
-    for (int j = 0; j < (buf_size_nonzero_w_div8 << 1); ++j) {
-      __m128i *buf0_cur = buf0 + j * 4;
-      load_buffer_32bit_input(input_row + j * 4, input_stride, buf0_cur, 4);
-
-      TRANSPOSE_4X4(buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3],
-                    buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3]);
-    }
+    load_buffer_32bit_input(input + i * 4, input_stride, buf0,
+                            buf_size_nonzero_w);
     if (rect_type == 1 || rect_type == -1) {
-      av1_round_shift_rect_array_32_sse4_1(
-          buf0, buf0, (buf_size_nonzero_w_div8 << 3), 0, NewInvSqrt2);
+      av1_round_shift_rect_array_32_sse4_1(buf0, buf0, buf_size_nonzero_w, 0,
+                                           NewInvSqrt2);
     }
     row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
 
     __m128i *_buf1 = buf1 + i * 4;
     if (lr_flip) {
-      for (int j = 0; j < buf_size_w_div8; ++j) {
+      for (int j = 0; j < buf_size_w_div4; ++j) {
         TRANSPOSE_4X4(buf0[4 * j + 3], buf0[4 * j + 2], buf0[4 * j + 1],
                       buf0[4 * j],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 0],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 1],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 2],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 3]);
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 0],
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 1],
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 2],
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 3]);
       }
     } else {
-      for (int j = 0; j < buf_size_w_div8; ++j) {
+      for (int j = 0; j < buf_size_w_div4; ++j) {
         TRANSPOSE_4X4(
             buf0[j * 4 + 0], buf0[j * 4 + 1], buf0[j * 4 + 2], buf0[j * 4 + 3],
             _buf1[j * txfm_size_row + 0], _buf1[j * txfm_size_row + 1],
@@ -5361,7 +5342,7 @@
       }
     }
   }
-  for (int i = 0; i < buf_size_w_div8; i++) {
+  for (int i = 0; i < buf_size_w_div4; i++) {
     col_txfm(buf1 + i * txfm_size_row, buf1 + i * txfm_size_row, INV_COS_BIT, 1,
              bd, 0);
 
@@ -5390,8 +5371,10 @@
   const int txh_idx = get_txh_idx(tx_size);
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
-  const int input_stride = AOMMIN(32, txfm_size_col);
   const int row_max = AOMMIN(32, txfm_size_row);
+  const int input_stride = row_max;
+  const int buf_size_w = AOMMIN(32, txfm_size_col);
+  const int buf_size_w_div4 = buf_size_w >> 2;
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
   const transform_1d_sse4_1 row_txfm =
       highbd_txfm_all_1d_zeros_w8_arr[txw_idx][hitx_1d_tab[tx_type]][0];
@@ -5400,26 +5383,25 @@
 
   for (int i = 0; i < (row_max >> 2); ++i) {
     __m128i buf0[32];
-    const int32_t *input_row = input + i * input_stride * 4;
-    for (int j = 0; j < (input_stride >> 2); ++j) {
-      __m128i *buf0_cur = buf0 + j * 4;
-      load_buffer_32bit_input(input_row + j * 4, input_stride, buf0_cur, 4);
-    }
+    load_buffer_32bit_input(input + i * 4, input_stride, buf0, buf_size_w);
     if (rect_type == 1 || rect_type == -1) {
-      av1_round_shift_rect_array_32_sse4_1(buf0, buf0, input_stride, 0,
+      av1_round_shift_rect_array_32_sse4_1(buf0, buf0, buf_size_w, 0,
                                            NewInvSqrt2);
     }
     row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
 
     __m128i *_buf1 = buf1 + i * 4;
-    for (int j = 0; j < (input_stride >> 2); ++j) {
-      _buf1[j * txfm_size_row + 0] = buf0[j * 4 + 0];
-      _buf1[j * txfm_size_row + 1] = buf0[j * 4 + 1];
-      _buf1[j * txfm_size_row + 2] = buf0[j * 4 + 2];
-      _buf1[j * txfm_size_row + 3] = buf0[j * 4 + 3];
+    for (int j = 0; j < buf_size_w_div4; ++j) {
+      __m128i *buf0_cur = buf0 + j * 4;
+      TRANSPOSE_4X4(buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3],
+                    buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3]);
+      _buf1[j * txfm_size_row + 0] = buf0_cur[0];
+      _buf1[j * txfm_size_row + 1] = buf0_cur[1];
+      _buf1[j * txfm_size_row + 2] = buf0_cur[2];
+      _buf1[j * txfm_size_row + 3] = buf0_cur[3];
     }
   }
-  for (int i = 0; i < (input_stride >> 2); i++) {
+  for (int i = 0; i < buf_size_w_div4; i++) {
     col_txfm(buf1 + i * txfm_size_row, buf1 + i * txfm_size_row, INV_COS_BIT, 1,
              bd, 0);
 
@@ -5450,10 +5432,10 @@
   const int txh_idx = get_txh_idx(tx_size);
   const int txfm_size_col = tx_size_wide[tx_size];
   const int txfm_size_row = tx_size_high[tx_size];
-  const int buf_size_w_div8 = txfm_size_col >> 2;
-  const int buf_size_nonzero_w_div8 = (eobx + 8) >> 3;
+  const int buf_size_w_div4 = txfm_size_col >> 2;
+  const int buf_size_nonzero_w = (eobx + 8) >> 3 << 3;
   const int buf_size_nonzero_h_div8 = (eoby + 8) >> 3;
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int input_stride = AOMMIN(32, txfm_size_row);
   const int rect_type = get_rect_tx_log_ratio(txfm_size_col, txfm_size_row);
 
   const int fun_idx_x = lowbd_txfm_all_1d_zeros_idx[eobx];
@@ -5471,32 +5453,26 @@
   // 1st stage: column transform
   for (int i = 0; i < buf_size_nonzero_h_div8 << 1; i++) {
     __m128i buf0[64];
-    const int32_t *input_row = input + i * input_stride * 4;
-    for (int j = 0; j < buf_size_nonzero_w_div8 << 1; ++j) {
-      __m128i *buf0_cur = buf0 + j * 4;
-      load_buffer_32bit_input(input_row + j * 4, input_stride, buf0_cur, 4);
-
-      TRANSPOSE_4X4(buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3],
-                    buf0_cur[0], buf0_cur[1], buf0_cur[2], buf0_cur[3]);
-    }
+    load_buffer_32bit_input(input + i * 4, input_stride, buf0,
+                            buf_size_nonzero_w);
     if (rect_type == 1 || rect_type == -1) {
-      av1_round_shift_rect_array_32_sse4_1(
-          buf0, buf0, buf_size_nonzero_w_div8 << 3, 0, NewInvSqrt2);
+      av1_round_shift_rect_array_32_sse4_1(buf0, buf0, buf_size_nonzero_w, 0,
+                                           NewInvSqrt2);
     }
     row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
 
     __m128i *_buf1 = buf1 + i * 4;
     if (lr_flip) {
-      for (int j = 0; j < buf_size_w_div8; ++j) {
+      for (int j = 0; j < buf_size_w_div4; ++j) {
         TRANSPOSE_4X4(buf0[4 * j + 3], buf0[4 * j + 2], buf0[4 * j + 1],
                       buf0[4 * j],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 0],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 1],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 2],
-                      _buf1[txfm_size_row * (buf_size_w_div8 - 1 - j) + 3]);
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 0],
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 1],
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 2],
+                      _buf1[txfm_size_row * (buf_size_w_div4 - 1 - j) + 3]);
       }
     } else {
-      for (int j = 0; j < buf_size_w_div8; ++j) {
+      for (int j = 0; j < buf_size_w_div4; ++j) {
         TRANSPOSE_4X4(
             buf0[j * 4 + 0], buf0[j * 4 + 1], buf0[j * 4 + 2], buf0[j * 4 + 3],
             _buf1[j * txfm_size_row + 0], _buf1[j * txfm_size_row + 1],
@@ -5505,7 +5481,7 @@
     }
   }
   // 2nd stage: column transform
-  for (int i = 0; i < buf_size_w_div8; i++) {
+  for (int i = 0; i < buf_size_w_div4; i++) {
     col_txfm(buf1 + i * txfm_size_row, buf1 + i * txfm_size_row, INV_COS_BIT, 1,
              bd, 0);
 
@@ -5539,7 +5515,7 @@
       highbd_txfm_all_1d_zeros_w8_arr[txw_idx][hitx_1d_tab[tx_type]][0];
   const transform_1d_sse4_1 col_txfm =
       highbd_txfm_all_1d_zeros_w8_arr[txh_idx][vitx_1d_tab[tx_type]][1];
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int input_stride = AOMMIN(32, txfm_size_row);
 
   assert(col_txfm != NULL);
   assert(row_txfm != NULL);
@@ -5548,9 +5524,8 @@
 
   // 1st stage: column transform
   __m128i buf0[8];
-  const int32_t *input_row = input;
-  __m128i *buf0_cur = buf0;
-  load_buffer_32bit_input(input_row, input_stride, buf0_cur, txfm_size_row);
+  load_buffer_32bit_input(input, input_stride, buf0, txfm_size_col);
+  load_buffer_32bit_input(input + 4, input_stride, buf0 + 4, txfm_size_col);
   av1_round_shift_rect_array_32_sse4_1(buf0, buf0, txfm_size_row, 0,
                                        NewInvSqrt2);
   row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
@@ -5606,12 +5581,7 @@
   const int32_t *input_row = input;
   load_buffer_32bit_input(input_row, 4, buf0, txfm_size_col);
 
-  TRANSPOSE_4X4(buf0[0], buf0[2], buf0[4], buf0[6], buf1[0], buf1[1], buf1[2],
-                buf1[3]);
-  TRANSPOSE_4X4(buf0[1], buf0[3], buf0[5], buf0[7], buf1[4], buf1[5], buf1[6],
-                buf1[7]);
-
-  av1_round_shift_rect_array_32_sse4_1(buf1, buf0, txfm_size_col, 0,
+  av1_round_shift_rect_array_32_sse4_1(buf0, buf0, txfm_size_col, 0,
                                        NewInvSqrt2);
   row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
 
@@ -5625,8 +5595,9 @@
 
   // 2nd stage: column transform
   for (int i = 0; i < 2; i++) {
-    col_txfm(buf1_ptr + i * txfm_size_row, buf1_ptr + i * txfm_size_row,
-             INV_COS_BIT, 1, bd, 0);
+    __m128i *buf1_cur = buf1_ptr + i * txfm_size_row;
+    transpose_32bit_4x4(buf1_cur, buf1_cur);
+    col_txfm(buf1_cur, buf1_cur, INV_COS_BIT, 1, bd, 0);
   }
   av1_round_shift_array_32_sse4_1(buf1_ptr, buf1_ptr, txfm_size_col, -shift[1]);
   // write to buffer
@@ -5650,7 +5621,7 @@
       highbd_txfm_all_1d_zeros_w8_arr[txw_idx][hitx_1d_tab[tx_type]][0];
   const transform_1d_sse4_1 col_txfm =
       highbd_txfm_all_1d_zeros_w8_arr[txh_idx][vitx_1d_tab[tx_type]][2];
-  const int input_stride = AOMMIN(32, txfm_size_col);
+  const int input_stride = AOMMIN(32, txfm_size_row);
 
   assert(col_txfm != NULL);
   assert(row_txfm != NULL);
@@ -5659,11 +5630,11 @@
 
   // 1st stage: column transform
   __m128i buf0[16];
-  const int32_t *input_row = input;
-  __m128i *buf0_cur = buf0;
-  load_buffer_32bit_input(input_row, input_stride, buf0_cur, txfm_size_row);
   for (int i = 0; i < (txfm_size_row >> 2); i++) {
-    row_txfm(buf0 + (i << 2), buf0 + (i << 2), INV_COS_BIT, 0, bd, -shift[0]);
+    const int32_t *input_row = input + i * 4;
+    __m128i *buf0_cur = buf0 + i * 4;
+    load_buffer_32bit_input(input_row, input_stride, buf0_cur, txfm_size_col);
+    row_txfm(buf0_cur, buf0_cur, INV_COS_BIT, 0, bd, -shift[0]);
   }
 
   if (lr_flip) {
@@ -5717,11 +5688,7 @@
   const int32_t *input_row = input;
   load_buffer_32bit_input(input_row, 4, buf0, txfm_size_col);
 
-  for (int j = 0; j < buf_size_w_div8; j++) {
-    TRANSPOSE_4X4(buf0[j], buf0[j + 4], buf0[j + 8], buf0[j + 12], buf1[4 * j],
-                  buf1[4 * j + 1], buf1[4 * j + 2], buf1[4 * j + 3]);
-  }
-  row_txfm(buf1, buf0, INV_COS_BIT, 0, bd, -shift[0]);
+  row_txfm(buf0, buf0, INV_COS_BIT, 0, bd, -shift[0]);
 
   __m128i *buf1_ptr;
   if (lr_flip) {
@@ -5733,8 +5700,9 @@
 
   // 2nd stage: column transform
   for (int i = 0; i < buf_size_w_div8; i++) {
-    col_txfm(buf1_ptr + i * txfm_size_row, buf1_ptr + i * txfm_size_row,
-             INV_COS_BIT, 1, bd, 0);
+    __m128i *buf1_cur = buf1_ptr + i * txfm_size_row;
+    transpose_32bit_4x4(buf1_cur, buf1_cur);
+    col_txfm(buf1_cur, buf1_cur, INV_COS_BIT, 1, bd, 0);
   }
   av1_round_shift_array_32_sse4_1(buf1_ptr, buf1_ptr, txfm_size_col, -shift[1]);
 

diff --git a/av1/common/x86/warp_plane_sse4.c b/av1/common/x86/warp_plane_sse4.c
index e35b557..4c05555 100644
--- a/av1/common/x86/warp_plane_sse4.c
+++ b/av1/common/x86/warp_plane_sse4.c

@@ -33,7 +33,6 @@
 /* clang-format off */
 DECLARE_ALIGNED(8, const int8_t,
                 av1_filter_8bit[WARPEDPIXEL_PREC_SHIFTS * 3 + 1][8]) = {
-#if WARPEDPIXEL_PREC_BITS == 6
   // [-1, 0)
   { 0, 127,   0, 0,   0,   1, 0, 0}, { 0, 127,   0, 0,  -1,   2, 0, 0},
   { 1, 127,  -1, 0,  -3,   4, 0, 0}, { 1, 126,  -2, 0,  -4,   6, 1, 0},
@@ -135,62 +134,6 @@
   { 0, 0,   4,  -3, 0,  -1, 127, 1}, { 0, 0,   2,  -1, 0,   0, 127, 0},
   // dummy (replicate row index 191)
   { 0, 0,   2,  -1, 0,   0, 127, 0},
-
-#else
-  // [-1, 0)
-  { 0, 127,   0, 0,   0,   1, 0, 0}, { 1, 127,  -1, 0,  -3,   4, 0, 0},
-  { 1, 126,  -3, 0,  -5,   8, 1, 0}, { 1, 124,  -4, 0,  -7,  13, 1, 0},
-  { 2, 122,  -6, 0,  -9,  18, 1, 0}, { 2, 120,  -7, 0, -11,  22, 2, 0},
-  { 3, 117,  -8, 0, -13,  27, 2, 0}, { 3, 114, -10, 0, -14,  32, 3, 0},
-  { 3, 111, -11, 0, -15,  37, 3, 0}, { 3, 108, -12, 0, -16,  42, 3, 0},
-  { 4, 104, -13, 0, -17,  47, 3, 0}, { 4, 100, -14, 0, -17,  52, 3, 0},
-  { 4,  96, -15, 0, -18,  58, 3, 0}, { 4,  91, -16, 0, -18,  63, 4, 0},
-  { 4,  87, -17, 0, -18,  68, 4, 0}, { 4,  82, -17, 0, -18,  73, 4, 0},
-  { 4,  78, -18, 0, -18,  78, 4, 0}, { 4,  73, -18, 0, -17,  82, 4, 0},
-  { 4,  68, -18, 0, -17,  87, 4, 0}, { 4,  63, -18, 0, -16,  91, 4, 0},
-  { 3,  58, -18, 0, -15,  96, 4, 0}, { 3,  52, -17, 0, -14, 100, 4, 0},
-  { 3,  47, -17, 0, -13, 104, 4, 0}, { 3,  42, -16, 0, -12, 108, 3, 0},
-  { 3,  37, -15, 0, -11, 111, 3, 0}, { 3,  32, -14, 0, -10, 114, 3, 0},
-  { 2,  27, -13, 0,  -8, 117, 3, 0}, { 2,  22, -11, 0,  -7, 120, 2, 0},
-  { 1,  18,  -9, 0,  -6, 122, 2, 0}, { 1,  13,  -7, 0,  -4, 124, 1, 0},
-  { 1,   8,  -5, 0,  -3, 126, 1, 0}, { 0,   4,  -3, 0,  -1, 127, 1, 0},
-  // [0, 1)
-  { 0,   0,   1, 0, 0, 127,   0,  0}, { 0,  -3,   4, 1, 1, 127,  -2,  0},
-  { 0,  -6,   8, 1, 2, 126,  -3,  0}, {-1,  -8,  13, 2, 3, 125,  -5, -1},
-  {-1, -11,  18, 3, 4, 123,  -7, -1}, {-1, -13,  23, 3, 4, 121,  -8, -1},
-  {-1, -15,  27, 4, 5, 119, -10, -1}, {-2, -17,  33, 5, 6, 116, -12, -1},
-  {-2, -18,  38, 5, 6, 113, -13, -1}, {-2, -19,  43, 6, 7, 110, -15, -2},
-  {-2, -20,  49, 6, 7, 106, -16, -2}, {-2, -21,  54, 7, 7, 102, -17, -2},
-  {-2, -22,  59, 7, 8,  98, -18, -2}, {-2, -22,  64, 7, 8,  94, -19, -2},
-  {-2, -22,  69, 8, 8,  89, -20, -2}, {-2, -21,  74, 8, 8,  84, -21, -2},
-  {-2, -21,  79, 8, 8,  79, -21, -2}, {-2, -21,  84, 8, 8,  74, -21, -2},
-  {-2, -20,  89, 8, 8,  69, -22, -2}, {-2, -19,  94, 8, 7,  64, -22, -2},
-  {-2, -18,  98, 8, 7,  59, -22, -2}, {-2, -17, 102, 7, 7,  54, -21, -2},
-  {-2, -16, 106, 7, 6,  49, -20, -2}, {-2, -15, 110, 7, 6,  43, -19, -2},
-  {-1, -13, 113, 6, 5,  38, -18, -2}, {-1, -12, 116, 6, 5,  33, -17, -2},
-  {-1, -10, 119, 5, 4,  27, -15, -1}, {-1,  -8, 121, 4, 3,  23, -13, -1},
-  {-1,  -7, 123, 4, 3,  18, -11, -1}, {-1,  -5, 125, 3, 2,  13,  -8, -1},
-  { 0,  -3, 126, 2, 1,   8,  -6,  0}, { 0,  -2, 127, 1, 1,   4,  -3,  0},
-  // [1, 2)
-  { 0,  0, 127,   0, 0,   1,   0, 0}, { 0, 1, 127,  -1, 0,  -3,   4, 0},
-  { 0,  1, 126,  -3, 0,  -5,   8, 1}, { 0, 1, 124,  -4, 0,  -7,  13, 1},
-  { 0,  2, 122,  -6, 0,  -9,  18, 1}, { 0, 2, 120,  -7, 0, -11,  22, 2},
-  { 0,  3, 117,  -8, 0, -13,  27, 2}, { 0, 3, 114, -10, 0, -14,  32, 3},
-  { 0,  3, 111, -11, 0, -15,  37, 3}, { 0, 3, 108, -12, 0, -16,  42, 3},
-  { 0,  4, 104, -13, 0, -17,  47, 3}, { 0, 4, 100, -14, 0, -17,  52, 3},
-  { 0,  4,  96, -15, 0, -18,  58, 3}, { 0, 4,  91, -16, 0, -18,  63, 4},
-  { 0,  4,  87, -17, 0, -18,  68, 4}, { 0, 4,  82, -17, 0, -18,  73, 4},
-  { 0,  4,  78, -18, 0, -18,  78, 4}, { 0, 4,  73, -18, 0, -17,  82, 4},
-  { 0,  4,  68, -18, 0, -17,  87, 4}, { 0, 4,  63, -18, 0, -16,  91, 4},
-  { 0,  3,  58, -18, 0, -15,  96, 4}, { 0, 3,  52, -17, 0, -14, 100, 4},
-  { 0,  3,  47, -17, 0, -13, 104, 4}, { 0, 3,  42, -16, 0, -12, 108, 3},
-  { 0,  3,  37, -15, 0, -11, 111, 3}, { 0, 3,  32, -14, 0, -10, 114, 3},
-  { 0,  2,  27, -13, 0,  -8, 117, 3}, { 0, 2,  22, -11, 0,  -7, 120, 2},
-  { 0,  1,  18,  -9, 0,  -6, 122, 2}, { 0, 1,  13,  -7, 0,  -4, 124, 1},
-  { 0,  1,   8,  -5, 0,  -3, 126, 1}, { 0, 0,   4,  -3, 0,  -1, 127, 1},
-  // dummy (replicate row index 95)
-  { 0, 0,   4,  -3, 0,  -1, 127, 1},
-#endif  // WARPEDPIXEL_PREC_BITS == 6
 };
 /* clang-format on */
 

diff --git a/av1/decoder/decodeframe.c b/av1/decoder/decodeframe.c
index 53275ea..5b76de8 100644
--- a/av1/decoder/decodeframe.c
+++ b/av1/decoder/decodeframe.c

@@ -10,6 +10,7 @@
  */
 
 #include <assert.h>
+#include <stdbool.h>
 #include <stddef.h>
 
 #include "config/aom_config.h"
@@ -4325,10 +4326,9 @@
                        trans_dec_factor;
   }
 
-  if (params->wmtype <= AFFINE) {
-    int good_shear_params = av1_get_shear_params(params);
-    if (!good_shear_params) return 0;
-  }
+  assert(params->wmtype <= AFFINE);
+  int good_shear_params = av1_get_shear_params(params);
+  if (!good_shear_params) return 0;
 
   return 1;
 }
@@ -4434,7 +4434,7 @@
   lock_buffer_pool(cm->buffer_pool);
   reset_ref_frame_map(cm);
   assert(cm->cur_frame->ref_count == 1);
-  for (i = 0; i < FRAME_BUFFERS; ++i) {
+  for (i = 0; i < cm->buffer_pool->num_frame_bufs; ++i) {
     // Reset all unreferenced frame buffers. We can also reset cm->cur_frame
     // because we are the sole owner of cm->cur_frame.
     if (frame_bufs[i].ref_count > 0 && &frame_bufs[i] != cm->cur_frame) {
@@ -5128,7 +5128,7 @@
   if (!av1_superres_scaled(cm)) return;
   assert(!cm->features.all_lossless);
 
-  av1_superres_upscale(cm, pool);
+  av1_superres_upscale(cm, pool, 0);
 }
 
 uint32_t av1_decode_frame_headers_and_setup(AV1Decoder *pbi,
@@ -5218,7 +5218,7 @@
   if (cm->rst_info[0].frame_restoration_type != RESTORE_NONE ||
       cm->rst_info[1].frame_restoration_type != RESTORE_NONE ||
       cm->rst_info[2].frame_restoration_type != RESTORE_NONE) {
-    av1_alloc_restoration_buffers(cm);
+    av1_alloc_restoration_buffers(cm, /*is_sgr_enabled =*/true);
   }
 
   const int use_highbd = cm->seq_params->use_highbitdepth;

diff --git a/av1/decoder/decodetxb.c b/av1/decoder/decodetxb.c
index 0ec1487..dd5aa62 100644
--- a/av1/decoder/decodetxb.c
+++ b/av1/decoder/decodetxb.c

@@ -61,17 +61,17 @@
 
 static INLINE void read_coeffs_reverse_2d(aom_reader *r, TX_SIZE tx_size,
                                           int start_si, int end_si,
-                                          const int16_t *scan, int bwl,
+                                          const int16_t *scan, int bhl,
                                           uint8_t *levels,
                                           base_cdf_arr base_cdf,
                                           br_cdf_arr br_cdf) {
   for (int c = end_si; c >= start_si; --c) {
     const int pos = scan[c];
-    const int coeff_ctx = get_lower_levels_ctx_2d(levels, pos, bwl, tx_size);
+    const int coeff_ctx = get_lower_levels_ctx_2d(levels, pos, bhl, tx_size);
     const int nsymbs = 4;
     int level = aom_read_symbol(r, base_cdf[coeff_ctx], nsymbs, ACCT_STR);
     if (level > NUM_BASE_LEVELS) {
-      const int br_ctx = get_br_ctx_2d(levels, pos, bwl);
+      const int br_ctx = get_br_ctx_2d(levels, pos, bhl);
       aom_cdf_prob *cdf = br_cdf[br_ctx];
       for (int idx = 0; idx < COEFF_BASE_RANGE; idx += BR_CDF_SIZE - 1) {
         const int k = aom_read_symbol(r, cdf, BR_CDF_SIZE, ACCT_STR);
@@ -79,23 +79,23 @@
         if (k < BR_CDF_SIZE - 1) break;
       }
     }
-    levels[get_padded_idx(pos, bwl)] = level;
+    levels[get_padded_idx(pos, bhl)] = level;
   }
 }
 
 static INLINE void read_coeffs_reverse(aom_reader *r, TX_SIZE tx_size,
                                        TX_CLASS tx_class, int start_si,
-                                       int end_si, const int16_t *scan, int bwl,
+                                       int end_si, const int16_t *scan, int bhl,
                                        uint8_t *levels, base_cdf_arr base_cdf,
                                        br_cdf_arr br_cdf) {
   for (int c = end_si; c >= start_si; --c) {
     const int pos = scan[c];
     const int coeff_ctx =
-        get_lower_levels_ctx(levels, pos, bwl, tx_size, tx_class);
+        get_lower_levels_ctx(levels, pos, bhl, tx_size, tx_class);
     const int nsymbs = 4;
     int level = aom_read_symbol(r, base_cdf[coeff_ctx], nsymbs, ACCT_STR);
     if (level > NUM_BASE_LEVELS) {
-      const int br_ctx = get_br_ctx(levels, pos, bwl, tx_class);
+      const int br_ctx = get_br_ctx(levels, pos, bhl, tx_class);
       aom_cdf_prob *cdf = br_cdf[br_ctx];
       for (int idx = 0; idx < COEFF_BASE_RANGE; idx += BR_CDF_SIZE - 1) {
         const int k = aom_read_symbol(r, cdf, BR_CDF_SIZE, ACCT_STR);
@@ -103,7 +103,7 @@
         if (k < BR_CDF_SIZE - 1) break;
       }
     }
-    levels[get_padded_idx(pos, bwl)] = level;
+    levels[get_padded_idx(pos, bhl)] = level;
   }
 }
 
@@ -123,13 +123,13 @@
   const int16_t *const dequant = pd->seg_dequant_QTX[mbmi->segment_id];
   tran_low_t *const tcoeffs = dcb->dqcoeff_block[plane] + dcb->cb_offset[plane];
   const int shift = av1_get_tx_scale(tx_size);
-  const int bwl = get_txb_bwl(tx_size);
+  const int bhl = get_txb_bhl(tx_size);
   const int width = get_txb_wide(tx_size);
   const int height = get_txb_high(tx_size);
   int cul_level = 0;
   int dc_val = 0;
   uint8_t levels_buf[TX_PAD_2D];
-  uint8_t *const levels = set_levels(levels_buf, width);
+  uint8_t *const levels = set_levels(levels_buf, height);
   const int all_zero = aom_read_symbol(
       r, ec_ctx->txb_skip_cdf[txs_ctx][txb_ctx->txb_skip_ctx], 2, ACCT_STR);
   eob_info *eob_data = dcb->eob_data[plane] + dcb->txb_offset[plane];
@@ -238,7 +238,7 @@
   if (*eob > 1) {
     memset(levels_buf, 0,
            sizeof(*levels_buf) *
-               ((width + TX_PAD_HOR) * (height + TX_PAD_VER) + TX_PAD_END));
+               ((height + TX_PAD_HOR) * (width + TX_PAD_VER) + TX_PAD_END));
   }
 
   {
@@ -246,13 +246,13 @@
     // TODO(angiebird): Put this into a function
     const int c = *eob - 1;
     const int pos = scan[c];
-    const int coeff_ctx = get_lower_levels_ctx_eob(bwl, height, c);
+    const int coeff_ctx = get_lower_levels_ctx_eob(bhl, width, c);
     const int nsymbs = 3;
     aom_cdf_prob *cdf =
         ec_ctx->coeff_base_eob_cdf[txs_ctx][plane_type][coeff_ctx];
     int level = aom_read_symbol(r, cdf, nsymbs, ACCT_STR) + 1;
     if (level > NUM_BASE_LEVELS) {
-      const int br_ctx = get_br_ctx_eob(pos, bwl, tx_class);
+      const int br_ctx = get_br_ctx_eob(pos, bhl, tx_class);
       cdf = ec_ctx->coeff_br_cdf[AOMMIN(txs_ctx, TX_32X32)][plane_type][br_ctx];
       for (int idx = 0; idx < COEFF_BASE_RANGE; idx += BR_CDF_SIZE - 1) {
         const int k = aom_read_symbol(r, cdf, BR_CDF_SIZE, ACCT_STR);
@@ -260,19 +260,19 @@
         if (k < BR_CDF_SIZE - 1) break;
       }
     }
-    levels[get_padded_idx(pos, bwl)] = level;
+    levels[get_padded_idx(pos, bhl)] = level;
   }
   if (*eob > 1) {
     base_cdf_arr base_cdf = ec_ctx->coeff_base_cdf[txs_ctx][plane_type];
     br_cdf_arr br_cdf =
         ec_ctx->coeff_br_cdf[AOMMIN(txs_ctx, TX_32X32)][plane_type];
     if (tx_class == TX_CLASS_2D) {
-      read_coeffs_reverse_2d(r, tx_size, 1, *eob - 1 - 1, scan, bwl, levels,
+      read_coeffs_reverse_2d(r, tx_size, 1, *eob - 1 - 1, scan, bhl, levels,
                              base_cdf, br_cdf);
-      read_coeffs_reverse(r, tx_size, tx_class, 0, 0, scan, bwl, levels,
+      read_coeffs_reverse(r, tx_size, tx_class, 0, 0, scan, bhl, levels,
                           base_cdf, br_cdf);
     } else {
-      read_coeffs_reverse(r, tx_size, tx_class, 0, *eob - 1 - 1, scan, bwl,
+      read_coeffs_reverse(r, tx_size, tx_class, 0, *eob - 1 - 1, scan, bhl,
                           levels, base_cdf, br_cdf);
     }
   }
@@ -280,7 +280,7 @@
   for (int c = 0; c < *eob; ++c) {
     const int pos = scan[c];
     uint8_t sign;
-    tran_low_t level = levels[get_padded_idx(pos, bwl)];
+    tran_low_t level = levels[get_padded_idx(pos, bhl)];
     if (level) {
       *max_scan_line = AOMMAX(*max_scan_line, pos);
       if (c == 0) {

diff --git a/av1/decoder/obu.c b/av1/decoder/obu.c
index d589f00..b687cf9 100644
--- a/av1/decoder/obu.c
+++ b/av1/decoder/obu.c

@@ -396,7 +396,7 @@
                              cm->seq_params->subsampling_y,
                              (cm->seq_params->use_highbitdepth &&
                               (cm->seq_params->bit_depth > AOM_BITS_8)),
-                             0, cm->features.byte_alignment, 0))
+                             0, cm->features.byte_alignment, 0, 0))
     aom_internal_error(&pbi->error, AOM_CODEC_MEM_ERROR,
                        "Failed to allocate the tile list output buffer");
 }

diff --git a/av1/encoder/allintra_vis.c b/av1/encoder/allintra_vis.c
index cfc3270..236b296 100644
--- a/av1/encoder/allintra_vis.c
+++ b/av1/encoder/allintra_vis.c

@@ -9,6 +9,8 @@
  * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
  */
 
+#include <assert.h>
+
 #include "config/aom_config.h"
 
 #if CONFIG_TFLITE
@@ -35,11 +37,29 @@
   // "compute_num_ai_workers()".
   cpi->weber_bsize = BLOCK_8X8;
 
-  if (cpi->mb_weber_stats) return;
+  if (cpi->oxcf.enable_rate_guide_deltaq) {
+    if (cpi->mb_weber_stats && cpi->prep_rate_estimates &&
+        cpi->ext_rate_distribution)
+      return;
+  } else {
+    if (cpi->mb_weber_stats) return;
+  }
 
   CHECK_MEM_ERROR(cm, cpi->mb_weber_stats,
                   aom_calloc(cpi->frame_info.mi_rows * cpi->frame_info.mi_cols,
                              sizeof(*cpi->mb_weber_stats)));
+
+  if (cpi->oxcf.enable_rate_guide_deltaq) {
+    CHECK_MEM_ERROR(
+        cm, cpi->prep_rate_estimates,
+        aom_calloc(cpi->frame_info.mi_rows * cpi->frame_info.mi_cols,
+                   sizeof(*cpi->prep_rate_estimates)));
+
+    CHECK_MEM_ERROR(
+        cm, cpi->ext_rate_distribution,
+        aom_calloc(cpi->frame_info.mi_rows * cpi->frame_info.mi_cols,
+                   sizeof(*cpi->ext_rate_distribution)));
+  }
 }
 
 static int64_t get_satd(AV1_COMP *const cpi, BLOCK_SIZE bsize, int mi_row,
@@ -197,6 +217,20 @@
   return sb_wiener_var;
 }
 
+static int rate_estimator(const tran_low_t *qcoeff, int eob, TX_SIZE tx_size) {
+  const SCAN_ORDER *const scan_order = &av1_scan_orders[tx_size][DCT_DCT];
+
+  assert((1 << num_pels_log2_lookup[txsize_to_bsize[tx_size]]) >= eob);
+  int rate_cost = 1;
+
+  for (int idx = 0; idx < eob; ++idx) {
+    int abs_level = abs(qcoeff[scan_order->scan[idx]]);
+    rate_cost += (int)(log1p(abs_level) / log(2.0)) + 1 + (abs_level > 0);
+  }
+
+  return (rate_cost << AV1_PROB_COST_SHIFT);
+}
+
 void av1_calc_mb_wiener_var_row(AV1_COMP *const cpi, MACROBLOCK *x,
                                 MACROBLOCKD *xd, const int mi_row,
                                 int16_t *src_diff, tran_low_t *coeff,
@@ -216,22 +250,36 @@
   const int coeff_count = block_size * block_size;
   const int mb_step = mi_size_wide[bsize];
   const BitDepthInfo bd_info = get_bit_depth_info(xd);
-  const AV1EncRowMultiThreadInfo *const enc_row_mt = &cpi->mt_info.enc_row_mt;
-  // We allocate cpi->tile_data (of size 1) when we call this function in
-  // multithreaded mode, so cpi->tile_data may be a null pointer when we call
-  // this function in single-threaded mode.
-  AV1EncRowMultiThreadSync *const row_mt_sync =
-      cpi->tile_data ? &cpi->tile_data[0].row_mt_sync : NULL;
+  const AV1EncAllIntraMultiThreadInfo *const intra_mt = &cpi->mt_info.intra_mt;
+  AV1EncRowMultiThreadSync *const intra_row_mt_sync =
+      &cpi->ppi->intra_row_mt_sync;
   const int mi_cols = cm->mi_params.mi_cols;
   const int mt_thread_id = mi_row / mb_step;
   // TODO(chengchen): test different unit step size
   const int mt_unit_step = mi_size_wide[BLOCK_64X64];
   const int mt_unit_cols = (mi_cols + (mt_unit_step >> 1)) / mt_unit_step;
   int mt_unit_col = 0;
+  const int is_high_bitdepth = is_cur_buf_hbd(xd);
+
+  // We use a scratch buffer to store the prediction.
+  // The stride is the max block size (128).
+  uint8_t *pred_buffer;
+  const int dst_buffer_stride = 128;
+  const int buf_width = 128;
+  const int buf_height = 128;
+  const size_t buf_size = (buf_width * buf_height * sizeof(*pred_buffer))
+                          << is_high_bitdepth;
+  CHECK_MEM_ERROR(cm, pred_buffer, aom_memalign(32, buf_size));
+  uint8_t *dst_buffer = pred_buffer;
+  if (is_high_bitdepth) {
+    uint16_t *pred_buffer_16 = (uint16_t *)pred_buffer;
+    dst_buffer = CONVERT_TO_BYTEPTR(pred_buffer_16);
+  }
 
   for (int mi_col = 0; mi_col < mi_cols; mi_col += mb_step) {
     if (mi_col % mt_unit_step == 0) {
-      enc_row_mt->sync_read_ptr(row_mt_sync, mt_thread_id, mt_unit_col);
+      intra_mt->intra_sync_read_ptr(intra_row_mt_sync, mt_thread_id,
+                                    mt_unit_col);
     }
 
     PREDICTION_MODE best_mode = DC_PRED;
@@ -241,24 +289,32 @@
     set_mode_info_offsets(&cpi->common.mi_params, &cpi->mbmi_ext_info, x, xd,
                           mi_row, mi_col);
     set_mi_row_col(xd, &xd->tile, mi_row, mi_height, mi_col, mi_width,
-                   cm->mi_params.mi_rows, cm->mi_params.mi_cols);
+                   AOMMIN(mi_row + mi_height, cm->mi_params.mi_rows),
+                   AOMMIN(mi_col + mi_width, cm->mi_params.mi_cols));
     set_plane_n4(xd, mi_size_wide[bsize], mi_size_high[bsize],
                  av1_num_planes(cm));
     xd->mi[0]->bsize = bsize;
     xd->mi[0]->motion_mode = SIMPLE_TRANSLATION;
-    av1_setup_dst_planes(xd->plane, bsize, &cm->cur_frame->buf, mi_row, mi_col,
-                         0, av1_num_planes(cm));
-    int dst_buffer_stride = xd->plane[0].dst.stride;
-    uint8_t *dst_buffer = xd->plane[0].dst.buf;
+    // Set above and left mbmi to NULL as they are not available in the
+    // preprocessing stage.
+    // They are used to detemine intra edge filter types in intra prediction.
+    if (xd->up_available) {
+      xd->above_mbmi = NULL;
+    }
+    if (xd->left_available) {
+      xd->left_mbmi = NULL;
+    }
     uint8_t *mb_buffer =
         buffer + mi_row * MI_SIZE * buf_stride + mi_col * MI_SIZE;
     for (PREDICTION_MODE mode = INTRA_MODE_START; mode < INTRA_MODE_END;
          ++mode) {
-      av1_predict_intra_block(xd, cm->seq_params->sb_size,
-                              cm->seq_params->enable_intra_edge_filter,
-                              block_size, block_size, tx_size, mode, 0, 0,
-                              FILTER_INTRA_MODES, dst_buffer, dst_buffer_stride,
-                              dst_buffer, dst_buffer_stride, 0, 0, 0);
+      // TODO(chengchen): Here we use src instead of reconstructed frame as
+      // the intra predictor to make single and multithread version match.
+      // Ideally we want to use the reconstructed.
+      av1_predict_intra_block(
+          xd, cm->seq_params->sb_size, cm->seq_params->enable_intra_edge_filter,
+          block_size, block_size, tx_size, mode, 0, 0, FILTER_INTRA_MODES,
+          mb_buffer, buf_stride, dst_buffer, dst_buffer_stride, 0, 0, 0);
       av1_subtract_block(bd_info, block_size, block_size, src_diff, block_size,
                          mb_buffer, buf_stride, dst_buffer, dst_buffer_stride);
       av1_quick_txfm(0, tx_size, bd_info, src_diff, block_size, coeff);
@@ -272,7 +328,7 @@
     av1_predict_intra_block(
         xd, cm->seq_params->sb_size, cm->seq_params->enable_intra_edge_filter,
         block_size, block_size, tx_size, best_mode, 0, 0, FILTER_INTRA_MODES,
-        dst_buffer, dst_buffer_stride, dst_buffer, dst_buffer_stride, 0, 0, 0);
+        mb_buffer, buf_stride, dst_buffer, dst_buffer_stride, 0, 0, 0);
     av1_subtract_block(bd_info, block_size, block_size, src_diff, block_size,
                        mb_buffer, buf_stride, dst_buffer, dst_buffer_stride);
     av1_quick_txfm(0, tx_size, bd_info, src_diff, block_size, coeff);
@@ -295,6 +351,13 @@
     av1_quantize_fp_facade(coeff, pix_num, p, qcoeff, dqcoeff, &eob, scan_order,
                            &quant_param);
 #endif  // CONFIG_AV1_HIGHBITDEPTH
+
+    if (cpi->oxcf.enable_rate_guide_deltaq) {
+      const int rate_cost = rate_estimator(qcoeff, eob, tx_size);
+      cpi->prep_rate_estimates[(mi_row / mb_step) * cpi->frame_info.mi_cols +
+                               (mi_col / mb_step)] = rate_cost;
+    }
+
     av1_inverse_transform_block(xd, dqcoeff, 0, DCT_DCT, tx_size, dst_buffer,
                                 dst_buffer_stride, eob, 0);
     WeberStats *weber_stats =
@@ -364,13 +427,14 @@
 
     if ((mi_col + mb_step) % mt_unit_step == 0 ||
         (mi_col + mb_step) >= mi_cols) {
-      enc_row_mt->sync_write_ptr(row_mt_sync, mt_thread_id, mt_unit_col,
-                                 mt_unit_cols);
+      intra_mt->intra_sync_write_ptr(intra_row_mt_sync, mt_thread_id,
+                                     mt_unit_col, mt_unit_cols);
       ++mt_unit_col;
     }
   }
   // Set the pointer to null since mbmi is only allocated inside this function.
   xd->mi = NULL;
+  aom_free(pred_buffer);
 }
 
 static void calc_mb_wiener_var(AV1_COMP *const cpi, double *sum_rec_distortion,
@@ -440,6 +504,57 @@
   }
 }
 
+static void ext_rate_guided_quantization(AV1_COMP *cpi) {
+  // Calculation uses 8x8.
+  const int mb_step = mi_size_wide[cpi->weber_bsize];
+  // Accumulate to 16x16, step size is in the unit of mi.
+  const int block_step = 4;
+
+  const char *filename = cpi->oxcf.rate_distribution_info;
+  FILE *pfile = fopen(filename, "r");
+  if (pfile == NULL) {
+    assert(pfile != NULL);
+    return;
+  }
+
+  double ext_rate_sum = 0.0;
+  for (int row = 0; row < cpi->frame_info.mi_rows; row += block_step) {
+    for (int col = 0; col < cpi->frame_info.mi_cols; col += block_step) {
+      float val;
+      const int fields_converted = fscanf(pfile, "%f", &val);
+      if (fields_converted != 1) {
+        assert(fields_converted == 1);
+        fclose(pfile);
+        return;
+      }
+      ext_rate_sum += val;
+      cpi->ext_rate_distribution[(row / mb_step) * cpi->frame_info.mi_cols +
+                                 (col / mb_step)] = val;
+    }
+  }
+  fclose(pfile);
+
+  int uniform_rate_sum = 0;
+  for (int row = 0; row < cpi->frame_info.mi_rows; row += block_step) {
+    for (int col = 0; col < cpi->frame_info.mi_cols; col += block_step) {
+      int rate_sum = 0;
+      for (int r = 0; r < block_step; r += mb_step) {
+        for (int c = 0; c < block_step; c += mb_step) {
+          const int mi_row = row + r;
+          const int mi_col = col + c;
+          rate_sum += cpi->prep_rate_estimates[(mi_row / mb_step) *
+                                                   cpi->frame_info.mi_cols +
+                                               (mi_col / mb_step)];
+        }
+      }
+      uniform_rate_sum += rate_sum;
+    }
+  }
+
+  const double scale = uniform_rate_sum / ext_rate_sum;
+  cpi->ext_rate_scale = scale;
+}
+
 void av1_set_mb_wiener_variance(AV1_COMP *cpi) {
   AV1_COMMON *const cm = &cpi->common;
   const SequenceHeader *const seq_params = cm->seq_params;
@@ -447,7 +562,7 @@
           &cm->cur_frame->buf, cm->width, cm->height, seq_params->subsampling_x,
           seq_params->subsampling_y, seq_params->use_highbitdepth,
           cpi->oxcf.border_in_pixels, cm->features.byte_alignment, NULL, NULL,
-          NULL, cpi->oxcf.tool_cfg.enable_global_motion, 0))
+          NULL, cpi->image_pyramid_levels, 0))
     aom_internal_error(cm->error, AOM_CODEC_MEM_ERROR,
                        "Failed to allocate frame buffer");
   cpi->norm_wiener_variance = 0;
@@ -468,15 +583,16 @@
   MultiThreadInfo *const mt_info = &cpi->mt_info;
   const int num_workers =
       AOMMIN(mt_info->num_mod_workers[MOD_AI], mt_info->num_workers);
-  AV1EncRowMultiThreadInfo *const enc_row_mt = &mt_info->enc_row_mt;
-  enc_row_mt->sync_read_ptr = av1_row_mt_sync_read_dummy;
-  enc_row_mt->sync_write_ptr = av1_row_mt_sync_write_dummy;
+  AV1EncAllIntraMultiThreadInfo *const intra_mt = &mt_info->intra_mt;
+  intra_mt->intra_sync_read_ptr = av1_row_mt_sync_read_dummy;
+  intra_mt->intra_sync_write_ptr = av1_row_mt_sync_write_dummy;
   // Calculate differential contrast for each block for the entire image.
-  // TODO(aomedia:3376): Remove " && 0" when there are no data races in
-  // av1_calc_mb_wiener_var_mt(). See also bug aomedia:3380.
-  if (num_workers > 1 && 0) {
-    enc_row_mt->sync_read_ptr = av1_row_mt_sync_read;
-    enc_row_mt->sync_write_ptr = av1_row_mt_sync_write;
+  // TODO(chengchen): properly accumulate the distortion and rate in
+  // av1_calc_mb_wiener_var_mt(). Until then, call calc_mb_wiener_var() if
+  // auto_intra_tools_off is true.
+  if (num_workers > 1 && !cpi->oxcf.intra_mode_cfg.auto_intra_tools_off) {
+    intra_mt->intra_sync_read_ptr = av1_row_mt_sync_read;
+    intra_mt->intra_sync_write_ptr = av1_row_mt_sync_write;
     av1_calc_mb_wiener_var_mt(cpi, num_workers, &sum_rec_distortion,
                               &sum_est_rate);
   } else {
@@ -486,6 +602,9 @@
   // Determine whether to turn off several intra coding tools.
   automatic_intra_tools_off(cpi, sum_rec_distortion, sum_est_rate);
 
+  // Read external rate distribution and use it to guide delta quantization
+  if (cpi->oxcf.enable_rate_guide_deltaq) ext_rate_guided_quantization(cpi);
+
   const BLOCK_SIZE norm_block_size = cm->seq_params->sb_size;
   cpi->norm_wiener_variance = estimate_wiener_var_norm(cpi, norm_block_size);
   const int norm_step = mi_size_wide[norm_block_size];
@@ -530,8 +649,67 @@
   aom_free_frame_buffer(&cm->cur_frame->buf);
 }
 
+static int get_rate_guided_quantizer(AV1_COMP *const cpi, BLOCK_SIZE bsize,
+                                     int mi_row, int mi_col) {
+  // Calculation uses 8x8.
+  const int mb_step = mi_size_wide[cpi->weber_bsize];
+  // Accumulate to 16x16
+  const int block_step = mi_size_wide[BLOCK_16X16];
+  double sb_rate_hific = 0.0;
+  double sb_rate_uniform = 0.0;
+  for (int row = mi_row; row < mi_row + mi_size_wide[bsize];
+       row += block_step) {
+    for (int col = mi_col; col < mi_col + mi_size_high[bsize];
+         col += block_step) {
+      sb_rate_hific +=
+          cpi->ext_rate_distribution[(row / mb_step) * cpi->frame_info.mi_cols +
+                                     (col / mb_step)];
+
+      for (int r = 0; r < block_step; r += mb_step) {
+        for (int c = 0; c < block_step; c += mb_step) {
+          const int this_row = row + r;
+          const int this_col = col + c;
+          sb_rate_uniform +=
+              cpi->prep_rate_estimates[(this_row / mb_step) *
+                                           cpi->frame_info.mi_cols +
+                                       (this_col / mb_step)];
+        }
+      }
+    }
+  }
+  sb_rate_hific *= cpi->ext_rate_scale;
+
+  const double weight = 1.0;
+  const double rate_diff =
+      weight * (sb_rate_hific - sb_rate_uniform) / sb_rate_uniform;
+  double scale = pow(2, rate_diff);
+
+  scale = scale * scale;
+  double min_max_scale = AOMMAX(1.0, get_max_scale(cpi, bsize, mi_row, mi_col));
+  scale = 1.0 / AOMMIN(1.0 / scale, min_max_scale);
+
+  AV1_COMMON *const cm = &cpi->common;
+  const int base_qindex = cm->quant_params.base_qindex;
+  int offset =
+      av1_get_deltaq_offset(cm->seq_params->bit_depth, base_qindex, scale);
+  const DeltaQInfo *const delta_q_info = &cm->delta_q_info;
+  const int max_offset = delta_q_info->delta_q_res * 10;
+  offset = AOMMIN(offset, max_offset - 1);
+  offset = AOMMAX(offset, -max_offset + 1);
+  int qindex = cm->quant_params.base_qindex + offset;
+  qindex = AOMMIN(qindex, MAXQ);
+  qindex = AOMMAX(qindex, MINQ);
+  if (base_qindex > MINQ) qindex = AOMMAX(qindex, MINQ + 1);
+
+  return qindex;
+}
+
 int av1_get_sbq_perceptual_ai(AV1_COMP *const cpi, BLOCK_SIZE bsize, int mi_row,
                               int mi_col) {
+  if (cpi->oxcf.enable_rate_guide_deltaq) {
+    return get_rate_guided_quantizer(cpi, bsize, mi_row, mi_col);
+  }
+
   AV1_COMMON *const cm = &cpi->common;
   const int base_qindex = cm->quant_params.base_qindex;
   int sb_wiener_var = get_var_perceptual_ai(cpi, bsize, mi_row, mi_col);

diff --git a/av1/encoder/aq_cyclicrefresh.c b/av1/encoder/aq_cyclicrefresh.c
index 616d52f..be51ba1 100644
--- a/av1/encoder/aq_cyclicrefresh.c
+++ b/av1/encoder/aq_cyclicrefresh.c

@@ -313,6 +313,7 @@
   if (cr->sb_index >= sbs_in_frame) cr->sb_index = 0;
   assert(cr->sb_index < sbs_in_frame);
   i = cr->sb_index;
+  cr->last_sb_index = cr->sb_index;
   cr->target_num_seg_blocks = 0;
   do {
     int sum_map = 0;
@@ -330,13 +331,22 @@
     if (cr->use_block_sad_scene_det && cpi->rc.frames_since_key > 30 &&
         cr->counter_encode_maxq_scene_change > 30 &&
         cpi->src_sad_blk_64x64 != NULL &&
-        cpi->svc.number_temporal_layers == 1 &&
         cpi->svc.spatial_layer_id == cpi->svc.number_spatial_layers - 1) {
       sb_sad = cpi->src_sad_blk_64x64[sb_col_index + sb_cols * sb_row_index];
       int scale = (cm->width * cm->height < 640 * 360) ? 6 : 8;
       int scale_low = 2;
       thresh_sad = (scale * 64 * 64);
       thresh_sad_low = (scale_low * 64 * 64);
+      // For temporal layers: the base temporal layer (temporal_layer_id = 0)
+      // has larger frame separation (2 or 4 frames apart), so use larger sad
+      // thresholds to compensate for larger frame sad. The larger thresholds
+      // also increase the amount of refresh, which is needed for the base
+      // temporal layer.
+      if (cpi->svc.number_temporal_layers > 1 &&
+          cpi->svc.temporal_layer_id == 0) {
+        thresh_sad <<= 4;
+        thresh_sad_low <<= 2;
+      }
     }
     // cr_map only needed at 8x8 blocks.
     for (y = 0; y < ymis; y += 2) {
@@ -384,18 +394,23 @@
   const PRIMARY_RATE_CONTROL *const p_rc = &cpi->ppi->p_rc;
   const AV1_COMMON *const cm = &cpi->common;
   CYCLIC_REFRESH *const cr = cpi->cyclic_refresh;
-  int num4x4bl = cm->mi_params.MBs << 4;
-  int target_refresh = 0;
-  double weight_segment_target = 0;
-  double weight_segment = 0;
-  int qp_thresh = AOMMIN(20, rc->best_quality << 1);
-  if (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN)
-    qp_thresh = AOMMIN(35, rc->best_quality << 1);
-  int qp_max_thresh = 118 * MAXQ >> 7;
+  SVC *const svc = &cpi->svc;
+  const int qp_thresh = AOMMAX(16, rc->best_quality + 4);
+  const int qp_max_thresh = 118 * MAXQ >> 7;
   const int scene_change_detected = is_scene_change_detected(cpi);
+  const int is_screen_content =
+      (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN);
+
+  // A scene change or key frame marks the start of a cyclic refresh cycle.
+  const int frames_since_scene_change =
+      (cpi->ppi->use_svc || !is_screen_content)
+          ? cpi->rc.frames_since_key
+          : AOMMIN(cpi->rc.frames_since_key,
+                   cr->counter_encode_maxq_scene_change);
 
   // Cases to reset the cyclic refresh adjustment parameters.
-  if (frame_is_intra_only(cm) || scene_change_detected) {
+  if (frame_is_intra_only(cm) || scene_change_detected ||
+      cpi->ppi->rtc_ref.bias_recovery_frame) {
     // Reset adaptive elements for intra only frames and scene changes.
     cr->percent_refresh_adjustment = 5;
     cr->rate_ratio_qdelta_adjustment = 0.25;
@@ -414,20 +429,22 @@
   // should we enable cyclic refresh on this frame.
   cr->apply_cyclic_refresh = 1;
   if (frame_is_intra_only(cm) || is_lossless_requested(&cpi->oxcf.rc_cfg) ||
-      scene_change_detected || cpi->svc.temporal_layer_id > 0 ||
+      scene_change_detected || svc->temporal_layer_id > 0 ||
+      svc->prev_number_spatial_layers != svc->number_spatial_layers ||
       p_rc->avg_frame_qindex[INTER_FRAME] < qp_thresh ||
-      (cpi->svc.number_spatial_layers > 1 &&
-       cpi->svc.layer_context[cpi->svc.temporal_layer_id].is_key_frame) ||
-      (rc->frames_since_key > 20 &&
+      (svc->number_spatial_layers > 1 &&
+       svc->layer_context[svc->temporal_layer_id].is_key_frame) ||
+      (frames_since_scene_change > 20 &&
        p_rc->avg_frame_qindex[INTER_FRAME] > qp_max_thresh) ||
       (rc->avg_frame_low_motion && rc->avg_frame_low_motion < 30 &&
-       rc->frames_since_key > 40)) {
+       frames_since_scene_change > 40) ||
+      cpi->ppi->rtc_ref.bias_recovery_frame) {
     cr->apply_cyclic_refresh = 0;
     return;
   }
 
   // Increase the amount of refresh for #temporal_layers > 2
-  if (cpi->svc.number_temporal_layers > 2)
+  if (svc->number_temporal_layers > 2)
     cr->percent_refresh = 15;
   else
     cr->percent_refresh = 10 + cr->percent_refresh_adjustment;
@@ -442,24 +459,46 @@
   cr->motion_thresh = 32;
   cr->rate_boost_fac =
       (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN) ? 10 : 15;
-  // Use larger delta-qp (increase rate_ratio_qdelta) for first few (~4)
-  // periods of the refresh cycle, after a key frame.
-  // Account for larger interval on base layer for temporal layers.
-  if (cr->percent_refresh > 0 &&
-      rc->frames_since_key <
-          (4 * cpi->svc.number_temporal_layers) * (100 / cr->percent_refresh)) {
-    cr->rate_ratio_qdelta = 3.0 + cr->rate_ratio_qdelta_adjustment;
+
+  // Use larger delta-qp (increase rate_ratio_qdelta) for first few
+  // refresh cycles after a key frame (svc) or scene change (non svc).
+  // For non svc screen content, after a scene change gradually reduce
+  // this boost and supress it further if either of the previous two
+  // frames overshot.
+  if (cr->percent_refresh > 0) {
+    if (cpi->ppi->use_svc || !is_screen_content) {
+      if (frames_since_scene_change <
+          ((4 * svc->number_temporal_layers) * (100 / cr->percent_refresh))) {
+        cr->rate_ratio_qdelta = 3.0 + cr->rate_ratio_qdelta_adjustment;
+      } else {
+        cr->rate_ratio_qdelta = 2.25 + cr->rate_ratio_qdelta_adjustment;
+      }
+    } else {
+      double distance_from_sc_factor =
+          AOMMIN(0.75, (int)(frames_since_scene_change / 10) * 0.1);
+      cr->rate_ratio_qdelta =
+          3.0 + cr->rate_ratio_qdelta_adjustment - distance_from_sc_factor;
+      if ((frames_since_scene_change < 10) &&
+          ((cpi->rc.rc_1_frame < 0) || (cpi->rc.rc_2_frame < 0))) {
+        cr->rate_ratio_qdelta -= 0.25;
+      }
+    }
   } else {
     cr->rate_ratio_qdelta = 2.25 + cr->rate_ratio_qdelta_adjustment;
   }
   // Adjust some parameters for low resolutions.
   if (cm->width * cm->height <= 352 * 288) {
-    if (rc->avg_frame_bandwidth < 3000) {
-      cr->motion_thresh = 16;
+    if (cpi->svc.number_temporal_layers > 1) {
+      cr->motion_thresh = 32;
       cr->rate_boost_fac = 13;
     } else {
-      cr->max_qdelta_perc = 50;
-      cr->rate_ratio_qdelta = AOMMAX(cr->rate_ratio_qdelta, 2.0);
+      if (rc->avg_frame_bandwidth < 3000) {
+        cr->motion_thresh = 16;
+        cr->rate_boost_fac = 13;
+      } else {
+        cr->max_qdelta_perc = 50;
+        cr->rate_ratio_qdelta = AOMMAX(cr->rate_ratio_qdelta, 2.0);
+      }
     }
   }
   if (cpi->oxcf.rc_cfg.mode == AOM_VBR) {
@@ -474,25 +513,10 @@
       cr->rate_ratio_qdelta = 1.0;
     }
   }
-  // Weight for segment prior to encoding: take the average of the target
-  // number for the frame to be encoded and the actual from the previous frame.
-  // Use the target if its less. To be used for setting the base qp for the
-  // frame in av1_rc_regulate_q.
-  target_refresh =
-      cr->percent_refresh * cm->mi_params.mi_rows * cm->mi_params.mi_cols / 100;
-  weight_segment_target = (double)(target_refresh) / num4x4bl;
-  weight_segment = (double)((target_refresh + cr->actual_num_seg1_blocks +
-                             cr->actual_num_seg2_blocks) >>
-                            1) /
-                   num4x4bl;
-  if (weight_segment_target < 7 * weight_segment / 8)
-    weight_segment = weight_segment_target;
-  cr->weight_segment = weight_segment;
   if (rc->rtc_external_ratectrl) {
     cr->actual_num_seg1_blocks = cr->percent_refresh * cm->mi_params.mi_rows *
                                  cm->mi_params.mi_cols / 100;
     cr->actual_num_seg2_blocks = 0;
-    cr->weight_segment = (double)(cr->actual_num_seg1_blocks) / num4x4bl;
   }
 }
 
@@ -508,9 +532,13 @@
   const int layer_depth = AOMMIN(gf_group->layer_depth[cpi->gf_frame_index], 6);
   const FRAME_TYPE frame_type = cm->current_frame.frame_type;
 
+  // Set resolution_change flag: for svc only set it when the
+  // number of spatial layers has not changed.
   const int resolution_change =
-      cm->prev_frame && (cm->width != cm->prev_frame->width ||
-                         cm->height != cm->prev_frame->height);
+      cm->prev_frame &&
+      (cm->width != cm->prev_frame->width ||
+       cm->height != cm->prev_frame->height) &&
+      cpi->svc.prev_number_spatial_layers == cpi->svc.number_spatial_layers;
 
   if (resolution_change) av1_cyclic_refresh_reset_resize(cpi);
   if (!cr->apply_cyclic_refresh) {
@@ -518,9 +546,13 @@
     unsigned char *const seg_map = cpi->enc_seg.map;
     memset(seg_map, 0, cm->mi_params.mi_rows * cm->mi_params.mi_cols);
     av1_disable_segmentation(&cm->seg);
-    if (cm->current_frame.frame_type == KEY_FRAME || scene_change_detected) {
+    if (frame_is_intra_only(cm) || scene_change_detected ||
+        cpi->ppi->rtc_ref.bias_recovery_frame) {
       cr->sb_index = 0;
+      cr->last_sb_index = 0;
       cr->counter_encode_maxq_scene_change = 0;
+      cr->actual_num_seg1_blocks = 0;
+      cr->actual_num_seg2_blocks = 0;
     }
     return;
   } else {
@@ -600,6 +632,7 @@
   CYCLIC_REFRESH *const cr = cpi->cyclic_refresh;
   memset(cr->map, 0, cm->mi_params.mi_rows * cm->mi_params.mi_cols);
   cr->sb_index = 0;
+  cr->last_sb_index = 0;
   cpi->refresh_frame.golden_frame = true;
   cr->apply_cyclic_refresh = 0;
   cr->counter_encode_maxq_scene_change = 0;
@@ -610,6 +643,7 @@
 int av1_cyclic_refresh_disable_lf_cdef(AV1_COMP *const cpi) {
   CYCLIC_REFRESH *const cr = cpi->cyclic_refresh;
   // TODO(marpan): Tune these conditons, add QP dependence.
+  if (cpi->sf.rt_sf.skip_lf_screen > 1 && !cpi->rc.high_source_sad) return 1;
   if (cpi->rc.frames_since_key > 30 && cr->percent_refresh > 0 &&
       cr->counter_encode_maxq_scene_change > 300 / cr->percent_refresh &&
       cpi->rc.frame_source_sad < 1000)

diff --git a/av1/encoder/aq_cyclicrefresh.h b/av1/encoder/aq_cyclicrefresh.h
index 3353c5a..10974f0 100644
--- a/av1/encoder/aq_cyclicrefresh.h
+++ b/av1/encoder/aq_cyclicrefresh.h

@@ -54,6 +54,10 @@
    */
   int sb_index;
   /*!
+   *Superblock index cyclic refresh index last frame
+   */
+  int last_sb_index;
+  /*!
    * Controls how long block will need to wait to be refreshed again, in
    * excess of the cycle time, i.e., in the case of all zero motion, block
    * will be refreshed every (100/percent_refresh + time_for_refresh) frames.
@@ -113,7 +117,6 @@
 
   /*!\cond */
   int qindex_delta[3];
-  double weight_segment;
   int apply_cyclic_refresh;
   int skip_over4x4;
   int counter_encode_maxq_scene_change;
@@ -226,7 +229,7 @@
 
 /*!\brief Initialize counters used for cyclic refresh.
  *
- * Initializes cyclic refresh counters cnt_zeromv, actual_num_seg1_blocks and
+ * Initializes cyclic refresh counters actual_num_seg1_blocks and
  * actual_num_seg2_blocks.
  *
  * \ingroup cyclic_refresh
@@ -235,14 +238,14 @@
  *
  * \param[in]   x         Pointer to MACROBLOCK structure
  *
- * \remark Update the \c x->cnt_zeromv, the \c x->actual_num_seg1_blocks and
- * the \c x->actual_num_seg1_blocks.
+ * \remark Update the \c x->actual_num_seg1_blocks and the
+ * \c x->actual_num_seg2_blocks.
  */
 void av1_init_cyclic_refresh_counters(MACROBLOCK *const x);
 
 /*!\brief Accumulate cyclic refresh counters.
  *
- * Accumulates cyclic refresh counters cnt_zeromv, actual_num_seg1_blocks and
+ * Accumulates cyclic refresh counters actual_num_seg1_blocks and
  * actual_num_seg2_blocks from MACROBLOCK strcture to CYCLIC_REFRESH strcture.
  *
  * \ingroup cyclic_refresh
@@ -252,9 +255,8 @@
  * \param[in]   cyclic_refresh Pointer to CYCLIC_REFRESH structure
  * \param[in]   x              Pointer to MACROBLOCK structure
  *
- * \remark Update the \c cyclic_refresh->cnt_zeromv, the \c
- * cyclic_refresh->actual_num_seg1_blocks and the \c
- * cyclic_refresh->actual_num_seg1_blocks.
+ * \remark Update the \c cyclic_refresh->actual_num_seg1_blocks and the
+ * \c cyclic_refresh->actual_num_seg2_blocks.
  */
 void av1_accumulate_cyclic_refresh_counters(
     CYCLIC_REFRESH *const cyclic_refresh, const MACROBLOCK *const x);

diff --git a/av1/encoder/aq_variance.c b/av1/encoder/aq_variance.c
index d53d2c9..086928a 100644
--- a/av1/encoder/aq_variance.c
+++ b/av1/encoder/aq_variance.c

@@ -118,18 +118,16 @@
   for (i = 0; i < bh; i += 4) {
     for (j = 0; j < bw; j += 4) {
       if (is_cur_buf_hbd(xd)) {
-        var +=
-            log(1.0 + cpi->ppi->fn_ptr[BLOCK_4X4].vf(
-                          x->plane[0].src.buf + i * x->plane[0].src.stride + j,
-                          x->plane[0].src.stride,
-                          CONVERT_TO_BYTEPTR(av1_highbd_all_zeros), 0, &sse) /
-                          16.0);
+        var += log1p(cpi->ppi->fn_ptr[BLOCK_4X4].vf(
+                         x->plane[0].src.buf + i * x->plane[0].src.stride + j,
+                         x->plane[0].src.stride,
+                         CONVERT_TO_BYTEPTR(av1_highbd_all_zeros), 0, &sse) /
+                     16.0);
       } else {
-        var +=
-            log(1.0 + cpi->ppi->fn_ptr[BLOCK_4X4].vf(
-                          x->plane[0].src.buf + i * x->plane[0].src.stride + j,
-                          x->plane[0].src.stride, av1_all_zeros, 0, &sse) /
-                          16.0);
+        var += log1p(cpi->ppi->fn_ptr[BLOCK_4X4].vf(
+                         x->plane[0].src.buf + i * x->plane[0].src.stride + j,
+                         x->plane[0].src.stride, av1_all_zeros, 0, &sse) /
+                     16.0);
       }
     }
   }
@@ -184,9 +182,9 @@
   return (unsigned int)((uint64_t)var * 256) >> num_pels_log2_lookup[bs];
 }
 
-double av1_log_block_wavelet_energy(MACROBLOCK *x, BLOCK_SIZE bs) {
+static double log_block_wavelet_energy(MACROBLOCK *x, BLOCK_SIZE bs) {
   unsigned int haar_sad = haar_ac_energy(x, bs);
-  return log(haar_sad + 1.0);
+  return log1p(haar_sad);
 }
 
 int av1_block_wavelet_energy_level(const AV1_COMP *cpi, MACROBLOCK *x,
@@ -195,7 +193,7 @@
   energy_midpoint = (is_stat_consumption_stage_twopass(cpi))
                         ? cpi->twopass_frame.frame_avg_haar_energy
                         : DEFAULT_E_MIDPOINT;
-  energy = av1_log_block_wavelet_energy(x, bs) - energy_midpoint;
+  energy = log_block_wavelet_energy(x, bs) - energy_midpoint;
   return clamp((int)round(energy), ENERGY_MIN, ENERGY_MAX);
 }
 

diff --git a/av1/encoder/arm/crc32/hash_crc32.c b/av1/encoder/arm/crc32/hash_crc32.c
index dd8685d..771496c 100644
--- a/av1/encoder/arm/crc32/hash_crc32.c
+++ b/av1/encoder/arm/crc32/hash_crc32.c

@@ -13,6 +13,8 @@
 #include <stddef.h>
 #include <arm_acle.h>
 
+#include "config/aom_config.h"
+
 #define CRC_LOOP(op, crc, type, buf, len) \
   while ((len) >= sizeof(type)) {         \
     (crc) = op((crc), *(type *)(buf));    \
@@ -37,7 +39,7 @@
   const uint8_t *buf = p;
   uint32_t crc = 0xFFFFFFFF;
 
-#if !defined(__aarch64__)
+#if !AOM_ARCH_AARCH64
   // Align input to 8-byte boundary (only necessary for 32-bit builds.)
   while (len && ((uintptr_t)buf & 7)) {
     crc = __crc32cb(crc, *buf++);

diff --git a/av1/encoder/arm/neon/av1_error_neon.c b/av1/encoder/arm/neon/av1_error_neon.c
index 124c1fd..7d24c7d 100644
--- a/av1/encoder/arm/neon/av1_error_neon.c
+++ b/av1/encoder/arm/neon/av1_error_neon.c

@@ -11,6 +11,8 @@
 #include <arm_neon.h>
 #include <assert.h>
 
+#include "config/aom_config.h"
+
 #include "aom_dsp/aom_dsp_common.h"
 #include "aom_dsp/arm/mem_neon.h"
 
@@ -48,7 +50,7 @@
     block_size -= 8;
   } while (block_size != 0);
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   *ssz = vaddvq_s64(sqcoeff);
   return vaddvq_s64(error);
 #else

diff --git a/av1/encoder/arm/neon/av1_fwd_txfm2d_neon.c b/av1/encoder/arm/neon/av1_fwd_txfm2d_neon.c
index 8a282b3..ee8b115 100644
--- a/av1/encoder/arm/neon/av1_fwd_txfm2d_neon.c
+++ b/av1/encoder/arm/neon/av1_fwd_txfm2d_neon.c

@@ -24,7 +24,7 @@
 
 static INLINE void transpose_16bit_4x4(const int16x8_t *const in,
                                        int16x8_t *const out) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   const int16x8_t a0 = vzip1q_s16(in[0], in[1]);
   const int16x8_t a1 = vzip1q_s16(in[2], in[3]);
 #else
@@ -45,7 +45,7 @@
 
 static INLINE void transpose_16bit_4x8(const int16x8_t *const in,
                                        int16x8_t *const out) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   const int16x8_t a0 = vzip1q_s16(in[0], in[1]);
   const int16x8_t a1 = vzip1q_s16(in[2], in[3]);
   const int16x8_t a2 = vzip1q_s16(in[4], in[5]);
@@ -67,7 +67,7 @@
   const int32x4x2_t b13 =
       vzipq_s32(vreinterpretq_s32_s16(a2), vreinterpretq_s32_s16(a3));
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   out[0] = vreinterpretq_s16_s64(vzip1q_s64(vreinterpretq_s64_s32(b02.val[0]),
                                             vreinterpretq_s64_s32(b13.val[0])));
   out[1] = vreinterpretq_s16_s64(vzip2q_s64(vreinterpretq_s64_s32(b02.val[0]),
@@ -100,7 +100,7 @@
 
   const int32x4_t zeros = vdupq_n_s32(0);
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   out[0] = vreinterpretq_s16_s64(vzip1q_s64(vreinterpretq_s64_s32(b01.val[0]),
                                             vreinterpretq_s64_s32(zeros)));
   out[1] = vreinterpretq_s16_s64(vzip2q_s64(vreinterpretq_s64_s32(b01.val[0]),
@@ -149,7 +149,7 @@
   const int32x4x2_t b37 = vzipq_s32(vreinterpretq_s32_s16(a26.val[1]),
                                     vreinterpretq_s32_s16(a37.val[1]));
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   out[0] = vreinterpretq_s16_s64(vzip1q_s64(vreinterpretq_s64_s32(b04.val[0]),
                                             vreinterpretq_s64_s32(b15.val[0])));
   out[1] = vreinterpretq_s16_s64(vzip2q_s64(vreinterpretq_s64_s32(b04.val[0]),
@@ -254,6 +254,16 @@
   vst1q_s32((b + 4), vmovl_s16(vget_high_s16(a)));
 }
 
+static INLINE void store_output_32bit_w8(int32_t *const out,
+                                         const int32x4_t *const in1,
+                                         const int32x4_t *const in2,
+                                         const int stride, const int out_size) {
+  for (int i = 0; i < out_size; ++i) {
+    vst1q_s32(out + stride * i, in1[i]);
+    vst1q_s32(out + stride * i + 4, in2[i]);
+  }
+}
+
 static INLINE void store_rect_16bit_to_32bit_w4(
     const int16x8_t a, int32_t *const b, const int16x4_t *v_newsqrt2,
     const int32x4_t *v_newsqrt2bits) {
@@ -2329,8 +2339,7 @@
   row_txfm(buf, buf, cos_bit_row, NULL);
   round_shift_16bit_vector(buf0, height, &v_shift2);
 
-  transpose_16bit_4x4(buf, buf);
-  store_buffer_16bit_to_32bit_w4(buf, output, width, height);
+  store_buffer_16bit_to_32bit_w4(buf, output, height, width);
 }
 
 void av1_lowbd_fwd_txfm2d_4x8_neon(const int16_t *input, int32_t *output,
@@ -2371,8 +2380,7 @@
   }
   row_txfm(buf, buf, cos_bit_row, NULL);
   round_shift_16bit_vector(buf0, height, &v_shift2);
-  transpose_16bit_8x4(buf, buf);
-  store_rect_buffer_16bit_to_32bit_w4(buf, output, width, height);
+  store_rect_buffer_16bit_to_32bit_w8(buf, output, height, width);
 }
 
 void av1_lowbd_fwd_txfm2d_4x16_neon(const int16_t *input, int32_t *output,
@@ -2415,8 +2423,7 @@
     }
     row_txfm(buf, buf, cos_bit_row, NULL);
     round_shift_16bit_vector(buf0, height, &v_shift2);
-    transpose_16bit_8x4(buf, buf);
-    store_buffer_16bit_to_32bit_w4(buf, output + 8 * width * i, width, 8);
+    store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
   }
 }
 
@@ -2456,8 +2463,7 @@
   }
   row_txfm(buf, buf, cos_bit_row, NULL);
   round_shift_16bit_vector(buf0, height, &v_shift2);
-  transpose_16bit_8x8(buf, buf);
-  store_rect_buffer_16bit_to_32bit_w8(buf, output, width, height);
+  store_rect_buffer_16bit_to_32bit_w4(buf, output, height, width);
 }
 
 void av1_lowbd_fwd_txfm2d_8x8_neon(const int16_t *input, int32_t *output,
@@ -2496,8 +2502,7 @@
   }
   row_txfm(buf, buf, cos_bit_row, NULL);
   round_shift_16bit_vector(buf0, height, &v_shift2);
-  transpose_16bit_8x8(buf, buf);
-  store_buffer_16bit_to_32bit_w8(buf, output, width, height);
+  store_buffer_16bit_to_32bit_w8(buf, output, height, width);
 }
 
 void av1_lowbd_fwd_txfm2d_8x16_neon(const int16_t *input, int32_t *output,
@@ -2540,8 +2545,7 @@
     }
     row_txfm(buf, buf, cos_bit_row, NULL);
     round_shift_16bit_vector(buf0, height, &v_shift2);
-    transpose_16bit_8x8(buf, buf);
-    store_rect_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width, 8);
+    store_rect_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, 8);
   }
 }
 
@@ -2587,8 +2591,7 @@
     }
     row_txfm(buf, buf, cos_bit_row, NULL);
     round_shift_16bit_vector(buf0, height, &v_shift2);
-    transpose_16bit_8x8(buf, buf);
-    store_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width, 8);
+    store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
   }
 }
 
@@ -2632,10 +2635,7 @@
   }
   row_txfm(buf, buf, cos_bit_row, NULL);
   round_shift_16bit_vector(buf0, height, &v_shift2);
-  transpose_16bit_4x8(buf, buf);
-  store_buffer_16bit_to_32bit_w8(buf, output, width, height);
-  transpose_16bit_4x8(buf + 8, buf + 8);
-  store_buffer_16bit_to_32bit_w8(buf + 8, output + 8, width, height);
+  store_buffer_16bit_to_32bit_w4(buf, output, height, width);
 }
 
 void av1_lowbd_fwd_txfm2d_16x8_neon(const int16_t *input, int32_t *output,
@@ -2678,10 +2678,7 @@
   }
   row_txfm(buf, buf, cos_bit_row, NULL);
   round_shift_16bit_vector(buf0, height, &v_shift2);
-  transpose_16bit_8x8(buf, buf);
-  store_rect_buffer_16bit_to_32bit_w8(buf, output, width, height);
-  transpose_16bit_8x8(buf + 8, buf + 8);
-  store_rect_buffer_16bit_to_32bit_w8(buf + 8, output + 8, width, height);
+  store_rect_buffer_16bit_to_32bit_w8(buf, output, height, width);
 }
 
 void av1_lowbd_fwd_txfm2d_16x16_neon(const int16_t *input, int32_t *output,
@@ -2727,11 +2724,7 @@
     }
     row_txfm(buf, buf, cos_bit_row, NULL);
     round_shift_16bit_vector(buf0, height, &v_shift2);
-    transpose_16bit_8x8(buf, buf);
-    store_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width, 8);
-    transpose_16bit_8x8(buf + 8, buf + 8);
-    store_buffer_16bit_to_32bit_w8(buf + 8, output + 8 * width * i + 8, width,
-                                   8);
+    store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
   }
 }
 
@@ -2781,12 +2774,7 @@
       }
       row_txfm(buf, buf, cos_bit_row, NULL);
       round_shift_16bit_vector(buf0, height, &v_shift2);
-      transpose_16bit_8x8(buf, buf);
-      store_rect_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width,
-                                          8);
-      transpose_16bit_8x8(buf + 8, buf + 8);
-      store_rect_buffer_16bit_to_32bit_w8(buf + 8, output + 8 * width * i + 8,
-                                          width, 8);
+      store_rect_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
     }
   } else {
     av1_fwd_txfm2d_16x32_c(input, output, stride, tx_type, bd);
@@ -2836,18 +2824,7 @@
       }
       row_txfm(buf, buf, cos_bit_row, NULL);
       round_shift_16bit_vector(buf, width, &v_shift2);
-      transpose_16bit_8x8(buf, buf);
-      store_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width,
-                                     height);
-      transpose_16bit_8x8(buf + 8, buf + 8);
-      store_buffer_16bit_to_32bit_w8(buf + 8, output + 8 * width * i + 8, width,
-                                     height);
-      transpose_16bit_8x8(buf + 16, buf + 16);
-      store_buffer_16bit_to_32bit_w8(buf + 16, output + 8 * width * i + 16,
-                                     width, height);
-      transpose_16bit_8x8(buf + 24, buf + 24);
-      store_buffer_16bit_to_32bit_w8(buf + 24, output + 8 * width * i + 24,
-                                     width, height);
+      store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
     }
   } else {
     av1_fwd_txfm2d_32x16_c(input, output, stride, tx_type, bd);
@@ -2898,18 +2875,7 @@
       }
       row_txfm(buf, buf, cos_bit_row, NULL);
       round_shift_16bit_vector(buf, width, &v_shift2);
-      transpose_16bit_8x8(buf, buf);
-      store_rect_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width,
-                                          8);
-      transpose_16bit_8x8(buf + 8, buf + 8);
-      store_rect_buffer_16bit_to_32bit_w8(buf + 8, output + 8 * width * i + 8,
-                                          width, 8);
-      transpose_16bit_8x8(buf + 16, buf + 16);
-      store_rect_buffer_16bit_to_32bit_w8(buf + 16, output + 8 * width * i + 16,
-                                          width, 8);
-      transpose_16bit_8x8(buf + 24, buf + 24);
-      store_rect_buffer_16bit_to_32bit_w8(buf + 24, output + 8 * width * i + 24,
-                                          width, 8);
+      store_rect_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
     }
   } else {
     av1_fwd_txfm2d_32x16_c(input, output, stride, tx_type, bd);
@@ -2959,17 +2925,7 @@
       }
       row_txfm(buf, buf, cos_bit_row, NULL);
       round_shift_16bit(buf, width, shift[2]);
-      transpose_16bit_8x8(buf, buf);
-      store_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width, 8);
-      transpose_16bit_8x8(buf + 8, buf + 8);
-      store_buffer_16bit_to_32bit_w8(buf + 8, output + 8 * width * i + 8, width,
-                                     8);
-      transpose_16bit_8x8(buf + 16, buf + 16);
-      store_buffer_16bit_to_32bit_w8(buf + 16, output + 8 * width * i + 16,
-                                     width, 8);
-      transpose_16bit_8x8(buf + 24, buf + 24);
-      store_buffer_16bit_to_32bit_w8(buf + 24, output + 8 * width * i + 24,
-                                     width, 8);
+      store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
     }
   } else {
     av1_fwd_txfm2d_32x32_c(input, output, stride, tx_type, bd);
@@ -3009,13 +2965,10 @@
     int16x8_t *buf = buf1 + width * i;
     row_txfm(buf, buf, cos_bit_row, NULL);
     round_shift_16bit(buf, width, shift[2]);
-    int32_t *output8 = output + 8 * 32 * i;
-    for (int j = 0; j < 4; ++j) {
-      int16x8_t *buf8 = buf + 8 * j;
-      transpose_16bit_8x8(buf8, buf8);
-      store_buffer_16bit_to_32bit_w8(buf8, output8 + 8 * j, 32, 8);
-    }
+    store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, 16, 32);
   }
+  // Zero out the bottom 16x32 area.
+  memset(output + 16 * 32, 0, 16 * 32 * sizeof(*output));
 }
 
 void av1_lowbd_fwd_txfm2d_16x64_neon(const int16_t *input, int32_t *output,
@@ -3051,15 +3004,8 @@
     int16x8_t *buf = buf1 + width * i;
     row_txfm(buf, buf, cos_bit_row, NULL);
     round_shift_16bit(buf, width, shift[2]);
-    int32_t *output8 = output + 8 * width * i;
-    for (int j = 0; j < width_div8; ++j) {
-      int16x8_t *buf8 = buf + 8 * j;
-      transpose_16bit_8x8(buf8, buf8);
-      store_buffer_16bit_to_32bit_w8(buf8, output8 + 8 * j, width, 8);
-    }
+    store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, 32, 16);
   }
-  // Zero out the bottom 16x32 area.
-  memset(output + 16 * 32, 0, 16 * 32 * sizeof(*output));
 }
 
 #define TRANSPOSE_4X4_L32(x0, x1, x2, x3, y0, y1, y2, y3)      \
@@ -3074,17 +3020,6 @@
     y3 = y23.val[1];                                           \
   } while (0)
 
-static INLINE void transpose_32_4x4x2(int stride, const int32x4_t *inputA,
-                                      const int32x4_t *inputB,
-                                      int32x4_t *output) {
-  TRANSPOSE_4X4_L32(inputA[0], inputA[2], inputA[1], inputA[3],
-                    output[0 * stride], output[1 * stride], output[2 * stride],
-                    output[3 * stride]);
-  TRANSPOSE_4X4_L32(inputB[0], inputB[2], inputB[1], inputB[3],
-                    output[4 * stride], output[5 * stride], output[6 * stride],
-                    output[7 * stride]);
-}
-
 static void av1_fdct32_new_neon(int32x4_t *input, int32x4_t *output,
                                 int cos_bit, const int stride,
                                 const int8_t *stage_range) {
@@ -4259,11 +4194,7 @@
     av1_round_shift_array_32_neon(bufA, bufA, 32);
     av1_round_shift_array_32_neon(bufB, bufB, 32);
 
-    int32_t *output8 = output + 8 * 32 * i;
-    for (int j = 0; j < width_div8; ++j) {
-      int32x4_t *out = (int32x4_t *)(output8 + 4 * j);
-      transpose_32_4x4x2(8, bufA + 4 * j, bufB + 4 * j, out);
-    }
+    store_output_32bit_w8(output + i * 8, bufA, bufB, 32, 32);
   }
 }
 static void av1_lowbd_fwd_txfm2d_64x32_neon(const int16_t *input,
@@ -4306,11 +4237,7 @@
     av1_round_shift_rect_array_32_neon(bufA, bufA, 32);
     av1_round_shift_rect_array_32_neon(bufB, bufB, 32);
 
-    int32_t *output8 = output + 8 * 32 * i;
-    for (int j = 0; j < width_div8; ++j) {
-      int32x4_t *out = (int32x4_t *)(output8 + 4 * j);
-      transpose_32_4x4x2(8, bufA + 4 * j, bufB + 4 * j, out);
-    }
+    store_output_32bit_w8(output + i * 8, bufA, bufB, 32, 32);
   }
 }
 
@@ -4356,11 +4283,7 @@
     av1_round_shift_rect_array_32_neon(bufA, bufA, 32);
     av1_round_shift_rect_array_32_neon(bufB, bufB, 32);
 
-    int32_t *output8 = output + 8 * 32 * i;
-    for (int j = 0; j < (32 / 4); ++j) {
-      int32x4_t *out = (int32x4_t *)(output8 + 4 * j);
-      transpose_32_4x4x2(8, bufA + 4 * j, bufB + 4 * j, out);
-    }
+    store_output_32bit_w8(output + i * 8, bufA, bufB, 32, 32);
   }
 }
 

diff --git a/av1/encoder/arm/neon/av1_highbd_quantize_neon.c b/av1/encoder/arm/neon/av1_highbd_quantize_neon.c
index 197eae0..11d3def 100644
--- a/av1/encoder/arm/neon/av1_highbd_quantize_neon.c
+++ b/av1/encoder/arm/neon/av1_highbd_quantize_neon.c

@@ -11,6 +11,8 @@
 
 #include <arm_neon.h>
 
+#include "config/aom_config.h"
+
 #include "aom_dsp/arm/mem_neon.h"
 
 #include "av1/common/quant_common.h"
@@ -65,7 +67,7 @@
 }
 
 static INLINE uint16_t get_max_eob(int16x8_t v_eobmax) {
-#ifdef __aarch64__
+#if AOM_ARCH_AARCH64
   return (uint16_t)vmaxvq_s16(v_eobmax);
 #else
   const int16x4_t v_eobmax_3210 =

diff --git a/av1/encoder/arm/neon/av1_k_means_neon.c b/av1/encoder/arm/neon/av1_k_means_neon.c
new file mode 100644
index 0000000..d13cc65
--- /dev/null
+++ b/av1/encoder/arm/neon/av1_k_means_neon.c

@@ -0,0 +1,115 @@
+/*
+ *  Copyright (c) 2023, Alliance for Open Media. All Rights Reserved.
+ *
+ *  Use of this source code is governed by a BSD-style license
+ *  that can be found in the LICENSE file in the root of the source
+ *  tree. An additional intellectual property rights grant can be found
+ *  in the file PATENTS.  All contributing project authors may
+ *  be found in the AUTHORS file in the root of the source tree.
+ */
+
+#include <arm_neon.h>
+
+#include "aom_dsp/arm/sum_neon.h"
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+
+static int32x4_t k_means_multiply_add_neon(const int16x8_t a) {
+  const int32x4_t l = vmull_s16(vget_low_s16(a), vget_low_s16(a));
+  const int32x4_t h = vmull_s16(vget_high_s16(a), vget_high_s16(a));
+#if AOM_ARCH_AARCH64
+  return vpaddq_s32(l, h);
+#else
+  const int32x2_t dl = vpadd_s32(vget_low_s32(l), vget_high_s32(l));
+  const int32x2_t dh = vpadd_s32(vget_low_s32(h), vget_high_s32(h));
+  return vcombine_s32(dl, dh);
+#endif
+}
+
+void av1_calc_indices_dim1_neon(const int16_t *data, const int16_t *centroids,
+                                uint8_t *indices, int64_t *total_dist, int n,
+                                int k) {
+  int64x2_t sum = vdupq_n_s64(0);
+  int16x8_t cents[PALETTE_MAX_SIZE];
+  for (int j = 0; j < k; ++j) {
+    cents[j] = vdupq_n_s16(centroids[j]);
+  }
+
+  for (int i = 0; i < n; i += 8) {
+    const int16x8_t in = vld1q_s16(data);
+    uint16x8_t ind = vdupq_n_u16(0);
+    // Compute the distance to the first centroid.
+    int16x8_t dist_min = vabdq_s16(in, cents[0]);
+
+    for (int j = 1; j < k; ++j) {
+      // Compute the distance to the centroid.
+      const int16x8_t dist = vabdq_s16(in, cents[j]);
+      // Compare to the minimal one.
+      const uint16x8_t cmp = vcgtq_s16(dist_min, dist);
+      dist_min = vminq_s16(dist_min, dist);
+      const uint16x8_t ind1 = vdupq_n_u16(j);
+      ind = vbslq_u16(cmp, ind1, ind);
+    }
+    if (total_dist) {
+      // Square, convert to 32 bit and add together.
+      const int32x4_t l =
+          vmull_s16(vget_low_s16(dist_min), vget_low_s16(dist_min));
+      const int32x4_t sum32_tmp =
+          vmlal_s16(l, vget_high_s16(dist_min), vget_high_s16(dist_min));
+      // Pairwise sum, convert to 64 bit and add to sum.
+      sum = vpadalq_s32(sum, sum32_tmp);
+    }
+    vst1_u8(indices, vmovn_u16(ind));
+    indices += 8;
+    data += 8;
+  }
+  if (total_dist) {
+    *total_dist = horizontal_add_s64x2(sum);
+  }
+}
+
+void av1_calc_indices_dim2_neon(const int16_t *data, const int16_t *centroids,
+                                uint8_t *indices, int64_t *total_dist, int n,
+                                int k) {
+  int64x2_t sum = vdupq_n_s64(0);
+  uint32x4_t ind[2];
+  int16x8_t cents[PALETTE_MAX_SIZE];
+  for (int j = 0; j < k; ++j) {
+    const int16_t cx = centroids[2 * j], cy = centroids[2 * j + 1];
+    const int16_t cxcy[8] = { cx, cy, cx, cy, cx, cy, cx, cy };
+    cents[j] = vld1q_s16(cxcy);
+  }
+
+  for (int i = 0; i < n; i += 8) {
+    for (int l = 0; l < 2; ++l) {
+      const int16x8_t in = vld1q_s16(data);
+      ind[l] = vdupq_n_u32(0);
+      // Compute the distance to the first centroid.
+      int16x8_t d1 = vsubq_s16(in, cents[0]);
+      int32x4_t dist_min = k_means_multiply_add_neon(d1);
+
+      for (int j = 1; j < k; ++j) {
+        // Compute the distance to the centroid.
+        d1 = vsubq_s16(in, cents[j]);
+        const int32x4_t dist = k_means_multiply_add_neon(d1);
+        // Compare to the minimal one.
+        const uint32x4_t cmp = vcgtq_s32(dist_min, dist);
+        dist_min = vminq_s32(dist_min, dist);
+        const uint32x4_t ind1 = vdupq_n_u32(j);
+        ind[l] = vbslq_u32(cmp, ind1, ind[l]);
+      }
+      if (total_dist) {
+        // Pairwise sum, convert to 64 bit and add to sum.
+        sum = vpadalq_s32(sum, dist_min);
+      }
+      data += 8;
+    }
+    // Cast to 8 bit and store.
+    vst1_u8(indices,
+            vmovn_u16(vcombine_u16(vmovn_u32(ind[0]), vmovn_u32(ind[1]))));
+    indices += 8;
+  }
+  if (total_dist) {
+    *total_dist = horizontal_add_s64x2(sum);
+  }
+}

diff --git a/av1/encoder/arm/neon/av1_temporal_denoiser_neon.c b/av1/encoder/arm/neon/av1_temporal_denoiser_neon.c
index 3528105..18cd0ce 100644
--- a/av1/encoder/arm/neon/av1_temporal_denoiser_neon.c
+++ b/av1/encoder/arm/neon/av1_temporal_denoiser_neon.c

@@ -24,7 +24,7 @@
 
 // Compute the sum of all pixel differences of this MB.
 static INLINE int horizontal_add_s8x16(const int8x16_t v_sum_diff_total) {
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   return vaddlvq_s8(v_sum_diff_total);
 #else
   const int16x8_t fe_dc_ba_98_76_54_32_10 = vpaddlq_s8(v_sum_diff_total);

diff --git a/av1/encoder/arm/neon/encodetxb_neon.c b/av1/encoder/arm/neon/encodetxb_neon.c
index 9bb822a..ee93608 100644
--- a/av1/encoder/arm/neon/encodetxb_neon.c
+++ b/av1/encoder/arm/neon/encodetxb_neon.c

@@ -13,31 +13,33 @@
 #include <assert.h>
 #include <math.h>
 
+#include "config/aom_config.h"
+
 #include "aom_dsp/arm/mem_neon.h"
 #include "av1/common/txb_common.h"
 #include "av1/encoder/encodetxb.h"
 
 void av1_txb_init_levels_neon(const tran_low_t *const coeff, const int width,
                               const int height, uint8_t *const levels) {
-  const int stride = width + TX_PAD_HOR;
+  const int stride = height + TX_PAD_HOR;
   memset(levels - TX_PAD_TOP * stride, 0,
          sizeof(*levels) * TX_PAD_TOP * stride);
-  memset(levels + stride * height, 0,
+  memset(levels + stride * width, 0,
          sizeof(*levels) * (TX_PAD_BOTTOM * stride + TX_PAD_END));
 
   const int32x4_t zeros = vdupq_n_s32(0);
   int i = 0;
   uint8_t *ls = levels;
   const tran_low_t *cf = coeff;
-  if (width == 4) {
+  if (height == 4) {
     do {
       const int32x4_t coeffA = vld1q_s32(cf);
-      const int32x4_t coeffB = vld1q_s32(cf + width);
+      const int32x4_t coeffB = vld1q_s32(cf + height);
       const int16x8_t coeffAB =
           vcombine_s16(vqmovn_s32(coeffA), vqmovn_s32(coeffB));
       const int16x8_t absAB = vqabsq_s16(coeffAB);
       const int8x8_t absABs = vqmovn_s16(absAB);
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
       const int8x16_t absAB8 =
           vcombine_s8(absABs, vreinterpret_s8_s32(vget_low_s32(zeros)));
       const uint8x16_t lsAB =
@@ -50,10 +52,10 @@
 #endif
       vst1q_u8(ls, lsAB);
       ls += (stride << 1);
-      cf += (width << 1);
+      cf += (height << 1);
       i += 2;
-    } while (i < height);
-  } else if (width == 8) {
+    } while (i < width);
+  } else if (height == 8) {
     do {
       const int32x4_t coeffA = vld1q_s32(cf);
       const int32x4_t coeffB = vld1q_s32(cf + 4);
@@ -64,9 +66,9 @@
           vqmovn_s16(absAB), vreinterpret_s8_s32(vget_low_s32(zeros))));
       vst1q_u8(ls, absAB8);
       ls += stride;
-      cf += width;
+      cf += height;
       i += 1;
-    } while (i < height);
+    } while (i < width);
   } else {
     do {
       int j = 0;
@@ -86,18 +88,18 @@
         vst1q_u8((ls + j), absABCD);
         j += 16;
         cf += 16;
-      } while (j < width);
-      *(int32_t *)(ls + width) = 0;
+      } while (j < height);
+      *(int32_t *)(ls + height) = 0;
       ls += stride;
       i += 1;
-    } while (i < height);
+    } while (i < width);
   }
 }
 
 // get_4_nz_map_contexts_2d coefficients:
 static const DECLARE_ALIGNED(16, uint8_t, c_4_po_2d[2][16]) = {
   { 0, 1, 6, 6, 1, 6, 6, 21, 6, 6, 21, 21, 6, 21, 21, 21 },
-  { 0, 11, 11, 11, 11, 11, 11, 11, 6, 6, 21, 21, 6, 21, 21, 21 }
+  { 0, 16, 16, 16, 16, 16, 16, 16, 6, 6, 21, 21, 6, 21, 21, 21 }
 };
 
 // get_4_nz_map_contexts_hor coefficients:
@@ -108,7 +110,7 @@
 /* clang-format on */
 
 // get_4_nz_map_contexts_ver coefficients:
-static const DECLARE_ALIGNED(16, uint8_t, c_4_po_ver[16]) = {
+static const DECLARE_ALIGNED(16, uint8_t, c_4_po_hor[16]) = {
   SIG_COEF_CONTEXTS_2D + 0,  SIG_COEF_CONTEXTS_2D + 0,
   SIG_COEF_CONTEXTS_2D + 0,  SIG_COEF_CONTEXTS_2D + 0,
   SIG_COEF_CONTEXTS_2D + 5,  SIG_COEF_CONTEXTS_2D + 5,
@@ -120,25 +122,25 @@
 };
 
 // get_8_coeff_contexts_2d coefficients:
-// if (height == 8)
+// if (width == 8)
 static const DECLARE_ALIGNED(16, uint8_t, c_8_po_2d_8[2][16]) = {
   { 0, 1, 6, 6, 21, 21, 21, 21, 1, 6, 6, 21, 21, 21, 21, 21 },
   { 6, 6, 21, 21, 21, 21, 21, 21, 6, 21, 21, 21, 21, 21, 21, 21 }
 };
-// if (height < 8)
+// if (width < 8)
 static const DECLARE_ALIGNED(16, uint8_t, c_8_po_2d_l[2][16]) = {
-  { 0, 16, 6, 6, 21, 21, 21, 21, 16, 16, 6, 21, 21, 21, 21, 21 },
-  { 16, 16, 21, 21, 21, 21, 21, 21, 16, 16, 21, 21, 21, 21, 21, 21 }
+  { 0, 11, 6, 6, 21, 21, 21, 21, 11, 11, 6, 21, 21, 21, 21, 21 },
+  { 11, 11, 21, 21, 21, 21, 21, 21, 11, 11, 21, 21, 21, 21, 21, 21 }
 };
 
-// if (height > 8)
+// if (width > 8)
 static const DECLARE_ALIGNED(16, uint8_t, c_8_po_2d_g[2][16]) = {
-  { 0, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11 },
+  { 0, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 },
   { 6, 6, 21, 21, 21, 21, 21, 21, 6, 21, 21, 21, 21, 21, 21, 21 }
 };
 
 // get_4_nz_map_contexts_ver coefficients:
-static const DECLARE_ALIGNED(16, uint8_t, c_8_po_hor[16]) = {
+static const DECLARE_ALIGNED(16, uint8_t, c_8_po_ver[16]) = {
   SIG_COEF_CONTEXTS_2D + 0,  SIG_COEF_CONTEXTS_2D + 5,
   SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10,
   SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10,
@@ -158,22 +160,22 @@
   { 6, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21 }
 };
 
-// real_width > real_height
+// real_width < real_height
 static const DECLARE_ALIGNED(16, uint8_t, c_16_po_2d_g[3][16]) = {
-  { 0, 16, 6, 6, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21 },
-  { 16, 16, 6, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21 },
-  { 16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21 }
+  { 0, 11, 6, 6, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21 },
+  { 11, 11, 6, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21 },
+  { 11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21 }
 };
 
-// real_width < real_height
+// real_width > real_height
 static const DECLARE_ALIGNED(16, uint8_t, c_16_po_2d_l[3][16]) = {
-  { 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11 },
+  { 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 },
   { 6, 6, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21 },
   { 6, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21 }
 };
 
 // get_16n_coeff_contexts_hor coefficients:
-static const DECLARE_ALIGNED(16, uint8_t, c_16_po_hor[16]) = {
+static const DECLARE_ALIGNED(16, uint8_t, c_16_po_ver[16]) = {
   SIG_COEF_CONTEXTS_2D + 0,  SIG_COEF_CONTEXTS_2D + 5,
   SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10,
   SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10,
@@ -188,7 +190,7 @@
 
 static INLINE uint8x16_t load_8bit_4x4_to_1_reg(const uint8_t *const src,
                                                 const int byte_stride) {
-#ifdef __aarch64__
+#if AOM_ARCH_AARCH64
   uint32x4_t v_data = vld1q_u32((uint32_t *)src);
   v_data = vld1q_lane_u32((uint32_t *)(src + 1 * byte_stride), v_data, 1);
   v_data = vld1q_lane_u32((uint32_t *)(src + 2 * byte_stride), v_data, 2);
@@ -202,7 +204,7 @@
 
 static INLINE uint8x16_t load_8bit_8x2_to_1_reg(const uint8_t *const src,
                                                 const int byte_stride) {
-#ifdef __aarch64__
+#if AOM_ARCH_AARCH64
   uint64x2_t v_data = vld1q_u64((uint64_t *)src);
   v_data = vld1q_lane_u64((uint64_t *)(src + 1 * byte_stride), v_data, 1);
 
@@ -273,22 +275,22 @@
 }
 
 static INLINE void get_4_nz_map_contexts_2d(const uint8_t *levels,
-                                            const int height,
+                                            const int width,
                                             const ptrdiff_t *const offsets,
                                             uint8_t *const coeff_contexts) {
   const int stride = 4 + TX_PAD_HOR;
   const uint8x16_t pos_to_offset_large = vdupq_n_u8(21);
 
   uint8x16_t pos_to_offset =
-      vld1q_u8((height == 4) ? c_4_po_2d[0] : c_4_po_2d[1]);
+      vld1q_u8((width == 4) ? c_4_po_2d[0] : c_4_po_2d[1]);
 
   uint8x16_t count;
   uint8x16_t level[5];
   uint8_t *cc = coeff_contexts;
 
-  assert(!(height % 4));
+  assert(!(width % 4));
 
-  int row = height;
+  int col = width;
   do {
     load_levels_4x4x5(levels, stride, offsets, level);
     count = get_coeff_contexts_kernel(level);
@@ -297,14 +299,14 @@
     pos_to_offset = pos_to_offset_large;
     levels += 4 * stride;
     cc += 16;
-    row -= 4;
-  } while (row);
+    col -= 4;
+  } while (col);
 
   coeff_contexts[0] = 0;
 }
 
-static INLINE void get_4_nz_map_contexts_hor(const uint8_t *levels,
-                                             const int height,
+static INLINE void get_4_nz_map_contexts_ver(const uint8_t *levels,
+                                             const int width,
                                              const ptrdiff_t *const offsets,
                                              uint8_t *coeff_contexts) {
   const int stride = 4 + TX_PAD_HOR;
@@ -315,9 +317,9 @@
   uint8x16_t count;
   uint8x16_t level[5];
 
-  assert(!(height % 4));
+  assert(!(width % 4));
 
-  int row = height;
+  int col = width;
   do {
     load_levels_4x4x5(levels, stride, offsets, level);
     count = get_coeff_contexts_kernel(level);
@@ -325,25 +327,25 @@
     vst1q_u8(coeff_contexts, count);
     levels += 4 * stride;
     coeff_contexts += 16;
-    row -= 4;
-  } while (row);
+    col -= 4;
+  } while (col);
 }
 
-static INLINE void get_4_nz_map_contexts_ver(const uint8_t *levels,
-                                             const int height,
+static INLINE void get_4_nz_map_contexts_hor(const uint8_t *levels,
+                                             const int width,
                                              const ptrdiff_t *const offsets,
                                              uint8_t *coeff_contexts) {
   const int stride = 4 + TX_PAD_HOR;
   const uint8x16_t pos_to_offset_large = vdupq_n_u8(SIG_COEF_CONTEXTS_2D + 10);
 
-  uint8x16_t pos_to_offset = vld1q_u8(c_4_po_ver);
+  uint8x16_t pos_to_offset = vld1q_u8(c_4_po_hor);
 
   uint8x16_t count;
   uint8x16_t level[5];
 
-  assert(!(height % 4));
+  assert(!(width % 4));
 
-  int row = height;
+  int col = width;
   do {
     load_levels_4x4x5(levels, stride, offsets, level);
     count = get_coeff_contexts_kernel(level);
@@ -352,12 +354,12 @@
     pos_to_offset = pos_to_offset_large;
     levels += 4 * stride;
     coeff_contexts += 16;
-    row -= 4;
-  } while (row);
+    col -= 4;
+  } while (col);
 }
 
 static INLINE void get_8_coeff_contexts_2d(const uint8_t *levels,
-                                           const int height,
+                                           const int width,
                                            const ptrdiff_t *const offsets,
                                            uint8_t *coeff_contexts) {
   const int stride = 8 + TX_PAD_HOR;
@@ -366,12 +368,12 @@
   uint8x16_t level[5];
   uint8x16_t pos_to_offset[3];
 
-  assert(!(height % 2));
+  assert(!(width % 2));
 
-  if (height == 8) {
+  if (width == 8) {
     pos_to_offset[0] = vld1q_u8(c_8_po_2d_8[0]);
     pos_to_offset[1] = vld1q_u8(c_8_po_2d_8[1]);
-  } else if (height < 8) {
+  } else if (width < 8) {
     pos_to_offset[0] = vld1q_u8(c_8_po_2d_l[0]);
     pos_to_offset[1] = vld1q_u8(c_8_po_2d_l[1]);
   } else {
@@ -380,7 +382,7 @@
   }
   pos_to_offset[2] = vdupq_n_u8(21);
 
-  int row = height;
+  int col = width;
   do {
     load_levels_8x2x5(levels, stride, offsets, level);
     count = get_coeff_contexts_kernel(level);
@@ -390,26 +392,26 @@
     pos_to_offset[1] = pos_to_offset[2];
     levels += 2 * stride;
     cc += 16;
-    row -= 2;
-  } while (row);
+    col -= 2;
+  } while (col);
 
   coeff_contexts[0] = 0;
 }
 
-static INLINE void get_8_coeff_contexts_hor(const uint8_t *levels,
-                                            const int height,
+static INLINE void get_8_coeff_contexts_ver(const uint8_t *levels,
+                                            const int width,
                                             const ptrdiff_t *const offsets,
                                             uint8_t *coeff_contexts) {
   const int stride = 8 + TX_PAD_HOR;
 
-  const uint8x16_t pos_to_offset = vld1q_u8(c_8_po_hor);
+  const uint8x16_t pos_to_offset = vld1q_u8(c_8_po_ver);
 
   uint8x16_t count;
   uint8x16_t level[5];
 
-  assert(!(height % 2));
+  assert(!(width % 2));
 
-  int row = height;
+  int col = width;
   do {
     load_levels_8x2x5(levels, stride, offsets, level);
     count = get_coeff_contexts_kernel(level);
@@ -417,12 +419,12 @@
     vst1q_u8(coeff_contexts, count);
     levels += 2 * stride;
     coeff_contexts += 16;
-    row -= 2;
-  } while (row);
+    col -= 2;
+  } while (col);
 }
 
-static INLINE void get_8_coeff_contexts_ver(const uint8_t *levels,
-                                            const int height,
+static INLINE void get_8_coeff_contexts_hor(const uint8_t *levels,
+                                            const int width,
                                             const ptrdiff_t *const offsets,
                                             uint8_t *coeff_contexts) {
   const int stride = 8 + TX_PAD_HOR;
@@ -434,9 +436,9 @@
   uint8x16_t count;
   uint8x16_t level[5];
 
-  assert(!(height % 2));
+  assert(!(width % 2));
 
-  int row = height;
+  int col = width;
   do {
     load_levels_8x2x5(levels, stride, offsets, level);
     count = get_coeff_contexts_kernel(level);
@@ -445,8 +447,8 @@
     pos_to_offset = pos_to_offset_large;
     levels += 2 * stride;
     coeff_contexts += 16;
-    row -= 2;
-  } while (row);
+    col -= 2;
+  } while (col);
 }
 
 static INLINE void get_16n_coeff_contexts_2d(const uint8_t *levels,
@@ -455,15 +457,15 @@
                                              const int width, const int height,
                                              const ptrdiff_t *const offsets,
                                              uint8_t *coeff_contexts) {
-  const int stride = width + TX_PAD_HOR;
+  const int stride = height + TX_PAD_HOR;
   uint8_t *cc = coeff_contexts;
-  int row = height;
+  int col = width;
   uint8x16_t pos_to_offset[5];
   uint8x16_t pos_to_offset_large[3];
   uint8x16_t count;
   uint8x16_t level[5];
 
-  assert(!(width % 16));
+  assert(!(height % 16));
 
   pos_to_offset_large[2] = vdupq_n_u8(21);
   if (real_width == real_height) {
@@ -473,22 +475,22 @@
     pos_to_offset[3] = vld1q_u8(c_16_po_2d_e[3]);
     pos_to_offset[4] = pos_to_offset_large[0] = pos_to_offset_large[1] =
         pos_to_offset_large[2];
-  } else if (real_width > real_height) {
+  } else if (real_width < real_height) {
     pos_to_offset[0] = vld1q_u8(c_16_po_2d_g[0]);
     pos_to_offset[1] = vld1q_u8(c_16_po_2d_g[1]);
     pos_to_offset[2] = pos_to_offset[3] = pos_to_offset[4] =
         vld1q_u8(c_16_po_2d_g[2]);
     pos_to_offset_large[0] = pos_to_offset_large[1] = pos_to_offset_large[2];
-  } else {  // real_width < real_height
+  } else {  // real_width > real_height
     pos_to_offset[0] = pos_to_offset[1] = vld1q_u8(c_16_po_2d_l[0]);
     pos_to_offset[2] = vld1q_u8(c_16_po_2d_l[1]);
     pos_to_offset[3] = vld1q_u8(c_16_po_2d_l[2]);
     pos_to_offset[4] = pos_to_offset_large[2];
-    pos_to_offset_large[0] = pos_to_offset_large[1] = vdupq_n_u8(11);
+    pos_to_offset_large[0] = pos_to_offset_large[1] = vdupq_n_u8(16);
   }
 
   do {
-    int w = width;
+    int h = height;
 
     do {
       load_levels_16x1x5(levels, stride, offsets, level);
@@ -497,9 +499,9 @@
       vst1q_u8(cc, count);
       levels += 16;
       cc += 16;
-      w -= 16;
+      h -= 16;
       pos_to_offset[0] = pos_to_offset_large[0];
-    } while (w);
+    } while (h);
 
     pos_to_offset[0] = pos_to_offset[1];
     pos_to_offset[1] = pos_to_offset[2];
@@ -508,29 +510,29 @@
     pos_to_offset_large[0] = pos_to_offset_large[1];
     pos_to_offset_large[1] = pos_to_offset_large[2];
     levels += TX_PAD_HOR;
-  } while (--row);
+  } while (--col);
 
   coeff_contexts[0] = 0;
 }
 
-static INLINE void get_16n_coeff_contexts_hor(const uint8_t *levels,
+static INLINE void get_16n_coeff_contexts_ver(const uint8_t *levels,
                                               const int width, const int height,
                                               const ptrdiff_t *const offsets,
                                               uint8_t *coeff_contexts) {
-  const int stride = width + TX_PAD_HOR;
+  const int stride = height + TX_PAD_HOR;
 
   const uint8x16_t pos_to_offset_large = vdupq_n_u8(SIG_COEF_CONTEXTS_2D + 10);
 
   uint8x16_t count;
   uint8x16_t level[5];
 
-  assert(!(width % 16));
+  assert(!(height % 16));
 
-  int row = height;
+  int col = width;
   do {
-    uint8x16_t pos_to_offset = vld1q_u8(c_16_po_hor);
+    uint8x16_t pos_to_offset = vld1q_u8(c_16_po_ver);
 
-    int w = width;
+    int h = height;
     do {
       load_levels_16x1x5(levels, stride, offsets, level);
       count = get_coeff_contexts_kernel(level);
@@ -539,32 +541,32 @@
       pos_to_offset = pos_to_offset_large;
       levels += 16;
       coeff_contexts += 16;
-      w -= 16;
-    } while (w);
+      h -= 16;
+    } while (h);
 
     levels += TX_PAD_HOR;
-  } while (--row);
+  } while (--col);
 }
 
-static INLINE void get_16n_coeff_contexts_ver(const uint8_t *levels,
+static INLINE void get_16n_coeff_contexts_hor(const uint8_t *levels,
                                               const int width, const int height,
                                               const ptrdiff_t *const offsets,
                                               uint8_t *coeff_contexts) {
-  const int stride = width + TX_PAD_HOR;
+  const int stride = height + TX_PAD_HOR;
 
   uint8x16_t pos_to_offset[3];
   uint8x16_t count;
   uint8x16_t level[5];
 
-  assert(!(width % 16));
+  assert(!(height % 16));
 
   pos_to_offset[0] = vdupq_n_u8(SIG_COEF_CONTEXTS_2D + 0);
   pos_to_offset[1] = vdupq_n_u8(SIG_COEF_CONTEXTS_2D + 5);
   pos_to_offset[2] = vdupq_n_u8(SIG_COEF_CONTEXTS_2D + 10);
 
-  int row = height;
+  int col = width;
   do {
-    int w = width;
+    int h = height;
     do {
       load_levels_16x1x5(levels, stride, offsets, level);
       count = get_coeff_contexts_kernel(level);
@@ -572,13 +574,13 @@
       vst1q_u8(coeff_contexts, count);
       levels += 16;
       coeff_contexts += 16;
-      w -= 16;
-    } while (w);
+      h -= 16;
+    } while (h);
 
     pos_to_offset[0] = pos_to_offset[1];
     pos_to_offset[1] = pos_to_offset[2];
     levels += TX_PAD_HOR;
-  } while (--row);
+  } while (--col);
 }
 
 // Note: levels[] must be in the range [0, 127], inclusive.
@@ -599,7 +601,7 @@
   const int real_height = tx_size_high[tx_size];
   const int width = get_txb_wide(tx_size);
   const int height = get_txb_high(tx_size);
-  const int stride = width + TX_PAD_HOR;
+  const int stride = height + TX_PAD_HOR;
   ptrdiff_t offsets[3];
 
   /* coeff_contexts must be 16 byte aligned. */
@@ -610,43 +612,43 @@
     offsets[1] = 1 * stride + 1;
     offsets[2] = 2 * stride + 0;
 
-    if (width == 4) {
-      get_4_nz_map_contexts_2d(levels, height, offsets, coefficients);
-    } else if (width == 8) {
-      get_8_coeff_contexts_2d(levels, height, offsets, coefficients);
+    if (height == 4) {
+      get_4_nz_map_contexts_2d(levels, width, offsets, coefficients);
+    } else if (height == 8) {
+      get_8_coeff_contexts_2d(levels, width, offsets, coefficients);
     } else {
       get_16n_coeff_contexts_2d(levels, real_width, real_height, width, height,
                                 offsets, coefficients);
     }
   } else if (tx_class == TX_CLASS_HORIZ) {
-    offsets[0] = 2;
-    offsets[1] = 3;
-    offsets[2] = 4;
-    if (width == 4) {
-      get_4_nz_map_contexts_hor(levels, height, offsets, coefficients);
-    } else if (width == 8) {
-      get_8_coeff_contexts_hor(levels, height, offsets, coefficients);
+    offsets[0] = 2 * stride;
+    offsets[1] = 3 * stride;
+    offsets[2] = 4 * stride;
+    if (height == 4) {
+      get_4_nz_map_contexts_hor(levels, width, offsets, coefficients);
+    } else if (height == 8) {
+      get_8_coeff_contexts_hor(levels, width, offsets, coefficients);
     } else {
       get_16n_coeff_contexts_hor(levels, width, height, offsets, coefficients);
     }
   } else {  // TX_CLASS_VERT
-    offsets[0] = 2 * stride;
-    offsets[1] = 3 * stride;
-    offsets[2] = 4 * stride;
-    if (width == 4) {
-      get_4_nz_map_contexts_ver(levels, height, offsets, coefficients);
-    } else if (width == 8) {
-      get_8_coeff_contexts_ver(levels, height, offsets, coefficients);
+    offsets[0] = 2;
+    offsets[1] = 3;
+    offsets[2] = 4;
+    if (height == 4) {
+      get_4_nz_map_contexts_ver(levels, width, offsets, coefficients);
+    } else if (height == 8) {
+      get_8_coeff_contexts_ver(levels, width, offsets, coefficients);
     } else {
       get_16n_coeff_contexts_ver(levels, width, height, offsets, coefficients);
     }
   }
 
-  const int bwl = get_txb_bwl(tx_size);
+  const int bhl = get_txb_bhl(tx_size);
   const int pos = scan[last_idx];
-  if (last_idx <= (height << bwl) / 8)
+  if (last_idx <= (width << bhl) / 8)
     coeff_contexts[pos] = 1;
-  else if (last_idx <= (height << bwl) / 4)
+  else if (last_idx <= (width << bhl) / 4)
     coeff_contexts[pos] = 2;
   else
     coeff_contexts[pos] = 3;

diff --git a/av1/encoder/arm/neon/highbd_fwd_txfm_neon.c b/av1/encoder/arm/neon/highbd_fwd_txfm_neon.c
index 273712a..15d375a 100644
--- a/av1/encoder/arm/neon/highbd_fwd_txfm_neon.c
+++ b/av1/encoder/arm/neon/highbd_fwd_txfm_neon.c

@@ -19,6 +19,14 @@
 #include "config/av1_rtcd.h"
 #include "config/aom_config.h"
 
+static INLINE void store_output_w4(int32_t *const out,
+                                   const int32x4_t *const in, const int stride,
+                                   const int out_size) {
+  for (int i = 0; i < out_size; ++i) {
+    vst1q_s32(out + i * stride, in[i]);
+  }
+}
+
 static INLINE int32x4_t half_btf_neon(const int32_t *w0, const int32x4_t *n0,
                                       const int32_t *w1, const int32x4_t *n1,
                                       const int32x4_t v_bit) {
@@ -39,7 +47,7 @@
   return x;
 }
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
 #define TRANSPOSE_4X4(x0, x1, x2, x3, y0, y1, y2, y3)         \
   do {                                                        \
     int32x4x2_t swap_low = vtrnq_s32(x0, x1);                 \
@@ -71,7 +79,11 @@
     y3 = vextq_s32(swap_low.val[1],                                      \
                    vextq_s32(swap_high.val[1], swap_high.val[1], 2), 2); \
   } while (0)
-#endif  // (__aarch64__)
+#endif  // AOM_ARCH_AARCH64
+
+static INLINE void transpose_4x4(const int32x4_t *in, int32x4_t *out) {
+  TRANSPOSE_4X4(in[0], in[1], in[2], in[3], out[0], out[1], out[2], out[3]);
+}
 
 static INLINE void transpose_8x8(const int32x4_t *in, int32x4_t *out) {
   TRANSPOSE_4X4(in[0], in[2], in[4], in[6], out[0], out[2], out[4], out[6]);
@@ -215,7 +227,10 @@
 
   u3 = vrshlq_s32(v2, v_bit);
 
-  TRANSPOSE_4X4(u0, u1, u2, u3, out[0], out[1], out[2], out[3]);
+  out[0] = u0;
+  out[1] = u1;
+  out[2] = u2;
+  out[3] = u3;
 }
 
 static INLINE void write_buffer_4x4(int32x4_t *res, int32_t *output) {
@@ -237,7 +252,6 @@
   int32x4_t t;
   int32x4_t s0, s1, s2, s3, s7;
   int32x4_t x0, x1, x2, x3;
-  int32x4_t u0, u1, u2, u3;
 
   int idx = 0 * num_col;
   s0 = vmulq_s32(in[idx], sinpi1);
@@ -261,12 +275,10 @@
   s3 = vaddq_s32(t, x3);
 
   const int32x4_t v_bit = vdupq_n_s32(-bit);
-  u0 = vrshlq_s32(s0, v_bit);
-  u1 = vrshlq_s32(s1, v_bit);
-  u2 = vrshlq_s32(s2, v_bit);
-  u3 = vrshlq_s32(s3, v_bit);
-
-  TRANSPOSE_4X4(u0, u1, u2, u3, out[0], out[1], out[2], out[3]);
+  out[0] = vrshlq_s32(s0, v_bit);
+  out[1] = vrshlq_s32(s1, v_bit);
+  out[2] = vrshlq_s32(s2, v_bit);
+  out[3] = vrshlq_s32(s3, v_bit);
 }
 static void idtx4x4_neon(int32x4_t *in, int32x4_t *out, int bit, int col_num) {
   (void)bit;
@@ -278,8 +290,6 @@
     a_low = vmulq_s32(in[i * col_num], fact);
     out[i] = vrshrq_n_s32(a_low, NewSqrt2Bits);
   }
-
-  TRANSPOSE_4X4(out[0], out[1], out[2], out[3], out[0], out[1], out[2], out[3]);
 }
 void av1_fwd_txfm2d_4x4_neon(const int16_t *input, int32_t *coeff,
                              int input_stride, TX_TYPE tx_type, int bd) {
@@ -292,96 +302,112 @@
     case DCT_DCT:
       load_buffer_4x4(input, in, input_stride, 0, 0, &v_shift0);
       fdct4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       fdct4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case ADST_DCT:
       load_buffer_4x4(input, in, input_stride, 0, 0, &v_shift0);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       fdct4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case DCT_ADST:
       load_buffer_4x4(input, in, input_stride, 0, 0, &v_shift0);
       fdct4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case ADST_ADST:
       load_buffer_4x4(input, in, input_stride, 0, 0, &v_shift0);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case FLIPADST_DCT:
       load_buffer_4x4(input, in, input_stride, 1, 0, &v_shift0);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       fdct4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case DCT_FLIPADST:
       load_buffer_4x4(input, in, input_stride, 0, 1, &v_shift0);
       fdct4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case FLIPADST_FLIPADST:
       load_buffer_4x4(input, in, input_stride, 1, 1, &v_shift0);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case ADST_FLIPADST:
       load_buffer_4x4(input, in, input_stride, 0, 1, &v_shift0);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case FLIPADST_ADST:
       load_buffer_4x4(input, in, input_stride, 1, 0, &v_shift0);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case IDTX:
       load_buffer_4x4(input, in, input_stride, 0, 0, &v_shift0);
       idtx4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       idtx4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case V_DCT:
       load_buffer_4x4(input, in, input_stride, 0, 0, &v_shift0);
       fdct4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       idtx4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case H_DCT:
       load_buffer_4x4(input, in, input_stride, 0, 0, &v_shift0);
       idtx4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       fdct4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case V_ADST:
       load_buffer_4x4(input, in, input_stride, 0, 0, &v_shift0);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       idtx4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case H_ADST:
       load_buffer_4x4(input, in, input_stride, 0, 0, &v_shift0);
       idtx4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case V_FLIPADST:
       load_buffer_4x4(input, in, input_stride, 1, 0, &v_shift0);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       idtx4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case H_FLIPADST:
       load_buffer_4x4(input, in, input_stride, 0, 1, &v_shift0);
       idtx4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
+      transpose_4x4(in, in);
       fadst4x4_neon(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
@@ -827,8 +853,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       fdct8x8_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case ADST_DCT:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -836,8 +861,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       fdct8x8_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case DCT_ADST:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -845,8 +869,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       fadst8x8_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case ADST_ADST:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -854,8 +877,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       fadst8x8_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case FLIPADST_DCT:
       load_buffer_8x8(input, in, stride, 1, 0, shift[0]);
@@ -863,8 +885,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       fdct8x8_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case DCT_FLIPADST:
       load_buffer_8x8(input, in, stride, 0, 1, shift[0]);
@@ -872,8 +893,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       fadst8x8_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case FLIPADST_FLIPADST:
       load_buffer_8x8(input, in, stride, 1, 1, shift[0]);
@@ -881,8 +901,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       fadst8x8_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case ADST_FLIPADST:
       load_buffer_8x8(input, in, stride, 0, 1, shift[0]);
@@ -890,8 +909,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       fadst8x8_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case FLIPADST_ADST:
       load_buffer_8x8(input, in, stride, 1, 0, shift[0]);
@@ -899,8 +917,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       fadst8x8_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case IDTX:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -908,8 +925,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       idtx8x8_neon(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case V_DCT:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -917,8 +933,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       idtx8x8_neon(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case H_DCT:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -926,8 +941,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       fdct8x8_neon(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case V_ADST:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -935,8 +949,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       idtx8x8_neon(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case H_ADST:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -944,8 +957,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       fadst8x8_neon(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case V_FLIPADST:
       load_buffer_8x8(input, in, stride, 1, 0, shift[0]);
@@ -953,8 +965,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       idtx8x8_neon(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case H_FLIPADST:
       load_buffer_8x8(input, in, stride, 0, 1, shift[0]);
@@ -962,8 +973,7 @@
       col_txfm_8x8_rounding(out, &v_shift1);
       transpose_8x8(out, in);
       fadst8x8_neon(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     default: assert(0);
   }
@@ -1628,8 +1638,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       fdct16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case ADST_DCT:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1637,8 +1646,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       fdct16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case DCT_ADST:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1646,8 +1654,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       fadst16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case ADST_ADST:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1655,8 +1662,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       fadst16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case FLIPADST_DCT:
       load_buffer_16x16(input, in, stride, 1, 0, shift[0]);
@@ -1664,8 +1670,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       fdct16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case DCT_FLIPADST:
       load_buffer_16x16(input, in, stride, 0, 1, shift[0]);
@@ -1673,8 +1678,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       fadst16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case FLIPADST_FLIPADST:
       load_buffer_16x16(input, in, stride, 1, 1, shift[0]);
@@ -1682,8 +1686,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       fadst16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case ADST_FLIPADST:
       load_buffer_16x16(input, in, stride, 0, 1, shift[0]);
@@ -1691,8 +1694,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       fadst16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case FLIPADST_ADST:
       load_buffer_16x16(input, in, stride, 1, 0, shift[0]);
@@ -1700,8 +1702,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       fadst16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case IDTX:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1709,8 +1710,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       idtx16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case V_DCT:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1718,8 +1718,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       idtx16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case H_DCT:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1727,8 +1726,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       fdct16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case V_ADST:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1736,8 +1734,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       idtx16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case H_ADST:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1745,8 +1742,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       fadst16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case V_FLIPADST:
       load_buffer_16x16(input, in, stride, 1, 0, shift[0]);
@@ -1754,8 +1750,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       idtx16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case H_FLIPADST:
       load_buffer_16x16(input, in, stride, 0, 1, shift[0]);
@@ -1763,8 +1758,7 @@
       col_txfm_16x16_rounding(out, &v_shift);
       transpose_16x16(out, in);
       fadst16x16_neon(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     default: assert(0);
   }
@@ -2356,37 +2350,30 @@
   cospi = cospi_arr(cos_bit);
   for (col = 0; col < col_num; col++) {
     // stage 0;
-    int32_t stage_idx = 0;
     int j;
     for (j = 0; j < 4; ++j) {
       buf0[j] = input[j * col_num + col];
     }
 
     // stage 1
-    stage_idx++;
     buf1[0] = buf0[3];
     buf1[1] = buf0[0];
     buf1[2] = buf0[1];
     buf1[3] = buf0[2];
 
     // stage 2
-    stage_idx++;
-
     btf_32_neon_type0(cospi[8], cospi[56], buf1[0], buf1[1], buf0[0], buf0[1],
                       v_cos_bit);
     btf_32_neon_type0(cospi[40], cospi[24], buf1[2], buf1[3], buf0[2], buf0[3],
                       v_cos_bit);
 
     // stage 3
-    stage_idx++;
     buf1[0] = vaddq_s32(buf0[0], buf0[2]);
     buf1[2] = vsubq_s32(buf0[0], buf0[2]);
     buf1[1] = vaddq_s32(buf0[1], buf0[3]);
     buf1[3] = vsubq_s32(buf0[1], buf0[3]);
 
     // stage 4
-    stage_idx++;
-
     cospi = cospi_arr(cos_bit);
     buf0[0] = buf1[0];
     buf0[1] = buf1[1];
@@ -2395,7 +2382,6 @@
                       v_cos_bit);
 
     // stage 5
-    stage_idx++;
     buf1[0] = buf0[0];
     buf1[1] = vnegq_s32(buf0[2]);
     buf1[2] = buf0[3];
@@ -3375,9 +3361,9 @@
   }
 
   for (int i = 0; i < 2; i++) {
-    transpose_8x8(out + i * 16, in);
-    av1_round_shift_rect_array_32_neon(in, in, 16, -shift[2], NewSqrt2);
-    write_buffer_16x8(in, coeff + i * 8, 16);
+    av1_round_shift_rect_array_32_neon(out + i * 16, in, 16, -shift[2],
+                                       NewSqrt2);
+    write_buffer_8x8(in, coeff + i * 64);
   }
 }
 
@@ -3403,9 +3389,8 @@
 
   for (int i = 0; i < 2; i++) {
     row_txfm(out + i * 16, out, bit, 2);
-    transpose_8x8(out, in);
-    av1_round_shift_rect_array_32_neon(in, in, 16, -shift[2], NewSqrt2);
-    write_buffer_8x8(in, coeff + i * 64);
+    av1_round_shift_rect_array_32_neon(out, out, 16, -shift[2], NewSqrt2);
+    write_buffer_16x8(out, coeff + i * 8, 16);
   }
 }
 
@@ -3456,7 +3441,9 @@
 
   // row transform
   for (int i = 0; i < txfm_size_col; i++) {
-    row_txfm(in + i, outcoeff128 + i * txfm_size_col, bitrow, txfm_size_col);
+    int32x4_t tmp[4];
+    row_txfm(in + i, tmp, bitrow, txfm_size_row >> 2);
+    store_output_w4(coeff + i * 4, tmp, txfm_size_row, txfm_size_col);
   }
 }
 #endif
@@ -3483,16 +3470,16 @@
   const int32x4_t v_shift0 = vdupq_n_s32(shift[0]);
   load_buffer_16x4(input, in, stride, ud_flip, lr_flip, &v_shift0);
 
-  for (int i = 0; i < txfm_size_row; i++) {
-    col_txfm(in + i * txfm_size_row, outcoeff128 + i * txfm_size_row, bitcol,
-             1);
+  for (int i = 0; i < (txfm_size_col >> 2); i++) {
+    int32x4_t *cur_in = &in[i * txfm_size_row];
+    col_txfm(cur_in, cur_in, bitcol, 1);
+    transpose_4x4(cur_in, cur_in);
   }
   const int32x4_t v_shift1 = vdupq_n_s32(shift[1]);
-  col_txfm_8x8_rounding(outcoeff128, &v_shift1);
+  col_txfm_8x8_rounding(in, &v_shift1);
 
   // row transform
-  row_txfm(outcoeff128, in, bitrow, 1);
-  transpose_8nx8n(in, outcoeff128, txfm_size_row, txfm_size_col);
+  row_txfm(in, outcoeff128, bitrow, 1);
 }
 
 void av1_fwd_txfm2d_16x32_neon(const int16_t *input, int32_t *coeff, int stride,
@@ -3524,9 +3511,7 @@
 
   // row transform
   row_txfm(outcoef128, in, bitrow, 8);
-  transpose_8nx8n(in, outcoef128, 32, 16);
-  av1_round_shift_rect_array_32_neon(outcoef128, outcoef128, 128, -shift[2],
-                                     NewSqrt2);
+  av1_round_shift_rect_array_32_neon(in, outcoef128, 128, -shift[2], NewSqrt2);
 }
 
 void av1_fwd_txfm2d_32x64_neon(const int16_t *input, int32_t *coeff, int stride,
@@ -3562,9 +3547,10 @@
   for (int i = 0; i < num_row; i++) {
     av1_fdct32_new_neon((outcoef128 + i), (in + i), bitrow, num_row);
   }
-  transpose_8nx8n(in, outcoef128, txfm_size_row, txfm_size_col);
-  av1_round_shift_rect_array_32_neon(outcoef128, outcoef128, 512, -shift[2],
-                                     NewSqrt2);
+  for (int i = 0; i < txfm_size_col; i++) {
+    av1_round_shift_rect_array_32_neon(in + i * 16, outcoef128 + i * 8, 8,
+                                       -shift[2], NewSqrt2);
+  }
 }
 
 void av1_fwd_txfm2d_64x32_neon(const int16_t *input, int32_t *coeff, int stride,
@@ -3609,9 +3595,7 @@
   for (int i = 0; i < num_row; i++) {
     av1_fdct64_new_neon((outcoef128 + i), (in + i), bitrow, num_row, num_row);
   }
-  transpose_8nx8n(in, outcoef128, txfm_size_row, txfm_size_col >> 1);
-  av1_round_shift_rect_array_32_neon(outcoef128, outcoef128, 512 >> 1,
-                                     -shift[2], NewSqrt2);
+  av1_round_shift_rect_array_32_neon(in, outcoef128, 512, -shift[2], NewSqrt2);
   (void)bd;
 }
 
@@ -3639,9 +3623,7 @@
   for (int i = 0; i < 4; i++) {
     row_txfm((outcoef128 + i), (in + i), bitrow, 4);
   }
-  transpose_8nx8n(in, outcoef128, 16, 32);
-  av1_round_shift_rect_array_32_neon(outcoef128, outcoef128, 128, -shift[2],
-                                     NewSqrt2);
+  av1_round_shift_rect_array_32_neon(in, outcoef128, 128, -shift[2], NewSqrt2);
   (void)bd;
 }
 
@@ -3677,9 +3659,8 @@
 
   // row transform
   for (int i = 0; i < txfm_size_col; i += 2) {
-    row_txfm((outcoef128 + i), (in + i), bitrow, txfm_size_col);
+    row_txfm((outcoef128 + i), (outcoef128 + i), bitrow, txfm_size_col);
   }
-  transpose_8nx8n(in, outcoef128, txfm_size_row, txfm_size_col);
   (void)bd;
 }
 
@@ -3711,9 +3692,8 @@
 
   // row transform
   for (int i = 0; i < num_col; i++) {
-    row_txfm((outcoef128 + i), (in + i), bitrow, num_col);
+    row_txfm((outcoef128 + i), (outcoef128 + i), bitrow, num_col);
   }
-  transpose_8nx8n(in, outcoef128, txfm_size_row, txfm_size_col);
   (void)bd;
 }
 #endif
@@ -3721,7 +3701,6 @@
 void av1_fwd_txfm2d_4x8_neon(const int16_t *input, int32_t *coeff, int stride,
                              TX_TYPE tx_type, int bd) {
   int32x4_t in[8];
-  int32x4_t *outcoeff128 = (int32x4_t *)coeff;
   const int8_t *shift = av1_fwd_txfm_shift_ls[TX_4X8];
   const int txw_idx = get_txw_idx(TX_4X8);
   const int txh_idx = get_txh_idx(TX_4X8);
@@ -3739,13 +3718,15 @@
   col_txfm(in, in, bitcol, 1);
   int32x4_t v_shift1 = vdupq_n_s32(shift[1]);
   col_txfm_4x8_rounding(in, &v_shift1);
-  transpose_8nx8n(in, outcoeff128, txfm_size_col, txfm_size_row);
 
   for (int i = 0; i < 2; i++) {
-    row_txfm(outcoeff128 + i, in + i * txfm_size_col, bitrow, 2);
+    int32x4_t *cur_in = &in[i * 4];
+    transpose_4x4(cur_in, cur_in);
+    row_txfm(cur_in, cur_in, bitrow, 1);
+    av1_round_shift_rect_array_32_neon(cur_in, cur_in, txfm_size_col, -shift[2],
+                                       NewSqrt2);
+    store_output_w4(coeff + i * 4, cur_in, txfm_size_row, 4);
   }
-  av1_round_shift_rect_array_32_neon(in, outcoeff128, txfm_size_row, -shift[2],
-                                     NewSqrt2);
   (void)bd;
 }
 
@@ -3768,16 +3749,17 @@
   int32x4_t v_shift0 = vdupq_n_s32(shift[0]);
   load_buffer_8x4(input, in, stride, ud_flip, lr_flip, &v_shift0);
   for (int i = 0; i < 2; i++) {
-    col_txfm(in + i * txfm_size_row, in + i * txfm_size_row, bitcol, 1);
+    int32x4_t *cur_in = &in[i * txfm_size_row];
+    col_txfm(cur_in, cur_in, bitcol, 1);
+    transpose_4x4(cur_in, cur_in);
   }
   int32x4_t v_shift1 = vdupq_n_s32(shift[1]);
   col_txfm_4x8_rounding(in, &v_shift1);
 
   // row tranform
   row_txfm(in, outcoeff128, bitrow, 1);
-  av1_round_shift_rect_array_32_neon(outcoeff128, in, txfm_size_col, -shift[2],
-                                     NewSqrt2);
-  transpose_8nx8n(in, outcoeff128, txfm_size_row, txfm_size_col);
+  av1_round_shift_rect_array_32_neon(outcoeff128, outcoeff128, txfm_size_col,
+                                     -shift[2], NewSqrt2);
   (void)bd;
 }
 
@@ -3820,9 +3802,7 @@
   col_txfm_16x16_rounding(outcoeff128 + 192, &v_shift);
 
   transpose_8nx8n(outcoeff128, in, txfm_size_col, 32);
-  fdct16x16_neon(in, in, bitrow, 8);
-  transpose_8nx8n(in, outcoeff128, 32, txfm_size_col);
-  memset(coeff + txfm_size_col * 32, 0, txfm_size_col * 32 * sizeof(*coeff));
+  fdct16x16_neon(in, outcoeff128, bitrow, 8);
   (void)bd;
 }
 
@@ -3861,9 +3841,9 @@
 
   transpose_8nx8n(outcoeff128, in, txfm_size_col, txfm_size_row);
   for (int i = 0; i < 4; i++) {
-    av1_fdct64_new_neon(in + i, in + i, bitrow, 4, 4);
+    av1_fdct64_new_neon(in + i, outcoeff128 + i, bitrow, 4, 4);
   }
-  transpose_8nx8n(in, outcoeff128, txfm_size_row, 32);
+  memset(coeff + txfm_size_row * 32, 0, txfm_size_row * 32 * sizeof(*coeff));
   (void)bd;
 }
 #endif
@@ -3906,9 +3886,9 @@
 
 static INLINE TxfmFuncNEON fwd_txfm_type_to_func(TXFM_TYPE txfm_type) {
   switch (txfm_type) {
-    case TXFM_TYPE_DCT32: return fdct32_new_neon; break;
-    case TXFM_TYPE_DCT64: return fdct64_new_neon; break;
-    case TXFM_TYPE_IDENTITY32: return idtx32x32_neon; break;
+    case TXFM_TYPE_DCT32: return fdct32_new_neon;
+    case TXFM_TYPE_DCT64: return fdct64_new_neon;
+    case TXFM_TYPE_IDENTITY32: return idtx32x32_neon;
     default: assert(0);
   }
   return NULL;
@@ -3994,8 +3974,7 @@
   }
 
   txfm2d_size_128 = (col_num >> 1) * (txfm_size >> 1);
-  av1_round_shift_array_32_neon(out_128, buf_128, txfm2d_size_128, -shift[2]);
-  transpose_8nx8n(buf_128, out_128, 32, 32);
+  av1_round_shift_array_32_neon(out_128, out_128, txfm2d_size_128, -shift[2]);
 }
 
 static INLINE void fwd_txfm2d_neon(const int16_t *input, int32_t *output,
@@ -4024,8 +4003,7 @@
   av1_round_shift_array_32_neon(buf_128, out_128, txfm2d_size_128, -shift[1]);
   transpose_32(txfm_size, out_128, buf_128);
   txfm_func_row(buf_128, out_128, cos_bit_row, stage_range_row);
-  av1_round_shift_array_32_neon(out_128, buf_128, txfm2d_size_128, -shift[2]);
-  transpose_32(txfm_size, buf_128, out_128);
+  av1_round_shift_array_32_neon(out_128, out_128, txfm2d_size_128, -shift[2]);
 }
 
 void av1_fwd_txfm2d_32x32_neon(const int16_t *input, int32_t *output,

diff --git a/av1/encoder/arm/neon/hybrid_fwd_txfm_neon.c b/av1/encoder/arm/neon/hybrid_fwd_txfm_neon.c
index 0ad1131..6cf835a 100644
--- a/av1/encoder/arm/neon/hybrid_fwd_txfm_neon.c
+++ b/av1/encoder/arm/neon/hybrid_fwd_txfm_neon.c

@@ -66,18 +66,8 @@
   a1 = vsub_s16(a1, c1);
   d1 = vadd_s16(d1, b1);
 
-  x[0] = vcombine_s16(a1, c1);
-  x[1] = vcombine_s16(d1, b1);
-
-  transpose4x4(x, s);
-
-  vst1q_s32(&output[0], vshll_n_s16(s[0], UNIT_QUANT_SHIFT));
-  vst1q_s32(&output[4], vshll_n_s16(s[1], UNIT_QUANT_SHIFT));
-  vst1q_s32(&output[8], vshll_n_s16(s[2], UNIT_QUANT_SHIFT));
-  vst1q_s32(&output[12], vshll_n_s16(s[3], UNIT_QUANT_SHIFT));
-}
-
-void av1_highbd_fwht4x4_neon(const int16_t *input, tran_low_t *output,
-                             int stride) {
-  av1_fwht4x4_neon(input, output, stride);
+  vst1q_s32(&output[0], vshll_n_s16(a1, UNIT_QUANT_SHIFT));
+  vst1q_s32(&output[4], vshll_n_s16(c1, UNIT_QUANT_SHIFT));
+  vst1q_s32(&output[8], vshll_n_s16(d1, UNIT_QUANT_SHIFT));
+  vst1q_s32(&output[12], vshll_n_s16(b1, UNIT_QUANT_SHIFT));
 }

diff --git a/av1/encoder/arm/neon/ml_neon.c b/av1/encoder/arm/neon/ml_neon.c
index fcff3a9..be6ddfd 100644
--- a/av1/encoder/arm/neon/ml_neon.c
+++ b/av1/encoder/arm/neon/ml_neon.c

@@ -13,6 +13,7 @@
 #include <assert.h>
 #include <arm_neon.h>
 
+#include "config/aom_config.h"
 #include "config/av1_rtcd.h"
 #include "av1/encoder/ml.h"
 
@@ -46,7 +47,7 @@
     vadd = vmlaq_f32(vadd, inputs_h, weights_h);
     vadd = vmlaq_f32(vadd, inputs_l, weights_l);
   }
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   total += vaddvq_f32(vadd);
 #else
   float32x2_t vadd_lo = vadd_f32(vget_low_f32(vadd), vget_high_f32(vadd));
@@ -80,7 +81,7 @@
     j -= 8;
   }
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   total += vaddvq_f32(vadd);
 
 #else
@@ -98,7 +99,7 @@
                                const float *layer_bias,
                                float *const output_nodes) {
   float total = *layer_bias;
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   const float32x4_t v_inputs = vld1q_f32(inputs);
   const float32x4_t v_weights = vld1q_f32(weights);
   const float32x4_t vadd = vmulq_f32(v_inputs, v_weights);
@@ -126,7 +127,7 @@
     vadd = vmlaq_f32(vadd, v_inputs, v_weights);
   }
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   total += vaddvq_f32(vadd);
 #else
   float32x2_t vadd_lo = vadd_f32(vget_low_f32(vadd), vget_high_f32(vadd));
@@ -159,7 +160,7 @@
     }
   }
   for (int i = 0; i < 2; i++)
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     mul0[i] = vpaddq_f32(mul0[i], mul1[i]);
   const float32x4_t hh = vpaddq_f32(mul0[0], mul0[1]);
 #else
@@ -197,7 +198,7 @@
     }
   }
   for (int i = 0; i < 4; i++)
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     mul0[i] = vpaddq_f32(mul0[i], mul1[i]);
   const float32x4_t hh0 = vpaddq_f32(mul0[0], mul0[1]);
   const float32x4_t hh1 = vpaddq_f32(mul0[2], mul0[3]);
@@ -239,7 +240,7 @@
       add[i] = vmlaq_f32(add[i], inputs_h, weight_h);
     }
   }
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   const float32x4_t hadd_h = vpaddq_f32(add[2], add[3]);
   const float32x4_t hadd_l = vpaddq_f32(add[0], add[1]);
   const float32x4_t haddhadd = vpaddq_f32(hadd_l, hadd_h);

diff --git a/av1/encoder/arm/neon/picksrt_neon.c b/av1/encoder/arm/neon/picksrt_neon.c
index a1e7765..1346d6b 100644
--- a/av1/encoder/arm/neon/picksrt_neon.c
+++ b/av1/encoder/arm/neon/picksrt_neon.c

@@ -141,10 +141,10 @@
     }
     sum64 = vpaddlq_u32(err0);
   }
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   err += vaddvq_u64(sum64);
 #else
   err += vget_lane_u64(vadd_u64(vget_low_u64(sum64), vget_high_u64(sum64)), 0);
-#endif  // __aarch64__
+#endif  // AOM_ARCH_AARCH64
   return err;
 }

diff --git a/av1/encoder/arm/neon/quantize_neon.c b/av1/encoder/arm/neon/quantize_neon.c
index dbfbeef..c3b57ce 100644
--- a/av1/encoder/arm/neon/quantize_neon.c
+++ b/av1/encoder/arm/neon/quantize_neon.c

@@ -14,6 +14,8 @@
 #include <assert.h>
 #include <math.h>
 
+#include "config/aom_config.h"
+
 #include "aom_dsp/arm/mem_neon.h"
 #include "aom_dsp/arm/sum_neon.h"
 #include "aom_mem/aom_mem.h"
@@ -26,7 +28,7 @@
 #include "av1/encoder/rd.h"
 
 static INLINE uint16_t get_max_eob(int16x8_t v_eobmax) {
-#ifdef __aarch64__
+#if AOM_ARCH_AARCH64
   return (uint16_t)vmaxvq_s16(v_eobmax);
 #else
   const int16x4_t v_eobmax_3210 =

diff --git a/av1/encoder/arm/neon/rdopt_neon.c b/av1/encoder/arm/neon/rdopt_neon.c
index 25df6b4..7d3bd4c 100644
--- a/av1/encoder/arm/neon/rdopt_neon.c
+++ b/av1/encoder/arm/neon/rdopt_neon.c

@@ -14,6 +14,7 @@
 #include <arm_neon.h>
 
 #include "av1/encoder/rdopt.h"
+#include "config/aom_config.h"
 #include "config/av1_rtcd.h"
 
 // Process horizontal and vertical correlations in a 4x4 block of pixels.
@@ -97,7 +98,7 @@
     v_x_sum = vpadalq_s32(v_x_sum, x_sum_32);
     v_x2_sum = vpadalq_s32(v_x2_sum, x2_sum_32);
   }
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
   xy_sum = vaddvq_s64(v_xy_sum);
   xz_sum = vaddvq_s64(v_xz_sum);
   x2_sum = vaddvq_s64(v_x2_sum);
@@ -160,7 +161,7 @@
       v_y2_sum = vmlal_s16(v_y2_sum, v_y_hi, v_y_hi);
       const int32x4_t v_y_sum_a = vpadalq_s16(v_y_sum, v_y);
       const int64x2_t v_xy_sum2 = vpaddlq_s32(v_xy_sum_a);
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
       const int64x2_t v_y2_sum_a = vpaddlq_s32(v_y2_sum);
       xy_sum += vaddvq_s64(v_xy_sum2);
       const int32_t y = vaddvq_s32(v_y_sum_a);
@@ -278,7 +279,7 @@
       v_x_sum_a = vpadalq_s16(v_x_sum_a, v_y);
       v_x_sum_a = vpadalq_s16(v_x_sum_a, v_w);
 
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
       xy_sum += vaddvq_s64(vpaddlq_s32(v_xy_sum_a));
       xz_sum += vaddvq_s64(vpaddlq_s32(v_xz_sum_a));
       x_sum += vaddvq_s32(v_x_sum_a);
@@ -398,7 +399,7 @@
       v_x2_firstrow = vmlal_s16(v_x2_firstrow, v_diff_lo, v_diff_lo);
       v_x2_firstrow = vmlal_s16(v_x2_firstrow, v_diff_hi, v_diff_hi);
     }
-#if defined(__aarch64__)
+#if AOM_ARCH_AARCH64
     x_firstrow += vaddvq_s32(v_x_firstrow);
     x2_firstrow += vaddvq_s32(v_x2_firstrow);
 #else

diff --git a/av1/encoder/arm/neon/reconinter_enc_neon.c b/av1/encoder/arm/neon/reconinter_enc_neon.c
new file mode 100644
index 0000000..e5975b0
--- /dev/null
+++ b/av1/encoder/arm/neon/reconinter_enc_neon.c

@@ -0,0 +1,140 @@
+/*
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <arm_neon.h>
+#include <assert.h>
+
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+
+#include "aom_dsp/arm/mem_neon.h"
+
+#include "av1/encoder/reconinter_enc.h"
+
+void aom_upsampled_pred_neon(MACROBLOCKD *xd, const AV1_COMMON *const cm,
+                             int mi_row, int mi_col, const MV *const mv,
+                             uint8_t *comp_pred, int width, int height,
+                             int subpel_x_q3, int subpel_y_q3,
+                             const uint8_t *ref, int ref_stride,
+                             int subpel_search) {
+  // expect xd == NULL only in tests
+  if (xd != NULL) {
+    const MB_MODE_INFO *mi = xd->mi[0];
+    const int ref_num = 0;
+    const int is_intrabc = is_intrabc_block(mi);
+    const struct scale_factors *const sf =
+        is_intrabc ? &cm->sf_identity : xd->block_ref_scale_factors[ref_num];
+    const int is_scaled = av1_is_scaled(sf);
+
+    if (is_scaled) {
+      int plane = 0;
+      const int mi_x = mi_col * MI_SIZE;
+      const int mi_y = mi_row * MI_SIZE;
+      const struct macroblockd_plane *const pd = &xd->plane[plane];
+      const struct buf_2d *const dst_buf = &pd->dst;
+      const struct buf_2d *const pre_buf =
+          is_intrabc ? dst_buf : &pd->pre[ref_num];
+
+      InterPredParams inter_pred_params;
+      inter_pred_params.conv_params = get_conv_params(0, plane, xd->bd);
+      const int_interpfilters filters =
+          av1_broadcast_interp_filter(EIGHTTAP_REGULAR);
+      av1_init_inter_params(
+          &inter_pred_params, width, height, mi_y >> pd->subsampling_y,
+          mi_x >> pd->subsampling_x, pd->subsampling_x, pd->subsampling_y,
+          xd->bd, is_cur_buf_hbd(xd), is_intrabc, sf, pre_buf, filters);
+      av1_enc_build_one_inter_predictor(comp_pred, width, mv,
+                                        &inter_pred_params);
+      return;
+    }
+  }
+
+  const InterpFilterParams *filter_params = av1_get_filter(subpel_search);
+
+  if (!subpel_x_q3 && !subpel_y_q3) {
+    if (width > 8) {
+      assert(width % 16 == 0);
+      int i = height;
+      do {
+        int j = 0;
+        do {
+          uint8x16_t r = vld1q_u8(ref + j);
+          vst1q_u8(comp_pred + j, r);
+          j += 16;
+        } while (j < width);
+        ref += ref_stride;
+        comp_pred += width;
+      } while (--i != 0);
+    } else if (width == 8) {
+      int i = height;
+      do {
+        uint8x8_t r = vld1_u8(ref);
+        vst1_u8(comp_pred, r);
+        ref += ref_stride;
+        comp_pred += width;
+      } while (--i != 0);
+    } else {
+      assert(width == 4);
+      int i = height / 2;
+      do {
+        uint8x8_t r = load_unaligned_u8(ref, ref_stride);
+        vst1_u8(comp_pred, r);
+        ref += 2 * ref_stride;
+        comp_pred += 2 * width;
+      } while (--i != 0);
+    }
+  } else if (!subpel_y_q3) {
+    const int16_t *const filter_x =
+        av1_get_interp_filter_subpel_kernel(filter_params, subpel_x_q3 << 1);
+    aom_convolve8_horiz_neon(ref, ref_stride, comp_pred, width, filter_x, 16,
+                             NULL, -1, width, height);
+  } else if (!subpel_x_q3) {
+    const int16_t *const filter_y =
+        av1_get_interp_filter_subpel_kernel(filter_params, subpel_y_q3 << 1);
+    aom_convolve8_vert_neon(ref, ref_stride, comp_pred, width, NULL, -1,
+                            filter_y, 16, width, height);
+  } else {
+    DECLARE_ALIGNED(16, uint8_t,
+                    im_block[((MAX_SB_SIZE * 2 + 16) + 16) * MAX_SB_SIZE]);
+
+    const int16_t *const filter_x =
+        av1_get_interp_filter_subpel_kernel(filter_params, subpel_x_q3 << 1);
+    const int16_t *const filter_y =
+        av1_get_interp_filter_subpel_kernel(filter_params, subpel_y_q3 << 1);
+
+    const int im_stride = MAX_SB_SIZE;
+    const int im_height = (((height - 1) * 8 + subpel_y_q3) >> 3) + SUBPEL_TAPS;
+
+    const int ref_vert_offset = ref_stride * ((SUBPEL_TAPS >> 1) - 1);
+    const int im_vert_offset = im_stride * ((filter_params->taps >> 1) - 1);
+
+    assert(im_height <= (MAX_SB_SIZE * 2 + 16) + 16);
+    aom_convolve8_horiz_neon(ref - ref_vert_offset, ref_stride, im_block,
+                             MAX_SB_SIZE, filter_x, 16, NULL, -1, width,
+                             im_height);
+    aom_convolve8_vert_neon(im_block + im_vert_offset, MAX_SB_SIZE, comp_pred,
+                            width, NULL, -1, filter_y, 16, width, height);
+  }
+}
+
+void aom_comp_avg_upsampled_pred_neon(MACROBLOCKD *xd,
+                                      const AV1_COMMON *const cm, int mi_row,
+                                      int mi_col, const MV *const mv,
+                                      uint8_t *comp_pred, const uint8_t *pred,
+                                      int width, int height, int subpel_x_q3,
+                                      int subpel_y_q3, const uint8_t *ref,
+                                      int ref_stride, int subpel_search) {
+  aom_upsampled_pred_neon(xd, cm, mi_row, mi_col, mv, comp_pred, width, height,
+                          subpel_x_q3, subpel_y_q3, ref, ref_stride,
+                          subpel_search);
+
+  aom_comp_avg_pred_neon(comp_pred, pred, width, height, comp_pred, width);
+}

diff --git a/av1/encoder/arm/neon/temporal_filter_neon.c b/av1/encoder/arm/neon/temporal_filter_neon.c
index cae44f9..163768b 100644
--- a/av1/encoder/arm/neon/temporal_filter_neon.c
+++ b/av1/encoder/arm/neon/temporal_filter_neon.c

@@ -11,16 +11,18 @@
 
 #include <arm_neon.h>
 
+#include "config/aom_config.h"
 #include "config/av1_rtcd.h"
 #include "av1/encoder/encoder.h"
 #include "av1/encoder/temporal_filter.h"
+#include "aom_dsp/mathutils.h"
 #include "aom_dsp/arm/mem_neon.h"
 #include "aom_dsp/arm/sum_neon.h"
 
 // For the squared error buffer, add padding for 4 samples.
 #define SSE_STRIDE (BW + 4)
 
-#if defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#if AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
 
 // clang-format off
 
@@ -57,16 +59,18 @@
   } while (i < block_height);
 }
 
-static INLINE uint8x16_t load_and_pad(uint8_t *src, const uint32_t col,
+static INLINE uint8x16_t load_and_pad(const uint8_t *src, const uint32_t col,
                                       const uint32_t block_width) {
   uint8x8_t s = vld1_u8(src);
 
   if (col == 0) {
-    s[0] = s[2];
-    s[1] = s[2];
+    const uint8_t lane2 = vget_lane_u8(s, 2);
+    s = vset_lane_u8(lane2, s, 0);
+    s = vset_lane_u8(lane2, s, 1);
   } else if (col >= block_width - 4) {
-    s[6] = s[5];
-    s[7] = s[5];
+    const uint8_t lane5 = vget_lane_u8(s, 5);
+    s = vset_lane_u8(lane5, s, 6);
+    s = vset_lane_u8(lane5, s, 7);
   }
   return vcombine_u8(s, s);
 }
@@ -74,10 +78,10 @@
 static void apply_temporal_filter(
     const uint8_t *frame, const unsigned int stride, const uint32_t block_width,
     const uint32_t block_height, const int *subblock_mses,
-    unsigned int *accumulator, uint16_t *count, uint8_t *frame_abs_diff,
-    uint32_t *luma_sse_sum, const double inv_num_ref_pixels,
+    unsigned int *accumulator, uint16_t *count, const uint8_t *frame_abs_diff,
+    const uint32_t *luma_sse_sum, const double inv_num_ref_pixels,
     const double decay_factor, const double inv_factor,
-    const double weight_factor, double *d_factor) {
+    const double weight_factor, const double *d_factor, int tf_wgt_calc_lvl) {
   assert(((block_width == 16) || (block_width == 32)) &&
          ((block_height == 16) || (block_height == 32)));
 
@@ -87,11 +91,11 @@
   // Traverse 4 columns at a time - first and last two columns need padding.
   for (uint32_t col = 0; col < block_width; col += 4) {
     uint8x16_t vsrc[5][2];
-    uint8_t *src = frame_abs_diff + col;
+    const uint8_t *src = frame_abs_diff + col;
 
     // Load, pad (for first and last two columns) and mask 3 rows from the top.
     for (int i = 2; i < 5; i++) {
-      uint8x16_t s = load_and_pad(src, col, block_width);
+      const uint8x16_t s = load_and_pad(src, col, block_width);
       vsrc[i][0] = vandq_u8(s, vmask.val[0]);
       vsrc[i][1] = vandq_u8(s, vmask.val[1]);
       src += SSE_STRIDE;
@@ -142,29 +146,54 @@
   }
 
   // Perform filtering.
-  for (unsigned int i = 0, k = 0; i < block_height; i++) {
-    for (unsigned int j = 0; j < block_width; j++, k++) {
-      const int pixel_value = frame[i * stride + j];
-      uint32_t diff_sse = acc_5x5_neon[i][j] + luma_sse_sum[i * BW + j];
+  if (tf_wgt_calc_lvl == 0) {
+    for (unsigned int i = 0, k = 0; i < block_height; i++) {
+      for (unsigned int j = 0; j < block_width; j++, k++) {
+        const int pixel_value = frame[i * stride + j];
+        const uint32_t diff_sse = acc_5x5_neon[i][j] + luma_sse_sum[i * BW + j];
 
-      const double window_error = diff_sse * inv_num_ref_pixels;
-      const int subblock_idx =
-          (i >= block_height / 2) * 2 + (j >= block_width / 2);
-      const double block_error = (double)subblock_mses[subblock_idx];
-      const double combined_error =
-          weight_factor * window_error + block_error * inv_factor;
-      // Compute filter weight.
-      double scaled_error =
-          combined_error * d_factor[subblock_idx] * decay_factor;
-      scaled_error = AOMMIN(scaled_error, 7);
-      const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
-      accumulator[k] += weight * pixel_value;
-      count[k] += weight;
+        const double window_error = diff_sse * inv_num_ref_pixels;
+        const int subblock_idx =
+            (i >= block_height / 2) * 2 + (j >= block_width / 2);
+        const double block_error = (double)subblock_mses[subblock_idx];
+        const double combined_error =
+            weight_factor * window_error + block_error * inv_factor;
+        // Compute filter weight.
+        double scaled_error =
+            combined_error * d_factor[subblock_idx] * decay_factor;
+        scaled_error = AOMMIN(scaled_error, 7);
+        const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
+        accumulator[k] += weight * pixel_value;
+        count[k] += weight;
+      }
+    }
+  } else {
+    for (unsigned int i = 0, k = 0; i < block_height; i++) {
+      for (unsigned int j = 0; j < block_width; j++, k++) {
+        const int pixel_value = frame[i * stride + j];
+        const uint32_t diff_sse = acc_5x5_neon[i][j] + luma_sse_sum[i * BW + j];
+
+        const double window_error = diff_sse * inv_num_ref_pixels;
+        const int subblock_idx =
+            (i >= block_height / 2) * 2 + (j >= block_width / 2);
+        const double block_error = (double)subblock_mses[subblock_idx];
+        const double combined_error =
+            weight_factor * window_error + block_error * inv_factor;
+        // Compute filter weight.
+        double scaled_error =
+            combined_error * d_factor[subblock_idx] * decay_factor;
+        scaled_error = AOMMIN(scaled_error, 7);
+        const float fweight =
+            approx_exp((float)-scaled_error) * TF_WEIGHT_SCALE;
+        const int weight = iroundpf(fweight);
+        accumulator[k] += weight * pixel_value;
+        count[k] += weight;
+      }
     }
   }
 }
 
-#else  // !(defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD))
+#else  // !(AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD))
 
 // When using vld1q_u16_x4 compilers may insert an alignment hint of 256 bits.
 DECLARE_ALIGNED(32, static const uint16_t, kSlidingWindowMask[]) = {
@@ -205,16 +234,18 @@
   } while (i < block_height);
 }
 
-static INLINE uint16x8_t load_and_pad(uint16_t *src, const uint32_t col,
+static INLINE uint16x8_t load_and_pad(const uint16_t *src, const uint32_t col,
                                       const uint32_t block_width) {
   uint16x8_t s = vld1q_u16(src);
 
   if (col == 0) {
-    s[0] = s[2];
-    s[1] = s[2];
+    const uint16_t lane2 = vgetq_lane_u16(s, 2);
+    s = vsetq_lane_u16(lane2, s, 0);
+    s = vsetq_lane_u16(lane2, s, 1);
   } else if (col >= block_width - 4) {
-    s[6] = s[5];
-    s[7] = s[5];
+    const uint16_t lane5 = vgetq_lane_u16(s, 5);
+    s = vsetq_lane_u16(lane5, s, 6);
+    s = vsetq_lane_u16(lane5, s, 7);
   }
   return s;
 }
@@ -222,10 +253,10 @@
 static void apply_temporal_filter(
     const uint8_t *frame, const unsigned int stride, const uint32_t block_width,
     const uint32_t block_height, const int *subblock_mses,
-    unsigned int *accumulator, uint16_t *count, uint16_t *frame_sse,
-    uint32_t *luma_sse_sum, const double inv_num_ref_pixels,
+    unsigned int *accumulator, uint16_t *count, const uint16_t *frame_sse,
+    const uint32_t *luma_sse_sum, const double inv_num_ref_pixels,
     const double decay_factor, const double inv_factor,
-    const double weight_factor, double *d_factor) {
+    const double weight_factor, const double *d_factor, int tf_wgt_calc_lvl) {
   assert(((block_width == 16) || (block_width == 32)) &&
          ((block_height == 16) || (block_height == 32)));
 
@@ -235,7 +266,7 @@
   // Traverse 4 columns at a time - first and last two columns need padding.
   for (uint32_t col = 0; col < block_width; col += 4) {
     uint16x8_t vsrc[5];
-    uint16_t *src = frame_sse + col;
+    const uint16_t *src = frame_sse + col;
 
     // Load and pad (for first and last two columns) 3 rows from the top.
     for (int i = 2; i < 5; i++) {
@@ -273,36 +304,62 @@
   }
 
   // Perform filtering.
-  for (unsigned int i = 0, k = 0; i < block_height; i++) {
-    for (unsigned int j = 0; j < block_width; j++, k++) {
-      const int pixel_value = frame[i * stride + j];
-      uint32_t diff_sse = acc_5x5_neon[i][j] + luma_sse_sum[i * BW + j];
+  if (tf_wgt_calc_lvl == 0) {
+    for (unsigned int i = 0, k = 0; i < block_height; i++) {
+      for (unsigned int j = 0; j < block_width; j++, k++) {
+        const int pixel_value = frame[i * stride + j];
+        const uint32_t diff_sse = acc_5x5_neon[i][j] + luma_sse_sum[i * BW + j];
 
-      const double window_error = diff_sse * inv_num_ref_pixels;
-      const int subblock_idx =
-          (i >= block_height / 2) * 2 + (j >= block_width / 2);
-      const double block_error = (double)subblock_mses[subblock_idx];
-      const double combined_error =
-          weight_factor * window_error + block_error * inv_factor;
-      // Compute filter weight.
-      double scaled_error =
-          combined_error * d_factor[subblock_idx] * decay_factor;
-      scaled_error = AOMMIN(scaled_error, 7);
-      const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
-      accumulator[k] += weight * pixel_value;
-      count[k] += weight;
+        const double window_error = diff_sse * inv_num_ref_pixels;
+        const int subblock_idx =
+            (i >= block_height / 2) * 2 + (j >= block_width / 2);
+        const double block_error = (double)subblock_mses[subblock_idx];
+        const double combined_error =
+            weight_factor * window_error + block_error * inv_factor;
+        // Compute filter weight.
+        double scaled_error =
+            combined_error * d_factor[subblock_idx] * decay_factor;
+        scaled_error = AOMMIN(scaled_error, 7);
+        const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
+        accumulator[k] += weight * pixel_value;
+        count[k] += weight;
+      }
+    }
+  } else {
+    for (unsigned int i = 0, k = 0; i < block_height; i++) {
+      for (unsigned int j = 0; j < block_width; j++, k++) {
+        const int pixel_value = frame[i * stride + j];
+        const uint32_t diff_sse = acc_5x5_neon[i][j] + luma_sse_sum[i * BW + j];
+
+        const double window_error = diff_sse * inv_num_ref_pixels;
+        const int subblock_idx =
+            (i >= block_height / 2) * 2 + (j >= block_width / 2);
+        const double block_error = (double)subblock_mses[subblock_idx];
+        const double combined_error =
+            weight_factor * window_error + block_error * inv_factor;
+        // Compute filter weight.
+        double scaled_error =
+            combined_error * d_factor[subblock_idx] * decay_factor;
+        scaled_error = AOMMIN(scaled_error, 7);
+        const float fweight =
+            approx_exp((float)-scaled_error) * TF_WEIGHT_SCALE;
+        const int weight = iroundpf(fweight);
+        accumulator[k] += weight * pixel_value;
+        count[k] += weight;
+      }
     }
   }
 }
 
-#endif  // defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#endif  // AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
 
 void av1_apply_temporal_filter_neon(
     const YV12_BUFFER_CONFIG *frame_to_filter, const MACROBLOCKD *mbd,
     const BLOCK_SIZE block_size, const int mb_row, const int mb_col,
     const int num_planes, const double *noise_levels, const MV *subblock_mvs,
     const int *subblock_mses, const int q_factor, const int filter_strength,
-    const uint8_t *pred, uint32_t *accum, uint16_t *count) {
+    int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum,
+    uint16_t *count) {
   const int is_high_bitdepth = frame_to_filter->flags & YV12_FLAG_HIGHBITDEPTH;
   assert(block_size == BLOCK_32X32 && "Only support 32x32 block with Neon!");
   assert(TF_WINDOW_LENGTH == 5 && "Only support window length 5 with Neon!");
@@ -336,11 +393,11 @@
   double s_decay = pow((double)filter_strength / TF_STRENGTH_THRESHOLD, 2);
   s_decay = CLIP(s_decay, 1e-5, 1);
   double d_factor[4] = { 0 };
-#if defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#if AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
   uint8_t frame_abs_diff[SSE_STRIDE * BH] = { 0 };
-#else   // !(defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD))
+#else   // !(AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD))
   uint16_t frame_sse[SSE_STRIDE * BH] = { 0 };
-#endif  // defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#endif  // AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
   uint32_t luma_sse_sum[BW * BH] = { 0 };
 
   for (int subblock_idx = 0; subblock_idx < 4; subblock_idx++) {
@@ -379,7 +436,7 @@
     // search is only done on Y-plane, so the information from Y-plane
     // will be more accurate. The luma sse sum is reused in both chroma
     // planes.
-#if defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+#if AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
     if (plane == AOM_PLANE_U) {
       for (unsigned int i = 0; i < plane_h; i++) {
         for (unsigned int j = 0; j < plane_w; j++) {
@@ -403,8 +460,8 @@
                           subblock_mses, accum + plane_offset,
                           count + plane_offset, frame_abs_diff, luma_sse_sum,
                           inv_num_ref_pixels, decay_factor, inv_factor,
-                          weight_factor, d_factor);
-#else   // !(defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD))
+                          weight_factor, d_factor, tf_wgt_calc_lvl);
+#else   // !(AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD))
     if (plane == AOM_PLANE_U) {
       for (unsigned int i = 0; i < plane_h; i++) {
         for (unsigned int j = 0; j < plane_w; j++) {
@@ -422,11 +479,12 @@
     get_squared_error(ref, frame_stride, pred + plane_offset, plane_w, plane_w,
                       plane_h, frame_sse, SSE_STRIDE);
 
-    apply_temporal_filter(
-        pred + plane_offset, plane_w, plane_w, plane_h, subblock_mses,
-        accum + plane_offset, count + plane_offset, frame_sse, luma_sse_sum,
-        inv_num_ref_pixels, decay_factor, inv_factor, weight_factor, d_factor);
-#endif  // defined(__aarch64__) && defined(__ARM_FEATURE_DOTPROD)
+    apply_temporal_filter(pred + plane_offset, plane_w, plane_w, plane_h,
+                          subblock_mses, accum + plane_offset,
+                          count + plane_offset, frame_sse, luma_sse_sum,
+                          inv_num_ref_pixels, decay_factor, inv_factor,
+                          weight_factor, d_factor, tf_wgt_calc_lvl);
+#endif  // AOM_ARCH_AARCH64 && defined(__ARM_FEATURE_DOTPROD)
 
     plane_offset += plane_h * plane_w;
   }

diff --git a/av1/encoder/av1_fwd_txfm2d.c b/av1/encoder/av1_fwd_txfm2d.c
index bcb829d..12a9535 100644
--- a/av1/encoder/av1_fwd_txfm2d.c
+++ b/av1/encoder/av1_fwd_txfm2d.c

@@ -105,19 +105,24 @@
     }
   }
 
+  DECLARE_ALIGNED(16, int32_t, row_buffer[MAX_TX_SIZE]);
+
   // Rows
   for (r = 0; r < txfm_size_row; ++r) {
-    txfm_func_row(buf + r * txfm_size_col, output + r * txfm_size_col,
-                  cos_bit_row, stage_range_row);
-    av1_round_shift_array(output + r * txfm_size_col, txfm_size_col, -shift[2]);
+    txfm_func_row(buf + r * txfm_size_col, row_buffer, cos_bit_row,
+                  stage_range_row);
+    av1_round_shift_array(row_buffer, txfm_size_col, -shift[2]);
     if (abs(rect_type) == 1) {
       // Multiply everything by Sqrt2 if the transform is rectangular and the
       // size difference is a factor of 2.
       for (c = 0; c < txfm_size_col; ++c) {
-        output[r * txfm_size_col + c] = round_shift(
-            (int64_t)output[r * txfm_size_col + c] * NewSqrt2, NewSqrt2Bits);
+        row_buffer[c] =
+            round_shift((int64_t)row_buffer[c] * NewSqrt2, NewSqrt2Bits);
       }
     }
+    for (c = 0; c < txfm_size_col; ++c) {
+      output[c * txfm_size_row + r] = row_buffer[c];
+    }
   }
 }
 
@@ -241,14 +246,14 @@
   fwd_txfm2d_c(input, output, stride, &cfg, txfm_buf, bd);
 
   // Zero out top-right 32x32 area.
-  for (int row = 0; row < 32; ++row) {
-    memset(output + row * 64 + 32, 0, 32 * sizeof(*output));
+  for (int col = 0; col < 32; ++col) {
+    memset(output + col * 64 + 32, 0, 32 * sizeof(*output));
   }
   // Zero out the bottom 64x32 area.
   memset(output + 32 * 64, 0, 32 * 64 * sizeof(*output));
   // Re-pack non-zero coeffs in the first 32x32 indices.
-  for (int row = 1; row < 32; ++row) {
-    memcpy(output + row * 32, output + row * 64, 32 * sizeof(*output));
+  for (int col = 1; col < 32; ++col) {
+    memcpy(output + col * 32, output + col * 64, 32 * sizeof(*output));
   }
 }
 
@@ -258,9 +263,14 @@
   TXFM_2D_FLIP_CFG cfg;
   av1_get_fwd_txfm_cfg(tx_type, TX_32X64, &cfg);
   fwd_txfm2d_c(input, output, stride, &cfg, txfm_buf, bd);
-  // Zero out the bottom 32x32 area.
-  memset(output + 32 * 32, 0, 32 * 32 * sizeof(*output));
-  // Note: no repacking needed here.
+  // Zero out right 32x32 area.
+  for (int col = 0; col < 32; ++col) {
+    memset(output + col * 64 + 32, 0, 32 * sizeof(*output));
+  }
+  // Re-pack non-zero coeffs in the first 32x32 indices.
+  for (int col = 1; col < 32; ++col) {
+    memcpy(output + col * 32, output + col * 64, 32 * sizeof(*output));
+  }
 }
 
 void av1_fwd_txfm2d_64x32_c(const int16_t *input, int32_t *output, int stride,
@@ -269,15 +279,9 @@
   TXFM_2D_FLIP_CFG cfg;
   av1_get_fwd_txfm_cfg(tx_type, TX_64X32, &cfg);
   fwd_txfm2d_c(input, output, stride, &cfg, txfm_buf, bd);
-
-  // Zero out right 32x32 area.
-  for (int row = 0; row < 32; ++row) {
-    memset(output + row * 64 + 32, 0, 32 * sizeof(*output));
-  }
-  // Re-pack non-zero coeffs in the first 32x32 indices.
-  for (int row = 1; row < 32; ++row) {
-    memcpy(output + row * 32, output + row * 64, 32 * sizeof(*output));
-  }
+  // Zero out the bottom 32x32 area.
+  memset(output + 32 * 32, 0, 32 * 32 * sizeof(*output));
+  // Note: no repacking needed here.
 }
 
 void av1_fwd_txfm2d_16x64_c(const int16_t *input, int32_t *output, int stride,
@@ -286,17 +290,6 @@
   TXFM_2D_FLIP_CFG cfg;
   av1_get_fwd_txfm_cfg(tx_type, TX_16X64, &cfg);
   fwd_txfm2d_c(input, output, stride, &cfg, txfm_buf, bd);
-  // Zero out the bottom 16x32 area.
-  memset(output + 16 * 32, 0, 16 * 32 * sizeof(*output));
-  // Note: no repacking needed here.
-}
-
-void av1_fwd_txfm2d_64x16_c(const int16_t *input, int32_t *output, int stride,
-                            TX_TYPE tx_type, int bd) {
-  int32_t txfm_buf[64 * 16];
-  TXFM_2D_FLIP_CFG cfg;
-  av1_get_fwd_txfm_cfg(tx_type, TX_64X16, &cfg);
-  fwd_txfm2d_c(input, output, stride, &cfg, txfm_buf, bd);
   // Zero out right 32x16 area.
   for (int row = 0; row < 16; ++row) {
     memset(output + row * 64 + 32, 0, 32 * sizeof(*output));
@@ -307,6 +300,17 @@
   }
 }
 
+void av1_fwd_txfm2d_64x16_c(const int16_t *input, int32_t *output, int stride,
+                            TX_TYPE tx_type, int bd) {
+  int32_t txfm_buf[64 * 16];
+  TXFM_2D_FLIP_CFG cfg;
+  av1_get_fwd_txfm_cfg(tx_type, TX_64X16, &cfg);
+  fwd_txfm2d_c(input, output, stride, &cfg, txfm_buf, bd);
+  // Zero out the bottom 16x32 area.
+  memset(output + 16 * 32, 0, 16 * 32 * sizeof(*output));
+  // Note: no repacking needed here.
+}
+
 static const int8_t fwd_shift_4x4[3] = { 2, 0, 0 };
 static const int8_t fwd_shift_8x8[3] = { 2, -1, 0 };
 static const int8_t fwd_shift_16x16[3] = { 2, -2, 0 };
@@ -369,16 +373,6 @@
 static const int8_t fidtx16_range_mult2[1] = { 3 };
 static const int8_t fidtx32_range_mult2[1] = { 4 };
 
-#if 0
-const int8_t fwd_idtx_range_row[MAX_TXWH_IDX /*txw_idx*/]
-                               [MAX_TXWH_IDX /*txh_idx*/] = { { 2, 4, 5, 0, 0 },
-                                                              { 3, 4, 5, 6, 0 },
-                                                              { 4, 5, 6, 7, 8 },
-                                                              { 0, 5, 6, 7, 8 },
-                                                              { 0, 0, 7, 8,
-                                                                9 } };
-#endif
-
 static const int8_t *fwd_txfm_range_mult2_list[TXFM_TYPES] = {
   fdct4_range_mult2,  fdct8_range_mult2,   fdct16_range_mult2,
   fdct32_range_mult2, fdct64_range_mult2,  fadst4_range_mult2,
@@ -390,22 +384,20 @@
   av1_zero(cfg->stage_range_col);
   av1_zero(cfg->stage_range_row);
 
-  const int8_t *range_mult2_col = fwd_txfm_range_mult2_list[cfg->txfm_type_col];
-  if (cfg->txfm_type_col != TXFM_TYPE_INVALID) {
-    int stage_num_col = cfg->stage_num_col;
-    for (int i = 0; i < stage_num_col; ++i)
-      cfg->stage_range_col[i] = (range_mult2_col[i] + 1) >> 1;
-  }
+  const int8_t *const range_mult2_col =
+      fwd_txfm_range_mult2_list[cfg->txfm_type_col];
+  const int stage_num_col = cfg->stage_num_col;
+  // i < MAX_TXFM_STAGE_NUM will quiet -Wstringop-overflow.
+  for (int i = 0; i < stage_num_col && i < MAX_TXFM_STAGE_NUM; ++i)
+    cfg->stage_range_col[i] = (range_mult2_col[i] + 1) >> 1;
 
-  if (cfg->txfm_type_row != TXFM_TYPE_INVALID) {
-    int stage_num_row = cfg->stage_num_row;
-    const int8_t *range_mult2_row =
-        fwd_txfm_range_mult2_list[cfg->txfm_type_row];
-    for (int i = 0; i < stage_num_row; ++i) {
-      cfg->stage_range_row[i] =
-          (range_mult2_col[cfg->stage_num_col - 1] + range_mult2_row[i] + 1) >>
-          1;
-    }
+  const int8_t *const range_mult2_row =
+      fwd_txfm_range_mult2_list[cfg->txfm_type_row];
+  const int stage_num_row = cfg->stage_num_row;
+  // i < MAX_TXFM_STAGE_NUM will quiet -Wstringop-overflow.
+  for (int i = 0; i < stage_num_row && i < MAX_TXFM_STAGE_NUM; ++i) {
+    cfg->stage_range_row[i] =
+        (range_mult2_col[stage_num_col - 1] + range_mult2_row[i] + 1) >> 1;
   }
 }
 
@@ -422,7 +414,9 @@
   cfg->cos_bit_col = av1_fwd_cos_bit_col[txw_idx][txh_idx];
   cfg->cos_bit_row = av1_fwd_cos_bit_row[txw_idx][txh_idx];
   cfg->txfm_type_col = av1_txfm_type_ls[txh_idx][tx_type_1d_col];
+  assert(cfg->txfm_type_col != TXFM_TYPE_INVALID);
   cfg->txfm_type_row = av1_txfm_type_ls[txw_idx][tx_type_1d_row];
+  assert(cfg->txfm_type_row != TXFM_TYPE_INVALID);
   cfg->stage_num_col = av1_txfm_stage_num_list[cfg->txfm_type_col];
   cfg->stage_num_row = av1_txfm_stage_num_list[cfg->txfm_type_row];
   set_fwd_txfm_non_scale_range(cfg);

diff --git a/av1/encoder/av1_quantize.c b/av1/encoder/av1_quantize.c
index 97652cf..1aad473 100644
--- a/av1/encoder/av1_quantize.c
+++ b/av1/encoder/av1_quantize.c

@@ -673,15 +673,38 @@
   }
 }
 
+static INLINE bool deltaq_params_have_changed(
+    const DeltaQuantParams *prev_deltaq_params,
+    const CommonQuantParams *quant_params) {
+  return (prev_deltaq_params->y_dc_delta_q != quant_params->y_dc_delta_q ||
+          prev_deltaq_params->u_dc_delta_q != quant_params->u_dc_delta_q ||
+          prev_deltaq_params->v_dc_delta_q != quant_params->v_dc_delta_q ||
+          prev_deltaq_params->u_ac_delta_q != quant_params->u_ac_delta_q ||
+          prev_deltaq_params->v_ac_delta_q != quant_params->v_ac_delta_q);
+}
+
 void av1_init_quantizer(EncQuantDequantParams *const enc_quant_dequant_params,
                         const CommonQuantParams *quant_params,
                         aom_bit_depth_t bit_depth) {
+  DeltaQuantParams *const prev_deltaq_params =
+      &enc_quant_dequant_params->prev_deltaq_params;
+
+  // Re-initialize the quantizer only if any of the dc/ac deltaq parameters
+  // change.
+  if (!deltaq_params_have_changed(prev_deltaq_params, quant_params)) return;
   QUANTS *const quants = &enc_quant_dequant_params->quants;
   Dequants *const dequants = &enc_quant_dequant_params->dequants;
   av1_build_quantizer(bit_depth, quant_params->y_dc_delta_q,
                       quant_params->u_dc_delta_q, quant_params->u_ac_delta_q,
                       quant_params->v_dc_delta_q, quant_params->v_ac_delta_q,
                       quants, dequants);
+
+  // Record the state of deltaq parameters.
+  prev_deltaq_params->y_dc_delta_q = quant_params->y_dc_delta_q;
+  prev_deltaq_params->u_dc_delta_q = quant_params->u_dc_delta_q;
+  prev_deltaq_params->v_dc_delta_q = quant_params->v_dc_delta_q;
+  prev_deltaq_params->u_ac_delta_q = quant_params->u_ac_delta_q;
+  prev_deltaq_params->v_ac_delta_q = quant_params->v_ac_delta_q;
 }
 
 void av1_set_q_index(const EncQuantDequantParams *enc_quant_dequant_params,

diff --git a/av1/encoder/av1_quantize.h b/av1/encoder/av1_quantize.h
index 701e4cf..0409733 100644
--- a/av1/encoder/av1_quantize.h
+++ b/av1/encoder/av1_quantize.h

@@ -81,11 +81,24 @@
                   v_dequant_QTX[QINDEX_RANGE][8]);  // 8: SIMD width
 } Dequants;
 
+// The DeltaQuantParams structure holds the dc/ac deltaq parameters.
+typedef struct {
+  int y_dc_delta_q;
+  int u_dc_delta_q;
+  int u_ac_delta_q;
+  int v_dc_delta_q;
+  int v_ac_delta_q;
+} DeltaQuantParams;
+
 typedef struct {
   // Quantization parameters for internal quantizer setup.
   QUANTS quants;
   // Dequantization parameters for internal quantizer setup.
   Dequants dequants;
+  // Deltaq parameters to track the state of the dc/ac deltaq parameters in
+  // cm->quant_params. It is used to decide whether the quantizer tables need
+  // to be re-initialized.
+  DeltaQuantParams prev_deltaq_params;
 } EncQuantDequantParams;
 
 struct AV1_COMP;

diff --git a/av1/encoder/av1_temporal_denoiser.c b/av1/encoder/av1_temporal_denoiser.c
index 87ae763..3012df6 100644
--- a/av1/encoder/av1_temporal_denoiser.c
+++ b/av1/encoder/av1_temporal_denoiser.c

@@ -489,7 +489,7 @@
         &denoiser->running_avg_y[fb_idx], cm->width, cm->height,
         cm->seq_params->subsampling_x, cm->seq_params->subsampling_y,
         cm->seq_params->use_highbitdepth, AOM_BORDER_IN_PIXELS,
-        cm->features.byte_alignment, 0);
+        cm->features.byte_alignment, 0, 0);
     if (fail) {
       av1_denoiser_free(denoiser);
       return 1;
@@ -577,7 +577,7 @@
       fail = aom_alloc_frame_buffer(
           &denoiser->running_avg_y[i + denoiser->num_ref_frames * layer],
           denoise_width, denoise_height, ssx, ssy, use_highbitdepth, border,
-          legacy_byte_alignment, 0);
+          legacy_byte_alignment, 0, 0);
       if (fail) {
         av1_denoiser_free(denoiser);
         return 1;
@@ -589,7 +589,7 @@
 
     fail = aom_alloc_frame_buffer(
         &denoiser->mc_running_avg_y[layer], denoise_width, denoise_height, ssx,
-        ssy, use_highbitdepth, border, legacy_byte_alignment, 0);
+        ssy, use_highbitdepth, border, legacy_byte_alignment, 0, 0);
     if (fail) {
       av1_denoiser_free(denoiser);
       return 1;
@@ -600,7 +600,7 @@
   // layer.
   fail = aom_alloc_frame_buffer(&denoiser->last_source, width, height, ssx, ssy,
                                 use_highbitdepth, border, legacy_byte_alignment,
-                                0);
+                                0, 0);
   if (fail) {
     av1_denoiser_free(denoiser);
     return 1;

diff --git a/av1/encoder/bitstream.c b/av1/encoder/bitstream.c
index 4f85307..39aa027 100644
--- a/av1/encoder/bitstream.c
+++ b/av1/encoder/bitstream.c

@@ -2795,7 +2795,7 @@
   }
 
   // Check whether all references are distinct frames.
-  const RefCntBuffer *seen_bufs[FRAME_BUFFERS] = { NULL };
+  const RefCntBuffer *seen_bufs[INTER_REFS_PER_FRAME] = { NULL };
   int num_refs = 0;
   for (int ref_frame = LAST_FRAME; ref_frame <= ALTREF_FRAME; ++ref_frame) {
     const RefCntBuffer *const buf = get_ref_frame_buf(cm, ref_frame);

diff --git a/av1/encoder/block.h b/av1/encoder/block.h
index 4185798..360b9d4 100644
--- a/av1/encoder/block.h
+++ b/av1/encoder/block.h

@@ -42,6 +42,35 @@
 
 /*! Maximum value taken by transform type probabilities */
 #define MAX_TX_TYPE_PROB 1024
+
+//! Compute color sensitivity index for given plane
+#define COLOR_SENS_IDX(plane) ((plane)-1)
+
+//! Enable timer statistics of mode search in non-rd
+#define COLLECT_NONRD_PICK_MODE_STAT 0
+
+/*!\cond */
+#if COLLECT_NONRD_PICK_MODE_STAT
+#include "aom_ports/aom_timer.h"
+
+typedef struct _mode_search_stat_nonrd {
+  int32_t num_blocks[BLOCK_SIZES];
+  int64_t total_block_times[BLOCK_SIZES];
+  int32_t num_searches[BLOCK_SIZES][MB_MODE_COUNT];
+  int32_t num_nonskipped_searches[BLOCK_SIZES][MB_MODE_COUNT];
+  int64_t search_times[BLOCK_SIZES][MB_MODE_COUNT];
+  int64_t nonskipped_search_times[BLOCK_SIZES][MB_MODE_COUNT];
+  int64_t ms_time[BLOCK_SIZES][MB_MODE_COUNT];
+  int64_t ifs_time[BLOCK_SIZES][MB_MODE_COUNT];
+  int64_t model_rd_time[BLOCK_SIZES][MB_MODE_COUNT];
+  int64_t txfm_time[BLOCK_SIZES][MB_MODE_COUNT];
+  struct aom_usec_timer timer1;
+  struct aom_usec_timer timer2;
+  struct aom_usec_timer bsize_timer;
+} mode_search_stat_nonrd;
+#endif  // COLLECT_NONRD_PICK_MODE_STAT
+/*!\endcond */
+
 /*! \brief Superblock level encoder info
  *
  * SuperblockEnc stores superblock level information used by the encoder for
@@ -1286,11 +1315,13 @@
    * Used in REALTIME coding mode to enhance the visual quality at the boundary
    * of moving color objects.
    */
-  uint8_t color_sensitivity_sb[2];
+  uint8_t color_sensitivity_sb[MAX_MB_PLANE - 1];
   //! Color sensitivity flag for the superblock for golden reference.
-  uint8_t color_sensitivity_sb_g[2];
+  uint8_t color_sensitivity_sb_g[MAX_MB_PLANE - 1];
+  //! Color sensitivity flag for the superblock for altref reference.
+  uint8_t color_sensitivity_sb_alt[MAX_MB_PLANE - 1];
   //! Color sensitivity flag for the coding block.
-  uint8_t color_sensitivity[2];
+  uint8_t color_sensitivity[MAX_MB_PLANE - 1];
   /**@}*/
 
   /*****************************************************************************
@@ -1326,6 +1357,15 @@
   /*! \brief A hash to make sure av1_set_offsets is called */
   SetOffsetsLoc last_set_offsets_loc;
 #endif  // NDEBUG
+
+#if COLLECT_NONRD_PICK_MODE_STAT
+  mode_search_stat_nonrd ms_stat_nonrd;
+#endif  // COLLECT_NONRD_PICK_MODE_STAT
+
+  /*!\brief Number of pixels in current thread that choose palette mode in the
+   * fast encoding stage for screen content tool detemination.
+   */
+  int palette_pixels;
 } MACROBLOCK;
 #undef SINGLE_REF_MODES
 

diff --git a/av1/encoder/cnn.c b/av1/encoder/cnn.c
index 639922f..28e1f71 100644
--- a/av1/encoder/cnn.c
+++ b/av1/encoder/cnn.c

@@ -1193,40 +1193,3 @@
   aom_free(input_);
   return success;
 }
-
-// Assume output already has proper allocation
-// Assume input image buffers all have same resolution and strides
-bool av1_cnn_predict_img(uint8_t **dgd, int width, int height, int stride,
-                         const CNN_CONFIG *cnn_config,
-                         const CNN_THREAD_DATA *thread_data, float **output,
-                         int out_stride) {
-  int out_width = 0, out_height = 0, out_channels = 0;
-  av1_find_cnn_output_size(width, height, cnn_config, &out_width, &out_height,
-                           &out_channels);
-  const int output_chs[1] = { out_channels };
-  const int output_strides[1] = { out_stride };
-  CNN_MULTI_OUT output_struct = { .output_channels = output_chs,
-                                  .output_strides = output_strides,
-                                  .output_buffer = output };
-  return av1_cnn_predict_img_multi_out(dgd, width, height, stride, cnn_config,
-                                       thread_data, &output_struct);
-}
-
-// Assume output already has proper allocation
-// Assume input image buffers all have same resolution and strides
-bool av1_cnn_predict_img_highbd(uint16_t **dgd, int width, int height,
-                                int stride, const CNN_CONFIG *cnn_config,
-                                const CNN_THREAD_DATA *thread_data,
-                                int bit_depth, float **output, int out_stride) {
-  int out_width = 0, out_height = 0, out_channels = 0;
-  av1_find_cnn_output_size(width, height, cnn_config, &out_width, &out_height,
-                           &out_channels);
-  const int output_chs[1] = { out_channels };
-  const int output_strides[1] = { out_stride };
-  CNN_MULTI_OUT output_struct = { .output_channels = output_chs,
-                                  .output_strides = output_strides,
-                                  .output_buffer = output };
-  return av1_cnn_predict_img_multi_out_highbd(dgd, width, height, stride,
-                                              cnn_config, thread_data,
-                                              bit_depth, &output_struct);
-}

diff --git a/av1/encoder/cnn.h b/av1/encoder/cnn.h
index 1a6c03a..df6401f 100644
--- a/av1/encoder/cnn.h
+++ b/av1/encoder/cnn.h

@@ -9,8 +9,8 @@
  * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
  */
 
-#ifndef AOM_AV1_COMMON_CNN_H_
-#define AOM_AV1_COMMON_CNN_H_
+#ifndef AOM_AV1_ENCODER_CNN_H_
+#define AOM_AV1_ENCODER_CNN_H_
 
 #ifdef __cplusplus
 extern "C" {
@@ -184,20 +184,8 @@
                                           const CNN_CONFIG *cnn_config,
                                           const CNN_THREAD_DATA *thread_data,
                                           int bit_depth, CNN_MULTI_OUT *output);
-
-// Prediction functions from set of input image buffers. This function only
-// supports a single output.
-bool av1_cnn_predict_img(uint8_t **dgd, int width, int height, int stride,
-                         const CNN_CONFIG *cnn_config,
-                         const CNN_THREAD_DATA *thread_data, float **output,
-                         int out_stride);
-bool av1_cnn_predict_img_highbd(uint16_t **dgd, int width, int height,
-                                int stride, const CNN_CONFIG *cnn_config,
-                                const CNN_THREAD_DATA *thread_data,
-                                int bit_depth, float **output, int out_stride);
-
 #ifdef __cplusplus
 }  // extern "C"
 #endif
 
-#endif  // AOM_AV1_COMMON_CNN_H_
+#endif  // AOM_AV1_ENCODER_CNN_H_

diff --git a/av1/encoder/compound_type.c b/av1/encoder/compound_type.c
index 39c505d..1992f23 100644
--- a/av1/encoder/compound_type.c
+++ b/av1/encoder/compound_type.c

@@ -1023,12 +1023,15 @@
                                         const BLOCK_SIZE bsize,
                                         int64_t ref_skip_rd, int mode_rate) {
   int eval_txfm = 1;
+  const int txfm_rd_gate_level =
+      get_txfm_rd_gate_level(cpi->sf.inter_sf.txfm_rd_gate_level, bsize,
+                             TX_SEARCH_DEFAULT, /*eval_motion_mode=*/0);
   // Check if the mode is good enough based on skip rd
-  if (cpi->sf.inter_sf.txfm_rd_gate_level) {
+  if (txfm_rd_gate_level) {
     int64_t sse_y = compute_sse_plane(x, xd, PLANE_TYPE_Y, bsize);
     int64_t skip_rd = RDCOST(x->rdmult, mode_rate, (sse_y << 4));
-    eval_txfm = check_txfm_eval(x, bsize, ref_skip_rd, skip_rd,
-                                cpi->sf.inter_sf.txfm_rd_gate_level, 1);
+    eval_txfm =
+        check_txfm_eval(x, bsize, ref_skip_rd, skip_rd, txfm_rd_gate_level, 1);
   }
   return eval_txfm;
 }
@@ -1104,9 +1107,12 @@
   // Check if the mode is good enough based on skip rd
   // TODO(nithya): Handle wedge_newmv_search if extending for lower speed
   // setting
-  if (cpi->sf.inter_sf.txfm_rd_gate_level) {
+  const int txfm_rd_gate_level =
+      get_txfm_rd_gate_level(cpi->sf.inter_sf.txfm_rd_gate_level, bsize,
+                             TX_SEARCH_DEFAULT, /*eval_motion_mode=*/0);
+  if (txfm_rd_gate_level) {
     int eval_txfm = check_txfm_eval(x, bsize, ref_skip_rd, skip_rd_cur,
-                                    cpi->sf.inter_sf.txfm_rd_gate_level, 1);
+                                    txfm_rd_gate_level, 1);
     if (!eval_txfm) {
       *comp_model_rd_cur = INT64_MAX;
       return INT64_MAX;
@@ -1300,9 +1306,18 @@
     int64_t mode_rd = RDCOST(x->rdmult, rs2 + rd_stats->rate, 0);
     if (mode_rd >= ref_best_rd) continue;
 
+    // Derive the flags to indicate enabling/disabling of MV refinement process.
+    const int enable_fast_compound_mode_search =
+        cpi->sf.inter_sf.enable_fast_compound_mode_search;
+    const bool skip_mv_refinement_for_avg_distwtd =
+        enable_fast_compound_mode_search == 3 ||
+        (enable_fast_compound_mode_search == 2 && (this_mode != NEW_NEWMV));
+    const bool skip_mv_refinement_for_diffwtd =
+        (!enable_fast_compound_mode_search && cur_type == COMPOUND_DIFFWTD);
+
     // Case COMPOUND_AVERAGE and COMPOUND_DISTWTD
     if (cur_type < COMPOUND_WEDGE) {
-      if (cpi->sf.inter_sf.enable_fast_compound_mode_search == 2) {
+      if (skip_mv_refinement_for_avg_distwtd) {
         int rate_sum;
         uint8_t tmp_skip_txfm_sb;
         int64_t dist_sum, tmp_skip_sse_sb;
@@ -1514,8 +1529,7 @@
       mbmi->mv[1] = tmp_mv[1];
       tmp_rate_mv = best_rate_mv;
       rs2 = best_rs2;
-    } else if (!cpi->sf.inter_sf.enable_fast_compound_mode_search &&
-               cur_type == COMPOUND_DIFFWTD) {
+    } else if (skip_mv_refinement_for_diffwtd) {
       int_mv tmp_mv[2];
       int best_mask_index = 0;
       rs2 += get_interinter_compound_mask_rate(&x->mode_costs, mbmi);
@@ -1597,20 +1611,24 @@
       mbmi->mv[1] = tmp_mv[1];
     } else {
       // Handle masked compound types
-      // Factors to control gating of compound type selection based on best
-      // approximate rd so far
-      const int max_comp_type_rd_threshold_mul =
-          comp_type_rd_threshold_mul[cpi->sf.inter_sf
-                                         .prune_comp_type_by_comp_avg];
-      const int max_comp_type_rd_threshold_div =
-          comp_type_rd_threshold_div[cpi->sf.inter_sf
-                                         .prune_comp_type_by_comp_avg];
-      // Evaluate COMPOUND_WEDGE / COMPOUND_DIFFWTD if approximated cost is
-      // within threshold
-      int64_t approx_rd = ((*rd / max_comp_type_rd_threshold_div) *
-                           max_comp_type_rd_threshold_mul);
+      bool eval_masked_comp_type = true;
+      if (*rd != INT64_MAX) {
+        // Factors to control gating of compound type selection based on best
+        // approximate rd so far
+        const int max_comp_type_rd_threshold_mul =
+            comp_type_rd_threshold_mul[cpi->sf.inter_sf
+                                           .prune_comp_type_by_comp_avg];
+        const int max_comp_type_rd_threshold_div =
+            comp_type_rd_threshold_div[cpi->sf.inter_sf
+                                           .prune_comp_type_by_comp_avg];
+        // Evaluate COMPOUND_WEDGE / COMPOUND_DIFFWTD if approximated cost is
+        // within threshold
+        const int64_t approx_rd = ((*rd / max_comp_type_rd_threshold_div) *
+                                   max_comp_type_rd_threshold_mul);
+        if (approx_rd >= ref_best_rd) eval_masked_comp_type = false;
+      }
 
-      if (approx_rd < ref_best_rd) {
+      if (eval_masked_comp_type) {
         const int64_t tmp_rd_thresh = AOMMIN(*rd, rd_thresh);
         best_rd_cur = masked_compound_type_rd(
             cpi, x, cur_mv, bsize, this_mode, &rs2, *rate_mv, orig_dst,

diff --git a/av1/encoder/context_tree.c b/av1/encoder/context_tree.c
index f328745..2bd2d7f 100644
--- a/av1/encoder/context_tree.c
+++ b/av1/encoder/context_tree.c

@@ -12,6 +12,7 @@
 #include "av1/encoder/context_tree.h"
 #include "av1/encoder/encoder.h"
 #include "av1/encoder/rd.h"
+#include <assert.h>
 
 void av1_copy_tree_context(PICK_MODE_CONTEXT *dst_ctx,
                            PICK_MODE_CONTEXT *src_ctx) {
@@ -150,36 +151,11 @@
 }
 
 PC_TREE *av1_alloc_pc_tree_node(BLOCK_SIZE bsize) {
-  PC_TREE *pc_tree = NULL;
-  struct aom_internal_error_info error;
-
-  AOM_CHECK_MEM_ERROR(&error, pc_tree, aom_calloc(1, sizeof(*pc_tree)));
+  PC_TREE *pc_tree = aom_calloc(1, sizeof(*pc_tree));
+  if (pc_tree == NULL) return NULL;
 
   pc_tree->partitioning = PARTITION_NONE;
   pc_tree->block_size = bsize;
-  pc_tree->index = 0;
-
-  pc_tree->none = NULL;
-  for (int i = 0; i < 2; ++i) {
-    pc_tree->horizontal[i] = NULL;
-    pc_tree->vertical[i] = NULL;
-  }
-
-#if !CONFIG_REALTIME_ONLY
-  for (int i = 0; i < 3; ++i) {
-    pc_tree->horizontala[i] = NULL;
-    pc_tree->horizontalb[i] = NULL;
-    pc_tree->verticala[i] = NULL;
-    pc_tree->verticalb[i] = NULL;
-  }
-  for (int i = 0; i < 4; ++i) {
-    pc_tree->horizontal4[i] = NULL;
-    pc_tree->vertical4[i] = NULL;
-  }
-#endif
-  for (int i = 0; i < 4; ++i) {
-    pc_tree->split[i] = NULL;
-  }
 
   return pc_tree;
 }
@@ -191,9 +167,45 @@
   } while (0)
 
 void av1_free_pc_tree_recursive(PC_TREE *pc_tree, int num_planes, int keep_best,
-                                int keep_none) {
+                                int keep_none,
+                                PARTITION_SEARCH_TYPE partition_search_type) {
   if (pc_tree == NULL) return;
 
+  // Avoid freeing of extended partitions as they are not supported when
+  // partition_search_type is VAR_BASED_PARTITION.
+  if (partition_search_type == VAR_BASED_PARTITION && !keep_best &&
+      !keep_none) {
+    FREE_PMC_NODE(pc_tree->none);
+
+    for (int i = 0; i < 2; ++i) {
+      FREE_PMC_NODE(pc_tree->horizontal[i]);
+      FREE_PMC_NODE(pc_tree->vertical[i]);
+    }
+
+#if !defined(NDEBUG) && !CONFIG_REALTIME_ONLY
+    for (int i = 0; i < 3; ++i) {
+      assert(pc_tree->horizontala[i] == NULL);
+      assert(pc_tree->horizontalb[i] == NULL);
+      assert(pc_tree->verticala[i] == NULL);
+      assert(pc_tree->verticalb[i] == NULL);
+    }
+    for (int i = 0; i < 4; ++i) {
+      assert(pc_tree->horizontal4[i] == NULL);
+      assert(pc_tree->vertical4[i] == NULL);
+    }
+#endif
+
+    for (int i = 0; i < 4; ++i) {
+      if (pc_tree->split[i] != NULL) {
+        av1_free_pc_tree_recursive(pc_tree->split[i], num_planes, 0, 0,
+                                   partition_search_type);
+        pc_tree->split[i] = NULL;
+      }
+    }
+    aom_free(pc_tree);
+    return;
+  }
+
   const PARTITION_TYPE partition = pc_tree->partitioning;
 
   if (!keep_none && (!keep_best || (partition != PARTITION_NONE)))
@@ -226,7 +238,8 @@
   if (!keep_best || (partition != PARTITION_SPLIT)) {
     for (int i = 0; i < 4; ++i) {
       if (pc_tree->split[i] != NULL) {
-        av1_free_pc_tree_recursive(pc_tree->split[i], num_planes, 0, 0);
+        av1_free_pc_tree_recursive(pc_tree->split[i], num_planes, 0, 0,
+                                   partition_search_type);
         pc_tree->split[i] = NULL;
       }
     }

diff --git a/av1/encoder/context_tree.h b/av1/encoder/context_tree.h
index 413535d..78f2076 100644
--- a/av1/encoder/context_tree.h
+++ b/av1/encoder/context_tree.h

@@ -16,6 +16,7 @@
 
 #include "av1/common/blockd.h"
 #include "av1/encoder/block.h"
+#include "av1/encoder/speed_features.h"
 
 #ifdef __cplusplus
 extern "C" {
@@ -107,7 +108,8 @@
 
 PC_TREE *av1_alloc_pc_tree_node(BLOCK_SIZE bsize);
 void av1_free_pc_tree_recursive(PC_TREE *tree, int num_planes, int keep_best,
-                                int keep_none);
+                                int keep_none,
+                                PARTITION_SEARCH_TYPE partition_search_type);
 
 PICK_MODE_CONTEXT *av1_alloc_pmc(const struct AV1_COMP *const cpi,
                                  BLOCK_SIZE bsize,

diff --git a/av1/encoder/dwt.c b/av1/encoder/dwt.c
index 5dfbcb6..2fab99d 100644
--- a/av1/encoder/dwt.c
+++ b/av1/encoder/dwt.c

@@ -114,7 +114,7 @@
   dyadic_analyze_53_uint8_input(4, 8, 8, input, stride, output, 8, 2, hbd);
 }
 
-int av1_haar_ac_sad(const tran_low_t *output, int bw, int bh, int stride) {
+static int haar_ac_sad(const tran_low_t *output, int bw, int bh, int stride) {
   int acsad = 0;
 
   for (int r = 0; r < bh; ++r)
@@ -124,35 +124,12 @@
   return acsad;
 }
 
-uint64_t av1_dct_ac_sad(tran_low_t *output, int bw, int bh, int stride) {
-  uint64_t acsad = 0;
-
-  for (int r = 0; r < bh; ++r)
-    for (int c = 0; c < bw; ++c) {
-      if (r > 0 || c > 0) acsad += abs(output[r * stride + c]);
-    }
-
-  return acsad;
-}
-
-uint32_t av1_variance(uint8_t *input, int bw, int bh, int stride) {
-  int sum = 0;
-  uint32_t sse = 0;
-
-  for (int r = 0; r < bh; ++r)
-    for (int c = 0; c < bw; ++c) {
-      sum += input[r * stride + c];
-      sse += input[r * stride + c] * input[r * stride + c];
-    }
-  return sse - (uint32_t)(((int64_t)sum * sum) / (bw * bh));
-}
-
 static int haar_ac_sad_8x8_uint8_input(const uint8_t *input, int stride,
                                        int hbd) {
   tran_low_t output[64];
 
   av1_fdwt8x8_uint8_input_c(input, output, stride, hbd);
-  return av1_haar_ac_sad(output, 8, 8, 8);
+  return haar_ac_sad(output, 8, 8, 8);
 }
 
 int64_t av1_haar_ac_sad_mxn_uint8_input(const uint8_t *input, int stride,

diff --git a/av1/encoder/encode_strategy.c b/av1/encoder/encode_strategy.c
index f4c1ba3..90279b0 100644
--- a/av1/encoder/encode_strategy.c
+++ b/av1/encoder/encode_strategy.c

@@ -717,7 +717,7 @@
 // to av1_encode() except that tpl is not performed.
 static int denoise_and_encode(AV1_COMP *const cpi, uint8_t *const dest,
                               EncodeFrameInput *const frame_input,
-                              EncodeFrameParams *const frame_params,
+                              const EncodeFrameParams *const frame_params,
                               EncodeFrameResults *const frame_results) {
 #if CONFIG_COLLECT_COMPONENT_TIMING
   if (cpi->oxcf.pass == 2) start_timing(cpi, denoise_and_encode_time);
@@ -744,9 +744,10 @@
                                !frame_params->show_existing_frame &&
                                !is_lossless_requested(&oxcf->rc_cfg);
       if (allow_kf_filtering) {
-        const double y_noise_level = av1_estimate_noise_from_single_plane(
-            frame_input->source, 0, cm->seq_params->bit_depth,
-            NOISE_ESTIMATION_EDGE_THRESHOLD);
+        double y_noise_level = 0.0;
+        av1_estimate_noise_level(
+            frame_input->source, &y_noise_level, AOM_PLANE_Y, AOM_PLANE_Y,
+            cm->seq_params->bit_depth, NOISE_ESTIMATION_EDGE_THRESHOLD);
         apply_filtering = y_noise_level > 0;
       } else {
         apply_filtering = 0;
@@ -786,6 +787,8 @@
             tf_buf, &frame_diff, q_index, cm->seq_params->bit_depth);
         if (show_existing_alt_ref) {
           cpi->common.showable_frame |= 1;
+        } else {
+          cpi->common.showable_frame = 0;
         }
       }
       if (gf_group->frame_type[cpi->gf_frame_index] != KEY_FRAME) {
@@ -801,7 +804,7 @@
           oxcf->frm_dim_cfg.height, cm->seq_params->subsampling_x,
           cm->seq_params->subsampling_y, cm->seq_params->use_highbitdepth,
           cpi->oxcf.border_in_pixels, cm->features.byte_alignment, NULL, NULL,
-          NULL, cpi->oxcf.tool_cfg.enable_global_motion, 0);
+          NULL, cpi->image_pyramid_levels, 0);
       if (ret)
         aom_internal_error(cm->error, AOM_CODEC_MEM_ERROR,
                            "Failed to allocate tf_buf_second_arf");
@@ -860,10 +863,7 @@
     if (gf_group->size > MAX_LENGTH_TPL_FRAME_STATS) {
       allow_tpl = 0;
     }
-    if (frame_params->frame_type == KEY_FRAME) {
-      // TODO(angiebird): handle disable_filtered_key_tpl properly
-      allow_tpl = allow_tpl && !cpi->sf.tpl_sf.disable_filtered_key_tpl;
-    } else {
+    if (frame_params->frame_type != KEY_FRAME) {
       // In rare case, it's possible to have non ARF/GF update_type here.
       // We should set allow_tpl to zero in the situation
       allow_tpl =
@@ -908,8 +908,7 @@
   if (apply_filtering && is_psnr_calc_enabled(cpi)) {
     cpi->source = av1_realloc_and_scale_if_required(
         cm, source_buffer, &cpi->scaled_source, cm->features.interp_filter, 0,
-        false, true, cpi->oxcf.border_in_pixels,
-        cpi->oxcf.tool_cfg.enable_global_motion);
+        false, true, cpi->oxcf.border_in_pixels, cpi->image_pyramid_levels);
     cpi->unscaled_source = source_buffer;
   }
 #if CONFIG_COLLECT_COMPONENT_TIMING
@@ -996,18 +995,30 @@
 #if !CONFIG_REALTIME_ONLY
   if (cpi->use_ducky_encode &&
       cpi->ducky_encode_info.frame_info.gop_mode == DUCKY_ENCODE_GOP_MODE_RCL) {
-    int valid_rf_idx = 0;
     for (int rf = LAST_FRAME; rf < REF_FRAMES; ++rf) {
       if (cpi->ppi->gf_group.ref_frame_list[gf_index][rf] != INVALID_IDX) {
         remapped_ref_idx[rf - LAST_FRAME] =
             cpi->ppi->gf_group.ref_frame_list[gf_index][rf];
+      }
+    }
+
+    int valid_rf_idx = 0;
+    static const int ref_frame_type_order[REF_FRAMES - LAST_FRAME] = {
+      GOLDEN_FRAME,  ALTREF_FRAME, LAST_FRAME, BWDREF_FRAME,
+      ALTREF2_FRAME, LAST2_FRAME,  LAST3_FRAME
+    };
+    for (int i = 0; i < REF_FRAMES - LAST_FRAME; i++) {
+      int rf = ref_frame_type_order[i];
+      if (remapped_ref_idx[rf - LAST_FRAME] != INVALID_IDX) {
         valid_rf_idx = remapped_ref_idx[rf - LAST_FRAME];
+        break;
       }
     }
 
     for (int i = 0; i < REF_FRAMES; ++i) {
-      if (remapped_ref_idx[i] == INVALID_IDX)
+      if (remapped_ref_idx[i] == INVALID_IDX) {
         remapped_ref_idx[i] = valid_rf_idx;
+      }
     }
 
     return;
@@ -1351,6 +1362,35 @@
     }
     frame_params.show_existing_frame &= allow_show_existing(cpi, *frame_flags);
 
+    // Special handling to reset 'show_existing_frame' in case of dropped
+    // frames.
+    if (oxcf->rc_cfg.drop_frames_water_mark &&
+        (gf_group->update_type[cpi->gf_frame_index] == OVERLAY_UPDATE ||
+         gf_group->update_type[cpi->gf_frame_index] == INTNL_OVERLAY_UPDATE)) {
+      // During the encode of an OVERLAY_UPDATE/INTNL_OVERLAY_UPDATE frame, loop
+      // over the gf group to check if the corresponding
+      // ARF_UPDATE/INTNL_ARF_UPDATE frame was dropped.
+      int cur_disp_idx = gf_group->display_idx[cpi->gf_frame_index];
+      for (int idx = 0; idx < cpi->gf_frame_index; idx++) {
+        if (cur_disp_idx == gf_group->display_idx[idx]) {
+          assert(IMPLIES(
+              gf_group->update_type[cpi->gf_frame_index] == OVERLAY_UPDATE,
+              gf_group->update_type[idx] == ARF_UPDATE));
+          assert(IMPLIES(gf_group->update_type[cpi->gf_frame_index] ==
+                             INTNL_OVERLAY_UPDATE,
+                         gf_group->update_type[idx] == INTNL_ARF_UPDATE));
+          // Reset show_existing_frame and set cpi->is_dropped_frame to true if
+          // the frame was dropped during its first encode.
+          if (gf_group->is_frame_dropped[idx]) {
+            frame_params.show_existing_frame = 0;
+            assert(!cpi->is_dropped_frame);
+            cpi->is_dropped_frame = true;
+          }
+          break;
+        }
+      }
+    }
+
     // Reset show_existing_alt_ref decision to 0 after it is used.
     if (gf_group->update_type[cpi->gf_frame_index] == OVERLAY_UPDATE) {
       cpi->ppi->show_existing_alt_ref = 0;
@@ -1387,7 +1427,8 @@
 
   // Source may be changed if temporal filtered later.
   frame_input.source = &source->img;
-  if (cpi->ppi->use_svc && last_source != NULL)
+  if ((cpi->ppi->use_svc || cpi->rc.prev_frame_is_dropped) &&
+      last_source != NULL)
     av1_svc_set_last_source(cpi, &frame_input, &last_source->img);
   else
     frame_input.last_source = last_source != NULL ? &last_source->img : NULL;
@@ -1680,13 +1721,15 @@
         is_frame_droppable(&cpi->ppi->rtc_ref, &ext_flags->refresh_frame);
   }
 
-  // For SVC: keep track of the (unscaled) source corresponding to the
-  // refresh of LAST reference (base temporal layer- TL0). Copy only for the
+  // For SVC, or when frame-dropper is enabled:
+  // keep track of the (unscaled) source corresponding to the refresh of LAST
+  // reference (base temporal layer - TL0). Copy only for the
   // top spatial enhancement layer so all spatial layers of the next
   // superframe have last_source to be aligned with previous TL0 superframe.
   // Avoid cases where resolution changes for unscaled source (top spatial
-  // layer).
-  if (cpi->ppi->use_svc &&
+  // layer). Only needs to be done for frame that are encoded (size > 0).
+  if (*size > 0 &&
+      (cpi->ppi->use_svc || cpi->oxcf.rc_cfg.drop_frames_water_mark > 0) &&
       cpi->svc.spatial_layer_id == cpi->svc.number_spatial_layers - 1 &&
       cpi->svc.temporal_layer_id == 0 &&
       cpi->unscaled_source->y_width == cpi->svc.source_last_TL0.y_width &&

diff --git a/av1/encoder/encodeframe.c b/av1/encoder/encodeframe.c
index 6700669..50f046d 100644
--- a/av1/encoder/encodeframe.c
+++ b/av1/encoder/encodeframe.c

@@ -81,7 +81,7 @@
 //  purposes of activity masking.
 // Eventually this should be replaced by custom no-reference routines,
 //  which will be faster.
-const uint8_t AV1_VAR_OFFS[MAX_SB_SIZE] = {
+static const uint8_t AV1_VAR_OFFS[MAX_SB_SIZE] = {
   128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
   128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
   128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
@@ -93,6 +93,7 @@
   128, 128, 128, 128, 128, 128, 128, 128
 };
 
+#if CONFIG_AV1_HIGHBITDEPTH
 static const uint16_t AV1_HIGH_VAR_OFFS_8[MAX_SB_SIZE] = {
   128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
   128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,
@@ -145,8 +146,31 @@
   128 * 16, 128 * 16, 128 * 16, 128 * 16, 128 * 16, 128 * 16, 128 * 16,
   128 * 16, 128 * 16
 };
+#endif  // CONFIG_AV1_HIGHBITDEPTH
 /*!\endcond */
 
+// For the given bit depth, returns a constant array used to assist the
+// calculation of source block variance, which will then be used to decide
+// adaptive quantizers.
+static const uint8_t *get_var_offs(int use_hbd, int bd) {
+#if CONFIG_AV1_HIGHBITDEPTH
+  if (use_hbd) {
+    assert(bd == 8 || bd == 10 || bd == 12);
+    const int off_index = (bd - 8) >> 1;
+    static const uint16_t *high_var_offs[3] = { AV1_HIGH_VAR_OFFS_8,
+                                                AV1_HIGH_VAR_OFFS_10,
+                                                AV1_HIGH_VAR_OFFS_12 };
+    return CONVERT_TO_BYTEPTR(high_var_offs[off_index]);
+  }
+#else
+  (void)use_hbd;
+  (void)bd;
+  assert(!use_hbd);
+#endif
+  assert(bd == 8);
+  return AV1_VAR_OFFS;
+}
+
 void av1_init_rtc_counters(MACROBLOCK *const x) {
   av1_init_cyclic_refresh_counters(x);
   x->cnt_zeromv = 0;
@@ -167,21 +191,9 @@
   const int subsampling_y = xd->plane[plane].subsampling_y;
   const BLOCK_SIZE plane_bsize =
       get_plane_block_size(bsize, subsampling_x, subsampling_y);
-  unsigned int var, sse;
-  if (use_hbd) {
-    const int bd = xd->bd;
-    assert(bd == 8 || bd == 10 || bd == 12);
-    const int off_index = (bd - 8) >> 1;
-    static const uint16_t *high_var_offs[3] = { AV1_HIGH_VAR_OFFS_8,
-                                                AV1_HIGH_VAR_OFFS_10,
-                                                AV1_HIGH_VAR_OFFS_12 };
-    var = cpi->ppi->fn_ptr[plane_bsize].vf(
-        ref->buf, ref->stride, CONVERT_TO_BYTEPTR(high_var_offs[off_index]), 0,
-        &sse);
-  } else {
-    var = cpi->ppi->fn_ptr[plane_bsize].vf(ref->buf, ref->stride, AV1_VAR_OFFS,
-                                           0, &sse);
-  }
+  unsigned int sse;
+  const unsigned int var = cpi->ppi->fn_ptr[plane_bsize].vf(
+      ref->buf, ref->stride, get_var_offs(use_hbd, xd->bd), 0, &sse);
   return ROUND_POWER_OF_TWO(var, num_pels_log2_lookup[plane_bsize]);
 }
 
@@ -247,7 +259,7 @@
     const int sb_row = mi_row >> cm->seq_params->mib_size_log2;
     const int sb_col = mi_col >> cm->seq_params->mib_size_log2;
     const int sb_cols =
-        CEIL_POWER_OF_TWO(cm->mi_params.mi_cols, MAX_MIB_SIZE_LOG2);
+        CEIL_POWER_OF_TWO(cm->mi_params.mi_cols, cm->seq_params->mib_size_log2);
     const int sb_index = sb_row * sb_cols + sb_col;
     current_qindex =
         cpi->ducky_encode_info.frame_info.superblock_encode_qindex[sb_index];
@@ -599,7 +611,7 @@
       }
 
       // TODO(jingning): revisit this function.
-      if (cpi->oxcf.algo_cfg.enable_tpl_model && 0) {
+      if (cpi->oxcf.algo_cfg.enable_tpl_model && (0)) {
         adjust_rdmult_tpl_model(cpi, x, mi_row, mi_col);
       }
     }
@@ -778,7 +790,8 @@
     PC_TREE *const pc_root = av1_alloc_pc_tree_node(sb_size);
     av1_rd_use_partition(cpi, td, tile_data, mi, tp, mi_row, mi_col, sb_size,
                          &dummy_rate, &dummy_dist, 1, pc_root);
-    av1_free_pc_tree_recursive(pc_root, num_planes, 0, 0);
+    av1_free_pc_tree_recursive(pc_root, num_planes, 0, 0,
+                               sf->part_sf.partition_search_type);
 #if CONFIG_COLLECT_COMPONENT_TIMING
     end_timing(cpi, rd_use_partition_time);
 #endif
@@ -793,7 +806,8 @@
     PC_TREE *const pc_root = av1_alloc_pc_tree_node(sb_size);
     av1_rd_use_partition(cpi, td, tile_data, mi, tp, mi_row, mi_col, sb_size,
                          &dummy_rate, &dummy_dist, 1, pc_root);
-    av1_free_pc_tree_recursive(pc_root, num_planes, 0, 0);
+    av1_free_pc_tree_recursive(pc_root, num_planes, 0, 0,
+                               sf->part_sf.partition_search_type);
   } else {
     SB_FIRST_PASS_STATS *sb_org_stats = NULL;
 
@@ -1132,12 +1146,10 @@
     av1_set_cost_upd_freq(cpi, td, tile_info, mi_row, mi_col);
 
     // Reset color coding related parameters
-    x->color_sensitivity_sb[0] = 0;
-    x->color_sensitivity_sb[1] = 0;
-    x->color_sensitivity_sb_g[0] = 0;
-    x->color_sensitivity_sb_g[1] = 0;
-    x->color_sensitivity[0] = 0;
-    x->color_sensitivity[1] = 0;
+    av1_zero(x->color_sensitivity_sb);
+    av1_zero(x->color_sensitivity_sb_g);
+    av1_zero(x->color_sensitivity_sb_alt);
+    av1_zero(x->color_sensitivity);
     x->content_state_sb.source_sad_nonrd = kMedSad;
     x->content_state_sb.source_sad_rd = kMedSad;
     x->content_state_sb.lighting_change = 0;
@@ -1419,9 +1431,11 @@
       cpi->td.mb.e_mbd.tile_ctx = &this_tile->tctx;
       cpi->td.mb.tile_pb_ctx = &this_tile->tctx;
       av1_init_rtc_counters(&cpi->td.mb);
+      cpi->td.mb.palette_pixels = 0;
       av1_encode_tile(cpi, &cpi->td, tile_row, tile_col);
       if (!frame_is_intra_only(&cpi->common))
         av1_accumulate_rtc_counters(cpi, &cpi->td.mb);
+      cpi->palette_pixel_num += cpi->td.mb.palette_pixels;
       cpi->intrabc_used |= cpi->td.intrabc_used;
       cpi->deltaq_used |= cpi->td.deltaq_used;
     }
@@ -1857,11 +1871,12 @@
     // base_qindex
     cm->delta_q_info.delta_q_present_flag &= quant_params->base_qindex > 0;
     cm->delta_q_info.delta_lf_present_flag &= quant_params->base_qindex > 0;
-  } else {
+  } else if (cpi->cyclic_refresh->apply_cyclic_refresh ||
+             cpi->svc.number_temporal_layers == 1) {
     cpi->cyclic_refresh->actual_num_seg1_blocks = 0;
     cpi->cyclic_refresh->actual_num_seg2_blocks = 0;
-    cpi->rc.cnt_zeromv = 0;
   }
+  cpi->rc.cnt_zeromv = 0;
 
   av1_frame_init_quantizer(cpi);
   init_encode_frame_mb_context(cpi);
@@ -1946,7 +1961,8 @@
                            ? av1_alloc_pc_tree_node(cm->seq_params->sb_size)
                            : NULL;
       encode_tiles(cpi);
-      av1_free_pc_tree_recursive(td->rt_pc_root, av1_num_planes(cm), 0, 0);
+      av1_free_pc_tree_recursive(td->rt_pc_root, av1_num_planes(cm), 0, 0,
+                                 cpi->sf.part_sf.partition_search_type);
     }
   }
 
@@ -2215,7 +2231,6 @@
   AV1_COMMON *const cm = &cpi->common;
   CurrentFrame *const current_frame = &cm->current_frame;
   FeatureFlags *const features = &cm->features;
-  const int num_planes = av1_num_planes(cm);
   RD_COUNTS *const rdc = &cpi->td.rd_counts;
   const AV1EncoderConfig *const oxcf = &cpi->oxcf;
   // Indicates whether or not to use a default reduced set for ext-tx
@@ -2244,13 +2259,27 @@
                      cpi->ref_frame_flags);
   av1_setup_frame_sign_bias(cm);
 
+  // If global motion is enabled, then every buffer which is used as either
+  // a source or a ref frame should have an image pyramid allocated.
+  // Check here so that issues can be caught early in debug mode
+#if !defined(NDEBUG) && !CONFIG_REALTIME_ONLY
+  if (cpi->image_pyramid_levels > 0) {
+    assert(cpi->source->y_pyramid);
+    for (int ref_frame = LAST_FRAME; ref_frame <= ALTREF_FRAME; ++ref_frame) {
+      const RefCntBuffer *const buf = get_ref_frame_buf(cm, ref_frame);
+      if (buf != NULL) {
+        assert(buf->buf.y_pyramid);
+      }
+    }
+  }
+#endif  // !defined(NDEBUG) && !CONFIG_REALTIME_ONLY
+
 #if CONFIG_MISMATCH_DEBUG
-  mismatch_reset_frame(num_planes);
-#else
-  (void)num_planes;
+  mismatch_reset_frame(av1_num_planes(cm));
 #endif
 
   rdc->newmv_or_intra_blocks = 0;
+  cpi->palette_pixel_num = 0;
 
   if (cpi->sf.hl_sf.frame_parameter_update ||
       cpi->sf.rt_sf.use_comp_ref_nonrd) {

diff --git a/av1/encoder/encodeframe_utils.c b/av1/encoder/encodeframe_utils.c
index c478ef6..29d7fe4 100644
--- a/av1/encoder/encodeframe_utils.c
+++ b/av1/encoder/encodeframe_utils.c

@@ -31,8 +31,19 @@
   const int num_brows = (mi_size_high[bsize] + num_mi_h - 1) / num_mi_h;
   int row, col;
   double num_of_mi = 0.0;
-  double geom_mean_of_scale = 0.0;
+  double geom_mean_of_scale = 1.0;
 
+  // To avoid overflow of 'geom_mean_of_scale', bsize_base must be at least
+  // BLOCK_8X8.
+  //
+  // For bsize=BLOCK_128X128 and bsize_base=BLOCK_8X8, the loop below would
+  // iterate 256 times. Considering the maximum value of
+  // cpi->ssim_rdmult_scaling_factors (see av1_set_mb_ssim_rdmult_scaling()),
+  // geom_mean_of_scale can go up to 4.8323^256, which is within DBL_MAX
+  // (maximum value a double data type can hold). If bsize_base is modified to
+  // BLOCK_4X4 (minimum possible block size), geom_mean_of_scale can go up
+  // to 4.8323^1024 and exceed DBL_MAX, resulting in data overflow.
+  assert(bsize_base >= BLOCK_8X8);
   assert(cpi->oxcf.tune_cfg.tuning == AOM_TUNE_SSIM);
 
   for (row = mi_row / num_mi_w;
@@ -41,17 +52,36 @@
          col < num_cols && col < mi_col / num_mi_h + num_bcols; ++col) {
       const int index = row * num_cols + col;
       assert(cpi->ssim_rdmult_scaling_factors[index] != 0.0);
-      geom_mean_of_scale += log(cpi->ssim_rdmult_scaling_factors[index]);
+      geom_mean_of_scale *= cpi->ssim_rdmult_scaling_factors[index];
       num_of_mi += 1.0;
     }
   }
-  geom_mean_of_scale = exp(geom_mean_of_scale / num_of_mi);
+  geom_mean_of_scale = pow(geom_mean_of_scale, (1.0 / num_of_mi));
 
   *rdmult = (int)((double)(*rdmult) * geom_mean_of_scale + 0.5);
   *rdmult = AOMMAX(*rdmult, 0);
   av1_set_error_per_bit(errorperbit, *rdmult);
 }
 
+#if CONFIG_SALIENCY_MAP
+void av1_set_saliency_map_vmaf_rdmult(const AV1_COMP *const cpi,
+                                      int *errorperbit, const BLOCK_SIZE bsize,
+                                      const int mi_row, const int mi_col,
+                                      int *const rdmult) {
+  const AV1_COMMON *const cm = &cpi->common;
+  const int num_mi_w = mi_size_wide[bsize];
+  const int num_mi_h = mi_size_high[bsize];
+  const int num_cols = (cm->mi_params.mi_cols + num_mi_w - 1) / num_mi_w;
+
+  *rdmult =
+      (int)(*rdmult * cpi->sm_scaling_factor[(mi_row / num_mi_h) * num_cols +
+                                             (mi_col / num_mi_w)]);
+
+  *rdmult = AOMMAX(*rdmult, 0);
+  av1_set_error_per_bit(errorperbit, *rdmult);
+}
+#endif
+
 // TODO(angiebird): Move these function to tpl_model.c
 #if !CONFIG_REALTIME_ONLY
 // Return the end column for the current superblock, in unit of TPL blocks.
@@ -193,6 +223,9 @@
   for (dir = 0; dir < 2; ++dir) {
     const int ctx = av1_get_pred_context_switchable_interp(xd, dir);
     InterpFilter filter = av1_extract_interp_filter(mbmi->interp_filters, dir);
+
+    // Only allow the 3 valid SWITCHABLE_FILTERS.
+    assert(filter < SWITCHABLE_FILTERS);
     ++counts->switchable_interp[ctx][filter];
   }
 }
@@ -306,8 +339,7 @@
   }
 
   // Count zero motion vector.
-  if (!dry_run && cpi->oxcf.q_cfg.aq_mode == CYCLIC_REFRESH_AQ &&
-      !frame_is_intra_only(cm)) {
+  if (!dry_run && !frame_is_intra_only(cm)) {
     const MV mv = mi->mv[0].as_mv;
     if (is_inter_block(mi) && mi->ref_frame[0] == LAST_FRAME &&
         abs(mv.row) < 8 && abs(mv.col) < 8) {
@@ -369,9 +401,12 @@
   }
 #endif
   if (!frame_is_intra_only(cm)) {
-    if (cm->features.interp_filter == SWITCHABLE &&
-        mi_addr->motion_mode != WARPED_CAUSAL &&
-        !is_nontrans_global_motion(xd, xd->mi[0])) {
+    if (is_inter_block(mi) && cm->features.interp_filter == SWITCHABLE) {
+      // When the frame interp filter is SWITCHABLE, several cases that always
+      // use the default type (EIGHTTAP_REGULAR) are described in
+      // av1_is_interp_needed(). Here, we should keep the counts for all
+      // applicable blocks, so the frame filter resetting decision in
+      // fix_interp_filter() is made correctly.
       update_filter_type_count(td->counts, xd, mi_addr);
     }
   }

diff --git a/av1/encoder/encodeframe_utils.h b/av1/encoder/encodeframe_utils.h
index 29350d7..24a36c5 100644
--- a/av1/encoder/encodeframe_utils.h
+++ b/av1/encoder/encodeframe_utils.h

@@ -368,6 +368,13 @@
                          const BLOCK_SIZE bsize, const int mi_row,
                          const int mi_col, int *const rdmult);
 
+#if CONFIG_SALIENCY_MAP
+void av1_set_saliency_map_vmaf_rdmult(const AV1_COMP *const cpi,
+                                      int *errorperbit, const BLOCK_SIZE bsize,
+                                      const int mi_row, const int mi_col,
+                                      int *const rdmult);
+#endif
+
 void av1_update_state(const AV1_COMP *const cpi, ThreadData *td,
                       const PICK_MODE_CONTEXT *const ctx, int mi_row,
                       int mi_col, BLOCK_SIZE bsize, RUN_TYPE dry_run);

diff --git a/av1/encoder/encodemb.c b/av1/encoder/encodemb.c
index 8dee801..78efa0c 100644
--- a/av1/encoder/encodemb.c
+++ b/av1/encoder/encodemb.c

@@ -403,10 +403,7 @@
   l = &args->tl[blk_row];
 
   TX_TYPE tx_type = DCT_DCT;
-  const int blk_skip_idx =
-      (cpi->sf.rt_sf.use_nonrd_pick_mode && is_inter_block(mbmi))
-          ? blk_row * bw / 4 + blk_col / 2
-          : blk_row * bw + blk_col;
+  const int blk_skip_idx = blk_row * bw + blk_col;
   if (!is_blk_skip(x->txfm_search_info.blk_skip, plane, blk_skip_idx) &&
       !mbmi->skip_mode) {
     tx_type = av1_get_tx_type(xd, pd->plane_type, blk_row, blk_col, tx_size,
@@ -556,6 +553,13 @@
   // 4x4=0, 8x8=2, 16x16=4, 32x32=6, 64x64=8
   // transform size varies per plane, look it up in a common way.
   const TX_SIZE tx_size = av1_get_tx_size(plane, xd);
+  const BLOCK_SIZE tx_bsize = txsize_to_bsize[tx_size];
+  // Call visit() directly with zero offsets if the current block size is the
+  // same as the transform block size.
+  if (plane_bsize == tx_bsize) {
+    visit(plane, 0, 0, 0, plane_bsize, tx_size, arg);
+    return;
+  }
   const uint8_t txw_unit = tx_size_wide_unit[tx_size];
   const uint8_t txh_unit = tx_size_high_unit[tx_size];
   const int step = txw_unit * txh_unit;
@@ -588,6 +592,8 @@
       }
     }
   }
+  // Check if visit() is invoked at least once.
+  assert(i >= 1);
 }
 
 typedef struct encode_block_pass1_args {

diff --git a/av1/encoder/encoder.c b/av1/encoder/encoder.c
index 0b12ff0..d5d7dcc 100644
--- a/av1/encoder/encoder.c
+++ b/av1/encoder/encoder.c

@@ -27,6 +27,7 @@
 #include "aom_dsp/noise_util.h"
 #include "aom_dsp/noise_model.h"
 #endif
+#include "aom_dsp/flow_estimation/corner_detect.h"
 #include "aom_dsp/psnr.h"
 #if CONFIG_INTERNAL_STATS
 #include "aom_dsp/ssim.h"
@@ -75,6 +76,9 @@
 #include "av1/encoder/rc_utils.h"
 #include "av1/encoder/rd.h"
 #include "av1/encoder/rdopt.h"
+#if CONFIG_SALIENCY_MAP
+#include "av1/encoder/saliency_map.h"
+#endif
 #include "av1/encoder/segmentation.h"
 #include "av1/encoder/speed_features.h"
 #include "av1/encoder/superres_scale.h"
@@ -125,6 +129,14 @@
       *hr = 1;
       *hs = 2;
       break;
+    case AOME_TWOTHREE:
+      *hr = 2;
+      *hs = 3;
+      break;
+    case AOME_ONETHREE:
+      *hr = 1;
+      *hs = 3;
+      break;
     default:
       *hr = 1;
       *hs = 1;
@@ -363,9 +375,7 @@
 static void set_bitstream_level_tier(AV1_PRIMARY *const ppi, int width,
                                      int height, double init_framerate) {
   SequenceHeader *const seq_params = &ppi->seq_params;
-#if CONFIG_CWG_C013
   const AV1LevelParams *const level_params = &ppi->level_params;
-#endif
   // TODO(any): This is a placeholder function that only addresses dimensions
   // and max display sample rates.
   // Need to add checks for max bit rate, max decoded luma sample rate, header
@@ -435,7 +445,15 @@
 #endif
 
   for (int i = 0; i < MAX_NUM_OPERATING_POINTS; ++i) {
-    seq_params->seq_level_idx[i] = level;
+    assert(is_valid_seq_level_idx(level_params->target_seq_level_idx[i]) ||
+           level_params->target_seq_level_idx[i] == SEQ_LEVEL_KEEP_STATS);
+    // If a higher target level is specified, it is then used rather than the
+    // inferred one from resolution and framerate.
+    seq_params->seq_level_idx[i] =
+        level_params->target_seq_level_idx[i] < SEQ_LEVELS &&
+                level_params->target_seq_level_idx[i] > level
+            ? level_params->target_seq_level_idx[i]
+            : level;
     // Set the maximum parameters for bitrate and buffer size for this profile,
     // level, and tier
     seq_params->op_params[i].bitrate = av1_max_level_bitrate(
@@ -650,6 +668,9 @@
   resize_pending_params->width = 0;
   resize_pending_params->height = 0;
 
+  // Setup identity scale factor
+  av1_setup_scale_factors_for_frame(&cm->sf_identity, 1, 1, 1, 1);
+
   init_buffer_indices(&cpi->force_intpel_info, cm->remapped_ref_idx);
 
   av1_noise_estimate_init(&cpi->noise_estimate, cm->width, cm->height);
@@ -799,7 +820,11 @@
 
   if (has_no_stats_stage(cpi) && (rc_cfg->mode == AOM_Q)) {
     p_rc->baseline_gf_interval = FIXED_GF_INTERVAL;
-  } else {
+  } else if (!is_one_pass_rt_params(cpi) ||
+             cm->current_frame.frame_number == 0) {
+    // For rtc mode: logic for setting the baseline_gf_interval is done
+    // in av1_get_one_pass_rt_params(), and it should not be reset here in
+    // change_config(), unless after init_config (first frame).
     p_rc->baseline_gf_interval = (MIN_GF_INTERVAL + MAX_GF_INTERVAL) / 2;
   }
 
@@ -859,6 +884,14 @@
   rc->worst_quality = rc_cfg->worst_allowed_q;
   rc->best_quality = rc_cfg->best_allowed_q;
 
+  // If lossless has been requested make sure average Q accumulators are reset.
+  if (is_lossless_requested(&cpi->oxcf.rc_cfg)) {
+    int i;
+    for (i = 0; i < FRAME_TYPES; ++i) {
+      p_rc->avg_frame_qindex[i] = 0;
+    }
+  }
+
   features->interp_filter =
       oxcf->tile_cfg.enable_large_scale_tile ? EIGHTTAP_REGULAR : SWITCHABLE;
   features->switchable_motion_mode = is_switchable_motion_mode_allowed(
@@ -906,6 +939,18 @@
   if (lap_lag_in_frames != -1) {
     cpi->oxcf.gf_cfg.lag_in_frames = lap_lag_in_frames;
   }
+
+#if CONFIG_REALTIME_ONLY
+  assert(!oxcf->tool_cfg.enable_global_motion);
+  cpi->image_pyramid_levels = 0;
+#else
+  if (oxcf->tool_cfg.enable_global_motion) {
+    cpi->image_pyramid_levels =
+        global_motion_pyr_levels[default_global_motion_method];
+  } else {
+    cpi->image_pyramid_levels = 0;
+  }
+#endif  // CONFIG_REALTIME_ONLY
 }
 
 static INLINE void init_frame_info(FRAME_INFO *frame_info,
@@ -928,11 +973,10 @@
   frame_index_set->show_frame_count = 0;
 }
 
-static INLINE void update_frame_index_set(FRAME_INDEX_SET *frame_index_set,
-                                          int is_show_frame) {
-  if (is_show_frame) {
-    frame_index_set->show_frame_count++;
-  }
+static INLINE void update_counters_for_show_frame(AV1_COMP *const cpi) {
+  assert(cpi->common.show_frame);
+  cpi->frame_index_set.show_frame_count++;
+  cpi->common.current_frame.frame_number++;
 }
 
 AV1_PRIMARY *av1_create_primary_compressor(
@@ -1366,6 +1410,8 @@
   init_frame_index_set(&cpi->frame_index_set);
 
   cm->current_frame.frame_number = 0;
+  cpi->rc.frame_number_encoded = 0;
+  cpi->rc.prev_frame_is_dropped = 0;
   cm->current_frame_id = -1;
   cpi->tile_data = NULL;
   cpi->last_show_frame_buf = NULL;
@@ -1446,6 +1492,7 @@
 
   cpi->mb_weber_stats = NULL;
   cpi->mb_delta_q = NULL;
+  cpi->palette_pixel_num = 0;
 
   {
     const int bsize = BLOCK_16X16;
@@ -1499,15 +1546,41 @@
   }
 #endif
 
+#if CONFIG_SALIENCY_MAP
+  {
+    CHECK_MEM_ERROR(cm, cpi->saliency_map,
+                    (uint8_t *)aom_calloc(cm->height * cm->width,
+                                          sizeof(*cpi->saliency_map)));
+    // Buffer initialization based on MIN_MIB_SIZE_LOG2 to ensure that
+    // cpi->sm_scaling_factor buffer is allocated big enough, since we have no
+    // idea of the actual superblock size we are going to use yet.
+    const int min_mi_w_sb = (1 << MIN_MIB_SIZE_LOG2);
+    const int min_mi_h_sb = (1 << MIN_MIB_SIZE_LOG2);
+    const int max_sb_cols =
+        (cm->mi_params.mi_cols + min_mi_w_sb - 1) / min_mi_w_sb;
+    const int max_sb_rows =
+        (cm->mi_params.mi_rows + min_mi_h_sb - 1) / min_mi_h_sb;
+    CHECK_MEM_ERROR(cm, cpi->sm_scaling_factor,
+                    (double *)aom_calloc(max_sb_rows * max_sb_cols,
+                                         sizeof(*cpi->sm_scaling_factor)));
+  }
+#endif
+
 #if CONFIG_COLLECT_PARTITION_STATS
   av1_zero(cpi->partition_stats);
 #endif  // CONFIG_COLLECT_PARTITION_STATS
 
-  /* av1_init_quantizer() is first called here. Add check in
-   * av1_frame_init_quantizer() so that av1_init_quantizer is only
-   * called later when needed. This will avoid unnecessary calls of
-   * av1_init_quantizer() for every frame.
-   */
+  // Initialize the members of DeltaQuantParams with INT_MAX to ensure that
+  // the quantizer tables are correctly initialized using the default deltaq
+  // parameters when av1_init_quantizer is called for the first time.
+  DeltaQuantParams *const prev_deltaq_params =
+      &cpi->enc_quant_dequant_params.prev_deltaq_params;
+  prev_deltaq_params->y_dc_delta_q = INT_MAX;
+  prev_deltaq_params->u_dc_delta_q = INT_MAX;
+  prev_deltaq_params->v_dc_delta_q = INT_MAX;
+  prev_deltaq_params->u_ac_delta_q = INT_MAX;
+  prev_deltaq_params->v_ac_delta_q = INT_MAX;
+
   av1_init_quantizer(&cpi->enc_quant_dequant_params, &cm->quant_params,
                      cm->seq_params->bit_depth);
   av1_qm_init(&cm->quant_params, av1_num_planes(cm));
@@ -1550,40 +1623,6 @@
   }
 }
 
-// Deallocate allocated thread_data.
-static AOM_INLINE void free_thread_data(AV1_PRIMARY *ppi) {
-  PrimaryMultiThreadInfo *const p_mt_info = &ppi->p_mt_info;
-  for (int t = 1; t < p_mt_info->num_workers; ++t) {
-    EncWorkerData *const thread_data = &p_mt_info->tile_thr_data[t];
-    thread_data->td = thread_data->original_td;
-    aom_free(thread_data->td->tctx);
-    aom_free(thread_data->td->palette_buffer);
-    aom_free(thread_data->td->tmp_conv_dst);
-    release_compound_type_rd_buffers(&thread_data->td->comp_rd_buffer);
-    for (int j = 0; j < 2; ++j) {
-      aom_free(thread_data->td->tmp_pred_bufs[j]);
-    }
-    aom_free(thread_data->td->pixel_gradient_info);
-    aom_free(thread_data->td->src_var_info_of_4x4_sub_blocks);
-    release_obmc_buffers(&thread_data->td->obmc_buffer);
-    aom_free(thread_data->td->vt64x64);
-
-    for (int x = 0; x < 2; x++) {
-      for (int y = 0; y < 2; y++) {
-        aom_free(thread_data->td->hash_value_buffer[x][y]);
-        thread_data->td->hash_value_buffer[x][y] = NULL;
-      }
-    }
-    aom_free(thread_data->td->counts);
-    av1_free_pmc(thread_data->td->firstpass_ctx,
-                 ppi->seq_params.monochrome ? 1 : MAX_MB_PLANE);
-    thread_data->td->firstpass_ctx = NULL;
-    av1_free_shared_coeff_buffer(&thread_data->td->shared_coeff_buf);
-    av1_free_sms_tree(thread_data->td);
-    aom_free(thread_data->td);
-  }
-}
-
 void av1_remove_primary_compressor(AV1_PRIMARY *ppi) {
   if (!ppi) return;
 #if !CONFIG_REALTIME_ONLY
@@ -1648,7 +1687,12 @@
   av1_denoiser_free(&(cpi->denoiser));
 #endif
 
-  aom_free(cm->error);
+  if (cm->error) {
+    // Help detect use after free of the error detail string.
+    memset(cm->error->detail, 'A', sizeof(cm->error->detail) - 1);
+    cm->error->detail[sizeof(cm->error->detail) - 1] = '\0';
+    aom_free(cm->error);
+  }
   aom_free(cpi->td.tctx);
   MultiThreadInfo *const mt_info = &cpi->mt_info;
 #if CONFIG_MULTITHREAD
@@ -2040,7 +2084,7 @@
   }
 #ifndef NDEBUG
   BufferPool *const pool = cm->buffer_pool;
-  for (i = 0; i < FRAME_BUFFERS; ++i) {
+  for (i = 0; i < pool->num_frame_bufs; ++i) {
     assert(pool->frame_bufs[i].ref_count == 0);
   }
 #endif
@@ -2153,7 +2197,6 @@
     }
 #endif
   }
-
   if (is_stat_consumption_stage(cpi)) {
     av1_set_target_rate(cpi, cm->width, cm->height);
   }
@@ -2183,7 +2226,7 @@
           &cm->cur_frame->buf, cm->width, cm->height, seq_params->subsampling_x,
           seq_params->subsampling_y, seq_params->use_highbitdepth,
           cpi->oxcf.border_in_pixels, cm->features.byte_alignment, NULL, NULL,
-          NULL, cpi->oxcf.tool_cfg.enable_global_motion, 0))
+          NULL, cpi->image_pyramid_levels, 0))
     aom_internal_error(cm->error, AOM_CODEC_MEM_ERROR,
                        "Failed to allocate frame buffer");
 
@@ -2199,7 +2242,8 @@
     for (int i = 0; i < num_planes; ++i)
       cm->rst_info[i].frame_restoration_type = RESTORE_NONE;
 
-    av1_alloc_restoration_buffers(cm);
+    const bool is_sgr_enabled = !cpi->sf.lpf_sf.disable_sgr_filter;
+    av1_alloc_restoration_buffers(cm, is_sgr_enabled);
     // Store the allocated restoration buffers in MT object.
     if (cpi->ppi->p_mt_info.num_workers > 1) {
       av1_init_lr_mt_buffers(cpi);
@@ -2278,7 +2322,8 @@
                     cpi->sf.lpf_sf.cdef_pick_method, cpi->td.mb.rdmult,
                     cpi->sf.rt_sf.skip_cdef_sb, cpi->oxcf.tool_cfg.cdef_control,
                     use_screen_content_model,
-                    cpi->ppi->rtc_ref.non_reference_frame);
+                    cpi->ppi->rtc_ref.non_reference_frame,
+                    cpi->rc.rtc_external_ratectrl);
 
     // Apply the filter
     if ((skip_apply_postproc_filters & SKIP_APPLY_CDEF) == 0) {
@@ -2472,14 +2517,15 @@
   av1_set_size_dependent_vars(cpi, &q, &bottom_index, &top_index);
   av1_set_mv_search_params(cpi);
 
-  if (cm->current_frame.frame_number == 0 && cpi->ppi->use_svc &&
+  if (cm->current_frame.frame_number == 0 &&
+      (cpi->ppi->use_svc || cpi->oxcf.rc_cfg.drop_frames_water_mark > 0) &&
       cpi->svc.temporal_layer_id == 0) {
     const SequenceHeader *seq_params = cm->seq_params;
     if (aom_alloc_frame_buffer(
             &cpi->svc.source_last_TL0, cpi->oxcf.frm_dim_cfg.width,
             cpi->oxcf.frm_dim_cfg.height, seq_params->subsampling_x,
             seq_params->subsampling_y, seq_params->use_highbitdepth,
-            cpi->oxcf.border_in_pixels, cm->features.byte_alignment, 0)) {
+            cpi->oxcf.border_in_pixels, cm->features.byte_alignment, 0, 0)) {
       aom_internal_error(cm->error, AOM_CODEC_MEM_ERROR,
                          "Failed to allocate buffer for source_last_TL0");
     }
@@ -2528,8 +2574,7 @@
 
   cpi->source = av1_realloc_and_scale_if_required(
       cm, unscaled, &cpi->scaled_source, filter_scaler, phase_scaler, true,
-      false, cpi->oxcf.border_in_pixels,
-      cpi->oxcf.tool_cfg.enable_global_motion);
+      false, cpi->oxcf.border_in_pixels, cpi->image_pyramid_levels);
   if (frame_is_intra_only(cm) || resize_pending != 0) {
     memset(cpi->consec_zero_mv, 0,
            ((cm->mi_params.mi_rows * cm->mi_params.mi_cols) >> 2) *
@@ -2540,7 +2585,7 @@
     cpi->last_source = av1_realloc_and_scale_if_required(
         cm, cpi->unscaled_last_source, &cpi->scaled_last_source, filter_scaler,
         phase_scaler, true, false, cpi->oxcf.border_in_pixels,
-        cpi->oxcf.tool_cfg.enable_global_motion);
+        cpi->image_pyramid_levels);
   }
 
   if (cpi->sf.rt_sf.use_temporal_noise_estimate) {
@@ -2595,9 +2640,8 @@
   av1_set_quantizer(cm, q_cfg->qm_minlevel, q_cfg->qm_maxlevel, q,
                     q_cfg->enable_chroma_deltaq, q_cfg->enable_hdr_deltaq);
   av1_set_speed_features_qindex_dependent(cpi, cpi->oxcf.speed);
-  if ((q_cfg->deltaq_mode != NO_DELTA_Q) || q_cfg->enable_chroma_deltaq)
-    av1_init_quantizer(&cpi->enc_quant_dequant_params, &cm->quant_params,
-                       cm->seq_params->bit_depth);
+  av1_init_quantizer(&cpi->enc_quant_dequant_params, &cm->quant_params,
+                     cm->seq_params->bit_depth);
   av1_set_variance_partition_thresholds(cpi, q, 0);
   av1_setup_frame(cpi);
 
@@ -2610,9 +2654,8 @@
       av1_set_quantizer(cm, q_cfg->qm_minlevel, q_cfg->qm_maxlevel, q,
                         q_cfg->enable_chroma_deltaq, q_cfg->enable_hdr_deltaq);
       av1_set_speed_features_qindex_dependent(cpi, cpi->oxcf.speed);
-      if (q_cfg->deltaq_mode != NO_DELTA_Q || q_cfg->enable_chroma_deltaq)
-        av1_init_quantizer(&cpi->enc_quant_dequant_params, &cm->quant_params,
-                           cm->seq_params->bit_depth);
+      av1_init_quantizer(&cpi->enc_quant_dequant_params, &cm->quant_params,
+                         cm->seq_params->bit_depth);
       av1_set_variance_partition_thresholds(cpi, q, 0);
       if (frame_is_intra_only(cm) || cm->features.error_resilient_mode ||
           cm->features.primary_ref_frame == PRIMARY_REF_NONE)
@@ -2651,7 +2694,7 @@
               &cpi->orig_source, cpi->oxcf.frm_dim_cfg.width,
               cpi->oxcf.frm_dim_cfg.height, seq_params->subsampling_x,
               seq_params->subsampling_y, seq_params->use_highbitdepth,
-              cpi->oxcf.border_in_pixels, cm->features.byte_alignment, 0))
+              cpi->oxcf.border_in_pixels, cm->features.byte_alignment, 0, 0))
         aom_internal_error(cm->error, AOM_CODEC_MEM_ERROR,
                            "Failed to allocate scaled buffer");
     }
@@ -2725,7 +2768,6 @@
       cpi->sf.interp_sf.adaptive_interp_filter_search)
     cpi->interp_search_flags.interp_filter_search_mask =
         av1_setup_interp_filter_search_mask(cpi);
-  cpi->source->buf_8bit_valid = 0;
 
   av1_setup_frame_size(cpi);
 
@@ -2800,8 +2842,7 @@
     }
     cpi->source = av1_realloc_and_scale_if_required(
         cm, cpi->unscaled_source, &cpi->scaled_source, EIGHTTAP_REGULAR, 0,
-        false, false, cpi->oxcf.border_in_pixels,
-        cpi->oxcf.tool_cfg.enable_global_motion);
+        false, false, cpi->oxcf.border_in_pixels, cpi->image_pyramid_levels);
 
 #if CONFIG_TUNE_BUTTERAUGLI
     if (oxcf->tune_cfg.tuning == AOM_TUNE_BUTTERAUGLI) {
@@ -2821,7 +2862,7 @@
       cpi->last_source = av1_realloc_and_scale_if_required(
           cm, cpi->unscaled_last_source, &cpi->scaled_last_source,
           EIGHTTAP_REGULAR, 0, false, false, cpi->oxcf.border_in_pixels,
-          cpi->oxcf.tool_cfg.enable_global_motion);
+          cpi->image_pyramid_levels);
     }
 
     int scale_references = 0;
@@ -2900,10 +2941,8 @@
     av1_set_quantizer(cm, q_cfg->qm_minlevel, q_cfg->qm_maxlevel, q,
                       q_cfg->enable_chroma_deltaq, q_cfg->enable_hdr_deltaq);
     av1_set_speed_features_qindex_dependent(cpi, oxcf->speed);
-
-    if (q_cfg->deltaq_mode != NO_DELTA_Q || q_cfg->enable_chroma_deltaq)
-      av1_init_quantizer(&cpi->enc_quant_dequant_params, &cm->quant_params,
-                         cm->seq_params->bit_depth);
+    av1_init_quantizer(&cpi->enc_quant_dequant_params, &cm->quant_params,
+                       cm->seq_params->bit_depth);
 
     av1_set_variance_partition_thresholds(cpi, q, 0);
 
@@ -3080,15 +3119,21 @@
   film_grain_params->scaling_points_y[0][0] = 128;
   film_grain_params->scaling_points_y[0][1] = 100;
 
-  film_grain_params->num_cb_points = 1;
-  film_grain_params->scaling_points_cb[0][0] = 128;
-  film_grain_params->scaling_points_cb[0][1] = 100;
+  if (!cm->seq_params->monochrome) {
+    film_grain_params->num_cb_points = 1;
+    film_grain_params->scaling_points_cb[0][0] = 128;
+    film_grain_params->scaling_points_cb[0][1] = 100;
 
-  film_grain_params->num_cr_points = 1;
-  film_grain_params->scaling_points_cr[0][0] = 128;
-  film_grain_params->scaling_points_cr[0][1] = 100;
+    film_grain_params->num_cr_points = 1;
+    film_grain_params->scaling_points_cr[0][0] = 128;
+    film_grain_params->scaling_points_cr[0][1] = 100;
+  } else {
+    film_grain_params->num_cb_points = 0;
+    film_grain_params->num_cr_points = 0;
+  }
 
   film_grain_params->chroma_scaling_from_luma = 0;
+
   film_grain_params->scaling_shift = 1;
   film_grain_params->ar_coeff_lag = 0;
   film_grain_params->ar_coeff_shift = 1;
@@ -3423,7 +3468,8 @@
     // after 8 frames since last update if frame_source_sad > 0.
     if (frame_is_intra_only(cm) || is_frame_resize_pending(cpi) ||
         rc->high_source_sad || rc->frames_since_key < 30 ||
-        cpi->cyclic_refresh->counter_encode_maxq_scene_change < 30 ||
+        (cpi->oxcf.q_cfg.aq_mode == CYCLIC_REFRESH_AQ &&
+         cpi->cyclic_refresh->counter_encode_maxq_scene_change < 30) ||
         (cpi->frames_since_last_update > 8 && cpi->rc.frame_source_sad > 0))
       return 0;
     else
@@ -3494,7 +3540,7 @@
       src, stride, hbd, num_8x8_rows, num_8x8_cols);
 
   cpi->twopass_frame.frame_avg_haar_energy =
-      log(((double)frame_avg_wavelet_energy / num_mbs) + 1.0);
+      log1p((double)frame_avg_wavelet_energy / num_mbs);
 }
 #endif
 
@@ -3632,8 +3678,7 @@
     }
 #endif  // !CONFIG_REALTIME_ONLY
 
-    ++current_frame->frame_number;
-    update_frame_index_set(&cpi->frame_index_set, cm->show_frame);
+    update_counters_for_show_frame(cpi);
     return AOM_CODEC_OK;
   }
 
@@ -3685,15 +3730,32 @@
   }
 
   // For 1 pass CBR, check if we are dropping this frame.
-  // Never drop on key frame.
+  // Never drop on key frame, or for frame whose base layer is key.
   if (has_no_stats_stage(cpi) && oxcf->rc_cfg.mode == AOM_CBR &&
-      current_frame->frame_type != KEY_FRAME) {
+      current_frame->frame_type != KEY_FRAME &&
+      !(cpi->ppi->use_svc &&
+        cpi->svc.layer_context[cpi->svc.temporal_layer_id].is_key_frame)) {
+    FRAME_UPDATE_TYPE update_type =
+        cpi->ppi->gf_group.update_type[cpi->gf_frame_index];
+    (void)update_type;
+    assert(
+        IMPLIES(cpi->is_dropped_frame, (update_type == OVERLAY_UPDATE ||
+                                        update_type == INTNL_OVERLAY_UPDATE)));
     if (av1_rc_drop_frame(cpi)) {
+      cpi->is_dropped_frame = true;
+    }
+    if (cpi->is_dropped_frame) {
       av1_setup_frame_size(cpi);
       av1_set_mv_search_params(cpi);
       av1_rc_postencode_update_drop_frame(cpi);
       release_scaled_references(cpi);
-      cpi->is_dropped_frame = true;
+      cpi->ppi->gf_group.is_frame_dropped[cpi->gf_frame_index] = true;
+      // A dropped frame might not be shown but it always takes a slot in the gf
+      // group. Therefore, even when it is not shown, we still need to update
+      // the relevant frame counters.
+      if (cm->show_frame) {
+        update_counters_for_show_frame(cpi);
+      }
       return AOM_CODEC_OK;
     }
   }
@@ -3701,11 +3763,26 @@
   if (oxcf->tune_cfg.tuning == AOM_TUNE_SSIM) {
     av1_set_mb_ssim_rdmult_scaling(cpi);
   }
-
+#if CONFIG_SALIENCY_MAP
+  else if (oxcf->tune_cfg.tuning == AOM_TUNE_VMAF_SALIENCY_MAP &&
+           !(cpi->source->flags & YV12_FLAG_HIGHBITDEPTH)) {
+    if (av1_set_saliency_map(cpi) == 0) {
+      return AOM_CODEC_MEM_ERROR;
+    }
+#if !CONFIG_REALTIME_ONLY
+    double motion_ratio = av1_setup_motion_ratio(cpi);
+#else
+    double motion_ratio = 1.0;
+#endif
+    if (av1_setup_sm_rdmult_scaling_factor(cpi, motion_ratio) == 0) {
+      return AOM_CODEC_MEM_ERROR;
+    }
+  }
+#endif
 #if CONFIG_TUNE_VMAF
-  if (oxcf->tune_cfg.tuning == AOM_TUNE_VMAF_WITHOUT_PREPROCESSING ||
-      oxcf->tune_cfg.tuning == AOM_TUNE_VMAF_MAX_GAIN ||
-      oxcf->tune_cfg.tuning == AOM_TUNE_VMAF_NEG_MAX_GAIN) {
+  else if (oxcf->tune_cfg.tuning == AOM_TUNE_VMAF_WITHOUT_PREPROCESSING ||
+           oxcf->tune_cfg.tuning == AOM_TUNE_VMAF_MAX_GAIN ||
+           oxcf->tune_cfg.tuning == AOM_TUNE_VMAF_NEG_MAX_GAIN) {
     av1_set_mb_vmaf_rdmult_scaling(cpi);
   }
 #endif
@@ -3787,6 +3864,15 @@
     features->disable_cdf_update = 1;
   }
 
+#if !CONFIG_REALTIME_ONLY
+  if (cpi->oxcf.tool_cfg.enable_global_motion && !frame_is_intra_only(cm)) {
+    // Flush any stale global motion information, which may be left over
+    // from a previous frame
+    aom_invalidate_pyramid(cpi->source->y_pyramid);
+    av1_invalidate_corner_list(cpi->source->corners);
+  }
+#endif  // !CONFIG_REALTIME_ONLY
+
   int largest_tile_id = 0;
   if (av1_superres_in_recode_allowed(cpi)) {
     if (encode_with_and_without_superres(cpi, size, dest, &largest_tile_id) !=
@@ -3875,18 +3961,17 @@
     cpi->frames_since_last_update = 1;
   }
 
+  if (cpi->svc.spatial_layer_id == cpi->svc.number_spatial_layers - 1)
+    cpi->svc.prev_number_spatial_layers = cpi->svc.number_spatial_layers;
+
   // Clear the one shot update flags for segmentation map and mode/ref loop
   // filter deltas.
   cm->seg.update_map = 0;
   cm->seg.update_data = 0;
   cm->lf.mode_ref_delta_update = 0;
 
-  // A droppable frame might not be shown but it always
-  // takes a space in the gf group. Therefore, even when
-  // it is not shown, we still need update the count down.
   if (cm->show_frame) {
-    update_frame_index_set(&cpi->frame_index_set, cm->show_frame);
-    ++current_frame->frame_number;
+    update_counters_for_show_frame(cpi);
   }
 
 #if CONFIG_COLLECT_COMPONENT_TIMING
@@ -4038,10 +4123,10 @@
       // No noise synthesis if source is very clean.
       // Uses a low edge threshold to focus on smooth areas.
       // Increase output noise setting a little compared to measured value.
-      cpi->oxcf.noise_level =
-          (float)(av1_estimate_noise_from_single_plane(
-                      sd, 0, cm->seq_params->bit_depth, 16) -
-                  0.1);
+      double y_noise_level = 0.0;
+      av1_estimate_noise_level(sd, &y_noise_level, AOM_PLANE_Y, AOM_PLANE_Y,
+                               cm->seq_params->bit_depth, 16);
+      cpi->oxcf.noise_level = (float)(y_noise_level - 0.1);
       cpi->oxcf.noise_level = (float)AOMMAX(0.0, cpi->oxcf.noise_level);
       if (cpi->oxcf.noise_level > 0.0) {
         cpi->oxcf.noise_level += (float)0.5;
@@ -4057,7 +4142,8 @@
 #endif  //  CONFIG_DENOISE
 
   if (av1_lookahead_push(cpi->ppi->lookahead, sd, time_stamp, end_time,
-                         use_highbitdepth, frame_flags)) {
+                         use_highbitdepth, cpi->image_pyramid_levels,
+                         frame_flags)) {
     aom_internal_error(cm->error, AOM_CODEC_ERROR,
                        "av1_lookahead_push() failed");
     res = -1;
@@ -4509,6 +4595,13 @@
   }
 #endif
 
+#if CONFIG_OUTPUT_FRAME_SIZE
+  FILE *f = fopen("frame_sizes.csv", "a");
+  fprintf(f, "%d,", 8 * (int)cpi_data->frame_size);
+  fprintf(f, "%d\n", cm->quant_params.base_qindex);
+  fclose(f);
+#endif  // CONFIG_OUTPUT_FRAME_SIZE
+
   if (!is_stat_generation_stage(cpi) && !cpi->is_dropped_frame) {
     // Before calling refresh_reference_frames(), copy ppi->ref_frame_map_copy
     // to cm->ref_frame_map for frame_parallel_level 2 frame in a parallel
@@ -4564,6 +4657,11 @@
     av1_pop_third_pass_info(cpi->third_pass_ctx);
   }
 
+  if (ppi->rtc_ref.set_ref_frame_config) {
+    av1_svc_update_buffer_slot_refreshed(cpi);
+    av1_svc_set_reference_was_previous(cpi);
+  }
+
   if (ppi->use_svc) av1_save_layer_context(cpi);
 
   // Note *size = 0 indicates a dropped frame for which psnr is not calculated
@@ -4701,6 +4799,9 @@
   }
 #endif
 
+  // Reset the flag to 0 afer encoding.
+  cpi->rc.use_external_qp_one_pass = 0;
+
   if (result == -1) {
     cm->error->setjmp = 0;
     // Returning -1 indicates no frame encoded; more input is required
@@ -4749,7 +4850,7 @@
 
       RefCntBuffer *buf = get_ref_frame_buf(cm, ref_frame);
       cpi->scaled_ref_buf[ref_frame - 1] = buf;
-      for (int i = 0; i < FRAME_BUFFERS; ++i) {
+      for (int i = 0; i < cm->buffer_pool->num_frame_bufs; ++i) {
         if (&cm->buffer_pool->frame_bufs[i] == buf) {
           *ref_buffers_used_map |= (1 << i);
         }
@@ -4764,7 +4865,7 @@
 // corresponding to frames in a parallel encode set.
 void av1_increment_scaled_ref_counts_fpmt(BufferPool *buffer_pool,
                                           int ref_buffers_used_map) {
-  for (int i = 0; i < FRAME_BUFFERS; ++i) {
+  for (int i = 0; i < buffer_pool->num_frame_bufs; ++i) {
     if (ref_buffers_used_map & (1 << i)) {
       ++buffer_pool->frame_bufs[i].ref_count;
     }
@@ -4787,7 +4888,7 @@
 // corresponding to frames in a parallel encode set.
 void av1_decrement_ref_counts_fpmt(BufferPool *buffer_pool,
                                    int ref_buffers_used_map) {
-  for (int i = 0; i < FRAME_BUFFERS; ++i) {
+  for (int i = 0; i < buffer_pool->num_frame_bufs; ++i) {
     if (ref_buffers_used_map & (1 << i)) {
       --buffer_pool->frame_bufs[i].ref_count;
     }
@@ -5070,7 +5171,8 @@
                           AOM_SCALING_MODE vert_mode) {
   int hr = 0, hs = 0, vr = 0, vs = 0;
 
-  if (horiz_mode > AOME_ONETWO || vert_mode > AOME_ONETWO) return -1;
+  // Checks for invalid AOM_SCALING_MODE values.
+  if (horiz_mode > AOME_ONETHREE || vert_mode > AOME_ONETHREE) return -1;
 
   Scale2Ratio(horiz_mode, &hr, &hs);
   Scale2Ratio(vert_mode, &vr, &vs);

diff --git a/av1/encoder/encoder.h b/av1/encoder/encoder.h
index d13f08f..2965f9b 100644
--- a/av1/encoder/encoder.h
+++ b/av1/encoder/encoder.h

@@ -1071,6 +1071,15 @@
   // CONFIG_PARTITION_SEARCH_ORDER.
   const char *partition_info_path;
 
+  // The flag that indicates whether we use an external rate distribution to
+  // guide adaptive quantization. It requires --deltaq-mode=3. The rate
+  // distribution map file name is stored in |rate_distribution_info|.
+  unsigned int enable_rate_guide_deltaq;
+
+  // The input file of rate distribution information used in all intra mode
+  // to determine delta quantization.
+  const char *rate_distribution_info;
+
   // Exit the encoder when it fails to encode to a given level.
   int strict_level_conformance;
 
@@ -1544,6 +1553,36 @@
 } AV1EncRowMultiThreadInfo;
 
 /*!
+ * \brief Encoder data related to multi-threading for allintra deltaq-mode=3
+ */
+typedef struct {
+#if CONFIG_MULTITHREAD
+  /*!
+   * Mutex lock used while dispatching jobs.
+   */
+  pthread_mutex_t *mutex_;
+  /*!
+   *  Condition variable used to dispatch loopfilter jobs.
+   */
+  pthread_cond_t *cond_;
+#endif
+
+  /**
+   * \name Row synchronization related function pointers for all intra mode
+   */
+  /**@{*/
+  /*!
+   * Reader.
+   */
+  void (*intra_sync_read_ptr)(AV1EncRowMultiThreadSync *const, int, int);
+  /*!
+   * Writer.
+   */
+  void (*intra_sync_write_ptr)(AV1EncRowMultiThreadSync *const, int, int, int);
+  /**@}*/
+} AV1EncAllIntraMultiThreadInfo;
+
+/*!
  * \brief Max number of recodes used to track the frame probabilities.
  */
 #define NUM_RECODES_PER_FRAME 10
@@ -1619,6 +1658,11 @@
    * Number of primary workers created for multi-threading.
    */
   int p_num_workers;
+
+  /*!
+   * Tracks the number of workers in encode stage multi-threading.
+   */
+  int prev_num_enc_workers;
 } PrimaryMultiThreadInfo;
 
 /*!
@@ -1663,6 +1707,12 @@
   AV1EncRowMultiThreadInfo enc_row_mt;
 
   /*!
+   * Encoder multi-threading data for allintra mode in the preprocessing stage
+   * when --deltaq-mode=3.
+   */
+  AV1EncAllIntraMultiThreadInfo intra_mt;
+
+  /*!
    * Tpl row multi-threading data.
    */
   AV1TplRowMultiThreadInfo tpl_row_mt;
@@ -1950,11 +2000,6 @@
   YV12_BUFFER_CONFIG *ref_buf[REF_FRAMES];
 
   /*!
-   * Pointer to the source frame buffer.
-   */
-  unsigned char *src_buffer;
-
-  /*!
    * Holds the number of valid reference frames in past and future directions
    * w.r.t. the current frame. num_ref_frames[i] stores the total number of
    * valid reference frames in 'i' direction.
@@ -1976,18 +2021,6 @@
   int segment_map_w; /*!< segment map width */
   int segment_map_h; /*!< segment map height */
   /**@}*/
-
-  /*!
-   * Holds the total number of corner points detected in the source frame.
-   */
-  int num_src_corners;
-
-  /*!
-   * Holds the x and y co-ordinates of the corner points detected in the source
-   * frame. src_corners[i] holds the x co-ordinate and src_corners[i+1] holds
-   * the y co-ordinate of the ith corner point detected.
-   */
-  int src_corners[2 * MAX_CORNERS];
 } GlobalMotionInfo;
 
 /*!
@@ -2405,6 +2438,23 @@
   int non_reference_frame;
   int ref_frame_comp[3];
   int gld_idx_1layer;
+  /*!
+   * Frame number of the last frame that refreshed the buffer slot.
+   */
+  unsigned int buffer_time_index[REF_FRAMES];
+  /*!
+   * Spatial layer id of the last frame that refreshed the buffer slot.
+   */
+  unsigned char buffer_spatial_layer[REF_FRAMES];
+  /*!
+   * Flag to indicate whether closest reference was the previous frame.
+   */
+  bool reference_was_previous_frame;
+  /*!
+   * Flag to indicate this frame is based on longer term reference only,
+   * for recovery from past loss, and it should be biased for improved coding.
+   */
+  bool bias_recovery_frame;
 } RTC_REF;
 /*!\endcond */
 
@@ -2751,6 +2801,12 @@
    * Struct for the reference structure for RTC.
    */
   RTC_REF rtc_ref;
+
+  /*!
+   * Struct for all intra mode row multi threading in the preprocess stage
+   * when --deltaq-mode=3.
+   */
+  AV1EncRowMultiThreadSync intra_row_mt_sync;
 } AV1_PRIMARY;
 
 /*!
@@ -3382,6 +3438,23 @@
   WeberStats *mb_weber_stats;
 
   /*!
+   * Buffer to store rate cost estimates for each macro block (8x8) in the
+   * preprocessing stage used in allintra mode.
+   */
+  int *prep_rate_estimates;
+
+  /*!
+   * Buffer to store rate cost estimates for each 16x16 block read
+   * from an external file, used in allintra mode.
+   */
+  double *ext_rate_distribution;
+
+  /*!
+   * The scale that equals sum_rate_uniform_quantizer / sum_ext_rate.
+   */
+  double ext_rate_scale;
+
+  /*!
    * Buffer to store MB variance after Wiener filter.
    */
   BLOCK_SIZE weber_bsize;
@@ -3462,6 +3535,30 @@
    * Block level thresholds to force zeromv-skip at partition level.
    */
   unsigned int zeromv_skip_thresh_exit_part[BLOCK_SIZES_ALL];
+
+  /*!
+   *  Number of downsampling pyramid levels to allocate for each frame
+   *  This is currently only used for global motion
+   */
+  int image_pyramid_levels;
+
+#if CONFIG_SALIENCY_MAP
+  /*!
+   * Pixel level saliency map for each frame.
+   */
+  uint8_t *saliency_map;
+
+  /*!
+   * Superblock level rdmult scaling factor driven by saliency map.
+   */
+  double *sm_scaling_factor;
+#endif
+
+  /*!
+   * Number of pixels that choose palette mode for luma in the
+   * fast encoding pass in av1_determine_sc_tools_with_encoding().
+   */
+  int palette_pixel_num;
 } AV1_COMP;
 
 /*!
@@ -3599,11 +3696,11 @@
  * \ingroup high_level_algo
  * This function receives the raw frame data from input.
  *
- * \param[in]    cpi            Top-level encoder structure
- * \param[in]    frame_flags    Flags to decide how to encoding the frame
- * \param[in]    sd             Contain raw frame data
- * \param[in]    time_stamp     Time stamp of the frame
- * \param[in]    end_time_stamp End time stamp
+ * \param[in]     cpi            Top-level encoder structure
+ * \param[in]     frame_flags    Flags to decide how to encoding the frame
+ * \param[in,out] sd             Contain raw frame data
+ * \param[in]     time_stamp     Time stamp of the frame
+ * \param[in]     end_time_stamp End time stamp
  *
  * \return Returns a value to indicate if the frame data is received
  * successfully.
@@ -4177,7 +4274,9 @@
   }
   if (use_loopfilter) return SKIP_APPLY_LOOPFILTER;
 
-  return 0;  // All post-processing stages disabled.
+  // If we reach here, all post-processing stages are disabled, so none need to
+  // be skipped.
+  return 0;
 }
 
 static INLINE void set_postproc_filter_default_params(AV1_COMMON *cm) {

diff --git a/av1/encoder/encoder_alloc.h b/av1/encoder/encoder_alloc.h
index f4c345f..7dd81bd 100644
--- a/av1/encoder/encoder_alloc.h
+++ b/av1/encoder/encoder_alloc.h

@@ -213,6 +213,11 @@
   aom_free_frame_buffer(&cpi->butteraugli_info.resized_source);
 #endif
 
+#if CONFIG_SALIENCY_MAP
+  aom_free(cpi->saliency_map);
+  aom_free(cpi->sm_scaling_factor);
+#endif
+
   release_obmc_buffers(&cpi->td.mb.obmc_buffer);
 
   if (cpi->td.mb.mv_costs) {
@@ -291,6 +296,7 @@
 #endif
   if (cpi->film_grain_table) {
     aom_film_grain_table_free(cpi->film_grain_table);
+    aom_free(cpi->film_grain_table);
     cpi->film_grain_table = NULL;
   }
 
@@ -311,6 +317,14 @@
   aom_free(cpi->mb_weber_stats);
   cpi->mb_weber_stats = NULL;
 
+  if (cpi->oxcf.enable_rate_guide_deltaq) {
+    aom_free(cpi->prep_rate_estimates);
+    cpi->prep_rate_estimates = NULL;
+
+    aom_free(cpi->ext_rate_distribution);
+    cpi->ext_rate_distribution = NULL;
+  }
+
   aom_free(cpi->mb_delta_q);
   cpi->mb_delta_q = NULL;
 }
@@ -379,7 +393,7 @@
           cm->seq_params->subsampling_x, cm->seq_params->subsampling_y,
           cm->seq_params->use_highbitdepth, AOM_BORDER_IN_PIXELS,
           cm->features.byte_alignment, NULL, NULL, NULL,
-          cpi->oxcf.tool_cfg.enable_global_motion, 0))
+          cpi->image_pyramid_levels, 0))
     aom_internal_error(cm->error, AOM_CODEC_MEM_ERROR,
                        "Failed to reallocate scaled source buffer");
   assert(cpi->scaled_source.y_crop_width == scaled_width);
@@ -390,6 +404,40 @@
   return &cpi->scaled_source;
 }
 
+// Deallocate allocated thread_data.
+static AOM_INLINE void free_thread_data(AV1_PRIMARY *ppi) {
+  PrimaryMultiThreadInfo *const p_mt_info = &ppi->p_mt_info;
+  for (int t = 1; t < p_mt_info->num_workers; ++t) {
+    EncWorkerData *const thread_data = &p_mt_info->tile_thr_data[t];
+    thread_data->td = thread_data->original_td;
+    aom_free(thread_data->td->tctx);
+    aom_free(thread_data->td->palette_buffer);
+    aom_free(thread_data->td->tmp_conv_dst);
+    release_compound_type_rd_buffers(&thread_data->td->comp_rd_buffer);
+    for (int j = 0; j < 2; ++j) {
+      aom_free(thread_data->td->tmp_pred_bufs[j]);
+    }
+    aom_free(thread_data->td->pixel_gradient_info);
+    aom_free(thread_data->td->src_var_info_of_4x4_sub_blocks);
+    release_obmc_buffers(&thread_data->td->obmc_buffer);
+    aom_free(thread_data->td->vt64x64);
+
+    for (int x = 0; x < 2; x++) {
+      for (int y = 0; y < 2; y++) {
+        aom_free(thread_data->td->hash_value_buffer[x][y]);
+        thread_data->td->hash_value_buffer[x][y] = NULL;
+      }
+    }
+    aom_free(thread_data->td->counts);
+    av1_free_pmc(thread_data->td->firstpass_ctx,
+                 ppi->seq_params.monochrome ? 1 : MAX_MB_PLANE);
+    thread_data->td->firstpass_ctx = NULL;
+    av1_free_shared_coeff_buffer(&thread_data->td->shared_coeff_buf);
+    av1_free_sms_tree(thread_data->td);
+    aom_free(thread_data->td);
+  }
+}
+
 #ifdef __cplusplus
 }  // extern "C"
 #endif

diff --git a/av1/encoder/encoder_utils.c b/av1/encoder/encoder_utils.c
index ad99ec6..bc136b1 100644
--- a/av1/encoder/encoder_utils.c
+++ b/av1/encoder/encoder_utils.c

@@ -701,7 +701,8 @@
           RefCntBuffer *ref_fb = get_ref_frame_buf(cm, ref_frame);
           if (aom_yv12_realloc_with_new_border(
                   &ref_fb->buf, AOM_BORDER_IN_PIXELS,
-                  cm->features.byte_alignment, num_planes) != 0) {
+                  cm->features.byte_alignment, cpi->image_pyramid_levels,
+                  num_planes) != 0) {
             aom_internal_error(cm->error, AOM_CODEC_MEM_ERROR,
                                "Failed to allocate frame buffer");
           }
@@ -802,10 +803,21 @@
                ? BLOCK_128X128
                : BLOCK_64X64;
   } else if (oxcf->mode == REALTIME) {
-    if (oxcf->tune_cfg.content == AOM_CONTENT_SCREEN)
-      return AOMMIN(width, height) >= 720 ? BLOCK_128X128 : BLOCK_64X64;
-    else
+    if (oxcf->tune_cfg.content == AOM_CONTENT_SCREEN) {
+      const TileConfig *const tile_cfg = &oxcf->tile_cfg;
+      const int num_tiles =
+          (1 << tile_cfg->tile_columns) * (1 << tile_cfg->tile_rows);
+      // For multi-thread encode: if the number of (128x128) superblocks
+      // per tile is low use 64X64 superblock.
+      if (oxcf->row_mt == 1 && oxcf->max_threads >= 4 &&
+          oxcf->max_threads >= num_tiles && AOMMIN(width, height) > 720 &&
+          (width * height) / (128 * 128 * num_tiles) <= 38)
+        return BLOCK_64X64;
+      else
+        return AOMMIN(width, height) >= 720 ? BLOCK_128X128 : BLOCK_64X64;
+    } else {
       return AOMMIN(width, height) > 720 ? BLOCK_128X128 : BLOCK_64X64;
+    }
   }
 
   // TODO(any): Possibly could improve this with a heuristic.
@@ -825,6 +837,16 @@
     if (!is_480p_or_lesser && is_1080p_or_lesser && oxcf->mode == GOOD &&
         oxcf->row_mt == 1 && oxcf->max_threads > 1 && oxcf->speed >= 5)
       return BLOCK_64X64;
+
+    // For allintra encode, since the maximum partition size is set to 32X32 for
+    // speed>=6, superblock size is set to 64X64 instead of 128X128. This
+    // improves the multithread performance due to reduction in top right delay
+    // and thread sync wastage. Currently, this setting is selectively enabled
+    // only for speed>=9 and resolutions less than 4k since cost update
+    // frequency is set to INTERNAL_COST_UPD_OFF in these cases.
+    const int is_4k_or_larger = AOMMIN(width, height) >= 2160;
+    if (oxcf->mode == ALLINTRA && oxcf->speed >= 9 && !is_4k_or_larger)
+      return BLOCK_64X64;
   }
   return BLOCK_128X128;
 }
@@ -948,7 +970,13 @@
   if (pass != 1) return;
 
   const double psnr_diff = psnr[1].psnr[0] - psnr[0].psnr[0];
-  const int is_sc_encoding_much_better = psnr_diff > STRICT_PSNR_DIFF_THRESH;
+  // Calculate % of palette mode to be chosen in a frame from mode decision.
+  const double palette_ratio =
+      (double)cpi->palette_pixel_num / (double)(cm->height * cm->width);
+  const int psnr_diff_is_large = (psnr_diff > STRICT_PSNR_DIFF_THRESH);
+  const int ratio_is_large =
+      ((palette_ratio >= 0.0001) && ((psnr_diff / palette_ratio) > 4));
+  const int is_sc_encoding_much_better = (psnr_diff_is_large || ratio_is_large);
   if (is_sc_encoding_much_better) {
     // Use screen content tools, if we get coding gain.
     features->allow_screen_content_tools = 1;
@@ -1029,13 +1057,12 @@
 
   cpi->source = av1_realloc_and_scale_if_required(
       cm, cpi->unscaled_source, &cpi->scaled_source, cm->features.interp_filter,
-      0, false, false, cpi->oxcf.border_in_pixels,
-      cpi->oxcf.tool_cfg.enable_global_motion);
+      0, false, false, cpi->oxcf.border_in_pixels, cpi->image_pyramid_levels);
   if (cpi->unscaled_last_source != NULL) {
     cpi->last_source = av1_realloc_and_scale_if_required(
         cm, cpi->unscaled_last_source, &cpi->scaled_last_source,
         cm->features.interp_filter, 0, false, false, cpi->oxcf.border_in_pixels,
-        cpi->oxcf.tool_cfg.enable_global_motion);
+        cpi->image_pyramid_levels);
   }
 
   av1_setup_frame(cpi);
@@ -1061,9 +1088,8 @@
                       q_for_screen_content_quick_run,
                       q_cfg->enable_chroma_deltaq, q_cfg->enable_hdr_deltaq);
     av1_set_speed_features_qindex_dependent(cpi, oxcf->speed);
-    if (q_cfg->deltaq_mode != NO_DELTA_Q || q_cfg->enable_chroma_deltaq)
-      av1_init_quantizer(&cpi->enc_quant_dequant_params, &cm->quant_params,
-                         cm->seq_params->bit_depth);
+    av1_init_quantizer(&cpi->enc_quant_dequant_params, &cm->quant_params,
+                       cm->seq_params->bit_depth);
 
     av1_set_variance_partition_thresholds(cpi, q_for_screen_content_quick_run,
                                           0);
@@ -1101,7 +1127,7 @@
       // Only one filter is used. So set the filter at frame level
       for (int i = 0; i < SWITCHABLE_FILTERS; ++i) {
         if (count[i]) {
-          if (i == EIGHTTAP_REGULAR) *interp_filter = i;
+          *interp_filter = i;
           break;
         }
       }
@@ -1151,7 +1177,8 @@
     }
   }
 
-  fix_interp_filter(&cm->features.interp_filter, cpi->td.counts);
+  if (!frame_is_intra_only(cm))
+    fix_interp_filter(&cm->features.interp_filter, cpi->td.counts);
 }
 
 int av1_is_integer_mv(const YV12_BUFFER_CONFIG *cur_picture,
@@ -1307,10 +1334,22 @@
       // Curve fitting with an exponential model on all 16x16 blocks from the
       // midres dataset.
       var = 67.035434 * (1 - exp(-0.0021489 * var)) + 17.492222;
+
+      // As per the above computation, var will be in the range of
+      // [17.492222, 84.527656], assuming the data type is of infinite
+      // precision. The following assert conservatively checks if var is in the
+      // range of [17.0, 85.0] to avoid any issues due to the precision of the
+      // relevant data type.
+      assert(var > 17.0 && var < 85.0);
       cpi->ssim_rdmult_scaling_factors[index] = var;
       log_sum += log(var);
     }
   }
+
+  // As log_sum holds the geometric mean, it will be in the range
+  // [17.492222, 84.527656]. Hence, in the below loop, the value of
+  // cpi->ssim_rdmult_scaling_factors[index] would be in the range
+  // [0.2069, 4.8323].
   log_sum = exp(log_sum / (double)(num_rows * num_cols));
 
   for (int row = 0; row < num_rows; ++row) {

diff --git a/av1/encoder/encodetxb.c b/av1/encoder/encodetxb.c
index 4ea4f4c..602a6c4 100644
--- a/av1/encoder/encodetxb.c
+++ b/av1/encoder/encodetxb.c

@@ -220,32 +220,32 @@
 }
 
 static INLINE int get_nz_map_ctx(const uint8_t *const levels,
-                                 const int coeff_idx, const int bwl,
-                                 const int height, const int scan_idx,
+                                 const int coeff_idx, const int bhl,
+                                 const int width, const int scan_idx,
                                  const int is_eob, const TX_SIZE tx_size,
                                  const TX_CLASS tx_class) {
   if (is_eob) {
     if (scan_idx == 0) return 0;
-    if (scan_idx <= (height << bwl) / 8) return 1;
-    if (scan_idx <= (height << bwl) / 4) return 2;
+    if (scan_idx <= (width << bhl) / 8) return 1;
+    if (scan_idx <= (width << bhl) / 4) return 2;
     return 3;
   }
   const int stats =
-      get_nz_mag(levels + get_padded_idx(coeff_idx, bwl), bwl, tx_class);
-  return get_nz_map_ctx_from_stats(stats, coeff_idx, bwl, tx_size, tx_class);
+      get_nz_mag(levels + get_padded_idx(coeff_idx, bhl), bhl, tx_class);
+  return get_nz_map_ctx_from_stats(stats, coeff_idx, bhl, tx_size, tx_class);
 }
 
 void av1_txb_init_levels_c(const tran_low_t *const coeff, const int width,
                            const int height, uint8_t *const levels) {
-  const int stride = width + TX_PAD_HOR;
+  const int stride = height + TX_PAD_HOR;
   uint8_t *ls = levels;
 
-  memset(levels + stride * height, 0,
+  memset(levels + stride * width, 0,
          sizeof(*levels) * (TX_PAD_BOTTOM * stride + TX_PAD_END));
 
-  for (int i = 0; i < height; i++) {
-    for (int j = 0; j < width; j++) {
-      *ls++ = (uint8_t)clamp(abs(coeff[i * width + j]), 0, INT8_MAX);
+  for (int i = 0; i < width; i++) {
+    for (int j = 0; j < height; j++) {
+      *ls++ = (uint8_t)clamp(abs(coeff[i * height + j]), 0, INT8_MAX);
     }
     for (int j = 0; j < TX_PAD_HOR; j++) {
       *ls++ = 0;
@@ -257,11 +257,11 @@
                                const int16_t *const scan, const uint16_t eob,
                                const TX_SIZE tx_size, const TX_CLASS tx_class,
                                int8_t *const coeff_contexts) {
-  const int bwl = get_txb_bwl(tx_size);
-  const int height = get_txb_high(tx_size);
+  const int bhl = get_txb_bhl(tx_size);
+  const int width = get_txb_wide(tx_size);
   for (int i = 0; i < eob; ++i) {
     const int pos = scan[i];
-    coeff_contexts[pos] = get_nz_map_ctx(levels, pos, bwl, height, i,
+    coeff_contexts[pos] = get_nz_map_ctx(levels, pos, bhl, width, i,
                                          i == eob - 1, tx_size, tx_class);
   }
 }
@@ -344,7 +344,7 @@
   const int width = get_txb_wide(tx_size);
   const int height = get_txb_high(tx_size);
   uint8_t levels_buf[TX_PAD_2D];
-  uint8_t *const levels = set_levels(levels_buf, width);
+  uint8_t *const levels = set_levels(levels_buf, height);
   const tran_low_t *tcoeff_txb =
       cb_coef_buff->tcoeff[plane] + x->mbmi_ext_frame->cb_offset[plane_type];
   const tran_low_t *tcoeff = tcoeff_txb + BLOCK_OFFSET(block);
@@ -354,7 +354,7 @@
   DECLARE_ALIGNED(16, int8_t, coeff_contexts[MAX_TX_SQUARE]);
   av1_get_nz_map_contexts(levels, scan, eob, tx_size, tx_class, coeff_contexts);
 
-  const int bwl = get_txb_bwl(tx_size);
+  const int bhl = get_txb_bhl(tx_size);
   for (int c = eob - 1; c >= 0; --c) {
     const int pos = scan[c];
     const int coeff_ctx = coeff_contexts[pos];
@@ -373,7 +373,7 @@
     if (level > NUM_BASE_LEVELS) {
       // level is above 1.
       const int base_range = level - 1 - NUM_BASE_LEVELS;
-      const int br_ctx = get_br_ctx(levels, pos, bwl, tx_class);
+      const int br_ctx = get_br_ctx(levels, pos, bhl, tx_class);
       aom_cdf_prob *cdf =
           ec_ctx->coeff_br_cdf[AOMMIN(txs_ctx, TX_32X32)][plane_type][br_ctx];
       for (int idx = 0; idx < COEFF_BASE_RANGE; idx += BR_CDF_SIZE - 1) {
@@ -571,7 +571,7 @@
     get_txb_ctx(plane_bsize, tx_size, plane,
                 pd->above_entropy_context + blk_col,
                 pd->left_entropy_context + blk_row, &txb_ctx);
-    const int bwl = get_txb_bwl(tx_size);
+    const int bhl = get_txb_bhl(tx_size);
     const int width = get_txb_wide(tx_size);
     const int height = get_txb_high(tx_size);
     const uint8_t allow_update_cdf = args->allow_update_cdf;
@@ -607,7 +607,7 @@
     memcpy(tcoeff, qcoeff, sizeof(*tcoeff) * seg_eob);
 
     uint8_t levels_buf[TX_PAD_2D];
-    uint8_t *const levels = set_levels(levels_buf, width);
+    uint8_t *const levels = set_levels(levels_buf, height);
     av1_txb_init_levels(tcoeff, width, height, levels);
     update_tx_type_count(cpi, cm, xd, blk_row, blk_col, plane, tx_size,
                          td->counts, allow_update_cdf);
@@ -663,7 +663,7 @@
       }
       if (level > NUM_BASE_LEVELS) {
         const int base_range = level - 1 - NUM_BASE_LEVELS;
-        const int br_ctx = get_br_ctx(levels, pos, bwl, tx_class);
+        const int br_ctx = get_br_ctx(levels, pos, bhl, tx_class);
         for (int idx = 0; idx < COEFF_BASE_RANGE; idx += BR_CDF_SIZE - 1) {
           const int k = AOMMIN(base_range - idx, BR_CDF_SIZE - 1);
           if (allow_update_cdf) {
@@ -735,7 +735,7 @@
                 pd->left_entropy_context + blk_row, &txb_ctx);
 #if CONFIG_ENTROPY_STATS
     const TX_SIZE txsize_ctx = get_txsize_entropy_ctx(tx_size);
-    const int bwl = get_txb_bwl(tx_size);
+    const int bhl = get_txb_bhl(tx_size);
     const int width = get_txb_wide(tx_size);
     const int height = get_txb_high(tx_size);
     int cdf_idx = cm->coef_cdf_category;
@@ -764,7 +764,7 @@
 
 #if CONFIG_ENTROPY_STATS
     uint8_t levels_buf[TX_PAD_2D];
-    uint8_t *const levels = set_levels(levels_buf, width);
+    uint8_t *const levels = set_levels(levels_buf, height);
     av1_txb_init_levels(tcoeff, width, height, levels);
     update_tx_type_count(cpi, cm, xd, blk_row, blk_col, plane, tx_size,
                          td->counts, 0 /*allow_update_cdf*/);
@@ -810,7 +810,7 @@
       }
       if (level > NUM_BASE_LEVELS) {
         const int base_range = level - 1 - NUM_BASE_LEVELS;
-        const int br_ctx = get_br_ctx(levels, pos, bwl, tx_class);
+        const int br_ctx = get_br_ctx(levels, pos, bhl, tx_class);
         for (int idx = 0; idx < COEFF_BASE_RANGE; idx += BR_CDF_SIZE - 1) {
           const int k = AOMMIN(base_range - idx, BR_CDF_SIZE - 1);
           for (int lps = 0; lps < BR_CDF_SIZE - 1; lps++) {

diff --git a/av1/encoder/ethread.c b/av1/encoder/ethread.c
index 4127c9a..2a00999 100644
--- a/av1/encoder/ethread.c
+++ b/av1/encoder/ethread.c

@@ -100,7 +100,6 @@
   (void)row_mt_sync;
   (void)r;
   (void)c;
-  return;
 }
 
 void av1_row_mt_sync_write_dummy(AV1EncRowMultiThreadSync *row_mt_sync, int r,
@@ -109,7 +108,6 @@
   (void)r;
   (void)c;
   (void)cols;
-  return;
 }
 
 void av1_row_mt_sync_read(AV1EncRowMultiThreadSync *row_mt_sync, int r, int c) {
@@ -586,7 +584,7 @@
     launch_loop_filter_rows(cm, thread_data, enc_row_mt, mib_size_log2);
   }
   av1_free_pc_tree_recursive(thread_data->td->rt_pc_root, av1_num_planes(cm), 0,
-                             0);
+                             0, cpi->sf.part_sf.partition_search_type);
   return 1;
 }
 
@@ -619,7 +617,7 @@
   }
 
   av1_free_pc_tree_recursive(thread_data->td->rt_pc_root, av1_num_planes(cm), 0,
-                             0);
+                             0, cpi->sf.part_sf.partition_search_type);
 
   return 1;
 }
@@ -777,6 +775,7 @@
 
   int num_workers = p_mt_info->num_workers;
   int num_enc_workers = av1_get_num_mod_workers_for_alloc(p_mt_info, MOD_ENC);
+  assert(num_enc_workers <= num_workers);
   for (int i = num_workers - 1; i >= 0; i--) {
     EncWorkerData *const thread_data = &p_mt_info->tile_thr_data[i];
 
@@ -886,6 +885,10 @@
       }
     }
   }
+
+  // Record the number of workers in encode stage multi-threading for which
+  // allocation is done.
+  p_mt_info->prev_num_enc_workers = num_enc_workers;
 }
 
 void av1_create_workers(AV1_PRIMARY *ppi, int num_workers) {
@@ -1261,7 +1264,7 @@
 static AOM_INLINE void sync_enc_workers(MultiThreadInfo *const mt_info,
                                         AV1_COMMON *const cm, int num_workers) {
   const AVxWorkerInterface *const winterface = aom_get_worker_interface();
-  int had_error = 0;
+  int had_error = mt_info->workers[0].had_error;
 
   // Encoding ends.
   for (int i = num_workers - 1; i > 0; i--) {
@@ -1284,6 +1287,7 @@
     // Accumulate rtc counters.
     if (!frame_is_intra_only(&cpi->common))
       av1_accumulate_rtc_counters(cpi, &thread_data->td->mb);
+    cpi->palette_pixel_num += thread_data->td->mb.palette_pixels;
     if (thread_data->td != &cpi->td) {
       // Keep these conditional expressions in sync with the corresponding ones
       // in prepare_enc_workers().
@@ -1381,6 +1385,8 @@
     // Reset rtc counters.
     av1_init_rtc_counters(&thread_data->td->mb);
 
+    thread_data->td->mb.palette_pixels = 0;
+
     if (thread_data->td->counts != &cpi->counts) {
       memcpy(thread_data->td->counts, &cpi->counts, sizeof(cpi->counts));
     }
@@ -1827,7 +1833,6 @@
   (void)tpl_mt_sync;
   (void)r;
   (void)c;
-  return;
 }
 
 void av1_tpl_row_mt_sync_write_dummy(AV1TplRowMultiThreadSync *tpl_mt_sync,
@@ -1836,7 +1841,6 @@
   (void)r;
   (void)c;
   (void)cols;
-  return;
 }
 
 void av1_tpl_row_mt_sync_read(AV1TplRowMultiThreadSync *tpl_row_mt_sync, int r,
@@ -2007,6 +2011,7 @@
   }
 }
 
+#if CONFIG_BITRATE_ACCURACY
 // Accumulate transform stats after tpl.
 static void tpl_accumulate_txfm_stats(ThreadData *main_td,
                                       const MultiThreadInfo *mt_info,
@@ -2022,6 +2027,7 @@
     }
   }
 }
+#endif  // CONFIG_BITRATE_ACCURACY
 
 // Implements multi-threading for tpl.
 void av1_mc_flow_dispenser_mt(AV1_COMP *cpi) {
@@ -2047,7 +2053,9 @@
   prepare_tpl_workers(cpi, tpl_worker_hook, num_workers);
   launch_workers(&cpi->mt_info, num_workers);
   sync_enc_workers(&cpi->mt_info, cm, num_workers);
+#if CONFIG_BITRATE_ACCURACY
   tpl_accumulate_txfm_stats(&cpi->td, &cpi->mt_info, num_workers);
+#endif  // CONFIG_BITRATE_ACCURACY
 }
 
 // Deallocate memory for temporal filter multi-thread synchronization.
@@ -2223,7 +2231,7 @@
 static AOM_INLINE void init_gm_thread_data(
     const GlobalMotionInfo *gm_info, GlobalMotionThreadData *thread_data) {
   for (int m = 0; m < RANSAC_NUM_MOTIONS; m++) {
-    MotionModel motion_params = thread_data->params_by_motion[m];
+    MotionModel motion_params = thread_data->motion_models[m];
     av1_zero(motion_params.params);
     motion_params.num_inliers = 0;
   }
@@ -2251,7 +2259,6 @@
 
   while (1) {
     int ref_buf_idx = -1;
-    int ref_frame_idx = -1;
 
 #if CONFIG_MULTITHREAD
     pthread_mutex_lock(gm_mt_mutex_);
@@ -2265,11 +2272,6 @@
       switch_direction(cpi, &ref_buf_idx, &cur_dir);
     }
 
-    // 'ref_frame_idx' holds the index of the current reference frame type in
-    // gm_info->reference_frames. job_info->next_frame_to_process will be
-    // incremented in get_next_gm_job() and hence subtracting by 1.
-    ref_frame_idx = job_info->next_frame_to_process[cur_dir] - 1;
-
 #if CONFIG_MULTITHREAD
     pthread_mutex_unlock(gm_mt_mutex_);
 #endif
@@ -2280,23 +2282,18 @@
 
     // Compute global motion for the given ref_buf_idx.
     av1_compute_gm_for_valid_ref_frames(
-        cpi, gm_info->ref_buf, ref_buf_idx, gm_info->num_src_corners,
-        gm_info->src_corners, gm_info->src_buffer,
-        gm_thread_data->params_by_motion, gm_thread_data->segment_map,
-        gm_info->segment_map_w, gm_info->segment_map_h);
+        cpi, gm_info->ref_buf, ref_buf_idx, gm_thread_data->motion_models,
+        gm_thread_data->segment_map, gm_info->segment_map_w,
+        gm_info->segment_map_h);
 
 #if CONFIG_MULTITHREAD
     pthread_mutex_lock(gm_mt_mutex_);
 #endif
-    assert(ref_frame_idx != -1);
     // If global motion w.r.t. current ref frame is
     // INVALID/TRANSLATION/IDENTITY, skip the evaluation of global motion w.r.t
-    // the remaining ref frames in that direction. The below exit is disabled
-    // when ref frame distance w.r.t. current frame is zero. E.g.:
-    // source_alt_ref_frame w.r.t. ARF frames.
+    // the remaining ref frames in that direction.
     if (cpi->sf.gm_sf.prune_ref_frame_for_gm_search &&
-        gm_info->reference_frames[cur_dir][ref_frame_idx].distance != 0 &&
-        cpi->common.global_motion[ref_buf_idx].wmtype != ROTZOOM)
+        cpi->common.global_motion[ref_buf_idx].wmtype <= TRANSLATION)
       job_info->early_exit[cur_dir] = 1;
 
 #if CONFIG_MULTITHREAD
@@ -2361,7 +2358,7 @@
       aom_free(thread_data->segment_map);
 
       for (int m = 0; m < RANSAC_NUM_MOTIONS; m++)
-        aom_free(thread_data->params_by_motion[m].inliers);
+        aom_free(thread_data->motion_models[m].inliers);
     }
     aom_free(gm_sync_data->thread_data);
   }
@@ -2390,8 +2387,8 @@
 
     for (int m = 0; m < RANSAC_NUM_MOTIONS; m++) {
       CHECK_MEM_ERROR(
-          cm, thread_data->params_by_motion[m].inliers,
-          aom_malloc(sizeof(*thread_data->params_by_motion[m].inliers) * 2 *
+          cm, thread_data->motion_models[m].inliers,
+          aom_malloc(sizeof(*thread_data->motion_models[m].inliers) * 2 *
                      MAX_CORNERS));
     }
   }
@@ -2420,64 +2417,16 @@
 }
 #endif  // !CONFIG_REALTIME_ONLY
 
-// Allocate memory for row synchronization
-static void wiener_var_sync_mem_alloc(
-    AV1EncRowMultiThreadSync *const row_mt_sync, AV1_COMMON *const cm,
-    const int rows) {
-#if CONFIG_MULTITHREAD
-  int i;
-
-  CHECK_MEM_ERROR(cm, row_mt_sync->mutex_,
-                  aom_malloc(sizeof(*row_mt_sync->mutex_) * rows));
-  if (row_mt_sync->mutex_) {
-    for (i = 0; i < rows; ++i) {
-      pthread_mutex_init(&row_mt_sync->mutex_[i], NULL);
-    }
+static AOM_INLINE int get_next_job_allintra(
+    AV1EncRowMultiThreadSync *const row_mt_sync, const int mi_row_end,
+    int *current_mi_row, int mib_size) {
+  if (row_mt_sync->next_mi_row < mi_row_end) {
+    *current_mi_row = row_mt_sync->next_mi_row;
+    row_mt_sync->num_threads_working++;
+    row_mt_sync->next_mi_row += mib_size;
+    return 1;
   }
-
-  CHECK_MEM_ERROR(cm, row_mt_sync->cond_,
-                  aom_malloc(sizeof(*row_mt_sync->cond_) * rows));
-  if (row_mt_sync->cond_) {
-    for (i = 0; i < rows; ++i) {
-      pthread_cond_init(&row_mt_sync->cond_[i], NULL);
-    }
-  }
-#endif  // CONFIG_MULTITHREAD
-
-  CHECK_MEM_ERROR(cm, row_mt_sync->num_finished_cols,
-                  aom_malloc(sizeof(*row_mt_sync->num_finished_cols) * rows));
-
-  row_mt_sync->rows = rows;
-  // Set up nsync.
-  row_mt_sync->sync_range = 1;
-}
-
-// Deallocate row based multi-threading synchronization related mutex and data
-static void wiener_var_sync_mem_dealloc(AV1EncRowMultiThreadSync *row_mt_sync) {
-  if (row_mt_sync != NULL) {
-#if CONFIG_MULTITHREAD
-    int i;
-
-    if (row_mt_sync->mutex_ != NULL) {
-      for (i = 0; i < row_mt_sync->rows; ++i) {
-        pthread_mutex_destroy(&row_mt_sync->mutex_[i]);
-      }
-      aom_free(row_mt_sync->mutex_);
-    }
-    if (row_mt_sync->cond_ != NULL) {
-      for (i = 0; i < row_mt_sync->rows; ++i) {
-        pthread_cond_destroy(&row_mt_sync->cond_[i]);
-      }
-      aom_free(row_mt_sync->cond_);
-    }
-#endif  // CONFIG_MULTITHREAD
-    aom_free(row_mt_sync->num_finished_cols);
-
-    // clear the structure as the source of this call may be dynamic change
-    // in tiles in which case this call will be followed by an _alloc()
-    // which may fail.
-    av1_zero(*row_mt_sync);
-  }
+  return 0;
 }
 
 static AOM_INLINE void prepare_wiener_var_workers(AV1_COMP *const cpi,
@@ -2518,7 +2467,8 @@
   MACROBLOCKD *xd = &x->e_mbd;
   const BLOCK_SIZE bsize = cpi->weber_bsize;
   const int mb_step = mi_size_wide[bsize];
-  AV1EncRowMultiThreadSync *const row_mt_sync = &cpi->tile_data[0].row_mt_sync;
+  AV1EncRowMultiThreadSync *const intra_row_mt_sync =
+      &cpi->ppi->intra_row_mt_sync;
   AV1EncRowMultiThreadInfo *const enc_row_mt = &cpi->mt_info.enc_row_mt;
   (void)enc_row_mt;
 #if CONFIG_MULTITHREAD
@@ -2536,7 +2486,9 @@
 #if CONFIG_MULTITHREAD
     pthread_mutex_lock(enc_row_mt_mutex_);
 #endif
-    has_jobs = get_next_job(&cpi->tile_data[0], &current_mi_row, mb_step);
+    has_jobs =
+        get_next_job_allintra(intra_row_mt_sync, cpi->common.mi_params.mi_rows,
+                              &current_mi_row, mb_step);
 #if CONFIG_MULTITHREAD
     pthread_mutex_unlock(enc_row_mt_mutex_);
 #endif
@@ -2548,7 +2500,7 @@
 #if CONFIG_MULTITHREAD
     pthread_mutex_lock(enc_row_mt_mutex_);
 #endif
-    row_mt_sync->num_threads_working--;
+    intra_row_mt_sync->num_threads_working--;
 #if CONFIG_MULTITHREAD
     pthread_mutex_unlock(enc_row_mt_mutex_);
 #endif
@@ -2569,31 +2521,24 @@
   (void)sum_est_rate;
   AV1_COMMON *const cm = &cpi->common;
   MultiThreadInfo *const mt_info = &cpi->mt_info;
-  const int tile_cols = 1;
-  const int tile_rows = 1;
-  if (cpi->tile_data != NULL) aom_free(cpi->tile_data);
-  CHECK_MEM_ERROR(
-      cm, cpi->tile_data,
-      aom_memalign(32, tile_cols * tile_rows * sizeof(*cpi->tile_data)));
-  cpi->allocated_tiles = tile_cols * tile_rows;
-  cpi->tile_data->tile_info.mi_row_end = cm->mi_params.mi_rows;
-  AV1EncRowMultiThreadSync *const row_mt_sync = &cpi->tile_data[0].row_mt_sync;
+  AV1EncRowMultiThreadSync *const intra_row_mt_sync =
+      &cpi->ppi->intra_row_mt_sync;
 
   // TODO(chengchen): the memory usage could be improved.
   const int mi_rows = cm->mi_params.mi_rows;
-  wiener_var_sync_mem_alloc(row_mt_sync, cm, mi_rows);
+  row_mt_sync_mem_alloc(intra_row_mt_sync, cm, mi_rows);
 
-  row_mt_sync->intrabc_extra_top_right_sb_delay = 0;
-  row_mt_sync->num_threads_working = num_workers;
-  row_mt_sync->next_mi_row = 0;
-  memset(row_mt_sync->num_finished_cols, -1,
-         sizeof(*row_mt_sync->num_finished_cols) * num_workers);
+  intra_row_mt_sync->intrabc_extra_top_right_sb_delay = 0;
+  intra_row_mt_sync->num_threads_working = num_workers;
+  intra_row_mt_sync->next_mi_row = 0;
+  memset(intra_row_mt_sync->num_finished_cols, -1,
+         sizeof(*intra_row_mt_sync->num_finished_cols) * mi_rows);
 
   prepare_wiener_var_workers(cpi, cal_mb_wiener_var_hook, num_workers);
   launch_workers(mt_info, num_workers);
   sync_enc_workers(mt_info, cm, num_workers);
 
-  wiener_var_sync_mem_dealloc(row_mt_sync);
+  row_mt_sync_mem_dealloc(intra_row_mt_sync);
 }
 
 // Compare and order tiles based on absolute sum of tx coeffs.
@@ -3073,6 +3018,9 @@
 // Computes num_workers for all intra multi-threading.
 static AOM_INLINE int compute_num_ai_workers(AV1_COMP *cpi) {
   if (cpi->oxcf.max_threads <= 1) return 1;
+  // The multi-threading implementation of deltaq-mode = 3 in allintra
+  // mode is based on row multi threading.
+  if (!cpi->oxcf.row_mt) return 1;
   cpi->weber_bsize = BLOCK_8X8;
   const BLOCK_SIZE bsize = cpi->weber_bsize;
   const int mb_step = mi_size_wide[bsize];

diff --git a/av1/encoder/firstpass.c b/av1/encoder/firstpass.c
index 3dee644..1fad149 100644
--- a/av1/encoder/firstpass.c
+++ b/av1/encoder/firstpass.c

@@ -93,6 +93,8 @@
   section->intra_error = 0.0;
   section->frame_avg_wavelet_energy = 0.0;
   section->coded_error = 0.0;
+  section->log_intra_error = 0.0;
+  section->log_coded_error = 0.0;
   section->sr_coded_error = 0.0;
   section->pcnt_inter = 0.0;
   section->pcnt_motion = 0.0;
@@ -121,6 +123,8 @@
   section->frame += frame->frame;
   section->weight += frame->weight;
   section->intra_error += frame->intra_error;
+  section->log_intra_error += log1p(frame->intra_error);
+  section->log_coded_error += log1p(frame->coded_error);
   section->frame_avg_wavelet_energy += frame->frame_avg_wavelet_energy;
   section->coded_error += frame->coded_error;
   section->sr_coded_error += frame->sr_coded_error;
@@ -217,7 +221,6 @@
         case BLOCK_8X16: return aom_highbd_8_mse8x16;
         default: return aom_highbd_8_mse16x16;
       }
-      break;
     case 10:
       switch (bsize) {
         case BLOCK_8X8: return aom_highbd_10_mse8x8;
@@ -225,7 +228,6 @@
         case BLOCK_8X16: return aom_highbd_10_mse8x16;
         default: return aom_highbd_10_mse16x16;
       }
-      break;
     case 12:
       switch (bsize) {
         case BLOCK_8X8: return aom_highbd_12_mse8x8;
@@ -233,7 +235,6 @@
         case BLOCK_8X16: return aom_highbd_12_mse8x16;
         default: return aom_highbd_12_mse16x16;
       }
-      break;
   }
 }
 
@@ -276,7 +277,7 @@
       cpi->is_screen_content_type && cpi->common.features.allow_intrabc;
   FULLPEL_MOTION_SEARCH_PARAMS ms_params;
   av1_make_default_fullpel_ms_params(&ms_params, cpi, x, bsize, ref_mv,
-                                     first_pass_search_sites,
+                                     start_mv, first_pass_search_sites,
                                      fine_search_interval);
   av1_set_mv_search_method(&ms_params, first_pass_search_sites, NSTEP);
 
@@ -514,7 +515,7 @@
     stats->image_data_start_row = unit_row;
   }
 
-  double log_intra = log(this_intra_error + 1.0);
+  double log_intra = log1p(this_intra_error);
   if (log_intra < 10.0) {
     stats->intra_factor += 1.0 + ((10.0 - log_intra) * 0.05);
   } else {
@@ -707,6 +708,8 @@
   // Compute the motion error of the 0,0 motion using the last source
   // frame as the reference. Skip the further motion search on
   // reconstructed frame if this error is small.
+  // TODO(chiyotsai): The unscaled last source might be different dimension
+  // as the current source. See BUG=aomedia:3413
   struct buf_2d unscaled_last_source_buf_2d;
   unscaled_last_source_buf_2d.buf =
       cpi->unscaled_last_source->y_buffer + src_yoffset;
@@ -734,44 +737,43 @@
         mv = tmp_mv;
       }
     }
+  }
 
-    // Motion search in 2nd reference frame.
-    int gf_motion_error = motion_error;
-    if ((current_frame->frame_number > 1) && golden_frame != NULL) {
-      FULLPEL_MV tmp_mv = kZeroFullMv;
-      // Assume 0,0 motion with no mv overhead.
-      xd->plane[0].pre[0].buf = golden_frame->y_buffer + recon_yoffset;
-      xd->plane[0].pre[0].stride = golden_frame->y_stride;
-      gf_motion_error =
-          get_prediction_error_bitdepth(is_high_bitdepth, bitdepth, bsize,
-                                        &x->plane[0].src, &xd->plane[0].pre[0]);
-      first_pass_motion_search(cpi, x, &kZeroMv, &tmp_mv, &gf_motion_error);
-    }
-    if (gf_motion_error < motion_error && gf_motion_error < this_intra_error) {
-      ++stats->second_ref_count;
-    }
-    // In accumulating a score for the 2nd reference frame take the
-    // best of the motion predicted score and the intra coded error
-    // (just as will be done for) accumulation of "coded_error" for
-    // the last frame.
-    if ((current_frame->frame_number > 1) && golden_frame != NULL) {
-      stats->sr_coded_error += AOMMIN(gf_motion_error, this_intra_error);
-    } else {
-      // TODO(chengchen): I believe logically this should also be changed to
-      // stats->sr_coded_error += AOMMIN(gf_motion_error, this_intra_error).
-      stats->sr_coded_error += motion_error;
-    }
-
-    // Reset to last frame as reference buffer.
-    xd->plane[0].pre[0].buf = last_frame->y_buffer + recon_yoffset;
-    if (av1_num_planes(&cpi->common) > 1) {
-      xd->plane[1].pre[0].buf = last_frame->u_buffer + recon_uvoffset;
-      xd->plane[2].pre[0].buf = last_frame->v_buffer + recon_uvoffset;
-    }
+  // Motion search in 2nd reference frame.
+  int gf_motion_error = motion_error;
+  if ((current_frame->frame_number > 1) && golden_frame != NULL) {
+    FULLPEL_MV tmp_mv = kZeroFullMv;
+    // Assume 0,0 motion with no mv overhead.
+    xd->plane[0].pre[0].buf = golden_frame->y_buffer + recon_yoffset;
+    xd->plane[0].pre[0].stride = golden_frame->y_stride;
+    xd->plane[0].pre[0].width = golden_frame->y_width;
+    gf_motion_error =
+        get_prediction_error_bitdepth(is_high_bitdepth, bitdepth, bsize,
+                                      &x->plane[0].src, &xd->plane[0].pre[0]);
+    first_pass_motion_search(cpi, x, &kZeroMv, &tmp_mv, &gf_motion_error);
+  }
+  if (gf_motion_error < motion_error && gf_motion_error < this_intra_error) {
+    ++stats->second_ref_count;
+  }
+  // In accumulating a score for the 2nd reference frame take the
+  // best of the motion predicted score and the intra coded error
+  // (just as will be done for) accumulation of "coded_error" for
+  // the last frame.
+  if ((current_frame->frame_number > 1) && golden_frame != NULL) {
+    stats->sr_coded_error += AOMMIN(gf_motion_error, this_intra_error);
   } else {
+    // TODO(chengchen): I believe logically this should also be changed to
+    // stats->sr_coded_error += AOMMIN(gf_motion_error, this_intra_error).
     stats->sr_coded_error += motion_error;
   }
 
+  // Reset to last frame as reference buffer.
+  xd->plane[0].pre[0].buf = last_frame->y_buffer + recon_yoffset;
+  if (av1_num_planes(&cpi->common) > 1) {
+    xd->plane[1].pre[0].buf = last_frame->u_buffer + recon_uvoffset;
+    xd->plane[2].pre[0].buf = last_frame->v_buffer + recon_uvoffset;
+  }
+
   // Start by assuming that intra mode is best.
   *best_mv = kZeroMv;
 
@@ -829,7 +831,8 @@
   fps->sr_coded_error /= num_mbs_16x16;
   fps->intra_error /= num_mbs_16x16;
   fps->frame_avg_wavelet_energy /= num_mbs_16x16;
-
+  fps->log_coded_error = log1p(fps->coded_error);
+  fps->log_intra_error = log1p(fps->intra_error);
   fps->MVr /= f_h;
   fps->mvr_abs /= f_h;
   fps->MVc /= f_w;
@@ -889,11 +892,13 @@
   fps.pcnt_neutral = (double)stats->neutral_count / num_mbs;
   fps.intra_skip_pct = (double)stats->intra_skip_count / num_mbs;
   fps.inactive_zone_rows = (double)stats->image_data_start_row;
-  fps.inactive_zone_cols = (double)0;  // Placeholder: not currently supported.
+  fps.inactive_zone_cols = 0.0;  // Placeholder: not currently supported.
   fps.raw_error_stdev = raw_err_stdev;
   fps.is_flash = 0;
-  fps.noise_var = (double)0;
-  fps.cor_coeff = (double)1.0;
+  fps.noise_var = 0.0;
+  fps.cor_coeff = 1.0;
+  fps.log_coded_error = 0.0;
+  fps.log_intra_error = 0.0;
 
   if (stats->mv_count > 0) {
     fps.MVr = (double)stats->sum_mvr / stats->mv_count;
@@ -1118,10 +1123,16 @@
   AV1EncRowMultiThreadInfo *const enc_row_mt = &mt_info->enc_row_mt;
   AV1EncRowMultiThreadSync *const row_mt_sync = &tile_data->row_mt_sync;
 
-  const YV12_BUFFER_CONFIG *const last_frame =
-      get_ref_frame_yv12_buf(cm, LAST_FRAME);
+  const YV12_BUFFER_CONFIG *last_frame =
+      av1_get_scaled_ref_frame(cpi, LAST_FRAME);
+  if (!last_frame) {
+    last_frame = get_ref_frame_yv12_buf(cm, LAST_FRAME);
+  }
   const YV12_BUFFER_CONFIG *golden_frame =
-      get_ref_frame_yv12_buf(cm, GOLDEN_FRAME);
+      av1_get_scaled_ref_frame(cpi, GOLDEN_FRAME);
+  if (!golden_frame) {
+    golden_frame = get_ref_frame_yv12_buf(cm, GOLDEN_FRAME);
+  }
   YV12_BUFFER_CONFIG *const this_frame = &cm->cur_frame->buf;
 
   PICK_MODE_CONTEXT *ctx = td->firstpass_ctx;
@@ -1249,6 +1260,9 @@
   const int num_planes = av1_num_planes(cm);
   MACROBLOCKD *const xd = &x->e_mbd;
   const int qindex = find_fp_qindex(seq_params->bit_depth);
+  const int ref_frame_flags_backup = cpi->ref_frame_flags;
+  cpi->ref_frame_flags = av1_ref_frame_flag_list[LAST_FRAME] |
+                         av1_ref_frame_flag_list[GOLDEN_FRAME];
 
   // Detect if the key frame is screen content type.
   if (frame_is_intra_only(cm)) {
@@ -1300,10 +1314,18 @@
 
   av1_init_tile_data(cpi);
 
-  const YV12_BUFFER_CONFIG *const last_frame =
-      get_ref_frame_yv12_buf(cm, LAST_FRAME);
-  const YV12_BUFFER_CONFIG *golden_frame =
-      get_ref_frame_yv12_buf(cm, GOLDEN_FRAME);
+  const YV12_BUFFER_CONFIG *last_frame = NULL;
+  const YV12_BUFFER_CONFIG *golden_frame = NULL;
+  if (!frame_is_intra_only(cm)) {
+    av1_scale_references(cpi, EIGHTTAP_REGULAR, 0, 0);
+    last_frame = av1_is_scaled(get_ref_scale_factors_const(cm, LAST_FRAME))
+                     ? av1_get_scaled_ref_frame(cpi, LAST_FRAME)
+                     : get_ref_frame_yv12_buf(cm, LAST_FRAME);
+    golden_frame = av1_is_scaled(get_ref_scale_factors_const(cm, GOLDEN_FRAME))
+                       ? av1_get_scaled_ref_frame(cpi, GOLDEN_FRAME)
+                       : get_ref_frame_yv12_buf(cm, GOLDEN_FRAME);
+  }
+
   YV12_BUFFER_CONFIG *const this_frame = &cm->cur_frame->buf;
   // First pass code requires valid last and new frame buffers.
   assert(this_frame != NULL);
@@ -1425,6 +1447,10 @@
                              /*do_print=*/0);
 
   ++current_frame->frame_number;
+  cpi->ref_frame_flags = ref_frame_flags_backup;
+  if (!frame_is_intra_only(cm)) {
+    release_scaled_references(cpi);
+  }
 }
 
 aom_codec_err_t av1_firstpass_info_init(FIRSTPASS_INFO *firstpass_info,

diff --git a/av1/encoder/firstpass.h b/av1/encoder/firstpass.h
index d5f750f..e18e9e4 100644
--- a/av1/encoder/firstpass.h
+++ b/av1/encoder/firstpass.h

@@ -12,6 +12,8 @@
 #ifndef AOM_AV1_ENCODER_FIRSTPASS_H_
 #define AOM_AV1_ENCODER_FIRSTPASS_H_
 
+#include <stdbool.h>
+
 #include "av1/common/av1_common_int.h"
 #include "av1/common/enums.h"
 #include "av1/encoder/lookahead.h"
@@ -161,6 +163,14 @@
    * Correlation coefficient with the previous frame
    */
   double cor_coeff;
+  /*!
+   * log of intra_error
+   */
+  double log_intra_error;
+  /*!
+   * log of coded_error
+   */
+  double log_coded_error;
 } FIRSTPASS_STATS;
 
 // We want to keep one past stats for key frame detection
@@ -386,9 +396,9 @@
   // 2 : frame occurs later in encode order in a given parallel encode set.
   int frame_parallel_level[MAX_STATIC_GF_GROUP_LENGTH];
   // Indicates whether a frame should act as non-reference frame.
-  // 0 : frame is a reference frame.
-  // 1 : frame is a non-reference frame.
-  int is_frame_non_ref[MAX_STATIC_GF_GROUP_LENGTH];
+  bool is_frame_non_ref[MAX_STATIC_GF_GROUP_LENGTH];
+  // Indicates whether a frame is dropped.
+  bool is_frame_dropped[MAX_STATIC_GF_GROUP_LENGTH];
 
   // Stores the display order hint of the frames not to be
   // refreshed by the current frame.
@@ -454,7 +464,6 @@
   int last_kfgroup_zeromotion_pct;
   int extend_minq;
   int extend_maxq;
-  int extend_minq_fast;
   /*!\endcond */
 } TWO_PASS;
 

diff --git a/av1/encoder/global_motion.c b/av1/encoder/global_motion.c
index 9e84e53..bc5e186 100644
--- a/av1/encoder/global_motion.c
+++ b/av1/encoder/global_motion.c

@@ -37,7 +37,6 @@
 
 static void convert_to_params(const double *params, int32_t *model) {
   int i;
-  int alpha_present = 0;
   model[0] = (int32_t)floor(params[0] * (1 << GM_TRANS_PREC_BITS) + 0.5);
   model[1] = (int32_t)floor(params[1] * (1 << GM_TRANS_PREC_BITS) + 0.5);
   model[0] = (int32_t)clamp(model[0], GM_TRANS_MIN, GM_TRANS_MAX) *
@@ -50,22 +49,8 @@
     model[i] = (int32_t)floor(params[i] * (1 << GM_ALPHA_PREC_BITS) + 0.5);
     model[i] =
         (int32_t)clamp(model[i] - diag_value, GM_ALPHA_MIN, GM_ALPHA_MAX);
-    alpha_present |= (model[i] != 0);
     model[i] = (model[i] + diag_value) * GM_ALPHA_DECODE_FACTOR;
   }
-  for (; i < 8; ++i) {
-    model[i] = (int32_t)floor(params[i] * (1 << GM_ROW3HOMO_PREC_BITS) + 0.5);
-    model[i] = (int32_t)clamp(model[i], GM_ROW3HOMO_MIN, GM_ROW3HOMO_MAX) *
-               GM_ROW3HOMO_DECODE_FACTOR;
-    alpha_present |= (model[i] != 0);
-  }
-
-  if (!alpha_present) {
-    if (abs(model[0]) < MIN_TRANS_THRESH && abs(model[1]) < MIN_TRANS_THRESH) {
-      model[0] = 0;
-      model[1] = 0;
-    }
-  }
 }
 
 void av1_convert_model_to_params(const double *params,
@@ -80,11 +65,10 @@
 // zero-centering.
 static int32_t add_param_offset(int param_index, int32_t param_value,
                                 int32_t offset) {
-  const int scale_vals[3] = { GM_TRANS_PREC_DIFF, GM_ALPHA_PREC_DIFF,
-                              GM_ROW3HOMO_PREC_DIFF };
-  const int clamp_vals[3] = { GM_TRANS_MAX, GM_ALPHA_MAX, GM_ROW3HOMO_MAX };
-  // type of param: 0 - translation, 1 - affine, 2 - homography
-  const int param_type = (param_index < 2 ? 0 : (param_index < 6 ? 1 : 2));
+  const int scale_vals[2] = { GM_TRANS_PREC_DIFF, GM_ALPHA_PREC_DIFF };
+  const int clamp_vals[2] = { GM_TRANS_MAX, GM_ALPHA_MAX };
+  // type of param: 0 - translation, 1 - affine
+  const int param_type = (param_index < 2 ? 0 : 1);
   const int is_one_centered = (param_index == 2 || param_index == 5);
 
   // Make parameter zero-centered and offset the shift that was done to make
@@ -206,8 +190,9 @@
                        int p_height, int p_stride, int subsampling_x,
                        int subsampling_y, int64_t best_error,
                        uint8_t *segment_map, int segment_map_stride) {
-  if (wm->wmtype <= AFFINE)
-    if (!av1_get_shear_params(wm)) return INT64_MAX;
+  force_wmtype(wm, wm->wmtype);
+  assert(wm->wmtype <= AFFINE);
+  if (!av1_get_shear_params(wm)) return INT64_MAX;
 #if CONFIG_AV1_HIGHBITDEPTH
   if (use_hbd)
     return highbd_warp_error(wm, CONVERT_TO_SHORTPTR(ref), width, height,
@@ -224,8 +209,8 @@
 }
 
 // Factors used to calculate the thresholds for av1_warp_error
-static double thresh_factors[GM_REFINEMENT_COUNT] = { 1.25, 1.20, 1.15, 1.10,
-                                                      1.05 };
+static double thresh_factors[GM_MAX_REFINEMENT_STEPS] = { 1.25, 1.20, 1.15,
+                                                          1.10, 1.05 };
 
 static INLINE int64_t calc_approx_erroradv_threshold(
     double scaling_factor, int64_t erroradv_threshold) {
@@ -258,6 +243,12 @@
                      dst + border * d_stride + border, border, border,
                      d_width - 2 * border, d_height - 2 * border, d_stride, 0,
                      0, best_frame_error, segment_map, segment_map_stride);
+
+  if (n_refinements == 0) {
+    wm->wmtype = get_wmtype(wm);
+    return best_error;
+  }
+
   best_error = AOMMIN(best_error, best_frame_error);
   step = 1 << (n_refinements - 1);
   for (i = 0; i < n_refinements; i++, step >>= 1) {
@@ -324,7 +315,7 @@
 }
 
 #define FEAT_COUNT_TR 3
-#define SEG_COUNT_TR 0.40
+#define SEG_COUNT_TR 48
 void av1_compute_feature_segmentation_map(uint8_t *segment_map, int width,
                                           int height, int *inliers,
                                           int num_inliers) {
@@ -349,6 +340,6 @@
 
   // If this motion does not make up a large enough portion of the frame,
   // use the unsegmented version of the error metric
-  if (seg_count < (width * height * SEG_COUNT_TR))
+  if (seg_count < SEG_COUNT_TR)
     memset(segment_map, 1, width * height * sizeof(*segment_map));
 }

diff --git a/av1/encoder/global_motion.h b/av1/encoder/global_motion.h
index 4fa3253..cf1d0fd 100644
--- a/av1/encoder/global_motion.h
+++ b/av1/encoder/global_motion.h

@@ -22,7 +22,7 @@
 #endif
 
 #define RANSAC_NUM_MOTIONS 1
-#define GM_REFINEMENT_COUNT 5
+#define GM_MAX_REFINEMENT_STEPS 5
 #define MAX_DIRECTIONS 2
 
 // The structure holds a valid reference frame type and its temporal distance
@@ -34,9 +34,9 @@
 
 typedef struct {
   // Array of structure which holds the global motion parameters for a given
-  // motion model. params_by_motion[i] holds the parameters for a given motion
+  // motion model. motion_models[i] holds the parameters for a given motion
   // model for the ith ransac motion.
-  MotionModel params_by_motion[RANSAC_NUM_MOTIONS];
+  MotionModel motion_models[RANSAC_NUM_MOTIONS];
 
   // Pointer to hold inliers from motion model.
   uint8_t *segment_map;

diff --git a/av1/encoder/global_motion_facade.c b/av1/encoder/global_motion_facade.c
index 0df070a..1a00cbb 100644
--- a/av1/encoder/global_motion_facade.c
+++ b/av1/encoder/global_motion_facade.c

@@ -13,10 +13,12 @@
 
 #include "aom_dsp/flow_estimation/corner_detect.h"
 #include "aom_dsp/flow_estimation/flow_estimation.h"
+#include "aom_dsp/pyramid.h"
 #include "av1/common/warped_motion.h"
 #include "av1/encoder/encoder.h"
 #include "av1/encoder/ethread.h"
 #include "av1/encoder/rdopt.h"
+#include "av1/encoder/global_motion_facade.h"
 
 // Highest motion model to search.
 #define GLOBAL_TRANS_TYPES_ENC 3
@@ -80,111 +82,90 @@
 // different motion models and finds the best.
 static AOM_INLINE void compute_global_motion_for_ref_frame(
     AV1_COMP *cpi, YV12_BUFFER_CONFIG *ref_buf[REF_FRAMES], int frame,
-    int num_src_corners, int *src_corners, unsigned char *src_buffer,
-    MotionModel *params_by_motion, uint8_t *segment_map,
-    const int segment_map_w, const int segment_map_h,
-    const WarpedMotionParams *ref_params) {
+    MotionModel *motion_models, uint8_t *segment_map, const int segment_map_w,
+    const int segment_map_h, const WarpedMotionParams *ref_params) {
   ThreadData *const td = &cpi->td;
   MACROBLOCK *const x = &td->mb;
   AV1_COMMON *const cm = &cpi->common;
   MACROBLOCKD *const xd = &x->e_mbd;
   int i;
-  int src_width = cpi->source->y_width;
-  int src_height = cpi->source->y_height;
+  int src_width = cpi->source->y_crop_width;
+  int src_height = cpi->source->y_crop_height;
   int src_stride = cpi->source->y_stride;
-  // clang-format off
-  static const double kIdentityParams[MAX_PARAMDIM - 1] = {
-     0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0
-  };
-  // clang-format on
   WarpedMotionParams tmp_wm_params;
   const double *params_this_motion;
-  int inliers_by_motion[RANSAC_NUM_MOTIONS];
   assert(ref_buf[frame] != NULL);
   TransformationType model;
+  int bit_depth = cpi->common.seq_params->bit_depth;
+  GlobalMotionMethod global_motion_method = default_global_motion_method;
+  int num_refinements = cpi->sf.gm_sf.num_refinement_steps;
 
-  // TODO(sarahparker, debargha): Explore do_adaptive_gm_estimation = 1
-  const int do_adaptive_gm_estimation = 0;
-
-  const int ref_frame_dist = get_relative_dist(
-      &cm->seq_params->order_hint_info, cm->current_frame.order_hint,
-      cm->cur_frame->ref_order_hints[frame - LAST_FRAME]);
-  const GlobalMotionEstimationType gm_estimation_type =
-      cm->seq_params->order_hint_info.enable_order_hint &&
-              abs(ref_frame_dist) <= 2 && do_adaptive_gm_estimation
-          ? GLOBAL_MOTION_DISFLOW_BASED
-          : GLOBAL_MOTION_FEATURE_BASED;
   for (model = ROTZOOM; model < GLOBAL_TRANS_TYPES_ENC; ++model) {
-    int64_t best_warp_error = INT64_MAX;
-    // Initially set all params to identity.
-    for (i = 0; i < RANSAC_NUM_MOTIONS; ++i) {
-      memcpy(params_by_motion[i].params, kIdentityParams,
-             (MAX_PARAMDIM - 1) * sizeof(*(params_by_motion[i].params)));
-      params_by_motion[i].num_inliers = 0;
+    if (!aom_compute_global_motion(model, cpi->source, ref_buf[frame],
+                                   bit_depth, global_motion_method,
+                                   motion_models, RANSAC_NUM_MOTIONS)) {
+      continue;
     }
 
-    aom_compute_global_motion(model, src_buffer, src_width, src_height,
-                              src_stride, src_corners, num_src_corners,
-                              ref_buf[frame], cpi->common.seq_params->bit_depth,
-                              gm_estimation_type, inliers_by_motion,
-                              params_by_motion, RANSAC_NUM_MOTIONS);
-    int64_t ref_frame_error = 0;
+    int64_t best_ref_frame_error = 0;
+    int64_t best_warp_error = INT64_MAX;
     for (i = 0; i < RANSAC_NUM_MOTIONS; ++i) {
-      if (inliers_by_motion[i] == 0) continue;
+      if (motion_models[i].num_inliers == 0) continue;
 
-      params_this_motion = params_by_motion[i].params;
+      params_this_motion = motion_models[i].params;
       av1_convert_model_to_params(params_this_motion, &tmp_wm_params);
 
-      // Work around a bug in the AV1 specification
+      // Skip models that we won't use (IDENTITY or TRANSLATION)
+      //
+      // For IDENTITY type models, we don't need to evaluate anything because
+      // all the following logic is effectively comparing the estimated model
+      // to an identity model.
       //
       // For TRANSLATION type global motion models, gm_get_motion_vector() gives
       // the wrong motion vector (see comments in that function for details).
       // As translation-type models do not give much gain, we can avoid this bug
       // by never choosing a TRANSLATION type model
-      if (tmp_wm_params.wmtype == TRANSLATION) {
-        continue;
-      }
+      if (tmp_wm_params.wmtype <= TRANSLATION) continue;
 
-      if (tmp_wm_params.wmtype != IDENTITY) {
-        av1_compute_feature_segmentation_map(
-            segment_map, segment_map_w, segment_map_h,
-            params_by_motion[i].inliers, params_by_motion[i].num_inliers);
+      av1_compute_feature_segmentation_map(
+          segment_map, segment_map_w, segment_map_h, motion_models[i].inliers,
+          motion_models[i].num_inliers);
 
-        ref_frame_error = av1_segmented_frame_error(
-            is_cur_buf_hbd(xd), xd->bd, ref_buf[frame]->y_buffer,
-            ref_buf[frame]->y_stride, cpi->source->y_buffer, src_width,
-            src_height, src_stride, segment_map, segment_map_w);
+      int64_t ref_frame_error = av1_segmented_frame_error(
+          is_cur_buf_hbd(xd), xd->bd, ref_buf[frame]->y_buffer,
+          ref_buf[frame]->y_stride, cpi->source->y_buffer, src_width,
+          src_height, src_stride, segment_map, segment_map_w);
 
-        const int64_t erroradv_threshold =
-            calc_erroradv_threshold(ref_frame_error);
+      if (ref_frame_error == 0) continue;
 
-        const int64_t warp_error = av1_refine_integerized_param(
-            &tmp_wm_params, tmp_wm_params.wmtype, is_cur_buf_hbd(xd), xd->bd,
-            ref_buf[frame]->y_buffer, ref_buf[frame]->y_width,
-            ref_buf[frame]->y_height, ref_buf[frame]->y_stride,
-            cpi->source->y_buffer, src_width, src_height, src_stride,
-            GM_REFINEMENT_COUNT, best_warp_error, segment_map, segment_map_w,
-            erroradv_threshold);
+      const int64_t erroradv_threshold =
+          calc_erroradv_threshold(ref_frame_error);
 
-        // av1_refine_integerized_param() can return a TRANSLATION type model
-        // even if its input is some other type, so we have to skip those too
-        if (tmp_wm_params.wmtype == TRANSLATION) {
-          continue;
-        }
+      const int64_t warp_error = av1_refine_integerized_param(
+          &tmp_wm_params, tmp_wm_params.wmtype, is_cur_buf_hbd(xd), xd->bd,
+          ref_buf[frame]->y_buffer, ref_buf[frame]->y_crop_width,
+          ref_buf[frame]->y_crop_height, ref_buf[frame]->y_stride,
+          cpi->source->y_buffer, src_width, src_height, src_stride,
+          num_refinements, best_warp_error, segment_map, segment_map_w,
+          erroradv_threshold);
 
-        if (warp_error < best_warp_error) {
-          best_warp_error = warp_error;
-          // Save the wm_params modified by
-          // av1_refine_integerized_param() rather than motion index to
-          // avoid rerunning refine() below.
-          memcpy(&(cm->global_motion[frame]), &tmp_wm_params,
-                 sizeof(WarpedMotionParams));
-        }
+      // av1_refine_integerized_param() can return a simpler model type than
+      // its input, so re-check model type here
+      if (tmp_wm_params.wmtype <= TRANSLATION) continue;
+
+      if (warp_error < best_warp_error) {
+        best_ref_frame_error = ref_frame_error;
+        best_warp_error = warp_error;
+        // Save the wm_params modified by
+        // av1_refine_integerized_param() rather than motion index to
+        // avoid rerunning refine() below.
+        memcpy(&(cm->global_motion[frame]), &tmp_wm_params,
+               sizeof(WarpedMotionParams));
       }
     }
-    if (cm->global_motion[frame].wmtype <= AFFINE)
-      if (!av1_get_shear_params(&cm->global_motion[frame]))
-        cm->global_motion[frame] = default_warp_params;
+    assert(cm->global_motion[frame].wmtype <= AFFINE);
+    if (!av1_get_shear_params(&cm->global_motion[frame]))
+      cm->global_motion[frame] = default_warp_params;
 
 #if 0
     // We never choose translational models, so this code is disabled
@@ -202,12 +183,15 @@
 
     if (cm->global_motion[frame].wmtype == IDENTITY) continue;
 
-    if (ref_frame_error == 0) continue;
+    // Once we get here, best_ref_frame_error must be > 0. This is because
+    // of the logic above, which skips  over any models which have
+    // ref_frame_error == 0
+    assert(best_ref_frame_error > 0);
 
     // If the best error advantage found doesn't meet the threshold for
     // this motion type, revert to IDENTITY.
     if (!av1_is_enough_erroradvantage(
-            (double)best_warp_error / ref_frame_error,
+            (double)best_warp_error / best_ref_frame_error,
             gm_get_params_cost(&cm->global_motion[frame], ref_params,
                                cm->features.allow_high_precision_mv))) {
       cm->global_motion[frame] = default_warp_params;
@@ -220,44 +204,37 @@
 // Computes global motion for the given reference frame.
 void av1_compute_gm_for_valid_ref_frames(
     AV1_COMP *cpi, YV12_BUFFER_CONFIG *ref_buf[REF_FRAMES], int frame,
-    int num_src_corners, int *src_corners, unsigned char *src_buffer,
-    MotionModel *params_by_motion, uint8_t *segment_map, int segment_map_w,
+    MotionModel *motion_models, uint8_t *segment_map, int segment_map_w,
     int segment_map_h) {
   AV1_COMMON *const cm = &cpi->common;
   const WarpedMotionParams *ref_params =
       cm->prev_frame ? &cm->prev_frame->global_motion[frame]
                      : &default_warp_params;
 
-  compute_global_motion_for_ref_frame(
-      cpi, ref_buf, frame, num_src_corners, src_corners, src_buffer,
-      params_by_motion, segment_map, segment_map_w, segment_map_h, ref_params);
+  compute_global_motion_for_ref_frame(cpi, ref_buf, frame, motion_models,
+                                      segment_map, segment_map_w, segment_map_h,
+                                      ref_params);
 }
 
 // Loops over valid reference frames and computes global motion estimation.
 static AOM_INLINE void compute_global_motion_for_references(
     AV1_COMP *cpi, YV12_BUFFER_CONFIG *ref_buf[REF_FRAMES],
     FrameDistPair reference_frame[REF_FRAMES - 1], int num_ref_frames,
-    int num_src_corners, int *src_corners, unsigned char *src_buffer,
-    MotionModel *params_by_motion, uint8_t *segment_map,
-    const int segment_map_w, const int segment_map_h) {
-  // Computation of frame corners for the source frame will be done already.
-  assert(num_src_corners != -1);
+    MotionModel *motion_models, uint8_t *segment_map, const int segment_map_w,
+    const int segment_map_h) {
   AV1_COMMON *const cm = &cpi->common;
   // Compute global motion w.r.t. reference frames starting from the nearest ref
   // frame in a given direction.
   for (int frame = 0; frame < num_ref_frames; frame++) {
     int ref_frame = reference_frame[frame].frame;
-    av1_compute_gm_for_valid_ref_frames(
-        cpi, ref_buf, ref_frame, num_src_corners, src_corners, src_buffer,
-        params_by_motion, segment_map, segment_map_w, segment_map_h);
+    av1_compute_gm_for_valid_ref_frames(cpi, ref_buf, ref_frame, motion_models,
+                                        segment_map, segment_map_w,
+                                        segment_map_h);
     // If global motion w.r.t. current ref frame is
     // INVALID/TRANSLATION/IDENTITY, skip the evaluation of global motion w.r.t
-    // the remaining ref frames in that direction. The below exit is disabled
-    // when ref frame distance w.r.t. current frame is zero. E.g.:
-    // source_alt_ref_frame w.r.t. ARF frames.
+    // the remaining ref frames in that direction.
     if (cpi->sf.gm_sf.prune_ref_frame_for_gm_search &&
-        reference_frame[frame].distance != 0 &&
-        cm->global_motion[ref_frame].wmtype != ROTZOOM)
+        cm->global_motion[ref_frame].wmtype <= TRANSLATION)
       break;
   }
 }
@@ -306,6 +283,7 @@
     case GM_REDUCED_REF_SEARCH_SKIP_L2_L3_ARF2:
       return !(frame == LAST2_FRAME || frame == LAST3_FRAME ||
                (frame == ALTREF2_FRAME));
+    case GM_SEARCH_CLOSEST_REFS_ONLY: return 1;
     case GM_DISABLE_SEARCH: return 0;
     default: assert(0);
   }
@@ -325,6 +303,7 @@
   int ref_pruning_enabled = is_frame_eligible_for_ref_pruning(
       gf_group, cpi->sf.inter_sf.selective_ref_frame, 1, cpi->gf_frame_index);
   int cur_frame_gm_disabled = 0;
+  int pyr_lvl = cm->cur_frame->pyramid_level;
 
   if (cpi->sf.gm_sf.disable_gm_search_based_on_stats) {
     cur_frame_gm_disabled = disable_gm_search_based_on_stats(cpi);
@@ -349,18 +328,25 @@
         ref_pruning_enabled &&
         prune_ref_by_selective_ref_frame(cpi, NULL, ref_frame,
                                          cm->cur_frame->ref_display_order_hint);
+    int ref_pyr_lvl = buf->pyramid_level;
 
     if (ref_buf[frame]->y_crop_width == cpi->source->y_crop_width &&
         ref_buf[frame]->y_crop_height == cpi->source->y_crop_height &&
         do_gm_search_logic(&cpi->sf, frame) && !prune_ref_frames &&
-        !cur_frame_gm_disabled) {
+        ref_pyr_lvl <= pyr_lvl && !cur_frame_gm_disabled) {
       assert(ref_buf[frame] != NULL);
       const int relative_frame_dist = av1_encoder_get_relative_dist(
           buf->display_order_hint, cm->cur_frame->display_order_hint);
       // Populate past and future ref frames.
       // reference_frames[0][] indicates past direction and
       // reference_frames[1][] indicates future direction.
-      if (relative_frame_dist <= 0) {
+      if (relative_frame_dist == 0) {
+        // Skip global motion estimation for frames at the same nominal instant.
+        // This will generally be either a "real" frame coded against a
+        // temporal filtered version, or a higher spatial layer coded against
+        // a lower spatial layer. In either case, the optimal motion model will
+        // be IDENTITY, so we don't need to search explicitly.
+      } else if (relative_frame_dist < 0) {
         reference_frames[0][*num_past_ref_frames].distance =
             abs(relative_frame_dist);
         reference_frames[0][*num_past_ref_frames].frame = frame;
@@ -376,26 +362,26 @@
 }
 
 // Deallocates segment_map and inliers.
-static AOM_INLINE void dealloc_global_motion_data(MotionModel *params_by_motion,
+static AOM_INLINE void dealloc_global_motion_data(MotionModel *motion_models,
                                                   uint8_t *segment_map) {
   aom_free(segment_map);
 
   for (int m = 0; m < RANSAC_NUM_MOTIONS; m++) {
-    aom_free(params_by_motion[m].inliers);
+    aom_free(motion_models[m].inliers);
   }
 }
 
 // Allocates and initializes memory for segment_map and MotionModel.
-static AOM_INLINE bool alloc_global_motion_data(MotionModel *params_by_motion,
+static AOM_INLINE bool alloc_global_motion_data(MotionModel *motion_models,
                                                 uint8_t **segment_map,
                                                 const int segment_map_w,
                                                 const int segment_map_h) {
-  av1_zero_array(params_by_motion, RANSAC_NUM_MOTIONS);
+  av1_zero_array(motion_models, RANSAC_NUM_MOTIONS);
   for (int m = 0; m < RANSAC_NUM_MOTIONS; m++) {
-    params_by_motion[m].inliers =
-        aom_malloc(sizeof(*(params_by_motion[m].inliers)) * 2 * MAX_CORNERS);
-    if (!params_by_motion[m].inliers) {
-      dealloc_global_motion_data(params_by_motion, NULL);
+    motion_models[m].inliers =
+        aom_malloc(sizeof(*(motion_models[m].inliers)) * 2 * MAX_CORNERS);
+    if (!motion_models[m].inliers) {
+      dealloc_global_motion_data(motion_models, NULL);
       return false;
     }
   }
@@ -403,7 +389,7 @@
   *segment_map = (uint8_t *)aom_calloc(segment_map_w * segment_map_h,
                                        sizeof(*segment_map));
   if (!*segment_map) {
-    dealloc_global_motion_data(params_by_motion, NULL);
+    dealloc_global_motion_data(motion_models, NULL);
     return false;
   }
   return true;
@@ -414,18 +400,10 @@
   GlobalMotionInfo *const gm_info = &cpi->gm_info;
   YV12_BUFFER_CONFIG *source = cpi->source;
 
-  gm_info->src_buffer = source->y_buffer;
-  if (source->flags & YV12_FLAG_HIGHBITDEPTH) {
-    // The source buffer is 16-bit, so we need to convert to 8 bits for the
-    // following code. We cache the result until the source frame is released.
-    gm_info->src_buffer =
-        av1_downconvert_frame(source, cpi->common.seq_params->bit_depth);
-  }
-
   gm_info->segment_map_w =
-      (source->y_width + WARP_ERROR_BLOCK) >> WARP_ERROR_BLOCK_LOG;
+      (source->y_crop_width + WARP_ERROR_BLOCK - 1) >> WARP_ERROR_BLOCK_LOG;
   gm_info->segment_map_h =
-      (source->y_height + WARP_ERROR_BLOCK) >> WARP_ERROR_BLOCK_LOG;
+      (source->y_crop_height + WARP_ERROR_BLOCK - 1) >> WARP_ERROR_BLOCK_LOG;
 
   memset(gm_info->reference_frames, -1,
          sizeof(gm_info->reference_frames[0][0]) * MAX_DIRECTIONS *
@@ -445,24 +423,27 @@
   qsort(gm_info->reference_frames[1], gm_info->num_ref_frames[1],
         sizeof(gm_info->reference_frames[1][0]), compare_distance);
 
-  gm_info->num_src_corners = -1;
-  // If at least one valid reference frame exists in past/future directions,
-  // compute interest points of source frame using FAST features.
-  if (gm_info->num_ref_frames[0] > 0 || gm_info->num_ref_frames[1] > 0) {
-    gm_info->num_src_corners = av1_fast_corner_detect(
-        gm_info->src_buffer, source->y_width, source->y_height,
-        source->y_stride, gm_info->src_corners, MAX_CORNERS);
+  if (cpi->sf.gm_sf.gm_search_type == GM_SEARCH_CLOSEST_REFS_ONLY) {
+    // Filter down to the nearest two ref frames.
+    // Prefer one past and one future ref over two past refs, even if
+    // the second past ref is closer
+    if (gm_info->num_ref_frames[1] > 0) {
+      gm_info->num_ref_frames[0] = AOMMIN(gm_info->num_ref_frames[0], 1);
+      gm_info->num_ref_frames[1] = AOMMIN(gm_info->num_ref_frames[1], 1);
+    } else {
+      gm_info->num_ref_frames[0] = AOMMIN(gm_info->num_ref_frames[0], 2);
+    }
   }
 }
 
 // Computes global motion w.r.t. valid reference frames.
 static AOM_INLINE void global_motion_estimation(AV1_COMP *cpi) {
   GlobalMotionInfo *const gm_info = &cpi->gm_info;
-  MotionModel params_by_motion[RANSAC_NUM_MOTIONS];
+  MotionModel motion_models[RANSAC_NUM_MOTIONS];
   uint8_t *segment_map = NULL;
 
-  alloc_global_motion_data(params_by_motion, &segment_map,
-                           gm_info->segment_map_w, gm_info->segment_map_h);
+  alloc_global_motion_data(motion_models, &segment_map, gm_info->segment_map_w,
+                           gm_info->segment_map_h);
 
   // Compute global motion w.r.t. past reference frames and future reference
   // frames
@@ -470,12 +451,11 @@
     if (gm_info->num_ref_frames[dir] > 0)
       compute_global_motion_for_references(
           cpi, gm_info->ref_buf, gm_info->reference_frames[dir],
-          gm_info->num_ref_frames[dir], gm_info->num_src_corners,
-          gm_info->src_corners, gm_info->src_buffer, params_by_motion,
-          segment_map, gm_info->segment_map_w, gm_info->segment_map_h);
+          gm_info->num_ref_frames[dir], motion_models, segment_map,
+          gm_info->segment_map_w, gm_info->segment_map_h);
   }
 
-  dealloc_global_motion_data(params_by_motion, segment_map);
+  dealloc_global_motion_data(motion_models, segment_map);
 }
 
 // Global motion estimation for the current frame is computed.This computation
@@ -498,7 +478,6 @@
   }
 
   if (cpi->common.current_frame.frame_type == INTER_FRAME && cpi->source &&
-      cpi->superres_mode == AOM_SUPERRES_NONE &&
       cpi->oxcf.tool_cfg.enable_global_motion && !gm_info->search_done) {
     setup_global_motion_info_params(cpi);
     if (cpi->mt_info.num_workers > 1)

diff --git a/av1/encoder/global_motion_facade.h b/av1/encoder/global_motion_facade.h
index 52df19d..dfdedf7 100644
--- a/av1/encoder/global_motion_facade.h
+++ b/av1/encoder/global_motion_facade.h

@@ -19,9 +19,8 @@
 struct AV1_COMP;
 
 void av1_compute_gm_for_valid_ref_frames(
-    struct AV1_COMP *cpi, YV12_BUFFER_CONFIG *ref_buf[REF_FRAMES], int frame,
-    int num_src_corners, int *src_corners, unsigned char *src_buffer,
-    MotionModel *params_by_motion, uint8_t *segment_map, int segment_map_w,
+    AV1_COMP *cpi, YV12_BUFFER_CONFIG *ref_buf[REF_FRAMES], int frame,
+    MotionModel *motion_models, uint8_t *segment_map, int segment_map_w,
     int segment_map_h);
 void av1_compute_global_motion_facade(struct AV1_COMP *cpi);
 #ifdef __cplusplus

diff --git a/av1/encoder/gop_structure.c b/av1/encoder/gop_structure.c
index e0208c9..5078098 100644
--- a/av1/encoder/gop_structure.c
+++ b/av1/encoder/gop_structure.c

@@ -84,7 +84,7 @@
     set_frame_parallel_level(&gf_group->frame_parallel_level[*frame_ind],
                              parallel_frame_count, max_parallel_frames);
     // Set LF_UPDATE frames as non-reference frames.
-    gf_group->is_frame_non_ref[*frame_ind] = 1;
+    gf_group->is_frame_non_ref[*frame_ind] = true;
   }
   set_src_offset(gf_group, first_frame_index, *cur_frame_idx, *frame_ind);
 
@@ -437,7 +437,7 @@
     RATE_CONTROL *rc, FRAME_INFO *frame_info, int start, int end,
     int *cur_frame_idx, int *frame_ind, int *parallel_frame_count,
     int max_parallel_frames, int do_frame_parallel_encode,
-    int *first_frame_index, int layer_depth) {
+    int *first_frame_index, int *cur_disp_idx, int layer_depth) {
   const int num_frames_to_process = end - start;
 
   // Either we are at the last level of the pyramid, or we don't have enough
@@ -449,6 +449,7 @@
       gf_group->update_type[*frame_ind] = LF_UPDATE;
       gf_group->arf_src_offset[*frame_ind] = 0;
       gf_group->cur_frame_idx[*frame_ind] = *cur_frame_idx;
+      gf_group->display_idx[*frame_ind] = *cur_disp_idx;
       gf_group->layer_depth[*frame_ind] = MAX_ARF_LAYERS;
       gf_group->arf_boost[*frame_ind] =
           av1_calc_arf_boost(twopass, twopass_frame, p_rc, frame_info, start,
@@ -462,11 +463,12 @@
         set_frame_parallel_level(&gf_group->frame_parallel_level[*frame_ind],
                                  parallel_frame_count, max_parallel_frames);
         // Set LF_UPDATE frames as non-reference frames.
-        gf_group->is_frame_non_ref[*frame_ind] = 1;
+        gf_group->is_frame_non_ref[*frame_ind] = true;
       }
       set_src_offset(gf_group, first_frame_index, *cur_frame_idx, *frame_ind);
       ++(*frame_ind);
       ++(*cur_frame_idx);
+      ++(*cur_disp_idx);
       ++start;
     }
   } else {
@@ -476,6 +478,8 @@
     gf_group->update_type[*frame_ind] = INTNL_ARF_UPDATE;
     gf_group->arf_src_offset[*frame_ind] = m - start;
     gf_group->cur_frame_idx[*frame_ind] = *cur_frame_idx;
+    gf_group->display_idx[*frame_ind] =
+        *cur_disp_idx + gf_group->arf_src_offset[*frame_ind];
     gf_group->layer_depth[*frame_ind] = layer_depth;
     gf_group->frame_type[*frame_ind] = INTER_FRAME;
     gf_group->refbuf_state[*frame_ind] = REFBUF_UPDATE;
@@ -499,15 +503,17 @@
     ++(*frame_ind);
 
     // Frames displayed before this internal ARF.
-    set_multi_layer_params(
-        twopass, twopass_frame, gf_group, p_rc, rc, frame_info, start, m,
-        cur_frame_idx, frame_ind, parallel_frame_count, max_parallel_frames,
-        do_frame_parallel_encode, first_frame_index, layer_depth + 1);
+    set_multi_layer_params(twopass, twopass_frame, gf_group, p_rc, rc,
+                           frame_info, start, m, cur_frame_idx, frame_ind,
+                           parallel_frame_count, max_parallel_frames,
+                           do_frame_parallel_encode, first_frame_index,
+                           cur_disp_idx, layer_depth + 1);
 
     // Overlay for internal ARF.
     gf_group->update_type[*frame_ind] = INTNL_OVERLAY_UPDATE;
     gf_group->arf_src_offset[*frame_ind] = 0;
     gf_group->cur_frame_idx[*frame_ind] = *cur_frame_idx;
+    gf_group->display_idx[*frame_ind] = *cur_disp_idx;
     gf_group->arf_boost[*frame_ind] = 0;
     gf_group->layer_depth[*frame_ind] = layer_depth;
     gf_group->frame_type[*frame_ind] = INTER_FRAME;
@@ -516,12 +522,14 @@
     set_src_offset(gf_group, first_frame_index, *cur_frame_idx, *frame_ind);
     ++(*frame_ind);
     ++(*cur_frame_idx);
+    ++(*cur_disp_idx);
 
     // Frames displayed after this internal ARF.
-    set_multi_layer_params(
-        twopass, twopass_frame, gf_group, p_rc, rc, frame_info, m + 1, end,
-        cur_frame_idx, frame_ind, parallel_frame_count, max_parallel_frames,
-        do_frame_parallel_encode, first_frame_index, layer_depth + 1);
+    set_multi_layer_params(twopass, twopass_frame, gf_group, p_rc, rc,
+                           frame_info, m + 1, end, cur_frame_idx, frame_ind,
+                           parallel_frame_count, max_parallel_frames,
+                           do_frame_parallel_encode, first_frame_index,
+                           cur_disp_idx, layer_depth + 1);
   }
 }
 
@@ -540,22 +548,19 @@
                            ? 0
                            : cpi->common.current_frame.frame_number;
 
-  // Initialize gf_group->frame_parallel_level and gf_group->is_frame_non_ref to
-  // 0.
-  memset(
-      gf_group->frame_parallel_level, 0,
-      sizeof(gf_group->frame_parallel_level[0]) * MAX_STATIC_GF_GROUP_LENGTH);
-  memset(gf_group->is_frame_non_ref, 0,
-         sizeof(gf_group->is_frame_non_ref[0]) * MAX_STATIC_GF_GROUP_LENGTH);
-  memset(gf_group->src_offset, 0,
-         sizeof(gf_group->src_offset[0]) * MAX_STATIC_GF_GROUP_LENGTH);
+  // Initialize gf_group->frame_parallel_level, gf_group->is_frame_non_ref,
+  // gf_group->src_offset and gf_group->is_frame_dropped with 0.
+  memset(gf_group->frame_parallel_level, 0,
+         sizeof(gf_group->frame_parallel_level));
+  memset(gf_group->is_frame_non_ref, 0, sizeof(gf_group->is_frame_non_ref));
+  memset(gf_group->src_offset, 0, sizeof(gf_group->src_offset));
+  memset(gf_group->is_frame_dropped, 0, sizeof(gf_group->is_frame_dropped));
   // Initialize gf_group->skip_frame_refresh and gf_group->skip_frame_as_ref
   // with INVALID_IDX.
   memset(gf_group->skip_frame_refresh, INVALID_IDX,
-         sizeof(gf_group->skip_frame_refresh[0][0]) *
-             MAX_STATIC_GF_GROUP_LENGTH * REF_FRAMES);
+         sizeof(gf_group->skip_frame_refresh));
   memset(gf_group->skip_frame_as_ref, INVALID_IDX,
-         sizeof(gf_group->skip_frame_as_ref[0]) * MAX_STATIC_GF_GROUP_LENGTH);
+         sizeof(gf_group->skip_frame_as_ref));
 
   int kf_decomp = cpi->oxcf.kf_cfg.enable_keyframe_filtering > 1;
   // This is a patch that fixes https://crbug.com/aomedia/3163
@@ -721,11 +726,12 @@
 
   // Rest of the frames.
   if (!is_multi_layer_configured)
-    set_multi_layer_params(
-        twopass, &cpi->twopass_frame, gf_group, p_rc, rc, frame_info,
-        cur_frame_index, gf_interval, &cur_frame_index, &frame_index,
-        &parallel_frame_count, cpi->ppi->num_fp_contexts,
-        do_frame_parallel_encode, &first_frame_index, use_altref + 1);
+    set_multi_layer_params(twopass, &cpi->twopass_frame, gf_group, p_rc, rc,
+                           frame_info, cur_frame_index, gf_interval,
+                           &cur_frame_index, &frame_index,
+                           &parallel_frame_count, cpi->ppi->num_fp_contexts,
+                           do_frame_parallel_encode, &first_frame_index,
+                           &cur_disp_index, use_altref + 1);
 
   if (use_altref) {
     gf_group->update_type[frame_index] = OVERLAY_UPDATE;

diff --git a/av1/encoder/hybrid_fwd_txfm.c b/av1/encoder/hybrid_fwd_txfm.c
index eda5ddf..4c2f8d0 100644
--- a/av1/encoder/hybrid_fwd_txfm.c
+++ b/av1/encoder/hybrid_fwd_txfm.c

@@ -18,7 +18,9 @@
 #include "av1/encoder/hybrid_fwd_txfm.h"
 
 /* 4-point reversible, orthonormal Walsh-Hadamard in 3.5 adds, 0.5 shifts per
-   pixel. */
+   pixel.
+   Shared for both high and low bit depth.
+ */
 void av1_fwht4x4_c(const int16_t *input, tran_low_t *output, int stride) {
   int i;
   tran_high_t a1, b1, c1, d1, e1;
@@ -40,21 +42,21 @@
     a1 -= c1;
     d1 += b1;
     op[0] = (tran_low_t)a1;
-    op[4] = (tran_low_t)c1;
-    op[8] = (tran_low_t)d1;
-    op[12] = (tran_low_t)b1;
+    op[1] = (tran_low_t)c1;
+    op[2] = (tran_low_t)d1;
+    op[3] = (tran_low_t)b1;
 
     ip_pass0++;
-    op++;
+    op += 4;
   }
   ip = output;
   op = output;
 
   for (i = 0; i < 4; i++) {
-    a1 = ip[0];
-    b1 = ip[1];
-    c1 = ip[2];
-    d1 = ip[3];
+    a1 = ip[4 * 0];
+    b1 = ip[4 * 1];
+    c1 = ip[4 * 2];
+    d1 = ip[4 * 3];
 
     a1 += b1;
     d1 -= c1;
@@ -63,21 +65,16 @@
     c1 = e1 - c1;
     a1 -= c1;
     d1 += b1;
-    op[0] = (tran_low_t)(a1 * UNIT_QUANT_FACTOR);
-    op[1] = (tran_low_t)(c1 * UNIT_QUANT_FACTOR);
-    op[2] = (tran_low_t)(d1 * UNIT_QUANT_FACTOR);
-    op[3] = (tran_low_t)(b1 * UNIT_QUANT_FACTOR);
+    op[4 * 0] = (tran_low_t)(a1 * UNIT_QUANT_FACTOR);
+    op[4 * 1] = (tran_low_t)(c1 * UNIT_QUANT_FACTOR);
+    op[4 * 2] = (tran_low_t)(d1 * UNIT_QUANT_FACTOR);
+    op[4 * 3] = (tran_low_t)(b1 * UNIT_QUANT_FACTOR);
 
-    ip += 4;
-    op += 4;
+    ip++;
+    op++;
   }
 }
 
-void av1_highbd_fwht4x4_c(const int16_t *input, tran_low_t *output,
-                          int stride) {
-  av1_fwht4x4_c(input, output, stride);
-}
-
 static void highbd_fwd_txfm_4x4(const int16_t *src_diff, tran_low_t *coeff,
                                 int diff_stride, TxfmParam *txfm_param) {
   int32_t *dst_coeff = (int32_t *)coeff;
@@ -85,7 +82,7 @@
   const int bd = txfm_param->bd;
   if (txfm_param->lossless) {
     assert(tx_type == DCT_DCT);
-    av1_highbd_fwht4x4(src_diff, coeff, diff_stride);
+    av1_fwht4x4(src_diff, coeff, diff_stride);
     return;
   }
   av1_fwd_txfm2d_4x4(src_diff, dst_coeff, diff_stride, tx_type, bd);

diff --git a/av1/encoder/interp_search.c b/av1/encoder/interp_search.c
index 2b7eb91..247fa3e 100644
--- a/av1/encoder/interp_search.c
+++ b/av1/encoder/interp_search.c

@@ -682,6 +682,7 @@
     *rd = args->interp_filter_stats[match_found_idx].rd;
     x->pred_sse[ref_frame] =
         args->interp_filter_stats[match_found_idx].pred_sse;
+    *skip_build_pred = 0;
     return 0;
   }
 

diff --git a/av1/encoder/interp_search.h b/av1/encoder/interp_search.h
index 8eba483..bce494e 100644
--- a/av1/encoder/interp_search.h
+++ b/av1/encoder/interp_search.h

@@ -126,6 +126,11 @@
   FULLPEL_MV start_mv_stack[(MAX_REF_MV_SEARCH - 1) * 2];
 
   /*!
+   * Stack to store ref_mv_idx of NEWMV mode.
+   */
+  uint8_t ref_mv_idx_stack[(MAX_REF_MV_SEARCH - 1) * 2];
+
+  /*!
    * Count of mvs in start mv stack.
    */
   int start_mv_cnt;

diff --git a/av1/encoder/intra_mode_search.c b/av1/encoder/intra_mode_search.c
index d863910..3b5dd75 100644
--- a/av1/encoder/intra_mode_search.c
+++ b/av1/encoder/intra_mode_search.c

@@ -10,6 +10,7 @@
  */
 
 #include "av1/common/av1_common_int.h"
+#include "av1/common/cfl.h"
 #include "av1/common/reconintra.h"
 
 #include "av1/encoder/intra_mode_search.h"
@@ -149,7 +150,7 @@
             x->plane[0].src.buf + i * x->plane[0].src.stride + j,
             x->plane[0].src.stride, is_hbd);
         block_4x4_var_info->var = src_var;
-        log_src_var = log(1.0 + src_var / 16.0);
+        log_src_var = log1p(src_var / 16.0);
         block_4x4_var_info->log_var = log_src_var;
       } else {
         // When source variance is already calculated and available for
@@ -157,7 +158,7 @@
         // available, then retrieve from buffer. Else, calculate the same and
         // store to the buffer.
         if (log_src_var < 0) {
-          log_src_var = log(1.0 + src_var / 16.0);
+          log_src_var = log1p(src_var / 16.0);
           block_4x4_var_info->log_var = log_src_var;
         }
       }
@@ -167,7 +168,7 @@
           cpi->ppi->fn_ptr[BLOCK_4X4].vf,
           xd->plane[0].dst.buf + i * xd->plane[0].dst.stride + j,
           xd->plane[0].dst.stride, is_hbd);
-      *avg_log_recon_variance += log(1.0 + recon_var / 16.0);
+      *avg_log_recon_variance += log1p(recon_var / 16.0);
     }
   }
 
@@ -640,6 +641,12 @@
   return est_best_cfl_idx;
 }
 
+static AOM_INLINE void set_invalid_cfl_parameters(
+    uint8_t *best_cfl_alpha_idx, int8_t *best_cfl_alpha_signs) {
+  *best_cfl_alpha_idx = 0;
+  *best_cfl_alpha_signs = 0;
+}
+
 static void cfl_pick_plane_rd(const AV1_COMP *const cpi, MACROBLOCK *x,
                               int plane, TX_SIZE tx_size, int cfl_search_range,
                               RD_STATS cfl_rd_arr[CFL_MAGS_SIZE],
@@ -717,28 +724,44 @@
   av1_invalid_rd_stats(best_rd_stats);
 
   // As the dc pred data is same for different values of alpha, enable the
-  // caching of dc pred data.
-  xd->cfl.use_dc_pred_cache = 1;
+  // caching of dc pred data. Call clear_cfl_dc_pred_cache_flags() before
+  // returning to avoid the unintentional usage of cached dc pred data.
+  xd->cfl.use_dc_pred_cache = true;
   // Evaluate alpha parameter of each chroma plane.
   est_best_cfl_idx_u =
       cfl_pick_plane_parameter(cpi, x, 1, tx_size, cfl_search_range);
   est_best_cfl_idx_v =
       cfl_pick_plane_parameter(cpi, x, 2, tx_size, cfl_search_range);
 
-  // For cfl_search_range=1, further refinement of alpha is not enabled. Hence
-  // CfL index=0 for both the chroma planes implies invalid CfL mode.
-  if (cfl_search_range == 1 && est_best_cfl_idx_u == CFL_INDEX_ZERO &&
-      est_best_cfl_idx_v == CFL_INDEX_ZERO) {
-    // Set invalid CfL parameters here as CfL mode is invalid.
-    *best_cfl_alpha_idx = 0;
-    *best_cfl_alpha_signs = 0;
+  if (cfl_search_range == 1) {
+    // For cfl_search_range=1, further refinement of alpha is not enabled. Hence
+    // CfL index=0 for both the chroma planes implies invalid CfL mode.
+    if (est_best_cfl_idx_u == CFL_INDEX_ZERO &&
+        est_best_cfl_idx_v == CFL_INDEX_ZERO) {
+      set_invalid_cfl_parameters(best_cfl_alpha_idx, best_cfl_alpha_signs);
+      clear_cfl_dc_pred_cache_flags(&xd->cfl);
+      return 0;
+    }
 
-    // Clear the following flags to avoid the unintentional usage of cached dc
-    // pred data.
-    xd->cfl.use_dc_pred_cache = 0;
-    xd->cfl.dc_pred_is_cached[0] = 0;
-    xd->cfl.dc_pred_is_cached[1] = 0;
-    return 0;
+    int cfl_alpha_u, cfl_alpha_v;
+    CFL_SIGN_TYPE cfl_sign_u, cfl_sign_v;
+    const MB_MODE_INFO *mbmi = xd->mi[0];
+    cfl_idx_to_sign_and_alpha(est_best_cfl_idx_u, &cfl_sign_u, &cfl_alpha_u);
+    cfl_idx_to_sign_and_alpha(est_best_cfl_idx_v, &cfl_sign_v, &cfl_alpha_v);
+    const int joint_sign = cfl_sign_u * CFL_SIGNS + cfl_sign_v - 1;
+    // Compute alpha and mode signaling rate.
+    const int rate_overhead =
+        mode_costs->cfl_cost[joint_sign][CFL_PRED_U][cfl_alpha_u] +
+        mode_costs->cfl_cost[joint_sign][CFL_PRED_V][cfl_alpha_v] +
+        mode_costs
+            ->intra_uv_mode_cost[is_cfl_allowed(xd)][mbmi->mode][UV_CFL_PRED];
+    // Skip the CfL mode evaluation if the RD cost derived using the rate needed
+    // to signal the CfL mode and alpha parameter exceeds the ref_best_rd.
+    if (RDCOST(x->rdmult, rate_overhead, 0) > ref_best_rd) {
+      set_invalid_cfl_parameters(best_cfl_alpha_idx, best_cfl_alpha_signs);
+      clear_cfl_dc_pred_cache_flags(&xd->cfl);
+      return 0;
+    }
   }
 
   // Compute the rd cost of each chroma plane using the alpha parameters which
@@ -748,11 +771,7 @@
   cfl_pick_plane_rd(cpi, x, 2, tx_size, cfl_search_range, cfl_rd_arr_v,
                     est_best_cfl_idx_v);
 
-  // Clear the following flags to avoid the unintentional usage of cached dc
-  // pred data.
-  xd->cfl.use_dc_pred_cache = 0;
-  xd->cfl.dc_pred_is_cached[0] = 0;
-  xd->cfl.dc_pred_is_cached[1] = 0;
+  clear_cfl_dc_pred_cache_flags(&xd->cfl);
 
   for (int ui = 0; ui < CFL_MAGS_SIZE; ++ui) {
     if (cfl_rd_arr_u[ui].rate == INT_MAX) continue;
@@ -789,8 +808,7 @@
     av1_invalid_rd_stats(best_rd_stats);
     // Set invalid CFL parameters here since the rdcost is not better than
     // ref_best_rd.
-    *best_cfl_alpha_idx = 0;
-    *best_cfl_alpha_signs = 0;
+    set_invalid_cfl_parameters(best_cfl_alpha_idx, best_cfl_alpha_signs);
     return 0;
   }
   return 1;
@@ -850,12 +868,20 @@
   }
   IntraModeSearchState intra_search_state;
   init_intra_mode_search_state(&intra_search_state);
+  const CFL_ALLOWED_TYPE cfl_allowed = is_cfl_allowed(xd);
 
   // Search through all non-palette modes.
   for (int mode_idx = 0; mode_idx < UV_INTRA_MODES; ++mode_idx) {
     int this_rate;
     RD_STATS tokenonly_rd_stats;
     UV_PREDICTION_MODE mode = uv_rd_search_mode_order[mode_idx];
+
+    // Skip the current mode evaluation if the RD cost derived using the mode
+    // signaling rate exceeds the best_rd so far.
+    const int mode_rate =
+        mode_costs->intra_uv_mode_cost[cfl_allowed][mbmi->mode][mode];
+    if (RDCOST(x->rdmult, mode_rate, 0) > best_rd) continue;
+
     const int is_diagonal_mode = av1_is_diagonal_mode(get_uv_mode(mode));
     const int is_directional_mode = av1_is_directional_mode(get_uv_mode(mode));
 
@@ -885,7 +911,7 @@
     const SPEED_FEATURES *sf = &cpi->sf;
     mbmi->angle_delta[PLANE_TYPE_UV] = 0;
     if (mode == UV_CFL_PRED) {
-      if (!is_cfl_allowed(xd) || !intra_mode_cfg->enable_cfl_intra) continue;
+      if (!cfl_allowed || !intra_mode_cfg->enable_cfl_intra) continue;
       assert(!is_directional_mode);
       const TX_SIZE uv_tx_size = av1_get_tx_size(AOM_PLANE_U, xd);
       if (!cfl_rd_pick_alpha(x, cpi, uv_tx_size, best_rd,
@@ -916,7 +942,7 @@
 
       // Search through angle delta
       const int rate_overhead =
-          mode_costs->intra_uv_mode_cost[is_cfl_allowed(xd)][mbmi->mode][mode];
+          mode_costs->intra_uv_mode_cost[cfl_allowed][mbmi->mode][mode];
       if (!rd_pick_intra_angle_sbuv(cpi, x, bsize, rate_overhead, best_rd,
                                     &this_rate, &tokenonly_rd_stats))
         continue;
@@ -932,7 +958,7 @@
       }
     }
     const int mode_cost =
-        mode_costs->intra_uv_mode_cost[is_cfl_allowed(xd)][mbmi->mode][mode];
+        mode_costs->intra_uv_mode_cost[cfl_allowed][mbmi->mode][mode];
     this_rate = tokenonly_rd_stats.rate +
                 intra_mode_info_cost_uv(cpi, x, mbmi, bsize, mode_cost);
     this_rd = RDCOST(x->rdmult, this_rate, tokenonly_rd_stats.dist);
@@ -956,8 +982,7 @@
     uint8_t *best_palette_color_map = x->palette_buffer->best_palette_color_map;
     av1_rd_pick_palette_intra_sbuv(
         cpi, x,
-        mode_costs
-            ->intra_uv_mode_cost[is_cfl_allowed(xd)][mbmi->mode][UV_DC_PRED],
+        mode_costs->intra_uv_mode_cost[cfl_allowed][mbmi->mode][UV_DC_PRED],
         best_palette_color_map, &best_mbmi, &best_rd, rate, rate_tokenonly,
         distortion, skippable);
   }
@@ -1143,9 +1168,13 @@
   MACROBLOCKD *const xd = &x->e_mbd;
   MB_MODE_INFO *const mbmi = xd->mi[0];
   RD_STATS rd_stats;
-  // In order to improve txfm search avoid rd based breakouts during winner
-  // mode evaluation. Hence passing ref_best_rd as a maximum value
-  av1_pick_uniform_tx_size_type_yrd(cpi, x, &rd_stats, bsize, INT64_MAX);
+  // In order to improve txfm search, avoid rd based breakouts during winner
+  // mode evaluation. Hence passing ref_best_rd as INT64_MAX by default when the
+  // speed feature use_rd_based_breakout_for_intra_tx_search is disabled.
+  int64_t ref_best_rd = cpi->sf.tx_sf.use_rd_based_breakout_for_intra_tx_search
+                            ? *best_rd
+                            : INT64_MAX;
+  av1_pick_uniform_tx_size_type_yrd(cpi, x, &rd_stats, bsize, ref_best_rd);
   if (rd_stats.rate == INT_MAX) return 0;
   int this_rate_tokenonly = rd_stats.rate;
   if (!xd->lossless[mbmi->segment_id] && block_signals_txsize(mbmi->bsize)) {

diff --git a/av1/encoder/k_means_template.h b/av1/encoder/k_means_template.h
index 31ffdcf..4be2038 100644
--- a/av1/encoder/k_means_template.h
+++ b/av1/encoder/k_means_template.h

@@ -123,6 +123,10 @@
     l = (l == 1) ? 0 : 1;
 
     RENAME(calc_centroids)(data, meta_centroids[l], meta_indices[prev_l], n, k);
+    if (!memcmp(meta_centroids[l], meta_centroids[prev_l],
+                sizeof(centroids[0]) * k * AV1_K_MEANS_DIM)) {
+      break;
+    }
 #if AV1_K_MEANS_DIM == 1
     av1_calc_indices_dim1(data, meta_centroids[l], meta_indices[l], &this_dist,
                           n, k);
@@ -135,9 +139,6 @@
       best_l = prev_l;
       break;
     }
-    if (!memcmp(meta_centroids[l], meta_centroids[prev_l],
-                sizeof(centroids[0]) * k * AV1_K_MEANS_DIM))
-      break;
   }
   if (i == max_itr) best_l = l;
   if (best_l != 0) {

diff --git a/av1/encoder/level.c b/av1/encoder/level.c
index 5741659..5d5fe9c 100644
--- a/av1/encoder/level.c
+++ b/av1/encoder/level.c

@@ -522,9 +522,10 @@
 }
 
 #define MAX_TIME 1e16
-double time_next_buffer_is_free(int num_decoded_frame, int decoder_buffer_delay,
-                                const FRAME_BUFFER *frame_buffer_pool,
-                                double current_time) {
+static double time_next_buffer_is_free(int num_decoded_frame,
+                                       int decoder_buffer_delay,
+                                       const FRAME_BUFFER *frame_buffer_pool,
+                                       double current_time) {
   if (num_decoded_frame == 0) {
     return (double)decoder_buffer_delay / 90000.0;
   }
@@ -1243,7 +1244,8 @@
       AOMMAX(level_spec->max_decode_rate, decoded_samples);
   level_spec->max_tile_rate = AOMMAX(level_spec->max_tile_rate, tiles);
   level_stats->max_bitrate =
-      AOMMAX(level_stats->max_bitrate, (int)encoded_size_in_bytes * 8);
+      AOMMAX(level_stats->max_bitrate,
+             (int)AOMMIN(encoded_size_in_bytes * 8, (size_t)INT_MAX));
 }
 
 void av1_update_level_info(AV1_COMP *cpi, size_t size, int64_t ts_start,

diff --git a/av1/encoder/lookahead.c b/av1/encoder/lookahead.c
index 10fbb77..9ef9b88 100644
--- a/av1/encoder/lookahead.c
+++ b/av1/encoder/lookahead.c

@@ -46,7 +46,7 @@
     unsigned int width, unsigned int height, unsigned int subsampling_x,
     unsigned int subsampling_y, int use_highbitdepth, unsigned int depth,
     const int border_in_pixels, int byte_alignment, int num_lap_buffers,
-    bool is_all_intra, int enable_global_motion) {
+    bool is_all_intra, int num_pyramid_levels) {
   int lag_in_frames = AOMMAX(1, depth);
 
   // For all-intra frame encoding, previous source frames are not required.
@@ -82,7 +82,7 @@
       if (aom_realloc_frame_buffer(
               &ctx->buf[i].img, width, height, subsampling_x, subsampling_y,
               use_highbitdepth, border_in_pixels, byte_alignment, NULL, NULL,
-              NULL, enable_global_motion, 0)) {
+              NULL, num_pyramid_levels, 0)) {
         goto fail;
       }
     }
@@ -100,7 +100,7 @@
 
 int av1_lookahead_push(struct lookahead_ctx *ctx, const YV12_BUFFER_CONFIG *src,
                        int64_t ts_start, int64_t ts_end, int use_highbitdepth,
-                       aom_enc_frame_flags_t flags) {
+                       int num_pyramid_levels, aom_enc_frame_flags_t flags) {
   int width = src->y_crop_width;
   int height = src->y_crop_height;
   int uv_width = src->uv_crop_width;
@@ -134,7 +134,7 @@
     memset(&new_img, 0, sizeof(new_img));
     if (aom_alloc_frame_buffer(&new_img, width, height, subsampling_x,
                                subsampling_y, use_highbitdepth,
-                               AOM_BORDER_IN_PIXELS, 0, 0))
+                               AOM_BORDER_IN_PIXELS, 0, num_pyramid_levels, 0))
       return 1;
     aom_free_frame_buffer(&buf->img);
     buf->img = new_img;

diff --git a/av1/encoder/lookahead.h b/av1/encoder/lookahead.h
index bd7cae4..c0e6d22 100644
--- a/av1/encoder/lookahead.h
+++ b/av1/encoder/lookahead.h

@@ -70,7 +70,7 @@
     unsigned int width, unsigned int height, unsigned int subsampling_x,
     unsigned int subsampling_y, int use_highbitdepth, unsigned int depth,
     const int border_in_pixels, int byte_alignment, int num_lap_buffers,
-    bool is_all_intra, int enable_global_motion);
+    bool is_all_intra, int num_pyramid_levels);
 
 /**\brief Destroys the lookahead stage
  */
@@ -90,11 +90,13 @@
  * \param[in] ts_start    Timestamp for the start of this frame
  * \param[in] ts_end      Timestamp for the end of this frame
  * \param[in] use_highbitdepth Tell if HBD is used
+ * \param[in] num_pyramid_levels Number of pyramid levels to allocate
+                          for each frame buffer
  * \param[in] flags       Flags set on this frame
  */
 int av1_lookahead_push(struct lookahead_ctx *ctx, const YV12_BUFFER_CONFIG *src,
                        int64_t ts_start, int64_t ts_end, int use_highbitdepth,
-                       aom_enc_frame_flags_t flags);
+                       int num_pyramid_levels, aom_enc_frame_flags_t flags);
 
 /**\brief Get the next source buffer to encode
  *

diff --git a/av1/encoder/mcomp.c b/av1/encoder/mcomp.c
index 8fd1ab1..cc39c81 100644
--- a/av1/encoder/mcomp.c
+++ b/av1/encoder/mcomp.c

@@ -94,10 +94,12 @@
 
 void av1_make_default_fullpel_ms_params(
     FULLPEL_MOTION_SEARCH_PARAMS *ms_params, const struct AV1_COMP *cpi,
-    MACROBLOCK *x, BLOCK_SIZE bsize, const MV *ref_mv,
+    MACROBLOCK *x, BLOCK_SIZE bsize, const MV *ref_mv, FULLPEL_MV start_mv,
     const search_site_config search_sites[NUM_DISTINCT_SEARCH_METHODS],
     int fine_search_interval) {
   const MV_SPEED_FEATURES *mv_sf = &cpi->sf.mv_sf;
+  const int is_key_frame =
+      cpi->ppi->gf_group.update_type[cpi->gf_frame_index] == KF_UPDATE;
 
   // High level params
   ms_params->bsize = bsize;
@@ -129,19 +131,6 @@
 
   av1_set_mv_search_method(ms_params, search_sites, search_method);
 
-  const int use_downsampled_sad =
-      mv_sf->use_downsampled_sad && block_size_high[bsize] >= 16;
-  if (use_downsampled_sad) {
-    ms_params->sdf = ms_params->vfp->sdsf;
-    ms_params->sdx4df = ms_params->vfp->sdsx4df;
-    // Skip version of sadx3 is not is not available yet
-    ms_params->sdx3df = ms_params->vfp->sdsx4df;
-  } else {
-    ms_params->sdf = ms_params->vfp->sdf;
-    ms_params->sdx4df = ms_params->vfp->sdx4df;
-    ms_params->sdx3df = ms_params->vfp->sdx3df;
-  }
-
   ms_params->mesh_patterns[0] = mv_sf->mesh_patterns;
   ms_params->mesh_patterns[1] = mv_sf->intrabc_mesh_patterns;
   ms_params->force_mesh_thresh = mv_sf->exhaustive_searches_thresh;
@@ -161,6 +150,47 @@
   // Mvcost params
   init_mv_cost_params(&ms_params->mv_cost_params, x->mv_costs, ref_mv,
                       x->errorperbit, x->sadperbit);
+
+  ms_params->sdf = ms_params->vfp->sdf;
+  ms_params->sdx4df = ms_params->vfp->sdx4df;
+  ms_params->sdx3df = ms_params->vfp->sdx3df;
+
+  if (mv_sf->use_downsampled_sad == 2 && block_size_high[bsize] >= 16) {
+    ms_params->sdf = ms_params->vfp->sdsf;
+    ms_params->sdx4df = ms_params->vfp->sdsx4df;
+    // Skip version of sadx3 is not available yet
+    ms_params->sdx3df = ms_params->vfp->sdsx4df;
+  } else if (mv_sf->use_downsampled_sad == 1 && block_size_high[bsize] >= 16 &&
+             !is_key_frame) {
+    FULLPEL_MV start_mv_clamped = start_mv;
+    // adjust start_mv to make sure it is within MV range
+    clamp_fullmv(&start_mv_clamped, &ms_params->mv_limits);
+
+    const struct buf_2d *const ref = ms_params->ms_buffers.ref;
+    const int ref_stride = ref->stride;
+    const uint8_t *best_address = get_buf_from_fullmv(ref, &start_mv_clamped);
+    const struct buf_2d *const src = ms_params->ms_buffers.src;
+    const uint8_t *src_buf = src->buf;
+    const int src_stride = src->stride;
+
+    unsigned int start_mv_sad_even_rows, start_mv_sad_odd_rows;
+    start_mv_sad_even_rows =
+        ms_params->vfp->sdsf(src_buf, src_stride, best_address, ref_stride);
+    start_mv_sad_odd_rows =
+        ms_params->vfp->sdsf(src_buf + src_stride, src_stride,
+                             best_address + ref_stride, ref_stride);
+
+    // If the absolute SAD difference computed between the pred-to-src of even
+    // and odd rows is small, skip every other row in sad computation.
+    const int odd_to_even_diff_sad =
+        abs((int)start_mv_sad_even_rows - (int)start_mv_sad_odd_rows);
+    const int mult_thresh = 4;
+    if (odd_to_even_diff_sad * mult_thresh < (int)start_mv_sad_even_rows) {
+      ms_params->sdf = ms_params->vfp->sdsf;
+      ms_params->sdx4df = ms_params->vfp->sdsx4df;
+      ms_params->sdx3df = ms_params->vfp->sdsx4df;
+    }
+  }
 }
 
 void av1_set_ms_to_intra_mode(FULLPEL_MOTION_SEARCH_PARAMS *ms_params,
@@ -228,6 +258,9 @@
   if (mv_limits->col_max > col_max) mv_limits->col_max = col_max;
   if (mv_limits->row_min < row_min) mv_limits->row_min = row_min;
   if (mv_limits->row_max > row_max) mv_limits->row_max = row_max;
+
+  mv_limits->col_max = AOMMAX(mv_limits->col_min, mv_limits->col_max);
+  mv_limits->row_max = AOMMAX(mv_limits->row_min, mv_limits->row_max);
 }
 
 int av1_init_search_range(int size) {
@@ -649,6 +682,14 @@
   cfg->num_search_steps = MAX_PATTERN_SCALES;
 }
 
+const av1_init_search_site_config
+    av1_init_motion_compensation[NUM_DISTINCT_SEARCH_METHODS] = {
+      av1_init_dsmotion_compensation,     av1_init_motion_compensation_nstep,
+      av1_init_motion_compensation_nstep, av1_init_dsmotion_compensation,
+      av1_init_motion_compensation_hex,   av1_init_motion_compensation_bigdia,
+      av1_init_motion_compensation_square
+    };
+
 // Checks whether the mv is within range of the mv_limits
 static INLINE int check_bounds(const FullMvLimits *mv_limits, int row, int col,
                                int range) {
@@ -1312,88 +1353,76 @@
                        do_init_search, cost_list, best_mv);
 }
 
-static int diamond_search_sad(FULLPEL_MV start_mv,
+static int diamond_search_sad(FULLPEL_MV start_mv, unsigned int start_mv_sad,
                               const FULLPEL_MOTION_SEARCH_PARAMS *ms_params,
                               const int search_step, int *num00,
                               FULLPEL_MV *best_mv, FULLPEL_MV *second_best_mv) {
+#define UPDATE_SEARCH_STEP                                      \
+  do {                                                          \
+    if (best_site != 0) {                                       \
+      tmp_second_best_mv = *best_mv;                            \
+      best_mv->row += site[best_site].mv.row;                   \
+      best_mv->col += site[best_site].mv.col;                   \
+      best_address += site[best_site].offset;                   \
+      is_off_center = 1;                                        \
+    }                                                           \
+                                                                \
+    if (is_off_center == 0) num_center_steps++;                 \
+                                                                \
+    if (best_site == 0 && step > 2) {                           \
+      int next_step_size = cfg->radius[step - 1];               \
+      while (next_step_size == cfg->radius[step] && step > 2) { \
+        num_center_steps++;                                     \
+        --step;                                                 \
+        next_step_size = cfg->radius[step - 1];                 \
+      }                                                         \
+    }                                                           \
+  } while (0)
+
   const struct buf_2d *const src = ms_params->ms_buffers.src;
   const struct buf_2d *const ref = ms_params->ms_buffers.ref;
 
+  const uint8_t *src_buf = src->buf;
+  const int src_stride = src->stride;
   const int ref_stride = ref->stride;
-  const uint8_t *best_address;
 
-  const uint8_t *mask = ms_params->ms_buffers.mask;
-  const uint8_t *second_pred = ms_params->ms_buffers.second_pred;
   const MV_COST_PARAMS *mv_cost_params = &ms_params->mv_cost_params;
 
   const search_site_config *cfg = ms_params->search_sites;
 
-  unsigned int bestsad = INT_MAX;
-  int best_site = 0;
   int is_off_center = 0;
-
-  clamp_fullmv(&start_mv, &ms_params->mv_limits);
+  // Number of times that we have stayed in the middle. This is used to skip
+  // search steps in the future if diamond_search_sad is called again.
+  int num_center_steps = 0;
 
   // search_step determines the length of the initial step and hence the number
   // of iterations.
   const int tot_steps = cfg->num_search_steps - search_step;
+  FULLPEL_MV tmp_second_best_mv;
+  if (second_best_mv) {
+    tmp_second_best_mv = *second_best_mv;
+  }
 
-  *num00 = 0;
   *best_mv = start_mv;
 
   // Check the starting position
-  best_address = get_buf_from_fullmv(ref, &start_mv);
-  bestsad = get_mvpred_compound_sad(ms_params, src, best_address, ref_stride);
-  bestsad += mvsad_err_cost_(best_mv, &ms_params->mv_cost_params);
+  const uint8_t *best_address = get_buf_from_fullmv(ref, &start_mv);
+  unsigned int bestsad = start_mv_sad;
 
-  int next_step_size = tot_steps > 2 ? cfg->radius[tot_steps - 2] : 1;
-  for (int step = tot_steps - 1; step >= 0; --step) {
-    const search_site *site = cfg->site[step];
-    best_site = 0;
-    if (step > 0) next_step_size = cfg->radius[step - 1];
+  // TODO([email protected]): Implement 4 points search for msdf&sdaf
+  if (ms_params->ms_buffers.second_pred) {
+    for (int step = tot_steps - 1; step >= 0; --step) {
+      const search_site *site = cfg->site[step];
+      const int num_searches = cfg->searches_per_step[step];
+      int best_site = 0;
 
-    int all_in = 1, j;
-    // Trap illegal vectors
-    all_in &= best_mv->row + site[1].mv.row >= ms_params->mv_limits.row_min;
-    all_in &= best_mv->row + site[2].mv.row <= ms_params->mv_limits.row_max;
-    all_in &= best_mv->col + site[3].mv.col >= ms_params->mv_limits.col_min;
-    all_in &= best_mv->col + site[4].mv.col <= ms_params->mv_limits.col_max;
-
-    // TODO(anyone): Implement 4 points search for msdf&sdaf
-    if (all_in && !mask && !second_pred) {
-      const uint8_t *src_buf = src->buf;
-      const int src_stride = src->stride;
-      for (int idx = 1; idx <= cfg->searches_per_step[step]; idx += 4) {
-        unsigned char const *block_offset[4];
-        unsigned int sads[4];
-
-        for (j = 0; j < 4; j++)
-          block_offset[j] = site[idx + j].offset + best_address;
-
-        ms_params->sdx4df(src_buf, src_stride, block_offset, ref_stride, sads);
-        for (j = 0; j < 4; j++) {
-          if (sads[j] < bestsad) {
-            const FULLPEL_MV this_mv = { best_mv->row + site[idx + j].mv.row,
-                                         best_mv->col + site[idx + j].mv.col };
-            unsigned int thissad =
-                sads[j] + mvsad_err_cost_(&this_mv, mv_cost_params);
-            if (thissad < bestsad) {
-              bestsad = thissad;
-              best_site = idx + j;
-            }
-          }
-        }
-      }
-    } else {
-      for (int idx = 1; idx <= cfg->searches_per_step[step]; idx++) {
+      for (int idx = 1; idx <= num_searches; idx++) {
         const FULLPEL_MV this_mv = { best_mv->row + site[idx].mv.row,
                                      best_mv->col + site[idx].mv.col };
 
         if (av1_is_fullmv_in_range(&ms_params->mv_limits, this_mv)) {
           const uint8_t *const check_here = site[idx].offset + best_address;
-          unsigned int thissad;
-
-          thissad =
+          unsigned int thissad =
               get_mvpred_compound_sad(ms_params, src, check_here, ref_stride);
 
           if (thissad < bestsad) {
@@ -1405,47 +1434,112 @@
           }
         }
       }
+      UPDATE_SEARCH_STEP;
     }
+  } else {
+    for (int step = tot_steps - 1; step >= 0; --step) {
+      const search_site *site = cfg->site[step];
+      const int num_searches = cfg->searches_per_step[step];
+      int best_site = 0;
 
-    if (best_site != 0) {
-      if (second_best_mv) {
-        *second_best_mv = *best_mv;
+      int all_in = 1;
+      // Trap illegal vectors
+      all_in &= best_mv->row + site[1].mv.row >= ms_params->mv_limits.row_min;
+      all_in &= best_mv->row + site[2].mv.row <= ms_params->mv_limits.row_max;
+      all_in &= best_mv->col + site[3].mv.col >= ms_params->mv_limits.col_min;
+      all_in &= best_mv->col + site[4].mv.col <= ms_params->mv_limits.col_max;
+
+      if (all_in) {
+        for (int idx = 1; idx <= num_searches; idx += 4) {
+          unsigned char const *block_offset[4];
+          unsigned int sads[4];
+
+          for (int j = 0; j < 4; j++)
+            block_offset[j] = site[idx + j].offset + best_address;
+
+          ms_params->sdx4df(src_buf, src_stride, block_offset, ref_stride,
+                            sads);
+          for (int j = 0; j < 4; j++) {
+            if (sads[j] < bestsad) {
+              const FULLPEL_MV this_mv = { best_mv->row + site[idx + j].mv.row,
+                                           best_mv->col +
+                                               site[idx + j].mv.col };
+              unsigned int thissad =
+                  sads[j] + mvsad_err_cost_(&this_mv, mv_cost_params);
+              if (thissad < bestsad) {
+                bestsad = thissad;
+                best_site = idx + j;
+              }
+            }
+          }
+        }
+      } else {
+        for (int idx = 1; idx <= num_searches; idx++) {
+          const FULLPEL_MV this_mv = { best_mv->row + site[idx].mv.row,
+                                       best_mv->col + site[idx].mv.col };
+
+          if (av1_is_fullmv_in_range(&ms_params->mv_limits, this_mv)) {
+            const uint8_t *const check_here = site[idx].offset + best_address;
+            unsigned int thissad =
+                get_mvpred_sad(ms_params, src, check_here, ref_stride);
+
+            if (thissad < bestsad) {
+              thissad += mvsad_err_cost_(&this_mv, mv_cost_params);
+              if (thissad < bestsad) {
+                bestsad = thissad;
+                best_site = idx;
+              }
+            }
+          }
+        }
       }
-      best_mv->row += site[best_site].mv.row;
-      best_mv->col += site[best_site].mv.col;
-      best_address += site[best_site].offset;
-      is_off_center = 1;
-    }
-
-    if (is_off_center == 0) (*num00)++;
-
-    if (best_site == 0) {
-      while (next_step_size == cfg->radius[step] && step > 2) {
-        ++(*num00);
-        --step;
-        next_step_size = cfg->radius[step - 1];
-      }
+      UPDATE_SEARCH_STEP;
     }
   }
 
+  *num00 = num_center_steps;
+  if (second_best_mv) {
+    *second_best_mv = tmp_second_best_mv;
+  }
+
   return bestsad;
+
+#undef UPDATE_SEARCH_STEP
 }
 
-/* do_refine: If last step (1-away) of n-step search doesn't pick the center
-              point as the best match, we will do a final 1-away diamond
-              refining search  */
-static int full_pixel_diamond(const FULLPEL_MV start_mv,
+static INLINE unsigned int get_start_mvpred_sad_cost(
+    const FULLPEL_MOTION_SEARCH_PARAMS *ms_params, FULLPEL_MV start_mv) {
+  const struct buf_2d *const src = ms_params->ms_buffers.src;
+  const struct buf_2d *const ref = ms_params->ms_buffers.ref;
+  const uint8_t *best_address = get_buf_from_fullmv(ref, &start_mv);
+
+  unsigned int start_mv_sad =
+      mvsad_err_cost_(&start_mv, &ms_params->mv_cost_params);
+
+  if (ms_params->ms_buffers.second_pred)
+    start_mv_sad +=
+        get_mvpred_compound_sad(ms_params, src, best_address, ref->stride);
+  else
+    start_mv_sad += get_mvpred_sad(ms_params, src, best_address, ref->stride);
+
+  return start_mv_sad;
+}
+
+static int full_pixel_diamond(FULLPEL_MV start_mv,
                               const FULLPEL_MOTION_SEARCH_PARAMS *ms_params,
                               const int step_param, int *cost_list,
                               FULLPEL_MV *best_mv, FULLPEL_MV *second_best_mv) {
   const search_site_config *cfg = ms_params->search_sites;
   int thissme, n, num00 = 0;
-  int bestsme = diamond_search_sad(start_mv, ms_params, step_param, &n, best_mv,
-                                   second_best_mv);
 
-  if (bestsme < INT_MAX) {
-    bestsme = get_mvpred_compound_var_cost(ms_params, best_mv);
-  }
+  // Clamp start mv and calculate the cost
+  clamp_fullmv(&start_mv, &ms_params->mv_limits);
+  unsigned int start_mv_sad = get_start_mvpred_sad_cost(ms_params, start_mv);
+
+  diamond_search_sad(start_mv, start_mv_sad, ms_params, step_param, &n, best_mv,
+                     second_best_mv);
+
+  int bestsme = get_mvpred_compound_var_cost(ms_params, best_mv);
 
   // If there won't be more n-step search, check to see if refining search is
   // needed.
@@ -1453,23 +1547,23 @@
   while (n < further_steps) {
     ++n;
 
+    // TODO([email protected]): There is another bug here where the second
+    // best mv gets incorrectly overwritten. Fix it later.
+    FULLPEL_MV tmp_best_mv;
+    diamond_search_sad(start_mv, start_mv_sad, ms_params, step_param + n,
+                       &num00, &tmp_best_mv, second_best_mv);
+
+    thissme = get_mvpred_compound_var_cost(ms_params, &tmp_best_mv);
+
+    if (thissme < bestsme) {
+      bestsme = thissme;
+      *best_mv = tmp_best_mv;
+    }
+
     if (num00) {
-      num00--;
-    } else {
-      // TODO([email protected]): There is another bug here where the second
-      // best mv gets incorrectly overwritten. Fix it later.
-      FULLPEL_MV tmp_best_mv;
-      thissme = diamond_search_sad(start_mv, ms_params, step_param + n, &num00,
-                                   &tmp_best_mv, second_best_mv);
-
-      if (thissme < INT_MAX) {
-        thissme = get_mvpred_compound_var_cost(ms_params, &tmp_best_mv);
-      }
-
-      if (thissme < bestsme) {
-        bestsme = thissme;
-        *best_mv = tmp_best_mv;
-      }
+      // Advance the loop by num00 steps
+      n += num00;
+      num00 = 0;
     }
   }
 
@@ -1575,6 +1669,12 @@
   int range = mesh_patterns[0].range;
   int baseline_interval_divisor;
 
+  // TODO([email protected]): Currently exhaustive search calls single ref
+  // version of sad and variance function. We still need to check the
+  // performance when compound ref exhaustive search is enabled.
+  assert(!ms_params->ms_buffers.second_pred &&
+         "Mesh search does not support compound mode!");
+
   *best_mv = start_mv;
 
   // Trap illegal values for interval and range for this function.
@@ -1772,7 +1872,8 @@
 
   // Should we allow a follow on exhaustive search?
   if (!run_mesh_search &&
-      ((search_method == NSTEP) || (search_method == NSTEP_8PT))) {
+      ((search_method == NSTEP) || (search_method == NSTEP_8PT)) &&
+      !ms_params->ms_buffers.second_pred) {
     int exhaustive_thr = ms_params->force_mesh_thresh;
     exhaustive_thr >>=
         10 - (mi_size_wide_log2[bsize] + mi_size_high_log2[bsize]);
@@ -2018,16 +2119,15 @@
   }
 
   if (xd->bd != 8) {
-    unsigned int sad;
     best_int_mv->as_fullmv = kZeroFullMv;
-    sad = cpi->ppi->fn_ptr[bsize].sdf(x->plane[0].src.buf, src_stride,
-                                      xd->plane[0].pre[0].buf, ref_stride);
+    best_sad = cpi->ppi->fn_ptr[bsize].sdf(x->plane[0].src.buf, src_stride,
+                                           xd->plane[0].pre[0].buf, ref_stride);
 
     if (scaled_ref_frame) {
       int i;
       for (i = 0; i < MAX_MB_PLANE; i++) xd->plane[i].pre[0] = backup_yv12[i];
     }
-    return sad;
+    return best_sad;
   }
 
   // Set up prediction 1-D reference set
@@ -2055,6 +2155,19 @@
   best_sad =
       cpi->ppi->fn_ptr[bsize].sdf(src_buf, src_stride, ref_buf, ref_stride);
 
+  // Evaluate zero MV if found MV is non-zero.
+  if (best_int_mv->as_int != 0) {
+    tmp_sad = cpi->ppi->fn_ptr[bsize].sdf(x->plane[0].src.buf, src_stride,
+                                          xd->plane[0].pre[0].buf, ref_stride);
+
+    if (tmp_sad < best_sad) {
+      best_int_mv->as_fullmv = kZeroFullMv;
+      this_mv = best_int_mv->as_fullmv;
+      ref_buf = xd->plane[0].pre[0].buf;
+      best_sad = tmp_sad;
+    }
+  }
+
   {
     const uint8_t *const pos[4] = {
       ref_buf - ref_stride,
@@ -3225,13 +3338,111 @@
 }
 
 // Refines MV in a small range
+
+// Macros to build bitmasks which help us avoid redundant computations
+//
+// To explain the idea here, imagine that on the first iteration of the
+// loop below, we step rightwards. Then, on the second iteration, the neighbors
+// to consider are:
+//     . . .
+//     0 1 .
+//     . . .
+// Where 0 is the initial search point, 1 is the best candidate found in the
+// first iteration, and the dots are the other neighbors of point 1.
+//
+// Naively, we would now need to scan all 8 neighbors of point 1 (point 0 and
+// the seven points marked with dots), and compare them to see where to move
+// next. However, we already evaluated 5 of those 8 neighbors in the last
+// iteration, and decided that they are worse than point 1. So we don't need
+// to re-consider these points. We only really need to consider the three
+// points which are adjacent to point 1 but *not* to point 0.
+//
+// As the algorithm goes on, there are other ways that redundant evaluations
+// can happen, if the search path curls back around on itself.
+//
+// To avoid all possible redundancies, we'd have to build a set containing
+// every point we have already checked, and this would be quite expensive.
+//
+// So instead, we apply a 95%-effective solution with a much lower overhead:
+// we prune out the points which were considered during the previous
+// iteration, but we don't worry about any prior iteration. This can be done
+// as follows:
+//
+// We build a static table, called neighbor_mask, which answers the question
+// "if we moved in direction X last time, which neighbors are new, and which
+//  were scanned last iteration?"
+// Then we can query this table to quickly determine which points we need to
+// evaluate, and which we can skip.
+//
+// To query the table, the logic is simply:
+// neighbor_mask[i] & (1 << j) == "if we moved in direction i last iteration,
+//                             do we need to scan neighbor j this iteration?"
+#define NEIGHBOR_MASK_DIA(left, down, right, up) \
+  (left | (down << 1) | (right << 2) | (up << 3))
+
+#define NEIGHBOR_MASK_SQR(left, down, right, up, down_left, down_right, \
+                          up_left, up_right)                            \
+  (left | (down << 1) | (right << 2) | (up << 3) | (down_left << 4) |   \
+   (down_right << 5) | (up_left << 6) | (up_right << 7))
+
+static const warp_search_config warp_search_info[WARP_SEARCH_METHODS] = {
+  // WARP_SEARCH_DIAMOND
+  {
+    .num_neighbors = 4,
+    .neighbors = { {  0, -1 }, {  1,  0 }, {  0,  1 }, { -1,  0 } },
+    .neighbor_mask = {
+      // If we stepped left last time, consider all points except right
+      NEIGHBOR_MASK_DIA(1, 1, 0, 1),
+      // If we stepped down last time, consider all points except up
+      NEIGHBOR_MASK_DIA(1, 1, 1, 0),
+      // Stepped right last time
+      NEIGHBOR_MASK_DIA(0, 1, 1, 1),
+      // Stepped up last time
+      NEIGHBOR_MASK_DIA(1, 0, 1, 1),
+    },
+  },
+  // WARP_SEARCH_SQUARE
+  {
+    .num_neighbors = 8,
+    .neighbors = { {  0, -1 }, {  1,  0 }, {  0,  1 }, { -1,  0 },
+                   {  1, -1 }, {  1,  1 }, { -1, -1 }, { -1,  1 } },
+    .neighbor_mask = {
+      // If we stepped left last time, then we only need to consider 3 points:
+      // left, down+left, up+left
+      NEIGHBOR_MASK_SQR(1, 0, 0, 0, 1, 0, 1, 0),
+      // If we stepped down last time, then we only need to consider 3 points:
+      // down, down+left, down+right
+      NEIGHBOR_MASK_SQR(0, 1, 0, 0, 1, 1, 0, 0),
+      // Stepped right last time
+      NEIGHBOR_MASK_SQR(0, 0, 1, 0, 0, 1, 0, 1),
+      // Stepped up last time
+      NEIGHBOR_MASK_SQR(0, 0, 0, 1, 0, 0, 1, 1),
+
+      // If we stepped down+left last time, then we need to consider 5 points:
+      // left, down, down+left, down+right, up+left
+      NEIGHBOR_MASK_SQR(1, 1, 0, 0, 1, 1, 1, 0),
+      // Stepped down+right last time
+      NEIGHBOR_MASK_SQR(0, 1, 1, 0, 1, 1, 0, 1),
+      // Stepped up+left last time
+      NEIGHBOR_MASK_SQR(1, 0, 0, 1, 1, 0, 1, 1),
+      // Stepped up+right last time
+      NEIGHBOR_MASK_SQR(0, 0, 1, 1, 0, 1, 1, 1),
+    },
+  },
+};
+
 unsigned int av1_refine_warped_mv(MACROBLOCKD *xd, const AV1_COMMON *const cm,
                                   const SUBPEL_MOTION_SEARCH_PARAMS *ms_params,
                                   BLOCK_SIZE bsize, const int *pts0,
-                                  const int *pts_inref0, int total_samples) {
+                                  const int *pts_inref0, int total_samples,
+                                  WARP_SEARCH_METHOD search_method,
+                                  int num_iterations) {
   MB_MODE_INFO *mbmi = xd->mi[0];
-  static const MV neighbors[8] = { { 0, -1 }, { 1, 0 }, { 0, 1 }, { -1, 0 },
-                                   { 0, -2 }, { 2, 0 }, { 0, 2 }, { -2, 0 } };
+
+  const MV *neighbors = warp_search_info[search_method].neighbors;
+  const int num_neighbors = warp_search_info[search_method].num_neighbors;
+  const uint8_t *neighbor_mask = warp_search_info[search_method].neighbor_mask;
+
   MV *best_mv = &mbmi->mv[0].as_mv;
 
   WarpedMotionParams best_wm_params = mbmi->wm_params;
@@ -3239,7 +3450,7 @@
   unsigned int bestmse;
   const SubpelMvLimits *mv_limits = &ms_params->mv_limits;
 
-  const int start = ms_params->allow_hp ? 0 : 4;
+  const int mv_shift = ms_params->allow_hp ? 0 : 1;
 
   // Calculate the center position's error
   assert(av1_is_subpelmv_in_range(mv_limits, *best_mv));
@@ -3249,14 +3460,22 @@
   int pts[SAMPLES_ARRAY_SIZE], pts_inref[SAMPLES_ARRAY_SIZE];
   const int mi_row = xd->mi_row;
   const int mi_col = xd->mi_col;
-  for (int ite = 0; ite < 2; ++ite) {
+
+  // First step always scans all neighbors
+  uint8_t valid_neighbors = UINT8_MAX;
+
+  for (int ite = 0; ite < num_iterations; ++ite) {
     int best_idx = -1;
 
-    for (int idx = start; idx < start + 4; ++idx) {
+    for (int idx = 0; idx < num_neighbors; ++idx) {
+      if ((valid_neighbors & (1 << idx)) == 0) {
+        continue;
+      }
+
       unsigned int thismse;
 
-      MV this_mv = { best_mv->row + neighbors[idx].row,
-                     best_mv->col + neighbors[idx].col };
+      MV this_mv = { best_mv->row + neighbors[idx].row * (1 << mv_shift),
+                     best_mv->col + neighbors[idx].col * (1 << mv_shift) };
       if (av1_is_subpelmv_in_range(mv_limits, this_mv)) {
         memcpy(pts, pts0, total_samples * 2 * sizeof(*pts0));
         memcpy(pts_inref, pts_inref0, total_samples * 2 * sizeof(*pts_inref0));
@@ -3283,8 +3502,9 @@
     if (best_idx == -1) break;
 
     if (best_idx >= 0) {
-      best_mv->row += neighbors[best_idx].row;
-      best_mv->col += neighbors[best_idx].col;
+      best_mv->row += neighbors[best_idx].row * (1 << mv_shift);
+      best_mv->col += neighbors[best_idx].col * (1 << mv_shift);
+      valid_neighbors = neighbor_mask[best_idx];
     }
   }
 
@@ -3292,6 +3512,7 @@
   mbmi->num_proj_ref = best_num_proj_ref;
   return bestmse;
 }
+
 #endif  // !CONFIG_REALTIME_ONLY
 // =============================================================================
 //  Subpixel Motion Search: OBMC

diff --git a/av1/encoder/mcomp.h b/av1/encoder/mcomp.h
index 1e8bbab..6b9af07 100644
--- a/av1/encoder/mcomp.h
+++ b/av1/encoder/mcomp.h

@@ -144,7 +144,7 @@
 
 void av1_make_default_fullpel_ms_params(
     FULLPEL_MOTION_SEARCH_PARAMS *ms_params, const struct AV1_COMP *cpi,
-    MACROBLOCK *x, BLOCK_SIZE bsize, const MV *ref_mv,
+    MACROBLOCK *x, BLOCK_SIZE bsize, const MV *ref_mv, FULLPEL_MV start_mv,
     const search_site_config search_sites[NUM_DISTINCT_SEARCH_METHODS],
     int fine_search_interval);
 
@@ -176,14 +176,9 @@
 typedef void (*av1_init_search_site_config)(search_site_config *cfg, int stride,
                                             int level);
 
-/*! Array of function pointer used to set the motion search config. */
-static const av1_init_search_site_config
-    av1_init_motion_compensation[NUM_DISTINCT_SEARCH_METHODS] = {
-      av1_init_dsmotion_compensation,     av1_init_motion_compensation_nstep,
-      av1_init_motion_compensation_nstep, av1_init_dsmotion_compensation,
-      av1_init_motion_compensation_hex,   av1_init_motion_compensation_bigdia,
-      av1_init_motion_compensation_square
-    };
+/*! Array of function pointers used to set the motion search config. */
+extern const av1_init_search_site_config
+    av1_init_motion_compensation[NUM_DISTINCT_SEARCH_METHODS];
 
 // Array to inform which all search methods are having
 // same candidates and different in number of search steps.
@@ -344,7 +339,9 @@
 unsigned int av1_refine_warped_mv(MACROBLOCKD *xd, const AV1_COMMON *const cm,
                                   const SUBPEL_MOTION_SEARCH_PARAMS *ms_params,
                                   BLOCK_SIZE bsize, const int *pts0,
-                                  const int *pts_inref0, int total_samples);
+                                  const int *pts_inref0, int total_samples,
+                                  WARP_SEARCH_METHOD search_method,
+                                  int num_iterations);
 
 static INLINE void av1_set_fractional_mv(int_mv *fractional_best_mv) {
   for (int z = 0; z < 3; z++) {
@@ -356,14 +353,13 @@
                                                   const FullMvLimits *mv_limits,
                                                   const MV *ref_mv) {
   const int max_mv = GET_MV_SUBPEL(MAX_FULL_PEL_VAL);
-  const int minc =
-      AOMMAX(GET_MV_SUBPEL(mv_limits->col_min), ref_mv->col - max_mv);
-  const int maxc =
-      AOMMIN(GET_MV_SUBPEL(mv_limits->col_max), ref_mv->col + max_mv);
-  const int minr =
-      AOMMAX(GET_MV_SUBPEL(mv_limits->row_min), ref_mv->row - max_mv);
-  const int maxr =
-      AOMMIN(GET_MV_SUBPEL(mv_limits->row_max), ref_mv->row + max_mv);
+  int minc = AOMMAX(GET_MV_SUBPEL(mv_limits->col_min), ref_mv->col - max_mv);
+  int maxc = AOMMIN(GET_MV_SUBPEL(mv_limits->col_max), ref_mv->col + max_mv);
+  int minr = AOMMAX(GET_MV_SUBPEL(mv_limits->row_min), ref_mv->row - max_mv);
+  int maxr = AOMMIN(GET_MV_SUBPEL(mv_limits->row_max), ref_mv->row + max_mv);
+
+  maxc = AOMMAX(minc, maxc);
+  maxr = AOMMAX(minr, maxr);
 
   subpel_limits->col_min = AOMMAX(MV_LOW + 1, minc);
   subpel_limits->col_max = AOMMIN(MV_UPP - 1, maxc);

diff --git a/av1/encoder/mcomp_structs.h b/av1/encoder/mcomp_structs.h
index 3fc1ab8..06660cf 100644
--- a/av1/encoder/mcomp_structs.h
+++ b/av1/encoder/mcomp_structs.h

@@ -22,6 +22,12 @@
 #define MAX_FULL_PEL_VAL ((1 << (MAX_MVSEARCH_STEPS - 1)) - 1)
 // Maximum size of the first step in full pel units
 #define MAX_FIRST_STEP (1 << (MAX_MVSEARCH_STEPS - 1))
+// Maximum number of neighbors to scan per iteration during
+// WARPED_CAUSAL refinement
+// Note: The elements of warp_search_config.neighbor_mask must be at least
+// MAX_WARP_SEARCH_NEIGHBORS many bits wide. So the type may need to be
+// widened if this value is increased.
+#define MAX_WARP_SEARCH_NEIGHBORS 8
 
 #define SEARCH_RANGE_8P 3
 #define SEARCH_GRID_STRIDE_8P (2 * SEARCH_RANGE_8P + 1)
@@ -82,4 +88,22 @@
   NUM_DISTINCT_SEARCH_METHODS = SQUARE + 1,
 } UENUM1BYTE(SEARCH_METHODS);
 
+typedef struct warp_search_config {
+  int num_neighbors;
+  MV neighbors[MAX_WARP_SEARCH_NEIGHBORS];
+  // Bitmask which is used to prune the search neighbors at one iteration
+  // based on which direction we chose in the previous iteration.
+  // See comments in av1_refine_warped_mv for details.
+  uint8_t neighbor_mask[MAX_WARP_SEARCH_NEIGHBORS];
+} warp_search_config;
+
+// Methods for refining WARPED_CAUSAL motion vectors
+enum {
+  // Search 4 adjacent points in a diamond shape at each iteration
+  WARP_SEARCH_DIAMOND,
+  // Search 8 adjacent points in a square at each iteration
+  WARP_SEARCH_SQUARE,
+  WARP_SEARCH_METHODS
+} UENUM1BYTE(WARP_SEARCH_METHOD);
+
 #endif  // AOM_AV1_ENCODER_MCOMP_STRUCTS_H_

diff --git a/av1/encoder/ml.c b/av1/encoder/ml.c
index 5078fb1..94cd56c 100644
--- a/av1/encoder/ml.c
+++ b/av1/encoder/ml.c

@@ -13,6 +13,7 @@
 #include <math.h>
 
 #include "aom_dsp/aom_dsp_common.h"
+#include "aom_dsp/mathutils.h"
 #include "av1/encoder/ml.h"
 
 void av1_nn_output_prec_reduce(float *const output, int num_output) {
@@ -155,22 +156,6 @@
   for (int i = 0; i < n; i++) output[i] /= sum_out;
 }
 
-static AOM_INLINE float approx_exp(float y) {
-#define A ((1 << 23) / 0.69314718056f)  // (1 << 23) / ln(2)
-#define B \
-  127  // Offset for the exponent according to IEEE floating point standard.
-#define C 60801  // Magic number controls the accuracy of approximation
-  union {
-    float as_float;
-    int32_t as_int32;
-  } container;
-  container.as_int32 = ((int32_t)(y * A)) + ((B << 23) - C);
-  return container.as_float;
-#undef A
-#undef B
-#undef C
-}
-
 void av1_nn_fast_softmax_16_c(const float *input, float *output) {
   const int kNumClasses = 16;
   float max_input = input[0];

diff --git a/av1/encoder/motion_search_facade.c b/av1/encoder/motion_search_facade.c
index 30e1b73..b771b05 100644
--- a/av1/encoder/motion_search_facade.c
+++ b/av1/encoder/motion_search_facade.c

@@ -192,23 +192,42 @@
 
       // Check difference between mvs in the stack and candidate mv.
       for (int stack_idx = 0; stack_idx < stack_size; stack_idx++) {
-        FULLPEL_MV *fmv_stack = &args->start_mv_stack[stack_idx];
-        const int row = abs(fmv_stack->row - fmv_cand->as_fullmv.row);
-        const int col = abs(fmv_stack->col - fmv_cand->as_fullmv.col);
+        const uint8_t this_ref_mv_idx = args->ref_mv_idx_stack[stack_idx];
+        const FULLPEL_MV *fmv_stack = &args->start_mv_stack[stack_idx];
+        const int this_newmv_valid =
+            args->single_newmv_valid[this_ref_mv_idx][ref];
+        const int row_diff = abs(fmv_stack->row - fmv_cand->as_fullmv.row);
+        const int col_diff = abs(fmv_stack->col - fmv_cand->as_fullmv.col);
 
-        if (row <= 1 && col <= 1) {
-          skip_cand_mv = 1;
-          break;
+        if (!this_newmv_valid) continue;
+
+        if (cpi->sf.mv_sf.skip_fullpel_search_using_startmv >= 2) {
+          // Prunes the current start_mv candidate, if the absolute mv
+          // difference of both row and column are <= 1.
+          if (row_diff <= 1 && col_diff <= 1) {
+            skip_cand_mv = 1;
+            break;
+          }
+        } else if (cpi->sf.mv_sf.skip_fullpel_search_using_startmv >= 1) {
+          // Prunes the current start_mv candidate, if the sum of the absolute
+          // mv difference of row and column is <= 1.
+          if (row_diff + col_diff <= 1) {
+            skip_cand_mv = 1;
+            break;
+          }
         }
       }
       if (skip_cand_mv) {
+        // Ensure atleast one full-pel motion search is not pruned.
+        assert(mbmi->ref_mv_idx != 0);
         // Mark the candidate mv as invalid so that motion search gets skipped.
         cand[cand_idx].fmv.as_int = INVALID_MV;
       } else {
-        // Store start mv candidate of full-pel search in the mv stack (except
-        // last ref_mv_idx).
+        // Store start_mv candidate and corresponding ref_mv_idx of full-pel
+        // search in the mv stack (except last ref_mv_idx).
         if (mbmi->ref_mv_idx != MAX_REF_MV_SEARCH - 1) {
           args->start_mv_stack[args->start_mv_cnt] = fmv_cand->as_fullmv;
+          args->ref_mv_idx_stack[args->start_mv_cnt] = mbmi->ref_mv_idx;
           args->start_mv_cnt++;
           assert(args->start_mv_cnt <= (MAX_REF_MV_SEARCH - 1) * 2);
         }
@@ -246,8 +265,6 @@
   // Allow more mesh searches for screen content type on the ARF.
   const int fine_search_interval = use_fine_search_interval(cpi);
   FULLPEL_MOTION_SEARCH_PARAMS full_ms_params;
-  av1_make_default_fullpel_ms_params(&full_ms_params, cpi, x, bsize, &ref_mv,
-                                     src_search_site_cfg, fine_search_interval);
 
   switch (mbmi->motion_mode) {
     case SIMPLE_TRANSLATION: {
@@ -259,7 +276,11 @@
 
         if (smv.as_int == INVALID_MV) continue;
 
-        int thissme =
+        av1_make_default_fullpel_ms_params(
+            &full_ms_params, cpi, x, bsize, &ref_mv, smv.as_fullmv,
+            src_search_site_cfg, fine_search_interval);
+
+        const int thissme =
             av1_full_pixel_search(smv.as_fullmv, &full_ms_params, step_param,
                                   cond_cost_list(cpi, cost_list), &this_best_mv,
                                   &this_second_best_mv);
@@ -275,6 +296,10 @@
       }
     } break;
     case OBMC_CAUSAL:
+      av1_make_default_fullpel_ms_params(&full_ms_params, cpi, x, bsize,
+                                         &ref_mv, start_mv, src_search_site_cfg,
+                                         fine_search_interval);
+
       bestsme = av1_obmc_full_pixel_search(start_mv, &full_ms_params,
                                            step_param, &best_mv->as_fullmv);
       break;
@@ -496,7 +521,7 @@
 int av1_joint_motion_search(const AV1_COMP *cpi, MACROBLOCK *x,
                             BLOCK_SIZE bsize, int_mv *cur_mv,
                             const uint8_t *mask, int mask_stride, int *rate_mv,
-                            int allow_second_mv) {
+                            int allow_second_mv, int joint_me_num_refine_iter) {
   const AV1_COMMON *const cm = &cpi->common;
   const int num_planes = av1_num_planes(cm);
   const int pw = block_size_wide[bsize];
@@ -536,7 +561,7 @@
 
   // Allow joint search multiple times iteratively for each reference frame
   // and break out of the search loop if it couldn't find a better mv.
-  for (ite = 0; ite < 4; ite++) {
+  for (ite = 0; ite < (2 * joint_me_num_refine_iter); ite++) {
     struct buf_2d ref_yv12[2];
     int bestsme = INT_MAX;
     int id = ite % 2;  // Even iterations search in the first reference frame,
@@ -599,16 +624,16 @@
     const SEARCH_METHODS search_method = cpi->sf.mv_sf.search_method;
     const search_site_config *src_search_sites =
         av1_get_search_site_config(cpi, x, search_method);
+    // Use the mv result from the single mode as mv predictor.
+    const FULLPEL_MV start_fullmv = get_fullmv_from_mv(&cur_mv[id].as_mv);
     av1_make_default_fullpel_ms_params(&full_ms_params, cpi, x, bsize,
-                                       &ref_mv[id].as_mv, src_search_sites,
+                                       &ref_mv[id].as_mv, start_fullmv,
+                                       src_search_sites,
                                        /*fine_search_interval=*/0);
 
     av1_set_ms_compound_refs(&full_ms_params.ms_buffers, second_pred, mask,
                              mask_stride, id);
 
-    // Use the mv result from the single mode as mv predictor.
-    const FULLPEL_MV start_fullmv = get_fullmv_from_mv(&cur_mv[id].as_mv);
-
     // Small-range full-pixel motion search.
     if (!cpi->sf.mv_sf.disable_extensive_joint_motion_search &&
         mbmi->interinter_comp.type != COMPOUND_WEDGE) {
@@ -737,7 +762,11 @@
     }
     const int mi_row = xd->mi_row;
     const int mi_col = xd->mi_col;
-    av1_setup_pre_planes(xd, ref_idx, scaled_ref_frame, mi_row, mi_col, NULL,
+    // The index below needs to be 0 instead of ref_idx since we assume the
+    // 0th slot to be used for subsequent searches. Note that the ref_idx
+    // reference buffer has been copied to the 0th slot in the code above.
+    // Now we need to swap the reference frame for the 0th slot.
+    av1_setup_pre_planes(xd, 0, scaled_ref_frame, mi_row, mi_col, NULL,
                          num_planes);
   }
 
@@ -749,24 +778,24 @@
   const SEARCH_METHODS search_method = cpi->sf.mv_sf.search_method;
   const search_site_config *src_search_sites =
       av1_get_search_site_config(cpi, x, search_method);
+  // Use the mv result from the single mode as mv predictor.
+  const FULLPEL_MV start_fullmv = get_fullmv_from_mv(this_mv);
   av1_make_default_fullpel_ms_params(&full_ms_params, cpi, x, bsize,
-                                     &ref_mv.as_mv, src_search_sites,
+                                     &ref_mv.as_mv, start_fullmv,
+                                     src_search_sites,
                                      /*fine_search_interval=*/0);
 
   av1_set_ms_compound_refs(&full_ms_params.ms_buffers, second_pred, mask,
                            mask_stride, ref_idx);
 
-  // Use the mv result from the single mode as mv predictor.
-  const FULLPEL_MV start_fullmv = get_fullmv_from_mv(this_mv);
-
   // Small-range full-pixel motion search.
   bestsme = av1_full_pixel_search(start_fullmv, &full_ms_params, 5, NULL,
                                   &best_mv.as_fullmv, NULL);
 
   if (scaled_ref_frame) {
-    // Swap back the original buffers for subpel motion search.
+    // Swap back the original buffers for subpel motion search for the 0th slot.
     for (int i = 0; i < num_planes; i++) {
-      xd->plane[i].pre[ref_idx] = backup_yv12[i];
+      xd->plane[i].pre[0] = backup_yv12[i];
     }
   }
 
@@ -883,8 +912,13 @@
     av1_compound_single_motion_search_interinter(cpi, x, bsize, tmp_mv, mask,
                                                  mask_stride, rate_mv, which);
   } else if (which == 2) {
+    const int joint_me_num_refine_iter =
+        cpi->sf.inter_sf.enable_fast_compound_mode_search == 2
+            ? REDUCED_JOINT_ME_REFINE_ITER
+            : NUM_JOINT_ME_REFINE_ITER;
     av1_joint_motion_search(cpi, x, bsize, tmp_mv, mask, mask_stride, rate_mv,
-                            !cpi->sf.mv_sf.disable_second_mv);
+                            !cpi->sf.mv_sf.disable_second_mv,
+                            joint_me_num_refine_iter);
   }
 }
 
@@ -971,7 +1005,8 @@
   const search_site_config *src_search_sites =
       av1_get_search_site_config(cpi, x, search_method);
   av1_make_default_fullpel_ms_params(&full_ms_params, cpi, x, bsize, &ref_mv,
-                                     src_search_sites, fine_search_interval);
+                                     start_mv, src_search_sites,
+                                     fine_search_interval);
 
   var = av1_full_pixel_search(start_mv, &full_ms_params, step_param,
                               cond_cost_list(cpi, cost_list),

diff --git a/av1/encoder/motion_search_facade.h b/av1/encoder/motion_search_facade.h
index 4d76287..d2996bc 100644
--- a/av1/encoder/motion_search_facade.h
+++ b/av1/encoder/motion_search_facade.h

@@ -18,6 +18,8 @@
 extern "C" {
 #endif
 
+#define NUM_JOINT_ME_REFINE_ITER 2
+#define REDUCED_JOINT_ME_REFINE_ITER 1
 // TODO(any): rename this struct to something else. There is already another
 // struct called inter_modes_info, which makes this terribly confusing.
 typedef struct {
@@ -38,7 +40,7 @@
 int av1_joint_motion_search(const AV1_COMP *cpi, MACROBLOCK *x,
                             BLOCK_SIZE bsize, int_mv *cur_mv,
                             const uint8_t *mask, int mask_stride, int *rate_mv,
-                            int allow_second_mv);
+                            int allow_second_mv, int joint_me_num_refine_iter);
 
 int av1_interinter_compound_motion_search(const AV1_COMP *const cpi,
                                           MACROBLOCK *x,

diff --git a/av1/encoder/nonrd_opt.c b/av1/encoder/nonrd_opt.c
new file mode 100644
index 0000000..651ca43
--- /dev/null
+++ b/av1/encoder/nonrd_opt.c

@@ -0,0 +1,933 @@
+/*
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include "config/aom_dsp_rtcd.h"
+
+#include "av1/common/reconinter.h"
+
+#include "av1/encoder/encodemv.h"
+#include "av1/encoder/nonrd_opt.h"
+#include "av1/encoder/rdopt.h"
+
+static const SCAN_ORDER av1_fast_idtx_scan_order_16x16 = {
+  av1_fast_idtx_scan_16x16, av1_fast_idtx_iscan_16x16
+};
+
+#define DECLARE_BLOCK_YRD_BUFFERS()                      \
+  DECLARE_ALIGNED(64, tran_low_t, dqcoeff_buf[16 * 16]); \
+  DECLARE_ALIGNED(64, tran_low_t, qcoeff_buf[16 * 16]);  \
+  DECLARE_ALIGNED(64, tran_low_t, coeff_buf[16 * 16]);   \
+  uint16_t eob[1];
+
+#define DECLARE_BLOCK_YRD_VARS()                                          \
+  /* When is_tx_8x8_dual_applicable is true, we compute the txfm for the  \
+   * entire bsize and write macroblock_plane::coeff. So low_coeff is kept \
+   * as a non-const so we can reassign it to macroblock_plane::coeff. */  \
+  int16_t *low_coeff = (int16_t *)coeff_buf;                              \
+  int16_t *const low_qcoeff = (int16_t *)qcoeff_buf;                      \
+  int16_t *const low_dqcoeff = (int16_t *)dqcoeff_buf;                    \
+  const int diff_stride = bw;
+
+#define DECLARE_LOOP_VARS_BLOCK_YRD() \
+  const int16_t *src_diff = &p->src_diff[(r * diff_stride + c) << 2];
+
+static AOM_FORCE_INLINE void update_yrd_loop_vars(
+    MACROBLOCK *x, int *skippable, int step, int ncoeffs,
+    int16_t *const low_coeff, int16_t *const low_qcoeff,
+    int16_t *const low_dqcoeff, RD_STATS *this_rdc, int *eob_cost,
+    int tx_blk_id) {
+  const int is_txfm_skip = (ncoeffs == 0);
+  *skippable &= is_txfm_skip;
+  x->txfm_search_info.blk_skip[tx_blk_id] = is_txfm_skip;
+  *eob_cost += get_msb(ncoeffs + 1);
+  if (ncoeffs == 1)
+    this_rdc->rate += (int)abs(low_qcoeff[0]);
+  else if (ncoeffs > 1)
+    this_rdc->rate += aom_satd_lp(low_qcoeff, step << 4);
+
+  this_rdc->dist += av1_block_error_lp(low_coeff, low_dqcoeff, step << 4) >> 2;
+}
+
+static INLINE void aom_process_hadamard_lp_8x16(MACROBLOCK *x,
+                                                int max_blocks_high,
+                                                int max_blocks_wide,
+                                                int num_4x4_w, int step,
+                                                int block_step) {
+  struct macroblock_plane *const p = &x->plane[AOM_PLANE_Y];
+  const int bw = 4 * num_4x4_w;
+  const int num_4x4 = AOMMIN(num_4x4_w, max_blocks_wide);
+  int block = 0;
+
+  for (int r = 0; r < max_blocks_high; r += block_step) {
+    for (int c = 0; c < num_4x4; c += 2 * block_step) {
+      const int16_t *src_diff = &p->src_diff[(r * bw + c) << 2];
+      int16_t *low_coeff = (int16_t *)p->coeff + BLOCK_OFFSET(block);
+      aom_hadamard_lp_8x8_dual(src_diff, (ptrdiff_t)bw, low_coeff);
+      block += 2 * step;
+    }
+  }
+}
+
+#if CONFIG_AV1_HIGHBITDEPTH
+#define DECLARE_BLOCK_YRD_HBD_VARS()     \
+  tran_low_t *const coeff = coeff_buf;   \
+  tran_low_t *const qcoeff = qcoeff_buf; \
+  tran_low_t *const dqcoeff = dqcoeff_buf;
+
+static AOM_FORCE_INLINE void update_yrd_loop_vars_hbd(
+    MACROBLOCK *x, int *skippable, int step, int ncoeffs,
+    tran_low_t *const coeff, tran_low_t *const qcoeff,
+    tran_low_t *const dqcoeff, RD_STATS *this_rdc, int *eob_cost,
+    int tx_blk_id) {
+  const MACROBLOCKD *xd = &x->e_mbd;
+  const int is_txfm_skip = (ncoeffs == 0);
+  *skippable &= is_txfm_skip;
+  x->txfm_search_info.blk_skip[tx_blk_id] = is_txfm_skip;
+  *eob_cost += get_msb(ncoeffs + 1);
+
+  int64_t dummy;
+  if (ncoeffs == 1)
+    this_rdc->rate += (int)abs(qcoeff[0]);
+  else if (ncoeffs > 1)
+    this_rdc->rate += aom_satd(qcoeff, step << 4);
+  this_rdc->dist +=
+      av1_highbd_block_error(coeff, dqcoeff, step << 4, &dummy, xd->bd) >> 2;
+}
+#endif
+
+/*!\brief Calculates RD Cost using Hadamard transform.
+ *
+ * \ingroup nonrd_mode_search
+ * \callgraph
+ * \callergraph
+ * Calculates RD Cost using Hadamard transform. For low bit depth this function
+ * uses low-precision set of functions (16-bit) and 32 bit for high bit depth
+ * \param[in]    x              Pointer to structure holding all the data for
+                                the current macroblock
+ * \param[in]    this_rdc       Pointer to calculated RD Cost
+ * \param[in]    skippable      Pointer to a flag indicating possible tx skip
+ * \param[in]    bsize          Current block size
+ * \param[in]    tx_size        Transform size
+ * \param[in]    is_inter_mode  Flag to indicate inter mode
+ *
+ * \remark Nothing is returned. Instead, calculated RD cost is placed to
+ * \c this_rdc. \c skippable flag is set if there is no non-zero quantized
+ * coefficients for Hadamard transform
+ */
+void av1_block_yrd(MACROBLOCK *x, RD_STATS *this_rdc, int *skippable,
+                   BLOCK_SIZE bsize, TX_SIZE tx_size) {
+  MACROBLOCKD *xd = &x->e_mbd;
+  const struct macroblockd_plane *pd = &xd->plane[AOM_PLANE_Y];
+  struct macroblock_plane *const p = &x->plane[AOM_PLANE_Y];
+  assert(bsize < BLOCK_SIZES_ALL);
+  const int num_4x4_w = mi_size_wide[bsize];
+  const int num_4x4_h = mi_size_high[bsize];
+  const int step = 1 << (tx_size << 1);
+  const int block_step = (1 << tx_size);
+  const int row_step = step * num_4x4_w >> tx_size;
+  int block = 0;
+  const int max_blocks_wide =
+      num_4x4_w + (xd->mb_to_right_edge >= 0 ? 0 : xd->mb_to_right_edge >> 5);
+  const int max_blocks_high =
+      num_4x4_h + (xd->mb_to_bottom_edge >= 0 ? 0 : xd->mb_to_bottom_edge >> 5);
+  int eob_cost = 0;
+  const int bw = 4 * num_4x4_w;
+  const int bh = 4 * num_4x4_h;
+  const int use_hbd = is_cur_buf_hbd(xd);
+  int num_blk_skip_w = num_4x4_w;
+
+#if CONFIG_AV1_HIGHBITDEPTH
+  if (use_hbd) {
+    aom_highbd_subtract_block(bh, bw, p->src_diff, bw, p->src.buf,
+                              p->src.stride, pd->dst.buf, pd->dst.stride);
+  } else {
+    aom_subtract_block(bh, bw, p->src_diff, bw, p->src.buf, p->src.stride,
+                       pd->dst.buf, pd->dst.stride);
+  }
+#else
+  aom_subtract_block(bh, bw, p->src_diff, bw, p->src.buf, p->src.stride,
+                     pd->dst.buf, pd->dst.stride);
+#endif
+
+  // Keep the intermediate value on the stack here. Writing directly to
+  // skippable causes speed regression due to load-and-store issues in
+  // update_yrd_loop_vars.
+  int temp_skippable = 1;
+  this_rdc->dist = 0;
+  this_rdc->rate = 0;
+  // For block sizes 8x16 or above, Hadamard txfm of two adjacent 8x8 blocks
+  // can be done per function call. Hence the call of Hadamard txfm is
+  // abstracted here for the specified cases.
+  int is_tx_8x8_dual_applicable =
+      (tx_size == TX_8X8 && block_size_wide[bsize] >= 16 &&
+       block_size_high[bsize] >= 8);
+
+#if CONFIG_AV1_HIGHBITDEPTH
+  // As of now, dual implementation of hadamard txfm is available for low
+  // bitdepth.
+  if (use_hbd) is_tx_8x8_dual_applicable = 0;
+#endif
+
+  if (is_tx_8x8_dual_applicable) {
+    aom_process_hadamard_lp_8x16(x, max_blocks_high, max_blocks_wide, num_4x4_w,
+                                 step, block_step);
+  }
+
+  const SCAN_ORDER *const scan_order = &av1_scan_orders[tx_size][DCT_DCT];
+  DECLARE_BLOCK_YRD_BUFFERS()
+  DECLARE_BLOCK_YRD_VARS()
+#if CONFIG_AV1_HIGHBITDEPTH
+  DECLARE_BLOCK_YRD_HBD_VARS()
+#else
+  (void)use_hbd;
+#endif
+
+  // Keep track of the row and column of the blocks we use so that we know
+  // if we are in the unrestricted motion border.
+  for (int r = 0; r < max_blocks_high; r += block_step) {
+    for (int c = 0, s = 0; c < max_blocks_wide; c += block_step, s += step) {
+      DECLARE_LOOP_VARS_BLOCK_YRD()
+
+      switch (tx_size) {
+#if CONFIG_AV1_HIGHBITDEPTH
+        case TX_16X16:
+          if (use_hbd) {
+            aom_hadamard_16x16(src_diff, diff_stride, coeff);
+            av1_quantize_fp(coeff, 16 * 16, p->zbin_QTX, p->round_fp_QTX,
+                            p->quant_fp_QTX, p->quant_shift_QTX, qcoeff,
+                            dqcoeff, p->dequant_QTX, eob,
+                            // default_scan_fp_16x16_transpose and
+                            // av1_default_iscan_fp_16x16_transpose have to be
+                            // used together.
+                            default_scan_fp_16x16_transpose,
+                            av1_default_iscan_fp_16x16_transpose);
+          } else {
+            aom_hadamard_lp_16x16(src_diff, diff_stride, low_coeff);
+            av1_quantize_lp(low_coeff, 16 * 16, p->round_fp_QTX,
+                            p->quant_fp_QTX, low_qcoeff, low_dqcoeff,
+                            p->dequant_QTX, eob,
+                            // default_scan_lp_16x16_transpose and
+                            // av1_default_iscan_lp_16x16_transpose have to be
+                            // used together.
+                            default_scan_lp_16x16_transpose,
+                            av1_default_iscan_lp_16x16_transpose);
+          }
+          break;
+        case TX_8X8:
+          if (use_hbd) {
+            aom_hadamard_8x8(src_diff, diff_stride, coeff);
+            av1_quantize_fp(
+                coeff, 8 * 8, p->zbin_QTX, p->round_fp_QTX, p->quant_fp_QTX,
+                p->quant_shift_QTX, qcoeff, dqcoeff, p->dequant_QTX, eob,
+                default_scan_8x8_transpose, av1_default_iscan_8x8_transpose);
+          } else {
+            if (is_tx_8x8_dual_applicable) {
+              // The coeffs are pre-computed for the whole block, so re-assign
+              // low_coeff to the appropriate location.
+              const int block_offset = BLOCK_OFFSET(block + s);
+              low_coeff = (int16_t *)p->coeff + block_offset;
+            } else {
+              aom_hadamard_lp_8x8(src_diff, diff_stride, low_coeff);
+            }
+            av1_quantize_lp(
+                low_coeff, 8 * 8, p->round_fp_QTX, p->quant_fp_QTX, low_qcoeff,
+                low_dqcoeff, p->dequant_QTX, eob,
+                // default_scan_8x8_transpose and
+                // av1_default_iscan_8x8_transpose have to be used together.
+                default_scan_8x8_transpose, av1_default_iscan_8x8_transpose);
+          }
+          break;
+        default:
+          assert(tx_size == TX_4X4);
+          // In tx_size=4x4 case, aom_fdct4x4 and aom_fdct4x4_lp generate
+          // normal coefficients order, so we don't need to change the scan
+          // order here.
+          if (use_hbd) {
+            aom_fdct4x4(src_diff, coeff, diff_stride);
+            av1_quantize_fp(coeff, 4 * 4, p->zbin_QTX, p->round_fp_QTX,
+                            p->quant_fp_QTX, p->quant_shift_QTX, qcoeff,
+                            dqcoeff, p->dequant_QTX, eob, scan_order->scan,
+                            scan_order->iscan);
+          } else {
+            aom_fdct4x4_lp(src_diff, low_coeff, diff_stride);
+            av1_quantize_lp(low_coeff, 4 * 4, p->round_fp_QTX, p->quant_fp_QTX,
+                            low_qcoeff, low_dqcoeff, p->dequant_QTX, eob,
+                            scan_order->scan, scan_order->iscan);
+          }
+          break;
+#else
+        case TX_16X16:
+          aom_hadamard_lp_16x16(src_diff, diff_stride, low_coeff);
+          av1_quantize_lp(low_coeff, 16 * 16, p->round_fp_QTX, p->quant_fp_QTX,
+                          low_qcoeff, low_dqcoeff, p->dequant_QTX, eob,
+                          default_scan_lp_16x16_transpose,
+                          av1_default_iscan_lp_16x16_transpose);
+          break;
+        case TX_8X8:
+          if (is_tx_8x8_dual_applicable) {
+            // The coeffs are pre-computed for the whole block, so re-assign
+            // low_coeff to the appropriate location.
+            const int block_offset = BLOCK_OFFSET(block + s);
+            low_coeff = (int16_t *)p->coeff + block_offset;
+          } else {
+            aom_hadamard_lp_8x8(src_diff, diff_stride, low_coeff);
+          }
+          av1_quantize_lp(low_coeff, 8 * 8, p->round_fp_QTX, p->quant_fp_QTX,
+                          low_qcoeff, low_dqcoeff, p->dequant_QTX, eob,
+                          default_scan_8x8_transpose,
+                          av1_default_iscan_8x8_transpose);
+          break;
+        default:
+          aom_fdct4x4_lp(src_diff, low_coeff, diff_stride);
+          av1_quantize_lp(low_coeff, 4 * 4, p->round_fp_QTX, p->quant_fp_QTX,
+                          low_qcoeff, low_dqcoeff, p->dequant_QTX, eob,
+                          scan_order->scan, scan_order->iscan);
+          break;
+#endif
+      }
+      assert(*eob <= 1024);
+#if CONFIG_AV1_HIGHBITDEPTH
+      if (use_hbd)
+        update_yrd_loop_vars_hbd(x, &temp_skippable, step, *eob, coeff, qcoeff,
+                                 dqcoeff, this_rdc, &eob_cost,
+                                 r * num_blk_skip_w + c);
+      else
+#endif
+        update_yrd_loop_vars(x, &temp_skippable, step, *eob, low_coeff,
+                             low_qcoeff, low_dqcoeff, this_rdc, &eob_cost,
+                             r * num_blk_skip_w + c);
+    }
+    block += row_step;
+  }
+
+  this_rdc->skip_txfm = *skippable = temp_skippable;
+  if (this_rdc->sse < INT64_MAX) {
+    this_rdc->sse = (this_rdc->sse << 6) >> 2;
+    if (temp_skippable) {
+      this_rdc->dist = 0;
+      this_rdc->dist = this_rdc->sse;
+      return;
+    }
+  }
+
+  // If skippable is set, rate gets clobbered later.
+  this_rdc->rate <<= (2 + AV1_PROB_COST_SHIFT);
+  this_rdc->rate += (eob_cost << AV1_PROB_COST_SHIFT);
+}
+
+// Explicitly enumerate the cases so the compiler can generate SIMD for the
+// function. According to the disassembler, gcc generates SSE codes for each of
+// the possible block sizes. The hottest case is tx_width 16, which takes up
+// about 8% of the self cycle of av1_nonrd_pick_inter_mode_sb. Since
+// av1_nonrd_pick_inter_mode_sb takes up about 3% of total encoding time, the
+// potential room of improvement for writing AVX2 optimization is only 3% * 8% =
+// 0.24% of total encoding time.
+static AOM_INLINE void scale_square_buf_vals(int16_t *dst, int tx_width,
+                                             const int16_t *src,
+                                             int src_stride) {
+#define DO_SCALING                                                   \
+  do {                                                               \
+    for (int idy = 0; idy < tx_width; ++idy) {                       \
+      for (int idx = 0; idx < tx_width; ++idx) {                     \
+        dst[idy * tx_width + idx] = src[idy * src_stride + idx] * 8; \
+      }                                                              \
+    }                                                                \
+  } while (0)
+
+  if (tx_width == 4) {
+    DO_SCALING;
+  } else if (tx_width == 8) {
+    DO_SCALING;
+  } else if (tx_width == 16) {
+    DO_SCALING;
+  } else {
+    assert(0);
+  }
+
+#undef DO_SCALING
+}
+
+/*!\brief Calculates RD Cost when the block uses Identity transform.
+ * Note that this function is only for low bit depth encoding, since it
+ * is called in real-time mode for now, which sets high bit depth to 0:
+ * -DCONFIG_AV1_HIGHBITDEPTH=0
+ *
+ * \ingroup nonrd_mode_search
+ * \callgraph
+ * \callergraph
+ * Calculates RD Cost. For low bit depth this function
+ * uses low-precision set of functions (16-bit) and 32 bit for high bit depth
+ * \param[in]    x              Pointer to structure holding all the data for
+                                the current macroblock
+ * \param[in]    pred_buf       Pointer to the prediction buffer
+ * \param[in]    pred_stride    Stride for the prediction buffer
+ * \param[in]    this_rdc       Pointer to calculated RD Cost
+ * \param[in]    skippable      Pointer to a flag indicating possible tx skip
+ * \param[in]    bsize          Current block size
+ * \param[in]    tx_size        Transform size
+ *
+ * \remark Nothing is returned. Instead, calculated RD cost is placed to
+ * \c this_rdc. \c skippable flag is set if all coefficients are zero.
+ */
+void av1_block_yrd_idtx(MACROBLOCK *x, const uint8_t *const pred_buf,
+                        int pred_stride, RD_STATS *this_rdc, int *skippable,
+                        BLOCK_SIZE bsize, TX_SIZE tx_size) {
+  MACROBLOCKD *xd = &x->e_mbd;
+  struct macroblock_plane *const p = &x->plane[AOM_PLANE_Y];
+  assert(bsize < BLOCK_SIZES_ALL);
+  const int num_4x4_w = mi_size_wide[bsize];
+  const int num_4x4_h = mi_size_high[bsize];
+  const int step = 1 << (tx_size << 1);
+  const int block_step = (1 << tx_size);
+  const int max_blocks_wide =
+      num_4x4_w + (xd->mb_to_right_edge >= 0 ? 0 : xd->mb_to_right_edge >> 5);
+  const int max_blocks_high =
+      num_4x4_h + (xd->mb_to_bottom_edge >= 0 ? 0 : xd->mb_to_bottom_edge >> 5);
+  int eob_cost = 0;
+  const int bw = 4 * num_4x4_w;
+  const int bh = 4 * num_4x4_h;
+  const int num_blk_skip_w = num_4x4_w;
+  // Keep the intermediate value on the stack here. Writing directly to
+  // skippable causes speed regression due to load-and-store issues in
+  // update_yrd_loop_vars.
+  int temp_skippable = 1;
+  int tx_wd = 0;
+  const SCAN_ORDER *scan_order = NULL;
+  switch (tx_size) {
+    case TX_64X64:
+      assert(0);  // Not implemented
+      break;
+    case TX_32X32:
+      assert(0);  // Not used
+      break;
+    case TX_16X16:
+      scan_order = &av1_fast_idtx_scan_order_16x16;
+      tx_wd = 16;
+      break;
+    case TX_8X8:
+      scan_order = &av1_fast_idtx_scan_order_8x8;
+      tx_wd = 8;
+      break;
+    default:
+      assert(tx_size == TX_4X4);
+      scan_order = &av1_fast_idtx_scan_order_4x4;
+      tx_wd = 4;
+      break;
+  }
+  assert(scan_order != NULL);
+
+  this_rdc->dist = 0;
+  this_rdc->rate = 0;
+  aom_subtract_block(bh, bw, p->src_diff, bw, p->src.buf, p->src.stride,
+                     pred_buf, pred_stride);
+  // Keep track of the row and column of the blocks we use so that we know
+  // if we are in the unrestricted motion border.
+  DECLARE_BLOCK_YRD_BUFFERS()
+  DECLARE_BLOCK_YRD_VARS()
+  for (int r = 0; r < max_blocks_high; r += block_step) {
+    for (int c = 0, s = 0; c < max_blocks_wide; c += block_step, s += step) {
+      DECLARE_LOOP_VARS_BLOCK_YRD()
+      scale_square_buf_vals(low_coeff, tx_wd, src_diff, diff_stride);
+      av1_quantize_lp(low_coeff, tx_wd * tx_wd, p->round_fp_QTX,
+                      p->quant_fp_QTX, low_qcoeff, low_dqcoeff, p->dequant_QTX,
+                      eob, scan_order->scan, scan_order->iscan);
+      assert(*eob <= 1024);
+      update_yrd_loop_vars(x, &temp_skippable, step, *eob, low_coeff,
+                           low_qcoeff, low_dqcoeff, this_rdc, &eob_cost,
+                           r * num_blk_skip_w + c);
+    }
+  }
+  this_rdc->skip_txfm = *skippable = temp_skippable;
+  if (this_rdc->sse < INT64_MAX) {
+    this_rdc->sse = (this_rdc->sse << 6) >> 2;
+    if (temp_skippable) {
+      this_rdc->dist = 0;
+      this_rdc->dist = this_rdc->sse;
+      return;
+    }
+  }
+  // If skippable is set, rate gets clobbered later.
+  this_rdc->rate <<= (2 + AV1_PROB_COST_SHIFT);
+  this_rdc->rate += (eob_cost << AV1_PROB_COST_SHIFT);
+}
+
+int64_t av1_model_rd_for_sb_uv(AV1_COMP *cpi, BLOCK_SIZE plane_bsize,
+                               MACROBLOCK *x, MACROBLOCKD *xd,
+                               RD_STATS *this_rdc, int start_plane,
+                               int stop_plane) {
+  // Note our transform coeffs are 8 times an orthogonal transform.
+  // Hence quantizer step is also 8 times. To get effective quantizer
+  // we need to divide by 8 before sending to modeling function.
+  unsigned int sse;
+  int rate;
+  int64_t dist;
+  int plane;
+  int64_t tot_sse = 0;
+
+  this_rdc->rate = 0;
+  this_rdc->dist = 0;
+  this_rdc->skip_txfm = 0;
+
+  for (plane = start_plane; plane <= stop_plane; ++plane) {
+    struct macroblock_plane *const p = &x->plane[plane];
+    struct macroblockd_plane *const pd = &xd->plane[plane];
+    const uint32_t dc_quant = p->dequant_QTX[0];
+    const uint32_t ac_quant = p->dequant_QTX[1];
+    const BLOCK_SIZE bs = plane_bsize;
+    unsigned int var;
+    if (!x->color_sensitivity[COLOR_SENS_IDX(plane)]) continue;
+
+    var = cpi->ppi->fn_ptr[bs].vf(p->src.buf, p->src.stride, pd->dst.buf,
+                                  pd->dst.stride, &sse);
+    assert(sse >= var);
+    tot_sse += sse;
+
+    av1_model_rd_from_var_lapndz(sse - var, num_pels_log2_lookup[bs],
+                                 dc_quant >> 3, &rate, &dist);
+
+    this_rdc->rate += rate >> 1;
+    this_rdc->dist += dist << 3;
+
+    av1_model_rd_from_var_lapndz(var, num_pels_log2_lookup[bs], ac_quant >> 3,
+                                 &rate, &dist);
+
+    this_rdc->rate += rate;
+    this_rdc->dist += dist << 4;
+  }
+
+  if (this_rdc->rate == 0) {
+    this_rdc->skip_txfm = 1;
+  }
+
+  if (RDCOST(x->rdmult, this_rdc->rate, this_rdc->dist) >=
+      RDCOST(x->rdmult, 0, tot_sse << 4)) {
+    this_rdc->rate = 0;
+    this_rdc->dist = tot_sse << 4;
+    this_rdc->skip_txfm = 1;
+  }
+
+  return tot_sse;
+}
+
+static void compute_intra_yprediction(const AV1_COMMON *cm,
+                                      PREDICTION_MODE mode, BLOCK_SIZE bsize,
+                                      MACROBLOCK *x, MACROBLOCKD *xd) {
+  const SequenceHeader *seq_params = cm->seq_params;
+  struct macroblockd_plane *const pd = &xd->plane[AOM_PLANE_Y];
+  struct macroblock_plane *const p = &x->plane[AOM_PLANE_Y];
+  uint8_t *const src_buf_base = p->src.buf;
+  uint8_t *const dst_buf_base = pd->dst.buf;
+  const int src_stride = p->src.stride;
+  const int dst_stride = pd->dst.stride;
+  int plane = 0;
+  int row, col;
+  // block and transform sizes, in number of 4x4 blocks log 2 ("*_b")
+  // 4x4=0, 8x8=2, 16x16=4, 32x32=6, 64x64=8
+  // transform size varies per plane, look it up in a common way.
+  const TX_SIZE tx_size = max_txsize_lookup[bsize];
+  const BLOCK_SIZE plane_bsize =
+      get_plane_block_size(bsize, pd->subsampling_x, pd->subsampling_y);
+  // If mb_to_right_edge is < 0 we are in a situation in which
+  // the current block size extends into the UMV and we won't
+  // visit the sub blocks that are wholly within the UMV.
+  const int max_blocks_wide = max_block_wide(xd, plane_bsize, plane);
+  const int max_blocks_high = max_block_high(xd, plane_bsize, plane);
+  // Keep track of the row and column of the blocks we use so that we know
+  // if we are in the unrestricted motion border.
+  for (row = 0; row < max_blocks_high; row += (1 << tx_size)) {
+    // Skip visiting the sub blocks that are wholly within the UMV.
+    for (col = 0; col < max_blocks_wide; col += (1 << tx_size)) {
+      p->src.buf = &src_buf_base[4 * (row * (int64_t)src_stride + col)];
+      pd->dst.buf = &dst_buf_base[4 * (row * (int64_t)dst_stride + col)];
+      av1_predict_intra_block(
+          xd, seq_params->sb_size, seq_params->enable_intra_edge_filter,
+          block_size_wide[bsize], block_size_high[bsize], tx_size, mode, 0, 0,
+          FILTER_INTRA_MODES, pd->dst.buf, dst_stride, pd->dst.buf, dst_stride,
+          0, 0, plane);
+    }
+  }
+  p->src.buf = src_buf_base;
+  pd->dst.buf = dst_buf_base;
+}
+
+// Checks whether Intra mode needs to be pruned based on
+// 'intra_y_mode_bsize_mask_nrd' and 'prune_hv_pred_modes_using_blksad'
+// speed features.
+static INLINE bool is_prune_intra_mode(
+    AV1_COMP *cpi, int mode_index, int force_intra_check, BLOCK_SIZE bsize,
+    uint8_t segment_id, SOURCE_SAD source_sad_nonrd,
+    uint8_t color_sensitivity[MAX_MB_PLANE - 1]) {
+  const PREDICTION_MODE this_mode = intra_mode_list[mode_index];
+  if (mode_index > 2 || force_intra_check == 0) {
+    if (!((1 << this_mode) & cpi->sf.rt_sf.intra_y_mode_bsize_mask_nrd[bsize]))
+      return true;
+
+    if (this_mode == DC_PRED) return false;
+
+    if (!cpi->sf.rt_sf.prune_hv_pred_modes_using_src_sad) return false;
+
+    const bool has_color_sensitivity =
+        color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] &&
+        color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)];
+    if (has_color_sensitivity &&
+        (cpi->rc.frame_source_sad > 1.1 * cpi->rc.avg_source_sad ||
+         cyclic_refresh_segment_id_boosted(segment_id) ||
+         source_sad_nonrd > kMedSad))
+      return false;
+
+    return true;
+  }
+  return false;
+}
+
+/*!\brief Estimation of RD cost of an intra mode for Non-RD optimized case.
+ *
+ * \ingroup nonrd_mode_search
+ * \callgraph
+ * \callergraph
+ * Calculates RD Cost for an intra mode for a single TX block using Hadamard
+ * transform.
+ * \param[in]    plane          Color plane
+ * \param[in]    block          Index of a TX block in a prediction block
+ * \param[in]    row            Row of a current TX block
+ * \param[in]    col            Column of a current TX block
+ * \param[in]    plane_bsize    Block size of a current prediction block
+ * \param[in]    tx_size        Transform size
+ * \param[in]    arg            Pointer to a structure that holds parameters
+ *                              for intra mode search
+ *
+ * \remark Nothing is returned. Instead, best mode and RD Cost of the best mode
+ * are set in \c args->rdc and \c args->mode
+ */
+void av1_estimate_block_intra(int plane, int block, int row, int col,
+                              BLOCK_SIZE plane_bsize, TX_SIZE tx_size,
+                              void *arg) {
+  struct estimate_block_intra_args *const args = arg;
+  AV1_COMP *const cpi = args->cpi;
+  AV1_COMMON *const cm = &cpi->common;
+  MACROBLOCK *const x = args->x;
+  MACROBLOCKD *const xd = &x->e_mbd;
+  struct macroblock_plane *const p = &x->plane[plane];
+  struct macroblockd_plane *const pd = &xd->plane[plane];
+  const BLOCK_SIZE bsize_tx = txsize_to_bsize[tx_size];
+  uint8_t *const src_buf_base = p->src.buf;
+  uint8_t *const dst_buf_base = pd->dst.buf;
+  const int64_t src_stride = p->src.stride;
+  const int64_t dst_stride = pd->dst.stride;
+
+  (void)block;
+
+  av1_predict_intra_block_facade(cm, xd, plane, col, row, tx_size);
+
+  if (args->prune_mode_based_on_sad) {
+    unsigned int this_sad = cpi->ppi->fn_ptr[plane_bsize].sdf(
+        p->src.buf, p->src.stride, pd->dst.buf, pd->dst.stride);
+    const unsigned int sad_threshold =
+        args->best_sad != UINT_MAX ? args->best_sad + (args->best_sad >> 4)
+                                   : UINT_MAX;
+    // Skip the evaluation of current mode if its SAD is more than a threshold.
+    if (this_sad > sad_threshold) {
+      // For the current mode, set rate and distortion to maximum possible
+      // values and return.
+      // Note: args->rdc->rate is checked in av1_nonrd_pick_intra_mode() to skip
+      // the evaluation of the current mode.
+      args->rdc->rate = INT_MAX;
+      args->rdc->dist = INT64_MAX;
+      return;
+    }
+    if (this_sad < args->best_sad) {
+      args->best_sad = this_sad;
+    }
+  }
+
+  RD_STATS this_rdc;
+  av1_invalid_rd_stats(&this_rdc);
+
+  p->src.buf = &src_buf_base[4 * (row * src_stride + col)];
+  pd->dst.buf = &dst_buf_base[4 * (row * dst_stride + col)];
+
+  if (plane == 0) {
+    av1_block_yrd(x, &this_rdc, &args->skippable, bsize_tx,
+                  AOMMIN(tx_size, TX_16X16));
+  } else {
+    av1_model_rd_for_sb_uv(cpi, bsize_tx, x, xd, &this_rdc, plane, plane);
+  }
+
+  p->src.buf = src_buf_base;
+  pd->dst.buf = dst_buf_base;
+  assert(args->rdc->rate != INT_MAX && args->rdc->dist != INT64_MAX);
+  args->rdc->rate += this_rdc.rate;
+  args->rdc->dist += this_rdc.dist;
+}
+
+/*!\brief Estimates best intra mode for inter mode search
+ *
+ * \ingroup nonrd_mode_search
+ * \callgraph
+ * \callergraph
+ *
+ * Using heuristics based on best inter mode, block size, and other decides
+ * whether to check intra modes. If so, estimates and selects best intra mode
+ * from the reduced set of intra modes (max 4 intra modes checked)
+ *
+ * \param[in]    cpi                      Top-level encoder structure
+ * \param[in]    x                        Pointer to structure holding all the
+ *                                        data for the current macroblock
+ * \param[in]    bsize                    Current block size
+ * \param[in]    best_early_term          Flag, indicating that TX for the
+ *                                        best inter mode was skipped
+ * \param[in]    ref_cost_intra           Cost of signalling intra mode
+ * \param[in]    reuse_prediction         Flag, indicating prediction re-use
+ * \param[in]    orig_dst                 Original destination buffer
+ * \param[in]    tmp_buffers              Pointer to a temporary buffers for
+ *                                        prediction re-use
+ * \param[out]   this_mode_pred           Pointer to store prediction buffer
+ *                                        for prediction re-use
+ * \param[in]    best_rdc                 Pointer to RD cost for the best
+ *                                        selected intra mode
+ * \param[in]    best_pickmode            Pointer to a structure containing
+ *                                        best mode picked so far
+ * \param[in]    ctx                      Pointer to structure holding coding
+ *                                        contexts and modes for the block
+ *
+ * \remark Nothing is returned. Instead, calculated RD cost is placed to
+ * \c best_rdc and best selected mode is placed to \c best_pickmode
+ *
+ */
+void av1_estimate_intra_mode(AV1_COMP *cpi, MACROBLOCK *x, BLOCK_SIZE bsize,
+                             int best_early_term, unsigned int ref_cost_intra,
+                             int reuse_prediction, struct buf_2d *orig_dst,
+                             PRED_BUFFER *tmp_buffers,
+                             PRED_BUFFER **this_mode_pred, RD_STATS *best_rdc,
+                             BEST_PICKMODE *best_pickmode,
+                             PICK_MODE_CONTEXT *ctx) {
+  AV1_COMMON *const cm = &cpi->common;
+  MACROBLOCKD *const xd = &x->e_mbd;
+  MB_MODE_INFO *const mi = xd->mi[0];
+  const TxfmSearchParams *txfm_params = &x->txfm_search_params;
+  const unsigned char segment_id = mi->segment_id;
+  const int *const rd_threshes = cpi->rd.threshes[segment_id][bsize];
+  const int *const rd_thresh_freq_fact = x->thresh_freq_fact[bsize];
+  const bool is_screen_content =
+      cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN;
+  struct macroblockd_plane *const pd = &xd->plane[AOM_PLANE_Y];
+  const REAL_TIME_SPEED_FEATURES *const rt_sf = &cpi->sf.rt_sf;
+
+  const CommonQuantParams *quant_params = &cm->quant_params;
+
+  RD_STATS this_rdc;
+
+  int intra_cost_penalty = av1_get_intra_cost_penalty(
+      quant_params->base_qindex, quant_params->y_dc_delta_q,
+      cm->seq_params->bit_depth);
+  int64_t inter_mode_thresh =
+      RDCOST(x->rdmult, ref_cost_intra + intra_cost_penalty, 0);
+  int perform_intra_pred = rt_sf->check_intra_pred_nonrd;
+  int force_intra_check = 0;
+  // For spatial enhancement layer: turn off intra prediction if the
+  // previous spatial layer as golden ref is not chosen as best reference.
+  // only do this for temporal enhancement layer and on non-key frames.
+  if (cpi->svc.spatial_layer_id > 0 &&
+      best_pickmode->best_ref_frame != GOLDEN_FRAME &&
+      cpi->svc.temporal_layer_id > 0 &&
+      !cpi->svc.layer_context[cpi->svc.temporal_layer_id].is_key_frame)
+    perform_intra_pred = 0;
+
+  int do_early_exit_rdthresh = 1;
+
+  uint32_t spatial_var_thresh = 50;
+  int motion_thresh = 32;
+  // Adjust thresholds to make intra mode likely tested if the other
+  // references (golden, alt) are skipped/not checked. For now always
+  // adjust for svc mode.
+  if (cpi->ppi->use_svc || (rt_sf->use_nonrd_altref_frame == 0 &&
+                            rt_sf->nonrd_prune_ref_frame_search > 0)) {
+    spatial_var_thresh = 150;
+    motion_thresh = 0;
+  }
+
+  // Some adjustments to checking intra mode based on source variance.
+  if (x->source_variance < spatial_var_thresh) {
+    // If the best inter mode is large motion or non-LAST ref reduce intra cost
+    // penalty, so intra mode is more likely tested.
+    if (best_rdc->rdcost != INT64_MAX &&
+        (best_pickmode->best_ref_frame != LAST_FRAME ||
+         abs(mi->mv[0].as_mv.row) >= motion_thresh ||
+         abs(mi->mv[0].as_mv.col) >= motion_thresh)) {
+      intra_cost_penalty = intra_cost_penalty >> 2;
+      inter_mode_thresh =
+          RDCOST(x->rdmult, ref_cost_intra + intra_cost_penalty, 0);
+      do_early_exit_rdthresh = 0;
+    }
+    if ((x->source_variance < AOMMAX(50, (spatial_var_thresh >> 1)) &&
+         x->content_state_sb.source_sad_nonrd >= kHighSad) ||
+        (is_screen_content && x->source_variance < 50 &&
+         ((bsize >= BLOCK_32X32 &&
+           x->content_state_sb.source_sad_nonrd != kZeroSad) ||
+          x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] == 1 ||
+          x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)] == 1)))
+      force_intra_check = 1;
+    // For big blocks worth checking intra (since only DC will be checked),
+    // even if best_early_term is set.
+    if (bsize >= BLOCK_32X32) best_early_term = 0;
+  } else if (rt_sf->source_metrics_sb_nonrd &&
+             x->content_state_sb.source_sad_nonrd <= kLowSad) {
+    perform_intra_pred = 0;
+  }
+
+  if (best_rdc->skip_txfm && best_pickmode->best_mode_initial_skip_flag) {
+    if (rt_sf->skip_intra_pred == 1 && best_pickmode->best_mode != NEWMV)
+      perform_intra_pred = 0;
+    else if (rt_sf->skip_intra_pred == 2)
+      perform_intra_pred = 0;
+  }
+
+  if (!(best_rdc->rdcost == INT64_MAX || force_intra_check ||
+        (perform_intra_pred && !best_early_term &&
+         bsize <= cpi->sf.part_sf.max_intra_bsize))) {
+    return;
+  }
+
+  // Early exit based on RD cost calculated using known rate. When
+  // is_screen_content is true, more bias is given to intra modes. Hence,
+  // considered conservative threshold in early exit for the same.
+  const int64_t known_rd = is_screen_content
+                               ? CALC_BIASED_RDCOST(inter_mode_thresh)
+                               : inter_mode_thresh;
+  if (known_rd > best_rdc->rdcost) return;
+
+  struct estimate_block_intra_args args;
+  init_estimate_block_intra_args(&args, cpi, x);
+  TX_SIZE intra_tx_size = AOMMIN(
+      AOMMIN(max_txsize_lookup[bsize],
+             tx_mode_to_biggest_tx_size[txfm_params->tx_mode_search_type]),
+      TX_16X16);
+  if (is_screen_content && cpi->rc.high_source_sad &&
+      x->source_variance > spatial_var_thresh && bsize <= BLOCK_16X16)
+    intra_tx_size = TX_4X4;
+
+  PRED_BUFFER *const best_pred = best_pickmode->best_pred;
+  if (reuse_prediction && best_pred != NULL) {
+    const int bh = block_size_high[bsize];
+    const int bw = block_size_wide[bsize];
+    if (best_pred->data == orig_dst->buf) {
+      *this_mode_pred = &tmp_buffers[get_pred_buffer(tmp_buffers, 3)];
+      aom_convolve_copy(best_pred->data, best_pred->stride,
+                        (*this_mode_pred)->data, (*this_mode_pred)->stride, bw,
+                        bh);
+      best_pickmode->best_pred = *this_mode_pred;
+    }
+  }
+  pd->dst = *orig_dst;
+
+  for (int midx = 0; midx < RTC_INTRA_MODES; ++midx) {
+    const PREDICTION_MODE this_mode = intra_mode_list[midx];
+    const THR_MODES mode_index = mode_idx[INTRA_FRAME][mode_offset(this_mode)];
+    const int64_t mode_rd_thresh = rd_threshes[mode_index];
+
+    if (is_prune_intra_mode(cpi, midx, force_intra_check, bsize, segment_id,
+                            x->content_state_sb.source_sad_nonrd,
+                            x->color_sensitivity))
+      continue;
+
+    if (is_screen_content && rt_sf->source_metrics_sb_nonrd) {
+      // For spatially flat blocks with zero motion only check
+      // DC mode.
+      if (x->content_state_sb.source_sad_nonrd == kZeroSad &&
+          x->source_variance == 0 && this_mode != DC_PRED)
+        continue;
+      // Only test Intra for big blocks if spatial_variance is small.
+      else if (bsize > BLOCK_32X32 && x->source_variance > 50)
+        continue;
+    }
+
+    if (rd_less_than_thresh(best_rdc->rdcost, mode_rd_thresh,
+                            rd_thresh_freq_fact[mode_index]) &&
+        (do_early_exit_rdthresh || this_mode == SMOOTH_PRED)) {
+      continue;
+    }
+    const BLOCK_SIZE uv_bsize =
+        get_plane_block_size(bsize, xd->plane[AOM_PLANE_U].subsampling_x,
+                             xd->plane[AOM_PLANE_U].subsampling_y);
+
+    mi->mode = this_mode;
+    mi->ref_frame[0] = INTRA_FRAME;
+    mi->ref_frame[1] = NONE_FRAME;
+
+    av1_invalid_rd_stats(&this_rdc);
+    args.mode = this_mode;
+    args.skippable = 1;
+    args.rdc = &this_rdc;
+    mi->tx_size = intra_tx_size;
+    compute_intra_yprediction(cm, this_mode, bsize, x, xd);
+    // Look into selecting tx_size here, based on prediction residual.
+    av1_block_yrd(x, &this_rdc, &args.skippable, bsize, mi->tx_size);
+    // TODO(kyslov@) Need to account for skippable
+    if (x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)]) {
+      av1_foreach_transformed_block_in_plane(xd, uv_bsize, AOM_PLANE_U,
+                                             av1_estimate_block_intra, &args);
+    }
+    if (x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)]) {
+      av1_foreach_transformed_block_in_plane(xd, uv_bsize, AOM_PLANE_V,
+                                             av1_estimate_block_intra, &args);
+    }
+
+    int mode_cost = 0;
+    if (av1_is_directional_mode(this_mode) && av1_use_angle_delta(bsize)) {
+      mode_cost +=
+          x->mode_costs.angle_delta_cost[this_mode - V_PRED]
+                                        [MAX_ANGLE_DELTA +
+                                         mi->angle_delta[PLANE_TYPE_Y]];
+    }
+    if (this_mode == DC_PRED && av1_filter_intra_allowed_bsize(cm, bsize)) {
+      mode_cost += x->mode_costs.filter_intra_cost[bsize][0];
+    }
+    this_rdc.rate += ref_cost_intra;
+    this_rdc.rate += intra_cost_penalty;
+    this_rdc.rate += mode_cost;
+    this_rdc.rdcost = RDCOST(x->rdmult, this_rdc.rate, this_rdc.dist);
+
+    if (is_screen_content && rt_sf->source_metrics_sb_nonrd) {
+      // For blocks with low spatial variance and color sad,
+      // favor the intra-modes, only on scene/slide change.
+      if (cpi->rc.high_source_sad && x->source_variance < 800 &&
+          (x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] ||
+           x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)]))
+        this_rdc.rdcost = CALC_BIASED_RDCOST(this_rdc.rdcost);
+      // Otherwise bias against intra for blocks with zero
+      // motion and no color, on non-scene/slide changes.
+      else if (!cpi->rc.high_source_sad && x->source_variance > 0 &&
+               x->content_state_sb.source_sad_nonrd == kZeroSad &&
+               x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] == 0 &&
+               x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)] == 0)
+        this_rdc.rdcost = (3 * this_rdc.rdcost) >> 1;
+    }
+
+    if (this_rdc.rdcost < best_rdc->rdcost) {
+      *best_rdc = this_rdc;
+      best_pickmode->best_mode = this_mode;
+      best_pickmode->best_tx_size = mi->tx_size;
+      best_pickmode->best_ref_frame = INTRA_FRAME;
+      best_pickmode->best_second_ref_frame = NONE;
+      best_pickmode->best_mode_skip_txfm = this_rdc.skip_txfm;
+      mi->uv_mode = this_mode;
+      mi->mv[0].as_int = INVALID_MV;
+      mi->mv[1].as_int = INVALID_MV;
+      if (!this_rdc.skip_txfm)
+        memset(ctx->blk_skip, 0,
+               sizeof(x->txfm_search_info.blk_skip[0]) * ctx->num_4x4_blk);
+    }
+  }
+  if (best_pickmode->best_ref_frame == INTRA_FRAME)
+    memset(ctx->blk_skip, 0,
+           sizeof(x->txfm_search_info.blk_skip[0]) * ctx->num_4x4_blk);
+  mi->tx_size = best_pickmode->best_tx_size;
+}

diff --git a/av1/encoder/nonrd_opt.h b/av1/encoder/nonrd_opt.h
index 0d0db81..7948c78 100644
--- a/av1/encoder/nonrd_opt.h
+++ b/av1/encoder/nonrd_opt.h

@@ -13,10 +13,104 @@
 #define AOM_AV1_ENCODER_NONRD_OPT_H_
 
 #include "av1/encoder/rdopt_utils.h"
+#include "av1/encoder/rdopt.h"
 
 #define RTC_INTER_MODES (4)
 #define RTC_INTRA_MODES (4)
 #define RTC_MODES (AOMMAX(RTC_INTER_MODES, RTC_INTRA_MODES))
+#define CALC_BIASED_RDCOST(rdcost) (7 * (rdcost) >> 3)
+#define NUM_COMP_INTER_MODES_RT (6)
+#define NUM_INTER_MODES 12
+#define CAP_TX_SIZE_FOR_BSIZE_GT32(tx_mode_search_type, bsize) \
+  (((tx_mode_search_type) != ONLY_4X4 && (bsize) > BLOCK_32X32) ? true : false)
+#define TX_SIZE_FOR_BSIZE_GT32 (TX_16X16)
+#define FILTER_SEARCH_SIZE 2
+#if !CONFIG_REALTIME_ONLY
+#define MOTION_MODE_SEARCH_SIZE 2
+#endif
+
+extern int g_pick_inter_mode_cnt;
+/*!\cond */
+typedef struct {
+  uint8_t *data;
+  int stride;
+  int in_use;
+} PRED_BUFFER;
+
+typedef struct {
+  PRED_BUFFER *best_pred;
+  PREDICTION_MODE best_mode;
+  TX_SIZE best_tx_size;
+  TX_TYPE tx_type;
+  MV_REFERENCE_FRAME best_ref_frame;
+  MV_REFERENCE_FRAME best_second_ref_frame;
+  uint8_t best_mode_skip_txfm;
+  uint8_t best_mode_initial_skip_flag;
+  int_interpfilters best_pred_filter;
+  MOTION_MODE best_motion_mode;
+  WarpedMotionParams wm_params;
+  int num_proj_ref;
+  PALETTE_MODE_INFO pmi;
+  int64_t best_sse;
+} BEST_PICKMODE;
+
+typedef struct {
+  MV_REFERENCE_FRAME ref_frame;
+  PREDICTION_MODE pred_mode;
+} REF_MODE;
+
+typedef struct {
+  MV_REFERENCE_FRAME ref_frame[2];
+  PREDICTION_MODE pred_mode;
+} COMP_REF_MODE;
+
+struct estimate_block_intra_args {
+  AV1_COMP *cpi;
+  MACROBLOCK *x;
+  PREDICTION_MODE mode;
+  int skippable;
+  RD_STATS *rdc;
+  unsigned int best_sad;
+  bool prune_mode_based_on_sad;
+};
+/*!\endcond */
+
+/*!\brief Structure to store parameters and statistics used in non-rd inter mode
+ * evaluation.
+ */
+typedef struct {
+  //! Structure to hold best inter mode data
+  BEST_PICKMODE best_pickmode;
+  //! Structure to RD cost of current mode
+  RD_STATS this_rdc;
+  //! Pointer to the RD Cost for the best mode found so far
+  RD_STATS best_rdc;
+  //! Distortion of chroma planes for all modes and reference frames
+  int64_t uv_dist[RTC_INTER_MODES][REF_FRAMES];
+  //! Buffer to hold predicted block for all reference frames and planes
+  struct buf_2d yv12_mb[REF_FRAMES][MAX_MB_PLANE];
+  //! Array to hold variance of all modes and reference frames
+  unsigned int vars[RTC_INTER_MODES][REF_FRAMES];
+  //! Array to hold ref cost of single reference mode for all ref frames
+  unsigned int ref_costs_single[REF_FRAMES];
+  //! Array to hold motion vector for all modes and reference frames
+  int_mv frame_mv[MB_MODE_COUNT][REF_FRAMES];
+  //! Array to hold best mv for all modes and reference frames
+  int_mv frame_mv_best[MB_MODE_COUNT][REF_FRAMES];
+  //! Array to hold inter mode cost of single ref mode for all ref frames
+  int single_inter_mode_costs[RTC_INTER_MODES][REF_FRAMES];
+  //! Array to hold use reference frame mask for each reference frame
+  int use_ref_frame_mask[REF_FRAMES];
+  //! Array to hold flags of evaluated modes for each reference frame
+  uint8_t mode_checked[MB_MODE_COUNT][REF_FRAMES];
+} InterModeSearchStateNonrd;
+
+static const uint8_t b_width_log2_lookup[BLOCK_SIZES] = { 0, 0, 1, 1, 1, 2,
+                                                          2, 2, 3, 3, 3, 4,
+                                                          4, 4, 5, 5 };
+static const uint8_t b_height_log2_lookup[BLOCK_SIZES] = { 0, 1, 0, 1, 2, 1,
+                                                           2, 3, 2, 3, 4, 3,
+                                                           4, 5, 4, 5 };
 
 static const PREDICTION_MODE intra_mode_list[] = { DC_PRED, V_PRED, H_PRED,
                                                    SMOOTH_PRED };
@@ -35,6 +129,266 @@
   { THR_NEARESTA, THR_NEARA, THR_GLOBALA, THR_NEWA },
 };
 
+// GLOBALMV in the set below is in fact ZEROMV as we don't do global ME in RT
+// mode
+static const REF_MODE ref_mode_set[NUM_INTER_MODES] = {
+  { LAST_FRAME, NEARESTMV },   { LAST_FRAME, NEARMV },
+  { LAST_FRAME, GLOBALMV },    { LAST_FRAME, NEWMV },
+  { GOLDEN_FRAME, NEARESTMV }, { GOLDEN_FRAME, NEARMV },
+  { GOLDEN_FRAME, GLOBALMV },  { GOLDEN_FRAME, NEWMV },
+  { ALTREF_FRAME, NEARESTMV }, { ALTREF_FRAME, NEARMV },
+  { ALTREF_FRAME, GLOBALMV },  { ALTREF_FRAME, NEWMV },
+};
+
+static const COMP_REF_MODE comp_ref_mode_set[NUM_COMP_INTER_MODES_RT] = {
+  { { LAST_FRAME, GOLDEN_FRAME }, GLOBAL_GLOBALMV },
+  { { LAST_FRAME, GOLDEN_FRAME }, NEAREST_NEARESTMV },
+  { { LAST_FRAME, LAST2_FRAME }, GLOBAL_GLOBALMV },
+  { { LAST_FRAME, LAST2_FRAME }, NEAREST_NEARESTMV },
+  { { LAST_FRAME, ALTREF_FRAME }, GLOBAL_GLOBALMV },
+  { { LAST_FRAME, ALTREF_FRAME }, NEAREST_NEARESTMV },
+};
+
+static const int_interpfilters filters_ref_set[9] = {
+  [0].as_filters = { EIGHTTAP_REGULAR, EIGHTTAP_REGULAR },
+  [1].as_filters = { EIGHTTAP_SMOOTH, EIGHTTAP_SMOOTH },
+  [2].as_filters = { EIGHTTAP_REGULAR, EIGHTTAP_SMOOTH },
+  [3].as_filters = { EIGHTTAP_SMOOTH, EIGHTTAP_REGULAR },
+  [4].as_filters = { MULTITAP_SHARP, MULTITAP_SHARP },
+  [5].as_filters = { EIGHTTAP_REGULAR, MULTITAP_SHARP },
+  [6].as_filters = { MULTITAP_SHARP, EIGHTTAP_REGULAR },
+  [7].as_filters = { EIGHTTAP_SMOOTH, MULTITAP_SHARP },
+  [8].as_filters = { MULTITAP_SHARP, EIGHTTAP_SMOOTH }
+};
+
+enum {
+  //  INTER_ALL = (1 << NEARESTMV) | (1 << NEARMV) | (1 << NEWMV),
+  INTER_NEAREST = (1 << NEARESTMV),
+  INTER_NEAREST_NEW = (1 << NEARESTMV) | (1 << NEWMV),
+  INTER_NEAREST_NEAR = (1 << NEARESTMV) | (1 << NEARMV),
+  INTER_NEAR_NEW = (1 << NEARMV) | (1 << NEWMV),
+};
+
+// The original scan order (default_scan_8x8) is modified according to the extra
+// transpose in hadamard c implementation, i.e., aom_hadamard_lp_8x8_c and
+// aom_hadamard_8x8_c.
+DECLARE_ALIGNED(16, static const int16_t, default_scan_8x8_transpose[64]) = {
+  0,  8,  1,  2,  9,  16, 24, 17, 10, 3,  4,  11, 18, 25, 32, 40,
+  33, 26, 19, 12, 5,  6,  13, 20, 27, 34, 41, 48, 56, 49, 42, 35,
+  28, 21, 14, 7,  15, 22, 29, 36, 43, 50, 57, 58, 51, 44, 37, 30,
+  23, 31, 38, 45, 52, 59, 60, 53, 46, 39, 47, 54, 61, 62, 55, 63
+};
+
+// The original scan order (av1_default_iscan_8x8) is modified to match
+// hadamard AVX2 implementation, i.e., aom_hadamard_lp_8x8_avx2 and
+// aom_hadamard_8x8_avx2. Since hadamard AVX2 implementation will modify the
+// order of coefficients, such that the normal scan order is no longer
+// guaranteed to scan low coefficients first, therefore we modify the scan order
+// accordingly.
+// Note that this one has to be used together with default_scan_8x8_transpose.
+DECLARE_ALIGNED(16, static const int16_t,
+                av1_default_iscan_8x8_transpose[64]) = {
+  0,  2,  3,  9,  10, 20, 21, 35, 1,  4,  8,  11, 19, 22, 34, 36,
+  5,  7,  12, 18, 23, 33, 37, 48, 6,  13, 17, 24, 32, 38, 47, 49,
+  14, 16, 25, 31, 39, 46, 50, 57, 15, 26, 30, 40, 45, 51, 56, 58,
+  27, 29, 41, 44, 52, 55, 59, 62, 28, 42, 43, 53, 54, 60, 61, 63
+};
+
+// The original scan order (default_scan_16x16) is modified according to the
+// extra transpose in hadamard c implementation in lp case, i.e.,
+// aom_hadamard_lp_16x16_c.
+DECLARE_ALIGNED(16, static const int16_t,
+                default_scan_lp_16x16_transpose[256]) = {
+  0,   8,   2,   4,   10,  16,  24,  18,  12,  6,   64,  14,  20,  26,  32,
+  40,  34,  28,  22,  72,  66,  68,  74,  80,  30,  36,  42,  48,  56,  50,
+  44,  38,  88,  82,  76,  70,  128, 78,  84,  90,  96,  46,  52,  58,  1,
+  9,   3,   60,  54,  104, 98,  92,  86,  136, 130, 132, 138, 144, 94,  100,
+  106, 112, 62,  5,   11,  17,  25,  19,  13,  7,   120, 114, 108, 102, 152,
+  146, 140, 134, 192, 142, 148, 154, 160, 110, 116, 122, 65,  15,  21,  27,
+  33,  41,  35,  29,  23,  73,  67,  124, 118, 168, 162, 156, 150, 200, 194,
+  196, 202, 208, 158, 164, 170, 176, 126, 69,  75,  81,  31,  37,  43,  49,
+  57,  51,  45,  39,  89,  83,  77,  71,  184, 178, 172, 166, 216, 210, 204,
+  198, 206, 212, 218, 224, 174, 180, 186, 129, 79,  85,  91,  97,  47,  53,
+  59,  61,  55,  105, 99,  93,  87,  137, 131, 188, 182, 232, 226, 220, 214,
+  222, 228, 234, 240, 190, 133, 139, 145, 95,  101, 107, 113, 63,  121, 115,
+  109, 103, 153, 147, 141, 135, 248, 242, 236, 230, 238, 244, 250, 193, 143,
+  149, 155, 161, 111, 117, 123, 125, 119, 169, 163, 157, 151, 201, 195, 252,
+  246, 254, 197, 203, 209, 159, 165, 171, 177, 127, 185, 179, 173, 167, 217,
+  211, 205, 199, 207, 213, 219, 225, 175, 181, 187, 189, 183, 233, 227, 221,
+  215, 223, 229, 235, 241, 191, 249, 243, 237, 231, 239, 245, 251, 253, 247,
+  255
+};
+
+#if CONFIG_AV1_HIGHBITDEPTH
+// The original scan order (default_scan_16x16) is modified according to the
+// extra shift in hadamard c implementation in fp case, i.e.,
+// aom_hadamard_16x16_c. Note that 16x16 lp and fp hadamard generate different
+// outputs, so we handle them separately.
+DECLARE_ALIGNED(16, static const int16_t,
+                default_scan_fp_16x16_transpose[256]) = {
+  0,   4,   2,   8,   6,   16,  20,  18,  12,  10,  64,  14,  24,  22,  32,
+  36,  34,  28,  26,  68,  66,  72,  70,  80,  30,  40,  38,  48,  52,  50,
+  44,  42,  84,  82,  76,  74,  128, 78,  88,  86,  96,  46,  56,  54,  1,
+  5,   3,   60,  58,  100, 98,  92,  90,  132, 130, 136, 134, 144, 94,  104,
+  102, 112, 62,  9,   7,   17,  21,  19,  13,  11,  116, 114, 108, 106, 148,
+  146, 140, 138, 192, 142, 152, 150, 160, 110, 120, 118, 65,  15,  25,  23,
+  33,  37,  35,  29,  27,  69,  67,  124, 122, 164, 162, 156, 154, 196, 194,
+  200, 198, 208, 158, 168, 166, 176, 126, 73,  71,  81,  31,  41,  39,  49,
+  53,  51,  45,  43,  85,  83,  77,  75,  180, 178, 172, 170, 212, 210, 204,
+  202, 206, 216, 214, 224, 174, 184, 182, 129, 79,  89,  87,  97,  47,  57,
+  55,  61,  59,  101, 99,  93,  91,  133, 131, 188, 186, 228, 226, 220, 218,
+  222, 232, 230, 240, 190, 137, 135, 145, 95,  105, 103, 113, 63,  117, 115,
+  109, 107, 149, 147, 141, 139, 244, 242, 236, 234, 238, 248, 246, 193, 143,
+  153, 151, 161, 111, 121, 119, 125, 123, 165, 163, 157, 155, 197, 195, 252,
+  250, 254, 201, 199, 209, 159, 169, 167, 177, 127, 181, 179, 173, 171, 213,
+  211, 205, 203, 207, 217, 215, 225, 175, 185, 183, 189, 187, 229, 227, 221,
+  219, 223, 233, 231, 241, 191, 245, 243, 237, 235, 239, 249, 247, 253, 251,
+  255
+};
+#endif
+
+// The original scan order (av1_default_iscan_16x16) is modified to match
+// hadamard AVX2 implementation, i.e., aom_hadamard_lp_16x16_avx2.
+// Since hadamard AVX2 implementation will modify the order of coefficients,
+// such that the normal scan order is no longer guaranteed to scan low
+// coefficients first, therefore we modify the scan order accordingly. Note that
+// this one has to be used together with default_scan_lp_16x16_transpose.
+DECLARE_ALIGNED(16, static const int16_t,
+                av1_default_iscan_lp_16x16_transpose[256]) = {
+  0,   44,  2,   46,  3,   63,  9,   69,  1,   45,  4,   64,  8,   68,  11,
+  87,  5,   65,  7,   67,  12,  88,  18,  94,  6,   66,  13,  89,  17,  93,
+  24,  116, 14,  90,  16,  92,  25,  117, 31,  123, 15,  91,  26,  118, 30,
+  122, 41,  148, 27,  119, 29,  121, 42,  149, 48,  152, 28,  120, 43,  150,
+  47,  151, 62,  177, 10,  86,  20,  96,  21,  113, 35,  127, 19,  95,  22,
+  114, 34,  126, 37,  144, 23,  115, 33,  125, 38,  145, 52,  156, 32,  124,
+  39,  146, 51,  155, 58,  173, 40,  147, 50,  154, 59,  174, 73,  181, 49,
+  153, 60,  175, 72,  180, 83,  198, 61,  176, 71,  179, 84,  199, 98,  202,
+  70,  178, 85,  200, 97,  201, 112, 219, 36,  143, 54,  158, 55,  170, 77,
+  185, 53,  157, 56,  171, 76,  184, 79,  194, 57,  172, 75,  183, 80,  195,
+  102, 206, 74,  182, 81,  196, 101, 205, 108, 215, 82,  197, 100, 204, 109,
+  216, 131, 223, 99,  203, 110, 217, 130, 222, 140, 232, 111, 218, 129, 221,
+  141, 233, 160, 236, 128, 220, 142, 234, 159, 235, 169, 245, 78,  193, 104,
+  208, 105, 212, 135, 227, 103, 207, 106, 213, 134, 226, 136, 228, 107, 214,
+  133, 225, 137, 229, 164, 240, 132, 224, 138, 230, 163, 239, 165, 241, 139,
+  231, 162, 238, 166, 242, 189, 249, 161, 237, 167, 243, 188, 248, 190, 250,
+  168, 244, 187, 247, 191, 251, 210, 254, 186, 246, 192, 252, 209, 253, 211,
+  255
+};
+
+#if CONFIG_AV1_HIGHBITDEPTH
+// The original scan order (av1_default_iscan_16x16) is modified to match
+// hadamard AVX2 implementation, i.e., aom_hadamard_16x16_avx2.
+// Since hadamard AVX2 implementation will modify the order of coefficients,
+// such that the normal scan order is no longer guaranteed to scan low
+// coefficients first, therefore we modify the scan order accordingly. Note that
+// this one has to be used together with default_scan_fp_16x16_transpose.
+DECLARE_ALIGNED(16, static const int16_t,
+                av1_default_iscan_fp_16x16_transpose[256]) = {
+  0,   44,  2,   46,  1,   45,  4,   64,  3,   63,  9,   69,  8,   68,  11,
+  87,  5,   65,  7,   67,  6,   66,  13,  89,  12,  88,  18,  94,  17,  93,
+  24,  116, 14,  90,  16,  92,  15,  91,  26,  118, 25,  117, 31,  123, 30,
+  122, 41,  148, 27,  119, 29,  121, 28,  120, 43,  150, 42,  149, 48,  152,
+  47,  151, 62,  177, 10,  86,  20,  96,  19,  95,  22,  114, 21,  113, 35,
+  127, 34,  126, 37,  144, 23,  115, 33,  125, 32,  124, 39,  146, 38,  145,
+  52,  156, 51,  155, 58,  173, 40,  147, 50,  154, 49,  153, 60,  175, 59,
+  174, 73,  181, 72,  180, 83,  198, 61,  176, 71,  179, 70,  178, 85,  200,
+  84,  199, 98,  202, 97,  201, 112, 219, 36,  143, 54,  158, 53,  157, 56,
+  171, 55,  170, 77,  185, 76,  184, 79,  194, 57,  172, 75,  183, 74,  182,
+  81,  196, 80,  195, 102, 206, 101, 205, 108, 215, 82,  197, 100, 204, 99,
+  203, 110, 217, 109, 216, 131, 223, 130, 222, 140, 232, 111, 218, 129, 221,
+  128, 220, 142, 234, 141, 233, 160, 236, 159, 235, 169, 245, 78,  193, 104,
+  208, 103, 207, 106, 213, 105, 212, 135, 227, 134, 226, 136, 228, 107, 214,
+  133, 225, 132, 224, 138, 230, 137, 229, 164, 240, 163, 239, 165, 241, 139,
+  231, 162, 238, 161, 237, 167, 243, 166, 242, 189, 249, 188, 248, 190, 250,
+  168, 244, 187, 247, 186, 246, 192, 252, 191, 251, 210, 254, 209, 253, 211,
+  255
+};
+#endif
+
+// For entropy coding, IDTX shares the scan orders of the other 2D-transforms,
+// but the fastest way to calculate the IDTX transform (i.e. no transposes)
+// results in coefficients that are a transposition of the entropy coding
+// versions. These tables are used as substitute for the scan order for the
+// faster version of IDTX.
+
+// Must be used together with av1_fast_idtx_iscan_4x4
+DECLARE_ALIGNED(16, static const int16_t,
+                av1_fast_idtx_scan_4x4[16]) = { 0, 1,  4,  8,  5, 2,  3,  6,
+                                                9, 12, 13, 10, 7, 11, 14, 15 };
+
+// Must be used together with av1_fast_idtx_scan_4x4
+DECLARE_ALIGNED(16, static const int16_t,
+                av1_fast_idtx_iscan_4x4[16]) = { 0, 1, 5,  6,  2, 4,  7,  12,
+                                                 3, 8, 11, 13, 9, 10, 14, 15 };
+
+static const SCAN_ORDER av1_fast_idtx_scan_order_4x4 = {
+  av1_fast_idtx_scan_4x4, av1_fast_idtx_iscan_4x4
+};
+
+// Must be used together with av1_fast_idtx_iscan_8x8
+DECLARE_ALIGNED(16, static const int16_t, av1_fast_idtx_scan_8x8[64]) = {
+  0,  1,  8,  16, 9,  2,  3,  10, 17, 24, 32, 25, 18, 11, 4,  5,
+  12, 19, 26, 33, 40, 48, 41, 34, 27, 20, 13, 6,  7,  14, 21, 28,
+  35, 42, 49, 56, 57, 50, 43, 36, 29, 22, 15, 23, 30, 37, 44, 51,
+  58, 59, 52, 45, 38, 31, 39, 46, 53, 60, 61, 54, 47, 55, 62, 63
+};
+
+// Must be used together with av1_fast_idtx_scan_8x8
+DECLARE_ALIGNED(16, static const int16_t, av1_fast_idtx_iscan_8x8[64]) = {
+  0,  1,  5,  6,  14, 15, 27, 28, 2,  4,  7,  13, 16, 26, 29, 42,
+  3,  8,  12, 17, 25, 30, 41, 43, 9,  11, 18, 24, 31, 40, 44, 53,
+  10, 19, 23, 32, 39, 45, 52, 54, 20, 22, 33, 38, 46, 51, 55, 60,
+  21, 34, 37, 47, 50, 56, 59, 61, 35, 36, 48, 49, 57, 58, 62, 63
+};
+
+static const SCAN_ORDER av1_fast_idtx_scan_order_8x8 = {
+  av1_fast_idtx_scan_8x8, av1_fast_idtx_iscan_8x8
+};
+
+// Must be used together with av1_fast_idtx_iscan_16x16
+DECLARE_ALIGNED(16, static const int16_t, av1_fast_idtx_scan_16x16[256]) = {
+  0,   1,   16,  32,  17,  2,   3,   18,  33,  48,  64,  49,  34,  19,  4,
+  5,   20,  35,  50,  65,  80,  96,  81,  66,  51,  36,  21,  6,   7,   22,
+  37,  52,  67,  82,  97,  112, 128, 113, 98,  83,  68,  53,  38,  23,  8,
+  9,   24,  39,  54,  69,  84,  99,  114, 129, 144, 160, 145, 130, 115, 100,
+  85,  70,  55,  40,  25,  10,  11,  26,  41,  56,  71,  86,  101, 116, 131,
+  146, 161, 176, 192, 177, 162, 147, 132, 117, 102, 87,  72,  57,  42,  27,
+  12,  13,  28,  43,  58,  73,  88,  103, 118, 133, 148, 163, 178, 193, 208,
+  224, 209, 194, 179, 164, 149, 134, 119, 104, 89,  74,  59,  44,  29,  14,
+  15,  30,  45,  60,  75,  90,  105, 120, 135, 150, 165, 180, 195, 210, 225,
+  240, 241, 226, 211, 196, 181, 166, 151, 136, 121, 106, 91,  76,  61,  46,
+  31,  47,  62,  77,  92,  107, 122, 137, 152, 167, 182, 197, 212, 227, 242,
+  243, 228, 213, 198, 183, 168, 153, 138, 123, 108, 93,  78,  63,  79,  94,
+  109, 124, 139, 154, 169, 184, 199, 214, 229, 244, 245, 230, 215, 200, 185,
+  170, 155, 140, 125, 110, 95,  111, 126, 141, 156, 171, 186, 201, 216, 231,
+  246, 247, 232, 217, 202, 187, 172, 157, 142, 127, 143, 158, 173, 188, 203,
+  218, 233, 248, 249, 234, 219, 204, 189, 174, 159, 175, 190, 205, 220, 235,
+  250, 251, 236, 221, 206, 191, 207, 222, 237, 252, 253, 238, 223, 239, 254,
+  255
+};
+
+// Must be used together with av1_fast_idtx_scan_16x16
+DECLARE_ALIGNED(16, static const int16_t, av1_fast_idtx_iscan_16x16[256]) = {
+  0,   1,   5,   6,   14,  15,  27,  28,  44,  45,  65,  66,  90,  91,  119,
+  120, 2,   4,   7,   13,  16,  26,  29,  43,  46,  64,  67,  89,  92,  118,
+  121, 150, 3,   8,   12,  17,  25,  30,  42,  47,  63,  68,  88,  93,  117,
+  122, 149, 151, 9,   11,  18,  24,  31,  41,  48,  62,  69,  87,  94,  116,
+  123, 148, 152, 177, 10,  19,  23,  32,  40,  49,  61,  70,  86,  95,  115,
+  124, 147, 153, 176, 178, 20,  22,  33,  39,  50,  60,  71,  85,  96,  114,
+  125, 146, 154, 175, 179, 200, 21,  34,  38,  51,  59,  72,  84,  97,  113,
+  126, 145, 155, 174, 180, 199, 201, 35,  37,  52,  58,  73,  83,  98,  112,
+  127, 144, 156, 173, 181, 198, 202, 219, 36,  53,  57,  74,  82,  99,  111,
+  128, 143, 157, 172, 182, 197, 203, 218, 220, 54,  56,  75,  81,  100, 110,
+  129, 142, 158, 171, 183, 196, 204, 217, 221, 234, 55,  76,  80,  101, 109,
+  130, 141, 159, 170, 184, 195, 205, 216, 222, 233, 235, 77,  79,  102, 108,
+  131, 140, 160, 169, 185, 194, 206, 215, 223, 232, 236, 245, 78,  103, 107,
+  132, 139, 161, 168, 186, 193, 207, 214, 224, 231, 237, 244, 246, 104, 106,
+  133, 138, 162, 167, 187, 192, 208, 213, 225, 230, 238, 243, 247, 252, 105,
+  134, 137, 163, 166, 188, 191, 209, 212, 226, 229, 239, 242, 248, 251, 253,
+  135, 136, 164, 165, 189, 190, 210, 211, 227, 228, 240, 241, 249, 250, 254,
+  255
+};
+
 // Indicates the blocks for which RD model should be based on special logic
 static INLINE int get_model_rd_flag(const AV1_COMP *cpi, const MACROBLOCKD *xd,
                                     BLOCK_SIZE bsize) {
@@ -59,9 +413,6 @@
  * \param[in]    ref_frame                Reference frame for which to find
  *                                        ref MVs
  * \param[in]    frame_mv                 Predicted MVs for a block
- * \param[in]    tile_data                Pointer to struct holding adaptive
- *                                        data/contexts/models for the tile
- *                                        during encoding
  * \param[in]    yv12_mb                  Buffer to hold predicted block
  * \param[in]    bsize                    Current block size
  * \param[in]    force_skip_low_temp_var  Flag indicating possible mode search
@@ -71,18 +422,19 @@
  * \remark Nothing is returned. Instead, predicted MVs are placed into
  * \c frame_mv array
  */
-static INLINE void find_predictors(
-    AV1_COMP *cpi, MACROBLOCK *x, MV_REFERENCE_FRAME ref_frame,
-    int_mv frame_mv[MB_MODE_COUNT][REF_FRAMES], TileDataEnc *tile_data,
-    struct buf_2d yv12_mb[8][MAX_MB_PLANE], BLOCK_SIZE bsize,
-    int force_skip_low_temp_var, int skip_pred_mv) {
+static INLINE void find_predictors(AV1_COMP *cpi, MACROBLOCK *x,
+                                   MV_REFERENCE_FRAME ref_frame,
+                                   int_mv frame_mv[MB_MODE_COUNT][REF_FRAMES],
+                                   struct buf_2d yv12_mb[8][MAX_MB_PLANE],
+                                   BLOCK_SIZE bsize,
+                                   int force_skip_low_temp_var,
+                                   int skip_pred_mv) {
   AV1_COMMON *const cm = &cpi->common;
   MACROBLOCKD *const xd = &x->e_mbd;
   MB_MODE_INFO *const mbmi = xd->mi[0];
   MB_MODE_INFO_EXT *const mbmi_ext = &x->mbmi_ext;
   const YV12_BUFFER_CONFIG *yv12 = get_ref_frame_yv12_buf(cm, ref_frame);
   const int num_planes = av1_num_planes(cm);
-  (void)tile_data;
 
   x->pred_mv_sad[ref_frame] = INT_MAX;
   x->pred_mv0_sad[ref_frame] = INT_MAX;
@@ -117,4 +469,99 @@
   mbmi->num_proj_ref = 1;
 }
 
+static INLINE void init_mbmi_nonrd(MB_MODE_INFO *mbmi,
+                                   PREDICTION_MODE pred_mode,
+                                   MV_REFERENCE_FRAME ref_frame0,
+                                   MV_REFERENCE_FRAME ref_frame1,
+                                   const AV1_COMMON *cm) {
+  PALETTE_MODE_INFO *const pmi = &mbmi->palette_mode_info;
+  mbmi->ref_mv_idx = 0;
+  mbmi->mode = pred_mode;
+  mbmi->uv_mode = UV_DC_PRED;
+  mbmi->ref_frame[0] = ref_frame0;
+  mbmi->ref_frame[1] = ref_frame1;
+  pmi->palette_size[PLANE_TYPE_Y] = 0;
+  pmi->palette_size[PLANE_TYPE_UV] = 0;
+  mbmi->filter_intra_mode_info.use_filter_intra = 0;
+  mbmi->mv[0].as_int = mbmi->mv[1].as_int = 0;
+  mbmi->motion_mode = SIMPLE_TRANSLATION;
+  mbmi->num_proj_ref = 1;
+  mbmi->interintra_mode = 0;
+  set_default_interp_filters(mbmi, cm->features.interp_filter);
+}
+
+static INLINE void init_estimate_block_intra_args(
+    struct estimate_block_intra_args *args, AV1_COMP *cpi, MACROBLOCK *x) {
+  args->cpi = cpi;
+  args->x = x;
+  args->mode = DC_PRED;
+  args->skippable = 1;
+  args->rdc = 0;
+  args->best_sad = UINT_MAX;
+  args->prune_mode_based_on_sad = false;
+}
+
+static INLINE int get_pred_buffer(PRED_BUFFER *p, int len) {
+  for (int buf_idx = 0; buf_idx < len; buf_idx++) {
+    if (!p[buf_idx].in_use) {
+      p[buf_idx].in_use = 1;
+      return buf_idx;
+    }
+  }
+  return -1;
+}
+
+static INLINE void free_pred_buffer(PRED_BUFFER *p) {
+  if (p != NULL) p->in_use = 0;
+}
+
+#if CONFIG_INTERNAL_STATS
+static INLINE void store_coding_context_nonrd(MACROBLOCK *x,
+                                              PICK_MODE_CONTEXT *ctx,
+                                              int mode_index) {
+#else
+static INLINE void store_coding_context_nonrd(MACROBLOCK *x,
+                                              PICK_MODE_CONTEXT *ctx) {
+#endif  // CONFIG_INTERNAL_STATS
+  MACROBLOCKD *const xd = &x->e_mbd;
+  TxfmSearchInfo *txfm_info = &x->txfm_search_info;
+
+  // Take a snapshot of the coding context so it can be
+  // restored if we decide to encode this way
+  ctx->rd_stats.skip_txfm = txfm_info->skip_txfm;
+
+  ctx->skippable = txfm_info->skip_txfm;
+#if CONFIG_INTERNAL_STATS
+  ctx->best_mode_index = mode_index;
+#endif  // CONFIG_INTERNAL_STATS
+  ctx->mic = *xd->mi[0];
+  ctx->skippable = txfm_info->skip_txfm;
+  av1_copy_mbmi_ext_to_mbmi_ext_frame(&ctx->mbmi_ext_best, &x->mbmi_ext,
+                                      av1_ref_frame_type(xd->mi[0]->ref_frame));
+}
+
+void av1_block_yrd(MACROBLOCK *x, RD_STATS *this_rdc, int *skippable,
+                   BLOCK_SIZE bsize, TX_SIZE tx_size);
+
+void av1_block_yrd_idtx(MACROBLOCK *x, const uint8_t *const pred_buf,
+                        int pred_stride, RD_STATS *this_rdc, int *skippable,
+                        BLOCK_SIZE bsize, TX_SIZE tx_size);
+
+int64_t av1_model_rd_for_sb_uv(AV1_COMP *cpi, BLOCK_SIZE plane_bsize,
+                               MACROBLOCK *x, MACROBLOCKD *xd,
+                               RD_STATS *this_rdc, int start_plane,
+                               int stop_plane);
+
+void av1_estimate_block_intra(int plane, int block, int row, int col,
+                              BLOCK_SIZE plane_bsize, TX_SIZE tx_size,
+                              void *arg);
+
+void av1_estimate_intra_mode(AV1_COMP *cpi, MACROBLOCK *x, BLOCK_SIZE bsize,
+                             int best_early_term, unsigned int ref_cost_intra,
+                             int reuse_prediction, struct buf_2d *orig_dst,
+                             PRED_BUFFER *tmp_buffers,
+                             PRED_BUFFER **this_mode_pred, RD_STATS *best_rdc,
+                             BEST_PICKMODE *best_pickmode,
+                             PICK_MODE_CONTEXT *ctx);
+
 #endif  // AOM_AV1_ENCODER_NONRD_OPT_H_

diff --git a/av1/encoder/nonrd_pickmode.c b/av1/encoder/nonrd_pickmode.c
index 4bad7a6..24a5264 100644
--- a/av1/encoder/nonrd_pickmode.c
+++ b/av1/encoder/nonrd_pickmode.c

@@ -15,265 +15,17 @@
 #include <math.h>
 #include <stdio.h>
 
-#include "config/aom_dsp_rtcd.h"
-#include "config/av1_rtcd.h"
-
-#include "aom_dsp/aom_dsp_common.h"
-#include "aom_dsp/txfm_common.h"
-#include "aom_ports/mem.h"
-
-#include "av1/common/blockd.h"
-#include "av1/common/mvref_common.h"
-#include "av1/common/pred_common.h"
 #include "av1/common/reconinter.h"
 #include "av1/common/reconintra.h"
 
 #include "av1/encoder/encodemv.h"
-#include "av1/encoder/encoder.h"
 #include "av1/encoder/intra_mode_search.h"
 #include "av1/encoder/model_rd.h"
 #include "av1/encoder/motion_search_facade.h"
 #include "av1/encoder/nonrd_opt.h"
-#include "av1/encoder/rdopt.h"
 #include "av1/encoder/reconinter_enc.h"
 #include "av1/encoder/var_based_part.h"
 
-#define CALC_BIASED_RDCOST(rdcost) (7 * (rdcost) >> 3)
-extern int g_pick_inter_mode_cnt;
-/*!\cond */
-typedef struct {
-  uint8_t *data;
-  int stride;
-  int in_use;
-} PRED_BUFFER;
-
-typedef struct {
-  PRED_BUFFER *best_pred;
-  PREDICTION_MODE best_mode;
-  TX_SIZE best_tx_size;
-  TX_TYPE tx_type;
-  MV_REFERENCE_FRAME best_ref_frame;
-  MV_REFERENCE_FRAME best_second_ref_frame;
-  uint8_t best_mode_skip_txfm;
-  uint8_t best_mode_initial_skip_flag;
-  int_interpfilters best_pred_filter;
-  MOTION_MODE best_motion_mode;
-  WarpedMotionParams wm_params;
-  int num_proj_ref;
-  uint8_t blk_skip[MAX_MIB_SIZE * MAX_MIB_SIZE / 4];
-  PALETTE_MODE_INFO pmi;
-  int64_t best_sse;
-} BEST_PICKMODE;
-
-typedef struct {
-  MV_REFERENCE_FRAME ref_frame;
-  PREDICTION_MODE pred_mode;
-} REF_MODE;
-
-typedef struct {
-  MV_REFERENCE_FRAME ref_frame[2];
-  PREDICTION_MODE pred_mode;
-} COMP_REF_MODE;
-
-typedef struct {
-  InterpFilter filter_x;
-  InterpFilter filter_y;
-} INTER_FILTER;
-
-/*!\brief Structure to store parameters and statistics used in non-rd inter mode
- * evaluation.
- */
-typedef struct {
-  BEST_PICKMODE best_pickmode;
-  RD_STATS this_rdc;
-  RD_STATS best_rdc;
-  int64_t uv_dist[RTC_INTER_MODES][REF_FRAMES];
-  struct buf_2d yv12_mb[REF_FRAMES][MAX_MB_PLANE];
-  unsigned int vars[RTC_INTER_MODES][REF_FRAMES];
-  unsigned int ref_costs_single[REF_FRAMES];
-  int_mv frame_mv[MB_MODE_COUNT][REF_FRAMES];
-  int_mv frame_mv_best[MB_MODE_COUNT][REF_FRAMES];
-  int single_inter_mode_costs[RTC_INTER_MODES][REF_FRAMES];
-  int use_ref_frame_mask[REF_FRAMES];
-  uint8_t mode_checked[MB_MODE_COUNT][REF_FRAMES];
-} InterModeSearchStateNonrd;
-/*!\endcond */
-
-#define NUM_COMP_INTER_MODES_RT (6)
-#define NUM_INTER_MODES 12
-
-// GLOBALMV in the set below is in fact ZEROMV as we don't do global ME in RT
-// mode
-static const REF_MODE ref_mode_set[NUM_INTER_MODES] = {
-  { LAST_FRAME, NEARESTMV },   { LAST_FRAME, NEARMV },
-  { LAST_FRAME, GLOBALMV },    { LAST_FRAME, NEWMV },
-  { GOLDEN_FRAME, NEARESTMV }, { GOLDEN_FRAME, NEARMV },
-  { GOLDEN_FRAME, GLOBALMV },  { GOLDEN_FRAME, NEWMV },
-  { ALTREF_FRAME, NEARESTMV }, { ALTREF_FRAME, NEARMV },
-  { ALTREF_FRAME, GLOBALMV },  { ALTREF_FRAME, NEWMV },
-};
-
-static const COMP_REF_MODE comp_ref_mode_set[NUM_COMP_INTER_MODES_RT] = {
-  { { LAST_FRAME, GOLDEN_FRAME }, GLOBAL_GLOBALMV },
-  { { LAST_FRAME, GOLDEN_FRAME }, NEAREST_NEARESTMV },
-  { { LAST_FRAME, LAST2_FRAME }, GLOBAL_GLOBALMV },
-  { { LAST_FRAME, LAST2_FRAME }, NEAREST_NEARESTMV },
-  { { LAST_FRAME, ALTREF_FRAME }, GLOBAL_GLOBALMV },
-  { { LAST_FRAME, ALTREF_FRAME }, NEAREST_NEARESTMV },
-};
-
-static const INTER_FILTER filters_ref_set[9] = {
-  { EIGHTTAP_REGULAR, EIGHTTAP_REGULAR }, { EIGHTTAP_SMOOTH, EIGHTTAP_SMOOTH },
-  { EIGHTTAP_REGULAR, EIGHTTAP_SMOOTH },  { EIGHTTAP_SMOOTH, EIGHTTAP_REGULAR },
-  { MULTITAP_SHARP, MULTITAP_SHARP },     { EIGHTTAP_REGULAR, MULTITAP_SHARP },
-  { MULTITAP_SHARP, EIGHTTAP_REGULAR },   { EIGHTTAP_SMOOTH, MULTITAP_SHARP },
-  { MULTITAP_SHARP, EIGHTTAP_SMOOTH }
-};
-
-enum {
-  //  INTER_ALL = (1 << NEARESTMV) | (1 << NEARMV) | (1 << NEWMV),
-  INTER_NEAREST = (1 << NEARESTMV),
-  INTER_NEAREST_NEW = (1 << NEARESTMV) | (1 << NEWMV),
-  INTER_NEAREST_NEAR = (1 << NEARESTMV) | (1 << NEARMV),
-  INTER_NEAR_NEW = (1 << NEARMV) | (1 << NEWMV),
-};
-
-// The original scan order (default_scan_8x8) is modified according to the extra
-// transpose in hadamard c implementation, i.e., aom_hadamard_lp_8x8_c and
-// aom_hadamard_8x8_c.
-DECLARE_ALIGNED(16, static const int16_t, default_scan_8x8_transpose[64]) = {
-  0,  8,  1,  2,  9,  16, 24, 17, 10, 3,  4,  11, 18, 25, 32, 40,
-  33, 26, 19, 12, 5,  6,  13, 20, 27, 34, 41, 48, 56, 49, 42, 35,
-  28, 21, 14, 7,  15, 22, 29, 36, 43, 50, 57, 58, 51, 44, 37, 30,
-  23, 31, 38, 45, 52, 59, 60, 53, 46, 39, 47, 54, 61, 62, 55, 63
-};
-
-// The original scan order (av1_default_iscan_8x8) is modified to match
-// hadamard AVX2 implementation, i.e., aom_hadamard_lp_8x8_avx2 and
-// aom_hadamard_8x8_avx2. Since hadamard AVX2 implementation will modify the
-// order of coefficients, such that the normal scan order is no longer
-// guaranteed to scan low coefficients first, therefore we modify the scan order
-// accordingly.
-// Note that this one has to be used together with default_scan_8x8_transpose.
-DECLARE_ALIGNED(16, static const int16_t,
-                av1_default_iscan_8x8_transpose[64]) = {
-  0,  2,  3,  9,  10, 20, 21, 35, 1,  4,  8,  11, 19, 22, 34, 36,
-  5,  7,  12, 18, 23, 33, 37, 48, 6,  13, 17, 24, 32, 38, 47, 49,
-  14, 16, 25, 31, 39, 46, 50, 57, 15, 26, 30, 40, 45, 51, 56, 58,
-  27, 29, 41, 44, 52, 55, 59, 62, 28, 42, 43, 53, 54, 60, 61, 63
-};
-
-// The original scan order (default_scan_16x16) is modified according to the
-// extra transpose in hadamard c implementation in lp case, i.e.,
-// aom_hadamard_lp_16x16_c.
-DECLARE_ALIGNED(16, static const int16_t,
-                default_scan_lp_16x16_transpose[256]) = {
-  0,   8,   2,   4,   10,  16,  24,  18,  12,  6,   64,  14,  20,  26,  32,
-  40,  34,  28,  22,  72,  66,  68,  74,  80,  30,  36,  42,  48,  56,  50,
-  44,  38,  88,  82,  76,  70,  128, 78,  84,  90,  96,  46,  52,  58,  1,
-  9,   3,   60,  54,  104, 98,  92,  86,  136, 130, 132, 138, 144, 94,  100,
-  106, 112, 62,  5,   11,  17,  25,  19,  13,  7,   120, 114, 108, 102, 152,
-  146, 140, 134, 192, 142, 148, 154, 160, 110, 116, 122, 65,  15,  21,  27,
-  33,  41,  35,  29,  23,  73,  67,  124, 118, 168, 162, 156, 150, 200, 194,
-  196, 202, 208, 158, 164, 170, 176, 126, 69,  75,  81,  31,  37,  43,  49,
-  57,  51,  45,  39,  89,  83,  77,  71,  184, 178, 172, 166, 216, 210, 204,
-  198, 206, 212, 218, 224, 174, 180, 186, 129, 79,  85,  91,  97,  47,  53,
-  59,  61,  55,  105, 99,  93,  87,  137, 131, 188, 182, 232, 226, 220, 214,
-  222, 228, 234, 240, 190, 133, 139, 145, 95,  101, 107, 113, 63,  121, 115,
-  109, 103, 153, 147, 141, 135, 248, 242, 236, 230, 238, 244, 250, 193, 143,
-  149, 155, 161, 111, 117, 123, 125, 119, 169, 163, 157, 151, 201, 195, 252,
-  246, 254, 197, 203, 209, 159, 165, 171, 177, 127, 185, 179, 173, 167, 217,
-  211, 205, 199, 207, 213, 219, 225, 175, 181, 187, 189, 183, 233, 227, 221,
-  215, 223, 229, 235, 241, 191, 249, 243, 237, 231, 239, 245, 251, 253, 247,
-  255
-};
-
-#if CONFIG_AV1_HIGHBITDEPTH
-// The original scan order (default_scan_16x16) is modified according to the
-// extra shift in hadamard c implementation in fp case, i.e.,
-// aom_hadamard_16x16_c. Note that 16x16 lp and fp hadamard generate different
-// outputs, so we handle them separately.
-DECLARE_ALIGNED(16, static const int16_t,
-                default_scan_fp_16x16_transpose[256]) = {
-  0,   4,   2,   8,   6,   16,  20,  18,  12,  10,  64,  14,  24,  22,  32,
-  36,  34,  28,  26,  68,  66,  72,  70,  80,  30,  40,  38,  48,  52,  50,
-  44,  42,  84,  82,  76,  74,  128, 78,  88,  86,  96,  46,  56,  54,  1,
-  5,   3,   60,  58,  100, 98,  92,  90,  132, 130, 136, 134, 144, 94,  104,
-  102, 112, 62,  9,   7,   17,  21,  19,  13,  11,  116, 114, 108, 106, 148,
-  146, 140, 138, 192, 142, 152, 150, 160, 110, 120, 118, 65,  15,  25,  23,
-  33,  37,  35,  29,  27,  69,  67,  124, 122, 164, 162, 156, 154, 196, 194,
-  200, 198, 208, 158, 168, 166, 176, 126, 73,  71,  81,  31,  41,  39,  49,
-  53,  51,  45,  43,  85,  83,  77,  75,  180, 178, 172, 170, 212, 210, 204,
-  202, 206, 216, 214, 224, 174, 184, 182, 129, 79,  89,  87,  97,  47,  57,
-  55,  61,  59,  101, 99,  93,  91,  133, 131, 188, 186, 228, 226, 220, 218,
-  222, 232, 230, 240, 190, 137, 135, 145, 95,  105, 103, 113, 63,  117, 115,
-  109, 107, 149, 147, 141, 139, 244, 242, 236, 234, 238, 248, 246, 193, 143,
-  153, 151, 161, 111, 121, 119, 125, 123, 165, 163, 157, 155, 197, 195, 252,
-  250, 254, 201, 199, 209, 159, 169, 167, 177, 127, 181, 179, 173, 171, 213,
-  211, 205, 203, 207, 217, 215, 225, 175, 185, 183, 189, 187, 229, 227, 221,
-  219, 223, 233, 231, 241, 191, 245, 243, 237, 235, 239, 249, 247, 253, 251,
-  255
-};
-#endif
-
-// The original scan order (av1_default_iscan_16x16) is modified to match
-// hadamard AVX2 implementation, i.e., aom_hadamard_lp_16x16_avx2.
-// Since hadamard AVX2 implementation will modify the order of coefficients,
-// such that the normal scan order is no longer guaranteed to scan low
-// coefficients first, therefore we modify the scan order accordingly. Note that
-// this one has to be used together with default_scan_lp_16x16_transpose.
-DECLARE_ALIGNED(16, static const int16_t,
-                av1_default_iscan_lp_16x16_transpose[256]) = {
-  0,   44,  2,   46,  3,   63,  9,   69,  1,   45,  4,   64,  8,   68,  11,
-  87,  5,   65,  7,   67,  12,  88,  18,  94,  6,   66,  13,  89,  17,  93,
-  24,  116, 14,  90,  16,  92,  25,  117, 31,  123, 15,  91,  26,  118, 30,
-  122, 41,  148, 27,  119, 29,  121, 42,  149, 48,  152, 28,  120, 43,  150,
-  47,  151, 62,  177, 10,  86,  20,  96,  21,  113, 35,  127, 19,  95,  22,
-  114, 34,  126, 37,  144, 23,  115, 33,  125, 38,  145, 52,  156, 32,  124,
-  39,  146, 51,  155, 58,  173, 40,  147, 50,  154, 59,  174, 73,  181, 49,
-  153, 60,  175, 72,  180, 83,  198, 61,  176, 71,  179, 84,  199, 98,  202,
-  70,  178, 85,  200, 97,  201, 112, 219, 36,  143, 54,  158, 55,  170, 77,
-  185, 53,  157, 56,  171, 76,  184, 79,  194, 57,  172, 75,  183, 80,  195,
-  102, 206, 74,  182, 81,  196, 101, 205, 108, 215, 82,  197, 100, 204, 109,
-  216, 131, 223, 99,  203, 110, 217, 130, 222, 140, 232, 111, 218, 129, 221,
-  141, 233, 160, 236, 128, 220, 142, 234, 159, 235, 169, 245, 78,  193, 104,
-  208, 105, 212, 135, 227, 103, 207, 106, 213, 134, 226, 136, 228, 107, 214,
-  133, 225, 137, 229, 164, 240, 132, 224, 138, 230, 163, 239, 165, 241, 139,
-  231, 162, 238, 166, 242, 189, 249, 161, 237, 167, 243, 188, 248, 190, 250,
-  168, 244, 187, 247, 191, 251, 210, 254, 186, 246, 192, 252, 209, 253, 211,
-  255
-};
-
-#if CONFIG_AV1_HIGHBITDEPTH
-// The original scan order (av1_default_iscan_16x16) is modified to match
-// hadamard AVX2 implementation, i.e., aom_hadamard_16x16_avx2.
-// Since hadamard AVX2 implementation will modify the order of coefficients,
-// such that the normal scan order is no longer guaranteed to scan low
-// coefficients first, therefore we modify the scan order accordingly. Note that
-// this one has to be used together with default_scan_fp_16x16_transpose.
-DECLARE_ALIGNED(16, static const int16_t,
-                av1_default_iscan_fp_16x16_transpose[256]) = {
-  0,   44,  2,   46,  1,   45,  4,   64,  3,   63,  9,   69,  8,   68,  11,
-  87,  5,   65,  7,   67,  6,   66,  13,  89,  12,  88,  18,  94,  17,  93,
-  24,  116, 14,  90,  16,  92,  15,  91,  26,  118, 25,  117, 31,  123, 30,
-  122, 41,  148, 27,  119, 29,  121, 28,  120, 43,  150, 42,  149, 48,  152,
-  47,  151, 62,  177, 10,  86,  20,  96,  19,  95,  22,  114, 21,  113, 35,
-  127, 34,  126, 37,  144, 23,  115, 33,  125, 32,  124, 39,  146, 38,  145,
-  52,  156, 51,  155, 58,  173, 40,  147, 50,  154, 49,  153, 60,  175, 59,
-  174, 73,  181, 72,  180, 83,  198, 61,  176, 71,  179, 70,  178, 85,  200,
-  84,  199, 98,  202, 97,  201, 112, 219, 36,  143, 54,  158, 53,  157, 56,
-  171, 55,  170, 77,  185, 76,  184, 79,  194, 57,  172, 75,  183, 74,  182,
-  81,  196, 80,  195, 102, 206, 101, 205, 108, 215, 82,  197, 100, 204, 99,
-  203, 110, 217, 109, 216, 131, 223, 130, 222, 140, 232, 111, 218, 129, 221,
-  128, 220, 142, 234, 141, 233, 160, 236, 159, 235, 169, 245, 78,  193, 104,
-  208, 103, 207, 106, 213, 105, 212, 135, 227, 134, 226, 136, 228, 107, 214,
-  133, 225, 132, 224, 138, 230, 137, 229, 164, 240, 163, 239, 165, 241, 139,
-  231, 162, 238, 161, 237, 167, 243, 166, 242, 189, 249, 188, 248, 190, 250,
-  168, 244, 187, 247, 186, 246, 192, 252, 191, 251, 210, 254, 209, 253, 211,
-  255
-};
-#endif
-
 static INLINE int early_term_inter_search_with_sse(int early_term_idx,
                                                    BLOCK_SIZE bsize,
                                                    int64_t this_sse,
@@ -317,17 +69,44 @@
   bp->best_pred = NULL;
   bp->best_motion_mode = SIMPLE_TRANSLATION;
   bp->num_proj_ref = 0;
-  memset(&bp->wm_params, 0, sizeof(bp->wm_params));
-  memset(&bp->blk_skip, 0, sizeof(bp->blk_skip));
-  memset(&bp->pmi, 0, sizeof(bp->pmi));
+  av1_zero(bp->wm_params);
+  av1_zero(bp->pmi);
+}
+
+// Copy best inter mode parameters to best_pickmode
+static INLINE void update_search_state_nonrd(
+    InterModeSearchStateNonrd *search_state, MB_MODE_INFO *const mi,
+    TxfmSearchInfo *txfm_info, RD_STATS *nonskip_rdc, PICK_MODE_CONTEXT *ctx,
+    PREDICTION_MODE this_best_mode, const int64_t sse_y) {
+  BEST_PICKMODE *const best_pickmode = &search_state->best_pickmode;
+
+  best_pickmode->best_sse = sse_y;
+  best_pickmode->best_mode = this_best_mode;
+  best_pickmode->best_motion_mode = mi->motion_mode;
+  best_pickmode->wm_params = mi->wm_params;
+  best_pickmode->num_proj_ref = mi->num_proj_ref;
+  best_pickmode->best_pred_filter = mi->interp_filters;
+  best_pickmode->best_tx_size = mi->tx_size;
+  best_pickmode->best_ref_frame = mi->ref_frame[0];
+  best_pickmode->best_second_ref_frame = mi->ref_frame[1];
+  best_pickmode->best_mode_skip_txfm = search_state->this_rdc.skip_txfm;
+  best_pickmode->best_mode_initial_skip_flag =
+      (nonskip_rdc->rate == INT_MAX && search_state->this_rdc.skip_txfm);
+  if (!best_pickmode->best_mode_skip_txfm) {
+    memcpy(ctx->blk_skip, txfm_info->blk_skip,
+           sizeof(txfm_info->blk_skip[0]) * ctx->num_4x4_blk);
+  }
 }
 
 static INLINE int subpel_select(AV1_COMP *cpi, MACROBLOCK *x, BLOCK_SIZE bsize,
                                 int_mv *mv, MV ref_mv, FULLPEL_MV start_mv,
                                 bool fullpel_performed_well) {
   const int frame_lowmotion = cpi->rc.avg_frame_low_motion;
+  const int reduce_mv_pel_precision_highmotion =
+      cpi->sf.rt_sf.reduce_mv_pel_precision_highmotion;
+
   // Reduce MV precision for higher int MV value & frame-level motion
-  if (cpi->sf.rt_sf.reduce_mv_pel_precision_highmotion >= 3) {
+  if (reduce_mv_pel_precision_highmotion >= 3) {
     int mv_thresh = 4;
     const int is_low_resoln =
         (cpi->common.width * cpi->common.height <= 320 * 240);
@@ -337,10 +116,10 @@
     if (abs(mv->as_fullmv.row) >= mv_thresh ||
         abs(mv->as_fullmv.col) >= mv_thresh)
       return HALF_PEL;
-  } else if (cpi->sf.rt_sf.reduce_mv_pel_precision_highmotion >= 1) {
+  } else if (reduce_mv_pel_precision_highmotion >= 1) {
     int mv_thresh;
     const int th_vals[2][3] = { { 4, 8, 10 }, { 4, 6, 8 } };
-    const int th_idx = cpi->sf.rt_sf.reduce_mv_pel_precision_highmotion - 1;
+    const int th_idx = reduce_mv_pel_precision_highmotion - 1;
     assert(th_idx >= 0 && th_idx < 2);
     if (frame_lowmotion > 0 && frame_lowmotion < 40)
       mv_thresh = 12;
@@ -375,9 +154,9 @@
   return cpi->sf.mv_sf.subpel_force_stop;
 }
 
-static bool use_aggressive_subpel_search_method(
-    MACROBLOCK *x, bool use_adaptive_subpel_search,
-    const bool fullpel_performed_well) {
+static bool use_aggressive_subpel_search_method(MACROBLOCK *x,
+                                                bool use_adaptive_subpel_search,
+                                                bool fullpel_performed_well) {
   if (!use_adaptive_subpel_search) return false;
   const int qband = x->qindex >> (QINDEX_BITS - 2);
   assert(qband < 4);
@@ -437,11 +216,12 @@
       av1_get_scaled_ref_frame(cpi, ref);
 
   if (scaled_ref_frame) {
-    int i;
+    int plane;
     // Swap out the reference frame for a version that's been scaled to
     // match the resolution of the current frame, allowing the existing
     // motion search code to be used without additional modifications.
-    for (i = 0; i < MAX_MB_PLANE; i++) backup_yv12[i] = xd->plane[i].pre[0];
+    for (plane = 0; plane < MAX_MB_PLANE; plane++)
+      backup_yv12[plane] = xd->plane[plane].pre[0];
     av1_setup_pre_planes(xd, 0, scaled_ref_frame, mi_row, mi_col, NULL,
                          num_planes);
   }
@@ -458,7 +238,7 @@
       av1_get_search_site_config(cpi, x, search_method);
   FULLPEL_MOTION_SEARCH_PARAMS full_ms_params;
   av1_make_default_fullpel_ms_params(&full_ms_params, cpi, x, bsize, &center_mv,
-                                     src_search_sites,
+                                     start_mv, src_search_sites,
                                      /*fine_search_interval=*/0);
 
   const unsigned int full_var_rd = av1_full_pixel_search(
@@ -505,8 +285,8 @@
   }
 
   if (scaled_ref_frame) {
-    int i;
-    for (i = 0; i < MAX_MB_PLANE; i++) xd->plane[i].pre[0] = backup_yv12[i];
+    for (int plane = 0; plane < MAX_MB_PLANE; plane++)
+      xd->plane[plane].pre[0] = backup_yv12[plane];
   }
   // The final MV can not be equal to the reference MV as this will trigger an
   // assert later. This can happen if both NEAREST and NEAR modes were skipped.
@@ -550,6 +330,7 @@
   MACROBLOCKD *const xd = &x->e_mbd;
   MB_MODE_INFO *const mi = xd->mi[0];
   AV1_COMMON *cm = &cpi->common;
+  int_mv *this_ref_frm_newmv = &frame_mv[NEWMV][ref_frame];
   if (ref_frame > LAST_FRAME && cpi->oxcf.rc_cfg.mode == AOM_CBR &&
       gf_temporal_ref) {
     int tmp_sad;
@@ -563,13 +344,13 @@
 
     if (tmp_sad > x->pred_mv_sad[LAST_FRAME]) return -1;
 
-    frame_mv[NEWMV][ref_frame].as_int = mi->mv[0].as_int;
+    this_ref_frm_newmv->as_int = mi->mv[0].as_int;
     int_mv best_mv = mi->mv[0];
     best_mv.as_mv.row >>= 3;
     best_mv.as_mv.col >>= 3;
     MV ref_mv = av1_get_ref_mv(x, 0).as_mv;
-    frame_mv[NEWMV][ref_frame].as_mv.row >>= 3;
-    frame_mv[NEWMV][ref_frame].as_mv.col >>= 3;
+    this_ref_frm_newmv->as_mv.row >>= 3;
+    this_ref_frm_newmv->as_mv.col >>= 3;
 
     SUBPEL_MOTION_SEARCH_PARAMS ms_params;
     av1_make_default_subpel_ms_params(&ms_params, cpi, x, bsize, &ref_mv, NULL);
@@ -584,17 +365,17 @@
     cpi->mv_search_params.find_fractional_mv_step(
         xd, cm, &ms_params, start_mv, &best_mv.as_mv, &dis,
         &x->pred_sse[ref_frame], NULL);
-    frame_mv[NEWMV][ref_frame].as_int = best_mv.as_int;
+    this_ref_frm_newmv->as_int = best_mv.as_int;
 
     // When NEWMV is same as ref_mv from the drl, it is preferred to code the
     // MV as NEARESTMV or NEARMV. In this case, NEWMV needs to be skipped to
     // avoid an assert failure at a later stage. The scenario can occur if
     // NEARESTMV was not evaluated for ALTREF.
-    if (frame_mv[NEWMV][ref_frame].as_mv.col == ref_mv.col &&
-        frame_mv[NEWMV][ref_frame].as_mv.row == ref_mv.row)
+    if (this_ref_frm_newmv->as_mv.col == ref_mv.col &&
+        this_ref_frm_newmv->as_mv.row == ref_mv.row)
       return -1;
 
-    *rate_mv = av1_mv_bit_cost(&frame_mv[NEWMV][ref_frame].as_mv, &ref_mv,
+    *rate_mv = av1_mv_bit_cost(&this_ref_frm_newmv->as_mv, &ref_mv,
                                x->mv_costs->nmv_joint_cost,
                                x->mv_costs->mv_cost_stack, MV_COST_WEIGHT);
   } else if (!combined_motion_search(cpi, x, bsize, mi_row, mi_col,
@@ -643,7 +424,7 @@
   if (x->txfm_search_params.tx_mode_search_type == TX_MODE_SELECT &&
       cpi->sf.rt_sf.tx_size_level_based_on_qstep &&
       cpi->sf.rt_sf.tx_size_level_based_on_qstep >= 2) {
-    const int qstep = x->plane[0].dequant_QTX[1] >> (x->e_mbd.bd - 5);
+    const int qstep = x->plane[AOM_PLANE_Y].dequant_QTX[1] >> (x->e_mbd.bd - 5);
     const unsigned int qstep_sq = qstep * qstep;
     // If the sse is low for low source variance blocks, mark those as
     // transform skip.
@@ -651,7 +432,8 @@
     // low so that reliable early estimate of tx skip can be obtained
     // through its comparison with sse.
     if (sse < qstep_sq && x->source_variance < qstep_sq &&
-        x->color_sensitivity[0] == 0 && x->color_sensitivity[1] == 0)
+        x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] == 0 &&
+        x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)] == 0)
       *force_skip = 1;
   }
 }
@@ -676,7 +458,7 @@
       const int mult[4] = { 8, 7, 6, 5 };
       assert(qband < 4);
       multiplier = mult[qband];
-      const int qstep = x->plane[0].dequant_QTX[1] >> (xd->bd - 5);
+      const int qstep = x->plane[AOM_PLANE_Y].dequant_QTX[1] >> (xd->bd - 5);
       const unsigned int qstep_sq = qstep * qstep;
       var_thresh = qstep_sq * 2;
       if (cpi->sf.rt_sf.tx_size_level_based_on_qstep >= 2) {
@@ -686,7 +468,8 @@
         // low so that reliable early estimate of tx skip can be obtained
         // through its comparison with sse.
         if (sse < qstep_sq && x->source_variance < qstep_sq &&
-            x->color_sensitivity[0] == 0 && x->color_sensitivity[1] == 0)
+            x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] == 0 &&
+            x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)] == 0)
           *force_skip = 1;
         // Further lower transform size based on aq mode only if residual
         // variance is high.
@@ -719,13 +502,6 @@
   return AOMMIN(tx_size, TX_16X16);
 }
 
-static const uint8_t b_width_log2_lookup[BLOCK_SIZES] = { 0, 0, 1, 1, 1, 2,
-                                                          2, 2, 3, 3, 3, 4,
-                                                          4, 4, 5, 5 };
-static const uint8_t b_height_log2_lookup[BLOCK_SIZES] = { 0, 1, 0, 1, 2, 1,
-                                                           2, 3, 2, 3, 4, 3,
-                                                           4, 5, 4, 5 };
-
 static void block_variance(const uint8_t *src, int src_stride,
                            const uint8_t *ref, int ref_stride, int w, int h,
                            unsigned int *sse, int *sum, int block_size,
@@ -740,11 +516,12 @@
   // 32 samples respectively.
   assert(w >= 32);
   assert(h >= 8);
-  for (int i = 0; i < h; i += block_size) {
-    for (int j = 0; j < w; j += 32) {
-      aom_get_var_sse_sum_8x8_quad(
-          src + src_stride * i + j, src_stride, ref + ref_stride * i + j,
-          ref_stride, &sse8x8[k], &sum8x8[k], sse, sum, &var8x8[k]);
+  for (int row = 0; row < h; row += block_size) {
+    for (int col = 0; col < w; col += 32) {
+      aom_get_var_sse_sum_8x8_quad(src + src_stride * row + col, src_stride,
+                                   ref + ref_stride * row + col, ref_stride,
+                                   &sse8x8[k], &sum8x8[k], sse, sum,
+                                   &var8x8[k]);
       k += 4;
     }
   }
@@ -764,10 +541,10 @@
   // least 16 and 32 samples respectively.
   assert(w >= 32);
   assert(h >= 16);
-  for (int i = 0; i < h; i += block_size) {
-    for (int j = 0; j < w; j += 32) {
-      aom_get_var_sse_sum_16x16_dual(src + src_stride * i + j, src_stride,
-                                     ref + ref_stride * i + j, ref_stride,
+  for (int row = 0; row < h; row += block_size) {
+    for (int col = 0; col < w; col += 32) {
+      aom_get_var_sse_sum_16x16_dual(src + src_stride * row + col, src_stride,
+                                     ref + ref_stride * row + col, ref_stride,
                                      &sse16x16[k], sse, sum, &var16x16[k]);
       k += 2;
     }
@@ -781,14 +558,14 @@
   const BLOCK_SIZE unit_size = txsize_to_bsize[tx_size];
   const int nw = 1 << (bw - b_width_log2_lookup[unit_size]);
   const int nh = 1 << (bh - b_height_log2_lookup[unit_size]);
-  int i, j, k = 0;
+  int row, col, k = 0;
 
-  for (i = 0; i < nh; i += 2) {
-    for (j = 0; j < nw; j += 2) {
-      sse_o[k] = sse_i[i * nw + j] + sse_i[i * nw + j + 1] +
-                 sse_i[(i + 1) * nw + j] + sse_i[(i + 1) * nw + j + 1];
-      sum_o[k] = sum_i[i * nw + j] + sum_i[i * nw + j + 1] +
-                 sum_i[(i + 1) * nw + j] + sum_i[(i + 1) * nw + j + 1];
+  for (row = 0; row < nh; row += 2) {
+    for (col = 0; col < nw; col += 2) {
+      sse_o[k] = sse_i[row * nw + col] + sse_i[row * nw + col + 1] +
+                 sse_i[(row + 1) * nw + col] + sse_i[(row + 1) * nw + col + 1];
+      sum_o[k] = sum_i[row * nw + col] + sum_i[row * nw + col + 1] +
+                 sum_i[(row + 1) * nw + col] + sum_i[(row + 1) * nw + col + 1];
       var_o[k] = sse_o[k] - (uint32_t)(((int64_t)sum_o[k] * sum_o[k]) >>
                                        (b_width_log2_lookup[unit_size] +
                                         b_height_log2_lookup[unit_size] + 6));
@@ -798,8 +575,7 @@
 }
 
 // Adjust the ac_thr according to speed, width, height and normalized sum
-static int ac_thr_factor(const int speed, const int width, const int height,
-                         const int norm_sum) {
+static int ac_thr_factor(int speed, int width, int height, int norm_sum) {
   if (speed >= 8 && norm_sum < 5) {
     if (width <= 640 && height <= 480)
       return 4;
@@ -815,7 +591,7 @@
     int mi_col, int *early_term, int num_blk, const unsigned int *sse_tx,
     const unsigned int *var_tx, int sum, unsigned int var, unsigned int sse) {
   AV1_COMMON *const cm = &cpi->common;
-  struct macroblock_plane *const p = &x->plane[0];
+  struct macroblock_plane *const p = &x->plane[AOM_PLANE_Y];
   const uint32_t dc_quant = p->dequant_QTX[0];
   const uint32_t ac_quant = p->dequant_QTX[1];
   const int64_t dc_thr = dc_quant * dc_quant >> 6;
@@ -857,13 +633,13 @@
     unsigned int var_uv[2];
     unsigned int sse_uv[2];
     // Transform skipping test in UV planes.
-    for (int i = 1; i <= 2; i++) {
-      int j = i - 1;
+    for (int plane = AOM_PLANE_U; plane <= AOM_PLANE_V; plane++) {
+      int j = plane - 1;
       skip_uv[j] = 1;
-      if (x->color_sensitivity[j]) {
+      if (x->color_sensitivity[COLOR_SENS_IDX(plane)]) {
         skip_uv[j] = 0;
-        struct macroblock_plane *const puv = &x->plane[i];
-        struct macroblockd_plane *const puvd = &xd->plane[i];
+        struct macroblock_plane *const puv = &x->plane[plane];
+        struct macroblockd_plane *const puvd = &xd->plane[plane];
         const BLOCK_SIZE uv_bsize = get_plane_block_size(
             bsize, puvd->subsampling_x, puvd->subsampling_y);
         // Adjust these thresholds for UV.
@@ -871,8 +647,8 @@
             (puv->dequant_QTX[0] * puv->dequant_QTX[0]) >> 3;
         const int64_t uv_ac_thr =
             (puv->dequant_QTX[1] * puv->dequant_QTX[1]) >> 3;
-        av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize, i,
-                                      i);
+        av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
+                                      plane, plane);
         var_uv[j] = cpi->ppi->fn_ptr[uv_bsize].vf(puv->src.buf, puv->src.stride,
                                                   puvd->dst.buf,
                                                   puvd->dst.stride, &sse_uv[j]);
@@ -921,8 +697,8 @@
   // Hence quantizer step is also 8 times. To get effective quantizer
   // we need to divide by 8 before sending to modeling function.
   unsigned int sse;
-  struct macroblock_plane *const p = &x->plane[0];
-  struct macroblockd_plane *const pd = &xd->plane[0];
+  struct macroblock_plane *const p = &x->plane[AOM_PLANE_Y];
+  struct macroblockd_plane *const pd = &xd->plane[AOM_PLANE_Y];
   int test_skip = 1;
   unsigned int var;
   int sum;
@@ -1007,8 +783,8 @@
   // Hence quantizer step is also 8 times. To get effective quantizer
   // we need to divide by 8 before sending to modeling function.
   unsigned int sse;
-  struct macroblock_plane *const p = &x->plane[0];
-  struct macroblockd_plane *const pd = &xd->plane[0];
+  struct macroblock_plane *const p = &x->plane[AOM_PLANE_Y];
+  struct macroblockd_plane *const pd = &xd->plane[AOM_PLANE_Y];
   int test_skip = 1;
   unsigned int var;
   int sum;
@@ -1093,8 +869,8 @@
 
   assert(bsize < BLOCK_SIZES_ALL);
 
-  struct macroblock_plane *const p = &x->plane[0];
-  struct macroblockd_plane *const pd = &xd->plane[0];
+  struct macroblock_plane *const p = &x->plane[AOM_PLANE_Y];
+  struct macroblockd_plane *const pd = &xd->plane[AOM_PLANE_Y];
   unsigned int sse;
   int rate;
   int64_t dist;
@@ -1113,7 +889,7 @@
     model_rd_with_curvfit(cpi, x, bsize, AOM_PLANE_Y, sse, bwide * bhigh, &rate,
                           &dist);
   } else {
-    rate = INT_MAX;  // this will be overwritten later with block_yrd
+    rate = INT_MAX;  // this will be overwritten later with av1_block_yrd
     dist = INT_MAX;
   }
   rd_stats->sse = sse;
@@ -1132,496 +908,7 @@
   rd_stats->dist = dist;
 }
 
-static INLINE void aom_process_hadamard_lp_8x16(MACROBLOCK *x,
-                                                int max_blocks_high,
-                                                int max_blocks_wide,
-                                                int num_4x4_w, int step,
-                                                int block_step) {
-  struct macroblock_plane *const p = &x->plane[0];
-  const int bw = 4 * num_4x4_w;
-  const int num_4x4 = AOMMIN(num_4x4_w, max_blocks_wide);
-  int block = 0;
-
-  for (int r = 0; r < max_blocks_high; r += block_step) {
-    for (int c = 0; c < num_4x4; c += 2 * block_step) {
-      const int16_t *src_diff = &p->src_diff[(r * bw + c) << 2];
-      int16_t *low_coeff = (int16_t *)p->coeff + BLOCK_OFFSET(block);
-      aom_hadamard_lp_8x8_dual(src_diff, (ptrdiff_t)bw, low_coeff);
-      block += 2 * step;
-    }
-  }
-}
-
-#define DECLARE_BLOCK_YRD_BUFFERS()                      \
-  DECLARE_ALIGNED(64, tran_low_t, dqcoeff_buf[16 * 16]); \
-  DECLARE_ALIGNED(64, tran_low_t, qcoeff_buf[16 * 16]);  \
-  DECLARE_ALIGNED(64, tran_low_t, coeff_buf[16 * 16]);   \
-  uint16_t eob[1];
-
-#define DECLARE_BLOCK_YRD_VARS()                                           \
-  /* When is_tx_8x8_dual_applicable is true, we compute the txfm for the   \
-   * entire bsize and write macroblock_plane::coeff. So low_coeff is kept  \
-   * as a non-const so we can reassign it to macroblock_plane::coeff. */   \
-  int16_t *low_coeff = (int16_t *)coeff_buf;                               \
-  int16_t *const low_qcoeff = (int16_t *)qcoeff_buf;                       \
-  int16_t *const low_dqcoeff = (int16_t *)dqcoeff_buf;                     \
-  const SCAN_ORDER *const scan_order = &av1_scan_orders[tx_size][DCT_DCT]; \
-  const int diff_stride = bw;
-
-#define DECLARE_LOOP_VARS_BLOCK_YRD() \
-  const int16_t *src_diff = &p->src_diff[(r * diff_stride + c) << 2];
-
-#if CONFIG_AV1_HIGHBITDEPTH
-#define DECLARE_BLOCK_YRD_HBD_VARS()     \
-  tran_low_t *const coeff = coeff_buf;   \
-  tran_low_t *const qcoeff = qcoeff_buf; \
-  tran_low_t *const dqcoeff = dqcoeff_buf;
-
-static AOM_FORCE_INLINE void update_yrd_loop_vars_hbd(
-    MACROBLOCK *x, int *skippable, const int step, const int ncoeffs,
-    tran_low_t *const coeff, tran_low_t *const qcoeff,
-    tran_low_t *const dqcoeff, RD_STATS *this_rdc, int *eob_cost,
-    const int tx_blk_id) {
-  const int is_txfm_skip = (ncoeffs == 0);
-  *skippable &= is_txfm_skip;
-  x->txfm_search_info.blk_skip[tx_blk_id] = is_txfm_skip;
-  *eob_cost += get_msb(ncoeffs + 1);
-
-  int64_t dummy;
-  if (ncoeffs == 1)
-    this_rdc->rate += (int)abs(qcoeff[0]);
-  else if (ncoeffs > 1)
-    this_rdc->rate += aom_satd(qcoeff, step << 4);
-
-  this_rdc->dist += av1_block_error(coeff, dqcoeff, step << 4, &dummy) >> 2;
-}
-#endif
-static AOM_FORCE_INLINE void update_yrd_loop_vars(
-    MACROBLOCK *x, int *skippable, const int step, const int ncoeffs,
-    int16_t *const low_coeff, int16_t *const low_qcoeff,
-    int16_t *const low_dqcoeff, RD_STATS *this_rdc, int *eob_cost,
-    const int tx_blk_id) {
-  const int is_txfm_skip = (ncoeffs == 0);
-  *skippable &= is_txfm_skip;
-  x->txfm_search_info.blk_skip[tx_blk_id] = is_txfm_skip;
-  *eob_cost += get_msb(ncoeffs + 1);
-  if (ncoeffs == 1)
-    this_rdc->rate += (int)abs(low_qcoeff[0]);
-  else if (ncoeffs > 1)
-    this_rdc->rate += aom_satd_lp(low_qcoeff, step << 4);
-
-  this_rdc->dist += av1_block_error_lp(low_coeff, low_dqcoeff, step << 4) >> 2;
-}
-
-/*!\brief Calculates RD Cost using Hadamard transform.
- *
- * \ingroup nonrd_mode_search
- * \callgraph
- * \callergraph
- * Calculates RD Cost using Hadamard transform. For low bit depth this function
- * uses low-precision set of functions (16-bit) and 32 bit for high bit depth
- * \param[in]    x              Pointer to structure holding all the data for
-                                the current macroblock
- * \param[in]    this_rdc       Pointer to calculated RD Cost
- * \param[in]    skippable      Pointer to a flag indicating possible tx skip
- * \param[in]    bsize          Current block size
- * \param[in]    tx_size        Transform size
- * \param[in]    is_inter_mode  Flag to indicate inter mode
- *
- * \remark Nothing is returned. Instead, calculated RD cost is placed to
- * \c this_rdc. \c skippable flag is set if there is no non-zero quantized
- * coefficients for Hadamard transform
- */
-static void block_yrd(MACROBLOCK *x, RD_STATS *this_rdc, int *skippable,
-                      const BLOCK_SIZE bsize, const TX_SIZE tx_size,
-                      const int is_inter_mode) {
-  MACROBLOCKD *xd = &x->e_mbd;
-  const struct macroblockd_plane *pd = &xd->plane[0];
-  struct macroblock_plane *const p = &x->plane[0];
-  assert(bsize < BLOCK_SIZES_ALL);
-  const int num_4x4_w = mi_size_wide[bsize];
-  const int num_4x4_h = mi_size_high[bsize];
-  const int step = 1 << (tx_size << 1);
-  const int block_step = (1 << tx_size);
-  const int row_step = step * num_4x4_w >> tx_size;
-  int block = 0;
-  const int max_blocks_wide =
-      num_4x4_w + (xd->mb_to_right_edge >= 0 ? 0 : xd->mb_to_right_edge >> 5);
-  const int max_blocks_high =
-      num_4x4_h + (xd->mb_to_bottom_edge >= 0 ? 0 : xd->mb_to_bottom_edge >> 5);
-  int eob_cost = 0;
-  const int bw = 4 * num_4x4_w;
-  const int bh = 4 * num_4x4_h;
-  const int use_hbd = is_cur_buf_hbd(xd);
-  int num_blk_skip_w = num_4x4_w;
-  int sh_blk_skip = 0;
-  if (is_inter_mode) {
-    num_blk_skip_w = num_4x4_w >> 1;
-    sh_blk_skip = 1;
-  }
-
-#if CONFIG_AV1_HIGHBITDEPTH
-  if (use_hbd) {
-    aom_highbd_subtract_block(bh, bw, p->src_diff, bw, p->src.buf,
-                              p->src.stride, pd->dst.buf, pd->dst.stride);
-  } else {
-    aom_subtract_block(bh, bw, p->src_diff, bw, p->src.buf, p->src.stride,
-                       pd->dst.buf, pd->dst.stride);
-  }
-#else
-  aom_subtract_block(bh, bw, p->src_diff, bw, p->src.buf, p->src.stride,
-                     pd->dst.buf, pd->dst.stride);
-#endif
-
-  // Keep the intermediate value on the stack here. Writing directly to
-  // skippable causes speed regression due to load-and-store issues in
-  // update_yrd_loop_vars.
-  int temp_skippable = 1;
-  this_rdc->dist = 0;
-  this_rdc->rate = 0;
-  // For block sizes 8x16 or above, Hadamard txfm of two adjacent 8x8 blocks
-  // can be done per function call. Hence the call of Hadamard txfm is
-  // abstracted here for the specified cases.
-  int is_tx_8x8_dual_applicable =
-      (tx_size == TX_8X8 && block_size_wide[bsize] >= 16 &&
-       block_size_high[bsize] >= 8);
-
-#if CONFIG_AV1_HIGHBITDEPTH
-  // As of now, dual implementation of hadamard txfm is available for low
-  // bitdepth.
-  if (use_hbd) is_tx_8x8_dual_applicable = 0;
-#endif
-
-  if (is_tx_8x8_dual_applicable) {
-    aom_process_hadamard_lp_8x16(x, max_blocks_high, max_blocks_wide, num_4x4_w,
-                                 step, block_step);
-  }
-
-  DECLARE_BLOCK_YRD_BUFFERS()
-  DECLARE_BLOCK_YRD_VARS()
-#if CONFIG_AV1_HIGHBITDEPTH
-  DECLARE_BLOCK_YRD_HBD_VARS()
-#else
-  (void)use_hbd;
-#endif
-
-  // Keep track of the row and column of the blocks we use so that we know
-  // if we are in the unrestricted motion border.
-  for (int r = 0; r < max_blocks_high; r += block_step) {
-    for (int c = 0, s = 0; c < max_blocks_wide; c += block_step, s += step) {
-      DECLARE_LOOP_VARS_BLOCK_YRD()
-
-      switch (tx_size) {
-#if CONFIG_AV1_HIGHBITDEPTH
-        case TX_16X16:
-          if (use_hbd) {
-            aom_hadamard_16x16(src_diff, diff_stride, coeff);
-            av1_quantize_fp(coeff, 16 * 16, p->zbin_QTX, p->round_fp_QTX,
-                            p->quant_fp_QTX, p->quant_shift_QTX, qcoeff,
-                            dqcoeff, p->dequant_QTX, eob,
-                            // default_scan_fp_16x16_transpose and
-                            // av1_default_iscan_fp_16x16_transpose have to be
-                            // used together.
-                            default_scan_fp_16x16_transpose,
-                            av1_default_iscan_fp_16x16_transpose);
-          } else {
-            aom_hadamard_lp_16x16(src_diff, diff_stride, low_coeff);
-            av1_quantize_lp(low_coeff, 16 * 16, p->round_fp_QTX,
-                            p->quant_fp_QTX, low_qcoeff, low_dqcoeff,
-                            p->dequant_QTX, eob,
-                            // default_scan_lp_16x16_transpose and
-                            // av1_default_iscan_lp_16x16_transpose have to be
-                            // used together.
-                            default_scan_lp_16x16_transpose,
-                            av1_default_iscan_lp_16x16_transpose);
-          }
-          break;
-        case TX_8X8:
-          if (use_hbd) {
-            aom_hadamard_8x8(src_diff, diff_stride, coeff);
-            av1_quantize_fp(
-                coeff, 8 * 8, p->zbin_QTX, p->round_fp_QTX, p->quant_fp_QTX,
-                p->quant_shift_QTX, qcoeff, dqcoeff, p->dequant_QTX, eob,
-                default_scan_8x8_transpose, av1_default_iscan_8x8_transpose);
-          } else {
-            if (is_tx_8x8_dual_applicable) {
-              // The coeffs are pre-computed for the whole block, so re-assign
-              // low_coeff to the appropriate location.
-              const int block_offset = BLOCK_OFFSET(block + s);
-              low_coeff = (int16_t *)p->coeff + block_offset;
-            } else {
-              aom_hadamard_lp_8x8(src_diff, diff_stride, low_coeff);
-            }
-            av1_quantize_lp(
-                low_coeff, 8 * 8, p->round_fp_QTX, p->quant_fp_QTX, low_qcoeff,
-                low_dqcoeff, p->dequant_QTX, eob,
-                // default_scan_8x8_transpose and
-                // av1_default_iscan_8x8_transpose have to be used together.
-                default_scan_8x8_transpose, av1_default_iscan_8x8_transpose);
-          }
-          break;
-        default:
-          assert(tx_size == TX_4X4);
-          // In tx_size=4x4 case, aom_fdct4x4 and aom_fdct4x4_lp generate
-          // normal coefficients order, so we don't need to change the scan
-          // order here.
-          if (use_hbd) {
-            aom_fdct4x4(src_diff, coeff, diff_stride);
-            av1_quantize_fp(coeff, 4 * 4, p->zbin_QTX, p->round_fp_QTX,
-                            p->quant_fp_QTX, p->quant_shift_QTX, qcoeff,
-                            dqcoeff, p->dequant_QTX, eob, scan_order->scan,
-                            scan_order->iscan);
-          } else {
-            aom_fdct4x4_lp(src_diff, low_coeff, diff_stride);
-            av1_quantize_lp(low_coeff, 4 * 4, p->round_fp_QTX, p->quant_fp_QTX,
-                            low_qcoeff, low_dqcoeff, p->dequant_QTX, eob,
-                            scan_order->scan, scan_order->iscan);
-          }
-          break;
-#else
-        case TX_16X16:
-          aom_hadamard_lp_16x16(src_diff, diff_stride, low_coeff);
-          av1_quantize_lp(low_coeff, 16 * 16, p->round_fp_QTX, p->quant_fp_QTX,
-                          low_qcoeff, low_dqcoeff, p->dequant_QTX, eob,
-                          default_scan_lp_16x16_transpose,
-                          av1_default_iscan_lp_16x16_transpose);
-          break;
-        case TX_8X8:
-          if (is_tx_8x8_dual_applicable) {
-            // The coeffs are pre-computed for the whole block, so re-assign
-            // low_coeff to the appropriate location.
-            const int block_offset = BLOCK_OFFSET(block + s);
-            low_coeff = (int16_t *)p->coeff + block_offset;
-          } else {
-            aom_hadamard_lp_8x8(src_diff, diff_stride, low_coeff);
-          }
-          av1_quantize_lp(low_coeff, 8 * 8, p->round_fp_QTX, p->quant_fp_QTX,
-                          low_qcoeff, low_dqcoeff, p->dequant_QTX, eob,
-                          default_scan_8x8_transpose,
-                          av1_default_iscan_8x8_transpose);
-          break;
-        default:
-          aom_fdct4x4_lp(src_diff, low_coeff, diff_stride);
-          av1_quantize_lp(low_coeff, 4 * 4, p->round_fp_QTX, p->quant_fp_QTX,
-                          low_qcoeff, low_dqcoeff, p->dequant_QTX, eob,
-                          scan_order->scan, scan_order->iscan);
-          break;
-#endif
-      }
-      assert(*eob <= 1024);
-#if CONFIG_AV1_HIGHBITDEPTH
-      if (use_hbd)
-        update_yrd_loop_vars_hbd(x, &temp_skippable, step, *eob, coeff, qcoeff,
-                                 dqcoeff, this_rdc, &eob_cost,
-                                 (r * num_blk_skip_w + c) >> sh_blk_skip);
-      else
-#endif
-        update_yrd_loop_vars(x, &temp_skippable, step, *eob, low_coeff,
-                             low_qcoeff, low_dqcoeff, this_rdc, &eob_cost,
-                             (r * num_blk_skip_w + c) >> sh_blk_skip);
-    }
-    block += row_step;
-  }
-
-  this_rdc->skip_txfm = *skippable = temp_skippable;
-  if (this_rdc->sse < INT64_MAX) {
-    this_rdc->sse = (this_rdc->sse << 6) >> 2;
-    if (temp_skippable) {
-      this_rdc->dist = 0;
-      this_rdc->dist = this_rdc->sse;
-      return;
-    }
-  }
-
-  // If skippable is set, rate gets clobbered later.
-  this_rdc->rate <<= (2 + AV1_PROB_COST_SHIFT);
-  this_rdc->rate += (eob_cost << AV1_PROB_COST_SHIFT);
-}
-
-// Explicitly enumerate the cases so the compiler can generate SIMD for the
-// function. According to the disassembler, gcc generates SSE codes for each of
-// the possible block sizes. The hottest case is tx_width 16, which takes up
-// about 8% of the self cycle of av1_nonrd_pick_inter_mode_sb. Since
-// av1_nonrd_pick_inter_mode_sb takes up about 3% of total encoding time, the
-// potential room of improvement for writing AVX2 optimization is only 3% * 8% =
-// 0.24% of total encoding time.
-static AOM_INLINE void scale_square_buf_vals(int16_t *dst, const int tx_width,
-                                             const int16_t *src,
-                                             const int src_stride) {
-#define DO_SCALING                                                   \
-  do {                                                               \
-    for (int idy = 0; idy < tx_width; ++idy) {                       \
-      for (int idx = 0; idx < tx_width; ++idx) {                     \
-        dst[idy * tx_width + idx] = src[idy * src_stride + idx] * 8; \
-      }                                                              \
-    }                                                                \
-  } while (0)
-
-  if (tx_width == 4) {
-    DO_SCALING;
-  } else if (tx_width == 8) {
-    DO_SCALING;
-  } else if (tx_width == 16) {
-    DO_SCALING;
-  } else {
-    assert(0);
-  }
-
-#undef DO_SCALING
-}
-
-/*!\brief Calculates RD Cost when the block uses Identity transform.
- * Note that thie function is only for low bit depth encoding, since it
- * is called in real-time mode for now, which sets high bit depth to 0:
- * -DCONFIG_AV1_HIGHBITDEPTH=0
- *
- * \ingroup nonrd_mode_search
- * \callgraph
- * \callergraph
- * Calculates RD Cost. For low bit depth this function
- * uses low-precision set of functions (16-bit) and 32 bit for high bit depth
- * \param[in]    x              Pointer to structure holding all the data for
-                                the current macroblock
- * \param[in]    this_rdc       Pointer to calculated RD Cost
- * \param[in]    skippable      Pointer to a flag indicating possible tx skip
- * \param[in]    bsize          Current block size
- * \param[in]    tx_size        Transform size
- *
- * \remark Nothing is returned. Instead, calculated RD cost is placed to
- * \c this_rdc. \c skippable flag is set if all coefficients are zero.
- */
-static void block_yrd_idtx(MACROBLOCK *x, RD_STATS *this_rdc, int *skippable,
-                           const BLOCK_SIZE bsize, const TX_SIZE tx_size) {
-  MACROBLOCKD *xd = &x->e_mbd;
-  const struct macroblockd_plane *pd = &xd->plane[0];
-  struct macroblock_plane *const p = &x->plane[0];
-  assert(bsize < BLOCK_SIZES_ALL);
-  const int num_4x4_w = mi_size_wide[bsize];
-  const int num_4x4_h = mi_size_high[bsize];
-  const int step = 1 << (tx_size << 1);
-  const int block_step = (1 << tx_size);
-  const int max_blocks_wide =
-      num_4x4_w + (xd->mb_to_right_edge >= 0 ? 0 : xd->mb_to_right_edge >> 5);
-  const int max_blocks_high =
-      num_4x4_h + (xd->mb_to_bottom_edge >= 0 ? 0 : xd->mb_to_bottom_edge >> 5);
-  int eob_cost = 0;
-  const int bw = 4 * num_4x4_w;
-  const int bh = 4 * num_4x4_h;
-  const int num_blk_skip_w = num_4x4_w >> 1;
-  const int sh_blk_skip = 1;
-  // Keep the intermediate value on the stack here. Writing directly to
-  // skippable causes speed regression due to load-and-store issues in
-  // update_yrd_loop_vars.
-  int temp_skippable = 1;
-  int tx_wd = 0;
-  switch (tx_size) {
-    case TX_64X64:
-      assert(0);  // Not implemented
-      break;
-    case TX_32X32:
-      assert(0);  // Not used
-      break;
-    case TX_16X16: tx_wd = 16; break;
-    case TX_8X8: tx_wd = 8; break;
-    default:
-      assert(tx_size == TX_4X4);
-      tx_wd = 4;
-      break;
-  }
-  this_rdc->dist = 0;
-  this_rdc->rate = 0;
-  aom_subtract_block(bh, bw, p->src_diff, bw, p->src.buf, p->src.stride,
-                     pd->dst.buf, pd->dst.stride);
-  // Keep track of the row and column of the blocks we use so that we know
-  // if we are in the unrestricted motion border.
-  DECLARE_BLOCK_YRD_BUFFERS()
-  DECLARE_BLOCK_YRD_VARS()
-  for (int r = 0; r < max_blocks_high; r += block_step) {
-    for (int c = 0, s = 0; c < max_blocks_wide; c += block_step, s += step) {
-      DECLARE_LOOP_VARS_BLOCK_YRD()
-      scale_square_buf_vals(low_coeff, tx_wd, src_diff, diff_stride);
-      av1_quantize_lp(low_coeff, tx_wd * tx_wd, p->round_fp_QTX,
-                      p->quant_fp_QTX, low_qcoeff, low_dqcoeff, p->dequant_QTX,
-                      eob, scan_order->scan, scan_order->iscan);
-      assert(*eob <= 1024);
-      update_yrd_loop_vars(x, &temp_skippable, step, *eob, low_coeff,
-                           low_qcoeff, low_dqcoeff, this_rdc, &eob_cost,
-                           (r * num_blk_skip_w + c) >> sh_blk_skip);
-    }
-  }
-  this_rdc->skip_txfm = *skippable = temp_skippable;
-  if (this_rdc->sse < INT64_MAX) {
-    this_rdc->sse = (this_rdc->sse << 6) >> 2;
-    if (temp_skippable) {
-      this_rdc->dist = 0;
-      this_rdc->dist = this_rdc->sse;
-      return;
-    }
-  }
-  // If skippable is set, rate gets clobbered later.
-  this_rdc->rate <<= (2 + AV1_PROB_COST_SHIFT);
-  this_rdc->rate += (eob_cost << AV1_PROB_COST_SHIFT);
-}
-
-static INLINE void init_mbmi(MB_MODE_INFO *mbmi, PREDICTION_MODE pred_mode,
-                             MV_REFERENCE_FRAME ref_frame0,
-                             MV_REFERENCE_FRAME ref_frame1,
-                             const AV1_COMMON *cm) {
-  PALETTE_MODE_INFO *const pmi = &mbmi->palette_mode_info;
-  mbmi->ref_mv_idx = 0;
-  mbmi->mode = pred_mode;
-  mbmi->uv_mode = UV_DC_PRED;
-  mbmi->ref_frame[0] = ref_frame0;
-  mbmi->ref_frame[1] = ref_frame1;
-  pmi->palette_size[0] = 0;
-  pmi->palette_size[1] = 0;
-  mbmi->filter_intra_mode_info.use_filter_intra = 0;
-  mbmi->mv[0].as_int = mbmi->mv[1].as_int = 0;
-  mbmi->motion_mode = SIMPLE_TRANSLATION;
-  mbmi->num_proj_ref = 1;
-  mbmi->interintra_mode = 0;
-  set_default_interp_filters(mbmi, cm->features.interp_filter);
-}
-
-#if CONFIG_INTERNAL_STATS
-static void store_coding_context(MACROBLOCK *x, PICK_MODE_CONTEXT *ctx,
-                                 int mode_index) {
-#else
-static void store_coding_context(MACROBLOCK *x, PICK_MODE_CONTEXT *ctx) {
-#endif  // CONFIG_INTERNAL_STATS
-  MACROBLOCKD *const xd = &x->e_mbd;
-  TxfmSearchInfo *txfm_info = &x->txfm_search_info;
-
-  // Take a snapshot of the coding context so it can be
-  // restored if we decide to encode this way
-  ctx->rd_stats.skip_txfm = txfm_info->skip_txfm;
-
-  ctx->skippable = txfm_info->skip_txfm;
-#if CONFIG_INTERNAL_STATS
-  ctx->best_mode_index = mode_index;
-#endif  // CONFIG_INTERNAL_STATS
-  ctx->mic = *xd->mi[0];
-  ctx->skippable = txfm_info->skip_txfm;
-  av1_copy_mbmi_ext_to_mbmi_ext_frame(&ctx->mbmi_ext_best, &x->mbmi_ext,
-                                      av1_ref_frame_type(xd->mi[0]->ref_frame));
-}
-
-static int get_pred_buffer(PRED_BUFFER *p, int len) {
-  for (int i = 0; i < len; i++) {
-    if (!p[i].in_use) {
-      p[i].in_use = 1;
-      return i;
-    }
-  }
-  return -1;
-}
-
-static void free_pred_buffer(PRED_BUFFER *p) {
-  if (p != NULL) p->in_use = 0;
-}
-
-static INLINE int get_drl_cost(const PREDICTION_MODE this_mode,
-                               const int ref_mv_idx,
+static INLINE int get_drl_cost(PREDICTION_MODE this_mode, int ref_mv_idx,
                                const MB_MODE_INFO_EXT *mbmi_ext,
                                const int (*const drl_mode_cost0)[2],
                                int8_t ref_frame_type) {
@@ -1739,132 +1026,6 @@
   }
 }
 
-static int64_t model_rd_for_sb_uv(AV1_COMP *cpi, BLOCK_SIZE plane_bsize,
-                                  MACROBLOCK *x, MACROBLOCKD *xd,
-                                  RD_STATS *this_rdc, int start_plane,
-                                  int stop_plane) {
-  // Note our transform coeffs are 8 times an orthogonal transform.
-  // Hence quantizer step is also 8 times. To get effective quantizer
-  // we need to divide by 8 before sending to modeling function.
-  unsigned int sse;
-  int rate;
-  int64_t dist;
-  int i;
-  int64_t tot_sse = 0;
-
-  this_rdc->rate = 0;
-  this_rdc->dist = 0;
-  this_rdc->skip_txfm = 0;
-
-  for (i = start_plane; i <= stop_plane; ++i) {
-    struct macroblock_plane *const p = &x->plane[i];
-    struct macroblockd_plane *const pd = &xd->plane[i];
-    const uint32_t dc_quant = p->dequant_QTX[0];
-    const uint32_t ac_quant = p->dequant_QTX[1];
-    const BLOCK_SIZE bs = plane_bsize;
-    unsigned int var;
-    if (!x->color_sensitivity[i - 1]) continue;
-
-    var = cpi->ppi->fn_ptr[bs].vf(p->src.buf, p->src.stride, pd->dst.buf,
-                                  pd->dst.stride, &sse);
-    assert(sse >= var);
-    tot_sse += sse;
-
-    av1_model_rd_from_var_lapndz(sse - var, num_pels_log2_lookup[bs],
-                                 dc_quant >> 3, &rate, &dist);
-
-    this_rdc->rate += rate >> 1;
-    this_rdc->dist += dist << 3;
-
-    av1_model_rd_from_var_lapndz(var, num_pels_log2_lookup[bs], ac_quant >> 3,
-                                 &rate, &dist);
-
-    this_rdc->rate += rate;
-    this_rdc->dist += dist << 4;
-  }
-
-  if (this_rdc->rate == 0) {
-    this_rdc->skip_txfm = 1;
-  }
-
-  if (RDCOST(x->rdmult, this_rdc->rate, this_rdc->dist) >=
-      RDCOST(x->rdmult, 0, tot_sse << 4)) {
-    this_rdc->rate = 0;
-    this_rdc->dist = tot_sse << 4;
-    this_rdc->skip_txfm = 1;
-  }
-
-  return tot_sse;
-}
-
-/*!\cond */
-struct estimate_block_intra_args {
-  AV1_COMP *cpi;
-  MACROBLOCK *x;
-  PREDICTION_MODE mode;
-  int skippable;
-  RD_STATS *rdc;
-};
-/*!\endcond */
-
-/*!\brief Estimation of RD cost of an intra mode for Non-RD optimized case.
- *
- * \ingroup nonrd_mode_search
- * \callgraph
- * \callergraph
- * Calculates RD Cost for an intra mode for a single TX block using Hadamard
- * transform.
- * \param[in]    plane          Color plane
- * \param[in]    block          Index of a TX block in a prediction block
- * \param[in]    row            Row of a current TX block
- * \param[in]    col            Column of a current TX block
- * \param[in]    plane_bsize    Block size of a current prediction block
- * \param[in]    tx_size        Transform size
- * \param[in]    arg            Pointer to a structure that holds parameters
- *                              for intra mode search
- *
- * \remark Nothing is returned. Instead, best mode and RD Cost of the best mode
- * are set in \c args->rdc and \c args->mode
- */
-static void estimate_block_intra(int plane, int block, int row, int col,
-                                 BLOCK_SIZE plane_bsize, TX_SIZE tx_size,
-                                 void *arg) {
-  struct estimate_block_intra_args *const args = arg;
-  AV1_COMP *const cpi = args->cpi;
-  AV1_COMMON *const cm = &cpi->common;
-  MACROBLOCK *const x = args->x;
-  MACROBLOCKD *const xd = &x->e_mbd;
-  struct macroblock_plane *const p = &x->plane[plane];
-  struct macroblockd_plane *const pd = &xd->plane[plane];
-  const BLOCK_SIZE bsize_tx = txsize_to_bsize[tx_size];
-  uint8_t *const src_buf_base = p->src.buf;
-  uint8_t *const dst_buf_base = pd->dst.buf;
-  const int64_t src_stride = p->src.stride;
-  const int64_t dst_stride = pd->dst.stride;
-  RD_STATS this_rdc;
-
-  (void)block;
-  (void)plane_bsize;
-
-  av1_predict_intra_block_facade(cm, xd, plane, col, row, tx_size);
-  av1_invalid_rd_stats(&this_rdc);
-
-  p->src.buf = &src_buf_base[4 * (row * src_stride + col)];
-  pd->dst.buf = &dst_buf_base[4 * (row * dst_stride + col)];
-
-  if (plane == 0) {
-    block_yrd(x, &this_rdc, &args->skippable, bsize_tx,
-              AOMMIN(tx_size, TX_16X16), 0);
-  } else {
-    model_rd_for_sb_uv(cpi, bsize_tx, x, xd, &this_rdc, plane, plane);
-  }
-
-  p->src.buf = src_buf_base;
-  pd->dst.buf = dst_buf_base;
-  args->rdc->rate += this_rdc.rate;
-  args->rdc->dist += this_rdc.dist;
-}
-
 static INLINE void update_thresh_freq_fact(AV1_COMP *cpi, MACROBLOCK *x,
                                            BLOCK_SIZE bsize,
                                            MV_REFERENCE_FRAME ref_frame,
@@ -1930,7 +1091,7 @@
     set_ref_ptrs(cm, xd, mi->ref_frame[0], NONE_FRAME);
     mi->mv[0].as_int = 0;
     mi->interp_filters = av1_broadcast_interp_filter(EIGHTTAP_REGULAR);
-    xd->plane[0].pre[0] = yv12_mb[LAST_FRAME][0];
+    xd->plane[AOM_PLANE_Y].pre[0] = yv12_mb[LAST_FRAME][AOM_PLANE_Y];
     av1_enc_build_inter_predictor_y(xd, mi_row, mi_col);
     unsigned int var;
     model_rd_for_sb_y(cpi, bsize, x, xd, &this_rdc, &var, 1, NULL);
@@ -1958,7 +1119,7 @@
                                          [best_pickmode->best_ref_frame]
                                .as_int;
         if (ctx_den->reuse_inter_pred) {
-          xd->plane[0].pre[0] = yv12_mb[GOLDEN_FRAME][0];
+          xd->plane[AOM_PLANE_Y].pre[0] = yv12_mb[GOLDEN_FRAME][AOM_PLANE_Y];
           av1_enc_build_inter_predictor_y(xd, mi_row, mi_col);
         }
       }
@@ -1972,8 +1133,6 @@
 }
 #endif  // CONFIG_AV1_TEMPORAL_DENOISING
 
-#define FILTER_SEARCH_SIZE 2
-
 /*!\brief Searches for the best interpolation filter
  *
  * \ingroup nonrd_mode_search
@@ -2006,7 +1165,7 @@
  * \param[in]    use_model_yrd_large  Flag, indicating special logic to handle
  *                                    large blocks
  * \param[in]    best_sse             Best sse so far.
- * \param[in]    comp_pred            Flag, indicating compound mode.
+ * \param[in]    is_single_pred       Flag, indicating single mode.
  *
  * \remark Nothing is returned. Instead, calculated RD cost is placed to
  * \c this_rdc and best filter is placed to \c mi->interp_filters. In case
@@ -2021,10 +1180,10 @@
                               PRED_BUFFER **this_mode_pred,
                               int *this_early_term, unsigned int *var,
                               int use_model_yrd_large, int64_t best_sse,
-                              int comp_pred) {
+                              int is_single_pred) {
   AV1_COMMON *const cm = &cpi->common;
   MACROBLOCKD *const xd = &x->e_mbd;
-  struct macroblockd_plane *const pd = &xd->plane[0];
+  struct macroblockd_plane *const pd = &xd->plane[AOM_PLANE_Y];
   MB_MODE_INFO *const mi = xd->mi[0];
   const int bw = block_size_wide[bsize];
   int dim_factor =
@@ -2040,38 +1199,43 @@
   SubpelParams subpel_params;
   // Initialize inter prediction params at mode level for single reference
   // mode.
-  if (!comp_pred)
+  if (is_single_pred)
     init_inter_mode_params(&mi->mv[0].as_mv, inter_pred_params_sr,
                            &subpel_params, xd->block_ref_scale_factors[0],
                            pd->pre->width, pd->pre->height);
-  for (int i = 0; i < FILTER_SEARCH_SIZE * FILTER_SEARCH_SIZE; ++i) {
+  for (int filter_idx = 0; filter_idx < FILTER_SEARCH_SIZE * FILTER_SEARCH_SIZE;
+       ++filter_idx) {
     int64_t cost;
     if (cpi->sf.interp_sf.disable_dual_filter &&
-        filters_ref_set[i].filter_x != filters_ref_set[i].filter_y)
+        filters_ref_set[filter_idx].as_filters.x_filter !=
+            filters_ref_set[filter_idx].as_filters.y_filter)
       continue;
-    mi->interp_filters.as_filters.x_filter = filters_ref_set[i].filter_x;
-    mi->interp_filters.as_filters.y_filter = filters_ref_set[i].filter_y;
-    if (!comp_pred)
+
+    mi->interp_filters.as_int = filters_ref_set[filter_idx].as_int;
+    if (is_single_pred)
       av1_enc_build_inter_predictor_y_nonrd(xd, inter_pred_params_sr,
                                             &subpel_params);
     else
-      av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize, 0, 0);
+      av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
+                                    AOM_PLANE_Y, AOM_PLANE_Y);
     unsigned int curr_var = UINT_MAX;
     if (use_model_yrd_large)
       model_skip_for_sb_y_large(cpi, bsize, mi_row, mi_col, x, xd,
-                                &pf_rd_stats[i], this_early_term, 1, best_sse,
-                                &curr_var, UINT_MAX);
+                                &pf_rd_stats[filter_idx], this_early_term, 1,
+                                best_sse, &curr_var, UINT_MAX);
     else
-      model_rd_for_sb_y(cpi, bsize, x, xd, &pf_rd_stats[i], &curr_var, 1, NULL);
-    pf_rd_stats[i].rate += av1_get_switchable_rate(
+      model_rd_for_sb_y(cpi, bsize, x, xd, &pf_rd_stats[filter_idx], &curr_var,
+                        1, NULL);
+    pf_rd_stats[filter_idx].rate += av1_get_switchable_rate(
         x, xd, cm->features.interp_filter, cm->seq_params->enable_dual_filter);
-    cost = RDCOST(x->rdmult, pf_rd_stats[i].rate, pf_rd_stats[i].dist);
-    pf_tx_size[i] = mi->tx_size;
+    cost = RDCOST(x->rdmult, pf_rd_stats[filter_idx].rate,
+                  pf_rd_stats[filter_idx].dist);
+    pf_tx_size[filter_idx] = mi->tx_size;
     if (cost < best_cost) {
       *var = curr_var;
-      best_filter_index = i;
+      best_filter_index = filter_idx;
       best_cost = cost;
-      best_skip = pf_rd_stats[i].skip_txfm;
+      best_skip = pf_rd_stats[filter_idx].skip_txfm;
       best_early_term = *this_early_term;
       if (reuse_inter_pred) {
         if (*this_mode_pred != current_pred) {
@@ -2089,10 +1253,7 @@
   if (reuse_inter_pred && *this_mode_pred != current_pred)
     free_pred_buffer(current_pred);
 
-  mi->interp_filters.as_filters.x_filter =
-      filters_ref_set[best_filter_index].filter_x;
-  mi->interp_filters.as_filters.y_filter =
-      filters_ref_set[best_filter_index].filter_y;
+  mi->interp_filters.as_int = filters_ref_set[best_filter_index].as_int;
   mi->tx_size = pf_tx_size[best_filter_index];
   this_rdc->rate = pf_rd_stats[best_filter_index].rate;
   this_rdc->dist = pf_rd_stats[best_filter_index].dist;
@@ -2103,15 +1264,15 @@
     pd->dst.buf = (*this_mode_pred)->data;
     pd->dst.stride = (*this_mode_pred)->stride;
   } else if (best_filter_index < dim_factor * FILTER_SEARCH_SIZE - 1) {
-    if (!comp_pred)
+    if (is_single_pred)
       av1_enc_build_inter_predictor_y_nonrd(xd, inter_pred_params_sr,
                                             &subpel_params);
     else
-      av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize, 0, 0);
+      av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
+                                    AOM_PLANE_Y, AOM_PLANE_Y);
   }
 }
 #if !CONFIG_REALTIME_ONLY
-#define MOTION_MODE_SEARCH_SIZE 2
 
 static AOM_INLINE int is_warped_mode_allowed(const AV1_COMP *cpi,
                                              MACROBLOCK *const x,
@@ -2199,25 +1360,28 @@
   const MB_MODE_INFO base_mbmi = *mi;
   MB_MODE_INFO best_mbmi;
 
-  for (int i = 0; i < mode_search_size; ++i) {
+  for (int mode_index = 0; mode_index < mode_search_size; ++mode_index) {
     int64_t cost = INT64_MAX;
-    MOTION_MODE motion_mode = motion_modes[i];
+    MOTION_MODE motion_mode = motion_modes[mode_index];
     *mi = base_mbmi;
     mi->motion_mode = motion_mode;
     if (motion_mode == SIMPLE_TRANSLATION) {
       mi->interp_filters = av1_broadcast_interp_filter(EIGHTTAP_REGULAR);
 
-      av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize, 0, 0);
+      av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
+                                    AOM_PLANE_Y, AOM_PLANE_Y);
       if (use_model_yrd_large)
         model_skip_for_sb_y_large(cpi, bsize, mi_row, mi_col, x, xd,
-                                  &pf_rd_stats[i], this_early_term, 1, best_sse,
-                                  NULL, UINT_MAX);
+                                  &pf_rd_stats[mode_index], this_early_term, 1,
+                                  best_sse, NULL, UINT_MAX);
       else
-        model_rd_for_sb_y(cpi, bsize, x, xd, &pf_rd_stats[i], NULL, 1, NULL);
-      pf_rd_stats[i].rate +=
+        model_rd_for_sb_y(cpi, bsize, x, xd, &pf_rd_stats[mode_index], NULL, 1,
+                          NULL);
+      pf_rd_stats[mode_index].rate +=
           av1_get_switchable_rate(x, xd, cm->features.interp_filter,
                                   cm->seq_params->enable_dual_filter);
-      cost = RDCOST(x->rdmult, pf_rd_stats[i].rate, pf_rd_stats[i].dist);
+      cost = RDCOST(x->rdmult, pf_rd_stats[mode_index].rate,
+                    pf_rd_stats[mode_index].dist);
     } else if (motion_mode == WARPED_CAUSAL) {
       int pts[SAMPLES_ARRAY_SIZE], pts_inref[SAMPLES_ARRAY_SIZE];
       const ModeCosts *mode_costs = &x->mode_costs;
@@ -2250,7 +1414,8 @@
 
           // Refine MV in a small range.
           av1_refine_warped_mv(xd, cm, &ms_params, bsize, pts0, pts_inref0,
-                               total_samples);
+                               total_samples, cpi->sf.mv_sf.warp_search_method,
+                               cpi->sf.mv_sf.warp_search_iters);
           if (mi->mv[0].as_int == ref_mv.as_int) {
             continue;
           }
@@ -2269,26 +1434,28 @@
           }
         }
         // Build the warped predictor
-        av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize, 0,
-                                      av1_num_planes(cm) - 1);
+        av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
+                                      AOM_PLANE_Y, av1_num_planes(cm) - 1);
         if (use_model_yrd_large)
           model_skip_for_sb_y_large(cpi, bsize, mi_row, mi_col, x, xd,
-                                    &pf_rd_stats[i], this_early_term, 1,
-                                    best_sse, NULL, UINT_MAX);
+                                    &pf_rd_stats[mode_index], this_early_term,
+                                    1, best_sse, NULL, UINT_MAX);
         else
-          model_rd_for_sb_y(cpi, bsize, x, xd, &pf_rd_stats[i], NULL, 1, NULL);
+          model_rd_for_sb_y(cpi, bsize, x, xd, &pf_rd_stats[mode_index], NULL,
+                            1, NULL);
 
-        pf_rd_stats[i].rate +=
+        pf_rd_stats[mode_index].rate +=
             mode_costs->motion_mode_cost[bsize][mi->motion_mode];
-        cost = RDCOST(x->rdmult, pf_rd_stats[i].rate, pf_rd_stats[i].dist);
+        cost = RDCOST(x->rdmult, pf_rd_stats[mode_index].rate,
+                      pf_rd_stats[mode_index].dist);
       } else {
         cost = INT64_MAX;
       }
     }
     if (cost < best_cost) {
-      best_mode_index = i;
+      best_mode_index = mode_index;
       best_cost = cost;
-      best_skip = pf_rd_stats[i].skip_txfm;
+      best_skip = pf_rd_stats[mode_index].skip_txfm;
       best_early_term = *this_early_term;
       best_mbmi = *mi;
     }
@@ -2302,33 +1469,15 @@
   this_rdc->skip_txfm = (best_skip || best_early_term);
   *this_early_term = best_early_term;
   if (best_mode_index < FILTER_SEARCH_SIZE - 1) {
-    av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize, 0, 0);
+    av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
+                                  AOM_PLANE_Y, AOM_PLANE_Y);
   }
 }
 #endif  // !CONFIG_REALTIME_ONLY
 
-#define COLLECT_PICK_MODE_STAT 0
 #define COLLECT_NON_SQR_STAT 0
 
-#if COLLECT_PICK_MODE_STAT
-#include "aom_ports/aom_timer.h"
-typedef struct _mode_search_stat {
-  int32_t num_blocks[BLOCK_SIZES];
-  int64_t total_block_times[BLOCK_SIZES];
-  int32_t num_searches[BLOCK_SIZES][MB_MODE_COUNT];
-  int32_t num_nonskipped_searches[BLOCK_SIZES][MB_MODE_COUNT];
-  int64_t search_times[BLOCK_SIZES][MB_MODE_COUNT];
-  int64_t nonskipped_search_times[BLOCK_SIZES][MB_MODE_COUNT];
-  int64_t ms_time[BLOCK_SIZES][MB_MODE_COUNT];
-  int64_t ifs_time[BLOCK_SIZES][MB_MODE_COUNT];
-  int64_t model_rd_time[BLOCK_SIZES][MB_MODE_COUNT];
-  int64_t txfm_time[BLOCK_SIZES][MB_MODE_COUNT];
-  struct aom_usec_timer timer1;
-  struct aom_usec_timer timer2;
-  struct aom_usec_timer bsize_timer;
-} mode_search_stat;
-
-static mode_search_stat ms_stat;
+#if COLLECT_NONRD_PICK_MODE_STAT
 
 static AOM_INLINE void print_stage_time(const char *stage_name,
                                         int64_t stage_time,
@@ -2337,9 +1486,9 @@
          100 * stage_time / (float)total_time);
 }
 
-static void print_time(const mode_search_stat *const ms_stat,
-                       const BLOCK_SIZE bsize, const int mi_rows,
-                       const int mi_cols, const int mi_row, const int mi_col) {
+static void print_time(const mode_search_stat_nonrd *const ms_stat,
+                       BLOCK_SIZE bsize, int mi_rows, int mi_cols, int mi_row,
+                       int mi_col) {
   if ((mi_row + mi_size_high[bsize] >= mi_rows) &&
       (mi_col + mi_size_wide[bsize] >= mi_cols)) {
     int64_t total_time = 0l;
@@ -2396,47 +1545,22 @@
     printf("Total time = %ld. Total blocks = %d\n", total_time, total_blocks);
   }
 }
-#endif  // COLLECT_PICK_MODE_STAT
+#endif  // COLLECT_NONRD_PICK_MODE_STAT
 
-static void compute_intra_yprediction(const AV1_COMMON *cm,
-                                      PREDICTION_MODE mode, BLOCK_SIZE bsize,
-                                      MACROBLOCK *x, MACROBLOCKD *xd) {
-  const SequenceHeader *seq_params = cm->seq_params;
-  struct macroblockd_plane *const pd = &xd->plane[0];
-  struct macroblock_plane *const p = &x->plane[0];
-  uint8_t *const src_buf_base = p->src.buf;
-  uint8_t *const dst_buf_base = pd->dst.buf;
-  const int src_stride = p->src.stride;
-  const int dst_stride = pd->dst.stride;
-  int plane = 0;
-  int row, col;
-  // block and transform sizes, in number of 4x4 blocks log 2 ("*_b")
-  // 4x4=0, 8x8=2, 16x16=4, 32x32=6, 64x64=8
-  // transform size varies per plane, look it up in a common way.
-  const TX_SIZE tx_size = max_txsize_lookup[bsize];
-  const BLOCK_SIZE plane_bsize =
-      get_plane_block_size(bsize, pd->subsampling_x, pd->subsampling_y);
-  // If mb_to_right_edge is < 0 we are in a situation in which
-  // the current block size extends into the UMV and we won't
-  // visit the sub blocks that are wholly within the UMV.
-  const int max_blocks_wide = max_block_wide(xd, plane_bsize, plane);
-  const int max_blocks_high = max_block_high(xd, plane_bsize, plane);
-  // Keep track of the row and column of the blocks we use so that we know
-  // if we are in the unrestricted motion border.
-  for (row = 0; row < max_blocks_high; row += (1 << tx_size)) {
-    // Skip visiting the sub blocks that are wholly within the UMV.
-    for (col = 0; col < max_blocks_wide; col += (1 << tx_size)) {
-      p->src.buf = &src_buf_base[4 * (row * (int64_t)src_stride + col)];
-      pd->dst.buf = &dst_buf_base[4 * (row * (int64_t)dst_stride + col)];
-      av1_predict_intra_block(
-          xd, seq_params->sb_size, seq_params->enable_intra_edge_filter,
-          block_size_wide[bsize], block_size_high[bsize], tx_size, mode, 0, 0,
-          FILTER_INTRA_MODES, pd->dst.buf, dst_stride, pd->dst.buf, dst_stride,
-          0, 0, plane);
-    }
-  }
-  p->src.buf = src_buf_base;
-  pd->dst.buf = dst_buf_base;
+static bool should_prune_intra_modes_using_neighbors(
+    const MACROBLOCKD *xd, bool enable_intra_mode_pruning_using_neighbors,
+    PREDICTION_MODE this_mode, PREDICTION_MODE above_mode,
+    PREDICTION_MODE left_mode) {
+  if (!enable_intra_mode_pruning_using_neighbors) return false;
+
+  // Avoid pruning of DC_PRED as it is the most probable mode to win as per the
+  // statistics generated for nonrd intra mode evaluations.
+  if (this_mode == DC_PRED) return false;
+
+  // Enable the pruning for current mode only if it is not the winner mode of
+  // both the neighboring blocks (left/top).
+  return xd->up_available && this_mode != above_mode && xd->left_available &&
+         this_mode != left_mode;
 }
 
 void av1_nonrd_pick_intra_mode(AV1_COMP *cpi, MACROBLOCK *x, RD_STATS *rd_cost,
@@ -2445,11 +1569,20 @@
   MACROBLOCKD *const xd = &x->e_mbd;
   MB_MODE_INFO *const mi = xd->mi[0];
   RD_STATS this_rdc, best_rdc;
-  struct estimate_block_intra_args args = { cpi, x, DC_PRED, 1, 0 };
+  struct estimate_block_intra_args args;
+  init_estimate_block_intra_args(&args, cpi, x);
   const TxfmSearchParams *txfm_params = &x->txfm_search_params;
-  const TX_SIZE intra_tx_size =
+  mi->tx_size =
       AOMMIN(max_txsize_lookup[bsize],
              tx_mode_to_biggest_tx_size[txfm_params->tx_mode_search_type]);
+  assert(IMPLIES(xd->lossless[mi->segment_id], mi->tx_size == TX_4X4));
+  const BLOCK_SIZE tx_bsize = txsize_to_bsize[mi->tx_size];
+
+  // If the current block size is the same as the transform block size, enable
+  // mode pruning based on the best SAD so far.
+  if (cpi->sf.rt_sf.prune_intra_mode_using_best_sad_so_far && bsize == tx_bsize)
+    args.prune_mode_based_on_sad = true;
+
   int *bmode_costs;
   PREDICTION_MODE best_mode = DC_PRED;
   const MB_MODE_INFO *above_mi = xd->above_mbmi;
@@ -2458,37 +1591,54 @@
   const PREDICTION_MODE L = av1_left_block_mode(left_mi);
   const int above_ctx = intra_mode_context[A];
   const int left_ctx = intra_mode_context[L];
+  const unsigned int source_variance = x->source_variance;
   bmode_costs = x->mode_costs.y_mode_costs[above_ctx][left_ctx];
 
   av1_invalid_rd_stats(&best_rdc);
   av1_invalid_rd_stats(&this_rdc);
 
-  init_mbmi(mi, DC_PRED, INTRA_FRAME, NONE_FRAME, cm);
+  init_mbmi_nonrd(mi, DC_PRED, INTRA_FRAME, NONE_FRAME, cm);
   mi->mv[0].as_int = mi->mv[1].as_int = INVALID_MV;
 
   // Change the limit of this loop to add other intra prediction
   // mode tests.
-  for (int i = 0; i < 4; ++i) {
-    PREDICTION_MODE this_mode = intra_mode_list[i];
+  for (int mode_index = 0; mode_index < RTC_INTRA_MODES; ++mode_index) {
+    PREDICTION_MODE this_mode = intra_mode_list[mode_index];
 
     // As per the statistics generated for intra mode evaluation in the nonrd
     // path, it is found that the probability of H_PRED mode being the winner is
-    // very less when the best mode so far is V_PRED (out of DC_PRED and
-    // V_PRED). If V_PRED is the winner mode out of DC_PRED and V_PRED, it could
-    // imply the presence of a vertically dominant pattern. Hence, H_PRED mode
-    // is not evaluated.
+    // very low when the best mode so far is V_PRED (out of DC_PRED and V_PRED).
+    // If V_PRED is the winner mode out of DC_PRED and V_PRED, it could imply
+    // the presence of a vertically dominant pattern. Hence, H_PRED mode is not
+    // evaluated.
     if (cpi->sf.rt_sf.prune_h_pred_using_best_mode_so_far &&
         this_mode == H_PRED && best_mode == V_PRED)
       continue;
 
+    if (should_prune_intra_modes_using_neighbors(
+            xd, cpi->sf.rt_sf.enable_intra_mode_pruning_using_neighbors,
+            this_mode, A, L)) {
+      // Prune V_PRED and H_PRED if source variance of the block is less than
+      // or equal to 50. The source variance threshold is obtained empirically.
+      if ((this_mode == V_PRED || this_mode == H_PRED) && source_variance <= 50)
+        continue;
+
+      // As per the statistics, probability of SMOOTH_PRED being the winner is
+      // low when best mode so far is DC_PRED (out of DC_PRED, V_PRED and
+      // H_PRED). Hence, SMOOTH_PRED mode is not evaluated.
+      if (best_mode == DC_PRED && this_mode == SMOOTH_PRED) continue;
+    }
+
     this_rdc.dist = this_rdc.rate = 0;
     args.mode = this_mode;
     args.skippable = 1;
     args.rdc = &this_rdc;
-    mi->tx_size = intra_tx_size;
     mi->mode = this_mode;
-    av1_foreach_transformed_block_in_plane(xd, bsize, 0, estimate_block_intra,
-                                           &args);
+    av1_foreach_transformed_block_in_plane(xd, bsize, AOM_PLANE_Y,
+                                           av1_estimate_block_intra, &args);
+
+    if (this_rdc.rate == INT_MAX) continue;
+
     const int skip_ctx = av1_get_skip_txfm_context(xd);
     if (args.skippable) {
       this_rdc.rate = x->mode_costs.skip_txfm_cost[skip_ctx][1];
@@ -2513,10 +1663,19 @@
   mi->uv_mode = UV_DC_PRED;
   *rd_cost = best_rdc;
 
+  // For lossless: always force the skip flags off.
+  // Even though the blk_skip is set to 0 above in the rdcost comparison,
+  // do it here again in case the above logic changes.
+  if (is_lossless_requested(&cpi->oxcf.rc_cfg)) {
+    x->txfm_search_info.skip_txfm = 0;
+    memset(ctx->blk_skip, 0,
+           sizeof(x->txfm_search_info.blk_skip[0]) * ctx->num_4x4_blk);
+  }
+
 #if CONFIG_INTERNAL_STATS
-  store_coding_context(x, ctx, mi->mode);
+  store_coding_context_nonrd(x, ctx, mi->mode);
 #else
-  store_coding_context(x, ctx);
+  store_coding_context_nonrd(x, ctx);
 #endif  // CONFIG_INTERNAL_STATS
 }
 
@@ -2588,18 +1747,26 @@
     use_alt_ref_frame = 0;
   }
 
-  // Skip golden reference if color is set, on flat blocks with motion.
-  // For screen: always skip golden (if color_sensitivity_sb_g is set)
+  // Skip golden/altref reference if color is set, on flat blocks with motion.
+  // For screen: always skip golden/alt (if color_sensitivity_sb_g/alt is set)
   // except when x->nonrd_prune_ref_frame_search = 0. This latter flag
   // may be set in the variance partition when golden is a much better
   // reference than last, in which case it may not be worth skipping
-  // golden completely.
-  if (((cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN &&
+  // golden/altref completely.
+  // Condition on use_last_ref to make sure there remains at least one
+  // reference.
+  if (use_last_ref_frame &&
+      ((cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN &&
         x->nonrd_prune_ref_frame_search != 0) ||
-       (x->source_variance < 500 &&
-        x->content_state_sb.source_sad_nonrd > kLowSad)) &&
-      (x->color_sensitivity_sb_g[0] == 1 || x->color_sensitivity_sb_g[1] == 1))
-    use_golden_ref_frame = 0;
+       (x->source_variance < 200 &&
+        x->content_state_sb.source_sad_nonrd >= kLowSad))) {
+    if (x->color_sensitivity_sb_g[COLOR_SENS_IDX(AOM_PLANE_U)] == 1 ||
+        x->color_sensitivity_sb_g[COLOR_SENS_IDX(AOM_PLANE_V)] == 1)
+      use_golden_ref_frame = 0;
+    if (x->color_sensitivity_sb_alt[COLOR_SENS_IDX(AOM_PLANE_U)] == 1 ||
+        x->color_sensitivity_sb_alt[COLOR_SENS_IDX(AOM_PLANE_V)] == 1)
+      use_alt_ref_frame = 0;
+  }
 
   // For non-screen: if golden and altref are not being selected as references
   // (use_golden_ref_frame/use_alt_ref_frame = 0) check to allow golden back
@@ -2610,7 +1777,8 @@
       (cpi->ref_frame_flags & AOM_LAST_FLAG) && !use_golden_ref_frame &&
       !use_alt_ref_frame && x->pred_mv_sad[LAST_FRAME] != INT_MAX &&
       x->nonrd_prune_ref_frame_search > 2 &&
-      x->color_sensitivity_sb_g[0] == 0 && x->color_sensitivity_sb_g[1] == 0) {
+      x->color_sensitivity_sb_g[COLOR_SENS_IDX(AOM_PLANE_U)] == 0 &&
+      x->color_sensitivity_sb_g[COLOR_SENS_IDX(AOM_PLANE_V)] == 0) {
     int thr = (cm->width * cm->height >= 640 * 360) ? 100 : 150;
     int pred = x->pred_mv_sad[LAST_FRAME] >>
                (b_width_log2_lookup[bsize] + b_height_log2_lookup[bsize]);
@@ -2628,7 +1796,7 @@
       x->content_state_sb.source_sad_nonrd < kHighSad) {
     const int buffslot_golden =
         cpi->ppi->rtc_ref.ref_idx[GOLDEN_FRAME - LAST_FRAME];
-    if (cpi->svc.buffer_time_index[buffslot_golden] ==
+    if (cpi->ppi->rtc_ref.buffer_time_index[buffslot_golden] ==
         cpi->svc.current_superframe)
       use_golden_ref_frame = 1;
   }
@@ -2643,289 +1811,6 @@
   assert(use_last_ref_frame || use_golden_ref_frame || use_alt_ref_frame);
 }
 
-// Checks whether Intra mode needs to be pruned based on
-// 'intra_y_mode_bsize_mask_nrd' and 'prune_hv_pred_modes_using_blksad'
-// speed features.
-static INLINE bool is_prune_intra_mode(AV1_COMP *cpi, int mode_index,
-                                       int force_intra_check, BLOCK_SIZE bsize,
-                                       uint8_t segment_id,
-                                       SOURCE_SAD source_sad_nonrd,
-                                       uint8_t color_sensitivity[2]) {
-  const PREDICTION_MODE this_mode = intra_mode_list[mode_index];
-  if (mode_index > 2 || force_intra_check == 0) {
-    if (!((1 << this_mode) & cpi->sf.rt_sf.intra_y_mode_bsize_mask_nrd[bsize]))
-      return true;
-
-    if (this_mode == DC_PRED) return false;
-
-    if (!cpi->sf.rt_sf.prune_hv_pred_modes_using_src_sad) return false;
-
-    const bool has_color_sensitivity =
-        color_sensitivity[0] && color_sensitivity[1];
-    if (has_color_sensitivity &&
-        (cpi->rc.frame_source_sad > 1.1 * cpi->rc.avg_source_sad ||
-         cyclic_refresh_segment_id_boosted(segment_id) ||
-         source_sad_nonrd > kMedSad))
-      return false;
-
-    return true;
-  }
-  return false;
-}
-
-/*!\brief Estimates best intra mode for inter mode search
- *
- * \ingroup nonrd_mode_search
- * \callgraph
- * \callergraph
- *
- * Using heuristics based on best inter mode, block size, and other decides
- * whether to check intra modes. If so, estimates and selects best intra mode
- * from the reduced set of intra modes (max 4 intra modes checked)
- *
- * \param[in]    cpi                      Top-level encoder structure
- * \param[in]    x                        Pointer to structure holding all the
- *                                        data for the current macroblock
- * \param[in]    bsize                    Current block size
- * \param[in]    best_early_term          Flag, indicating that TX for the
- *                                        best inter mode was skipped
- * \param[in]    ref_cost_intra           Cost of signalling intra mode
- * \param[in]    reuse_prediction         Flag, indicating prediction re-use
- * \param[in]    orig_dst                 Original destination buffer
- * \param[in]    tmp_buffers              Pointer to a temporary buffers for
- *                                        prediction re-use
- * \param[out]   this_mode_pred           Pointer to store prediction buffer
- *                                        for prediction re-use
- * \param[in]    best_rdc                 Pointer to RD cost for the best
- *                                        selected intra mode
- * \param[in]    best_pickmode            Pointer to a structure containing
- *                                        best mode picked so far
- * \param[in]    ctx                      Pointer to structure holding coding
- *                                        contexts and modes for the block
- *
- * \remark Nothing is returned. Instead, calculated RD cost is placed to
- * \c best_rdc and best selected mode is placed to \c best_pickmode
- */
-static void estimate_intra_mode(
-    AV1_COMP *cpi, MACROBLOCK *x, BLOCK_SIZE bsize, int best_early_term,
-    unsigned int ref_cost_intra, int reuse_prediction, struct buf_2d *orig_dst,
-    PRED_BUFFER *tmp_buffers, PRED_BUFFER **this_mode_pred, RD_STATS *best_rdc,
-    BEST_PICKMODE *best_pickmode, PICK_MODE_CONTEXT *ctx) {
-  AV1_COMMON *const cm = &cpi->common;
-  MACROBLOCKD *const xd = &x->e_mbd;
-  MB_MODE_INFO *const mi = xd->mi[0];
-  const TxfmSearchParams *txfm_params = &x->txfm_search_params;
-  const unsigned char segment_id = mi->segment_id;
-  const int *const rd_threshes = cpi->rd.threshes[segment_id][bsize];
-  const int *const rd_thresh_freq_fact = x->thresh_freq_fact[bsize];
-  const bool is_screen_content =
-      cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN;
-  struct macroblockd_plane *const pd = &xd->plane[0];
-
-  const CommonQuantParams *quant_params = &cm->quant_params;
-
-  RD_STATS this_rdc;
-
-  int intra_cost_penalty = av1_get_intra_cost_penalty(
-      quant_params->base_qindex, quant_params->y_dc_delta_q,
-      cm->seq_params->bit_depth);
-  int64_t inter_mode_thresh =
-      RDCOST(x->rdmult, ref_cost_intra + intra_cost_penalty, 0);
-  int perform_intra_pred = cpi->sf.rt_sf.check_intra_pred_nonrd;
-  int force_intra_check = 0;
-  // For spatial enhancement layer: turn off intra prediction if the
-  // previous spatial layer as golden ref is not chosen as best reference.
-  // only do this for temporal enhancement layer and on non-key frames.
-  if (cpi->svc.spatial_layer_id > 0 &&
-      best_pickmode->best_ref_frame != GOLDEN_FRAME &&
-      cpi->svc.temporal_layer_id > 0 &&
-      !cpi->svc.layer_context[cpi->svc.temporal_layer_id].is_key_frame)
-    perform_intra_pred = 0;
-
-  int do_early_exit_rdthresh = 1;
-
-  uint32_t spatial_var_thresh = 50;
-  int motion_thresh = 32;
-  // Adjust thresholds to make intra mode likely tested if the other
-  // references (golden, alt) are skipped/not checked. For now always
-  // adjust for svc mode.
-  if (cpi->ppi->use_svc || (cpi->sf.rt_sf.use_nonrd_altref_frame == 0 &&
-                            cpi->sf.rt_sf.nonrd_prune_ref_frame_search > 0)) {
-    spatial_var_thresh = 150;
-    motion_thresh = 0;
-  }
-
-  // Some adjustments to checking intra mode based on source variance.
-  if (x->source_variance < spatial_var_thresh) {
-    // If the best inter mode is large motion or non-LAST ref reduce intra cost
-    // penalty, so intra mode is more likely tested.
-    if (best_rdc->rdcost != INT64_MAX &&
-        (best_pickmode->best_ref_frame != LAST_FRAME ||
-         abs(mi->mv[0].as_mv.row) >= motion_thresh ||
-         abs(mi->mv[0].as_mv.col) >= motion_thresh)) {
-      intra_cost_penalty = intra_cost_penalty >> 2;
-      inter_mode_thresh =
-          RDCOST(x->rdmult, ref_cost_intra + intra_cost_penalty, 0);
-      do_early_exit_rdthresh = 0;
-    }
-    if ((x->source_variance < AOMMAX(50, (spatial_var_thresh >> 1)) &&
-         x->content_state_sb.source_sad_nonrd >= kHighSad) ||
-        (is_screen_content && x->source_variance < 50 &&
-         ((bsize >= BLOCK_32X32 &&
-           x->content_state_sb.source_sad_nonrd != kZeroSad) ||
-          x->color_sensitivity[0] == 1 || x->color_sensitivity[1] == 1)))
-      force_intra_check = 1;
-    // For big blocks worth checking intra (since only DC will be checked),
-    // even if best_early_term is set.
-    if (bsize >= BLOCK_32X32) best_early_term = 0;
-  } else if (cpi->sf.rt_sf.source_metrics_sb_nonrd &&
-             x->content_state_sb.source_sad_nonrd <= kLowSad) {
-    perform_intra_pred = 0;
-  }
-
-  if (best_rdc->skip_txfm && best_pickmode->best_mode_initial_skip_flag) {
-    if (cpi->sf.rt_sf.skip_intra_pred == 1 && best_pickmode->best_mode != NEWMV)
-      perform_intra_pred = 0;
-    else if (cpi->sf.rt_sf.skip_intra_pred == 2)
-      perform_intra_pred = 0;
-  }
-
-  if (!(best_rdc->rdcost == INT64_MAX || force_intra_check ||
-        (perform_intra_pred && !best_early_term &&
-         bsize <= cpi->sf.part_sf.max_intra_bsize))) {
-    return;
-  }
-
-  // Early exit based on RD cost calculated using known rate. When
-  // is_screen_content is true, more bias is given to intra modes. Hence,
-  // considered conservative threshold in early exit for the same.
-  const int64_t known_rd = is_screen_content
-                               ? CALC_BIASED_RDCOST(inter_mode_thresh)
-                               : inter_mode_thresh;
-  if (known_rd > best_rdc->rdcost) return;
-
-  struct estimate_block_intra_args args = { cpi, x, DC_PRED, 1, 0 };
-  TX_SIZE intra_tx_size = AOMMIN(
-      AOMMIN(max_txsize_lookup[bsize],
-             tx_mode_to_biggest_tx_size[txfm_params->tx_mode_search_type]),
-      TX_16X16);
-  if (is_screen_content && cpi->rc.high_source_sad &&
-      x->source_variance > spatial_var_thresh && bsize <= BLOCK_16X16)
-    intra_tx_size = TX_4X4;
-
-  PRED_BUFFER *const best_pred = best_pickmode->best_pred;
-  if (reuse_prediction && best_pred != NULL) {
-    const int bh = block_size_high[bsize];
-    const int bw = block_size_wide[bsize];
-    if (best_pred->data == orig_dst->buf) {
-      *this_mode_pred = &tmp_buffers[get_pred_buffer(tmp_buffers, 3)];
-      aom_convolve_copy(best_pred->data, best_pred->stride,
-                        (*this_mode_pred)->data, (*this_mode_pred)->stride, bw,
-                        bh);
-      best_pickmode->best_pred = *this_mode_pred;
-    }
-  }
-  pd->dst = *orig_dst;
-
-  for (int i = 0; i < 4; ++i) {
-    const PREDICTION_MODE this_mode = intra_mode_list[i];
-    const THR_MODES mode_index = mode_idx[INTRA_FRAME][mode_offset(this_mode)];
-    const int64_t mode_rd_thresh = rd_threshes[mode_index];
-
-    if (is_prune_intra_mode(cpi, i, force_intra_check, bsize, segment_id,
-                            x->content_state_sb.source_sad_nonrd,
-                            x->color_sensitivity))
-      continue;
-
-    if (is_screen_content && cpi->sf.rt_sf.source_metrics_sb_nonrd) {
-      // For spatially flat blocks with zero motion only check
-      // DC mode.
-      if (x->content_state_sb.source_sad_nonrd == kZeroSad &&
-          x->source_variance == 0 && this_mode != DC_PRED)
-        continue;
-      // Only test Intra for big blocks if spatial_variance is small.
-      else if (bsize > BLOCK_32X32 && x->source_variance > 50)
-        continue;
-    }
-
-    if (rd_less_than_thresh(best_rdc->rdcost, mode_rd_thresh,
-                            rd_thresh_freq_fact[mode_index]) &&
-        (do_early_exit_rdthresh || this_mode == SMOOTH_PRED)) {
-      continue;
-    }
-    const BLOCK_SIZE uv_bsize = get_plane_block_size(
-        bsize, xd->plane[1].subsampling_x, xd->plane[1].subsampling_y);
-
-    mi->mode = this_mode;
-    mi->ref_frame[0] = INTRA_FRAME;
-    mi->ref_frame[1] = NONE_FRAME;
-
-    av1_invalid_rd_stats(&this_rdc);
-    args.mode = this_mode;
-    args.skippable = 1;
-    args.rdc = &this_rdc;
-    mi->tx_size = intra_tx_size;
-    compute_intra_yprediction(cm, this_mode, bsize, x, xd);
-    // Look into selecting tx_size here, based on prediction residual.
-    block_yrd(x, &this_rdc, &args.skippable, bsize, mi->tx_size, 0);
-    // TODO(kyslov@) Need to account for skippable
-    if (x->color_sensitivity[0]) {
-      av1_foreach_transformed_block_in_plane(xd, uv_bsize, 1,
-                                             estimate_block_intra, &args);
-    }
-    if (x->color_sensitivity[1]) {
-      av1_foreach_transformed_block_in_plane(xd, uv_bsize, 2,
-                                             estimate_block_intra, &args);
-    }
-
-    int mode_cost = 0;
-    if (av1_is_directional_mode(this_mode) && av1_use_angle_delta(bsize)) {
-      mode_cost +=
-          x->mode_costs.angle_delta_cost[this_mode - V_PRED]
-                                        [MAX_ANGLE_DELTA +
-                                         mi->angle_delta[PLANE_TYPE_Y]];
-    }
-    if (this_mode == DC_PRED && av1_filter_intra_allowed_bsize(cm, bsize)) {
-      mode_cost += x->mode_costs.filter_intra_cost[bsize][0];
-    }
-    this_rdc.rate += ref_cost_intra;
-    this_rdc.rate += intra_cost_penalty;
-    this_rdc.rate += mode_cost;
-    this_rdc.rdcost = RDCOST(x->rdmult, this_rdc.rate, this_rdc.dist);
-
-    if (is_screen_content && cpi->sf.rt_sf.source_metrics_sb_nonrd) {
-      // For blocks with low spatial variance and color sad,
-      // favor the intra-modes, only on scene/slide change.
-      if (cpi->rc.high_source_sad && x->source_variance < 800 &&
-          (x->color_sensitivity[0] || x->color_sensitivity[1]))
-        this_rdc.rdcost = CALC_BIASED_RDCOST(this_rdc.rdcost);
-      // Otherwise bias against intra for blocks with zero
-      // motion and no color, on non-scene/slide changes.
-      else if (!cpi->rc.high_source_sad && x->source_variance > 0 &&
-               x->content_state_sb.source_sad_nonrd == kZeroSad &&
-               x->color_sensitivity[0] == 0 && x->color_sensitivity[1] == 0)
-        this_rdc.rdcost = (3 * this_rdc.rdcost) >> 1;
-    }
-
-    if (this_rdc.rdcost < best_rdc->rdcost) {
-      *best_rdc = this_rdc;
-      best_pickmode->best_mode = this_mode;
-      best_pickmode->best_tx_size = mi->tx_size;
-      best_pickmode->best_ref_frame = INTRA_FRAME;
-      best_pickmode->best_second_ref_frame = NONE;
-      best_pickmode->best_mode_skip_txfm = this_rdc.skip_txfm;
-      if (!this_rdc.skip_txfm) {
-        memcpy(ctx->blk_skip, x->txfm_search_info.blk_skip,
-               sizeof(x->txfm_search_info.blk_skip[0]) * ctx->num_4x4_blk);
-      }
-      mi->uv_mode = this_mode;
-      mi->mv[0].as_int = INVALID_MV;
-      mi->mv[1].as_int = INVALID_MV;
-    }
-  }
-  mi->tx_size = best_pickmode->best_tx_size;
-}
-
 static AOM_INLINE int is_filter_search_enabled_blk(
     AV1_COMP *cpi, MACROBLOCK *x, int mi_row, int mi_col, BLOCK_SIZE bsize,
     int segment_id, int cb_pred_filter_search, InterpFilter *filt_select) {
@@ -3043,10 +1928,9 @@
   int shift = 3;
   if (source_sad_nonrd >= kMedSad &&
       cpi->oxcf.tune_cfg.content != AOM_CONTENT_SCREEN &&
-      (int64_t) cpi->common.width * (int64_t) cpi->common.height >=
-            (int64_t) 640 * 360) {
+      cpi->common.width * cpi->common.height >= 640 * 360)
     shift = 4;
-  } else if (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN &&
+  if (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN &&
       cpi->rc.high_source_sad) {
     shift = 6;
   }
@@ -3062,26 +1946,28 @@
     noise_level = av1_noise_estimate_extract_level(&cpi->noise_estimate);
   if (noise_level == kLow && source_variance > thresh_spatial &&
       cpi->oxcf.tune_cfg.content != AOM_CONTENT_SCREEN && norm_sad < 50) {
-    x->color_sensitivity[0] = 0;
-    x->color_sensitivity[1] = 0;
+    x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] = 0;
+    x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)] = 0;
     return;
   }
   const int num_planes = av1_num_planes(&cpi->common);
-  for (int i = 1; i < num_planes; ++i) {
-    if (x->color_sensitivity[i - 1] == 2 || source_variance < 50) {
-      struct macroblock_plane *const p = &x->plane[i];
+
+  for (int plane = AOM_PLANE_U; plane < num_planes; ++plane) {
+    if (x->color_sensitivity[COLOR_SENS_IDX(plane)] == 2 ||
+        source_variance < 50) {
+      struct macroblock_plane *const p = &x->plane[plane];
       const BLOCK_SIZE bs =
           get_plane_block_size(bsize, subsampling_x, subsampling_y);
 
       const int uv_sad = cpi->ppi->fn_ptr[bs].sdf(
-          p->src.buf, p->src.stride, yv12_mb[i].buf, yv12_mb[i].stride);
+          p->src.buf, p->src.stride, yv12_mb[plane].buf, yv12_mb[plane].stride);
 
       const int norm_uv_sad =
           uv_sad >> (b_width_log2_lookup[bs] + b_height_log2_lookup[bs]);
-      x->color_sensitivity[i - 1] =
+      x->color_sensitivity[COLOR_SENS_IDX(plane)] =
           uv_sad > (y_sad >> shift) && norm_uv_sad > 40;
       if (source_variance < 50 && norm_uv_sad > 100)
-        x->color_sensitivity[i - 1] = 1;
+        x->color_sensitivity[COLOR_SENS_IDX(plane)] = 1;
     }
   }
 }
@@ -3115,8 +2001,8 @@
   *ref_mv_idx = mbmi->ref_mv_idx + 1;
 }
 
-static void set_compound_mode(MACROBLOCK *x, int ref_frame, int ref_frame2,
-                              int ref_mv_idx,
+static void set_compound_mode(MACROBLOCK *x, MV_REFERENCE_FRAME ref_frame,
+                              MV_REFERENCE_FRAME ref_frame2, int ref_mv_idx,
                               int_mv frame_mv[MB_MODE_COUNT][REF_FRAMES],
                               PREDICTION_MODE this_mode) {
   MACROBLOCKD *const xd = &x->e_mbd;
@@ -3168,7 +2054,7 @@
 }
 
 static AOM_FORCE_INLINE void fill_single_inter_mode_costs(
-    int (*single_inter_mode_costs)[REF_FRAMES], const int num_inter_modes,
+    int (*single_inter_mode_costs)[REF_FRAMES], int num_inter_modes,
     const REF_MODE *reference_mode_set, const ModeCosts *mode_costs,
     const int16_t *mode_context) {
   bool ref_frame_used[REF_FRAMES] = { false };
@@ -3216,18 +2102,29 @@
     PREDICTION_MODE *this_mode, MV_REFERENCE_FRAME *ref_frame,
     MV_REFERENCE_FRAME *ref_frame2, int_mv frame_mv[MB_MODE_COUNT][REF_FRAMES],
     const int *use_ref_frame_mask, int comp_index,
-    bool comp_use_zero_zeromv_only, MV_REFERENCE_FRAME *last_comp_ref_frame) {
+    bool comp_use_zero_zeromv_only, MV_REFERENCE_FRAME *last_comp_ref_frame,
+    BLOCK_SIZE bsize) {
   const MV_REFERENCE_FRAME *rf = comp_ref_mode_set[comp_index].ref_frame;
+  int skip_gf = 0;
+  int skip_alt = 0;
   *this_mode = comp_ref_mode_set[comp_index].pred_mode;
   *ref_frame = rf[0];
   *ref_frame2 = rf[1];
   assert(*ref_frame == LAST_FRAME);
   assert(*this_mode == GLOBAL_GLOBALMV || *this_mode == NEAREST_NEARESTMV);
+  if (x->source_variance < 50 && bsize > BLOCK_16X16) {
+    if (x->color_sensitivity_sb_g[COLOR_SENS_IDX(AOM_PLANE_U)] == 1 ||
+        x->color_sensitivity_sb_g[COLOR_SENS_IDX(AOM_PLANE_V)] == 1)
+      skip_gf = 1;
+    if (x->color_sensitivity_sb_alt[COLOR_SENS_IDX(AOM_PLANE_U)] == 1 ||
+        x->color_sensitivity_sb_alt[COLOR_SENS_IDX(AOM_PLANE_V)] == 1)
+      skip_alt = 1;
+  }
   if (comp_use_zero_zeromv_only && *this_mode != GLOBAL_GLOBALMV) {
     return 0;
   }
   if (*ref_frame2 == GOLDEN_FRAME &&
-      (cpi->sf.rt_sf.ref_frame_comp_nonrd[0] == 0 ||
+      (cpi->sf.rt_sf.ref_frame_comp_nonrd[0] == 0 || skip_gf ||
        !(cpi->ref_frame_flags & AOM_GOLD_FLAG))) {
     return 0;
   } else if (*ref_frame2 == LAST2_FRAME &&
@@ -3235,7 +2132,7 @@
               !(cpi->ref_frame_flags & AOM_LAST2_FLAG))) {
     return 0;
   } else if (*ref_frame2 == ALTREF_FRAME &&
-             (cpi->sf.rt_sf.ref_frame_comp_nonrd[2] == 0 ||
+             (cpi->sf.rt_sf.ref_frame_comp_nonrd[2] == 0 || skip_alt ||
               !(cpi->ref_frame_flags & AOM_ALT_FLAG))) {
     return 0;
   }
@@ -3313,16 +2210,15 @@
   return false;
 }
 
-// Function to setup parameters used for inter mode evaluation.
+// Function to setup parameters used for inter mode evaluation in non-rd.
 static AOM_FORCE_INLINE void set_params_nonrd_pick_inter_mode(
     AV1_COMP *cpi, MACROBLOCK *x, InterModeSearchStateNonrd *search_state,
-    TileDataEnc *tile_data, PICK_MODE_CONTEXT *ctx, RD_STATS *rd_cost,
-    int *force_skip_low_temp_var, int *skip_pred_mv, const int mi_row,
-    const int mi_col, const int gf_temporal_ref, const unsigned char segment_id,
+    RD_STATS *rd_cost, int *force_skip_low_temp_var, int *skip_pred_mv,
+    int mi_row, int mi_col, int gf_temporal_ref, unsigned char segment_id,
     BLOCK_SIZE bsize
 #if CONFIG_AV1_TEMPORAL_DENOISING
     ,
-    int denoise_svc_pickmode
+    PICK_MODE_CONTEXT *ctx, int denoise_svc_pickmode
 #endif
 ) {
   AV1_COMMON *const cm = &cpi->common;
@@ -3330,8 +2226,9 @@
   TxfmSearchInfo *txfm_info = &x->txfm_search_info;
   MB_MODE_INFO *const mi = xd->mi[0];
   const ModeCosts *mode_costs = &x->mode_costs;
-  (void)ctx;
 
+  // Initialize variance and distortion (chroma) for all modes and reference
+  // frames
   for (int idx = 0; idx < RTC_INTER_MODES; idx++) {
     for (int ref = 0; ref < REF_FRAMES; ref++) {
       search_state->vars[idx][ref] = UINT_MAX;
@@ -3339,23 +2236,26 @@
     }
   }
 
-  x->color_sensitivity[0] = x->color_sensitivity_sb[0];
-  x->color_sensitivity[1] = x->color_sensitivity_sb[1];
+  // Initialize values of color sensitivity with sb level color sensitivity
+  av1_copy(x->color_sensitivity, x->color_sensitivity_sb);
+
   init_best_pickmode(&search_state->best_pickmode);
 
+  // Estimate cost for single reference frames
   estimate_single_ref_frame_costs(cm, xd, mode_costs, segment_id, bsize,
                                   search_state->ref_costs_single);
 
-  memset(&search_state->mode_checked[0][0], 0, MB_MODE_COUNT * REF_FRAMES);
+  // Reset flag to indicate modes evaluated
+  av1_zero(search_state->mode_checked);
 
   txfm_info->skip_txfm = 0;
 
-  // initialize mode decisions
+  // Initialize mode decisions
   av1_invalid_rd_stats(&search_state->best_rdc);
   av1_invalid_rd_stats(&search_state->this_rdc);
   av1_invalid_rd_stats(rd_cost);
-  for (int i = 0; i < REF_FRAMES; ++i) {
-    x->warp_sample_info[i].num = -1;
+  for (int ref_idx = 0; ref_idx < REF_FRAMES; ++ref_idx) {
+    x->warp_sample_info[ref_idx].num = -1;
   }
 
   mi->bsize = bsize;
@@ -3371,25 +2271,28 @@
   }
 #endif
 
+  // Populate predicated motion vectors for LAST_FRAME
   if (cpi->ref_frame_flags & AOM_LAST_FLAG)
-    find_predictors(cpi, x, LAST_FRAME, search_state->frame_mv, tile_data,
+    find_predictors(cpi, x, LAST_FRAME, search_state->frame_mv,
                     search_state->yv12_mb, bsize, *force_skip_low_temp_var,
                     x->force_zeromv_skip_for_blk);
 
+  // Update mask to use all reference frame
   get_ref_frame_use_mask(cpi, x, mi, mi_row, mi_col, bsize, gf_temporal_ref,
                          search_state->use_ref_frame_mask,
                          force_skip_low_temp_var);
 
-  *skip_pred_mv =
-      x->force_zeromv_skip_for_blk ||
-      (x->nonrd_prune_ref_frame_search > 2 && x->color_sensitivity[0] != 2 &&
-       x->color_sensitivity[1] != 2);
+  *skip_pred_mv = x->force_zeromv_skip_for_blk ||
+                  (x->nonrd_prune_ref_frame_search > 2 &&
+                   x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] != 2 &&
+                   x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)] != 2);
 
+  // Populate predicated motion vectors for other single reference frame
   // Start at LAST_FRAME + 1.
   for (MV_REFERENCE_FRAME ref_frame_iter = LAST_FRAME + 1;
        ref_frame_iter <= ALTREF_FRAME; ++ref_frame_iter) {
     if (search_state->use_ref_frame_mask[ref_frame_iter]) {
-      find_predictors(cpi, x, ref_frame_iter, search_state->frame_mv, tile_data,
+      find_predictors(cpi, x, ref_frame_iter, search_state->frame_mv,
                       search_state->yv12_mb, bsize, *force_skip_low_temp_var,
                       *skip_pred_mv);
     }
@@ -3400,52 +2303,60 @@
 // speed features settings.
 static AOM_FORCE_INLINE bool skip_inter_mode_nonrd(
     AV1_COMP *cpi, MACROBLOCK *x, InterModeSearchStateNonrd *search_state,
-    int64_t *thresh_sad_pred, int *force_mv_inter_layer, int *comp_pred,
+    int64_t *thresh_sad_pred, int *force_mv_inter_layer, int *is_single_pred,
     PREDICTION_MODE *this_mode, MV_REFERENCE_FRAME *last_comp_ref_frame,
     MV_REFERENCE_FRAME *ref_frame, MV_REFERENCE_FRAME *ref_frame2, int idx,
-    int svc_mv_col, int svc_mv_row, int force_skip_low_temp_var,
-    unsigned int sse_zeromv_norm, const int num_inter_modes,
-    const unsigned char segment_id, BLOCK_SIZE bsize,
+    int_mv svc_mv, int force_skip_low_temp_var, unsigned int sse_zeromv_norm,
+    int num_inter_modes, unsigned char segment_id, BLOCK_SIZE bsize,
     bool comp_use_zero_zeromv_only, bool check_globalmv) {
   AV1_COMMON *const cm = &cpi->common;
   const struct segmentation *const seg = &cm->seg;
   const SVC *const svc = &cpi->svc;
   MACROBLOCKD *const xd = &x->e_mbd;
   MB_MODE_INFO *const mi = xd->mi[0];
+  const REAL_TIME_SPEED_FEATURES *const rt_sf = &cpi->sf.rt_sf;
 
+  // Skip compound mode based on reference frame mask and type of the mode and
+  // for allowed compound modes, setup ref mv stack and reference frame.
   if (idx >= num_inter_modes) {
     const int comp_index = idx - num_inter_modes;
     if (!setup_compound_params_from_comp_idx(
             cpi, x, search_state->yv12_mb, this_mode, ref_frame, ref_frame2,
             search_state->frame_mv, search_state->use_ref_frame_mask,
-            comp_index, comp_use_zero_zeromv_only, last_comp_ref_frame)) {
+            comp_index, comp_use_zero_zeromv_only, last_comp_ref_frame,
+            bsize)) {
       return true;
     }
-    *comp_pred = 1;
+    *is_single_pred = 0;
   } else {
     *this_mode = ref_mode_set[idx].pred_mode;
     *ref_frame = ref_mode_set[idx].ref_frame;
     *ref_frame2 = NONE_FRAME;
   }
 
-  if (!*comp_pred && search_state->mode_checked[*this_mode][*ref_frame]) {
+  // Skip the single reference mode for which mode check flag is set.
+  if (*is_single_pred && search_state->mode_checked[*this_mode][*ref_frame]) {
     return true;
   }
 
+  // Skip GLOBALMV mode if check_globalmv flag is not enabled.
   if (!check_globalmv && *this_mode == GLOBALMV) {
     return true;
   }
 
-#if COLLECT_PICK_MODE_STAT
-  aom_usec_timer_start(&ms_stat.timer1);
-  ms_stat.num_searches[bsize][*this_mode]++;
+#if COLLECT_NONRD_PICK_MODE_STAT
+  aom_usec_timer_start(&x->ms_stat_nonrd.timer1);
+  x->ms_stat_nonrd.num_searches[bsize][*this_mode]++;
 #endif
   mi->mode = *this_mode;
   mi->ref_frame[0] = *ref_frame;
   mi->ref_frame[1] = *ref_frame2;
 
+  // Skip the mode if use reference frame mask flag is not set.
   if (!search_state->use_ref_frame_mask[*ref_frame]) return true;
 
+  // Skip mode for some modes and reference frames when
+  // force_zeromv_skip_for_blk flag is true.
   if (x->force_zeromv_skip_for_blk &&
       ((!(*this_mode == NEARESTMV &&
           search_state->frame_mv[*this_mode][*ref_frame].as_int == 0) &&
@@ -3453,7 +2364,9 @@
        *ref_frame != LAST_FRAME))
     return true;
 
-  if (cpi->sf.rt_sf.prune_compoundmode_with_singlemode_var && *comp_pred &&
+  // Skip compound mode based on variance of previously evaluated single
+  // reference modes.
+  if (rt_sf->prune_compoundmode_with_singlemode_var && !*is_single_pred &&
       prune_compoundmode_with_singlemode_var(
           *this_mode, *ref_frame, *ref_frame2, search_state->frame_mv,
           search_state->mode_checked, search_state->vars,
@@ -3466,17 +2379,14 @@
       ((*ref_frame == LAST_FRAME && svc->skip_mvsearch_last) ||
        (*ref_frame == GOLDEN_FRAME && svc->skip_mvsearch_gf) ||
        (*ref_frame == ALTREF_FRAME && svc->skip_mvsearch_altref))) {
-    // Only test mode if NEARESTMV/NEARMV is (svc_mv_col, svc_mv_row),
-    // otherwise set NEWMV to (svc_mv_col, svc_mv_row).
+    // Only test mode if NEARESTMV/NEARMV is (svc_mv.mv.col, svc_mv.mv.row),
+    // otherwise set NEWMV to (svc_mv.mv.col, svc_mv.mv.row).
     // Skip newmv and filter search.
     *force_mv_inter_layer = 1;
     if (*this_mode == NEWMV) {
-      search_state->frame_mv[*this_mode][*ref_frame].as_mv.col = svc_mv_col;
-      search_state->frame_mv[*this_mode][*ref_frame].as_mv.row = svc_mv_row;
-    } else if (search_state->frame_mv[*this_mode][*ref_frame].as_mv.col !=
-                   svc_mv_col ||
-               search_state->frame_mv[*this_mode][*ref_frame].as_mv.row !=
-                   svc_mv_row) {
+      search_state->frame_mv[*this_mode][*ref_frame] = svc_mv;
+    } else if (search_state->frame_mv[*this_mode][*ref_frame].as_int !=
+               svc_mv.as_int) {
       return true;
     }
   }
@@ -3497,12 +2407,13 @@
     // For the latter condition: the same condition should apply
     // to newmv if (0, 0), so this latter condition is repeated
     // below after search_new_mv.
-    if (cpi->sf.rt_sf.source_metrics_sb_nonrd) {
+    if (rt_sf->source_metrics_sb_nonrd) {
       if ((search_state->frame_mv[*this_mode][*ref_frame].as_int != 0 &&
            x->content_state_sb.source_sad_nonrd == kZeroSad) ||
           (search_state->frame_mv[*this_mode][*ref_frame].as_int == 0 &&
            x->content_state_sb.source_sad_nonrd != kZeroSad &&
-           ((x->color_sensitivity[0] == 0 && x->color_sensitivity[1] == 0) ||
+           ((x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] == 0 &&
+             x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)] == 0) ||
             cpi->rc.high_source_sad) &&
            x->source_variance == 0))
         return true;
@@ -3511,15 +2422,19 @@
     if (*this_mode == NEWMV && x->source_variance < 100) return true;
     // Skip non-LAST for color on flat blocks.
     if (*ref_frame > LAST_FRAME && x->source_variance == 0 &&
-        (x->color_sensitivity[0] == 1 || x->color_sensitivity[1] == 1))
+        (x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] == 1 ||
+         x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)] == 1))
       return true;
   }
 
+  // Skip mode based on block size, reference frame mode and other block
+  // properties.
   if (skip_mode_by_bsize_and_ref_frame(
           *this_mode, *ref_frame, bsize, x->nonrd_prune_ref_frame_search,
-          sse_zeromv_norm, cpi->sf.rt_sf.nonrd_aggressive_skip))
+          sse_zeromv_norm, rt_sf->nonrd_aggressive_skip))
     return true;
 
+  // Skip mode based on low temporal variance and souce sad.
   if (skip_mode_by_low_temp(*this_mode, *ref_frame, bsize, x->content_state_sb,
                             search_state->frame_mv[*this_mode][*ref_frame],
                             force_skip_low_temp_var))
@@ -3530,7 +2445,7 @@
   // end up unable to pick any mode.
   if (!segfeature_active(seg, segment_id, SEG_LVL_REF_FRAME)) {
     // Check for skipping GOLDEN and ALTREF based pred_mv_sad.
-    if (cpi->sf.rt_sf.nonrd_prune_ref_frame_search > 0 &&
+    if (rt_sf->nonrd_prune_ref_frame_search > 0 &&
         x->pred_mv_sad[*ref_frame] != INT_MAX && *ref_frame != LAST_FRAME) {
       if ((int64_t)(x->pred_mv_sad[*ref_frame]) > *thresh_sad_pred) return true;
     }
@@ -3541,19 +2456,607 @@
       x->pred_mv1_sad[*ref_frame] > (x->pred_mv0_sad[*ref_frame] << 1))
     return true;
 
-  if (!*comp_pred) {
+  // Skip single reference mode based on rd threshold.
+  if (*is_single_pred) {
     if (skip_mode_by_threshold(
             *this_mode, *ref_frame,
             search_state->frame_mv[*this_mode][*ref_frame],
             cpi->rc.frames_since_golden, cpi->rd.threshes[segment_id][bsize],
             x->thresh_freq_fact[bsize], search_state->best_rdc.rdcost,
             search_state->best_pickmode.best_mode_skip_txfm,
-            (cpi->sf.rt_sf.nonrd_aggressive_skip ? 1 : 0)))
+            (rt_sf->nonrd_aggressive_skip ? 1 : 0)))
       return true;
   }
   return false;
 }
 
+// Function to perform inter mode evaluation for non-rd
+static AOM_FORCE_INLINE bool handle_inter_mode_nonrd(
+    AV1_COMP *cpi, MACROBLOCK *x, InterModeSearchStateNonrd *search_state,
+    PICK_MODE_CONTEXT *ctx, PRED_BUFFER **this_mode_pred,
+    PRED_BUFFER *tmp_buffer, InterPredParams inter_pred_params_sr,
+    int *best_early_term, unsigned int *sse_zeromv_norm, bool *check_globalmv,
+#if CONFIG_AV1_TEMPORAL_DENOISING
+    int64_t *zero_last_cost_orig, int denoise_svc_pickmode,
+#endif
+    int idx, int force_mv_inter_layer, int is_single_pred, int skip_pred_mv,
+    int gf_temporal_ref, int use_model_yrd_large, int filter_search_enabled_blk,
+    BLOCK_SIZE bsize, PREDICTION_MODE this_mode, InterpFilter filt_select,
+    int cb_pred_filter_search, int reuse_inter_pred) {
+  AV1_COMMON *const cm = &cpi->common;
+  MACROBLOCKD *const xd = &x->e_mbd;
+  MB_MODE_INFO *const mi = xd->mi[0];
+  const MB_MODE_INFO_EXT *const mbmi_ext = &x->mbmi_ext;
+  const int mi_row = xd->mi_row;
+  const int mi_col = xd->mi_col;
+  struct macroblockd_plane *const pd = &xd->plane[AOM_PLANE_Y];
+  const int bw = block_size_wide[bsize];
+  const InterpFilter filter_ref = cm->features.interp_filter;
+  const InterpFilter default_interp_filter = EIGHTTAP_REGULAR;
+  TxfmSearchInfo *txfm_info = &x->txfm_search_info;
+  const ModeCosts *mode_costs = &x->mode_costs;
+  const REAL_TIME_SPEED_FEATURES *const rt_sf = &cpi->sf.rt_sf;
+  BEST_PICKMODE *const best_pickmode = &search_state->best_pickmode;
+
+  MV_REFERENCE_FRAME ref_frame = mi->ref_frame[0];
+  MV_REFERENCE_FRAME ref_frame2 = mi->ref_frame[1];
+  int_mv *const this_mv = &search_state->frame_mv[this_mode][ref_frame];
+  unsigned int var = UINT_MAX;
+  int this_early_term = 0;
+  int rate_mv = 0;
+  int is_skippable;
+  int skip_this_mv = 0;
+  unsigned int var_threshold = UINT_MAX;
+  PREDICTION_MODE this_best_mode;
+  RD_STATS nonskip_rdc;
+  av1_invalid_rd_stats(&nonskip_rdc);
+
+  if (this_mode == NEWMV && !force_mv_inter_layer) {
+#if COLLECT_NONRD_PICK_MODE_STAT
+    aom_usec_timer_start(&x->ms_stat_nonrd.timer2);
+#endif
+    // Find the best motion vector for single/compound mode.
+    const bool skip_newmv = search_new_mv(
+        cpi, x, search_state->frame_mv, ref_frame, gf_temporal_ref, bsize,
+        mi_row, mi_col, &rate_mv, &search_state->best_rdc);
+#if COLLECT_NONRD_PICK_MODE_STAT
+    aom_usec_timer_mark(&x->ms_stat_nonrd.timer2);
+    x->ms_stat_nonrd.ms_time[bsize][this_mode] +=
+        aom_usec_timer_elapsed(&x->ms_stat_nonrd.timer2);
+#endif
+    // Skip NEWMV mode,
+    //   (i). For bsize smaller than 16X16
+    //  (ii). Based on sad of the predicted mv w.r.t LAST_FRAME
+    // (iii). When motion vector is same as that of reference mv
+    if (skip_newmv) {
+      return true;
+    }
+  }
+
+  // Check the current motion vector is same as that of previously evaluated
+  // motion vectors.
+  for (PREDICTION_MODE inter_mv_mode = NEARESTMV; inter_mv_mode <= NEWMV;
+       inter_mv_mode++) {
+    if (inter_mv_mode == this_mode) continue;
+    if (is_single_pred &&
+        search_state->mode_checked[inter_mv_mode][ref_frame] &&
+        this_mv->as_int ==
+            search_state->frame_mv[inter_mv_mode][ref_frame].as_int) {
+      skip_this_mv = 1;
+      break;
+    }
+  }
+
+  // Skip single mode if current motion vector is same that of previously
+  // evaluated motion vectors.
+  if (skip_this_mv && is_single_pred) return true;
+
+  // For screen: for spatially flat blocks with non-zero motion,
+  // skip newmv if the motion vector is (0, 0), and color is not set.
+  if (this_mode == NEWMV && cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN &&
+      cpi->svc.spatial_layer_id == 0 && rt_sf->source_metrics_sb_nonrd) {
+    if (this_mv->as_int == 0 &&
+        x->content_state_sb.source_sad_nonrd != kZeroSad &&
+        ((x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] == 0 &&
+          x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)] == 0) ||
+         cpi->rc.high_source_sad) &&
+        x->source_variance == 0)
+      return true;
+  }
+
+  mi->mode = this_mode;
+  mi->mv[0].as_int = this_mv->as_int;
+  mi->mv[1].as_int = 0;
+  if (!is_single_pred)
+    mi->mv[1].as_int = search_state->frame_mv[this_mode][ref_frame2].as_int;
+
+  // Set buffers to store predicted samples for reuse
+  if (reuse_inter_pred) {
+    if (!*this_mode_pred) {
+      *this_mode_pred = &tmp_buffer[3];
+    } else {
+      *this_mode_pred = &tmp_buffer[get_pred_buffer(tmp_buffer, 3)];
+      pd->dst.buf = (*this_mode_pred)->data;
+      pd->dst.stride = bw;
+    }
+  }
+
+  if (idx == 0 && !skip_pred_mv) {
+    // Set color sensitivity on first tested mode only.
+    // Use y-sad already computed in find_predictors: take the sad with motion
+    // vector closest to 0; the uv-sad computed below in set_color_sensitivity
+    // is for zeromv.
+    // For screen: first check if golden reference is being used, if so,
+    // force color_sensitivity on if the color sensitivity for sb_g is on.
+    if (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN &&
+        search_state->use_ref_frame_mask[GOLDEN_FRAME]) {
+      if (x->color_sensitivity_sb_g[COLOR_SENS_IDX(AOM_PLANE_U)] == 1)
+        x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] = 1;
+      if (x->color_sensitivity_sb_g[COLOR_SENS_IDX(AOM_PLANE_V)] == 1)
+        x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)] = 1;
+    } else {
+      int y_sad = x->pred_mv0_sad[LAST_FRAME];
+      if (x->pred_mv1_sad[LAST_FRAME] != INT_MAX &&
+          (abs(search_state->frame_mv[NEARMV][LAST_FRAME].as_mv.col) +
+           abs(search_state->frame_mv[NEARMV][LAST_FRAME].as_mv.row)) <
+              (abs(search_state->frame_mv[NEARESTMV][LAST_FRAME].as_mv.col) +
+               abs(search_state->frame_mv[NEARESTMV][LAST_FRAME].as_mv.row)))
+        y_sad = x->pred_mv1_sad[LAST_FRAME];
+      set_color_sensitivity(cpi, x, bsize, y_sad, x->source_variance,
+                            search_state->yv12_mb[LAST_FRAME]);
+    }
+  }
+
+  mi->motion_mode = SIMPLE_TRANSLATION;
+#if !CONFIG_REALTIME_ONLY
+  if (cpi->oxcf.motion_mode_cfg.allow_warped_motion) {
+    calc_num_proj_ref(cpi, x, mi);
+  }
+#endif
+  // set variance threshold for compound mode pruning
+  if (rt_sf->prune_compoundmode_with_singlecompound_var && !is_single_pred &&
+      use_model_yrd_large) {
+    const PREDICTION_MODE single_mode0 = compound_ref0_mode(this_mode);
+    const PREDICTION_MODE single_mode1 = compound_ref1_mode(this_mode);
+    var_threshold =
+        AOMMIN(var_threshold,
+               search_state->vars[INTER_OFFSET(single_mode0)][ref_frame]);
+    var_threshold =
+        AOMMIN(var_threshold,
+               search_state->vars[INTER_OFFSET(single_mode1)][ref_frame2]);
+  }
+
+  // decide interpolation filter, build prediction signal, get sse
+  const bool is_mv_subpel =
+      (mi->mv[0].as_mv.row & 0x07) || (mi->mv[0].as_mv.col & 0x07);
+  const bool enable_filt_search_this_mode =
+      (filter_search_enabled_blk == 2)
+          ? true
+          : (filter_search_enabled_blk && !force_mv_inter_layer &&
+             is_single_pred &&
+             (ref_frame == LAST_FRAME || !x->nonrd_prune_ref_frame_search));
+  if (is_mv_subpel && enable_filt_search_this_mode) {
+#if COLLECT_NONRD_PICK_MODE_STAT
+    aom_usec_timer_start(&x->ms_stat_nonrd.timer2);
+#endif
+    search_filter_ref(
+        cpi, x, &search_state->this_rdc, &inter_pred_params_sr, mi_row, mi_col,
+        tmp_buffer, bsize, reuse_inter_pred, this_mode_pred, &this_early_term,
+        &var, use_model_yrd_large, best_pickmode->best_sse, is_single_pred);
+#if COLLECT_NONRD_PICK_MODE_STAT
+    aom_usec_timer_mark(&x->ms_stat_nonrd.timer2);
+    x->ms_stat_nonrd.ifs_time[bsize][this_mode] +=
+        aom_usec_timer_elapsed(&x->ms_stat_nonrd.timer2);
+#endif
+#if !CONFIG_REALTIME_ONLY
+  } else if (cpi->oxcf.motion_mode_cfg.allow_warped_motion &&
+             this_mode == NEWMV) {
+    // Find the best motion mode when current mode is NEWMV
+    search_motion_mode(cpi, x, &search_state->this_rdc, mi_row, mi_col, bsize,
+                       &this_early_term, use_model_yrd_large, &rate_mv,
+                       best_pickmode->best_sse);
+    if (this_mode == NEWMV) {
+      this_mv[0] = mi->mv[0];
+    }
+#endif
+  } else {
+    mi->interp_filters =
+        (filter_ref == SWITCHABLE)
+            ? av1_broadcast_interp_filter(default_interp_filter)
+            : av1_broadcast_interp_filter(filter_ref);
+    if (force_mv_inter_layer)
+      mi->interp_filters = av1_broadcast_interp_filter(EIGHTTAP_REGULAR);
+
+    // If it is sub-pel motion and cb_pred_filter_search is enabled, select
+    // the pre-decided filter
+    if (is_mv_subpel && cb_pred_filter_search)
+      mi->interp_filters = av1_broadcast_interp_filter(filt_select);
+
+#if COLLECT_NONRD_PICK_MODE_STAT
+    aom_usec_timer_start(&x->ms_stat_nonrd.timer2);
+#endif
+    if (is_single_pred) {
+      SubpelParams subpel_params;
+      // Initialize inter mode level params for single reference mode.
+      init_inter_mode_params(&mi->mv[0].as_mv, &inter_pred_params_sr,
+                             &subpel_params, xd->block_ref_scale_factors[0],
+                             pd->pre->width, pd->pre->height);
+      av1_enc_build_inter_predictor_y_nonrd(xd, &inter_pred_params_sr,
+                                            &subpel_params);
+    } else {
+      av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
+                                    AOM_PLANE_Y, AOM_PLANE_Y);
+    }
+
+    if (use_model_yrd_large) {
+      model_skip_for_sb_y_large(cpi, bsize, mi_row, mi_col, x, xd,
+                                &search_state->this_rdc, &this_early_term, 0,
+                                best_pickmode->best_sse, &var, var_threshold);
+    } else {
+      model_rd_for_sb_y(cpi, bsize, x, xd, &search_state->this_rdc, &var, 0,
+                        &this_early_term);
+    }
+#if COLLECT_NONRD_PICK_MODE_STAT
+    aom_usec_timer_mark(&x->ms_stat_nonrd.timer2);
+    x->ms_stat_nonrd.model_rd_time[bsize][this_mode] +=
+        aom_usec_timer_elapsed(&x->ms_stat_nonrd.timer2);
+#endif
+  }
+
+  // update variance for single mode
+  if (is_single_pred) {
+    search_state->vars[INTER_OFFSET(this_mode)][ref_frame] = var;
+    if (this_mv->as_int == 0) {
+      search_state->vars[INTER_OFFSET(GLOBALMV)][ref_frame] = var;
+    }
+  }
+  // prune compound mode based on single mode var threshold
+  if (!is_single_pred && var > var_threshold) {
+    if (reuse_inter_pred) free_pred_buffer(*this_mode_pred);
+    return true;
+  }
+
+  if (ref_frame == LAST_FRAME && this_mv->as_int == 0) {
+    *sse_zeromv_norm = (unsigned int)(search_state->this_rdc.sse >>
+                                      (b_width_log2_lookup[bsize] +
+                                       b_height_log2_lookup[bsize]));
+  }
+
+  // Perform early termination based on sse.
+  if (rt_sf->sse_early_term_inter_search &&
+      early_term_inter_search_with_sse(rt_sf->sse_early_term_inter_search,
+                                       bsize, search_state->this_rdc.sse,
+                                       best_pickmode->best_sse, this_mode)) {
+    if (reuse_inter_pred) free_pred_buffer(*this_mode_pred);
+    return true;
+  }
+
+#if COLLECT_NONRD_PICK_MODE_STAT
+  x->ms_stat_nonrd.num_nonskipped_searches[bsize][this_mode]++;
+#endif
+
+  const int skip_ctx = av1_get_skip_txfm_context(xd);
+  const int skip_txfm_cost = mode_costs->skip_txfm_cost[skip_ctx][1];
+  const int no_skip_txfm_cost = mode_costs->skip_txfm_cost[skip_ctx][0];
+  const int64_t sse_y = search_state->this_rdc.sse;
+
+  if (this_early_term) {
+    search_state->this_rdc.skip_txfm = 1;
+    search_state->this_rdc.rate = skip_txfm_cost;
+    search_state->this_rdc.dist = search_state->this_rdc.sse << 4;
+  } else {
+#if COLLECT_NONRD_PICK_MODE_STAT
+    aom_usec_timer_start(&x->ms_stat_nonrd.timer2);
+#endif
+    // Calculates RD Cost using Hadamard transform.
+    av1_block_yrd(x, &search_state->this_rdc, &is_skippable, bsize,
+                  mi->tx_size);
+    if (search_state->this_rdc.skip_txfm ||
+        RDCOST(x->rdmult, search_state->this_rdc.rate,
+               search_state->this_rdc.dist) >=
+            RDCOST(x->rdmult, 0, search_state->this_rdc.sse)) {
+      if (!search_state->this_rdc.skip_txfm) {
+        // Need to store "real" rdc for possible future use if UV rdc
+        // disallows tx skip
+        nonskip_rdc = search_state->this_rdc;
+        nonskip_rdc.rate += no_skip_txfm_cost;
+      }
+      search_state->this_rdc.rate = skip_txfm_cost;
+      search_state->this_rdc.skip_txfm = 1;
+      search_state->this_rdc.dist = search_state->this_rdc.sse;
+    } else {
+      search_state->this_rdc.rate += no_skip_txfm_cost;
+    }
+
+    // Populate predicted sample for chroma planes based on color sensitivity.
+    if ((x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] ||
+         x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)])) {
+      RD_STATS rdc_uv;
+      const BLOCK_SIZE uv_bsize =
+          get_plane_block_size(bsize, xd->plane[AOM_PLANE_U].subsampling_x,
+                               xd->plane[AOM_PLANE_U].subsampling_y);
+      if (x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)]) {
+        av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
+                                      AOM_PLANE_U, AOM_PLANE_U);
+      }
+      if (x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)]) {
+        av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
+                                      AOM_PLANE_V, AOM_PLANE_V);
+      }
+      // Compute sse for chroma planes.
+      const int64_t sse_uv = av1_model_rd_for_sb_uv(
+          cpi, uv_bsize, x, xd, &rdc_uv, AOM_PLANE_U, AOM_PLANE_V);
+      search_state->this_rdc.sse += sse_uv;
+      // Restore Y rdc if UV rdc disallows txfm skip
+      if (search_state->this_rdc.skip_txfm && !rdc_uv.skip_txfm &&
+          nonskip_rdc.rate != INT_MAX)
+        search_state->this_rdc = nonskip_rdc;
+      if (is_single_pred) {
+        search_state->uv_dist[INTER_OFFSET(this_mode)][ref_frame] = rdc_uv.dist;
+      }
+      search_state->this_rdc.rate += rdc_uv.rate;
+      search_state->this_rdc.dist += rdc_uv.dist;
+      search_state->this_rdc.skip_txfm =
+          search_state->this_rdc.skip_txfm && rdc_uv.skip_txfm;
+    }
+#if COLLECT_NONRD_PICK_MODE_STAT
+    aom_usec_timer_mark(&x->ms_stat_nonrd.timer2);
+    x->ms_stat_nonrd.txfm_time[bsize][this_mode] +=
+        aom_usec_timer_elapsed(&x->ms_stat_nonrd.timer2);
+#endif
+  }
+
+  this_best_mode = this_mode;
+  // TODO(kyslov) account for UV prediction cost
+  search_state->this_rdc.rate += rate_mv;
+  if (!is_single_pred) {
+    const int16_t mode_ctx =
+        av1_mode_context_analyzer(mbmi_ext->mode_context, mi->ref_frame);
+    search_state->this_rdc.rate += cost_mv_ref(mode_costs, this_mode, mode_ctx);
+  } else {
+    // If the current mode has zeromv but is not GLOBALMV, compare the rate
+    // cost. If GLOBALMV is cheaper, use GLOBALMV instead.
+    if (this_mode != GLOBALMV &&
+        this_mv->as_int == search_state->frame_mv[GLOBALMV][ref_frame].as_int) {
+      if (is_globalmv_better(this_mode, ref_frame, rate_mv, mode_costs,
+                             search_state->single_inter_mode_costs, mbmi_ext)) {
+        this_best_mode = GLOBALMV;
+      }
+    }
+
+    search_state->this_rdc.rate +=
+        search_state
+            ->single_inter_mode_costs[INTER_OFFSET(this_best_mode)][ref_frame];
+  }
+
+  if (is_single_pred && this_mv->as_int == 0 && var < UINT_MAX) {
+    search_state->vars[INTER_OFFSET(GLOBALMV)][ref_frame] = var;
+  }
+
+  search_state->this_rdc.rate += search_state->ref_costs_single[ref_frame];
+
+  search_state->this_rdc.rdcost = RDCOST(x->rdmult, search_state->this_rdc.rate,
+                                         search_state->this_rdc.dist);
+  if (cpi->oxcf.rc_cfg.mode == AOM_CBR && is_single_pred) {
+    newmv_diff_bias(xd, this_best_mode, &search_state->this_rdc, bsize,
+                    search_state->frame_mv[this_best_mode][ref_frame].as_mv.row,
+                    search_state->frame_mv[this_best_mode][ref_frame].as_mv.col,
+                    cpi->speed, x->source_variance, x->content_state_sb);
+  }
+
+#if CONFIG_AV1_TEMPORAL_DENOISING
+  if (cpi->oxcf.noise_sensitivity > 0 && denoise_svc_pickmode &&
+      cpi->denoiser.denoising_level > kDenLowLow) {
+    av1_denoiser_update_frame_stats(mi, sse_y, this_mode, ctx);
+    // Keep track of zero_last cost.
+    if (ref_frame == LAST_FRAME && this_mv->as_int == 0)
+      *zero_last_cost_orig = search_state->this_rdc.rdcost;
+  }
+#else
+  (void)(sse_y);
+#endif
+
+  search_state->mode_checked[this_mode][ref_frame] = 1;
+  search_state->mode_checked[this_best_mode][ref_frame] = 1;
+
+  if (*check_globalmv) {
+    int32_t abs_mv =
+        abs(search_state->frame_mv[this_best_mode][ref_frame].as_mv.row) +
+        abs(search_state->frame_mv[this_best_mode][ref_frame].as_mv.col);
+    // Early exit check: if the magnitude of this_best_mode's mv is small
+    // enough, we skip GLOBALMV check in the next loop iteration.
+    if (abs_mv < 2) {
+      *check_globalmv = false;
+    }
+  }
+#if COLLECT_NONRD_PICK_MODE_STAT
+  aom_usec_timer_mark(&x->ms_stat_nonrd.timer1);
+  x->ms_stat_nonrd.nonskipped_search_times[bsize][this_mode] +=
+      aom_usec_timer_elapsed(&x->ms_stat_nonrd.timer1);
+#endif
+
+  // Copy best mode params to search state
+  if (search_state->this_rdc.rdcost < search_state->best_rdc.rdcost) {
+    search_state->best_rdc = search_state->this_rdc;
+    *best_early_term = this_early_term;
+    update_search_state_nonrd(search_state, mi, txfm_info, &nonskip_rdc, ctx,
+                              this_best_mode, sse_y);
+
+    // This is needed for the compound modes.
+    search_state->frame_mv_best[this_best_mode][ref_frame].as_int =
+        search_state->frame_mv[this_best_mode][ref_frame].as_int;
+    if (ref_frame2 > NONE_FRAME) {
+      search_state->frame_mv_best[this_best_mode][ref_frame2].as_int =
+          search_state->frame_mv[this_best_mode][ref_frame2].as_int;
+    }
+
+    if (reuse_inter_pred) {
+      free_pred_buffer(best_pickmode->best_pred);
+      best_pickmode->best_pred = *this_mode_pred;
+    }
+  } else {
+    if (reuse_inter_pred) free_pred_buffer(*this_mode_pred);
+  }
+
+  if (*best_early_term && (idx > 0 || rt_sf->nonrd_aggressive_skip)) {
+    txfm_info->skip_txfm = 1;
+    return false;
+  }
+  return true;
+}
+
+// Function to perform screen content mode evaluation for non-rd
+static AOM_FORCE_INLINE void handle_screen_content_mode_nonrd(
+    AV1_COMP *cpi, MACROBLOCK *x, InterModeSearchStateNonrd *search_state,
+    PRED_BUFFER *this_mode_pred, PICK_MODE_CONTEXT *ctx,
+    PRED_BUFFER *tmp_buffer, struct buf_2d *orig_dst, int skip_idtx_palette,
+    int try_palette, BLOCK_SIZE bsize, int reuse_inter_pred, int mi_col,
+    int mi_row) {
+  AV1_COMMON *const cm = &cpi->common;
+  const REAL_TIME_SPEED_FEATURES *const rt_sf = &cpi->sf.rt_sf;
+  MACROBLOCKD *const xd = &x->e_mbd;
+  MB_MODE_INFO *const mi = xd->mi[0];
+  struct macroblockd_plane *const pd = &xd->plane[0];
+  const int bw = block_size_wide[bsize];
+  const int bh = block_size_high[bsize];
+  TxfmSearchInfo *txfm_info = &x->txfm_search_info;
+  BEST_PICKMODE *const best_pickmode = &search_state->best_pickmode;
+
+  // TODO(marpan): Only allow for 8 bit-depth for now, re-enable for 10/12 bit
+  // when issue 3359 is fixed.
+  if (cm->seq_params->bit_depth == 8 &&
+      cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN && !skip_idtx_palette &&
+      !cpi->oxcf.txfm_cfg.use_inter_dct_only && !x->force_zeromv_skip_for_blk &&
+      is_inter_mode(best_pickmode->best_mode) &&
+      best_pickmode->best_pred != NULL &&
+      (!rt_sf->prune_idtx_nonrd ||
+       (rt_sf->prune_idtx_nonrd && bsize <= BLOCK_32X32 &&
+        best_pickmode->best_mode_skip_txfm != 1 && x->source_variance > 200))) {
+    RD_STATS idtx_rdc;
+    av1_init_rd_stats(&idtx_rdc);
+    int is_skippable;
+    this_mode_pred = &tmp_buffer[get_pred_buffer(tmp_buffer, 3)];
+    pd->dst.buf = this_mode_pred->data;
+    pd->dst.stride = bw;
+    const PRED_BUFFER *const best_pred = best_pickmode->best_pred;
+    av1_block_yrd_idtx(x, best_pred->data, best_pred->stride, &idtx_rdc,
+                       &is_skippable, bsize, mi->tx_size);
+    int64_t idx_rdcost_y = RDCOST(x->rdmult, idtx_rdc.rate, idtx_rdc.dist);
+    int allow_idtx = 1;
+    // Incorporate color into rd cost.
+    if ((x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] ||
+         x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)])) {
+      RD_STATS rdc_uv;
+      const BLOCK_SIZE uv_bsize =
+          get_plane_block_size(bsize, xd->plane[AOM_PLANE_U].subsampling_x,
+                               xd->plane[AOM_PLANE_U].subsampling_y);
+      if (x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)]) {
+        av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
+                                      AOM_PLANE_U, AOM_PLANE_U);
+      }
+      if (x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)]) {
+        av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
+                                      AOM_PLANE_V, AOM_PLANE_V);
+      }
+      av1_model_rd_for_sb_uv(cpi, uv_bsize, x, xd, &rdc_uv, AOM_PLANE_U,
+                             AOM_PLANE_V);
+      idtx_rdc.rate += rdc_uv.rate;
+      idtx_rdc.dist += rdc_uv.dist;
+      idtx_rdc.skip_txfm = idtx_rdc.skip_txfm && rdc_uv.skip_txfm;
+      if (idx_rdcost_y == 0 && rdc_uv.dist > 0 && x->source_variance < 3000 &&
+          x->content_state_sb.source_sad_nonrd > kMedSad)
+        allow_idtx = 0;
+    }
+    int64_t idx_rdcost = RDCOST(x->rdmult, idtx_rdc.rate, idtx_rdc.dist);
+    if (allow_idtx && idx_rdcost < search_state->best_rdc.rdcost) {
+      best_pickmode->tx_type = IDTX;
+      search_state->best_rdc.rdcost = idx_rdcost;
+      best_pickmode->best_mode_skip_txfm = idtx_rdc.skip_txfm;
+      if (!idtx_rdc.skip_txfm) {
+        memcpy(ctx->blk_skip, txfm_info->blk_skip,
+               sizeof(txfm_info->blk_skip[0]) * ctx->num_4x4_blk);
+      }
+      xd->tx_type_map[0] = best_pickmode->tx_type;
+      memset(ctx->tx_type_map, best_pickmode->tx_type, ctx->num_4x4_blk);
+      memset(xd->tx_type_map, best_pickmode->tx_type, ctx->num_4x4_blk);
+    }
+    pd->dst = *orig_dst;
+  }
+
+  if (!try_palette) return;
+  const unsigned int intra_ref_frame_cost =
+      search_state->ref_costs_single[INTRA_FRAME];
+
+  if (!is_mode_intra(best_pickmode->best_mode)) {
+    PRED_BUFFER *const best_pred = best_pickmode->best_pred;
+    if (reuse_inter_pred && best_pred != NULL) {
+      if (best_pred->data == orig_dst->buf) {
+        this_mode_pred = &tmp_buffer[get_pred_buffer(tmp_buffer, 3)];
+        aom_convolve_copy(best_pred->data, best_pred->stride,
+                          this_mode_pred->data, this_mode_pred->stride, bw, bh);
+        best_pickmode->best_pred = this_mode_pred;
+      }
+    }
+    pd->dst = *orig_dst;
+  }
+  // Search palette mode for Luma plane in inter frame.
+  av1_search_palette_mode_luma(cpi, x, bsize, intra_ref_frame_cost, ctx,
+                               &search_state->this_rdc,
+                               search_state->best_rdc.rdcost);
+  // Update best mode data in search_state
+  if (search_state->this_rdc.rdcost < search_state->best_rdc.rdcost) {
+    best_pickmode->pmi = mi->palette_mode_info;
+    best_pickmode->best_mode = DC_PRED;
+    mi->mv[0].as_int = INVALID_MV;
+    mi->mv[1].as_int = INVALID_MV;
+    best_pickmode->best_ref_frame = INTRA_FRAME;
+    best_pickmode->best_second_ref_frame = NONE;
+    search_state->best_rdc.rate = search_state->this_rdc.rate;
+    search_state->best_rdc.dist = search_state->this_rdc.dist;
+    search_state->best_rdc.rdcost = search_state->this_rdc.rdcost;
+    best_pickmode->best_mode_skip_txfm = search_state->this_rdc.skip_txfm;
+    // Keep the skip_txfm off if the color_sensitivity is set.
+    if (x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] ||
+        x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)])
+      search_state->this_rdc.skip_txfm = 0;
+    if (!search_state->this_rdc.skip_txfm) {
+      memcpy(ctx->blk_skip, txfm_info->blk_skip,
+             sizeof(txfm_info->blk_skip[0]) * ctx->num_4x4_blk);
+    }
+    if (xd->tx_type_map[0] != DCT_DCT)
+      av1_copy_array(ctx->tx_type_map, xd->tx_type_map, ctx->num_4x4_blk);
+  }
+}
+
+/*!\brief AV1 inter mode selection based on Non-RD optimized model.
+ *
+ * \ingroup nonrd_mode_search
+ * \callgraph
+ * Top level function for Non-RD optimized inter mode selection.
+ * This finction will loop over subset of inter modes and select the best one
+ * based on calculated modelled RD cost. While making decisions which modes to
+ * check, this function applies heuristics based on previously checked modes,
+ * block residual variance, block size, and other factors to prune certain
+ * modes and reference frames. Currently only single reference frame modes
+ * are checked. Additional heuristics are applied to decide if intra modes
+ *  need to be checked.
+ *  *
+ * \param[in]    cpi            Top-level encoder structure
+ * \param[in]    tile_data      Pointer to struct holding adaptive
+                                data/contexts/models for the tile during
+                                encoding
+ * \param[in]    x              Pointer to structure holding all the data for
+                                the current macroblock
+ * \param[in]    rd_cost        Struct to keep track of the RD information
+ * \param[in]    bsize          Current block size
+ * \param[in]    ctx            Structure to hold snapshot of coding context
+                                during the mode picking process
+ *
+ * \remark Nothing is returned. Instead, the MB_MODE_INFO struct inside x
+ * is modified to store information about the best mode computed
+ * in this function. The rd_cost struct is also updated with the RD stats
+ * corresponding to the best mode found.
+ */
 void av1_nonrd_pick_inter_mode_sb(AV1_COMP *cpi, TileDataEnc *tile_data,
                                   MACROBLOCK *x, RD_STATS *rd_cost,
                                   BLOCK_SIZE bsize, PICK_MODE_CONTEXT *ctx) {
@@ -3561,10 +3064,8 @@
   SVC *const svc = &cpi->svc;
   MACROBLOCKD *const xd = &x->e_mbd;
   MB_MODE_INFO *const mi = xd->mi[0];
-  struct macroblockd_plane *const pd = &xd->plane[0];
+  struct macroblockd_plane *const pd = &xd->plane[AOM_PLANE_Y];
   const MB_MODE_INFO_EXT *const mbmi_ext = &x->mbmi_ext;
-  const InterpFilter filter_ref = cm->features.interp_filter;
-  const InterpFilter default_interp_filter = EIGHTTAP_REGULAR;
   MV_REFERENCE_FRAME ref_frame, ref_frame2;
   const unsigned char segment_id = mi->segment_id;
   int best_early_term = 0;
@@ -3572,30 +3073,33 @@
   unsigned int sse_zeromv_norm = UINT_MAX;
   int skip_pred_mv = 0;
   const int num_inter_modes = NUM_INTER_MODES;
-  bool check_globalmv = cpi->sf.rt_sf.check_globalmv_on_single_ref;
+  const REAL_TIME_SPEED_FEATURES *const rt_sf = &cpi->sf.rt_sf;
+  bool check_globalmv = rt_sf->check_globalmv_on_single_ref;
   PRED_BUFFER tmp_buffer[4];
-  DECLARE_ALIGNED(16, uint8_t, pred_buf[3 * 128 * 128]);
+  DECLARE_ALIGNED(16, uint8_t, pred_buf[MAX_MB_PLANE * MAX_SB_SQUARE]);
   PRED_BUFFER *this_mode_pred = NULL;
-  const int reuse_inter_pred = cpi->sf.rt_sf.reuse_inter_pred_nonrd &&
-                               cm->seq_params->bit_depth == AOM_BITS_8;
+  const int reuse_inter_pred =
+      rt_sf->reuse_inter_pred_nonrd && cm->seq_params->bit_depth == AOM_BITS_8;
   InterModeSearchStateNonrd search_state;
   av1_zero(search_state.use_ref_frame_mask);
+  BEST_PICKMODE *const best_pickmode = &search_state.best_pickmode;
+  (void)tile_data;
 
   const int bh = block_size_high[bsize];
   const int bw = block_size_wide[bsize];
   const int pixels_in_block = bh * bw;
-  const int num_8x8_blocks = ctx->num_4x4_blk / 4;
   struct buf_2d orig_dst = pd->dst;
   const TxfmSearchParams *txfm_params = &x->txfm_search_params;
   TxfmSearchInfo *txfm_info = &x->txfm_search_info;
-#if COLLECT_PICK_MODE_STAT
-  aom_usec_timer_start(&ms_stat.bsize_timer);
+#if COLLECT_NONRD_PICK_MODE_STAT
+  // Mode statistics can be collected only when num_workers is 1
+  assert(cpi->mt_info.num_workers <= 1);
+  aom_usec_timer_start(&x->ms_stat_nonrd.bsize_timer);
 #endif
   int64_t thresh_sad_pred = INT64_MAX;
   const int mi_row = xd->mi_row;
   const int mi_col = xd->mi_col;
-  int svc_mv_col = 0;
-  int svc_mv_row = 0;
+  int_mv svc_mv = { .as_int = 0 };
   int force_mv_inter_layer = 0;
   bool comp_use_zero_zeromv_only = 0;
   int tot_num_comp_modes = NUM_COMP_INTER_MODES_RT;
@@ -3609,10 +3113,10 @@
   const ModeCosts *mode_costs = &x->mode_costs;
 
   if (reuse_inter_pred) {
-    for (int i = 0; i < 3; i++) {
-      tmp_buffer[i].data = &pred_buf[pixels_in_block * i];
-      tmp_buffer[i].stride = bw;
-      tmp_buffer[i].in_use = 0;
+    for (int buf_idx = 0; buf_idx < 3; buf_idx++) {
+      tmp_buffer[buf_idx].data = &pred_buf[pixels_in_block * buf_idx];
+      tmp_buffer[buf_idx].stride = bw;
+      tmp_buffer[buf_idx].in_use = 0;
     }
     tmp_buffer[3].data = pd->dst.buf;
     tmp_buffer[3].stride = pd->dst.stride;
@@ -3629,25 +3133,24 @@
   if (cpi->ppi->use_svc && svc->spatial_layer_id > 0 &&
       svc->downsample_filter_phase[svc->spatial_layer_id - 1] == 8 &&
       cm->width * cm->height > 640 * 480) {
-    svc_mv_col = -4;
-    svc_mv_row = -4;
+    svc_mv.as_mv.row = -4;
+    svc_mv.as_mv.col = -4;
   }
 
   // Setup parameters used for inter mode evaluation.
   set_params_nonrd_pick_inter_mode(
-      cpi, x, &search_state, tile_data, ctx, rd_cost, &force_skip_low_temp_var,
-      &skip_pred_mv, mi_row, mi_col, gf_temporal_ref, segment_id, bsize
+      cpi, x, &search_state, rd_cost, &force_skip_low_temp_var, &skip_pred_mv,
+      mi_row, mi_col, gf_temporal_ref, segment_id, bsize
 #if CONFIG_AV1_TEMPORAL_DENOISING
       ,
-      denoise_svc_pickmode
+      ctx, denoise_svc_pickmode
 #endif
   );
 
-  if (cpi->sf.rt_sf.use_comp_ref_nonrd && is_comp_ref_allowed(bsize)) {
+  if (rt_sf->use_comp_ref_nonrd && is_comp_ref_allowed(bsize)) {
     // Only search compound if bsize \gt BLOCK_16X16.
     if (bsize > BLOCK_16X16) {
-      comp_use_zero_zeromv_only =
-          cpi->sf.rt_sf.check_only_zero_zeromv_on_large_blocks;
+      comp_use_zero_zeromv_only = rt_sf->check_only_zero_zeromv_on_large_blocks;
     } else {
       tot_num_comp_modes = 0;
     }
@@ -3658,7 +3161,7 @@
   if (x->pred_mv_sad[LAST_FRAME] != INT_MAX) {
     thresh_sad_pred = ((int64_t)x->pred_mv_sad[LAST_FRAME]) << 1;
     // Increase threshold for less aggressive pruning.
-    if (cpi->sf.rt_sf.nonrd_prune_ref_frame_search == 1)
+    if (rt_sf->nonrd_prune_ref_frame_search == 1)
       thresh_sad_pred += (x->pred_mv_sad[LAST_FRAME] >> 2);
   }
 
@@ -3679,10 +3182,10 @@
       is_filter_search_enabled_blk(cpi, x, mi_row, mi_col, bsize, segment_id,
                                    cb_pred_filter_search, &filt_select);
 
-#if COLLECT_PICK_MODE_STAT
-  ms_stat.num_blocks[bsize]++;
+#if COLLECT_NONRD_PICK_MODE_STAT
+  x->ms_stat_nonrd.num_blocks[bsize]++;
 #endif
-  init_mbmi(mi, DC_PRED, NONE_FRAME, NONE_FRAME, cm);
+  init_mbmi_nonrd(mi, DC_PRED, NONE_FRAME, NONE_FRAME, cm);
   mi->tx_size = AOMMIN(
       AOMMIN(max_txsize_lookup[bsize],
              tx_mode_to_biggest_tx_size[txfm_params->tx_mode_search_type]),
@@ -3707,456 +3210,71 @@
   for (int idx = 0; idx < num_inter_modes + tot_num_comp_modes; ++idx) {
     // If we are at the first compound mode, and the single modes already
     // perform well, then end the search.
-    if (cpi->sf.rt_sf.skip_compound_based_on_var && idx == num_inter_modes &&
+    if (rt_sf->skip_compound_based_on_var && idx == num_inter_modes &&
         skip_comp_based_on_var(search_state.vars, bsize)) {
       break;
     }
 
-    int rate_mv = 0;
-    int is_skippable;
-    int this_early_term = 0;
-    int skip_this_mv = 0;
-    int comp_pred = 0;
-    unsigned int var = UINT_MAX;
+    int is_single_pred = 1;
     PREDICTION_MODE this_mode;
-    RD_STATS nonskip_rdc;
-    av1_invalid_rd_stats(&nonskip_rdc);
-    memset(txfm_info->blk_skip, 0,
-           sizeof(txfm_info->blk_skip[0]) * num_8x8_blocks);
 
     // Check the inter mode can be skipped based on mode statistics and speed
     // features settings.
-    if (skip_inter_mode_nonrd(
-            cpi, x, &search_state, &thresh_sad_pred, &force_mv_inter_layer,
-            &comp_pred, &this_mode, &last_comp_ref_frame, &ref_frame,
-            &ref_frame2, idx, svc_mv_col, svc_mv_row, force_skip_low_temp_var,
-            sse_zeromv_norm, num_inter_modes, segment_id, bsize,
-            comp_use_zero_zeromv_only, check_globalmv))
+    if (skip_inter_mode_nonrd(cpi, x, &search_state, &thresh_sad_pred,
+                              &force_mv_inter_layer, &is_single_pred,
+                              &this_mode, &last_comp_ref_frame, &ref_frame,
+                              &ref_frame2, idx, svc_mv, force_skip_low_temp_var,
+                              sse_zeromv_norm, num_inter_modes, segment_id,
+                              bsize, comp_use_zero_zeromv_only, check_globalmv))
       continue;
 
     // Select prediction reference frames.
-    for (int i = 0; i < MAX_MB_PLANE; i++) {
-      xd->plane[i].pre[0] = search_state.yv12_mb[ref_frame][i];
-      if (comp_pred) xd->plane[i].pre[1] = search_state.yv12_mb[ref_frame2][i];
+    for (int plane = 0; plane < MAX_MB_PLANE; plane++) {
+      xd->plane[plane].pre[0] = search_state.yv12_mb[ref_frame][plane];
+      if (!is_single_pred)
+        xd->plane[plane].pre[1] = search_state.yv12_mb[ref_frame2][plane];
     }
 
     mi->ref_frame[0] = ref_frame;
     mi->ref_frame[1] = ref_frame2;
     set_ref_ptrs(cm, xd, ref_frame, ref_frame2);
 
-    if (this_mode == NEWMV && !force_mv_inter_layer) {
-#if COLLECT_PICK_MODE_STAT
-      aom_usec_timer_start(&ms_stat.timer2);
-#endif
-      const bool skip_newmv = search_new_mv(
-          cpi, x, search_state.frame_mv, ref_frame, gf_temporal_ref, bsize,
-          mi_row, mi_col, &rate_mv, &search_state.best_rdc);
-#if COLLECT_PICK_MODE_STAT
-      aom_usec_timer_mark(&ms_stat.timer2);
-      ms_stat.ms_time[bsize][this_mode] +=
-          aom_usec_timer_elapsed(&ms_stat.timer2);
-#endif
-      if (skip_newmv) {
-        continue;
-      }
-    }
-
-    for (PREDICTION_MODE inter_mv_mode = NEARESTMV; inter_mv_mode <= NEWMV;
-         inter_mv_mode++) {
-      if (inter_mv_mode == this_mode) continue;
-      if (!comp_pred && search_state.mode_checked[inter_mv_mode][ref_frame] &&
-          search_state.frame_mv[this_mode][ref_frame].as_int ==
-              search_state.frame_mv[inter_mv_mode][ref_frame].as_int) {
-        skip_this_mv = 1;
-        break;
-      }
-    }
-
-    if (skip_this_mv && !comp_pred) continue;
-
-    // For screen: for spatially flat blocks with non-zero motion,
-    // skip newmv if the motion vector is (0, 0), and color is not set.
-    if (this_mode == NEWMV &&
-        cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN &&
-        cpi->svc.spatial_layer_id == 0 &&
-        cpi->sf.rt_sf.source_metrics_sb_nonrd) {
-      if (search_state.frame_mv[this_mode][ref_frame].as_int == 0 &&
-          x->content_state_sb.source_sad_nonrd != kZeroSad &&
-          ((x->color_sensitivity[0] == 0 && x->color_sensitivity[1] == 0) ||
-           cpi->rc.high_source_sad) &&
-          x->source_variance == 0)
-        continue;
-    }
-
-    mi->mode = this_mode;
-    mi->mv[0].as_int = search_state.frame_mv[this_mode][ref_frame].as_int;
-    mi->mv[1].as_int = 0;
-    if (comp_pred)
-      mi->mv[1].as_int = search_state.frame_mv[this_mode][ref_frame2].as_int;
-
-    if (reuse_inter_pred) {
-      if (!this_mode_pred) {
-        this_mode_pred = &tmp_buffer[3];
-      } else {
-        this_mode_pred = &tmp_buffer[get_pred_buffer(tmp_buffer, 3)];
-        pd->dst.buf = this_mode_pred->data;
-        pd->dst.stride = bw;
-      }
-    }
-
-    if (idx == 0 && !skip_pred_mv) {
-      // Set color sensitivity on first tested mode only.
-      // Use y-sad already computed in find_predictors: take the sad with motion
-      // vector closest to 0; the uv-sad computed below in set_color_sensitivity
-      // is for zeromv.
-      // For screen: first check if golden reference is being used, if so,
-      // force color_sensitivity on if the color sensitivity for sb_g is on.
-      if (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN &&
-          search_state.use_ref_frame_mask[GOLDEN_FRAME]) {
-        if (x->color_sensitivity_sb_g[0] == 1) x->color_sensitivity[0] = 1;
-        if (x->color_sensitivity_sb_g[1] == 1) x->color_sensitivity[1] = 1;
-      } else {
-        int y_sad = x->pred_mv0_sad[LAST_FRAME];
-        if (x->pred_mv1_sad[LAST_FRAME] != INT_MAX &&
-            (abs(search_state.frame_mv[NEARMV][LAST_FRAME].as_mv.col) +
-             abs(search_state.frame_mv[NEARMV][LAST_FRAME].as_mv.row)) <
-                (abs(search_state.frame_mv[NEARESTMV][LAST_FRAME].as_mv.col) +
-                 abs(search_state.frame_mv[NEARESTMV][LAST_FRAME].as_mv.row)))
-          y_sad = x->pred_mv1_sad[LAST_FRAME];
-        set_color_sensitivity(cpi, x, bsize, y_sad, x->source_variance,
-                              search_state.yv12_mb[LAST_FRAME]);
-      }
-    }
-    mi->motion_mode = SIMPLE_TRANSLATION;
-#if !CONFIG_REALTIME_ONLY
-    if (cpi->oxcf.motion_mode_cfg.allow_warped_motion) {
-      calc_num_proj_ref(cpi, x, mi);
-    }
-#endif
-    // set variance threshold for compound more pruning
-    unsigned int var_threshold = UINT_MAX;
-    if (cpi->sf.rt_sf.prune_compoundmode_with_singlecompound_var && comp_pred &&
-        use_model_yrd_large) {
-      const PREDICTION_MODE single_mode0 = compound_ref0_mode(this_mode);
-      const PREDICTION_MODE single_mode1 = compound_ref1_mode(this_mode);
-      var_threshold =
-          AOMMIN(var_threshold,
-                 search_state.vars[INTER_OFFSET(single_mode0)][ref_frame]);
-      var_threshold =
-          AOMMIN(var_threshold,
-                 search_state.vars[INTER_OFFSET(single_mode1)][ref_frame2]);
-    }
-    // decide interpolation filter, build prediction signal, get sse
-    const bool is_mv_subpel =
-        (mi->mv[0].as_mv.row & 0x07) || (mi->mv[0].as_mv.col & 0x07);
-    const bool enable_filt_search_this_mode =
-        (filter_search_enabled_blk == 2)
-            ? true
-            : (filter_search_enabled_blk && !force_mv_inter_layer &&
-               !comp_pred &&
-               (ref_frame == LAST_FRAME || !x->nonrd_prune_ref_frame_search));
-    if (is_mv_subpel && enable_filt_search_this_mode) {
-#if COLLECT_PICK_MODE_STAT
-      aom_usec_timer_start(&ms_stat.timer2);
-#endif
-      search_filter_ref(cpi, x, &search_state.this_rdc, &inter_pred_params_sr,
-                        mi_row, mi_col, tmp_buffer, bsize, reuse_inter_pred,
-                        &this_mode_pred, &this_early_term, &var,
-                        use_model_yrd_large,
-                        search_state.best_pickmode.best_sse, comp_pred);
-#if COLLECT_PICK_MODE_STAT
-      aom_usec_timer_mark(&ms_stat.timer2);
-      ms_stat.ifs_time[bsize][this_mode] +=
-          aom_usec_timer_elapsed(&ms_stat.timer2);
-#endif
-#if !CONFIG_REALTIME_ONLY
-    } else if (cpi->oxcf.motion_mode_cfg.allow_warped_motion &&
-               this_mode == NEWMV) {
-      search_motion_mode(cpi, x, &search_state.this_rdc, mi_row, mi_col, bsize,
-                         &this_early_term, use_model_yrd_large, &rate_mv,
-                         search_state.best_pickmode.best_sse);
-      if (this_mode == NEWMV) {
-        search_state.frame_mv[this_mode][ref_frame] = mi->mv[0];
-      }
-#endif
-    } else {
-      mi->interp_filters =
-          (filter_ref == SWITCHABLE)
-              ? av1_broadcast_interp_filter(default_interp_filter)
-              : av1_broadcast_interp_filter(filter_ref);
-      if (force_mv_inter_layer)
-        mi->interp_filters = av1_broadcast_interp_filter(EIGHTTAP_REGULAR);
-
-      // If it is sub-pel motion and cb_pred_filter_search is enabled, select
-      // the pre-decided filter
-      if (is_mv_subpel && cb_pred_filter_search)
-        mi->interp_filters = av1_broadcast_interp_filter(filt_select);
-
-#if COLLECT_PICK_MODE_STAT
-      aom_usec_timer_start(&ms_stat.timer2);
-#endif
-      if (!comp_pred) {
-        SubpelParams subpel_params;
-        // Initialize inter mode level params for single reference mode.
-        init_inter_mode_params(&mi->mv[0].as_mv, &inter_pred_params_sr,
-                               &subpel_params, xd->block_ref_scale_factors[0],
-                               pd->pre->width, pd->pre->height);
-        av1_enc_build_inter_predictor_y_nonrd(xd, &inter_pred_params_sr,
-                                              &subpel_params);
-      } else {
-        av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize, 0,
-                                      0);
-      }
-
-      if (use_model_yrd_large) {
-        model_skip_for_sb_y_large(cpi, bsize, mi_row, mi_col, x, xd,
-                                  &search_state.this_rdc, &this_early_term, 0,
-                                  search_state.best_pickmode.best_sse, &var,
-                                  var_threshold);
-      } else {
-        model_rd_for_sb_y(cpi, bsize, x, xd, &search_state.this_rdc, &var, 0,
-                          &this_early_term);
-      }
-#if COLLECT_PICK_MODE_STAT
-      aom_usec_timer_mark(&ms_stat.timer2);
-      ms_stat.model_rd_time[bsize][this_mode] +=
-          aom_usec_timer_elapsed(&ms_stat.timer2);
-#endif
-    }
-    // update variance for single mode
-    if (!comp_pred) {
-      search_state.vars[INTER_OFFSET(this_mode)][ref_frame] = var;
-      if (search_state.frame_mv[this_mode][ref_frame].as_int == 0) {
-        search_state.vars[INTER_OFFSET(GLOBALMV)][ref_frame] = var;
-      }
-    }
-    // prune compound mode based on single mode var threshold
-    if (comp_pred && var > var_threshold) {
-      if (reuse_inter_pred) free_pred_buffer(this_mode_pred);
-      continue;
-    }
-
-    if (ref_frame == LAST_FRAME &&
-        search_state.frame_mv[this_mode][ref_frame].as_int == 0) {
-      sse_zeromv_norm = (unsigned int)(search_state.this_rdc.sse >>
-                                       (b_width_log2_lookup[bsize] +
-                                        b_height_log2_lookup[bsize]));
-    }
-
-    if (cpi->sf.rt_sf.sse_early_term_inter_search &&
-        early_term_inter_search_with_sse(
-            cpi->sf.rt_sf.sse_early_term_inter_search, bsize,
-            search_state.this_rdc.sse, search_state.best_pickmode.best_sse,
-            this_mode)) {
-      if (reuse_inter_pred) free_pred_buffer(this_mode_pred);
-      continue;
-    }
-
-#if COLLECT_PICK_MODE_STAT
-    ms_stat.num_nonskipped_searches[bsize][this_mode]++;
-#endif
-
-    const int skip_ctx = av1_get_skip_txfm_context(xd);
-    const int skip_txfm_cost = mode_costs->skip_txfm_cost[skip_ctx][1];
-    const int no_skip_txfm_cost = mode_costs->skip_txfm_cost[skip_ctx][0];
-    const int64_t sse_y = search_state.this_rdc.sse;
-    if (this_early_term) {
-      search_state.this_rdc.skip_txfm = 1;
-      search_state.this_rdc.rate = skip_txfm_cost;
-      search_state.this_rdc.dist = search_state.this_rdc.sse << 4;
-    } else {
-#if COLLECT_PICK_MODE_STAT
-      aom_usec_timer_start(&ms_stat.timer2);
-#endif
-      block_yrd(x, &search_state.this_rdc, &is_skippable, bsize, mi->tx_size,
-                1);
-      if (search_state.this_rdc.skip_txfm ||
-          RDCOST(x->rdmult, search_state.this_rdc.rate,
-                 search_state.this_rdc.dist) >=
-              RDCOST(x->rdmult, 0, search_state.this_rdc.sse)) {
-        if (!search_state.this_rdc.skip_txfm) {
-          // Need to store "real" rdc for possible future use if UV rdc
-          // disallows tx skip
-          nonskip_rdc = search_state.this_rdc;
-          nonskip_rdc.rate += no_skip_txfm_cost;
-        }
-        search_state.this_rdc.rate = skip_txfm_cost;
-        search_state.this_rdc.skip_txfm = 1;
-        search_state.this_rdc.dist = search_state.this_rdc.sse;
-      } else {
-        search_state.this_rdc.rate += no_skip_txfm_cost;
-      }
-      if ((x->color_sensitivity[0] || x->color_sensitivity[1])) {
-        RD_STATS rdc_uv;
-        const BLOCK_SIZE uv_bsize = get_plane_block_size(
-            bsize, xd->plane[1].subsampling_x, xd->plane[1].subsampling_y);
-        if (x->color_sensitivity[0]) {
-          av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
-                                        AOM_PLANE_U, AOM_PLANE_U);
-        }
-        if (x->color_sensitivity[1]) {
-          av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize,
-                                        AOM_PLANE_V, AOM_PLANE_V);
-        }
-        const int64_t sse_uv =
-            model_rd_for_sb_uv(cpi, uv_bsize, x, xd, &rdc_uv, 1, 2);
-        search_state.this_rdc.sse += sse_uv;
-        // Restore Y rdc if UV rdc disallows txfm skip
-        if (search_state.this_rdc.skip_txfm && !rdc_uv.skip_txfm &&
-            nonskip_rdc.rate != INT_MAX)
-          search_state.this_rdc = nonskip_rdc;
-        if (!comp_pred) {
-          search_state.uv_dist[INTER_OFFSET(this_mode)][ref_frame] =
-              rdc_uv.dist;
-        }
-        search_state.this_rdc.rate += rdc_uv.rate;
-        search_state.this_rdc.dist += rdc_uv.dist;
-        search_state.this_rdc.skip_txfm =
-            search_state.this_rdc.skip_txfm && rdc_uv.skip_txfm;
-      }
-#if COLLECT_PICK_MODE_STAT
-      aom_usec_timer_mark(&ms_stat.timer2);
-      ms_stat.txfm_time[bsize][this_mode] +=
-          aom_usec_timer_elapsed(&ms_stat.timer2);
-#endif
-    }
-    PREDICTION_MODE this_best_mode = this_mode;
-
-    // TODO(kyslov) account for UV prediction cost
-    search_state.this_rdc.rate += rate_mv;
-    if (comp_pred) {
-      const int16_t mode_ctx =
-          av1_mode_context_analyzer(mbmi_ext->mode_context, mi->ref_frame);
-      search_state.this_rdc.rate +=
-          cost_mv_ref(mode_costs, this_mode, mode_ctx);
-    } else {
-      // If the current mode has zeromv but is not GLOBALMV, compare the rate
-      // cost. If GLOBALMV is cheaper, use GLOBALMV instead.
-      if (this_mode != GLOBALMV &&
-          search_state.frame_mv[this_mode][ref_frame].as_int ==
-              search_state.frame_mv[GLOBALMV][ref_frame].as_int) {
-        if (is_globalmv_better(this_mode, ref_frame, rate_mv, mode_costs,
-                               search_state.single_inter_mode_costs,
-                               mbmi_ext)) {
-          this_best_mode = GLOBALMV;
-        }
-      }
-
-      search_state.this_rdc.rate +=
-          search_state
-              .single_inter_mode_costs[INTER_OFFSET(this_best_mode)][ref_frame];
-    }
-
-    if (!comp_pred && search_state.frame_mv[this_mode][ref_frame].as_int == 0 &&
-        var < UINT_MAX) {
-      search_state.vars[INTER_OFFSET(GLOBALMV)][ref_frame] = var;
-    }
-
-    search_state.this_rdc.rate += search_state.ref_costs_single[ref_frame];
-
-    search_state.this_rdc.rdcost = RDCOST(x->rdmult, search_state.this_rdc.rate,
-                                          search_state.this_rdc.dist);
-    if (cpi->oxcf.rc_cfg.mode == AOM_CBR && !comp_pred) {
-      newmv_diff_bias(
-          xd, this_best_mode, &search_state.this_rdc, bsize,
-          search_state.frame_mv[this_best_mode][ref_frame].as_mv.row,
-          search_state.frame_mv[this_best_mode][ref_frame].as_mv.col,
-          cpi->speed, x->source_variance, x->content_state_sb);
-    }
+    // Perform inter mode evaluation for non-rd
+    if (!handle_inter_mode_nonrd(
+            cpi, x, &search_state, ctx, &this_mode_pred, tmp_buffer,
+            inter_pred_params_sr, &best_early_term, &sse_zeromv_norm,
+            &check_globalmv,
 #if CONFIG_AV1_TEMPORAL_DENOISING
-    if (cpi->oxcf.noise_sensitivity > 0 && denoise_svc_pickmode &&
-        cpi->denoiser.denoising_level > kDenLowLow) {
-      av1_denoiser_update_frame_stats(mi, sse_y, this_mode, ctx);
-      // Keep track of zero_last cost.
-      if (ref_frame == LAST_FRAME &&
-          search_state.frame_mv[this_mode][ref_frame].as_int == 0)
-        zero_last_cost_orig = search_state.this_rdc.rdcost;
-    }
-#else
-    (void)sse_y;
+            &zero_last_cost_orig, denoise_svc_pickmode,
 #endif
-
-    search_state.mode_checked[this_mode][ref_frame] = 1;
-    search_state.mode_checked[this_best_mode][ref_frame] = 1;
-
-    if (check_globalmv) {
-      int32_t abs_mv =
-          abs(search_state.frame_mv[this_best_mode][ref_frame].as_mv.row) +
-          abs(search_state.frame_mv[this_best_mode][ref_frame].as_mv.col);
-      // Early exit check: if the magnitude of this_best_mode's mv is small
-      // enough, we skip GLOBALMV check in the next loop iteration.
-      if (abs_mv < 2) {
-        check_globalmv = false;
-      }
-    }
-#if COLLECT_PICK_MODE_STAT
-    aom_usec_timer_mark(&ms_stat.timer1);
-    ms_stat.nonskipped_search_times[bsize][this_mode] +=
-        aom_usec_timer_elapsed(&ms_stat.timer1);
-#endif
-    if (search_state.this_rdc.rdcost < search_state.best_rdc.rdcost) {
-      search_state.best_rdc = search_state.this_rdc;
-      best_early_term = this_early_term;
-      search_state.best_pickmode.best_sse = sse_y;
-      search_state.best_pickmode.best_mode = this_best_mode;
-      search_state.best_pickmode.best_motion_mode = mi->motion_mode;
-      search_state.best_pickmode.wm_params = mi->wm_params;
-      search_state.best_pickmode.num_proj_ref = mi->num_proj_ref;
-      search_state.best_pickmode.best_pred_filter = mi->interp_filters;
-      search_state.best_pickmode.best_tx_size = mi->tx_size;
-      search_state.best_pickmode.best_ref_frame = ref_frame;
-      search_state.best_pickmode.best_second_ref_frame = ref_frame2;
-      search_state.best_pickmode.best_mode_skip_txfm =
-          search_state.this_rdc.skip_txfm;
-      search_state.best_pickmode.best_mode_initial_skip_flag =
-          (nonskip_rdc.rate == INT_MAX && search_state.this_rdc.skip_txfm);
-      if (!search_state.best_pickmode.best_mode_skip_txfm) {
-        memcpy(search_state.best_pickmode.blk_skip, txfm_info->blk_skip,
-               sizeof(txfm_info->blk_skip[0]) * num_8x8_blocks);
-      }
-
-      // This is needed for the compound modes.
-      search_state.frame_mv_best[this_best_mode][ref_frame].as_int =
-          search_state.frame_mv[this_best_mode][ref_frame].as_int;
-      if (ref_frame2 > NONE_FRAME) {
-        search_state.frame_mv_best[this_best_mode][ref_frame2].as_int =
-            search_state.frame_mv[this_best_mode][ref_frame2].as_int;
-      }
-
-      if (reuse_inter_pred) {
-        free_pred_buffer(search_state.best_pickmode.best_pred);
-        search_state.best_pickmode.best_pred = this_mode_pred;
-      }
-    } else {
-      if (reuse_inter_pred) free_pred_buffer(this_mode_pred);
-    }
-    if (best_early_term && (idx > 0 || cpi->sf.rt_sf.nonrd_aggressive_skip)) {
-      txfm_info->skip_txfm = 1;
+            idx, force_mv_inter_layer, is_single_pred, skip_pred_mv,
+            gf_temporal_ref, use_model_yrd_large, filter_search_enabled_blk,
+            bsize, this_mode, filt_select, cb_pred_filter_search,
+            reuse_inter_pred)) {
       break;
     }
   }
 
-  mi->mode = search_state.best_pickmode.best_mode;
-  mi->motion_mode = search_state.best_pickmode.best_motion_mode;
-  mi->wm_params = search_state.best_pickmode.wm_params;
-  mi->num_proj_ref = search_state.best_pickmode.num_proj_ref;
-  mi->interp_filters = search_state.best_pickmode.best_pred_filter;
-  mi->tx_size = search_state.best_pickmode.best_tx_size;
+  // Restore mode data of best inter mode
+  mi->mode = best_pickmode->best_mode;
+  mi->motion_mode = best_pickmode->best_motion_mode;
+  mi->wm_params = best_pickmode->wm_params;
+  mi->num_proj_ref = best_pickmode->num_proj_ref;
+  mi->interp_filters = best_pickmode->best_pred_filter;
+  mi->tx_size = best_pickmode->best_tx_size;
   memset(mi->inter_tx_size, mi->tx_size, sizeof(mi->inter_tx_size));
-  mi->ref_frame[0] = search_state.best_pickmode.best_ref_frame;
-  mi->mv[0].as_int =
-      search_state
-          .frame_mv_best[search_state.best_pickmode.best_mode]
-                        [search_state.best_pickmode.best_ref_frame]
-          .as_int;
+  mi->ref_frame[0] = best_pickmode->best_ref_frame;
+  mi->mv[0].as_int = search_state
+                         .frame_mv_best[best_pickmode->best_mode]
+                                       [best_pickmode->best_ref_frame]
+                         .as_int;
   mi->mv[1].as_int = 0;
-  if (search_state.best_pickmode.best_second_ref_frame > INTRA_FRAME) {
-    mi->ref_frame[1] = search_state.best_pickmode.best_second_ref_frame;
-    mi->mv[1].as_int =
-        search_state
-            .frame_mv_best[search_state.best_pickmode.best_mode]
-                          [search_state.best_pickmode.best_second_ref_frame]
-            .as_int;
+  if (best_pickmode->best_second_ref_frame > INTRA_FRAME) {
+    mi->ref_frame[1] = best_pickmode->best_second_ref_frame;
+    mi->mv[1].as_int = search_state
+                           .frame_mv_best[best_pickmode->best_mode]
+                                         [best_pickmode->best_second_ref_frame]
+                           .as_int;
   }
   // Perform intra prediction search, if the best SAD is above a certain
   // threshold.
@@ -4164,118 +3282,75 @@
   mi->angle_delta[PLANE_TYPE_UV] = 0;
   mi->filter_intra_mode_info.use_filter_intra = 0;
 
-#if COLLECT_PICK_MODE_STAT
-  aom_usec_timer_start(&ms_stat.timer1);
-  ms_stat.num_searches[bsize][DC_PRED]++;
-  ms_stat.num_nonskipped_searches[bsize][DC_PRED]++;
+#if COLLECT_NONRD_PICK_MODE_STAT
+  aom_usec_timer_start(&x->ms_stat_nonrd.timer1);
+  x->ms_stat_nonrd.num_searches[bsize][DC_PRED]++;
+  x->ms_stat_nonrd.num_nonskipped_searches[bsize][DC_PRED]++;
 #endif
 
-  if (!x->force_zeromv_skip_for_blk)
-    estimate_intra_mode(cpi, x, bsize, best_early_term,
-                        search_state.ref_costs_single[INTRA_FRAME],
-                        reuse_inter_pred, &orig_dst, tmp_buffer,
-                        &this_mode_pred, &search_state.best_rdc,
-                        &search_state.best_pickmode, ctx);
-
-  int skip_idtx_palette =
-      (x->color_sensitivity[0] || x->color_sensitivity[1]) &&
+  int force_palette_test = 0;
+  if (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN &&
       x->content_state_sb.source_sad_nonrd != kZeroSad &&
-      !cpi->rc.high_source_sad;
-
-  // Check for IDTX: based only on Y channel, so avoid when color_sensitivity
-  // is set.
-  if (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN && !skip_idtx_palette &&
-      !cpi->oxcf.txfm_cfg.use_inter_dct_only && !x->force_zeromv_skip_for_blk &&
-      is_inter_mode(search_state.best_pickmode.best_mode) &&
-      (!cpi->sf.rt_sf.prune_idtx_nonrd ||
-       (cpi->sf.rt_sf.prune_idtx_nonrd && bsize <= BLOCK_32X32 &&
-        search_state.best_pickmode.best_mode_skip_txfm != 1 &&
-        x->source_variance > 200))) {
-    RD_STATS idtx_rdc;
-    av1_init_rd_stats(&idtx_rdc);
-    int is_skippable;
-    this_mode_pred = &tmp_buffer[get_pred_buffer(tmp_buffer, 3)];
-    pd->dst.buf = this_mode_pred->data;
-    pd->dst.stride = bw;
-    av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, NULL, bsize, 0, 0);
-    block_yrd_idtx(x, &idtx_rdc, &is_skippable, bsize, mi->tx_size);
-    int64_t idx_rdcost = RDCOST(x->rdmult, idtx_rdc.rate, idtx_rdc.dist);
-    if (idx_rdcost < search_state.best_rdc.rdcost) {
-      // Keep the skip_txfm off if the color_sensitivity is set.
-      if (x->color_sensitivity[0] || x->color_sensitivity[1])
-        idtx_rdc.skip_txfm = 0;
-      search_state.best_pickmode.tx_type = IDTX;
-      search_state.best_rdc.rdcost = idx_rdcost;
-      search_state.best_pickmode.best_mode_skip_txfm = idtx_rdc.skip_txfm;
-      if (!idtx_rdc.skip_txfm) {
-        memcpy(search_state.best_pickmode.blk_skip, txfm_info->blk_skip,
-               sizeof(txfm_info->blk_skip[0]) * num_8x8_blocks);
-      }
-      xd->tx_type_map[0] = search_state.best_pickmode.tx_type;
-      memset(ctx->tx_type_map, search_state.best_pickmode.tx_type,
-             ctx->num_4x4_blk);
-      memset(xd->tx_type_map, search_state.best_pickmode.tx_type,
-             ctx->num_4x4_blk);
-    }
-    pd->dst = orig_dst;
+      bsize <= BLOCK_16X16) {
+    unsigned int thresh_sse = cpi->rc.high_source_sad ? 15000 : 250000;
+    unsigned int thresh_source_var = cpi->rc.high_source_sad ? 50 : 1000;
+    unsigned int best_sse_inter_motion =
+        (unsigned int)(search_state.best_rdc.sse >>
+                       (b_width_log2_lookup[bsize] +
+                        b_height_log2_lookup[bsize]));
+    if (best_sse_inter_motion > thresh_sse &&
+        x->source_variance > thresh_source_var)
+      force_palette_test = 1;
   }
 
+  // Evaluate Intra modes in inter frame
+  if (!x->force_zeromv_skip_for_blk)
+    av1_estimate_intra_mode(cpi, x, bsize, best_early_term,
+                            search_state.ref_costs_single[INTRA_FRAME],
+                            reuse_inter_pred, &orig_dst, tmp_buffer,
+                            &this_mode_pred, &search_state.best_rdc,
+                            best_pickmode, ctx);
+
+  int skip_idtx_palette = (x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] ||
+                           x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)]) &&
+                          x->content_state_sb.source_sad_nonrd != kZeroSad &&
+                          !cpi->rc.high_source_sad;
+
   int try_palette =
       !skip_idtx_palette && cpi->oxcf.tool_cfg.enable_palette &&
       av1_allow_palette(cpi->common.features.allow_screen_content_tools,
                         mi->bsize);
-  try_palette = try_palette &&
-                is_mode_intra(search_state.best_pickmode.best_mode) &&
-                x->source_variance > 0 && !x->force_zeromv_skip_for_blk &&
-                (cpi->rc.high_source_sad || x->source_variance > 500);
+  try_palette =
+      try_palette &&
+      (is_mode_intra(best_pickmode->best_mode) || force_palette_test) &&
+      x->source_variance > 0 && !x->force_zeromv_skip_for_blk &&
+      (cpi->rc.high_source_sad || x->source_variance > 500);
 
-  if (try_palette) {
-    const unsigned int intra_ref_frame_cost =
-        search_state.ref_costs_single[INTRA_FRAME];
+  if (rt_sf->prune_palette_nonrd && bsize > BLOCK_16X16) try_palette = 0;
 
-    av1_search_palette_mode_luma(cpi, x, bsize, intra_ref_frame_cost, ctx,
-                                 &search_state.this_rdc,
-                                 search_state.best_rdc.rdcost);
-    if (search_state.this_rdc.rdcost < search_state.best_rdc.rdcost) {
-      search_state.best_pickmode.pmi = mi->palette_mode_info;
-      search_state.best_pickmode.best_mode = DC_PRED;
-      mi->mv[0].as_int = 0;
-      search_state.best_rdc.rate = search_state.this_rdc.rate;
-      search_state.best_rdc.dist = search_state.this_rdc.dist;
-      search_state.best_rdc.rdcost = search_state.this_rdc.rdcost;
-      search_state.best_pickmode.best_mode_skip_txfm =
-          search_state.this_rdc.skip_txfm;
-      // Keep the skip_txfm off if the color_sensitivity is set.
-      if (x->color_sensitivity[0] || x->color_sensitivity[1])
-        search_state.this_rdc.skip_txfm = 0;
-      if (!search_state.this_rdc.skip_txfm) {
-        memcpy(ctx->blk_skip, txfm_info->blk_skip,
-               sizeof(txfm_info->blk_skip[0]) * ctx->num_4x4_blk);
-      }
-      if (xd->tx_type_map[0] != DCT_DCT)
-        av1_copy_array(ctx->tx_type_map, xd->tx_type_map, ctx->num_4x4_blk);
-    }
-  }
+  // Perform screen content mode evaluation for non-rd
+  handle_screen_content_mode_nonrd(
+      cpi, x, &search_state, this_mode_pred, ctx, tmp_buffer, &orig_dst,
+      skip_idtx_palette, try_palette, bsize, reuse_inter_pred, mi_col, mi_row);
 
-#if COLLECT_PICK_MODE_STAT
-  aom_usec_timer_mark(&ms_stat.timer1);
-  ms_stat.nonskipped_search_times[bsize][DC_PRED] +=
-      aom_usec_timer_elapsed(&ms_stat.timer1);
+#if COLLECT_NONRD_PICK_MODE_STAT
+  aom_usec_timer_mark(&x->ms_stat_nonrd.timer1);
+  x->ms_stat_nonrd.nonskipped_search_times[bsize][DC_PRED] +=
+      aom_usec_timer_elapsed(&x->ms_stat_nonrd.timer1);
 #endif
 
   pd->dst = orig_dst;
-  if (try_palette) mi->palette_mode_info = search_state.best_pickmode.pmi;
-  mi->mode = search_state.best_pickmode.best_mode;
-  mi->ref_frame[0] = search_state.best_pickmode.best_ref_frame;
-  mi->ref_frame[1] = search_state.best_pickmode.best_second_ref_frame;
-  txfm_info->skip_txfm = search_state.best_pickmode.best_mode_skip_txfm;
-  if (!txfm_info->skip_txfm) {
-    // For inter modes: copy blk_skip from best_pickmode, which is
-    // defined for 8x8 blocks. If palette or intra mode was selected
-    // as best then blk_skip is already copied into the ctx.
-    if (search_state.best_pickmode.best_mode >= INTRA_MODE_END)
-      memcpy(ctx->blk_skip, search_state.best_pickmode.blk_skip,
-             sizeof(search_state.best_pickmode.blk_skip[0]) * num_8x8_blocks);
+  // Best mode is finalized. Restore the mode data to mbmi
+  if (try_palette) mi->palette_mode_info = best_pickmode->pmi;
+  mi->mode = best_pickmode->best_mode;
+  mi->ref_frame[0] = best_pickmode->best_ref_frame;
+  mi->ref_frame[1] = best_pickmode->best_second_ref_frame;
+  // For lossless: always force the skip flags off.
+  if (is_lossless_requested(&cpi->oxcf.rc_cfg)) {
+    txfm_info->skip_txfm = 0;
+    memset(ctx->blk_skip, 0, sizeof(ctx->blk_skip[0]) * ctx->num_4x4_blk);
+  } else {
+    txfm_info->skip_txfm = best_pickmode->best_mode_skip_txfm;
   }
   if (has_second_ref(mi)) {
     mi->comp_group_idx = 0;
@@ -4287,8 +3362,9 @@
     mi->interp_filters = av1_broadcast_interp_filter(SWITCHABLE_FILTERS);
   }
 
-  if (reuse_inter_pred && search_state.best_pickmode.best_pred != NULL) {
-    PRED_BUFFER *const best_pred = search_state.best_pickmode.best_pred;
+  // Restore the predicted samples of best mode to final buffer
+  if (reuse_inter_pred && best_pickmode->best_pred != NULL) {
+    PRED_BUFFER *const best_pred = best_pickmode->best_pred;
     if (best_pred->data != orig_dst.buf && is_inter_mode(mi->mode)) {
       aom_convolve_copy(best_pred->data, best_pred->stride, pd->dst.buf,
                         pd->dst.stride, bw, bh);
@@ -4303,52 +3379,50 @@
     ctx->sb_skip_denoising = 0;
     av1_pickmode_ctx_den_update(
         &ctx_den, zero_last_cost_orig, search_state.ref_costs_single,
-        search_state.frame_mv, reuse_inter_pred, &search_state.best_pickmode);
+        search_state.frame_mv, reuse_inter_pred, best_pickmode);
     av1_denoiser_denoise(cpi, x, mi_row, mi_col, bsize, ctx, &decision,
                          gf_temporal_ref);
     if (denoise_recheck_zeromv)
       recheck_zeromv_after_denoising(
           cpi, mi, x, xd, decision, &ctx_den, search_state.yv12_mb,
-          &search_state.best_rdc, &search_state.best_pickmode, bsize, mi_row,
-          mi_col);
-    search_state.best_pickmode.best_ref_frame = ctx_den.best_ref_frame;
+          &search_state.best_rdc, best_pickmode, bsize, mi_row, mi_col);
+    best_pickmode->best_ref_frame = ctx_den.best_ref_frame;
   }
 #endif
 
+  // Update the factors used for RD thresholding for all modes.
   if (cpi->sf.inter_sf.adaptive_rd_thresh && !has_second_ref(mi)) {
     THR_MODES best_mode_idx =
-        mode_idx[search_state.best_pickmode.best_ref_frame]
-                [mode_offset(mi->mode)];
-    if (search_state.best_pickmode.best_ref_frame == INTRA_FRAME) {
+        mode_idx[best_pickmode->best_ref_frame][mode_offset(mi->mode)];
+    if (best_pickmode->best_ref_frame == INTRA_FRAME) {
       // Only consider the modes that are included in the intra_mode_list.
       int intra_modes = sizeof(intra_mode_list) / sizeof(PREDICTION_MODE);
-      for (int i = 0; i < intra_modes; i++) {
+      for (int mode_index = 0; mode_index < intra_modes; mode_index++) {
         update_thresh_freq_fact(cpi, x, bsize, INTRA_FRAME, best_mode_idx,
-                                intra_mode_list[i]);
+                                intra_mode_list[mode_index]);
       }
     } else {
       PREDICTION_MODE this_mode;
       for (this_mode = NEARESTMV; this_mode <= NEWMV; ++this_mode) {
-        update_thresh_freq_fact(cpi, x, bsize,
-                                search_state.best_pickmode.best_ref_frame,
+        update_thresh_freq_fact(cpi, x, bsize, best_pickmode->best_ref_frame,
                                 best_mode_idx, this_mode);
       }
     }
   }
 
 #if CONFIG_INTERNAL_STATS
-  store_coding_context(x, ctx, mi->mode);
+  store_coding_context_nonrd(x, ctx, mi->mode);
 #else
-  store_coding_context(x, ctx);
+  store_coding_context_nonrd(x, ctx);
 #endif  // CONFIG_INTERNAL_STATS
 
-#if COLLECT_PICK_MODE_STAT
-  aom_usec_timer_mark(&ms_stat.bsize_timer);
-  ms_stat.total_block_times[bsize] +=
-      aom_usec_timer_elapsed(&ms_stat.bsize_timer);
-  print_time(&ms_stat, bsize, cm->mi_params.mi_rows, cm->mi_params.mi_cols,
-             mi_row, mi_col);
-#endif  // COLLECT_PICK_MODE_STAT
+#if COLLECT_NONRD_PICK_MODE_STAT
+  aom_usec_timer_mark(&x->ms_stat_nonrd.bsize_timer);
+  x->ms_stat_nonrd.total_block_times[bsize] +=
+      aom_usec_timer_elapsed(&x->ms_stat_nonrd.bsize_timer);
+  print_time(&x->ms_stat_nonrd, bsize, cm->mi_params.mi_rows,
+             cm->mi_params.mi_cols, mi_row, mi_col);
+#endif  // COLLECT_NONRD_PICK_MODE_STAT
 
   *rd_cost = search_state.best_rdc;
 }

diff --git a/av1/encoder/palette.c b/av1/encoder/palette.c
index 9c3d407..b1a73e4 100644
--- a/av1/encoder/palette.c
+++ b/av1/encoder/palette.c

@@ -733,6 +733,9 @@
   if (best_mbmi->palette_mode_info.palette_size[0] > 0) {
     memcpy(color_map, best_palette_color_map,
            block_width * block_height * sizeof(best_palette_color_map[0]));
+    // Gather the stats to determine whether to use screen content tools in
+    // function av1_determine_sc_tools_with_encoding().
+    x->palette_pixels += (block_width * block_height);
   }
   *mbmi = *best_mbmi;
 }

diff --git a/av1/encoder/partition_search.c b/av1/encoder/partition_search.c
index 8d06bf5..96567dd 100644
--- a/av1/encoder/partition_search.c
+++ b/av1/encoder/partition_search.c

@@ -603,8 +603,7 @@
   }
 
 #if !CONFIG_REALTIME_ONLY
-  const AV1_COMMON *const cm = &cpi->common;
-  if (cm->delta_q_info.delta_q_present_flag &&
+  if (cpi->common.delta_q_info.delta_q_present_flag &&
       !cpi->sf.rt_sf.use_nonrd_pick_mode) {
     x->rdmult = av1_get_cb_rdmult(cpi, x, bsize, mi_row, mi_col);
   }
@@ -614,15 +613,22 @@
     av1_set_ssim_rdmult(cpi, &x->errorperbit, bsize, mi_row, mi_col,
                         &x->rdmult);
   }
+#if CONFIG_SALIENCY_MAP
+  else if (cpi->oxcf.tune_cfg.tuning == AOM_TUNE_VMAF_SALIENCY_MAP) {
+    av1_set_saliency_map_vmaf_rdmult(cpi, &x->errorperbit,
+                                     cpi->common.seq_params->sb_size, mi_row,
+                                     mi_col, &x->rdmult);
+  }
+#endif
 #if CONFIG_TUNE_VMAF
-  if (cpi->oxcf.tune_cfg.tuning == AOM_TUNE_VMAF_WITHOUT_PREPROCESSING ||
-      cpi->oxcf.tune_cfg.tuning == AOM_TUNE_VMAF_MAX_GAIN ||
-      cpi->oxcf.tune_cfg.tuning == AOM_TUNE_VMAF_NEG_MAX_GAIN) {
+  else if (cpi->oxcf.tune_cfg.tuning == AOM_TUNE_VMAF_WITHOUT_PREPROCESSING ||
+           cpi->oxcf.tune_cfg.tuning == AOM_TUNE_VMAF_MAX_GAIN ||
+           cpi->oxcf.tune_cfg.tuning == AOM_TUNE_VMAF_NEG_MAX_GAIN) {
     av1_set_vmaf_rdmult(cpi, x, bsize, mi_row, mi_col, &x->rdmult);
   }
 #endif
 #if CONFIG_TUNE_BUTTERAUGLI
-  if (cpi->oxcf.tune_cfg.tuning == AOM_TUNE_BUTTERAUGLI) {
+  else if (cpi->oxcf.tune_cfg.tuning == AOM_TUNE_BUTTERAUGLI) {
     av1_set_butteraugli_rdmult(cpi, x, bsize, mi_row, mi_col, &x->rdmult);
   }
 #endif
@@ -1294,8 +1300,7 @@
   }
 
   if (inter_block && cm->features.interp_filter == SWITCHABLE &&
-      mbmi->motion_mode != WARPED_CAUSAL &&
-      !is_nontrans_global_motion(xd, mbmi)) {
+      av1_is_interp_needed(xd)) {
     update_filter_type_cdf(xd, mbmi, cm->seq_params->enable_dual_filter);
   }
   if (inter_block &&
@@ -2301,7 +2306,8 @@
     // here. Check to see is skipping cdef is allowed.
     const int allow_cdef_skipping =
         cpi->rc.frames_since_key > 10 && !cpi->rc.high_source_sad &&
-        !(x->color_sensitivity[0] || x->color_sensitivity[1]);
+        !(x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] ||
+          x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)]);
 
     // Find the corresponding 64x64 block. It'll be the 128x128 block if that's
     // the block size.
@@ -2312,9 +2318,15 @@
         get_mi_grid_idx(&cm->mi_params, mi_row_sb, mi_col_sb);
     // Do not skip if intra or new mv is picked, or color sensitivity is set.
     // Never skip on slide/scene change.
-    mi_sb[0]->cdef_strength =
-        mi_sb[0]->cdef_strength && allow_cdef_skipping &&
-        !(mbmi->mode < INTRA_MODES || mbmi->mode == NEWMV);
+    if (cpi->sf.rt_sf.skip_cdef_sb >= 2) {
+      mi_sb[0]->cdef_strength =
+          mi_sb[0]->cdef_strength &&
+          (allow_cdef_skipping || x->source_variance == 0);
+    } else {
+      mi_sb[0]->cdef_strength =
+          mi_sb[0]->cdef_strength && allow_cdef_skipping &&
+          !(mbmi->mode < INTRA_MODES || mbmi->mode == NEWMV);
+    }
     // Store in the pickmode context.
     ctx->mic.cdef_strength = mi_sb[0]->cdef_strength;
   }
@@ -2538,6 +2550,7 @@
   MACROBLOCK *const x = &td->mb;
   MACROBLOCKD *const xd = &x->e_mbd;
   const ModeCosts *mode_costs = &x->mode_costs;
+  const int num_planes = av1_num_planes(cm);
   // Only square blocks from 8x8 to 128x128 are supported
   assert(bsize >= BLOCK_8X8 && bsize <= BLOCK_128X128);
   const int bs = mi_size_wide[bsize];
@@ -2547,7 +2560,7 @@
   RD_STATS split_rdc, none_rdc;
   av1_invalid_rd_stats(&split_rdc);
   av1_invalid_rd_stats(&none_rdc);
-  av1_save_context(x, &x_ctx, mi_row, mi_col, bsize, 3);
+  av1_save_context(x, &x_ctx, mi_row, mi_col, bsize, num_planes);
   xd->above_txfm_context =
       cm->above_contexts.txfm[tile_info->tile_row] + mi_col;
   xd->left_txfm_context =
@@ -2562,7 +2575,7 @@
                       pc_tree->none);
   none_rdc.rate += mode_costs->partition_cost[pl][PARTITION_NONE];
   none_rdc.rdcost = RDCOST(x->rdmult, none_rdc.rate, none_rdc.dist);
-  av1_restore_context(x, &x_ctx, mi_row, mi_col, bsize, 3);
+  av1_restore_context(x, &x_ctx, mi_row, mi_col, bsize, num_planes);
 
   if (cpi->sf.rt_sf.nonrd_check_partition_merge_mode < 2 ||
       none_rdc.skip_txfm != 1 || pc_tree->none->mic.mode == NEWMV) {
@@ -2608,7 +2621,7 @@
                          1, subsize, PARTITION_NONE, pc_tree->split[i]->none,
                          NULL);
       }
-      av1_restore_context(x, &x_ctx, mi_row, mi_col, bsize, 3);
+      av1_restore_context(x, &x_ctx, mi_row, mi_col, bsize, num_planes);
       split_rdc.rdcost = RDCOST(x->rdmult, split_rdc.rate, split_rdc.dist);
     }
   }
@@ -2755,12 +2768,12 @@
       frame_mv[i][j].as_int = INVALID_MV;
     }
   }
-  x->color_sensitivity[0] = x->color_sensitivity_sb[0];
-  x->color_sensitivity[1] = x->color_sensitivity_sb[1];
+  av1_copy(x->color_sensitivity, x->color_sensitivity_sb);
   skip_pred_mv = (x->nonrd_prune_ref_frame_search > 2 &&
-                  x->color_sensitivity[0] != 2 && x->color_sensitivity[1] != 2);
+                  x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_U)] != 2 &&
+                  x->color_sensitivity[COLOR_SENS_IDX(AOM_PLANE_V)] != 2);
 
-  find_predictors(cpi, x, ref_frame, frame_mv, tile_data, yv12_mb, bsize,
+  find_predictors(cpi, x, ref_frame, frame_mv, yv12_mb, bsize,
                   force_skip_low_temp_var, skip_pred_mv);
 
   int continue_merging = 1;
@@ -2776,8 +2789,8 @@
     // calling find_predictors() again.
     av1_set_offsets_without_segment_id(cpi, &tile_data->tile_info, x, mi_row,
                                        mi_col, this_mi[0]->bsize);
-    find_predictors(cpi, x, ref_frame, frame_mv, tile_data, yv12_mb,
-                    this_mi[0]->bsize, force_skip_low_temp_var, skip_pred_mv);
+    find_predictors(cpi, x, ref_frame, frame_mv, yv12_mb, this_mi[0]->bsize,
+                    force_skip_low_temp_var, skip_pred_mv);
   } else {
     struct scale_factors *sf = get_ref_scale_factors(cm, ref_frame);
     const int is_scaled = av1_is_scaled(sf);
@@ -4451,14 +4464,15 @@
 
 static int read_partition_tree(AV1_COMP *const cpi, PC_TREE *const pc_tree,
                                const int config_id) {
+  const AV1_COMMON *const cm = &cpi->common;
   const char *path = cpi->oxcf.partition_info_path;
   char filename[256];
   snprintf(filename, sizeof(filename), "%s/partition_tree_sb%d_c%d", path,
            cpi->sb_counter, config_id);
   FILE *pfile = fopen(filename, "r");
   if (pfile == NULL) {
-    printf("Can't find the file: %s\n", filename);
-    exit(0);
+    aom_internal_error(cm->error, AOM_CODEC_ERROR, "Can't find input file: %s.",
+                       filename);
   }
 
   int read_bsize;
@@ -4618,7 +4632,9 @@
       best_rdc.dist = sum_subblock_dist;
       best_rdc.rdcost = RDCOST(x->rdmult, best_rdc.rate, best_rdc.dist);
       break;
-    default: assert(0 && "invalid partition type."); exit(0);
+    default:
+      assert(0 && "invalid partition type.");
+      aom_internal_error(cm->error, AOM_CODEC_ERROR, "Invalid partition type.");
   }
   // Note: it is necessary to restore context information.
   av1_restore_context(x, &x_ctx, mi_row, mi_col, bsize, num_planes);
@@ -4721,7 +4737,8 @@
     update_partition_stats(&this_rdcost, &stats);
     av1_ext_part_send_partition_stats(ext_part_controller, &stats);
     if (!partition_decision.is_final_decision) {
-      av1_free_pc_tree_recursive(pc_tree, av1_num_planes(cm), 0, 0);
+      av1_free_pc_tree_recursive(pc_tree, av1_num_planes(cm), 0, 0,
+                                 cpi->sf.part_sf.partition_search_type);
     }
   } while (!partition_decision.is_final_decision);
 
@@ -4729,8 +4746,8 @@
   set_cb_offsets(x->cb_offset, 0, 0);
   encode_sb(cpi, td, tile_data, tp, mi_row, mi_col, OUTPUT_ENABLED, bsize,
             pc_tree, NULL);
-
-  av1_free_pc_tree_recursive(pc_tree, av1_num_planes(cm), 0, 0);
+  av1_free_pc_tree_recursive(pc_tree, av1_num_planes(cm), 0, 0,
+                             cpi->sf.part_sf.partition_search_type);
 
   return true;
 }
@@ -5003,7 +5020,8 @@
         for (int i = 0; i < 4; ++i) {
           if (pc_tree->split[i] != NULL) {
             av1_free_pc_tree_recursive(pc_tree->split[i], av1_num_planes(cm), 0,
-                                       0);
+                                       0,
+                                       cpi->sf.part_sf.partition_search_type);
             pc_tree->split[i] = NULL;
           }
         }
@@ -5047,8 +5065,8 @@
   set_cb_offsets(x->cb_offset, 0, 0);
   encode_sb(cpi, td, tile_data, tp, mi_row, mi_col, OUTPUT_ENABLED, bsize,
             pc_tree, NULL);
-
-  av1_free_pc_tree_recursive(pc_tree, av1_num_planes(cm), 0, 0);
+  av1_free_pc_tree_recursive(pc_tree, av1_num_planes(cm), 0, 0,
+                             cpi->sf.part_sf.partition_search_type);
 
   return true;
 }
@@ -5058,6 +5076,7 @@
                              SIMPLE_MOTION_DATA_TREE *sms_root, int mi_row,
                              int mi_col, const BLOCK_SIZE bsize,
                              RD_STATS *best_rd_cost) {
+  AV1_COMMON *const cm = &cpi->common;
   if (cpi->ext_part_controller.ready) {
     bool valid_search = true;
     const aom_ext_part_decision_mode_t decision_mode =
@@ -5073,13 +5092,13 @@
       return false;
     }
     if (!valid_search) {
-      assert(0 && "Invalid search from ML model, partition search failed.");
-      exit(0);
+      aom_internal_error(
+          cm->error, AOM_CODEC_ERROR,
+          "Invalid search from ML model, partition search failed");
     }
     return true;
   }
 
-  AV1_COMMON *const cm = &cpi->common;
   MACROBLOCK *const x = &td->mb;
   int best_idx = 0;
   int64_t min_rdcost = INT64_MAX;
@@ -5093,10 +5112,10 @@
       CHECK_MEM_ERROR(cm, rdcost, aom_calloc(num_configs, sizeof(*rdcost)));
     }
     if (num_configs <= 0) {
-      av1_free_pc_tree_recursive(pc_tree, av1_num_planes(cm), 0, 0);
+      av1_free_pc_tree_recursive(pc_tree, av1_num_planes(cm), 0, 0,
+                                 cpi->sf.part_sf.partition_search_type);
       if (rdcost != NULL) aom_free(rdcost);
-      exit(0);
-      return false;
+      aom_internal_error(cm->error, AOM_CODEC_ERROR, "Invalid configs.");
     }
     verify_write_partition_tree(cpi, pc_tree, bsize, i, mi_row, mi_col);
     // Encode the block with the given partition tree. Get rdcost and encoding
@@ -5109,7 +5128,8 @@
       best_idx = i;
       *best_rd_cost = rdcost[i];
     }
-    av1_free_pc_tree_recursive(pc_tree, av1_num_planes(cm), 0, 0);
+    av1_free_pc_tree_recursive(pc_tree, av1_num_planes(cm), 0, 0,
+                               cpi->sf.part_sf.partition_search_type);
     ++i;
   } while (i < num_configs);
 
@@ -5121,8 +5141,8 @@
   set_cb_offsets(x->cb_offset, 0, 0);
   encode_sb(cpi, td, tile_data, tp, mi_row, mi_col, OUTPUT_ENABLED, bsize,
             pc_tree, NULL);
-
-  av1_free_pc_tree_recursive(pc_tree, av1_num_planes(cm), 0, 0);
+  av1_free_pc_tree_recursive(pc_tree, av1_num_planes(cm), 0, 0,
+                             cpi->sf.part_sf.partition_search_type);
   aom_free(rdcost);
   ++cpi->sb_counter;
 
@@ -5177,8 +5197,8 @@
       max_var_4x4 = AOMMAX(max_var_4x4, var);
     }
   }
-  *var_min = log(1.0 + min_var_4x4 / 16.0);
-  *var_max = log(1.0 + max_var_4x4 / 16.0);
+  *var_min = log1p(min_var_4x4 / 16.0);
+  *var_max = log1p(max_var_4x4 / 16.0);
 }
 
 static AOM_INLINE void set_sms_tree_partitioning(
@@ -5359,6 +5379,7 @@
   // partition types and intra cnn output.
   if (x->must_find_valid_partition) {
     reset_part_limitations(cpi, &part_search_state);
+    av1_prune_partitions_by_max_min_bsize(&x->sb_enc, &part_search_state);
     // Invalidate intra cnn output for key frames.
     if (frame_is_intra_only(cm) && bsize == BLOCK_64X64) {
       part_search_state.intra_part_info->quad_tree_idx = 0;
@@ -5372,27 +5393,41 @@
   start_timing(cpi, none_partition_search_time);
 #endif
 
-  // Further pruning or in some cases reverse pruning when allintra is set
-  // This code helps visual and in some cases metrics quality where the current
-  // block comprises at least one very low variance sub-block and at least one
-  // where the variance is much higher.
-  //
-  // The idea is that in such cases there is danger of ringing and other visual
-  // artifacts from a high variance feature such as an edge into a very low
-  // variance region.
-  //
-  // The approach taken is to force break down / split to a smaller block size
-  // to try and separate out the low variance and well predicted blocks from the
-  // more complex ones and to prevent propagation of ringing over a large
-  // region.
-  if ((cpi->oxcf.mode == ALLINTRA) && (bsize >= BLOCK_16X16)) {
-    double var_min, var_max;
-    log_sub_block_var(cpi, x, bsize, &var_min, &var_max);
+  if (cpi->oxcf.mode == ALLINTRA) {
+    const bool bsize_at_least_16x16 = (bsize >= BLOCK_16X16);
+    const bool prune_rect_part_using_4x4_var_deviation =
+        (cpi->sf.part_sf.prune_rect_part_using_4x4_var_deviation &&
+         !x->must_find_valid_partition);
 
-    if ((var_min < 0.272) && ((var_max - var_min) > 3.0)) {
-      part_search_state.partition_none_allowed = 0;
-      part_search_state.terminate_partition_search = 0;
-      part_search_state.do_square_split = 1;
+    if (bsize_at_least_16x16 || prune_rect_part_using_4x4_var_deviation) {
+      double var_min, var_max;
+      log_sub_block_var(cpi, x, bsize, &var_min, &var_max);
+
+      // Further pruning or in some cases reverse pruning when allintra is set.
+      // This code helps visual and in some cases metrics quality where the
+      // current block comprises at least one very low variance sub-block and at
+      // least one where the variance is much higher.
+      //
+      // The idea is that in such cases there is danger of ringing and other
+      // visual artifacts from a high variance feature such as an edge into a
+      // very low variance region.
+      //
+      // The approach taken is to force break down / split to a smaller block
+      // size to try and separate out the low variance and well predicted blocks
+      // from the more complex ones and to prevent propagation of ringing over a
+      // large region.
+      if (bsize_at_least_16x16 && (var_min < 0.272) &&
+          ((var_max - var_min) > 3.0)) {
+        part_search_state.partition_none_allowed = 0;
+        part_search_state.terminate_partition_search = 0;
+        part_search_state.do_square_split = 1;
+      } else if (prune_rect_part_using_4x4_var_deviation &&
+                 (var_max - var_min < 3.0)) {
+        // Prune rectangular partitions if the variance deviation of 4x4
+        // sub-blocks within the block is less than a threshold (derived
+        // empirically).
+        part_search_state.do_rectangular_split = 0;
+      }
     }
   }
 
@@ -5584,7 +5619,8 @@
       encode_sb(cpi, td, tile_data, tp, mi_row, mi_col, run_type, bsize,
                 pc_tree, NULL);
       // Dealloc the whole PC_TREE after a superblock is done.
-      av1_free_pc_tree_recursive(pc_tree, num_planes, 0, 0);
+      av1_free_pc_tree_recursive(pc_tree, num_planes, 0, 0,
+                                 cpi->sf.part_sf.partition_search_type);
       pc_tree_dealloc = 1;
     } else if (should_do_dry_run_encode_for_current_block(
                    cm->seq_params->sb_size, x->sb_enc.max_partition_size,
@@ -5601,7 +5637,8 @@
   // If the tree still exists (non-superblock), dealloc most nodes, only keep
   // nodes for the best partition and PARTITION_NONE.
   if (pc_tree_dealloc == 0)
-    av1_free_pc_tree_recursive(pc_tree, num_planes, 1, 1);
+    av1_free_pc_tree_recursive(pc_tree, num_planes, 1, 1,
+                               cpi->sf.part_sf.partition_search_type);
 
   if (bsize == cm->seq_params->sb_size) {
     assert(best_rdc.rate < INT_MAX);
@@ -5659,7 +5696,7 @@
     float score[LABELS];
 
     features[feature_idx] =
-        (logf((float)(dc_q * dc_q) / 256.0f + 1.0f) - means[feature_idx]) /
+        (log1pf((float)(dc_q * dc_q) / 256.0f) - means[feature_idx]) /
         sqrtf(vars[feature_idx]);
     feature_idx++;
     av1_setup_src_planes(x, cpi->source, mi_row, mi_col, 1, bsize);
@@ -5679,8 +5716,8 @@
           cpi->ppi->fn_ptr[bsize].vf(src, src_stride, pred, pred_stride, &sse);
       const float factor = (var == 0) ? 1.0f : (1.0f / (float)var);
 
-      features[feature_idx] = (logf((float)var + 1.0f) - means[feature_idx]) /
-                              sqrtf(vars[feature_idx]);
+      features[feature_idx] =
+          (log1pf((float)var) - means[feature_idx]) / sqrtf(vars[feature_idx]);
       feature_idx++;
       for (i = 0; i < 4; ++i) {
         const int x_idx = (i & 1) * bs / 2;
@@ -5735,7 +5772,7 @@
                                       cm->seq_params->bit_depth);
     int feature_idx = 0;
 
-    features[feature_idx++] = logf((float)(dc_q * dc_q) / 256.0f + 1.0f);
+    features[feature_idx++] = log1pf((float)(dc_q * dc_q) / 256.0f);
     av1_setup_src_planes(x, cpi->source, mi_row, mi_col, 1, bsize);
     {
       const int bs = block_size_wide[bsize];
@@ -5768,7 +5805,7 @@
           cpi->fn_ptr[bsize].vf(src, src_stride, pred, pred_stride, &sse);
       const float factor = (var == 0) ? 1.0f : (1.0f / (float)var);
 
-      features[feature_idx++] = logf((float)var + 1.0f);
+      features[feature_idx++] = log1pf((float)var);
 
       fprintf(f, "%f,%f,", features[0], features[1]);
       for (i = 0; i < 4; ++i) {

diff --git a/av1/encoder/partition_strategy.c b/av1/encoder/partition_strategy.c
index 89c1a79..080587b 100644
--- a/av1/encoder/partition_strategy.c
+++ b/av1/encoder/partition_strategy.c

@@ -187,7 +187,7 @@
     const int bit_depth = xd->bd;
     const int dc_q =
         av1_dc_quant_QTX(x->qindex, 0, bit_depth) >> (bit_depth - 8);
-    part_info->log_q = logf(1.0f + (float)(dc_q * dc_q) / 256.0f);
+    part_info->log_q = log1pf((float)(dc_q * dc_q) / 256.0f);
     part_info->log_q =
         (part_info->log_q - av1_intra_mode_cnn_partition_mean[0]) /
         av1_intra_mode_cnn_partition_std[0];
@@ -602,21 +602,21 @@
   int f_idx = 0;
   if (features_to_get & FEATURE_SMS_NONE_FLAG) {
     for (int sub_idx = 0; sub_idx < 2; sub_idx++) {
-      features[f_idx++] = logf(1.0f + sms_tree->sms_none_feat[sub_idx]);
+      features[f_idx++] = log1pf((float)sms_tree->sms_none_feat[sub_idx]);
     }
   }
 
   if (features_to_get & FEATURE_SMS_SPLIT_FLAG) {
     for (int sub_idx = 0; sub_idx < SUB_PARTITIONS_SPLIT; sub_idx++) {
       SIMPLE_MOTION_DATA_TREE *sub_tree = sms_tree->split[sub_idx];
-      features[f_idx++] = logf(1.0f + sub_tree->sms_none_feat[0]);
-      features[f_idx++] = logf(1.0f + sub_tree->sms_none_feat[1]);
+      features[f_idx++] = log1pf((float)sub_tree->sms_none_feat[0]);
+      features[f_idx++] = log1pf((float)sub_tree->sms_none_feat[1]);
     }
   }
 
   if (features_to_get & FEATURE_SMS_RECT_FLAG) {
     for (int sub_idx = 0; sub_idx < 8; sub_idx++) {
-      features[f_idx++] = logf(1.0f + sms_tree->sms_rect_feat[sub_idx]);
+      features[f_idx++] = log1pf((float)sms_tree->sms_rect_feat[sub_idx]);
     }
   }
 
@@ -625,7 +625,7 @@
 
   // Q_INDEX
   const int dc_q = av1_dc_quant_QTX(x->qindex, 0, xd->bd) >> (xd->bd - 8);
-  features[f_idx++] = logf(1.0f + (float)(dc_q * dc_q) / 256.0f);
+  features[f_idx++] = log1pf((float)(dc_q * dc_q) / 256.0f);
 
   // Neighbor stuff
   const int has_above = !!xd->above_mbmi;
@@ -742,9 +742,9 @@
                                            FEATURE_SMS_PRUNE_PART_FLAG);
   int f_idx = FEATURE_SIZE_SMS_PRUNE_PART;
 
-  features[f_idx++] = logf(1.0f + (float)none_rdc->rate);
-  features[f_idx++] = logf(1.0f + (float)none_rdc->dist);
-  features[f_idx++] = logf(1.0f + (float)none_rdc->rdcost);
+  features[f_idx++] = log1pf((float)none_rdc->rate);
+  features[f_idx++] = log1pf((float)none_rdc->dist);
+  features[f_idx++] = log1pf((float)none_rdc->rdcost);
 
   assert(f_idx == FEATURE_SIZE_SMS_TERM_NONE);
 
@@ -809,7 +809,7 @@
   int f_idx = 0;
 
   const int dc_q = av1_dc_quant_QTX(x->qindex, 0, xd->bd) >> (xd->bd - 8);
-  const float log_q_sq = logf(1.0f + (float)(dc_q * dc_q) / 256.0f);
+  const float log_q_sq = log1pf((float)(dc_q * dc_q) / 256.0f);
 
   // Perform full-pixel single motion search in Y plane of 16x16 mbs in the sb
   float sum_mv_row_sq = 0;
@@ -845,7 +845,7 @@
 
       const float mv_row = (float)(best_mv.as_mv.row / 8);
       const float mv_col = (float)(best_mv.as_mv.col / 8);
-      const float log_sse = logf(1.0f + (float)sse);
+      const float log_sse = log1pf((float)sse);
       const float abs_mv_row = fabsf(mv_row);
       const float abs_mv_col = fabsf(mv_col);
 
@@ -1056,8 +1056,8 @@
   int f_idx = 0;
   float features[FEATURES] = { 0.0f };
 
-  features[f_idx++] = logf(1.0f + (float)dc_q / 4.0f);
-  features[f_idx++] = logf(1.0f + (float)best_rd / bs / bs / 1024.0f);
+  features[f_idx++] = log1pf((float)dc_q / 4.0f);
+  features[f_idx++] = log1pf((float)best_rd / bs / bs / 1024.0f);
 
   add_rd_feature(part_none_rd, best_rd, features, &f_idx);
   add_rd_feature(part_split_rd, best_rd, features, &f_idx);
@@ -1075,17 +1075,17 @@
                                            bsize, NULL,
                                            FEATURE_SMS_PRUNE_PART_FLAG);
 
-  features[f_idx++] = logf(1.0f + (float)sms_tree->sms_none_feat[1]);
+  features[f_idx++] = log1pf((float)sms_tree->sms_none_feat[1]);
 
-  features[f_idx++] = logf(1.0f + (float)sms_tree->split[0]->sms_none_feat[1]);
-  features[f_idx++] = logf(1.0f + (float)sms_tree->split[1]->sms_none_feat[1]);
-  features[f_idx++] = logf(1.0f + (float)sms_tree->split[2]->sms_none_feat[1]);
-  features[f_idx++] = logf(1.0f + (float)sms_tree->split[3]->sms_none_feat[1]);
+  features[f_idx++] = log1pf((float)sms_tree->split[0]->sms_none_feat[1]);
+  features[f_idx++] = log1pf((float)sms_tree->split[1]->sms_none_feat[1]);
+  features[f_idx++] = log1pf((float)sms_tree->split[2]->sms_none_feat[1]);
+  features[f_idx++] = log1pf((float)sms_tree->split[3]->sms_none_feat[1]);
 
-  features[f_idx++] = logf(1.0f + (float)sms_tree->sms_rect_feat[1]);
-  features[f_idx++] = logf(1.0f + (float)sms_tree->sms_rect_feat[3]);
-  features[f_idx++] = logf(1.0f + (float)sms_tree->sms_rect_feat[5]);
-  features[f_idx++] = logf(1.0f + (float)sms_tree->sms_rect_feat[7]);
+  features[f_idx++] = log1pf((float)sms_tree->sms_rect_feat[1]);
+  features[f_idx++] = log1pf((float)sms_tree->sms_rect_feat[3]);
+  features[f_idx++] = log1pf((float)sms_tree->sms_rect_feat[5]);
+  features[f_idx++] = log1pf((float)sms_tree->sms_rect_feat[7]);
 
   assert(f_idx == FEATURES);
 

diff --git a/av1/encoder/pass2_strategy.c b/av1/encoder/pass2_strategy.c
index d8b96c5..46bc6b0 100644
--- a/av1/encoder/pass2_strategy.c
+++ b/av1/encoder/pass2_strategy.c

@@ -20,6 +20,7 @@
 #include <assert.h>
 #include <stdint.h>
 
+#include "aom_mem/aom_mem.h"
 #include "config/aom_config.h"
 #include "config/aom_scale_rtcd.h"
 
@@ -513,12 +514,12 @@
   gf_stats->gf_group_inactive_zone_rows += stats->inactive_zone_rows;
 }
 
-void av1_accumulate_next_frame_stats(const FIRSTPASS_STATS *stats,
-                                     const int flash_detected,
-                                     const int frames_since_key,
-                                     const int cur_idx,
-                                     GF_GROUP_STATS *gf_stats, int f_w,
-                                     int f_h) {
+static void accumulate_next_frame_stats(const FIRSTPASS_STATS *stats,
+                                        const int flash_detected,
+                                        const int frames_since_key,
+                                        const int cur_idx,
+                                        GF_GROUP_STATS *gf_stats, int f_w,
+                                        int f_h) {
   accumulate_frame_motion_stats(stats, gf_stats, f_w, f_h);
   // sum up the metric values of current gf group
   gf_stats->avg_sr_coded_error += stats->sr_coded_error;
@@ -1034,9 +1035,9 @@
   return 0;
 }
 
-static int is_shorter_gf_interval_better(AV1_COMP *cpi,
-                                         EncodeFrameParams *frame_params) {
-  RATE_CONTROL *const rc = &cpi->rc;
+static int is_shorter_gf_interval_better(
+    AV1_COMP *cpi, const EncodeFrameParams *frame_params) {
+  const RATE_CONTROL *const rc = &cpi->rc;
   PRIMARY_RATE_CONTROL *const p_rc = &cpi->ppi->p_rc;
   int gop_length_decision_method = cpi->sf.tpl_sf.gop_length_decision_method;
   int shorten_gf_interval;
@@ -1938,9 +1939,8 @@
       flash_detected = detect_flash(twopass, &cpi->twopass_frame, 0);
       // TODO(bohanli): remove redundant accumulations here, or unify
       // this and the ones in define_gf_group
-      av1_accumulate_next_frame_stats(&next_frame, flash_detected,
-                                      rc->frames_since_key, i, &gf_stats, f_w,
-                                      f_h);
+      accumulate_next_frame_stats(&next_frame, flash_detected,
+                                  rc->frames_since_key, i, &gf_stats, f_w, f_h);
 
       cut_here = detect_gf_cut(cpi, i, cur_start, flash_detected,
                                active_max_gf_interval, active_min_gf_interval,
@@ -2044,8 +2044,9 @@
                 temp_accu_coeff *= stats[n].cor_coeff;
                 this_score +=
                     temp_accu_coeff *
-                    (1 - stats[n].noise_var /
-                             AOMMAX(regions[this_reg].avg_intra_err, 0.001));
+                    sqrt(AOMMAX(0.5,
+                                1 - stats[n].noise_var /
+                                        AOMMAX(stats[n].intra_error, 0.001)));
                 count_f++;
               }
               // preceding frames
@@ -2055,8 +2056,9 @@
                 temp_accu_coeff *= stats[n].cor_coeff;
                 this_score +=
                     temp_accu_coeff *
-                    (1 - stats[n].noise_var /
-                             AOMMAX(regions[this_reg].avg_intra_err, 0.001));
+                    sqrt(AOMMAX(0.5,
+                                1 - stats[n].noise_var /
+                                        AOMMAX(stats[n].intra_error, 0.001)));
               }
 
               if (this_score > best_score) {
@@ -2276,9 +2278,8 @@
     flash_detected = detect_flash(twopass, &cpi->twopass_frame, 0);
 
     // accumulate stats for next frame
-    av1_accumulate_next_frame_stats(next_frame, flash_detected,
-                                    rc->frames_since_key, i, gf_stats, f_w,
-                                    f_h);
+    accumulate_next_frame_stats(next_frame, flash_detected,
+                                rc->frames_since_key, i, gf_stats, f_w, f_h);
 
     ++i;
   }
@@ -3410,13 +3411,13 @@
   TWO_PASS_FRAME *twopass_frame = &cpi->twopass_frame;
   // The multiplication by 256 reverses a scaling factor of (>> 8)
   // applied when combining MB error values for the frame.
-  twopass_frame->mb_av_energy = log((this_frame_ptr->intra_error) + 1.0);
+  twopass_frame->mb_av_energy = log1p(this_frame_ptr->intra_error);
 
   const FIRSTPASS_STATS *const total_stats =
       cpi->ppi->twopass.stats_buf_ctx->total_stats;
   if (is_fp_wavelet_energy_invalid(total_stats) == 0) {
     twopass_frame->frame_avg_haar_energy =
-        log((this_frame_ptr->frame_avg_wavelet_energy) + 1.0);
+        log1p(this_frame_ptr->frame_avg_wavelet_energy);
   }
 
   // Set the frame content type flag.
@@ -3510,6 +3511,39 @@
   }
 }
 
+// Smooth-out the noise variance so it is more stable
+// TODO(bohanli): Use a better low-pass filter than averaging
+static void smooth_filter_noise(FIRSTPASS_STATS *first_stats,
+                                FIRSTPASS_STATS *last_stats) {
+  int len = (int)(last_stats - first_stats);
+  double *smooth_noise = aom_malloc(len * sizeof(*smooth_noise));
+  if (!smooth_noise) return;
+
+  for (int i = 0; i < len; i++) {
+    double total_noise = 0;
+    double total_wt = 0;
+    for (int j = -HALF_FILT_LEN; j <= HALF_FILT_LEN; j++) {
+      int idx = AOMMIN(AOMMAX(i + j, 0), len - 1);
+      if (first_stats[idx].is_flash) continue;
+
+      total_noise += first_stats[idx].noise_var;
+      total_wt += 1.0;
+    }
+    if (total_wt > 0.01) {
+      total_noise /= total_wt;
+    } else {
+      total_noise = first_stats[i].noise_var;
+    }
+    smooth_noise[i] = total_noise;
+  }
+
+  for (int i = 0; i < len; i++) {
+    first_stats[i].noise_var = smooth_noise[i];
+  }
+
+  aom_free(smooth_noise);
+}
+
 // Estimate the noise variance of each frame from the first pass stats
 void av1_estimate_noise(FIRSTPASS_STATS *first_stats,
                         FIRSTPASS_STATS *last_stats) {
@@ -3597,6 +3631,8 @@
        this_stats++) {
     this_stats->noise_var = (first_stats + 2)->noise_var;
   }
+
+  smooth_filter_noise(first_stats, last_stats);
 }
 
 // Estimate correlation coefficient of each frame with its previous frame.
@@ -3638,6 +3674,10 @@
     frame_params->show_frame =
         !(gf_group->update_type[cpi->gf_frame_index] == ARF_UPDATE ||
           gf_group->update_type[cpi->gf_frame_index] == INTNL_ARF_UPDATE);
+    if (cpi->gf_frame_index == 0) {
+      av1_tf_info_reset(&cpi->ppi->tf_info);
+      av1_tf_info_filtering(&cpi->ppi->tf_info, cpi, gf_group);
+    }
     return;
   }
 
@@ -3678,17 +3718,6 @@
 
   if (oxcf->rc_cfg.mode == AOM_Q)
     rc->active_worst_quality = oxcf->rc_cfg.cq_level;
-  FIRSTPASS_STATS this_frame;
-  av1_zero(this_frame);
-  // call above fn
-  if (is_stat_consumption_stage(cpi)) {
-    if (cpi->gf_frame_index < gf_group->size || rc->frames_to_key == 0) {
-      process_first_pass_stats(cpi, &this_frame);
-      update_total_stats = 1;
-    }
-  } else {
-    rc->active_worst_quality = oxcf->rc_cfg.cq_level;
-  }
 
   if (cpi->gf_frame_index == gf_group->size) {
     if (cpi->ppi->lap_enabled && cpi->ppi->p_rc.enable_scenecut_detection) {
@@ -3701,6 +3730,18 @@
     }
   }
 
+  FIRSTPASS_STATS this_frame;
+  av1_zero(this_frame);
+  // call above fn
+  if (is_stat_consumption_stage(cpi)) {
+    if (cpi->gf_frame_index < gf_group->size || rc->frames_to_key == 0) {
+      process_first_pass_stats(cpi, &this_frame);
+      update_total_stats = 1;
+    }
+  } else {
+    rc->active_worst_quality = oxcf->rc_cfg.cq_level;
+  }
+
   // Keyframe and section processing.
   FIRSTPASS_STATS this_frame_copy;
   this_frame_copy = this_frame;
@@ -4160,33 +4201,50 @@
 
   // If the rate control is drifting consider adjustment to min or maxq.
   if ((rc_cfg->mode != AOM_Q) && !cpi->rc.is_src_frame_alt_ref) {
-    int maxq_adj_limit;
     int minq_adj_limit;
-    maxq_adj_limit = rc->worst_quality - rc->active_worst_quality;
+    int maxq_adj_limit;
     minq_adj_limit =
         (rc_cfg->mode == AOM_CQ ? MINQ_ADJ_LIMIT_CQ : MINQ_ADJ_LIMIT);
-    // Undershoot.
-    if (p_rc->rate_error_estimate > rc_cfg->under_shoot_pct) {
-      --twopass->extend_maxq;
-      if (p_rc->rolling_target_bits >= p_rc->rolling_actual_bits)
-        ++twopass->extend_minq;
-      // Overshoot.
-    } else if (p_rc->rate_error_estimate < -rc_cfg->over_shoot_pct) {
-      --twopass->extend_minq;
-      if (p_rc->rolling_target_bits < p_rc->rolling_actual_bits)
-        ++twopass->extend_maxq;
+    maxq_adj_limit = rc->worst_quality - rc->active_worst_quality;
+
+    // Undershoot
+    if ((rc_cfg->under_shoot_pct < 100) &&
+        (p_rc->rolling_actual_bits < p_rc->rolling_target_bits)) {
+      int pct_error =
+          ((p_rc->rolling_target_bits - p_rc->rolling_actual_bits) * 100) /
+          p_rc->rolling_target_bits;
+
+      if ((pct_error >= rc_cfg->under_shoot_pct) &&
+          (p_rc->rate_error_estimate > 0)) {
+        twopass->extend_minq += 1;
+      }
+      twopass->extend_maxq -= 1;
+      // Overshoot
+    } else if ((rc_cfg->over_shoot_pct < 100) &&
+               (p_rc->rolling_actual_bits > p_rc->rolling_target_bits)) {
+      int pct_error =
+          ((p_rc->rolling_actual_bits - p_rc->rolling_target_bits) * 100) /
+          p_rc->rolling_target_bits;
+
+      pct_error = clamp(pct_error, 0, 100);
+      if ((pct_error >= rc_cfg->over_shoot_pct) &&
+          (p_rc->rate_error_estimate < 0)) {
+        twopass->extend_maxq += 1;
+      }
+      twopass->extend_minq -= 1;
     } else {
       // Adjustment for extreme local overshoot.
+      // Only applies when normal adjustment above is not used (e.g.
+      // when threshold is set to 100).
       if (rc->projected_frame_size > (2 * rc->base_frame_target) &&
           rc->projected_frame_size > (2 * rc->avg_frame_bandwidth))
         ++twopass->extend_maxq;
-      // Unwind undershoot or overshoot adjustment.
-      if (p_rc->rolling_target_bits < p_rc->rolling_actual_bits)
-        --twopass->extend_minq;
+      // Unwind extreme overshoot adjustment.
       else if (p_rc->rolling_target_bits > p_rc->rolling_actual_bits)
         --twopass->extend_maxq;
     }
-    twopass->extend_minq = clamp(twopass->extend_minq, 0, minq_adj_limit);
+    twopass->extend_minq =
+        clamp(twopass->extend_minq, -minq_adj_limit, minq_adj_limit);
     twopass->extend_maxq = clamp(twopass->extend_maxq, 0, maxq_adj_limit);
 
     // If there is a big and undexpected undershoot then feed the extra
@@ -4200,19 +4258,6 @@
             fast_extra_thresh - rc->projected_frame_size;
         p_rc->vbr_bits_off_target_fast = AOMMIN(p_rc->vbr_bits_off_target_fast,
                                                 (4 * rc->avg_frame_bandwidth));
-
-        // Fast adaptation of minQ if necessary to use up the extra bits.
-        if (rc->avg_frame_bandwidth) {
-          twopass->extend_minq_fast = (int)(p_rc->vbr_bits_off_target_fast * 8 /
-                                            rc->avg_frame_bandwidth);
-        }
-        twopass->extend_minq_fast = AOMMIN(
-            twopass->extend_minq_fast, minq_adj_limit - twopass->extend_minq);
-      } else if (p_rc->vbr_bits_off_target_fast) {
-        twopass->extend_minq_fast = AOMMIN(
-            twopass->extend_minq_fast, minq_adj_limit - twopass->extend_minq);
-      } else {
-        twopass->extend_minq_fast = 0;
       }
     }
 
@@ -4223,7 +4268,6 @@
           p_rc->vbr_bits_off_target_fast;
       cpi->ppi->p_rc.temp_extend_minq = twopass->extend_minq;
       cpi->ppi->p_rc.temp_extend_maxq = twopass->extend_maxq;
-      cpi->ppi->p_rc.temp_extend_minq_fast = twopass->extend_minq_fast;
     }
 #endif
   }

diff --git a/av1/encoder/pass2_strategy.h b/av1/encoder/pass2_strategy.h
index a75be1a..e34454e 100644
--- a/av1/encoder/pass2_strategy.h
+++ b/av1/encoder/pass2_strategy.h

@@ -134,12 +134,6 @@
                        int *num_fpstats_used, int *num_fpstats_required,
                        int project_gfu_boost);
 
-void av1_accumulate_next_frame_stats(const FIRSTPASS_STATS *stats,
-                                     const int flash_detected,
-                                     const int frames_since_key,
-                                     const int cur_idx,
-                                     GF_GROUP_STATS *gf_stats, int f_w,
-                                     int f_h);
 // Identify stable and unstable regions from first pass stats.
 // stats_start points to the first frame to analyze.
 // |offset| is the offset from the current frame to the frame stats_start is

diff --git a/av1/encoder/pickcdef.c b/av1/encoder/pickcdef.c
index 22a4557..293dafa 100644
--- a/av1/encoder/pickcdef.c
+++ b/av1/encoder/pickcdef.c

@@ -638,7 +638,7 @@
   const int nvfb = cdef_search_ctx->nvfb;
   const int nhfb = cdef_search_ctx->nhfb;
   cdef_search_ctx->sb_index =
-      aom_malloc(nvfb * nhfb * sizeof(cdef_search_ctx->sb_index));
+      aom_malloc(nvfb * nhfb * sizeof(cdef_search_ctx->sb_index[0]));
   cdef_search_ctx->sb_count = 0;
   cdef_search_ctx->mse[0] =
       aom_malloc(sizeof(**cdef_search_ctx->mse) * nvfb * nhfb);
@@ -728,8 +728,8 @@
 #endif
 }
 
-static void pick_cdef_from_qp(AV1_COMMON *const cm, int skip_cdef,
-                              int is_screen_content) {
+void av1_pick_cdef_from_qp(AV1_COMMON *const cm, int skip_cdef,
+                           int is_screen_content) {
   const int bd = cm->seq_params->bit_depth;
   const int q =
       av1_ac_quant_QTX(cm->quant_params.base_qindex, 0, bd) >> (bd - 8);
@@ -807,6 +807,8 @@
   const int nvfb = (mi_params->mi_rows + MI_SIZE_64X64 - 1) / MI_SIZE_64X64;
   const int nhfb = (mi_params->mi_cols + MI_SIZE_64X64 - 1) / MI_SIZE_64X64;
   MB_MODE_INFO **mbmi = mi_params->mi_grid_base;
+  // mbmi is NULL when real-time rate control library is used.
+  if (!mbmi) return;
   for (int r = 0; r < nvfb; ++r) {
     for (int c = 0; c < nhfb; ++c) {
       MB_MODE_INFO *current_mbmi = mbmi[MI_SIZE_64X64 * c];
@@ -820,7 +822,8 @@
                      const YV12_BUFFER_CONFIG *ref, AV1_COMMON *cm,
                      MACROBLOCKD *xd, CDEF_PICK_METHOD pick_method, int rdmult,
                      int skip_cdef_feature, CDEF_CONTROL cdef_control,
-                     const int is_screen_content, int non_reference_frame) {
+                     const int is_screen_content, int non_reference_frame,
+                     int rtc_ext_rc) {
   assert(cdef_control != CDEF_NONE);
   if (cdef_control == CDEF_REFERENCE && non_reference_frame) {
     CdefInfo *const cdef_info = &cm->cdef_info;
@@ -831,8 +834,12 @@
     return;
   }
 
+  if (rtc_ext_rc) {
+    av1_pick_cdef_from_qp(cm, 0, 0);
+    return;
+  }
   if (pick_method == CDEF_PICK_FROM_Q) {
-    pick_cdef_from_qp(cm, skip_cdef_feature, is_screen_content);
+    av1_pick_cdef_from_qp(cm, skip_cdef_feature, is_screen_content);
     return;
   }
   const CommonModeInfoParams *const mi_params = &cm->mi_params;

diff --git a/av1/encoder/pickcdef.h b/av1/encoder/pickcdef.h
index 548a740..bdd8233 100644
--- a/av1/encoder/pickcdef.h
+++ b/av1/encoder/pickcdef.h

@@ -235,6 +235,7 @@
  * \param[in]      is_screen_content   Whether it is screen content type
  * \param[in]      non_reference_frame Indicates if current frame is
  * non-reference
+ * \param[in]      rtc_ext_rc   Indicate if external RC is used for testing
  *
  * \remark Nothing is returned. Instead, optimal CDEF parameters are stored
  * in the \c cdef_info structure of type \ref CdefInfo inside \c cm:
@@ -252,7 +253,22 @@
                      const YV12_BUFFER_CONFIG *ref, AV1_COMMON *cm,
                      MACROBLOCKD *xd, CDEF_PICK_METHOD pick_method, int rdmult,
                      int skip_cdef_feature, CDEF_CONTROL cdef_control,
-                     const int is_screen_content, int non_reference_frame);
+                     const int is_screen_content, int non_reference_frame,
+                     int rtc_ext_rc);
+
+/*!\brief AV1 CDEF level from QP
+ *
+ * \ingroup in_loop_cdef
+ *
+ * Calculates CDEF levels from frame QP. Only used for speed 7+ with RT mode.
+ *
+ * \param[in,out]  cm                 Pointer to top level common structure
+ * \param[in]      skip_cdef          Flag to skip CDEF filtering
+ * \param[in]      is_screen_content  Flag indicating screen content
+ *
+ */
+void av1_pick_cdef_from_qp(AV1_COMMON *const cm, int skip_cdef,
+                           int is_screen_content);
 
 #ifdef __cplusplus
 }  // extern "C"

diff --git a/av1/encoder/picklpf.c b/av1/encoder/picklpf.c
index 90c3c1a..9084d3f 100644
--- a/av1/encoder/picklpf.c
+++ b/av1/encoder/picklpf.c

@@ -234,6 +234,17 @@
           cpi->common.width * cpi->common.height > 352 * 288))
             ? 12034
             : 6017;
+    // Increase strength on base TL0 for temporal layers, for low-resoln,
+    // based on frame source_sad.
+    if (cpi->svc.number_temporal_layers > 1 &&
+        cpi->svc.temporal_layer_id == 0 &&
+        cpi->common.width * cpi->common.height <= 352 * 288 &&
+        cpi->sf.rt_sf.use_nonrd_pick_mode) {
+      if (cpi->rc.frame_source_sad > 100000)
+        inter_frame_multiplier = inter_frame_multiplier << 1;
+      else if (cpi->rc.frame_source_sad > 50000)
+        inter_frame_multiplier = 3 * (inter_frame_multiplier >> 1);
+    }
     // These values were determined by linear fitting the result of the
     // searched level for 8 bit depth:
     // Keyframes: filt_guess = q * 0.06699 - 1.60817

diff --git a/av1/encoder/pickrst.c b/av1/encoder/pickrst.c
index dc599a8..7212469 100644
--- a/av1/encoder/pickrst.c
+++ b/av1/encoder/pickrst.c

@@ -32,10 +32,6 @@
 #include "av1/encoder/picklpf.h"
 #include "av1/encoder/pickrst.h"
 
-// When set to RESTORE_WIENER or RESTORE_SGRPROJ only those are allowed.
-// When set to RESTORE_TYPES we allow switchable.
-static const RestorationType force_restore_type = RESTORE_TYPES;
-
 // Number of Wiener iterations
 #define NUM_WIENER_ITERS 5
 
@@ -149,6 +145,11 @@
   SgrprojInfo sgrproj;
   WienerInfo wiener;
   PixelRect tile_rect;
+
+  // Buffers used to hold dgd-avg and src-avg data respectively during SIMD
+  // call of Wiener filter.
+  int16_t *dgd_avg;
+  int16_t *src_avg;
 } RestSearchCtxt;
 
 static AOM_INLINE void rsc_on_tile(void *priv) {
@@ -938,10 +939,11 @@
   if (cost_sgr < cost_none) rsc->sgrproj = rusi->sgrproj;
 }
 
-void acc_stat_one_line(const uint8_t *dgd, const uint8_t *src, int dgd_stride,
-                       int h_start, int h_end, uint8_t avg,
-                       const int wiener_halfwin, const int wiener_win2,
-                       int32_t *M_int32, int32_t *H_int32, int count) {
+static void acc_stat_one_line(const uint8_t *dgd, const uint8_t *src,
+                              int dgd_stride, int h_start, int h_end,
+                              uint8_t avg, const int wiener_halfwin,
+                              const int wiener_win2, int32_t *M_int32,
+                              int32_t *H_int32, int count) {
   int j, k, l;
   int16_t Y[WIENER_WIN2];
 
@@ -969,9 +971,12 @@
 }
 
 void av1_compute_stats_c(int wiener_win, const uint8_t *dgd, const uint8_t *src,
-                         int h_start, int h_end, int v_start, int v_end,
-                         int dgd_stride, int src_stride, int64_t *M, int64_t *H,
+                         int16_t *dgd_avg, int16_t *src_avg, int h_start,
+                         int h_end, int v_start, int v_end, int dgd_stride,
+                         int src_stride, int64_t *M, int64_t *H,
                          int use_downsampled_wiener_stats) {
+  (void)dgd_avg;
+  (void)src_avg;
   int i, k, l;
   const int wiener_win2 = wiener_win * wiener_win;
   const int wiener_halfwin = (wiener_win >> 1);
@@ -1096,15 +1101,41 @@
         b[i - 1] = c;
       }
     }
+
+    // b/278065963: The multiplies
+    //   c / 256 * A[k * stride + j] / cd * 256
+    // and
+    //   c / 256 * b[k] / cd * 256
+    // within Gaussian elimination can cause a signed integer overflow. Rework
+    // the multiplies so that larger scaling is used without significantly
+    // impacting the overall precision.
+    //
+    // Precision guidance:
+    //   scale_threshold: Pick as high as possible.
+    // For max_abs_akj >= scale_threshold scenario:
+    //   scaler_A: Pick as low as possible. Needed for A[(i + 1) * stride + j].
+    //   scaler_c: Pick as low as possible while maintaining scaler_c >=
+    //     (1 << 7). Needed for A[(i + 1) * stride + j] and b[i + 1].
+    int64_t max_abs_akj = 0;
+    for (int j = 0; j < n; j++) {
+      const int64_t abs_akj = llabs(A[k * stride + j]);
+      if (abs_akj > max_abs_akj) max_abs_akj = abs_akj;
+    }
+    const int scale_threshold = 1 << 22;
+    const int scaler_A = max_abs_akj < scale_threshold ? 1 : (1 << 5);
+    const int scaler_c = max_abs_akj < scale_threshold ? 1 : (1 << 7);
+    const int scaler = scaler_c * scaler_A;
+
     // Forward elimination (convert A to row-echelon form)
     for (int i = k; i < n - 1; i++) {
       if (A[k * stride + k] == 0) return 0;
-      const int64_t c = A[(i + 1) * stride + k];
+      const int64_t c = A[(i + 1) * stride + k] / scaler_c;
       const int64_t cd = A[k * stride + k];
       for (int j = 0; j < n; j++) {
-        A[(i + 1) * stride + j] -= c / 256 * A[k * stride + j] / cd * 256;
+        A[(i + 1) * stride + j] -=
+            A[k * stride + j] / scaler_A * c / cd * scaler;
       }
-      b[i + 1] -= c / 256 * b[k] / cd * 256;
+      b[i + 1] -= c * b[k] / cd * scaler_c;
     }
   }
   // Back-substitution
@@ -1137,16 +1168,28 @@
       A[jj] += Mc[i][j] * b[i] / WIENER_TAP_SCALE_FACTOR;
     }
   }
+
+  // b/274668506: This is the dual branch for the issue in b/272139363. The fix
+  // is similar. See comments in update_b_sep_sym() below.
+  int32_t max_b_l = 0;
+  for (int l = 0; l < wiener_win; ++l) {
+    const int32_t abs_b_l = abs(b[l]);
+    if (abs_b_l > max_b_l) max_b_l = abs_b_l;
+  }
+  const int scale_threshold = 128 * WIENER_TAP_SCALE_FACTOR;
+  const int scaler = max_b_l < scale_threshold ? 1 : 4;
+
   for (i = 0; i < wiener_win; i++) {
     for (j = 0; j < wiener_win; j++) {
       int k, l;
       for (k = 0; k < wiener_win; ++k) {
+        const int kk = wrap_index(k, wiener_win);
         for (l = 0; l < wiener_win; ++l) {
-          const int kk = wrap_index(k, wiener_win);
           const int ll = wrap_index(l, wiener_win);
           B[ll * wiener_halfwin1 + kk] +=
               Hc[j * wiener_win + i][k * wiener_win2 + l] * b[i] /
-              WIENER_TAP_SCALE_FACTOR * b[j] / WIENER_TAP_SCALE_FACTOR;
+              (scaler * WIENER_TAP_SCALE_FACTOR) * b[j] /
+              (WIENER_TAP_SCALE_FACTOR / scaler);
         }
       }
     }
@@ -1197,16 +1240,43 @@
     }
   }
 
+  // b/272139363: The computation,
+  //   Hc[i * wiener_win + j][k * wiener_win2 + l] * a[k] /
+  //          WIENER_TAP_SCALE_FACTOR * a[l] / WIENER_TAP_SCALE_FACTOR;
+  // may generate a signed-integer-overflow. Conditionally scale the terms to
+  // avoid a potential overflow.
+  //
+  // Hc contains accumulated correlation statistics and it is desired to leave
+  // as much room as possible for Hc. It was experimentally observed that the
+  // primary issue manifests itself with the second, a[l], multiply. For
+  // max_a_l < WIENER_TAP_SCALE_FACTOR the first multiply with a[k] should not
+  // increase dynamic range and the second multiply should hence be safe.
+  // Thereafter a safe scale_threshold depends on the actual operational range
+  // of Hc. The largest scale_threshold is expected to depend on bit-depth
+  // (av1_compute_stats_highbd_c() scales highbd to 8-bit) and maximum
+  // restoration-unit size (256), leading up to 32-bit positive numbers in Hc.
+  // Noting that the caller, wiener_decompose_sep_sym(), initializes a[...]
+  // to a range smaller than 16 bits, the scale_threshold is set as below for
+  // convenience.
+  int32_t max_a_l = 0;
+  for (int l = 0; l < wiener_win; ++l) {
+    const int32_t abs_a_l = abs(a[l]);
+    if (abs_a_l > max_a_l) max_a_l = abs_a_l;
+  }
+  const int scale_threshold = 128 * WIENER_TAP_SCALE_FACTOR;
+  const int scaler = max_a_l < scale_threshold ? 1 : 4;
+
   for (i = 0; i < wiener_win; i++) {
+    const int ii = wrap_index(i, wiener_win);
     for (j = 0; j < wiener_win; j++) {
-      const int ii = wrap_index(i, wiener_win);
       const int jj = wrap_index(j, wiener_win);
       int k, l;
       for (k = 0; k < wiener_win; ++k) {
         for (l = 0; l < wiener_win; ++l) {
           B[jj * wiener_halfwin1 + ii] +=
               Hc[i * wiener_win + j][k * wiener_win2 + l] * a[k] /
-              WIENER_TAP_SCALE_FACTOR * a[l] / WIENER_TAP_SCALE_FACTOR;
+              (scaler * WIENER_TAP_SCALE_FACTOR) * a[l] /
+              (WIENER_TAP_SCALE_FACTOR / scaler);
         }
       }
     }
@@ -1385,7 +1455,6 @@
   return bits;
 }
 
-#define USE_WIENER_REFINEMENT_SEARCH 1
 static int64_t finer_tile_search_wiener(const RestSearchCtxt *rsc,
                                         const RestorationTileLimits *limits,
                                         const PixelRect *tile,
@@ -1393,7 +1462,10 @@
                                         int wiener_win) {
   const int plane_off = (WIENER_WIN - wiener_win) >> 1;
   int64_t err = try_restoration_unit(rsc, limits, tile, rui);
-#if USE_WIENER_REFINEMENT_SEARCH
+
+  if (rsc->lpf_sf->disable_wiener_coeff_refine_search) return err;
+
+  // Refinement search around the wiener filter coefficients.
   int64_t err2;
   int tap_min[] = { WIENER_FILT_TAP0_MINV, WIENER_FILT_TAP1_MINV,
                     WIENER_FILT_TAP2_MINV };
@@ -1489,7 +1561,6 @@
     }
   }
   // printf("err post = %"PRId64"\n", err);
-#endif  // USE_WIENER_REFINEMENT_SEARCH
   return err;
 }
 
@@ -1549,21 +1620,24 @@
   const AV1_COMMON *const cm = rsc->cm;
   if (cm->seq_params->use_highbitdepth) {
     // TODO(any) : Add support for use_downsampled_wiener_stats SF in HBD
-    // functions
+    // functions. Optimize intrinsics of HBD design similar to LBD (i.e.,
+    // pre-calculate d and s buffers and avoid most of the C operations).
     av1_compute_stats_highbd(reduced_wiener_win, rsc->dgd_buffer,
                              rsc->src_buffer, limits->h_start, limits->h_end,
                              limits->v_start, limits->v_end, rsc->dgd_stride,
                              rsc->src_stride, M, H, cm->seq_params->bit_depth);
   } else {
     av1_compute_stats(reduced_wiener_win, rsc->dgd_buffer, rsc->src_buffer,
-                      limits->h_start, limits->h_end, limits->v_start,
-                      limits->v_end, rsc->dgd_stride, rsc->src_stride, M, H,
+                      rsc->dgd_avg, rsc->src_avg, limits->h_start,
+                      limits->h_end, limits->v_start, limits->v_end,
+                      rsc->dgd_stride, rsc->src_stride, M, H,
                       rsc->lpf_sf->use_downsampled_wiener_stats);
   }
 #else
   av1_compute_stats(reduced_wiener_win, rsc->dgd_buffer, rsc->src_buffer,
-                    limits->h_start, limits->h_end, limits->v_start,
-                    limits->v_end, rsc->dgd_stride, rsc->src_stride, M, H,
+                    rsc->dgd_avg, rsc->src_avg, limits->h_start, limits->h_end,
+                    limits->v_start, limits->v_end, rsc->dgd_stride,
+                    rsc->src_stride, M, H,
                     rsc->lpf_sf->use_downsampled_wiener_stats);
 #endif
 
@@ -1741,6 +1815,24 @@
   return rsi->units_per_tile;
 }
 
+static INLINE void av1_derive_flags_for_lr_processing(
+    const LOOP_FILTER_SPEED_FEATURES *lpf_sf, bool *disable_lr_filter) {
+  const bool is_wiener_disabled = lpf_sf->disable_wiener_filter;
+  const bool is_sgr_disabled = lpf_sf->disable_sgr_filter;
+
+  // Enable None Loop restoration filter if either of Wiener or Self-guided is
+  // enabled.
+  disable_lr_filter[RESTORE_NONE] = (is_wiener_disabled && is_sgr_disabled);
+
+  disable_lr_filter[RESTORE_WIENER] = is_wiener_disabled;
+  disable_lr_filter[RESTORE_SGRPROJ] = is_sgr_disabled;
+
+  // Enable Swicthable Loop restoration filter if both of the Wiener and
+  // Self-guided are enabled.
+  disable_lr_filter[RESTORE_SWITCHABLE] =
+      (is_wiener_disabled || is_sgr_disabled);
+}
+
 void av1_pick_filter_restoration(const YV12_BUFFER_CONFIG *src, AV1_COMP *cpi) {
   AV1_COMMON *const cm = &cpi->common;
   MACROBLOCK *const x = &cpi->td.mb;
@@ -1780,11 +1872,50 @@
                        "Failed to allocate trial restored frame buffer");
 
   RestSearchCtxt rsc;
+
+  // The buffers 'src_avg' and 'dgd_avg' are used to compute H and M buffers.
+  // These buffers are required for AVX2 SIMD purpose only. Hence, allocated the
+  // same if AVX2 variant of SIMD for av1_compute_stats() is enabled. The buffer
+  // size required is calculated based on maximum width and height of the LRU
+  // (i.e., from foreach_rest_unit_in_tile() 1.5 times the
+  // RESTORATION_UNITSIZE_MAX) allowed for Wiener filtering. The width and
+  // height aligned to multiple of 16 is considered for intrinsic purpose.
+  rsc.dgd_avg = NULL;
+  rsc.src_avg = NULL;
+#if HAVE_AVX2
+  // The buffers allocated below are used during Wiener filter processing of low
+  // bitdepth path. Hence, allocate the same when Wiener filter is enabled in
+  // low bitdepth path.
+  if (!cpi->sf.lpf_sf.disable_wiener_filter &&
+      !cm->seq_params->use_highbitdepth) {
+    const int buf_size = sizeof(*rsc.dgd_avg) * 6 * RESTORATION_UNITSIZE_MAX *
+                         RESTORATION_UNITSIZE_MAX;
+    CHECK_MEM_ERROR(cm, rsc.dgd_avg, (int16_t *)aom_memalign(32, buf_size));
+
+    // When LRU width isn't multiple of 16, the 256 bits load instruction used
+    // in AVX2 intrinsic can read data beyond valid LRU. Hence, in order to
+    // silence Valgrind warning this buffer is initialized with zero. Overhead
+    // due to this initialization is negligible since it is done at frame level.
+    memset(rsc.dgd_avg, 0, buf_size);
+    rsc.src_avg =
+        rsc.dgd_avg + 3 * RESTORATION_UNITSIZE_MAX * RESTORATION_UNITSIZE_MAX;
+    // Asserts the starting address of src_avg is always 32-bytes aligned.
+    assert(!((intptr_t)rsc.src_avg % 32));
+  }
+#endif
+
   const int plane_start = AOM_PLANE_Y;
   const int plane_end = num_planes > 1 ? AOM_PLANE_V : AOM_PLANE_Y;
+
+  // Derive the flags to enable/disable Loop restoration filters based on the
+  // speed features 'disable_wiener_filter' and 'disable_sgr_filter'.
+  bool disable_lr_filter[RESTORE_TYPES] = { false };
+  const LOOP_FILTER_SPEED_FEATURES *lpf_sf = &cpi->sf.lpf_sf;
+  av1_derive_flags_for_lr_processing(lpf_sf, disable_lr_filter);
+
   for (int plane = plane_start; plane <= plane_end; ++plane) {
-    init_rsc(src, &cpi->common, x, &cpi->sf.lpf_sf, plane, rusi,
-             &cpi->trial_frame_rst, &rsc);
+    init_rsc(src, &cpi->common, x, lpf_sf, plane, rusi, &cpi->trial_frame_rst,
+             &rsc);
 
     const int plane_ntiles = ntiles[plane > 0];
     const RestorationType num_rtypes =
@@ -1794,16 +1925,16 @@
     RestorationType best_rtype = RESTORE_NONE;
 
     const int highbd = rsc.cm->seq_params->use_highbitdepth;
-    if ((plane && !cpi->sf.lpf_sf.disable_loop_restoration_chroma) ||
-        (!plane && !cpi->sf.lpf_sf.disable_loop_restoration_luma)) {
+    if ((plane && !lpf_sf->disable_loop_restoration_chroma) ||
+        (!plane && !lpf_sf->disable_loop_restoration_luma)) {
       av1_extend_frame(rsc.dgd_buffer, rsc.plane_width, rsc.plane_height,
                        rsc.dgd_stride, RESTORATION_BORDER, RESTORATION_BORDER,
                        highbd);
 
       for (RestorationType r = 0; r < num_rtypes; ++r) {
-        if ((force_restore_type != RESTORE_TYPES) && (r != RESTORE_NONE) &&
-            (r != force_restore_type))
-          continue;
+        // Disable Loop restoration filter based on the flags set using speed
+        // feature 'disable_wiener_filter' and 'disable_sgr_filter'.
+        if (disable_lr_filter[r]) continue;
 
         double cost = search_rest_type(&rsc, r);
 
@@ -1815,15 +1946,17 @@
     }
 
     cm->rst_info[plane].frame_restoration_type = best_rtype;
-    if (force_restore_type != RESTORE_TYPES)
-      assert(best_rtype == force_restore_type || best_rtype == RESTORE_NONE);
-
     if (best_rtype != RESTORE_NONE) {
       for (int u = 0; u < plane_ntiles; ++u) {
         copy_unit_info(best_rtype, &rusi[u], &cm->rst_info[plane].unit_info[u]);
       }
     }
   }
-
+#if HAVE_AVX2
+  if (!cpi->sf.lpf_sf.disable_wiener_filter &&
+      !cm->seq_params->use_highbitdepth) {
+    aom_free(rsc.dgd_avg);
+  }
+#endif
   aom_free(rusi);
 }

diff --git a/av1/encoder/random.h b/av1/encoder/random.h
index 0bca391..efe909b 100644
--- a/av1/encoder/random.h
+++ b/av1/encoder/random.h

@@ -12,14 +12,70 @@
 #ifndef AOM_AV1_ENCODER_RANDOM_H_
 #define AOM_AV1_ENCODER_RANDOM_H_
 
+#include <stdint.h>
+
 #ifdef __cplusplus
 extern "C" {
 #endif
 
+// Advance the generator to its next state, and generate the next 32-bit output.
+// Note that the low bits of this output are comparatively low-quality, so users
+// of this function should ensure that the high bits factor through to their
+// outputs.
+static INLINE uint32_t lcg_next(uint32_t *state) {
+  *state = (uint32_t)(*state * 1103515245ULL + 12345);
+  return *state;
+}
+
 // Generate a random number in the range [0, 32768).
-static INLINE unsigned int lcg_rand16(unsigned int *state) {
-  *state = (unsigned int)(*state * 1103515245ULL + 12345);
-  return *state / 65536 % 32768;
+static INLINE uint32_t lcg_rand16(uint32_t *state) {
+  return (lcg_next(state) / 65536) % 32768;
+}
+
+// Generate a random number in the range [0, n)
+// This is implemented as (rand() * n) / <range of RNG> rather than
+// rand() % n, for a few reasons: This implementation is faster and less biased,
+// and if is a power of 2, this uses the higher-quality top bits from the RNG
+// output rather than the lower-quality bottom bits.
+static INLINE uint32_t lcg_randint(uint32_t *state, uint32_t n) {
+  uint64_t v = ((uint64_t)lcg_next(state) * n) >> 32;
+  return (uint32_t)v;
+}
+
+// Generate a random number in the range [lo, hi)
+static INLINE uint32_t lcg_randrange(uint32_t *state, uint32_t lo,
+                                     uint32_t hi) {
+  assert(lo < hi);
+  return lo + lcg_randint(state, hi - lo);
+}
+
+// Pick k distinct numbers from the set {0, ..., n-1}
+// All possible sets of k numbers, and all possible orderings of those numbers,
+// are equally likely.
+//
+// Note: The algorithm used here uses resampling to avoid choosing repeated
+// values. This works well as long as n >> k, but can potentially lead to many
+// resampling attempts if n is equal to or only slightly larger than k.
+static INLINE void lcg_pick(int n, int k, int *out, unsigned int *seed) {
+  assert(0 <= k && k <= n);
+  for (int i = 0; i < k; i++) {
+    int v;
+
+  // Inner resampling loop
+  // We have to use a goto here because C does not have a multi-level continue
+  // statement
+  resample:
+    v = (int)lcg_randint(seed, n);
+    for (int j = 0; j < i; j++) {
+      if (v == out[j]) {
+        // Repeated v, resample
+        goto resample;
+      }
+    }
+
+    // New v, accept
+    out[i] = v;
+  }
 }
 
 #ifdef __cplusplus

diff --git a/av1/encoder/ratectrl.c b/av1/encoder/ratectrl.c
index 9518480..fdf1495 100644
--- a/av1/encoder/ratectrl.c
+++ b/av1/encoder/ratectrl.c

@@ -174,35 +174,31 @@
   return enumerator;
 }
 
+static int get_init_ratio(double sse) { return (int)(300000 / sse); }
+
 int av1_rc_bits_per_mb(const AV1_COMP *cpi, FRAME_TYPE frame_type, int qindex,
                        double correction_factor, int accurate_estimate) {
   const AV1_COMMON *const cm = &cpi->common;
   const int is_screen_content_type = cpi->is_screen_content_type;
   const aom_bit_depth_t bit_depth = cm->seq_params->bit_depth;
   const double q = av1_convert_qindex_to_q(qindex, bit_depth);
+  int enumerator = av1_get_bpmb_enumerator(frame_type, is_screen_content_type);
 
-  const int min_dim = AOMMIN(cm->width, cm->height);
+  assert(correction_factor <= MAX_BPB_FACTOR &&
+         correction_factor >= MIN_BPB_FACTOR);
 
   if (frame_type != KEY_FRAME && accurate_estimate) {
     assert(cpi->rec_sse != UINT64_MAX);
     const int mbs = cm->mi_params.MBs;
-    const int res = (min_dim < 480) ? 0 : ((min_dim < 720) ? 1 : 2);
-    const double sse_over_q2 = (double)(cpi->rec_sse << BPER_MB_NORMBITS) /
-                               ((double)q * q) / (double)mbs;
-    const double coef[3][2] = {
-      { 0.535, 3000.0 },  // < 480
-      { 0.590, 3000.0 },  // < 720
-      { 0.485, 1000.0 }   // 720
-    };
-    int bits = (int)(coef[res][0] * sse_over_q2 + coef[res][1]);
-    return (int)(bits * correction_factor);
+    const double sse_sqrt =
+        (double)((int)sqrt((double)(cpi->rec_sse)) << BPER_MB_NORMBITS) /
+        (double)mbs;
+    const int ratio = (cpi->rc.bit_est_ratio == 0) ? get_init_ratio(sse_sqrt)
+                                                   : cpi->rc.bit_est_ratio;
+    // Clamp the enumerator to lower the q fluctuations.
+    enumerator = AOMMIN(AOMMAX((int)(ratio * sse_sqrt), 20000), 170000);
   }
 
-  const int enumerator =
-      av1_get_bpmb_enumerator(frame_type, is_screen_content_type);
-  assert(correction_factor <= MAX_BPB_FACTOR &&
-         correction_factor >= MIN_BPB_FACTOR);
-
   // q based adjustment to baseline enumerator
   return (int)(enumerator * correction_factor / q);
 }
@@ -262,7 +258,8 @@
 
 // Update the buffer level for higher temporal layers, given the encoded current
 // temporal layer.
-static void update_layer_buffer_level(SVC *svc, int encoded_frame_size) {
+static void update_layer_buffer_level(SVC *svc, int encoded_frame_size,
+                                      bool is_screen) {
   const int current_temporal_layer = svc->temporal_layer_id;
   for (int i = current_temporal_layer + 1; i < svc->number_temporal_layers;
        ++i) {
@@ -276,6 +273,15 @@
     lp_rc->bits_off_target =
         AOMMIN(lp_rc->bits_off_target, lp_rc->maximum_buffer_size);
     lp_rc->buffer_level = lp_rc->bits_off_target;
+
+    // For screen-content mode: don't let buffer level go below threshold,
+    // given here as -rc->maximum_ buffer_size, to allow buffer to come back
+    // up sooner after slide change with big oveshoot.
+    if (is_screen) {
+      lp_rc->bits_off_target =
+          AOMMAX(lp_rc->bits_off_target, -lp_rc->maximum_buffer_size);
+      lp_rc->buffer_level = lp_rc->bits_off_target;
+    }
   }
 }
 // Update the buffer level: leaky bucket model.
@@ -302,7 +308,8 @@
   p_rc->buffer_level = p_rc->bits_off_target;
 
   if (cpi->ppi->use_svc)
-    update_layer_buffer_level(&cpi->svc, encoded_frame_size);
+    update_layer_buffer_level(&cpi->svc, encoded_frame_size,
+                              cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN);
 
 #if CONFIG_FPMT_TEST
   /* The variable temp_buffer_level is introduced for quality
@@ -430,6 +437,7 @@
   rc->resize_count = 0;
   rc->rtc_external_ratectrl = 0;
   rc->frame_level_fast_extra_bits = 0;
+  rc->use_external_qp_one_pass = 0;
 }
 
 int av1_rc_drop_frame(AV1_COMP *cpi) {
@@ -483,14 +491,38 @@
   const RATE_CONTROL *const rc = &cpi->rc;
   const PRIMARY_RATE_CONTROL *const p_rc = &cpi->ppi->p_rc;
   const AV1_COMMON *const cm = &cpi->common;
+  const SVC *const svc = &cpi->svc;
   const RefreshFrameInfo *const refresh_frame = &cpi->refresh_frame;
-  const int max_delta_down = (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN)
-                                 ? AOMMIN(8, AOMMAX(1, rc->q_1_frame / 16))
-                                 : AOMMIN(16, AOMMAX(1, rc->q_1_frame / 8));
-  const int max_delta_up = 20;
+  int max_delta_down;
+  int max_delta_up = 20;
   const int change_avg_frame_bandwidth =
       abs(rc->avg_frame_bandwidth - rc->prev_avg_frame_bandwidth) >
       0.1 * (rc->avg_frame_bandwidth);
+
+  // Set the maximum adjustment down for Q for this frame.
+  if (cpi->oxcf.q_cfg.aq_mode == CYCLIC_REFRESH_AQ &&
+      cpi->cyclic_refresh->apply_cyclic_refresh) {
+    // For static screen type content limit the Q drop till the start of the
+    // next refresh cycle.
+    if (cpi->is_screen_content_type &&
+        (cpi->cyclic_refresh->sb_index > cpi->cyclic_refresh->last_sb_index)) {
+      max_delta_down = AOMMIN(8, AOMMAX(1, rc->q_1_frame / 32));
+    } else {
+      max_delta_down = AOMMIN(16, AOMMAX(1, rc->q_1_frame / 8));
+    }
+    if (!cpi->ppi->use_svc && cpi->is_screen_content_type) {
+      // Link max_delta_up to max_delta_down and buffer status.
+      if (p_rc->buffer_level > p_rc->optimal_buffer_level) {
+        max_delta_up = AOMMAX(4, max_delta_down);
+      } else {
+        max_delta_up = AOMMAX(8, max_delta_down);
+      }
+    }
+  } else {
+    max_delta_down = (cpi->is_screen_content_type)
+                         ? AOMMIN(8, AOMMAX(1, rc->q_1_frame / 16))
+                         : AOMMIN(16, AOMMAX(1, rc->q_1_frame / 8));
+  }
   // If resolution changes or avg_frame_bandwidth significantly changed,
   // then set this flag to indicate change in target bits per macroblock.
   const int change_target_bits_mb =
@@ -498,13 +530,20 @@
       (width != cm->prev_frame->width || height != cm->prev_frame->height ||
        change_avg_frame_bandwidth);
   // Apply some control/clamp to QP under certain conditions.
-  if (cm->current_frame.frame_type != KEY_FRAME && !cpi->ppi->use_svc &&
-      rc->frames_since_key > 1 && !change_target_bits_mb &&
+  // Delay the use of the clamping for svc until after num_temporal_layers,
+  // to make they have been set for each temporal layer.
+  if (!frame_is_intra_only(cm) && rc->frames_since_key > 1 &&
+      (!cpi->ppi->use_svc ||
+       svc->current_superframe > (unsigned int)svc->number_temporal_layers) &&
+      !change_target_bits_mb && !cpi->rc.rtc_external_ratectrl &&
       (!cpi->oxcf.rc_cfg.gf_cbr_boost_pct ||
        !(refresh_frame->alt_ref_frame || refresh_frame->golden_frame))) {
-    // Make sure q is between oscillating Qs to prevent resonance.
+    // If in the previous two frames we have seen both overshoot and undershoot
+    // clamp Q between the two. Check for rc->q_1/2_frame > 0 in case they have
+    // not been set due to dropped frames.
     if (rc->rc_1_frame * rc->rc_2_frame == -1 &&
-        rc->q_1_frame != rc->q_2_frame) {
+        rc->q_1_frame != rc->q_2_frame && rc->q_1_frame > 0 &&
+        rc->q_2_frame > 0) {
       int qclamp = clamp(q, AOMMIN(rc->q_1_frame, rc->q_2_frame),
                          AOMMAX(rc->q_1_frame, rc->q_2_frame));
       // If the previous frame had overshoot and the current q needs to
@@ -518,7 +557,7 @@
     // Adjust Q base on source content change from scene detection.
     if (cpi->sf.rt_sf.check_scene_detection && rc->prev_avg_source_sad > 0 &&
         rc->frames_since_key > 10 && rc->frame_source_sad > 0 &&
-        !cpi->ppi->use_svc) {
+        !cpi->rc.rtc_external_ratectrl) {
       const int bit_depth = cm->seq_params->bit_depth;
       double delta =
           (double)rc->avg_source_sad / (double)rc->prev_avg_source_sad - 1.0;
@@ -542,15 +581,42 @@
     // Limit the decrease in Q from previous frame.
     if (rc->q_1_frame - q > max_delta_down) q = rc->q_1_frame - max_delta_down;
     // Limit the increase in Q from previous frame.
-    else if (q - rc->q_1_frame > max_delta_up &&
-             cpi->oxcf.tune_cfg.content != AOM_CONTENT_SCREEN)
+    else if (q - rc->q_1_frame > max_delta_up)
       q = rc->q_1_frame + max_delta_up;
   }
-  // For single spatial layer: if resolution has increased push q closer
+  // Adjustment for temporal layers.
+  if (svc->number_temporal_layers > 1 && svc->spatial_layer_id == 0 &&
+      !change_target_bits_mb && !cpi->rc.rtc_external_ratectrl &&
+      cpi->oxcf.resize_cfg.resize_mode != RESIZE_DYNAMIC) {
+    if (svc->temporal_layer_id > 0) {
+      // Constrain enhancement relative to the previous base TL0.
+      // Get base temporal layer TL0.
+      const int layer = LAYER_IDS_TO_IDX(0, 0, svc->number_temporal_layers);
+      LAYER_CONTEXT *lc = &svc->layer_context[layer];
+      // lc->rc.avg_frame_bandwidth and lc->p_rc.last_q correspond to the
+      // last TL0 frame.
+      if (rc->avg_frame_bandwidth < lc->rc.avg_frame_bandwidth &&
+          q < lc->p_rc.last_q[INTER_FRAME] - 4)
+        q = lc->p_rc.last_q[INTER_FRAME] - 4;
+    } else if (cpi->svc.temporal_layer_id == 0 &&
+               p_rc->buffer_level > (p_rc->optimal_buffer_level >> 2) &&
+               rc->frame_source_sad < 100000) {
+      // Push base TL0 Q down if buffer is stable and frame_source_sad
+      // is below threshold.
+      int delta = (svc->number_temporal_layers == 2) ? 4 : 10;
+      q = q - delta;
+    }
+  }
+  // For non-svc (single layer): if resolution has increased push q closer
   // to the active_worst to avoid excess overshoot.
-  if (cpi->svc.number_spatial_layers <= 1 && cm->prev_frame &&
+  if (!cpi->ppi->use_svc && cm->prev_frame &&
       (width * height > 1.5 * cm->prev_frame->width * cm->prev_frame->height))
     q = (q + active_worst_quality) >> 1;
+  // For single layer RPS: Bias Q based on distance of closest reference.
+  if (cpi->ppi->rtc_ref.bias_recovery_frame) {
+    const int min_dist = av1_svc_get_min_ref_dist(cpi);
+    q = q - AOMMIN(min_dist, 20);
+  }
   return AOMMAX(AOMMIN(q, cpi->rc.worst_quality), cpi->rc.best_quality);
 }
 
@@ -709,7 +775,7 @@
   // recorded as INTRA only key frames.
   if ((cpi->oxcf.q_cfg.aq_mode == CYCLIC_REFRESH_AQ) &&
       (cpi->cyclic_refresh->counter_encode_maxq_scene_change == 0) &&
-      (cm->current_frame.frame_type != KEY_FRAME) && (!cpi->ppi->use_svc)) {
+      !frame_is_intra_only(cm) && !cpi->ppi->use_svc) {
     cpi->rc.q_2_frame = cm->quant_params.base_qindex;
     cpi->rc.q_1_frame = cm->quant_params.base_qindex;
     cpi->rc.rc_2_frame = 0;
@@ -762,8 +828,7 @@
 
   // Adjustment to delta Q and number of blocks updated in cyclic refressh
   // based on over or under shoot of target in current frame.
-  if (cyclic_refresh_active && (cpi->rc.this_frame_target > 0) &&
-      !cpi->ppi->use_svc) {
+  if (cyclic_refresh_active && cpi->rc.this_frame_target > 0) {
     CYCLIC_REFRESH *const cr = cpi->cyclic_refresh;
     if (correction_factor > 1.25) {
       cr->percent_refresh_adjustment =
@@ -1012,19 +1077,27 @@
     int layer = LAYER_IDS_TO_IDX(0, 0, svc->number_temporal_layers);
     const LAYER_CONTEXT *lc = &svc->layer_context[layer];
     const PRIMARY_RATE_CONTROL *const lp_rc = &lc->p_rc;
-    avg_qindex_key = lp_rc->avg_frame_qindex[KEY_FRAME];
-    if (svc->temporal_layer_id == 0)
-      avg_qindex_key =
-          AOMMIN(lp_rc->avg_frame_qindex[KEY_FRAME], lp_rc->last_q[KEY_FRAME]);
+    avg_qindex_key =
+        AOMMIN(lp_rc->avg_frame_qindex[KEY_FRAME], lp_rc->last_q[KEY_FRAME]);
   }
   ambient_qp = (cm->current_frame.frame_number < num_frames_weight_key)
                    ? AOMMIN(p_rc->avg_frame_qindex[INTER_FRAME], avg_qindex_key)
                    : p_rc->avg_frame_qindex[INTER_FRAME];
-  active_worst_quality = AOMMIN(rc->worst_quality, ambient_qp * 5 / 4);
+  ambient_qp = AOMMIN(rc->worst_quality, ambient_qp);
+
   if (p_rc->buffer_level > p_rc->optimal_buffer_level) {
     // Adjust down.
-    // Maximum limit for down adjustment, ~30%.
-    int max_adjustment_down = active_worst_quality / 3;
+    int max_adjustment_down;  // Maximum adjustment down for Q
+
+    if (cpi->oxcf.q_cfg.aq_mode == CYCLIC_REFRESH_AQ && !cpi->ppi->use_svc &&
+        (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN)) {
+      active_worst_quality = AOMMIN(rc->worst_quality, ambient_qp);
+      max_adjustment_down = AOMMIN(4, active_worst_quality / 16);
+    } else {
+      active_worst_quality = AOMMIN(rc->worst_quality, ambient_qp * 5 / 4);
+      max_adjustment_down = active_worst_quality / 3;
+    }
+
     if (max_adjustment_down) {
       buff_lvl_step =
           ((p_rc->maximum_buffer_size - p_rc->optimal_buffer_level) /
@@ -1036,6 +1109,7 @@
     }
   } else if (p_rc->buffer_level > critical_level) {
     // Adjust up from ambient Q.
+    active_worst_quality = AOMMIN(rc->worst_quality, ambient_qp);
     if (critical_level) {
       buff_lvl_step = (p_rc->optimal_buffer_level - critical_level);
       if (buff_lvl_step) {
@@ -1043,7 +1117,7 @@
                            (p_rc->optimal_buffer_level - p_rc->buffer_level) /
                            buff_lvl_step);
       }
-      active_worst_quality = ambient_qp + adjustment;
+      active_worst_quality += adjustment;
     }
   } else {
     // Set to worst_quality if buffer is below critical level.
@@ -1204,15 +1278,6 @@
       q = *top_index;
   }
 
-  // Special case: we force the first few frames to use low q such that
-  // these frames are encoded at a high quality, which provides good
-  // references for following frames.
-  if (current_frame->frame_type != KEY_FRAME && !cpi->ppi->use_svc &&
-      current_frame->frame_number >= 10 && current_frame->frame_number <= 15) {
-    q = AOMMIN(p_rc->last_kf_qindex + 108, AOMMAX(5, q - 9));
-    q = AOMMAX(q, rc->best_quality);
-  }
-
   assert(*top_index <= rc->worst_quality && *top_index >= rc->best_quality);
   assert(*bottom_index <= rc->worst_quality &&
          *bottom_index >= rc->best_quality);
@@ -1607,9 +1672,6 @@
   const int simulate_parallel_frame =
       cpi->ppi->gf_group.frame_parallel_level[cpi->gf_frame_index] > 0 &&
       cpi->ppi->fpmt_unit_test_cfg == PARALLEL_SIMULATION_ENCODE;
-  int extend_minq_fast = simulate_parallel_frame
-                             ? p_rc->temp_extend_minq_fast
-                             : cpi->ppi->twopass.extend_minq_fast;
   int extend_minq = simulate_parallel_frame ? p_rc->temp_extend_minq
                                             : cpi->ppi->twopass.extend_minq;
   int extend_maxq = simulate_parallel_frame ? p_rc->temp_extend_maxq
@@ -1623,21 +1685,18 @@
          (refresh_frame->golden_frame || is_intrl_arf_boost ||
           refresh_frame->alt_ref_frame))) {
 #if CONFIG_FPMT_TEST
-      active_best_quality -= (extend_minq + extend_minq_fast);
+      active_best_quality -= extend_minq;
       active_worst_quality += (extend_maxq / 2);
 #else
-      active_best_quality -=
-          (cpi->ppi->twopass.extend_minq + cpi->ppi->twopass.extend_minq_fast);
+      active_best_quality -= cpi->ppi->twopass.extend_minq / 4;
       active_worst_quality += (cpi->ppi->twopass.extend_maxq / 2);
 #endif
     } else {
 #if CONFIG_FPMT_TEST
-      active_best_quality -= (extend_minq + extend_minq_fast) / 2;
+      active_best_quality -= extend_minq / 2;
       active_worst_quality += extend_maxq;
 #else
-      active_best_quality -=
-          (cpi->ppi->twopass.extend_minq + cpi->ppi->twopass.extend_minq_fast) /
-          2;
+      active_best_quality -= cpi->ppi->twopass.extend_minq / 4;
       active_worst_quality += cpi->ppi->twopass.extend_maxq;
 #endif
     }
@@ -1860,6 +1919,8 @@
         get_active_best_quality(cpi, active_worst_quality, cq_level, gf_index);
   }
 
+  if (cq_level > 0) active_best_quality = AOMMAX(1, active_best_quality);
+
   *top_index = active_worst_quality;
   *bottom_index = active_best_quality;
 
@@ -2048,7 +2109,8 @@
     pre_y += (pre_ystride << 6) - (sb_cols << 6);
   }
   assert(num_samples > 0);
-  if (num_samples > 0) cpi->rec_sse = fsse;
+  // Ensure rec_sse > 0
+  if (num_samples > 0) cpi->rec_sse = fsse > 0 ? fsse : 1;
 }
 
 int av1_rc_pick_q_and_bounds(AV1_COMP *cpi, int width, int height, int gf_index,
@@ -2168,6 +2230,19 @@
   // Post encode loop adjustment of Q prediction.
   av1_rc_update_rate_correction_factors(cpi, 0, cm->width, cm->height);
 
+  // Update bit estimation ratio.
+  if (cm->current_frame.frame_type != KEY_FRAME &&
+      cpi->sf.hl_sf.accurate_bit_estimate) {
+    const double q = av1_convert_qindex_to_q(cm->quant_params.base_qindex,
+                                             cm->seq_params->bit_depth);
+    const int this_bit_est_ratio =
+        (int)(rc->projected_frame_size * q / sqrt((double)cpi->rec_sse));
+    cpi->rc.bit_est_ratio =
+        cpi->rc.bit_est_ratio == 0
+            ? this_bit_est_ratio
+            : (7 * cpi->rc.bit_est_ratio + this_bit_est_ratio) / 8;
+  }
+
   // Keep a record of last Q and ambient average Q.
   if (current_frame->frame_type == KEY_FRAME) {
     p_rc->last_q[KEY_FRAME] = qindex;
@@ -2266,6 +2341,8 @@
     rc->frame_num_last_gf_refresh = current_frame->frame_number;
   rc->prev_coded_width = cm->width;
   rc->prev_coded_height = cm->height;
+  rc->frame_number_encoded++;
+  rc->prev_frame_is_dropped = 0;
   // if (current_frame->frame_number == 1 && cm->show_frame)
   /*
   rc->this_frame_target =
@@ -2286,6 +2363,11 @@
   cpi->rc.prev_avg_frame_bandwidth = cpi->rc.avg_frame_bandwidth;
   cpi->rc.prev_coded_width = cpi->common.width;
   cpi->rc.prev_coded_height = cpi->common.height;
+  cpi->rc.prev_frame_is_dropped = 1;
+  // On a scene/slide change for dropped frame: reset the avg_source_sad to 0,
+  // otherwise the avg_source_sad can get too large and subsequent frames
+  // may miss the scene/slide detection.
+  if (cpi->rc.high_source_sad) cpi->rc.avg_source_sad = 0;
 }
 
 int av1_find_qindex(double desired_q, aom_bit_depth_t bit_depth,
@@ -2754,6 +2836,9 @@
   ExtRefreshFrameFlagsInfo *const ext_refresh_frame_flags =
       &ext_flags->refresh_frame;
   RTC_REF *const rtc_ref = &cpi->ppi->rtc_ref;
+  unsigned int frame_number = (cpi->oxcf.rc_cfg.drop_frames_water_mark)
+                                  ? rc->frame_number_encoded
+                                  : cm->current_frame.frame_number;
   unsigned int lag_alt = 4;
   int last_idx = 0;
   int last_idx_refresh = 0;
@@ -2799,19 +2884,16 @@
     ext_flags->ref_frame_flags ^= AOM_LAST2_FLAG;
   const int sh = 6;
   // Moving index slot for last: 0 - (sh - 1).
-  if (cm->current_frame.frame_number > 1)
-    last_idx = ((cm->current_frame.frame_number - 1) % sh);
+  if (frame_number > 1) last_idx = ((frame_number - 1) % sh);
   // Moving index for refresh of last: one ahead for next frame.
-  last_idx_refresh = (cm->current_frame.frame_number % sh);
+  last_idx_refresh = (frame_number % sh);
   gld_idx = 6;
 
   // Moving index for alt_ref, lag behind LAST by lag_alt frames.
-  if (cm->current_frame.frame_number > lag_alt)
-    alt_ref_idx = ((cm->current_frame.frame_number - lag_alt) % sh);
+  if (frame_number > lag_alt) alt_ref_idx = ((frame_number - lag_alt) % sh);
   if (cpi->sf.rt_sf.ref_frame_comp_nonrd[1]) {
     // Moving index for LAST2, lag behind LAST by 2 frames.
-    if (cm->current_frame.frame_number > 2)
-      last2_idx = ((cm->current_frame.frame_number - 2) % sh);
+    if (frame_number > 2) last2_idx = ((frame_number - 2) % sh);
   }
   rtc_ref->ref_idx[0] = last_idx;          // LAST
   rtc_ref->ref_idx[1] = last_idx_refresh;  // LAST2 (for refresh of last).
@@ -2926,6 +3008,13 @@
   int light_change = 0;
   // Flag to check light change or not.
   const int check_light_change = 0;
+  // TODO(marpan): There seems some difference along the bottom border when
+  // using the source_last_tl0 for last_source (used for temporal layers or
+  // when previous frame is dropped).
+  // Remove this bord parameter when issue is resolved: difference is that
+  // non-zero sad exists along bottom border even though source is static.
+  const int border =
+      rc->prev_frame_is_dropped || cpi->svc.number_temporal_layers > 1;
   // Store blkwise SAD for later use
   if (width == cm->render_width && height == cm->render_height) {
     if (cpi->src_sad_blk_64x64 == NULL) {
@@ -2934,7 +3023,8 @@
                                              sizeof(*cpi->src_sad_blk_64x64)));
     }
   }
-  for (int sbi_row = 0; sbi_row < sb_rows; ++sbi_row) {
+  // Avoid bottom and right border.
+  for (int sbi_row = 0; sbi_row < sb_rows - border; ++sbi_row) {
     for (int sbi_col = 0; sbi_col < sb_cols; ++sbi_col) {
       tmp_sad = cpi->ppi->fn_ptr[bsize].sdf(src_y, src_ystride, last_src_y,
                                             last_src_ystride);
@@ -3068,19 +3158,21 @@
     if (qindex <= 120 * p_rc->last_q[INTER_FRAME] / 100)
       p_rc->rate_correction_factors[INTER_NORMAL] *= 1.5;
   }
-  // Apply the same rate control reset to all temporal layers.
-  for (int tl = 0; tl < svc->number_temporal_layers; tl++) {
-    LAYER_CONTEXT *lc = NULL;
-    lc = &svc->layer_context[svc->spatial_layer_id *
-                                 svc->number_temporal_layers +
-                             tl];
-    lc->rc.resize_state = rc->resize_state;
-    lc->p_rc.buffer_level = lc->p_rc.optimal_buffer_level;
-    lc->p_rc.bits_off_target = lc->p_rc.optimal_buffer_level;
-    lc->p_rc.rate_correction_factors[INTER_NORMAL] =
-        p_rc->rate_correction_factors[INTER_NORMAL];
-    lc->p_rc.avg_frame_qindex[INTER_FRAME] =
-        p_rc->avg_frame_qindex[INTER_FRAME];
+  if (svc->number_temporal_layers > 1) {
+    // Apply the same rate control reset to all temporal layers.
+    for (int tl = 0; tl < svc->number_temporal_layers; tl++) {
+      LAYER_CONTEXT *lc = NULL;
+      lc = &svc->layer_context[svc->spatial_layer_id *
+                                   svc->number_temporal_layers +
+                               tl];
+      lc->rc.resize_state = rc->resize_state;
+      lc->p_rc.buffer_level = lc->p_rc.optimal_buffer_level;
+      lc->p_rc.bits_off_target = lc->p_rc.optimal_buffer_level;
+      lc->p_rc.rate_correction_factors[INTER_NORMAL] =
+          p_rc->rate_correction_factors[INTER_NORMAL];
+      lc->p_rc.avg_frame_qindex[INTER_FRAME] =
+          p_rc->avg_frame_qindex[INTER_FRAME];
+    }
   }
 }
 
@@ -3205,6 +3297,25 @@
   return 0;
 }
 
+// Set to true if this frame is a recovery frame, for 1 layer RPS,
+// and whether we should apply some boost (QP, adjust speed features, etc).
+// Recovery frame here means frame whose closest reference suddenly
+// switched from previous frame to one much further away.
+// TODO(marpan): Consider adding on/off flag to SVC_REF_FRAME_CONFIG to
+// allow more control for applications.
+static bool set_flag_rps_bias_recovery_frame(const AV1_COMP *const cpi) {
+  if (cpi->ppi->rtc_ref.set_ref_frame_config &&
+      cpi->svc.number_temporal_layers == 1 &&
+      cpi->svc.number_spatial_layers == 1 &&
+      cpi->ppi->rtc_ref.reference_was_previous_frame) {
+    int min_dist = av1_svc_get_min_ref_dist(cpi);
+    // Only consider boost for this frame if its closest reference is further
+    // than x frames away, using x = 4 for now.
+    if (min_dist != INT_MAX && min_dist > 4) return true;
+  }
+  return false;
+}
+
 void av1_get_one_pass_rt_params(AV1_COMP *cpi, FRAME_TYPE *const frame_type,
                                 const EncodeFrameInput *frame_input,
                                 unsigned int frame_flags) {
@@ -3219,12 +3330,11 @@
   const int layer =
       LAYER_IDS_TO_IDX(svc->spatial_layer_id, svc->temporal_layer_id,
                        svc->number_temporal_layers);
-  // Turn this on to explicitly set the reference structure rather than
-  // relying on internal/default structure.
   if (cpi->ppi->use_svc) {
     av1_update_temporal_layer_framerate(cpi);
     av1_restore_layer_context(cpi);
   }
+  cpi->ppi->rtc_ref.bias_recovery_frame = set_flag_rps_bias_recovery_frame(cpi);
   // Set frame type.
   if (set_key_frame(cpi, frame_flags)) {
     *frame_type = KEY_FRAME;
@@ -3240,6 +3350,7 @@
         av1_svc_reset_temporal_layers(cpi, 1);
       svc->layer_context[layer].is_key_frame = 1;
     }
+    rc->frame_number_encoded = 0;
   } else {
     *frame_type = INTER_FRAME;
     gf_group->update_type[cpi->gf_frame_index] = LF_UPDATE;

diff --git a/av1/encoder/ratectrl.h b/av1/encoder/ratectrl.h
index 114778d..4fb1179 100644
--- a/av1/encoder/ratectrl.h
+++ b/av1/encoder/ratectrl.h

@@ -204,6 +204,13 @@
 
   int decimation_factor;
   int decimation_count;
+  int prev_frame_is_dropped;
+
+  /*!
+   * Frame number for encoded frames (non-dropped).
+   * Use for setting the rtc reference structure.
+   */
+  unsigned int frame_number_encoded;
 
   /*!\endcond */
   /*!
@@ -261,6 +268,15 @@
 
   int prev_coded_width;
   int prev_coded_height;
+
+  // The ratio used for inter frames in bit estimation.
+  // TODO(yunqing): if golden frame is treated differently (e.g. gf_cbr_boost_
+  // pct > THR), consider to add bit_est_ratio_g for golden frames.
+  int bit_est_ratio;
+
+  // Whether to use a fixed qp for the frame, bypassing internal rate control.
+  // This flag will reset to 0 after every frame.
+  int use_external_qp_one_pass;
   /*!\endcond */
 } RATE_CONTROL;
 
@@ -461,11 +477,6 @@
    */
   int temp_extend_maxq;
 
-  /*!
-   * Temporary variable used in simulating the delayed update of
-   * extend_minq_fast.
-   */
-  int temp_extend_minq_fast;
 #endif
   /*!
    * Proposed minimum allowed Q different layers in a coding pyramid

diff --git a/av1/encoder/rd.h b/av1/encoder/rd.h
index b1eb154..b38d9ca 100644
--- a/av1/encoder/rd.h
+++ b/av1/encoder/rd.h

@@ -56,6 +56,16 @@
 // Factor to weigh the rate for switchable interp filters.
 #define SWITCHABLE_INTERP_RATE_FACTOR 1
 
+// Macros for common video resolutions: width x height
+// For example, 720p represents video resolution of 1280x720 pixels.
+#define RESOLUTION_288P 352 * 288
+#define RESOLUTION_360P 640 * 360
+#define RESOLUTION_480P 640 * 480
+#define RESOLUTION_720P 1280 * 720
+#define RESOLUTION_1080P 1920 * 1080
+#define RESOLUTION_1440P 2560 * 1440
+#define RESOLUTION_4K 3840 * 2160
+
 #define RTC_REFS 4
 static const MV_REFERENCE_FRAME real_time_ref_combos[RTC_REFS][2] = {
   { LAST_FRAME, NONE_FRAME },

diff --git a/av1/encoder/rdopt.c b/av1/encoder/rdopt.c
index c25db61..8620087 100644
--- a/av1/encoder/rdopt.c
+++ b/av1/encoder/rdopt.c

@@ -1321,6 +1321,10 @@
   const int mi_row = xd->mi_row;
   const int mi_col = xd->mi_col;
   int mode_index_start, mode_index_end;
+  const int txfm_rd_gate_level =
+      get_txfm_rd_gate_level(cpi->sf.inter_sf.txfm_rd_gate_level, bsize,
+                             TX_SEARCH_MOTION_MODE, eval_motion_mode);
+
   // Modify the start and end index according to speed features. For example,
   // if SIMPLE_TRANSLATION has already been searched according to
   // the motion_mode_for_winner_cand speed feature, update the mode_index_start
@@ -1429,7 +1433,8 @@
 
           // Refine MV in a small range.
           av1_refine_warped_mv(xd, cm, &ms_params, bsize, pts0, pts_inref0,
-                               total_samples);
+                               total_samples, cpi->sf.mv_sf.warp_search_method,
+                               cpi->sf.mv_sf.warp_search_iters);
 
           if (mv0.as_int != mbmi->mv[0].as_int) {
             // Keep the refined MV and WM parameters.
@@ -1523,7 +1528,7 @@
       if (rd_stats->rdcost < *best_est_rd) {
         *best_est_rd = rd_stats->rdcost;
         assert(sse_y >= 0);
-        ref_skip_rd[1] = cpi->sf.inter_sf.txfm_rd_gate_level
+        ref_skip_rd[1] = txfm_rd_gate_level
                              ? RDCOST(x->rdmult, mode_rate, (sse_y << 4))
                              : INT64_MAX;
       }
@@ -1545,14 +1550,14 @@
       // Perform full transform search
       int64_t skip_rd = INT64_MAX;
       int64_t skip_rdy = INT64_MAX;
-      if (cpi->sf.inter_sf.txfm_rd_gate_level) {
+      if (txfm_rd_gate_level) {
         // Check if the mode is good enough based on skip RD
         int64_t sse_y = INT64_MAX;
         int64_t curr_sse = get_sse(cpi, x, &sse_y);
         skip_rd = RDCOST(x->rdmult, rd_stats->rate, curr_sse);
         skip_rdy = RDCOST(x->rdmult, rd_stats->rate, (sse_y << 4));
         int eval_txfm = check_txfm_eval(x, bsize, ref_skip_rd[0], skip_rd,
-                                        cpi->sf.inter_sf.txfm_rd_gate_level, 0);
+                                        txfm_rd_gate_level, 0);
         if (!eval_txfm) continue;
       }
 
@@ -1635,18 +1640,22 @@
 
 static int64_t skip_mode_rd(RD_STATS *rd_stats, const AV1_COMP *const cpi,
                             MACROBLOCK *const x, BLOCK_SIZE bsize,
-                            const BUFFER_SET *const orig_dst) {
+                            const BUFFER_SET *const orig_dst, int64_t best_rd) {
   assert(bsize < BLOCK_SIZES_ALL);
   const AV1_COMMON *cm = &cpi->common;
   const int num_planes = av1_num_planes(cm);
   MACROBLOCKD *const xd = &x->e_mbd;
   const int mi_row = xd->mi_row;
   const int mi_col = xd->mi_col;
-  av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, orig_dst, bsize, 0,
-                                av1_num_planes(cm) - 1);
-
   int64_t total_sse = 0;
+  int64_t this_rd = INT64_MAX;
+  const int skip_mode_ctx = av1_get_skip_mode_context(xd);
+  rd_stats->rate = x->mode_costs.skip_mode_cost[skip_mode_ctx][1];
+
   for (int plane = 0; plane < num_planes; ++plane) {
+    // Call av1_enc_build_inter_predictor() for one plane at a time.
+    av1_enc_build_inter_predictor(cm, xd, mi_row, mi_col, orig_dst, bsize,
+                                  plane, plane);
     const struct macroblock_plane *const p = &x->plane[plane];
     const struct macroblockd_plane *const pd = &xd->plane[plane];
     const BLOCK_SIZE plane_bsize =
@@ -1658,11 +1667,14 @@
     int64_t sse = aom_sum_squares_2d_i16(p->src_diff, bw, bw, bh) << 4;
     sse >>= ((cpi->frame_info.bit_depth - 8) * 2);
     total_sse += sse;
+    // When current rd cost is more than the best rd, skip evaluation of
+    // remaining planes.
+    this_rd = RDCOST(x->rdmult, rd_stats->rate, total_sse);
+    if (this_rd > best_rd) break;
   }
-  const int skip_mode_ctx = av1_get_skip_mode_context(xd);
+
   rd_stats->dist = rd_stats->sse = total_sse;
-  rd_stats->rate = x->mode_costs.skip_mode_cost[skip_mode_ctx][1];
-  rd_stats->rdcost = RDCOST(x->rdmult, rd_stats->rate, rd_stats->dist);
+  rd_stats->rdcost = this_rd;
 
   restore_dst_buf(xd, *orig_dst, num_planes);
   return 0;
@@ -1670,6 +1682,10 @@
 
 // Check NEARESTMV, NEARMV, GLOBALMV ref mvs for duplicate and skip the relevant
 // mode
+// Note(rachelbarker): This speed feature currently does not interact correctly
+// with global motion. The issue is that, when global motion is used, GLOBALMV
+// produces a different prediction to NEARESTMV/NEARMV even if the motion
+// vectors are the same. Thus GLOBALMV should not be pruned in this case.
 static INLINE int check_repeat_ref_mv(const MB_MODE_INFO_EXT *mbmi_ext,
                                       int ref_idx,
                                       const MV_REFERENCE_FRAME *ref_frame,
@@ -1748,9 +1764,16 @@
 // population
 static INLINE int skip_nearest_near_mv_using_refmv_weight(
     const MACROBLOCK *const x, const PREDICTION_MODE this_mode,
-    const int8_t ref_frame_type) {
+    const int8_t ref_frame_type, PREDICTION_MODE best_mode) {
   if (this_mode != NEARESTMV && this_mode != NEARMV) return 0;
+  // Do not skip the mode if the current block has not yet obtained a valid
+  // inter mode.
+  if (!is_inter_mode(best_mode)) return 0;
 
+  const MACROBLOCKD *xd = &x->e_mbd;
+  // Do not skip the mode if both the top and left neighboring blocks are not
+  // available.
+  if (!xd->left_available || !xd->up_available) return 0;
   const MB_MODE_INFO_EXT *const mbmi_ext = &x->mbmi_ext;
   const uint16_t *const ref_mv_weight = mbmi_ext->weight[ref_frame_type];
   const int ref_mv_count =
@@ -2482,15 +2505,18 @@
   const int is_comp_pred = has_second_ref(mbmi);
   const MV_REFERENCE_FRAME *refs = mbmi->ref_frame;
 
-  // Check that the global mv is the same as ZEROMV
-  assert(mbmi->mv[0].as_int == 0);
-  assert(IMPLIES(is_comp_pred, mbmi->mv[0].as_int == 0));
-  assert(xd->global_motion[refs[0]].wmtype == TRANSLATION ||
-         xd->global_motion[refs[0]].wmtype == IDENTITY);
-
-  // Don't prune if we have invalid data
   for (int idx = 0; idx < 1 + is_comp_pred; idx++) {
-    assert(mbmi->mv[0].as_int == 0);
+    if (xd->global_motion[refs[idx]].wmtype != IDENTITY) {
+      // Pruning logic only works for IDENTITY type models
+      // Note: In theory we could apply similar logic for TRANSLATION
+      // type models, but we do not code these due to a spec bug
+      // (see comments in gm_get_motion_vector() in av1/common/mv.h)
+      assert(xd->global_motion[refs[idx]].wmtype != TRANSLATION);
+      return 0;
+    }
+
+    // Don't prune if we have invalid data
+    assert(mbmi->mv[idx].as_int == 0);
     if (args->best_single_sse_in_refs[refs[idx]] == INT32_MAX) {
       return 0;
     }
@@ -2940,7 +2966,6 @@
       continue;
 
     if (cpi->sf.gm_sf.prune_zero_mv_with_sse &&
-        cpi->sf.gm_sf.gm_search_type == GM_DISABLE_SEARCH &&
         (this_mode == GLOBALMV || this_mode == GLOBAL_GLOBALMV)) {
       if (prune_zero_mv_with_sse(cpi->ppi->fn_ptr, x, bsize, args,
                                  cpi->sf.gm_sf.prune_zero_mv_with_sse)) {
@@ -3165,8 +3190,10 @@
   FULLPEL_MOTION_SEARCH_PARAMS fullms_params;
   const search_site_config *lookahead_search_sites =
       cpi->mv_search_params.search_site_cfg[SS_CFG_LOOKAHEAD];
+  const FULLPEL_MV start_mv = get_fullmv_from_mv(&dv_ref.as_mv);
   av1_make_default_fullpel_ms_params(&fullms_params, cpi, x, bsize,
-                                     &dv_ref.as_mv, lookahead_search_sites,
+                                     &dv_ref.as_mv, start_mv,
+                                     lookahead_search_sites,
                                      /*fine_search_interval=*/0);
   const IntraBCMVCosts *const dv_costs = x->dv_costs;
   av1_set_ms_to_intra_mode(&fullms_params, dv_costs);
@@ -3213,7 +3240,6 @@
     }
 
     const int step_param = cpi->mv_search_params.mv_step_param;
-    const FULLPEL_MV start_mv = get_fullmv_from_mv(&dv_ref.as_mv);
     IntraBCHashInfo *intrabc_hash_info = &x->intrabc_hash_info;
     int_mv best_mv, best_hash_mv;
 
@@ -3446,9 +3472,6 @@
     orig_dst.stride[i] = xd->plane[i].dst.stride;
   }
 
-  // Obtain the rdcost for skip_mode.
-  skip_mode_rd(&skip_mode_rd_stats, cpi, x, bsize, &orig_dst);
-
   // Compare the use of skip_mode with the best intra/inter mode obtained.
   const int skip_mode_ctx = av1_get_skip_mode_context(xd);
   int64_t best_intra_inter_mode_cost = INT64_MAX;
@@ -3462,6 +3485,10 @@
     av1_rd_cost_update(x->rdmult, rd_cost);
   }
 
+  // Obtain the rdcost for skip_mode.
+  skip_mode_rd(&skip_mode_rd_stats, cpi, x, bsize, &orig_dst,
+               best_intra_inter_mode_cost);
+
   if (skip_mode_rd_stats.rdcost <= best_intra_inter_mode_cost &&
       (!xd->lossless[mbmi->segment_id] || skip_mode_rd_stats.dist == 0)) {
     assert(mode_index != THR_INVALID);
@@ -5029,8 +5056,14 @@
 
   if (sf->inter_sf.prune_nearest_near_mv_using_refmv_weight && !comp_pred) {
     const int8_t ref_frame_type = av1_ref_frame_type(ref_frames);
-    if (skip_nearest_near_mv_using_refmv_weight(x, this_mode, ref_frame_type))
+    if (skip_nearest_near_mv_using_refmv_weight(
+            x, this_mode, ref_frame_type,
+            args->search_state->best_mbmode.mode)) {
+      // Ensure the mode is pruned only when the current block has obtained a
+      // valid inter mode.
+      assert(is_inter_mode(args->search_state->best_mbmode.mode));
       return 1;
+    }
   }
 
   if (sf->rt_sf.prune_inter_modes_with_golden_ref &&
@@ -5169,13 +5202,15 @@
     RD_STATS rd_stats_uv;
     const int mode_rate = inter_modes_info->mode_rate_arr[data_idx];
     int64_t skip_rd = INT64_MAX;
-    if (cpi->sf.inter_sf.txfm_rd_gate_level) {
+    const int txfm_rd_gate_level = get_txfm_rd_gate_level(
+        cpi->sf.inter_sf.txfm_rd_gate_level, bsize, TX_SEARCH_DEFAULT,
+        /*eval_motion_mode=*/0);
+    if (txfm_rd_gate_level) {
       // Check if the mode is good enough based on skip RD
       int64_t curr_sse = inter_modes_info->sse_arr[data_idx];
       skip_rd = RDCOST(x->rdmult, mode_rate, curr_sse);
-      int eval_txfm =
-          check_txfm_eval(x, bsize, search_state->best_skip_rd[0], skip_rd,
-                          cpi->sf.inter_sf.txfm_rd_gate_level, 0);
+      int eval_txfm = check_txfm_eval(x, bsize, search_state->best_skip_rd[0],
+                                      skip_rd, txfm_rd_gate_level, 0);
       if (!eval_txfm) continue;
     }
 
@@ -5695,6 +5730,7 @@
                                interintra_modes,
                                { { { 0 }, { { 0 } }, { 0 }, 0, 0, 0, 0 } },
                                { { 0, 0 } },
+                               { 0 },
                                0,
                                0,
                                -1,

diff --git a/av1/encoder/rdopt.h b/av1/encoder/rdopt.h
index 78a23d6..efb797e 100644
--- a/av1/encoder/rdopt.h
+++ b/av1/encoder/rdopt.h

@@ -105,7 +105,7 @@
  * based on calculated modelled RD cost. Only 4 intra modes are checked as
  * specified in \c intra_mode_list. When calculating RD cost Hadamard transform
  * of residual is used to calculate rate. Estmation of RD cost is performed
- * in \c estimate_block_intra which is called from this function
+ * in \c av1_estimate_block_intra which is called from this function
  *
  * \param[in]    cpi            Top-level encoder structure
  * \param[in]    x              Pointer to structure holding all the data for

diff --git a/av1/encoder/rdopt_utils.h b/av1/encoder/rdopt_utils.h
index 91823d8..1c5b3db 100644
--- a/av1/encoder/rdopt_utils.h
+++ b/av1/encoder/rdopt_utils.h

@@ -23,6 +23,7 @@
 #endif
 
 #define MAX_REF_MV_SEARCH 3
+#define MAX_TX_RD_GATE_LEVEL 5
 #define INTER_INTRA_RD_THRESH_SCALE 9
 #define INTER_INTRA_RD_THRESH_SHIFT 4
 
@@ -352,10 +353,12 @@
   // Derive aggressiveness factor for gating the transform search
   // Lower value indicates more aggressiveness. Be more conservative (high
   // value) for (i) low quantizers (ii) regions where prediction is poor
-  const int scale[5] = { INT_MAX, 4, 3, 2, 2 };
+  const int scale[MAX_TX_RD_GATE_LEVEL + 1] = { INT_MAX, 4, 3, 2, 2, 1 };
   const int qslope = 2 * (!is_luma_only);
-  const int level_to_qindex_map[5] = { 0, 0, 0, 80, 100 };
+  const int level_to_qindex_map[MAX_TX_RD_GATE_LEVEL + 1] = { 0,  0,   0,
+                                                              80, 100, 140 };
   int aggr_factor = 4;
+  assert(level <= MAX_TX_RD_GATE_LEVEL);
   const int pred_qindex_thresh = level_to_qindex_map[level];
   if (!is_luma_only && level <= 2) {
     aggr_factor = 4 * AOMMAX(1, ROUND_POWER_OF_TWO((MAXQ - x->qindex) * qslope,
@@ -374,7 +377,9 @@
   // since best_skip_rd is computed after and skip_rd is computed (with 8-bit
   // prediction signals blended for WEDGE/DIFFWTD rather than 16-bit) before
   // interpolation filter search
-  const int luma_mul[5] = { INT_MAX, 32, 29, 17, 17 };
+  const int luma_mul[MAX_TX_RD_GATE_LEVEL + 1] = {
+    INT_MAX, 32, 29, 17, 17, 17
+  };
   int mul_factor = is_luma_only ? luma_mul[level] : 16;
   int64_t rd_thresh =
       (best_skip_rd == INT64_MAX)
@@ -767,6 +772,18 @@
          USABLE_REF_MV_STACK_SIZE * sizeof(xd->ref_mv_stack[0][0]));
 }
 
+// Get transform rd gate level for the given transform search case.
+static INLINE int get_txfm_rd_gate_level(
+    const int txfm_rd_gate_level[TX_SEARCH_CASES], BLOCK_SIZE bsize,
+    TX_SEARCH_CASE tx_search_case, int eval_motion_mode) {
+  assert(tx_search_case < TX_SEARCH_CASES);
+  if (tx_search_case == TX_SEARCH_MOTION_MODE && !eval_motion_mode &&
+      num_pels_log2_lookup[bsize] > 8)
+    return txfm_rd_gate_level[TX_SEARCH_MOTION_MODE];
+
+  return txfm_rd_gate_level[TX_SEARCH_DEFAULT];
+}
+
 #ifdef __cplusplus
 }  // extern "C"
 #endif

diff --git a/av1/encoder/reconinter_enc.c b/av1/encoder/reconinter_enc.c
index ac7dc16..83e5d4f 100644
--- a/av1/encoder/reconinter_enc.c
+++ b/av1/encoder/reconinter_enc.c

@@ -515,23 +515,23 @@
   }
 }
 
-void aom_comp_mask_upsampled_pred_c(MACROBLOCKD *xd, const AV1_COMMON *const cm,
-                                    int mi_row, int mi_col, const MV *const mv,
-                                    uint8_t *comp_pred, const uint8_t *pred,
-                                    int width, int height, int subpel_x_q3,
-                                    int subpel_y_q3, const uint8_t *ref,
-                                    int ref_stride, const uint8_t *mask,
-                                    int mask_stride, int invert_mask,
-                                    int subpel_search) {
+void aom_comp_mask_upsampled_pred(MACROBLOCKD *xd, const AV1_COMMON *const cm,
+                                  int mi_row, int mi_col, const MV *const mv,
+                                  uint8_t *comp_pred, const uint8_t *pred,
+                                  int width, int height, int subpel_x_q3,
+                                  int subpel_y_q3, const uint8_t *ref,
+                                  int ref_stride, const uint8_t *mask,
+                                  int mask_stride, int invert_mask,
+                                  int subpel_search) {
   if (subpel_x_q3 | subpel_y_q3) {
-    aom_upsampled_pred_c(xd, cm, mi_row, mi_col, mv, comp_pred, width, height,
-                         subpel_x_q3, subpel_y_q3, ref, ref_stride,
-                         subpel_search);
+    aom_upsampled_pred(xd, cm, mi_row, mi_col, mv, comp_pred, width, height,
+                       subpel_x_q3, subpel_y_q3, ref, ref_stride,
+                       subpel_search);
     ref = comp_pred;
     ref_stride = width;
   }
-  aom_comp_mask_pred_c(comp_pred, pred, width, height, ref, ref_stride, mask,
-                       mask_stride, invert_mask);
+  aom_comp_mask_pred(comp_pred, pred, width, height, ref, ref_stride, mask,
+                     mask_stride, invert_mask);
 }
 
 void aom_dist_wtd_comp_avg_upsampled_pred_c(

diff --git a/av1/encoder/reconinter_enc.h b/av1/encoder/reconinter_enc.h
index e187a5f..16932f3 100644
--- a/av1/encoder/reconinter_enc.h
+++ b/av1/encoder/reconinter_enc.h

@@ -24,6 +24,15 @@
 extern "C" {
 #endif
 
+void aom_comp_mask_upsampled_pred(MACROBLOCKD *xd, const AV1_COMMON *const cm,
+                                  int mi_row, int mi_col, const MV *const mv,
+                                  uint8_t *comp_pred, const uint8_t *pred,
+                                  int width, int height, int subpel_x_q3,
+                                  int subpel_y_q3, const uint8_t *ref,
+                                  int ref_stride, const uint8_t *mask,
+                                  int mask_stride, int invert_mask,
+                                  int subpel_search);
+
 void aom_highbd_comp_mask_upsampled_pred(
     MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
     const MV *const mv, uint8_t *comp_pred8, const uint8_t *pred8, int width,

diff --git a/av1/encoder/saliency_map.c b/av1/encoder/saliency_map.c
new file mode 100644
index 0000000..3376846
--- /dev/null
+++ b/av1/encoder/saliency_map.c

@@ -0,0 +1,1414 @@
+/*
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+#include <assert.h>
+#include <float.h>
+#include <string.h>
+
+#include "av1/encoder/encoder.h"
+#include "av1/encoder/encoder_utils.h"
+#include "av1/encoder/firstpass.h"
+#include "av1/encoder/rdopt.h"
+#include "av1/encoder/saliency_map.h"
+
+// The Gabor filter is generated by setting the parameters as:
+// ksize = 9
+// sigma = 1
+// theta = y*np.pi/4, where y /in {0, 1, 2, 3}, i.e., 0, 45, 90, 135 degree
+// lambda1 = 1
+// gamma=0.8
+// phi =0
+static const double kGaborFilter[4][9][9] = {  // [angle: 0, 45, 90, 135
+                                               // degree][ksize][ksize]
+  { { 2.0047323e-06, 6.6387620e-05, 8.0876675e-04, 3.6246411e-03, 5.9760227e-03,
+      3.6246411e-03, 8.0876675e-04, 6.6387620e-05, 2.0047323e-06 },
+    { 1.8831115e-05, 6.2360091e-04, 7.5970138e-03, 3.4047455e-02, 5.6134764e-02,
+      3.4047455e-02, 7.5970138e-03, 6.2360091e-04, 1.8831115e-05 },
+    { 9.3271126e-05, 3.0887155e-03, 3.7628256e-02, 1.6863814e-01, 2.7803731e-01,
+      1.6863814e-01, 3.7628256e-02, 3.0887155e-03, 9.3271126e-05 },
+    { 2.4359586e-04, 8.0667874e-03, 9.8273583e-02, 4.4043165e-01, 7.2614902e-01,
+      4.4043165e-01, 9.8273583e-02, 8.0667874e-03, 2.4359586e-04 },
+    { 3.3546262e-04, 1.1108996e-02, 1.3533528e-01, 6.0653067e-01, 1.0000000e+00,
+      6.0653067e-01, 1.3533528e-01, 1.1108996e-02, 3.3546262e-04 },
+    { 2.4359586e-04, 8.0667874e-03, 9.8273583e-02, 4.4043165e-01, 7.2614902e-01,
+      4.4043165e-01, 9.8273583e-02, 8.0667874e-03, 2.4359586e-04 },
+    { 9.3271126e-05, 3.0887155e-03, 3.7628256e-02, 1.6863814e-01, 2.7803731e-01,
+      1.6863814e-01, 3.7628256e-02, 3.0887155e-03, 9.3271126e-05 },
+    { 1.8831115e-05, 6.2360091e-04, 7.5970138e-03, 3.4047455e-02, 5.6134764e-02,
+      3.4047455e-02, 7.5970138e-03, 6.2360091e-04, 1.8831115e-05 },
+    { 2.0047323e-06, 6.6387620e-05, 8.0876675e-04, 3.6246411e-03, 5.9760227e-03,
+      3.6246411e-03, 8.0876675e-04, 6.6387620e-05, 2.0047323e-06 } },
+
+  { { -6.2165498e-08, 3.8760313e-06, 3.0079011e-06, -4.4602581e-04,
+      6.6981313e-04, 1.3962291e-03, -9.9486928e-04, -8.1631159e-05,
+      3.5712848e-05 },
+    { 3.8760313e-06, 5.7044272e-06, -1.6041942e-03, 4.5687673e-03,
+      1.8061366e-02, -2.4406660e-02, -3.7979286e-03, 3.1511115e-03,
+      -8.1631159e-05 },
+    { 3.0079011e-06, -1.6041942e-03, 8.6645801e-03, 6.4960226e-02,
+      -1.6647682e-01, -4.9129307e-02, 7.7304743e-02, -3.7979286e-03,
+      -9.9486928e-04 },
+    { -4.4602581e-04, 4.5687673e-03, 6.4960226e-02, -3.1572008e-01,
+      -1.7670043e-01, 5.2729243e-01, -4.9129307e-02, -2.4406660e-02,
+      1.3962291e-03 },
+    { 6.6981313e-04, 1.8061366e-02, -1.6647682e-01, -1.7670043e-01,
+      1.0000000e+00, -1.7670043e-01, -1.6647682e-01, 1.8061366e-02,
+      6.6981313e-04 },
+    { 1.3962291e-03, -2.4406660e-02, -4.9129307e-02, 5.2729243e-01,
+      -1.7670043e-01, -3.1572008e-01, 6.4960226e-02, 4.5687673e-03,
+      -4.4602581e-04 },
+    { -9.9486928e-04, -3.7979286e-03, 7.7304743e-02, -4.9129307e-02,
+      -1.6647682e-01, 6.4960226e-02, 8.6645801e-03, -1.6041942e-03,
+      3.0079011e-06 },
+    { -8.1631159e-05, 3.1511115e-03, -3.7979286e-03, -2.4406660e-02,
+      1.8061366e-02, 4.5687673e-03, -1.6041942e-03, 5.7044272e-06,
+      3.8760313e-06 },
+    { 3.5712848e-05, -8.1631159e-05, -9.9486928e-04, 1.3962291e-03,
+      6.6981313e-04, -4.4602581e-04, 3.0079011e-06, 3.8760313e-06,
+      -6.2165498e-08 } },
+
+  { { 2.0047323e-06, 1.8831115e-05, 9.3271126e-05, 2.4359586e-04, 3.3546262e-04,
+      2.4359586e-04, 9.3271126e-05, 1.8831115e-05, 2.0047323e-06 },
+    { 6.6387620e-05, 6.2360091e-04, 3.0887155e-03, 8.0667874e-03, 1.1108996e-02,
+      8.0667874e-03, 3.0887155e-03, 6.2360091e-04, 6.6387620e-05 },
+    { 8.0876675e-04, 7.5970138e-03, 3.7628256e-02, 9.8273583e-02, 1.3533528e-01,
+      9.8273583e-02, 3.7628256e-02, 7.5970138e-03, 8.0876675e-04 },
+    { 3.6246411e-03, 3.4047455e-02, 1.6863814e-01, 4.4043165e-01, 6.0653067e-01,
+      4.4043165e-01, 1.6863814e-01, 3.4047455e-02, 3.6246411e-03 },
+    { 5.9760227e-03, 5.6134764e-02, 2.7803731e-01, 7.2614902e-01, 1.0000000e+00,
+      7.2614902e-01, 2.7803731e-01, 5.6134764e-02, 5.9760227e-03 },
+    { 3.6246411e-03, 3.4047455e-02, 1.6863814e-01, 4.4043165e-01, 6.0653067e-01,
+      4.4043165e-01, 1.6863814e-01, 3.4047455e-02, 3.6246411e-03 },
+    { 8.0876675e-04, 7.5970138e-03, 3.7628256e-02, 9.8273583e-02, 1.3533528e-01,
+      9.8273583e-02, 3.7628256e-02, 7.5970138e-03, 8.0876675e-04 },
+    { 6.6387620e-05, 6.2360091e-04, 3.0887155e-03, 8.0667874e-03, 1.1108996e-02,
+      8.0667874e-03, 3.0887155e-03, 6.2360091e-04, 6.6387620e-05 },
+    { 2.0047323e-06, 1.8831115e-05, 9.3271126e-05, 2.4359586e-04, 3.3546262e-04,
+      2.4359586e-04, 9.3271126e-05, 1.8831115e-05, 2.0047323e-06 } },
+
+  { { 3.5712848e-05, -8.1631159e-05, -9.9486928e-04, 1.3962291e-03,
+      6.6981313e-04, -4.4602581e-04, 3.0079011e-06, 3.8760313e-06,
+      -6.2165498e-08 },
+    { -8.1631159e-05, 3.1511115e-03, -3.7979286e-03, -2.4406660e-02,
+      1.8061366e-02, 4.5687673e-03, -1.6041942e-03, 5.7044272e-06,
+      3.8760313e-06 },
+    { -9.9486928e-04, -3.7979286e-03, 7.7304743e-02, -4.9129307e-02,
+      -1.6647682e-01, 6.4960226e-02, 8.6645801e-03, -1.6041942e-03,
+      3.0079011e-06 },
+    { 1.3962291e-03, -2.4406660e-02, -4.9129307e-02, 5.2729243e-01,
+      -1.7670043e-01, -3.1572008e-01, 6.4960226e-02, 4.5687673e-03,
+      -4.4602581e-04 },
+    { 6.6981313e-04, 1.8061366e-02, -1.6647682e-01, -1.7670043e-01,
+      1.0000000e+00, -1.7670043e-01, -1.6647682e-01, 1.8061366e-02,
+      6.6981313e-04 },
+    { -4.4602581e-04, 4.5687673e-03, 6.4960226e-02, -3.1572008e-01,
+      -1.7670043e-01, 5.2729243e-01, -4.9129307e-02, -2.4406660e-02,
+      1.3962291e-03 },
+    { 3.0079011e-06, -1.6041942e-03, 8.6645801e-03, 6.4960226e-02,
+      -1.6647682e-01, -4.9129307e-02, 7.7304743e-02, -3.7979286e-03,
+      -9.9486928e-04 },
+    { 3.8760313e-06, 5.7044272e-06, -1.6041942e-03, 4.5687673e-03,
+      1.8061366e-02, -2.4406660e-02, -3.7979286e-03, 3.1511115e-03,
+      -8.1631159e-05 },
+    { -6.2165498e-08, 3.8760313e-06, 3.0079011e-06, -4.4602581e-04,
+      6.6981313e-04, 1.3962291e-03, -9.9486928e-04, -8.1631159e-05,
+      3.5712848e-05 } }
+};
+
+// This function is to extract red/green/blue channels, and calculate intensity
+// = (r+g+b)/3. Note that it only handles 8bits case now.
+// TODO(linzhen): add high bitdepth support.
+static void get_color_intensity(const YV12_BUFFER_CONFIG *src,
+                                int subsampling_x, int subsampling_y,
+                                double *cr, double *cg, double *cb,
+                                double *intensity) {
+  const uint8_t *y = src->buffers[0];
+  const uint8_t *u = src->buffers[1];
+  const uint8_t *v = src->buffers[2];
+
+  const int y_height = src->crop_heights[0];
+  const int y_width = src->crop_widths[0];
+  const int y_stride = src->strides[0];
+  const int c_stride = src->strides[1];
+
+  for (int i = 0; i < y_height; ++i) {
+    for (int j = 0; j < y_width; ++j) {
+      cr[i * y_width + j] =
+          fclamp((double)y[i * y_stride + j] +
+                     1.370 * (double)(v[(i >> subsampling_y) * c_stride +
+                                        (j >> subsampling_x)] -
+                                      128),
+                 0, 255);
+      cg[i * y_width + j] =
+          fclamp((double)y[i * y_stride + j] -
+                     0.698 * (double)(u[(i >> subsampling_y) * c_stride +
+                                        (j >> subsampling_x)] -
+                                      128) -
+                     0.337 * (double)(v[(i >> subsampling_y) * c_stride +
+                                        (j >> subsampling_x)] -
+                                      128),
+                 0, 255);
+      cb[i * y_width + j] =
+          fclamp((double)y[i * y_stride + j] +
+                     1.732 * (double)(u[(i >> subsampling_y) * c_stride +
+                                        (j >> subsampling_x)] -
+                                      128),
+                 0, 255);
+
+      intensity[i * y_width + j] =
+          (cr[i * y_width + j] + cg[i * y_width + j] + cb[i * y_width + j]) /
+          3.0;
+      assert(intensity[i * y_width + j] >= 0 &&
+             intensity[i * y_width + j] <= 255);
+
+      intensity[i * y_width + j] /= 256;
+      cr[i * y_width + j] /= 256;
+      cg[i * y_width + j] /= 256;
+      cb[i * y_width + j] /= 256;
+    }
+  }
+}
+
+static INLINE double convolve_map(const double *filter, const double *map,
+                                  const int size) {
+  double result = 0;
+  for (int i = 0; i < size; ++i) {
+    result += filter[i] * map[i];  // symmetric filter is used
+  }
+  return result;
+}
+
+// This function is to decimate the map by half, and apply Gaussian filter on
+// top of the downsampled map.
+static INLINE void decimate_map(const double *map, int height, int width,
+                                int stride, double *downsampled_map) {
+  const int new_width = width / 2;
+  const int window_size = 5;
+  const double gaussian_filter[25] = {
+    1. / 256, 1.0 / 64, 3. / 128, 1. / 64,  1. / 256, 1. / 64, 1. / 16,
+    3. / 32,  1. / 16,  1. / 64,  3. / 128, 3. / 32,  9. / 64, 3. / 32,
+    3. / 128, 1. / 64,  1. / 16,  3. / 32,  1. / 16,  1. / 64, 1. / 256,
+    1. / 64,  3. / 128, 1. / 64,  1. / 256
+  };
+
+  double map_region[25];
+  for (int y = 0; y < height - 1; y += 2) {
+    for (int x = 0; x < width - 1; x += 2) {
+      int i = 0;
+      for (int yy = y - window_size / 2; yy <= y + window_size / 2; ++yy) {
+        for (int xx = x - window_size / 2; xx <= x + window_size / 2; ++xx) {
+          int yvalue = clamp(yy, 0, height - 1);
+          int xvalue = clamp(xx, 0, width - 1);
+          map_region[i++] = map[yvalue * stride + xvalue];
+        }
+      }
+      downsampled_map[(y / 2) * new_width + (x / 2)] =
+          convolve_map(gaussian_filter, map_region, window_size * window_size);
+    }
+  }
+}
+
+// This function is to upscale the map from in_level size to out_level size.
+// Note that the map at "level-1" will upscale the map at "level" by x2.
+static INLINE int upscale_map(const double *input, int in_level, int out_level,
+                              int height[9], int width[9], double *output) {
+  for (int level = in_level; level > out_level; level--) {
+    const int cur_width = width[level];
+    const int cur_height = height[level];
+    const int cur_stride = width[level];
+
+    double *original = (level == in_level) ? (double *)input : output;
+
+    assert(level > 0);
+
+    const int h_upscale = height[level - 1];
+    const int w_upscale = width[level - 1];
+    const int s_upscale = width[level - 1];
+
+    double *upscale = aom_malloc(h_upscale * w_upscale * sizeof(*upscale));
+
+    if (!upscale) {
+      return 0;
+    }
+
+    for (int i = 0; i < h_upscale; ++i) {
+      for (int j = 0; j < w_upscale; ++j) {
+        const int ii = clamp((i >> 1), 0, cur_height - 1);
+        const int jj = clamp((j >> 1), 0, cur_width - 1);
+        upscale[j + i * s_upscale] = (double)original[jj + ii * cur_stride];
+      }
+    }
+    memcpy(output, upscale, h_upscale * w_upscale * sizeof(double));
+    aom_free(upscale);
+  }
+
+  return 1;
+}
+
+// This function calculates the differences between a fine scale c and a
+// coarser scale s yielding the feature maps. c \in {2, 3, 4}, and s = c +
+// delta, where delta \in {3, 4}.
+static int center_surround_diff(const double *input[9], int height[9],
+                                int width[9], saliency_feature_map *output[6]) {
+  int j = 0;
+  for (int k = 2; k < 5; ++k) {
+    int cur_height = height[k];
+    int cur_width = width[k];
+
+    if (upscale_map(input[k + 3], k + 3, k, height, width, output[j]->buf) ==
+        0) {
+      return 0;
+    }
+
+    for (int r = 0; r < cur_height; ++r) {
+      for (int c = 0; c < cur_width; ++c) {
+        output[j]->buf[r * cur_width + c] =
+            fabs((double)(input[k][r * cur_width + c] -
+                          output[j]->buf[r * cur_width + c]));
+      }
+    }
+
+    if (upscale_map(input[k + 4], k + 4, k, height, width,
+                    output[j + 1]->buf) == 0) {
+      return 0;
+    }
+
+    for (int r = 0; r < cur_height; ++r) {
+      for (int c = 0; c < cur_width; ++c) {
+        output[j + 1]->buf[r * cur_width + c] =
+            fabs(input[k][r * cur_width + c] -
+                 output[j + 1]->buf[r * cur_width + c]);
+      }
+    }
+
+    j += 2;
+  }
+  return 1;
+}
+
+// For color channels, the differences is calculated based on "color
+// double-opponency". For example, the RG feature map is constructed between a
+// fine scale c of R-G component and a coarser scale s of G-R component.
+static int center_surround_diff_rgb(const double *input_1[9],
+                                    const double *input_2[9], int height[9],
+                                    int width[9],
+                                    saliency_feature_map *output[6]) {
+  int j = 0;
+  for (int k = 2; k < 5; ++k) {
+    int cur_height = height[k];
+    int cur_width = width[k];
+
+    if (upscale_map(input_2[k + 3], k + 3, k, height, width, output[j]->buf) ==
+        0) {
+      return 0;
+    }
+
+    for (int r = 0; r < cur_height; ++r) {
+      for (int c = 0; c < cur_width; ++c) {
+        output[j]->buf[r * cur_width + c] =
+            fabs((double)(input_1[k][r * cur_width + c] -
+                          output[j]->buf[r * cur_width + c]));
+      }
+    }
+
+    if (upscale_map(input_2[k + 4], k + 4, k, height, width,
+                    output[j + 1]->buf) == 0) {
+      return 0;
+    }
+
+    for (int r = 0; r < cur_height; ++r) {
+      for (int c = 0; c < cur_width; ++c) {
+        output[j + 1]->buf[r * cur_width + c] =
+            fabs(input_1[k][r * cur_width + c] -
+                 output[j + 1]->buf[r * cur_width + c]);
+      }
+    }
+
+    j += 2;
+  }
+  return 1;
+}
+
+// This function is to generate Gaussian pyramid images with indexes from 0 to
+// 8, and construct the feature maps from calculating the center-surround
+// differences.
+static int gaussian_pyramid(const double *src, int width[9], int height[9],
+                            saliency_feature_map *dst[6]) {
+  double *gaussian_map[9];  // scale = 9
+  gaussian_map[0] =
+      (double *)aom_malloc(width[0] * height[0] * sizeof(*gaussian_map[0]));
+  if (!gaussian_map[0]) {
+    return 0;
+  }
+
+  memcpy(gaussian_map[0], src, width[0] * height[0] * sizeof(double));
+
+  for (int i = 1; i < 9; ++i) {
+    int stride = width[i - 1];
+    int new_width = width[i];
+    int new_height = height[i];
+
+    gaussian_map[i] =
+        (double *)aom_malloc(new_width * new_height * sizeof(*gaussian_map[i]));
+
+    if (!gaussian_map[i]) {
+      for (int l = 0; l < i; ++l) {
+        aom_free(gaussian_map[l]);
+      }
+      return 0;
+    }
+
+    memset(gaussian_map[i], 0, new_width * new_height * sizeof(double));
+
+    decimate_map(gaussian_map[i - 1], height[i - 1], width[i - 1], stride,
+                 gaussian_map[i]);
+  }
+
+  if (center_surround_diff((const double **)gaussian_map, height, width, dst) ==
+      0) {
+    for (int l = 0; l < 9; ++l) {
+      aom_free(gaussian_map[l]);
+    }
+    return 0;
+  }
+
+  for (int i = 0; i < 9; ++i) {
+    aom_free(gaussian_map[i]);
+  }
+  return 1;
+}
+
+static int gaussian_pyramid_rgb(double *src_1, double *src_2, int width[9],
+                                int height[9], saliency_feature_map *dst[6]) {
+  double *gaussian_map[2][9];  // scale = 9
+  double *src[2];
+
+  src[0] = src_1;
+  src[1] = src_2;
+
+  for (int k = 0; k < 2; ++k) {
+    gaussian_map[k][0] = (double *)aom_malloc(width[0] * height[0] *
+                                              sizeof(*gaussian_map[k][0]));
+    if (!gaussian_map[k][0]) {
+      for (int l = 0; l < k; ++l) {
+        aom_free(gaussian_map[l][0]);
+      }
+      return 0;
+    }
+    memcpy(gaussian_map[k][0], src[k], width[0] * height[0] * sizeof(double));
+
+    for (int i = 1; i < 9; ++i) {
+      int stride = width[i - 1];
+      int new_width = width[i];
+      int new_height = height[i];
+
+      gaussian_map[k][i] = (double *)aom_malloc(new_width * new_height *
+                                                sizeof(*gaussian_map[k][i]));
+      if (!gaussian_map[k][i]) {
+        for (int l = 0; l < k; ++l) {
+          aom_free(gaussian_map[l][i]);
+        }
+        return 0;
+      }
+      memset(gaussian_map[k][i], 0, new_width * new_height * sizeof(double));
+      decimate_map(gaussian_map[k][i - 1], height[i - 1], width[i - 1], stride,
+                   gaussian_map[k][i]);
+    }
+  }
+
+  if (center_surround_diff_rgb((const double **)gaussian_map[0],
+                               (const double **)gaussian_map[1], height, width,
+                               dst) == 0) {
+    for (int l = 0; l < 2; ++l) {
+      for (int i = 0; i < 9; ++i) {
+        aom_free(gaussian_map[l][i]);
+      }
+    }
+    return 0;
+  }
+
+  for (int l = 0; l < 2; ++l) {
+    for (int i = 0; i < 9; ++i) {
+      aom_free(gaussian_map[l][i]);
+    }
+  }
+  return 1;
+}
+
+static int get_feature_map_intensity(double *intensity, int width[9],
+                                     int height[9],
+                                     saliency_feature_map *i_map[6]) {
+  if (gaussian_pyramid(intensity, width, height, i_map) == 0) {
+    return 0;
+  }
+  return 1;
+}
+
+static int get_feature_map_rgb(double *cr, double *cg, double *cb, int width[9],
+                               int height[9], saliency_feature_map *rg_map[6],
+                               saliency_feature_map *by_map[6]) {
+  double *rg_mat = aom_malloc(height[0] * width[0] * sizeof(*rg_mat));
+  double *by_mat = aom_malloc(height[0] * width[0] * sizeof(*by_mat));
+  double *gr_mat = aom_malloc(height[0] * width[0] * sizeof(*gr_mat));
+  double *yb_mat = aom_malloc(height[0] * width[0] * sizeof(*yb_mat));
+
+  if (!rg_mat || !by_mat || !gr_mat || !yb_mat) {
+    aom_free(rg_mat);
+    aom_free(by_mat);
+    aom_free(gr_mat);
+    aom_free(yb_mat);
+    return 0;
+  }
+
+  double r, g, b, y;
+  for (int i = 0; i < height[0]; ++i) {
+    for (int j = 0; j < width[0]; ++j) {
+      r = AOMMAX(0, cr[i * width[0] + j] -
+                        (cg[i * width[0] + j] + cb[i * width[0] + j]) / 2);
+      g = AOMMAX(0, cg[i * width[0] + j] -
+                        (cr[i * width[0] + j] + cb[i * width[0] + j]) / 2);
+      b = AOMMAX(0, cb[i * width[0] + j] -
+                        (cr[i * width[0] + j] + cg[i * width[0] + j]) / 2);
+      y = AOMMAX(0, (cr[i * width[0] + j] + cg[i * width[0] + j]) / 2 -
+                        fabs(cr[i * width[0] + j] - cg[i * width[0] + j]) / 2 -
+                        cb[i * width[0] + j]);
+
+      rg_mat[i * width[0] + j] = r - g;
+      by_mat[i * width[0] + j] = b - y;
+      gr_mat[i * width[0] + j] = g - r;
+      yb_mat[i * width[0] + j] = y - b;
+    }
+  }
+
+  if (gaussian_pyramid_rgb(rg_mat, gr_mat, width, height, rg_map) == 0 ||
+      gaussian_pyramid_rgb(by_mat, yb_mat, width, height, by_map) == 0) {
+    aom_free(rg_mat);
+    aom_free(by_mat);
+    aom_free(gr_mat);
+    aom_free(yb_mat);
+    return 0;
+  }
+
+  aom_free(rg_mat);
+  aom_free(by_mat);
+  aom_free(gr_mat);
+  aom_free(yb_mat);
+  return 1;
+}
+
+static INLINE void filter2d(const double *input, const double kernel[9][9],
+                            int width, int height, double *output) {
+  const int window_size = 9;
+  double map_section[81];
+  for (int y = 0; y <= height - 1; ++y) {
+    for (int x = 0; x <= width - 1; ++x) {
+      int i = 0;
+      for (int yy = y - window_size / 2; yy <= y + window_size / 2; ++yy) {
+        for (int xx = x - window_size / 2; xx <= x + window_size / 2; ++xx) {
+          int yvalue = clamp(yy, 0, height - 1);
+          int xvalue = clamp(xx, 0, width - 1);
+          map_section[i++] = input[yvalue * width + xvalue];
+        }
+      }
+
+      output[y * width + x] = 0;
+      for (int k = 0; k < window_size; ++k) {
+        for (int l = 0; l < window_size; ++l) {
+          output[y * width + x] +=
+              kernel[k][l] * map_section[k * window_size + l];
+        }
+      }
+    }
+  }
+}
+
+static int get_feature_map_orientation(const double *intensity, int width[9],
+                                       int height[9],
+                                       saliency_feature_map *dst[24]) {
+  double *gaussian_map[9];
+
+  gaussian_map[0] =
+      (double *)aom_malloc(width[0] * height[0] * sizeof(*gaussian_map[0]));
+  if (!gaussian_map[0]) {
+    return 0;
+  }
+  memcpy(gaussian_map[0], intensity, width[0] * height[0] * sizeof(double));
+
+  for (int i = 1; i < 9; ++i) {
+    int stride = width[i - 1];
+    int new_width = width[i];
+    int new_height = height[i];
+
+    gaussian_map[i] =
+        (double *)aom_malloc(new_width * new_height * sizeof(*gaussian_map[i]));
+    if (!gaussian_map[i]) {
+      for (int l = 0; l < i; ++l) {
+        aom_free(gaussian_map[l]);
+      }
+      return 0;
+    }
+    memset(gaussian_map[i], 0, new_width * new_height * sizeof(double));
+    decimate_map(gaussian_map[i - 1], height[i - 1], width[i - 1], stride,
+                 gaussian_map[i]);
+  }
+
+  double *tempGaborOutput[4][9];  //[angle: 0, 45, 90, 135 degree][filter_size]
+
+  for (int i = 2; i < 9; ++i) {
+    const int cur_height = height[i];
+    const int cur_width = width[i];
+    for (int j = 0; j < 4; ++j) {
+      tempGaborOutput[j][i] = (double *)aom_malloc(
+          cur_height * cur_width * sizeof(*tempGaborOutput[j][i]));
+      if (!tempGaborOutput[j][i]) {
+        for (int l = 0; l < 9; ++l) {
+          aom_free(gaussian_map[l]);
+        }
+        for (int h = 0; h < 4; ++h) {
+          for (int g = 2; g < 9; ++g) {
+            aom_free(tempGaborOutput[h][g]);
+          }
+        }
+        return 0;
+      }
+      filter2d(gaussian_map[i], kGaborFilter[j], cur_width, cur_height,
+               tempGaborOutput[j][i]);
+    }
+  }
+
+  for (int i = 0; i < 9; ++i) {
+    aom_free(gaussian_map[i]);
+  }
+
+  saliency_feature_map
+      *tmp[4][6];  //[angle: 0, 45, 90, 135 degree][filter_size]
+
+  for (int i = 0; i < 6; ++i) {
+    for (int j = 0; j < 4; ++j) {
+      tmp[j][i] = dst[j * 6 + i];
+    }
+  }
+
+  for (int j = 0; j < 4; ++j) {
+    if (center_surround_diff((const double **)tempGaborOutput[j], height, width,
+                             tmp[j]) == 0) {
+      for (int h = 0; h < 4; ++h) {
+        for (int g = 2; g < 9; ++g) {
+          aom_free(tempGaborOutput[h][g]);
+        }
+      }
+      return 0;
+    }
+  }
+
+  for (int i = 2; i < 9; ++i) {
+    for (int j = 0; j < 4; ++j) {
+      aom_free(tempGaborOutput[j][i]);
+    }
+  }
+
+  return 1;
+}
+
+static INLINE void find_min_max(const saliency_feature_map *input,
+                                double *max_value, double *min_value) {
+  assert(input && input->buf);
+  *min_value = DBL_MAX;
+  *max_value = 0.0;
+
+  for (int i = 0; i < input->height; ++i) {
+    for (int j = 0; j < input->width; ++j) {
+      assert(input->buf[i * input->width + j] >= 0.0);
+      *min_value = fmin(input->buf[i * input->width + j], *min_value);
+      *max_value = fmax(input->buf[i * input->width + j], *max_value);
+    }
+  }
+}
+
+static INLINE double average_local_max(const saliency_feature_map *input,
+                                       int stepsize) {
+  int numlocal = 0;
+  double lmaxmean = 0, lmax = 0, dummy = 0;
+  saliency_feature_map local_map;
+  local_map.height = stepsize;
+  local_map.width = stepsize;
+  local_map.buf =
+      (double *)aom_malloc(stepsize * stepsize * sizeof(*local_map.buf));
+
+  if (!local_map.buf) {
+    return -1;
+  }
+
+  for (int y = 0; y < input->height - stepsize; y += stepsize) {
+    for (int x = 0; x < input->width - stepsize; x += stepsize) {
+      for (int i = 0; i < stepsize; ++i) {
+        for (int j = 0; j < stepsize; ++j) {
+          local_map.buf[i * stepsize + j] =
+              input->buf[(y + i) * input->width + x + j];
+        }
+      }
+
+      find_min_max(&local_map, &lmax, &dummy);
+      lmaxmean += lmax;
+      numlocal++;
+    }
+  }
+
+  aom_free(local_map.buf);
+
+  return lmaxmean / numlocal;
+}
+
+// Linear normalization the values in the map to [0,1].
+static void minmax_normalize(saliency_feature_map *input) {
+  double max_value, min_value;
+  find_min_max(input, &max_value, &min_value);
+
+  for (int i = 0; i < input->height; ++i) {
+    for (int j = 0; j < input->width; ++j) {
+      if (max_value != min_value) {
+        input->buf[i * input->width + j] =
+            input->buf[i * input->width + j] / (max_value - min_value) +
+            min_value / (min_value - max_value);
+      } else {
+        input->buf[i * input->width + j] -= min_value;
+      }
+    }
+  }
+}
+
+// This function is to promote meaningful “activation spots” in the map and
+// ignores homogeneous areas.
+static int nomalization_operator(saliency_feature_map *input, int stepsize) {
+  minmax_normalize(input);
+  double lmaxmean = average_local_max(input, stepsize);
+  if (lmaxmean < 0) {
+    return 0;
+  }
+  double normCoeff = (1 - lmaxmean) * (1 - lmaxmean);
+
+  for (int i = 0; i < input->height; ++i) {
+    for (int j = 0; j < input->width; ++j) {
+      input->buf[i * input->width + j] *= normCoeff;
+    }
+  }
+
+  return 1;
+}
+
+// Normalize the values in feature maps to [0,1], and then upscale all maps to
+// the original frame size.
+static int normalize_fm(saliency_feature_map *input[6], int width[9],
+                        int height[9], int num_fm,
+                        saliency_feature_map *output[6]) {
+  // Feature maps (FM) are generated by function "center_surround_diff()". The
+  // difference is between a fine scale c and a coarser scale s, where c \in {2,
+  // 3, 4}, and s = c + delta, where delta \in {3, 4}, and the FM size is scale
+  // c. Specifically, i=0: c=2 and s=5, i=1: c=2 and s=6, i=2: c=3 and s=6, i=3:
+  // c=3 and s=7, i=4: c=4 and s=7, i=5: c=4 and s=8.
+  for (int i = 0; i < num_fm; ++i) {
+    if (nomalization_operator(input[i], 8) == 0) {
+      return 0;
+    }
+
+    // Upscale FM to original frame size
+    if (upscale_map(input[i]->buf, (i / 2) + 2, 0, height, width,
+                    output[i]->buf) == 0) {
+      return 0;
+    }
+  }
+  return 1;
+}
+
+// Combine feature maps with the same category (intensity, color, or
+// orientation) into one conspicuity map.
+static int normalized_map(saliency_feature_map *input[6], int width[9],
+                          int height[9], saliency_feature_map *output) {
+  int num_fm = 6;
+
+  saliency_feature_map *n_input[6];
+  for (int i = 0; i < 6; ++i) {
+    n_input[i] = (saliency_feature_map *)aom_malloc(sizeof(*n_input[i]));
+    if (!n_input[i]) {
+      return 0;
+    }
+    n_input[i]->buf =
+        (double *)aom_malloc(width[0] * height[0] * sizeof(*n_input[i]->buf));
+    if (!n_input[i]->buf) {
+      aom_free(n_input[i]);
+      return 0;
+    }
+    n_input[i]->height = height[0];
+    n_input[i]->width = width[0];
+  }
+
+  if (normalize_fm(input, width, height, num_fm, n_input) == 0) {
+    for (int i = 0; i < num_fm; ++i) {
+      aom_free(n_input[i]->buf);
+      aom_free(n_input[i]);
+    }
+    return 0;
+  }
+
+  // Add up all normalized feature maps with the same category into one map.
+  for (int i = 0; i < num_fm; ++i) {
+    for (int r = 0; r < height[0]; ++r) {
+      for (int c = 0; c < width[0]; ++c) {
+        output->buf[r * width[0] + c] += n_input[i]->buf[r * width[0] + c];
+      }
+    }
+  }
+
+  for (int i = 0; i < num_fm; ++i) {
+    aom_free(n_input[i]->buf);
+    aom_free(n_input[i]);
+  }
+
+  nomalization_operator(output, 8);
+  return 1;
+}
+
+static int normalized_map_rgb(saliency_feature_map *rg_map[6],
+                              saliency_feature_map *by_map[6], int width[9],
+                              int height[9], saliency_feature_map *output) {
+  saliency_feature_map *color_cm[2];  // 0: color_cm_rg, 1: color_cm_by
+  for (int i = 0; i < 2; ++i) {
+    color_cm[i] = aom_malloc(sizeof(*color_cm[i]));
+    if (!color_cm[i]) {
+      return 0;
+    }
+    color_cm[i]->buf =
+        (double *)aom_malloc(width[0] * height[0] * sizeof(*color_cm[i]->buf));
+    if (!color_cm[i]->buf) {
+      for (int l = 0; l < i; ++l) {
+        aom_free(color_cm[l]->buf);
+      }
+      aom_free(color_cm[i]);
+      return 0;
+    }
+
+    color_cm[i]->width = width[0];
+    color_cm[i]->height = height[0];
+    memset(color_cm[i]->buf, 0,
+           width[0] * height[0] * sizeof(*color_cm[i]->buf));
+  }
+
+  if (normalized_map(rg_map, width, height, color_cm[0]) == 0 ||
+      normalized_map(by_map, width, height, color_cm[1]) == 0) {
+    for (int i = 0; i < 2; ++i) {
+      aom_free(color_cm[i]->buf);
+      aom_free(color_cm[i]);
+    }
+    return 0;
+  }
+
+  for (int r = 0; r < height[0]; ++r) {
+    for (int c = 0; c < width[0]; ++c) {
+      output->buf[r * width[0] + c] = color_cm[0]->buf[r * width[0] + c] +
+                                      color_cm[1]->buf[r * width[0] + c];
+    }
+  }
+
+  for (int i = 0; i < 2; ++i) {
+    aom_free(color_cm[i]->buf);
+    aom_free(color_cm[i]);
+  }
+
+  nomalization_operator(output, 8);
+  return 1;
+}
+
+static int normalized_map_orientation(saliency_feature_map *orientation_map[24],
+                                      int width[9], int height[9],
+                                      saliency_feature_map *output) {
+  int num_fms_per_angle = 6;
+
+  saliency_feature_map *ofm[4][6];
+  for (int i = 0; i < num_fms_per_angle; ++i) {
+    for (int j = 0; j < 4; ++j) {
+      ofm[j][i] = orientation_map[j * num_fms_per_angle + i];
+    }
+  }
+
+  // extract conspicuity map for each angle
+  saliency_feature_map *nofm = aom_malloc(sizeof(*nofm));
+  if (!nofm) {
+    return 0;
+  }
+  nofm->buf = (double *)aom_malloc(width[0] * height[0] * sizeof(*nofm->buf));
+  if (!nofm->buf) {
+    aom_free(nofm);
+    return 0;
+  }
+  nofm->height = height[0];
+  nofm->width = width[0];
+
+  for (int i = 0; i < 4; ++i) {
+    memset(nofm->buf, 0, width[0] * height[0] * sizeof(*nofm->buf));
+    if (normalized_map(ofm[i], width, height, nofm) == 0) {
+      aom_free(nofm->buf);
+      aom_free(nofm);
+      return 0;
+    }
+
+    for (int r = 0; r < height[0]; ++r) {
+      for (int c = 0; c < width[0]; ++c) {
+        output->buf[r * width[0] + c] += nofm->buf[r * width[0] + c];
+      }
+    }
+  }
+
+  aom_free(nofm->buf);
+  aom_free(nofm);
+
+  nomalization_operator(output, 8);
+  return 1;
+}
+
+// Set pixel level saliency mask based on Itti-Koch algorithm
+int av1_set_saliency_map(AV1_COMP *cpi) {
+  AV1_COMMON *const cm = &cpi->common;
+
+  int frm_width = cm->width;
+  int frm_height = cm->height;
+
+  int pyr_height[9];
+  int pyr_width[9];
+
+  pyr_height[0] = frm_height;
+  pyr_width[0] = frm_width;
+
+  for (int i = 1; i < 9; ++i) {
+    pyr_width[i] = pyr_width[i - 1] / 2;
+    pyr_height[i] = pyr_height[i - 1] / 2;
+  }
+
+  double *cr = aom_malloc(frm_width * frm_height * sizeof(*cr));
+  double *cg = aom_malloc(frm_width * frm_height * sizeof(*cg));
+  double *cb = aom_malloc(frm_width * frm_height * sizeof(*cb));
+  double *intensity = aom_malloc(frm_width * frm_height * sizeof(*intensity));
+
+  if (!cr || !cg || !cb || !intensity) {
+    aom_free(cr);
+    aom_free(cg);
+    aom_free(cb);
+    aom_free(intensity);
+    return 0;
+  }
+
+  // Extract red / green / blue channels and intensity component
+  get_color_intensity(cpi->source, cm->seq_params->subsampling_x,
+                      cm->seq_params->subsampling_y, cr, cg, cb, intensity);
+
+  // Feature Map Extraction
+  // intensity map
+  saliency_feature_map *i_map[6];
+  for (int i = 0; i < 6; ++i) {
+    int cur_height = pyr_height[(i / 2) + 2];
+    int cur_width = pyr_width[(i / 2) + 2];
+
+    i_map[i] = (saliency_feature_map *)aom_malloc(sizeof(*i_map[i]));
+    if (!i_map[i]) {
+      aom_free(cr);
+      aom_free(cg);
+      aom_free(cb);
+      aom_free(intensity);
+      for (int l = 0; l < i; ++l) {
+        aom_free(i_map[l]);
+      }
+      return 0;
+    }
+    i_map[i]->buf =
+        (double *)aom_malloc(cur_height * cur_width * sizeof(*i_map[i]->buf));
+    if (!i_map[i]->buf) {
+      aom_free(cr);
+      aom_free(cg);
+      aom_free(cb);
+      aom_free(intensity);
+      for (int l = 0; l < i; ++l) {
+        aom_free(i_map[l]->buf);
+        aom_free(i_map[l]);
+      }
+      return 0;
+    }
+    i_map[i]->height = cur_height;
+    i_map[i]->width = cur_width;
+  }
+
+  if (get_feature_map_intensity(intensity, pyr_width, pyr_height, i_map) == 0) {
+    aom_free(cr);
+    aom_free(cg);
+    aom_free(cb);
+    aom_free(intensity);
+    for (int l = 0; l < 6; ++l) {
+      aom_free(i_map[l]->buf);
+      aom_free(i_map[l]);
+    }
+    return 0;
+  }
+
+  // RGB map
+  saliency_feature_map *rg_map[6], *by_map[6];
+  for (int i = 0; i < 6; ++i) {
+    int cur_height = pyr_height[(i / 2) + 2];
+    int cur_width = pyr_width[(i / 2) + 2];
+    rg_map[i] = (saliency_feature_map *)aom_malloc(sizeof(*rg_map[i]));
+    by_map[i] = (saliency_feature_map *)aom_malloc(sizeof(*by_map[i]));
+    if (!rg_map[i] || !by_map[i]) {
+      aom_free(cr);
+      aom_free(cg);
+      aom_free(cb);
+      aom_free(intensity);
+      for (int l = 0; l < 6; ++l) {
+        aom_free(i_map[l]->buf);
+        aom_free(i_map[l]);
+        aom_free(rg_map[l]);
+        aom_free(by_map[l]);
+      }
+      return 0;
+    }
+    rg_map[i]->buf =
+        (double *)aom_malloc(cur_height * cur_width * sizeof(*rg_map[i]->buf));
+    by_map[i]->buf =
+        (double *)aom_malloc(cur_height * cur_width * sizeof(*by_map[i]->buf));
+    if (!by_map[i]->buf || !rg_map[i]->buf) {
+      aom_free(cr);
+      aom_free(cg);
+      aom_free(cb);
+      aom_free(intensity);
+      for (int l = 0; l < 6; ++l) {
+        aom_free(i_map[l]->buf);
+        aom_free(i_map[l]);
+      }
+      for (int l = 0; l < i; ++l) {
+        aom_free(rg_map[l]->buf);
+        aom_free(by_map[l]->buf);
+        aom_free(rg_map[l]);
+        aom_free(by_map[l]);
+      }
+      return 0;
+    }
+    rg_map[i]->height = cur_height;
+    rg_map[i]->width = cur_width;
+    by_map[i]->height = cur_height;
+    by_map[i]->width = cur_width;
+  }
+
+  if (get_feature_map_rgb(cr, cg, cb, pyr_width, pyr_height, rg_map, by_map) ==
+      0) {
+    aom_free(cr);
+    aom_free(cg);
+    aom_free(cb);
+    aom_free(intensity);
+    for (int l = 0; l < 6; ++l) {
+      aom_free(i_map[l]->buf);
+      aom_free(rg_map[l]->buf);
+      aom_free(by_map[l]->buf);
+      aom_free(i_map[l]);
+      aom_free(rg_map[l]);
+      aom_free(by_map[l]);
+    }
+    return 0;
+  }
+
+  // Orientation map
+  saliency_feature_map *orientation_map[24];
+  for (int i = 0; i < 24; ++i) {
+    int cur_height = pyr_height[((i % 6) / 2) + 2];
+    int cur_width = pyr_width[((i % 6) / 2) + 2];
+
+    orientation_map[i] =
+        (saliency_feature_map *)aom_malloc(sizeof(*orientation_map[i]));
+    if (!orientation_map[i]) {
+      aom_free(cr);
+      aom_free(cg);
+      aom_free(cb);
+      aom_free(intensity);
+      for (int l = 0; l < 6; ++l) {
+        aom_free(i_map[l]->buf);
+        aom_free(rg_map[l]->buf);
+        aom_free(by_map[l]->buf);
+        aom_free(i_map[l]);
+        aom_free(rg_map[l]);
+        aom_free(by_map[l]);
+      }
+      for (int h = 0; h < i; ++h) {
+        aom_free(orientation_map[h]);
+      }
+      return 0;
+    }
+
+    orientation_map[i]->buf = (double *)aom_malloc(
+        cur_height * cur_width * sizeof(*orientation_map[i]->buf));
+    if (!orientation_map[i]->buf) {
+      aom_free(cr);
+      aom_free(cg);
+      aom_free(cb);
+      aom_free(intensity);
+      for (int l = 0; l < 6; ++l) {
+        aom_free(i_map[l]->buf);
+        aom_free(rg_map[l]->buf);
+        aom_free(by_map[l]->buf);
+        aom_free(i_map[l]);
+        aom_free(rg_map[l]);
+        aom_free(by_map[l]);
+      }
+
+      for (int h = 0; h < i; ++h) {
+        aom_free(orientation_map[h]->buf);
+        aom_free(orientation_map[h]->buf);
+        aom_free(orientation_map[h]);
+        aom_free(orientation_map[h]);
+      }
+      return 0;
+    }
+
+    orientation_map[i]->height = cur_height;
+    orientation_map[i]->width = cur_width;
+  }
+
+  if (get_feature_map_orientation(intensity, pyr_width, pyr_height,
+                                  orientation_map) == 0) {
+    aom_free(cr);
+    aom_free(cg);
+    aom_free(cb);
+    aom_free(intensity);
+    for (int l = 0; l < 6; ++l) {
+      aom_free(i_map[l]->buf);
+      aom_free(rg_map[l]->buf);
+      aom_free(by_map[l]->buf);
+      aom_free(i_map[l]);
+      aom_free(rg_map[l]);
+      aom_free(by_map[l]);
+    }
+    for (int h = 0; h < 24; ++h) {
+      aom_free(orientation_map[h]->buf);
+      aom_free(orientation_map[h]);
+    }
+    return 0;
+  }
+
+  aom_free(cr);
+  aom_free(cg);
+  aom_free(cb);
+  aom_free(intensity);
+
+  saliency_feature_map
+      *normalized_maps[3];  // 0: intensity, 1: color, 2: orientation
+
+  for (int i = 0; i < 3; ++i) {
+    normalized_maps[i] = aom_malloc(sizeof(*normalized_maps[i]));
+    if (!normalized_maps[i]) {
+      for (int l = 0; l < 6; ++l) {
+        aom_free(i_map[l]->buf);
+        aom_free(rg_map[l]->buf);
+        aom_free(by_map[l]->buf);
+        aom_free(i_map[l]);
+        aom_free(rg_map[l]);
+        aom_free(by_map[l]);
+      }
+
+      for (int h = 0; h < 24; ++h) {
+        aom_free(orientation_map[h]->buf);
+        aom_free(orientation_map[h]);
+      }
+
+      for (int l = 0; l < i; ++l) {
+        aom_free(normalized_maps[l]);
+      }
+      return 0;
+    }
+    normalized_maps[i]->buf = (double *)aom_malloc(
+        frm_width * frm_height * sizeof(*normalized_maps[i]->buf));
+    if (!normalized_maps[i]->buf) {
+      for (int l = 0; l < 6; ++l) {
+        aom_free(i_map[l]->buf);
+        aom_free(rg_map[l]->buf);
+        aom_free(by_map[l]->buf);
+        aom_free(i_map[l]);
+        aom_free(rg_map[l]);
+        aom_free(by_map[l]);
+      }
+      for (int h = 0; h < 24; ++h) {
+        aom_free(orientation_map[h]->buf);
+        aom_free(orientation_map[h]);
+      }
+      for (int l = 0; l < i; ++l) {
+        aom_free(normalized_maps[l]->buf);
+        aom_free(normalized_maps[l]);
+      }
+      return 0;
+    }
+    normalized_maps[i]->width = frm_width;
+    normalized_maps[i]->height = frm_height;
+    memset(normalized_maps[i]->buf, 0,
+           frm_width * frm_height * sizeof(*normalized_maps[i]->buf));
+  }
+
+  // Conspicuity map generation
+  if (normalized_map(i_map, pyr_width, pyr_height, normalized_maps[0]) == 0 ||
+      normalized_map_rgb(rg_map, by_map, pyr_width, pyr_height,
+                         normalized_maps[1]) == 0 ||
+      normalized_map_orientation(orientation_map, pyr_width, pyr_height,
+                                 normalized_maps[2]) == 0) {
+    for (int i = 0; i < 6; ++i) {
+      aom_free(i_map[i]->buf);
+      aom_free(rg_map[i]->buf);
+      aom_free(by_map[i]->buf);
+      aom_free(i_map[i]);
+      aom_free(rg_map[i]);
+      aom_free(by_map[i]);
+    }
+
+    for (int i = 0; i < 24; ++i) {
+      aom_free(orientation_map[i]->buf);
+      aom_free(orientation_map[i]);
+    }
+
+    for (int i = 0; i < 3; ++i) {
+      aom_free(normalized_maps[i]->buf);
+      aom_free(normalized_maps[i]);
+    }
+    return 0;
+  }
+
+  for (int i = 0; i < 6; ++i) {
+    aom_free(i_map[i]->buf);
+    aom_free(rg_map[i]->buf);
+    aom_free(by_map[i]->buf);
+    aom_free(i_map[i]);
+    aom_free(rg_map[i]);
+    aom_free(by_map[i]);
+  }
+
+  for (int i = 0; i < 24; ++i) {
+    aom_free(orientation_map[i]->buf);
+    aom_free(orientation_map[i]);
+  }
+
+  // Pixel level saliency map
+  saliency_feature_map *combined_saliency_map =
+      aom_malloc(sizeof(*combined_saliency_map));
+  if (!combined_saliency_map) {
+    for (int i = 0; i < 3; ++i) {
+      aom_free(normalized_maps[i]->buf);
+      aom_free(normalized_maps[i]);
+    }
+    return 0;
+  }
+
+  combined_saliency_map->buf = (double *)aom_malloc(
+      frm_width * frm_height * sizeof(*combined_saliency_map->buf));
+  if (!combined_saliency_map->buf) {
+    for (int i = 0; i < 3; ++i) {
+      aom_free(normalized_maps[i]->buf);
+      aom_free(normalized_maps[i]);
+    }
+
+    aom_free(combined_saliency_map);
+    return 0;
+  }
+  combined_saliency_map->height = frm_height;
+  combined_saliency_map->width = frm_width;
+
+  double w_intensity, w_color, w_orient;
+
+  w_intensity = w_color = w_orient = (double)1 / 3;
+
+  for (int r = 0; r < frm_height; ++r) {
+    for (int c = 0; c < frm_width; ++c) {
+      combined_saliency_map->buf[r * frm_width + c] =
+          (w_intensity * normalized_maps[0]->buf[r * frm_width + c] +
+           w_color * normalized_maps[1]->buf[r * frm_width + c] +
+           w_orient * normalized_maps[2]->buf[r * frm_width + c]);
+    }
+  }
+
+  for (int r = 0; r < frm_height; ++r) {
+    for (int c = 0; c < frm_width; ++c) {
+      int index = r * frm_width + c;
+      cpi->saliency_map[index] =
+          (uint8_t)(combined_saliency_map->buf[index] * 255);
+    }
+  }
+
+  for (int i = 0; i < 3; ++i) {
+    aom_free(normalized_maps[i]->buf);
+    aom_free(normalized_maps[i]);
+  }
+
+  aom_free(combined_saliency_map->buf);
+  aom_free(combined_saliency_map);
+
+  return 1;
+}
+
+// Set superblock level saliency mask for rdmult scaling
+int av1_setup_sm_rdmult_scaling_factor(AV1_COMP *cpi, double motion_ratio) {
+  AV1_COMMON *cm = &cpi->common;
+
+  saliency_feature_map *sb_saliency_map =
+      aom_malloc(sizeof(saliency_feature_map));
+
+  if (sb_saliency_map == NULL) {
+    return 0;
+  }
+
+  const int bsize = cm->seq_params->sb_size;
+  const int num_mi_w = mi_size_wide[bsize];
+  const int num_mi_h = mi_size_high[bsize];
+  const int block_width = block_size_wide[bsize];
+  const int block_height = block_size_high[bsize];
+  const int num_sb_cols = (cm->mi_params.mi_cols + num_mi_w - 1) / num_mi_w;
+  const int num_sb_rows = (cm->mi_params.mi_rows + num_mi_h - 1) / num_mi_h;
+
+  sb_saliency_map->height = num_sb_rows;
+  sb_saliency_map->width = num_sb_cols;
+  sb_saliency_map->buf = (double *)aom_malloc(num_sb_rows * num_sb_cols *
+                                              sizeof(*sb_saliency_map->buf));
+
+  if (sb_saliency_map->buf == NULL) {
+    aom_free(sb_saliency_map);
+    return 0;
+  }
+
+  for (int row = 0; row < num_sb_rows; ++row) {
+    for (int col = 0; col < num_sb_cols; ++col) {
+      const int index = row * num_sb_cols + col;
+      double total_pixel = 0;
+      double total_weight = 0;
+
+      for (int i = 0; i < block_height; i++) {
+        for (int j = 0; j < block_width; j++) {
+          if ((row * block_height + i) >= cpi->common.height ||
+              (col * block_width + j) >= cpi->common.width)
+            continue;
+          total_pixel++;
+          total_weight +=
+              cpi->saliency_map[(row * block_height + i) * cpi->common.width +
+                                col * block_width + j];
+        }
+      }
+
+      assert(total_pixel > 0);
+
+      // Calculate the superblock level saliency map from pixel level saliency
+      // map
+      sb_saliency_map->buf[index] = total_weight / total_pixel;
+
+      // Further lower the superblock saliency score for boundary superblocks.
+      if (row < 1 || row > num_sb_rows - 2 || col < 1 ||
+          col > num_sb_cols - 2) {
+        sb_saliency_map->buf[index] /= 5;
+      }
+    }
+  }
+
+  // superblock level saliency map finalization
+  minmax_normalize(sb_saliency_map);
+
+  double log_sum = 0.0;
+  double sum = 0.0;
+  int block_count = 0;
+
+  // Calculate the average superblock sm_scaling_factor for a frame, to be used
+  // for clamping later.
+  for (int row = 0; row < num_sb_rows; ++row) {
+    for (int col = 0; col < num_sb_cols; ++col) {
+      const int index = row * num_sb_cols + col;
+      const double saliency = sb_saliency_map->buf[index];
+
+      cpi->sm_scaling_factor[index] = 1 - saliency;
+      sum += cpi->sm_scaling_factor[index];
+      block_count++;
+    }
+  }
+  assert(block_count > 0);
+  sum /= block_count;
+
+  // Calculate the geometric mean of superblock sm_scaling_factor for a frame,
+  // to be used for normalization.
+  for (int row = 0; row < num_sb_rows; ++row) {
+    for (int col = 0; col < num_sb_cols; ++col) {
+      const int index = row * num_sb_cols + col;
+      log_sum += log(fmax(cpi->sm_scaling_factor[index], 0.001));
+      cpi->sm_scaling_factor[index] =
+          fmax(cpi->sm_scaling_factor[index], 0.8 * sum);
+    }
+  }
+
+  log_sum = exp(log_sum / block_count);
+
+  // Normalize the sm_scaling_factor by geometric mean.
+  for (int row = 0; row < num_sb_rows; ++row) {
+    for (int col = 0; col < num_sb_cols; ++col) {
+      const int index = row * num_sb_cols + col;
+      assert(log_sum > 0);
+      cpi->sm_scaling_factor[index] /= log_sum;
+
+      // Modulate the sm_scaling_factor by frame basis motion factor
+      cpi->sm_scaling_factor[index] =
+          cpi->sm_scaling_factor[index] * motion_ratio;
+    }
+  }
+
+  aom_free(sb_saliency_map->buf);
+  aom_free(sb_saliency_map);
+  return 1;
+}
+
+// av1_setup_motion_ratio() is only enabled when CONFIG_REALTIME_ONLY is 0,
+// because the computations need to access the first pass stats which are
+// only available when CONFIG_REALTIME_ONLY is equal to 0.
+#if !CONFIG_REALTIME_ONLY
+// Set motion_ratio that reflects the motion quantities between two consecutive
+// frames. Motion_ratio will be used to set up saliency_map based rdmult scaling
+// factor, i.e., the less the motion quantities are, the more bits will be spent
+// on this frame, and vice versa.
+double av1_setup_motion_ratio(AV1_COMP *cpi) {
+  AV1_COMMON *cm = &cpi->common;
+  int frames_since_key =
+      cm->current_frame.display_order_hint - cpi->rc.frames_since_key;
+  const FIRSTPASS_STATS *cur_stats = av1_firstpass_info_peek(
+      &cpi->ppi->twopass.firstpass_info, frames_since_key);
+  assert(cur_stats != NULL);
+  assert(cpi->ppi->twopass.firstpass_info.total_stats.count > 0);
+
+  const double avg_intra_error =
+      exp(cpi->ppi->twopass.firstpass_info.total_stats.log_intra_error /
+          cpi->ppi->twopass.firstpass_info.total_stats.count);
+  const double avg_inter_error =
+      exp(cpi->ppi->twopass.firstpass_info.total_stats.log_coded_error /
+          cpi->ppi->twopass.firstpass_info.total_stats.count);
+
+  double inter_error = cur_stats->coded_error;
+  double error_stdev = 0;
+  const double avg_error =
+      cpi->ppi->twopass.firstpass_info.total_stats.intra_error /
+      cpi->ppi->twopass.firstpass_info.total_stats.count;
+  for (int i = 0; i < cpi->ppi->twopass.firstpass_info.total_stats.count; i++) {
+    const FIRSTPASS_STATS *stats =
+        &cpi->ppi->twopass.firstpass_info.stats_buf[i];
+    error_stdev +=
+        (stats->intra_error - avg_error) * (stats->intra_error - avg_error);
+  }
+  error_stdev =
+      sqrt(error_stdev / cpi->ppi->twopass.firstpass_info.total_stats.count);
+
+  double motion_ratio = 1;
+  if (error_stdev / fmax(avg_intra_error, 1) > 0.1) {
+    motion_ratio = inter_error / fmax(1, avg_inter_error);
+    motion_ratio = AOMMIN(motion_ratio, 1.5);
+    motion_ratio = AOMMAX(motion_ratio, 0.8);
+  }
+
+  return motion_ratio;
+}
+#endif  // !CONFIG_REALTIME_ONLY

diff --git a/av1/encoder/saliency_map.h b/av1/encoder/saliency_map.h
new file mode 100644
index 0000000..0d27f83
--- /dev/null
+++ b/av1/encoder/saliency_map.h

@@ -0,0 +1,28 @@
+/*
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#ifndef AOM_AV1_ENCODER_SALIENCY_MAP_H_
+#define AOM_AV1_ENCODER_SALIENCY_MAP_H_
+#include "av1/encoder/encoder.h"
+
+typedef struct saliency_feature_map {
+  double *buf;  // stores values of the map in 1D array
+  int height;
+  int width;
+} saliency_feature_map;
+
+int av1_set_saliency_map(AV1_COMP *cpi);
+#if !CONFIG_REALTIME_ONLY
+double av1_setup_motion_ratio(AV1_COMP *cpi);
+#endif
+int av1_setup_sm_rdmult_scaling_factor(AV1_COMP *cpi, double motion_ratio);
+
+#endif  // AOM_AV1_ENCODER_SALIENCY_MAP_H_

diff --git a/av1/encoder/speed_features.c b/av1/encoder/speed_features.c
index 9fbfbd7..19723a4 100644
--- a/av1/encoder/speed_features.c
+++ b/av1/encoder/speed_features.c

@@ -168,6 +168,14 @@
   return frame_is_kf_gf_arf(cpi);
 }
 
+// Set transform rd gate level for all transform search cases.
+static AOM_INLINE void set_txfm_rd_gate_level(
+    int txfm_rd_gate_level[TX_SEARCH_CASES], int level) {
+  assert(level <= MAX_TX_RD_GATE_LEVEL);
+  for (int idx = 0; idx < TX_SEARCH_CASES; idx++)
+    txfm_rd_gate_level[idx] = level;
+}
+
 static void set_allintra_speed_feature_framesize_dependent(
     const AV1_COMP *const cpi, SPEED_FEATURES *const sf, int speed) {
   const AV1_COMMON *const cm = &cpi->common;
@@ -206,7 +214,7 @@
   if (is_720p_or_larger) {
     // TODO([email protected]): make this speed feature adaptive based on
     // current block's vertical texture instead of hardcoded with resolution
-    sf->mv_sf.use_downsampled_sad = 1;
+    sf->mv_sf.use_downsampled_sad = 2;
   }
 
   if (speed >= 1) {
@@ -309,6 +317,11 @@
   if (speed >= 9) {
     // TODO(kyslov): add more speed features to control speed/quality
     if (!is_4k_or_larger) {
+      // In av1_select_sb_size(), superblock size is set to 64x64 only for
+      // resolutions less than 4k in speed>=9, to improve the multithread
+      // performance. If cost update levels are set to INTERNAL_COST_UPD_OFF
+      // for resolutions >= 4k, the SB size setting can be modified for these
+      // resolutions as well.
       sf->inter_sf.coeff_cost_upd_level = INTERNAL_COST_UPD_OFF;
       sf->inter_sf.mode_cost_upd_level = INTERNAL_COST_UPD_OFF;
     }
@@ -422,6 +435,7 @@
 
     sf->tx_sf.adaptive_txb_search_level = 2;
     sf->tx_sf.tx_type_search.use_skip_flag_prediction = 2;
+    sf->tx_sf.use_rd_based_breakout_for_intra_tx_search = true;
 
     // TODO(any): evaluate if these lpf features can be moved to speed 2.
     // For screen content, "prune_sgr_based_on_wiener = 2" cause large quality
@@ -478,7 +492,9 @@
     sf->intra_sf.chroma_intra_pruning_with_hog = 3;
 
     sf->lpf_sf.use_coarse_filter_level_search = 0;
-    sf->lpf_sf.disable_lr_filter = 1;
+    // Disable Wiener and Self-guided Loop restoration filters.
+    sf->lpf_sf.disable_wiener_filter = true;
+    sf->lpf_sf.disable_sgr_filter = true;
 
     sf->mv_sf.prune_mesh_search = PRUNE_MESH_SEARCH_LVL_2;
 
@@ -497,6 +513,7 @@
 
     sf->part_sf.prune_rectangular_split_based_on_qidx =
         allow_screen_content_tools ? 0 : 2;
+    sf->part_sf.prune_rect_part_using_4x4_var_deviation = true;
     sf->part_sf.prune_sub_8x8_partition_level =
         allow_screen_content_tools ? 0 : 1;
     sf->part_sf.prune_part4_search = 3;
@@ -555,6 +572,8 @@
     sf->rt_sf.var_part_split_threshold_shift = 9;
     sf->rt_sf.vbp_prune_16x16_split_using_min_max_sub_blk_var = true;
     sf->rt_sf.prune_h_pred_using_best_mode_so_far = true;
+    sf->rt_sf.enable_intra_mode_pruning_using_neighbors = true;
+    sf->rt_sf.prune_intra_mode_using_best_sad_so_far = true;
   }
 
   // As the speed feature prune_chroma_modes_using_luma_winner already
@@ -576,12 +595,21 @@
   const int is_1080p_or_larger = AOMMIN(cm->width, cm->height) >= 1080;
   const int is_4k_or_larger = AOMMIN(cm->width, cm->height) >= 2160;
   const bool use_hbd = cpi->oxcf.use_highbitdepth;
+  // Speed features applicable for temporal filtering and tpl modules may be
+  // changed based on frame type at places where the sf is applied (Example :
+  // use_downsampled_sad). This is because temporal filtering and tpl modules
+  // are called before this function (except for the first key frame).
+  // TODO([email protected]): For the speed features applicable to temporal
+  // filtering and tpl modules, modify the sf initialization appropriately
+  // before calling the modules.
   const int boosted = frame_is_boosted(cpi);
   const int is_boosted_arf2_bwd_type =
       boosted ||
       cpi->ppi->gf_group.update_type[cpi->gf_frame_index] == INTNL_ARF_UPDATE;
   const int is_lf_frame =
       cpi->ppi->gf_group.update_type[cpi->gf_frame_index] == LF_UPDATE;
+  const int allow_screen_content_tools =
+      cm->features.allow_screen_content_tools;
 
   if (is_480p_or_larger) {
     sf->part_sf.use_square_partition_only_threshold = BLOCK_128X128;
@@ -612,7 +640,7 @@
   if (is_720p_or_larger) {
     // TODO([email protected]): make this speed feature adaptive based on
     // current block's vertical texture instead of hardcoded with resolution
-    sf->mv_sf.use_downsampled_sad = 1;
+    sf->mv_sf.use_downsampled_sad = 2;
   }
 
   if (!is_720p_or_larger) {
@@ -766,6 +794,8 @@
 
     if (is_480p_or_larger) {
       sf->tx_sf.tx_type_search.prune_tx_type_using_stats = 2;
+    } else {
+      sf->mv_sf.skip_fullpel_search_using_startmv = boosted ? 0 : 1;
     }
 
     sf->inter_sf.disable_interinter_wedge_var_thresh = UINT_MAX;
@@ -799,14 +829,16 @@
 
     sf->inter_sf.skip_newmv_in_drl = 4;
     sf->inter_sf.prune_comp_ref_frames = 1;
+    sf->mv_sf.skip_fullpel_search_using_startmv = boosted ? 0 : 1;
 
     if (!is_720p_or_larger) {
       sf->inter_sf.mv_cost_upd_level = INTERNAL_COST_UPD_SBROW_SET;
+      sf->inter_sf.prune_nearest_near_mv_using_refmv_weight =
+          (boosted || allow_screen_content_tools) ? 0 : 1;
+      sf->mv_sf.use_downsampled_sad = 1;
     }
 
     if (!is_480p_or_larger) {
-      sf->tx_sf.tx_type_search.fast_inter_tx_type_prob_thresh =
-          boosted ? INT_MAX : 250;
       sf->part_sf.partition_search_breakout_dist_thr = (1 << 26);
     }
 
@@ -821,6 +853,10 @@
     sf->tx_sf.tx_type_search.winner_mode_tx_type_pruning = 4;
     sf->inter_sf.prune_nearmv_using_neighbors = PRUNE_NEARMV_LEVEL3;
     sf->inter_sf.prune_comp_ref_frames = 2;
+    sf->inter_sf.prune_nearest_near_mv_using_refmv_weight =
+        (boosted || allow_screen_content_tools) ? 0 : 1;
+    sf->mv_sf.skip_fullpel_search_using_startmv = boosted ? 0 : 2;
+
     if (is_720p_or_larger) {
       sf->part_sf.auto_max_partition_based_on_simple_motion = NOT_IN_USE;
     } else if (is_480p_or_larger) {
@@ -855,12 +891,13 @@
     }
 
     if (!is_720p_or_larger) {
-      sf->tx_sf.tx_type_search.fast_inter_tx_type_prob_thresh = 150;
+      sf->tx_sf.tx_type_search.fast_inter_tx_type_prob_thresh =
+          is_boosted_arf2_bwd_type ? 450 : 150;
     }
 
     sf->lpf_sf.cdef_pick_method = CDEF_FAST_SEARCH_LVL4;
 
-    if (!is_480p_or_larger) sf->hl_sf.num_frames_used_in_tf = 3;
+    sf->hl_sf.recode_tolerance = 55;
   }
 }
 
@@ -881,7 +918,10 @@
   }
 
   // Speed 0 for all speed features that give neutral coding performance change.
-  sf->gm_sf.gm_search_type = GM_REDUCED_REF_SEARCH_SKIP_L2_L3;
+  sf->gm_sf.gm_search_type = boosted ? GM_REDUCED_REF_SEARCH_SKIP_L2_L3_ARF2
+                                     : GM_SEARCH_CLOSEST_REFS_ONLY;
+  sf->gm_sf.prune_ref_frame_for_gm_search = boosted ? 0 : 1;
+  sf->gm_sf.disable_gm_search_based_on_stats = 1;
 
   sf->part_sf.less_rectangular_check_level = 1;
   sf->part_sf.ml_prune_partition = 1;
@@ -932,9 +972,6 @@
   sf->hl_sf.superres_auto_search_type = SUPERRES_AUTO_DUAL;
 
   if (speed >= 1) {
-    sf->gm_sf.gm_search_type = GM_REDUCED_REF_SEARCH_SKIP_L2_L3_ARF2;
-    sf->gm_sf.prune_ref_frame_for_gm_search = boosted ? 0 : 1;
-
     sf->part_sf.intra_cnn_based_part_prune_level =
         allow_screen_content_tools ? 0 : 2;
     sf->part_sf.simple_motion_search_early_term_none = 1;
@@ -990,7 +1027,7 @@
 
     sf->fp_sf.skip_motion_search_threshold = 25;
 
-    sf->gm_sf.disable_gm_search_based_on_stats = 1;
+    sf->gm_sf.num_refinement_steps = 2;
 
     sf->part_sf.reuse_best_prediction_for_part_ab =
         !frame_is_intra_only(&cpi->common);
@@ -1012,10 +1049,9 @@
     sf->inter_sf.prune_comp_type_by_comp_avg = 2;
     sf->inter_sf.selective_ref_frame = 3;
     sf->inter_sf.use_dist_wtd_comp_flag = DIST_WTD_COMP_DISABLED;
-    // Enable fast search only for COMPOUND_DIFFWTD type.
     sf->inter_sf.enable_fast_compound_mode_search = 1;
     sf->inter_sf.reuse_mask_search_results = 1;
-    sf->inter_sf.txfm_rd_gate_level = boosted ? 0 : 1;
+    set_txfm_rd_gate_level(sf->inter_sf.txfm_rd_gate_level, boosted ? 0 : 1);
     sf->inter_sf.inter_mode_txfm_breakout = boosted ? 0 : 1;
     sf->inter_sf.alt_ref_search_fp = 1;
 
@@ -1047,8 +1083,9 @@
   if (speed >= 3) {
     sf->hl_sf.high_precision_mv_usage = CURRENT_Q;
 
-    sf->gm_sf.gm_search_type = GM_DISABLE_SEARCH;
+    sf->gm_sf.prune_ref_frame_for_gm_search = 1;
     sf->gm_sf.prune_zero_mv_with_sse = 1;
+    sf->gm_sf.num_refinement_steps = 0;
 
     sf->part_sf.less_rectangular_check_level = 2;
     sf->part_sf.simple_motion_search_prune_agg =
@@ -1074,10 +1111,9 @@
     sf->inter_sf.prune_inter_modes_based_on_tpl = boosted ? 0 : 1;
     sf->inter_sf.prune_comp_search_by_single_result = boosted ? 4 : 2;
     sf->inter_sf.selective_ref_frame = 5;
-    sf->inter_sf.skip_repeated_ref_mv = 1;
     sf->inter_sf.reuse_compound_type_decision = 1;
-    sf->inter_sf.txfm_rd_gate_level =
-        boosted ? 0 : (is_boosted_arf2_bwd_type ? 1 : 2);
+    set_txfm_rd_gate_level(sf->inter_sf.txfm_rd_gate_level,
+                           boosted ? 0 : (is_boosted_arf2_bwd_type ? 1 : 2));
     sf->inter_sf.inter_mode_txfm_breakout = boosted ? 0 : 2;
 
     sf->interp_sf.adaptive_interp_filter_search = 2;
@@ -1121,10 +1157,10 @@
   }
 
   if (speed >= 4) {
-    sf->gm_sf.prune_zero_mv_with_sse = 2;
-
     sf->mv_sf.subpel_search_method = SUBPEL_TREE_PRUNED_MORE;
 
+    sf->gm_sf.prune_zero_mv_with_sse = 2;
+
     sf->part_sf.simple_motion_search_prune_agg =
         allow_screen_content_tools ? SIMPLE_AGG_LVL0 : SIMPLE_AGG_LVL2;
     sf->part_sf.simple_motion_search_reduce_search_steps = 4;
@@ -1135,7 +1171,8 @@
                                                                           : 1;
 
     sf->inter_sf.alt_ref_search_fp = 2;
-    sf->inter_sf.txfm_rd_gate_level = boosted ? 0 : 3;
+    sf->inter_sf.txfm_rd_gate_level[TX_SEARCH_DEFAULT] = boosted ? 0 : 3;
+    sf->inter_sf.txfm_rd_gate_level[TX_SEARCH_MOTION_MODE] = boosted ? 0 : 5;
 
     sf->inter_sf.prune_inter_modes_based_on_tpl = boosted ? 0 : 2;
     sf->inter_sf.prune_ext_comp_using_neighbors = 2;
@@ -1175,8 +1212,12 @@
   }
 
   if (speed >= 5) {
+    sf->hl_sf.weight_calc_level_in_tf = 1;
+
     sf->fp_sf.reduce_mv_step_param = 4;
 
+    sf->gm_sf.gm_search_type = GM_DISABLE_SEARCH;
+
     sf->part_sf.simple_motion_search_prune_agg =
         allow_screen_content_tools ? SIMPLE_AGG_LVL0 : SIMPLE_AGG_LVL3;
     sf->part_sf.ext_partition_eval_thresh =
@@ -1185,9 +1226,10 @@
         (allow_screen_content_tools || frame_is_intra_only(&cpi->common)) ? 0
                                                                           : 2;
 
+    sf->mv_sf.warp_search_method = WARP_SEARCH_DIAMOND;
+
     sf->inter_sf.prune_inter_modes_if_skippable = 1;
-    sf->inter_sf.txfm_rd_gate_level = boosted ? 0 : 4;
-    // Enable fast search for all valid compound modes.
+    sf->inter_sf.txfm_rd_gate_level[TX_SEARCH_DEFAULT] = boosted ? 0 : 4;
     sf->inter_sf.enable_fast_compound_mode_search = 2;
 
     sf->intra_sf.chroma_intra_pruning_with_hog = 3;
@@ -1197,7 +1239,9 @@
         frame_is_intra_only(&cpi->common) ? MULTI_WINNER_MODE_FAST
                                           : MULTI_WINNER_MODE_OFF;
 
-    sf->lpf_sf.disable_lr_filter = 1;
+    // Disable Self-guided Loop restoration filter.
+    sf->lpf_sf.disable_sgr_filter = true;
+    sf->lpf_sf.disable_wiener_coeff_refine_search = true;
 
     sf->tpl_sf.prune_starting_mv = 3;
     sf->tpl_sf.use_y_only_rate_distortion = 1;
@@ -1212,7 +1256,8 @@
   if (speed >= 6) {
     sf->hl_sf.disable_extra_sc_testing = 1;
     sf->hl_sf.second_alt_ref_filtering = 0;
-    sf->hl_sf.recode_tolerance = 55;
+    sf->hl_sf.adjust_num_frames_for_arf_filtering =
+        allow_screen_content_tools ? 0 : 1;
 
     sf->inter_sf.prune_inter_modes_based_on_tpl = boosted ? 0 : 3;
     sf->inter_sf.selective_ref_frame = 6;
@@ -1236,10 +1281,8 @@
 
     sf->mv_sf.simple_motion_subpel_force_stop = FULL_PEL;
     sf->mv_sf.use_bsize_dependent_search_method = 1;
-    sf->mv_sf.skip_fullpel_search_using_startmv = boosted ? 0 : 1;
 
     sf->tpl_sf.gop_length_decision_method = 3;
-    sf->tpl_sf.disable_filtered_key_tpl = 1;
 
     sf->rd_sf.perform_coeff_opt = is_boosted_arf2_bwd_type ? 6 : 8;
 
@@ -1300,6 +1343,10 @@
       sf->rt_sf.use_adaptive_subpel_search = false;
     }
     if (speed >= 10) {
+      // TODO([email protected]): To be conservative, disable
+      // sf->rt_sf.estimate_motion_for_var_based_partition = 3 for speed 10/qvga
+      // for now. May enable it in the future.
+      sf->rt_sf.estimate_motion_for_var_based_partition = 0;
       sf->rt_sf.skip_intra_pred = 2;
       sf->rt_sf.hybrid_intra_pickmode = 3;
       sf->rt_sf.reduce_mv_pel_precision_lowcomplex = 1;
@@ -1352,12 +1399,6 @@
     if (speed == 7) {
       sf->rt_sf.nonrd_check_partition_merge_mode = 2;
     }
-    if (speed >= 8) {
-      sf->rt_sf.estimate_motion_for_var_based_partition = 1;
-    }
-    if (speed >= 9) {
-      sf->rt_sf.estimate_motion_for_var_based_partition = 0;
-    }
   }
   if (!is_720p_or_larger) {
     if (speed >= 9) {
@@ -1399,18 +1440,22 @@
     // For SVC: for greater than 2 temporal layers, use better mv search on
     // base temporal layers, and only on base spatial layer if highest
     // resolution is above 640x360.
-    if (cpi->svc.number_temporal_layers > 2 &&
+    if (cpi->svc.number_temporal_layers >= 2 &&
         cpi->svc.temporal_layer_id == 0 &&
         (cpi->svc.spatial_layer_id == 0 ||
          cpi->oxcf.frm_dim_cfg.width * cpi->oxcf.frm_dim_cfg.height <=
              640 * 360)) {
       sf->mv_sf.search_method = NSTEP;
-      sf->mv_sf.subpel_search_method = SUBPEL_TREE;
-      sf->rt_sf.fullpel_search_step_param = 6;
+      sf->mv_sf.subpel_search_method = SUBPEL_TREE_PRUNED;
+      sf->rt_sf.fullpel_search_step_param = 10;
       sf->rt_sf.reduce_mv_pel_precision_highmotion = 0;
+      if (cm->width * cm->height <= 352 * 288)
+        sf->rt_sf.nonrd_prune_ref_frame_search = 2;
+      sf->rt_sf.force_large_partition_blocks_intra = 0;
     }
     if (speed >= 8) {
-      sf->rt_sf.disable_cdf_update_non_reference_frame = true;
+      if (cpi->svc.number_temporal_layers > 2)
+        sf->rt_sf.disable_cdf_update_non_reference_frame = true;
       sf->rt_sf.reduce_mv_pel_precision_highmotion = 3;
       if (rtc_ref->non_reference_frame) {
         sf->rt_sf.nonrd_aggressive_skip = 1;
@@ -1422,6 +1467,8 @@
       sf->rt_sf.check_only_zero_zeromv_on_large_blocks = false;
     else
       sf->rt_sf.check_only_zero_zeromv_on_large_blocks = true;
+    sf->rt_sf.frame_level_mode_cost_update = false;
+
     // Compound mode enabling.
     if (rtc_ref->ref_frame_comp[0] || rtc_ref->ref_frame_comp[1] ||
         rtc_ref->ref_frame_comp[2]) {
@@ -1439,6 +1486,20 @@
     if (cpi->svc.number_spatial_layers > 1 ||
         cpi->svc.number_temporal_layers > 1)
       sf->hl_sf.accurate_bit_estimate = 0;
+
+    // TODO([email protected]): test to see if
+    // estimate_motion_for_var_based_partition == 2 helps here.
+    if (sf->rt_sf.estimate_motion_for_var_based_partition == 2)
+      sf->rt_sf.estimate_motion_for_var_based_partition = 1;
+    if (speed >= 9) sf->rt_sf.estimate_motion_for_var_based_partition = 0;
+
+    // For single layers RPS: bias/adjustment for recovery frame.
+    if (cpi->ppi->rtc_ref.bias_recovery_frame) {
+      sf->mv_sf.search_method = NSTEP;
+      sf->mv_sf.subpel_search_method = SUBPEL_TREE;
+      sf->rt_sf.fullpel_search_step_param = 8;
+      sf->rt_sf.nonrd_aggressive_skip = 0;
+    }
   }
   // Screen settings.
   if (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN) {
@@ -1446,6 +1507,7 @@
     if (speed >= 7) {
       sf->rt_sf.reduce_mv_pel_precision_highmotion = 1;
       sf->mv_sf.use_bsize_dependent_search_method = 0;
+      sf->rt_sf.skip_cdef_sb = 1;
     }
     if (speed >= 8) {
       sf->rt_sf.nonrd_check_partition_merge_mode = 3;
@@ -1469,13 +1531,18 @@
     if (speed >= 10) {
       if (cm->width * cm->height > 1920 * 1080)
         sf->part_sf.disable_8x8_part_based_on_qidx = 1;
-      sf->rt_sf.set_zeromv_skip_based_on_source_sad = 2;
       sf->rt_sf.screen_content_cdef_filter_qindex_thresh = 80;
       sf->rt_sf.part_early_exit_zeromv = 1;
       sf->rt_sf.nonrd_aggressive_skip = 1;
     }
+    if (speed >= 11) {
+      sf->rt_sf.skip_lf_screen = 2;
+      sf->rt_sf.skip_cdef_sb = 2;
+      sf->rt_sf.part_early_exit_zeromv = 2;
+      sf->rt_sf.prune_palette_nonrd = 1;
+      sf->rt_sf.set_zeromv_skip_based_on_source_sad = 2;
+    }
     sf->rt_sf.use_nonrd_altref_frame = 0;
-    sf->rt_sf.skip_cdef_sb = 1;
     sf->rt_sf.use_rtc_tf = 0;
     sf->rt_sf.use_comp_ref_nonrd = 0;
     sf->rt_sf.source_metrics_sb_nonrd = 1;
@@ -1497,6 +1564,18 @@
     }
     sf->rt_sf.partition_direct_merging = 0;
     sf->hl_sf.accurate_bit_estimate = 0;
+
+    // "sf->rt_sf.estimate_motion_for_var_based_partition = 2" doesn't work well
+    // for screen contents.
+    if (sf->rt_sf.estimate_motion_for_var_based_partition == 2)
+      sf->rt_sf.estimate_motion_for_var_based_partition = 1;
+    if (speed >= 9) sf->rt_sf.estimate_motion_for_var_based_partition = 0;
+  }
+  if (is_lossless_requested(&cpi->oxcf.rc_cfg)) {
+    sf->rt_sf.use_rtc_tf = 0;
+    // TODO(aomedia:3412): The setting accurate_bit_estimate = 0
+    // can be removed once it's fixed for lossless mode.
+    sf->hl_sf.accurate_bit_estimate = 0;
   }
 }
 
@@ -1532,7 +1611,9 @@
   sf->intra_sf.dv_cost_upd_level = INTERNAL_COST_UPD_OFF;
   sf->tx_sf.model_based_prune_tx_search_level = 0;
   sf->lpf_sf.dual_sgr_penalty_level = 1;
-  sf->lpf_sf.disable_lr_filter = 1;
+  // Disable Wiener and Self-guided Loop restoration filters.
+  sf->lpf_sf.disable_wiener_filter = true;
+  sf->lpf_sf.disable_sgr_filter = true;
   sf->rt_sf.skip_interp_filter_search = 1;
   sf->intra_sf.prune_palette_search_level = 2;
   sf->intra_sf.prune_luma_palette_size_search_level = 2;
@@ -1557,7 +1638,7 @@
   sf->inter_sf.disable_interintra_wedge_var_thresh = UINT_MAX;
   sf->inter_sf.selective_ref_frame = 4;
   sf->inter_sf.alt_ref_search_fp = 2;
-  sf->inter_sf.txfm_rd_gate_level = boosted ? 0 : 4;
+  set_txfm_rd_gate_level(sf->inter_sf.txfm_rd_gate_level, boosted ? 0 : 4);
   sf->inter_sf.limit_txfm_eval_per_mode = 3;
 
   sf->inter_sf.adaptive_rd_thresh = 4;
@@ -1628,7 +1709,7 @@
   sf->winner_mode_sf.winner_mode_ifs = 1;
 
   sf->rt_sf.check_intra_pred_nonrd = 1;
-  sf->rt_sf.estimate_motion_for_var_based_partition = 1;
+  sf->rt_sf.estimate_motion_for_var_based_partition = 2;
   sf->rt_sf.hybrid_intra_pickmode = 1;
   sf->rt_sf.use_comp_ref_nonrd = 0;
   sf->rt_sf.ref_frame_comp_nonrd[0] = 0;
@@ -1754,7 +1835,6 @@
   if (speed >= 8) {
     sf->rt_sf.sse_early_term_inter_search = EARLY_TERM_IDX_2;
     sf->intra_sf.intra_pruning_with_hog = 1;
-    sf->rt_sf.estimate_motion_for_var_based_partition = 1;
     sf->rt_sf.short_circuit_low_temp_var = 1;
     sf->rt_sf.use_nonrd_altref_frame = 0;
     sf->rt_sf.nonrd_prune_ref_frame_search = 2;
@@ -1768,7 +1848,7 @@
   }
   if (speed >= 9) {
     sf->rt_sf.sse_early_term_inter_search = EARLY_TERM_IDX_3;
-    sf->rt_sf.estimate_motion_for_var_based_partition = 0;
+    sf->rt_sf.estimate_motion_for_var_based_partition = 3;
     sf->rt_sf.prefer_large_partition_blocks = 3;
     sf->rt_sf.skip_intra_pred = 2;
     sf->rt_sf.var_part_split_threshold_shift = 9;
@@ -1799,8 +1879,9 @@
   hl_sf->superres_auto_search_type = SUPERRES_AUTO_ALL;
   hl_sf->disable_extra_sc_testing = 0;
   hl_sf->second_alt_ref_filtering = 1;
-  hl_sf->num_frames_used_in_tf = INT_MAX;
+  hl_sf->adjust_num_frames_for_arf_filtering = 0;
   hl_sf->accurate_bit_estimate = 0;
+  hl_sf->weight_calc_level_in_tf = 0;
 }
 
 static AOM_INLINE void init_fp_sf(FIRST_PASS_SPEED_FEATURES *fp_sf) {
@@ -1818,7 +1899,6 @@
   tpl_sf->skip_alike_starting_mv = 0;
   tpl_sf->subpel_force_stop = EIGHTH_PEL;
   tpl_sf->search_method = NSTEP;
-  tpl_sf->disable_filtered_key_tpl = 0;
   tpl_sf->prune_ref_frames_in_tpl = 0;
   tpl_sf->allow_compound_pred = 1;
   tpl_sf->use_y_only_rate_distortion = 0;
@@ -1829,6 +1909,7 @@
   gm_sf->prune_ref_frame_for_gm_search = 0;
   gm_sf->prune_zero_mv_with_sse = 0;
   gm_sf->disable_gm_search_based_on_stats = 0;
+  gm_sf->num_refinement_steps = GM_MAX_REFINEMENT_STEPS;
 }
 
 static AOM_INLINE void init_part_sf(PARTITION_SPEED_FEATURES *part_sf) {
@@ -1864,6 +1945,7 @@
   part_sf->rect_partition_eval_thresh = BLOCK_128X128;
   part_sf->prune_ext_part_using_split_info = 0;
   part_sf->prune_rectangular_split_based_on_qidx = 0;
+  part_sf->prune_rect_part_using_4x4_var_deviation = false;
   part_sf->early_term_after_none_split = 0;
   part_sf->ml_predict_breakout_level = 0;
   part_sf->prune_sub_8x8_partition_level = 0;
@@ -1894,6 +1976,8 @@
   mv_sf->disable_extensive_joint_motion_search = 0;
   mv_sf->disable_second_mv = 0;
   mv_sf->skip_fullpel_search_using_startmv = 0;
+  mv_sf->warp_search_method = WARP_SEARCH_SQUARE;
+  mv_sf->warp_search_iters = 8;
 }
 
 static AOM_INLINE void init_inter_sf(INTER_MODE_SPEED_FEATURES *inter_sf) {
@@ -1934,7 +2018,6 @@
   inter_sf->prune_ref_mv_idx_search = 0;
   inter_sf->prune_warped_prob_thresh = 0;
   inter_sf->reuse_compound_type_decision = 0;
-  inter_sf->txfm_rd_gate_level = 0;
   inter_sf->prune_inter_modes_if_skippable = 0;
   inter_sf->disable_masked_comp = 0;
   inter_sf->enable_fast_compound_mode_search = 0;
@@ -1944,6 +2027,7 @@
   inter_sf->limit_inter_mode_cands = 0;
   inter_sf->limit_txfm_eval_per_mode = 0;
   inter_sf->skip_arf_compound = 0;
+  set_txfm_rd_gate_level(inter_sf->txfm_rd_gate_level, 0);
 }
 
 static AOM_INLINE void init_interp_sf(INTERP_FILTER_SPEED_FEATURES *interp_sf) {
@@ -2001,6 +2085,7 @@
   tx_sf->refine_fast_tx_search_results = 1;
   tx_sf->prune_tx_size_level = 0;
   tx_sf->prune_intra_tx_depths_using_nn = false;
+  tx_sf->use_rd_based_breakout_for_intra_tx_search = false;
 }
 
 static AOM_INLINE void init_rd_sf(RD_CALC_SPEED_FEATURES *rd_sf,
@@ -2058,7 +2143,10 @@
   lpf_sf->cdef_pick_method = CDEF_FULL_SEARCH;
   // Set decoder side speed feature to use less dual sgr modes
   lpf_sf->dual_sgr_penalty_level = 0;
-  lpf_sf->disable_lr_filter = 0;
+  // Enable Wiener and Self-guided Loop restoration filters by default.
+  lpf_sf->disable_wiener_filter = false;
+  lpf_sf->disable_sgr_filter = false;
+  lpf_sf->disable_wiener_coeff_refine_search = false;
   lpf_sf->use_downsampled_wiener_stats = 0;
 }
 
@@ -2106,6 +2194,7 @@
   rt_sf->gf_refresh_based_on_qp = 0;
   rt_sf->use_rtc_tf = 0;
   rt_sf->prune_idtx_nonrd = 0;
+  rt_sf->prune_palette_nonrd = 0;
   rt_sf->part_early_exit_zeromv = 0;
   rt_sf->sse_early_term_inter_search = EARLY_TERM_DISABLED;
   rt_sf->skip_lf_screen = 0;
@@ -2117,6 +2206,8 @@
   rt_sf->prune_compoundmode_with_singlecompound_var = false;
   rt_sf->frame_level_mode_cost_update = false;
   rt_sf->prune_h_pred_using_best_mode_so_far = false;
+  rt_sf->enable_intra_mode_pruning_using_neighbors = false;
+  rt_sf->prune_intra_mode_using_best_sad_so_far = false;
   rt_sf->check_only_zero_zeromv_on_large_blocks = false;
   rt_sf->disable_cdf_update_non_reference_frame = false;
   rt_sf->prune_compoundmode_with_singlemode_var = false;
@@ -2128,23 +2219,22 @@
   rt_sf->check_globalmv_on_single_ref = true;
 }
 
+static fractional_mv_step_fp
+    *const fractional_mv_search[SUBPEL_SEARCH_METHODS] = {
+      av1_find_best_sub_pixel_tree,             // SUBPEL_TREE = 0
+      av1_find_best_sub_pixel_tree_pruned,      // SUBPEL_TREE_PRUNED = 1
+      av1_find_best_sub_pixel_tree_pruned_more  // SUBPEL_TREE_PRUNED_MORE = 2
+    };
+
 // Populate appropriate sub-pel search method based on speed feature and user
 // specified settings
 static void set_subpel_search_method(
     MotionVectorSearchParams *mv_search_params,
     unsigned int motion_vector_unit_test,
-    SUBPEL_SEARCH_METHODS subpel_search_method) {
-  if (subpel_search_method == SUBPEL_TREE) {
-    mv_search_params->find_fractional_mv_step = av1_find_best_sub_pixel_tree;
-  } else if (subpel_search_method == SUBPEL_TREE_PRUNED) {
-    mv_search_params->find_fractional_mv_step =
-        av1_find_best_sub_pixel_tree_pruned;
-  } else if (subpel_search_method == SUBPEL_TREE_PRUNED_MORE) {
-    mv_search_params->find_fractional_mv_step =
-        av1_find_best_sub_pixel_tree_pruned_more;
-  } else {
-    assert(0);
-  }
+    SUBPEL_SEARCH_METHOD subpel_search_method) {
+  assert(subpel_search_method <= SUBPEL_TREE_PRUNED_MORE);
+  mv_search_params->find_fractional_mv_step =
+      fractional_mv_search[subpel_search_method];
 
   // This is only used in motion vector unit test.
   if (motion_vector_unit_test == 1)
@@ -2232,12 +2322,30 @@
     sf->winner_mode_sf.tx_size_search_level = 3;
   }
 
+  if (cpi->mt_info.num_workers > 1) {
+    // Loop restoration stage is conditionally disabled for speed 5, 6 when
+    // num_workers > 1. Since av1_pick_filter_restoration() is not
+    // multi-threaded, enabling the Loop restoration stage will cause an
+    // increase in encode time (3% to 7% increase depends on frame
+    // resolution).
+    // TODO(aomedia:3446): Implement multi-threading of
+    // av1_pick_filter_restoration() and enable Wiener filter for speed 5, 6
+    // similar to single thread encoding path.
+    if (speed >= 5) {
+      sf->lpf_sf.disable_sgr_filter = true;
+      sf->lpf_sf.disable_wiener_filter = true;
+    }
+  }
+
   if (!cpi->ppi->seq_params_locked) {
     cpi->common.seq_params->order_hint_info.enable_dist_wtd_comp &=
         (sf->inter_sf.use_dist_wtd_comp_flag != DIST_WTD_COMP_DISABLED);
     cpi->common.seq_params->enable_dual_filter &=
         !sf->interp_sf.disable_dual_filter;
-    cpi->common.seq_params->enable_restoration &= !sf->lpf_sf.disable_lr_filter;
+    // Set the flag 'enable_restoration', if one the Loop restoration filters
+    // (i.e., Wiener or Self-guided) is enabled.
+    cpi->common.seq_params->enable_restoration &=
+        (!sf->lpf_sf.disable_wiener_filter || !sf->lpf_sf.disable_sgr_filter);
 
     cpi->common.seq_params->enable_interintra_compound &=
         (sf->inter_sf.disable_interintra_wedge_var_thresh != UINT_MAX);
@@ -2469,6 +2577,17 @@
     }
   }
 
+  if (speed == 5) {
+    if (!(frame_is_intra_only(&cpi->common) ||
+          cm->features.allow_screen_content_tools)) {
+      const int qindex[2] = { 256, 128 };
+      // Set the sf value as 3 for low resolution and
+      // for higher resolutions with low quantizers.
+      if (cm->quant_params.base_qindex < qindex[is_480p_or_larger])
+        sf->tx_sf.tx_type_search.winner_mode_tx_type_pruning = 3;
+    }
+  }
+
   set_subpel_search_method(&cpi->mv_search_params,
                            cpi->oxcf.unit_test_cfg.motion_vector_unit_test,
                            sf->mv_sf.subpel_search_method);

diff --git a/av1/encoder/speed_features.h b/av1/encoder/speed_features.h
index 00c21e5..27a07c5 100644
--- a/av1/encoder/speed_features.h
+++ b/av1/encoder/speed_features.h

@@ -35,6 +35,11 @@
   GM_FULL_SEARCH,
   GM_REDUCED_REF_SEARCH_SKIP_L2_L3,
   GM_REDUCED_REF_SEARCH_SKIP_L2_L3_ARF2,
+
+  // Same as GM_REDUCED_REF_SEARCH_SKIP_L2_L3_ARF2 but with extra filtering
+  // to keep at most two ref frames
+  GM_SEARCH_CLOSEST_REFS_ONLY,
+
   GM_DISABLE_SEARCH
 } UENUM1BYTE(GM_SEARCH_TYPE);
 
@@ -134,7 +139,8 @@
   SUBPEL_TREE = 0,
   SUBPEL_TREE_PRUNED = 1,       // Prunes 1/2-pel searches
   SUBPEL_TREE_PRUNED_MORE = 2,  // Prunes 1/2-pel searches more aggressively
-} UENUM1BYTE(SUBPEL_SEARCH_METHODS);
+  SUBPEL_SEARCH_METHODS
+} UENUM1BYTE(SUBPEL_SEARCH_METHOD);
 
 enum {
   // Try the full image with different values.
@@ -233,6 +239,16 @@
   PRUNE_NEARMV_MAX = PRUNE_NEARMV_LEVEL3,
 } UENUM1BYTE(PRUNE_NEARMV_LEVEL);
 
+enum {
+  // Default Transform search case - used in evaluation of compound type mode
+  // and best inter candidates
+  TX_SEARCH_DEFAULT = 0,
+  // Transform search in motion mode rd
+  TX_SEARCH_MOTION_MODE,
+  // All transform search cases
+  TX_SEARCH_CASES
+} UENUM1BYTE(TX_SEARCH_CASE);
+
 typedef struct {
   TX_TYPE_PRUNE_MODE prune_2d_txfm_mode;
   int fast_intra_tx_type_search;
@@ -431,10 +447,10 @@
   int second_alt_ref_filtering;
 
   /*!
-   * Number of frames to be used in temporal filtering controlled based on noise
-   * levels and arf-q.
+   * The number of frames to be used during temporal filtering of an ARF frame
+   * is adjusted based on noise level of the current frame.
    */
-  int num_frames_used_in_tf;
+  int adjust_num_frames_for_arf_filtering;
 
   /*!
    * Decide the bit estimation approach used in qindex decision.
@@ -442,6 +458,13 @@
    * 1: estimate bits more accurately based on the frame complexity.
    */
   int accurate_bit_estimate;
+
+  /*!
+   * Decide the approach for weight calculation during temporal filtering.
+   * 0: Calculate weight using exp()
+   * 1: Calculate weight using a lookup table that approximates exp().
+   */
+  int weight_calc_level_in_tf;
 } HIGH_LEVEL_SPEED_FEATURES;
 
 /*!
@@ -505,9 +528,6 @@
   // Prune starting mvs in TPL based on sad scores.
   int prune_starting_mv;
 
-  // Not run TPL for filtered Key frame.
-  int disable_filtered_key_tpl;
-
   // Prune reference frames in TPL.
   int prune_ref_frames_in_tpl;
 
@@ -536,6 +556,9 @@
   // Disable global motion estimation based on stats of previous frames in the
   // GF group
   int disable_gm_search_based_on_stats;
+
+  // Number of refinement steps to apply after initial model generation
+  int num_refinement_steps;
 } GLOBAL_MOTION_SPEED_FEATURES;
 
 typedef struct PARTITION_SPEED_FEATURES {
@@ -649,6 +672,18 @@
   // 2 : prune all block size based on qindex
   int prune_rectangular_split_based_on_qidx;
 
+  // Prune rectangular partitions based on 4x4 sub-block variance
+  // false : no pruning
+  // true : prune rectangular partitions based on 4x4 sub-block variance
+  // deviation
+  //
+  // For allintra encode, this speed feature reduces instruction count by 6.4%
+  // for speed=6 with coding performance change less than 0.24%. For AVIF image
+  // encode, this speed feature reduces encode time by 8.14% for speed 6 on a
+  // typical image dataset with coding performance change less than 0.16%. This
+  // speed feature is not applicable to speed >= 7.
+  bool prune_rect_part_using_4x4_var_deviation;
+
   // Terminate partition search for child partition,
   // when NONE and SPLIT partition rd_costs are INT64_MAX.
   int early_term_after_none_split;
@@ -746,7 +781,7 @@
   // logarithmic search that keeps stepping at 1/2 pixel units until
   // you stop getting a gain, and then goes on to 1/4 and repeats
   // the same process. Along the way it skips many diagonals.
-  SUBPEL_SEARCH_METHODS subpel_search_method;
+  SUBPEL_SEARCH_METHOD subpel_search_method;
 
   // Maximum number of steps in logarithmic subpel search before giving up.
   int subpel_iters_per_step;
@@ -788,7 +823,16 @@
   int full_pixel_search_level;
 
   // Whether to downsample the rows in sad calculation during motion search.
-  // This is only active when there are at least 16 rows.
+  // This is only active when there are at least 16 rows. When this sf is
+  // active, if there is a large discrepancy in the SAD values for the final
+  // motion vector between skipping vs not skipping, motion search is redone
+  // with skip row features off.
+  // 0: Disabled (do not downsample rows)
+  // 1: Skip SAD calculation of odd rows if the SAD deviation of the even and
+  //    odd rows for the starting MV is small. Redo motion search with sf off
+  //    when SAD deviation is high for the final motion vector.
+  // 2: Skip SAD calculation of odd rows. SAD deviation is not tested for the
+  //    start MV and tested only for the final MV.
   int use_downsampled_sad;
 
   // Enable/disable extensive joint motion search.
@@ -801,7 +845,17 @@
   int disable_second_mv;
 
   // Skips full pixel search based on start mv of prior ref_mv_idx.
+  // 0: Disabled
+  // 1: Skips the full pixel search upto 4 neighbor full-pel MV positions.
+  // 2: Skips the full pixel search upto 8 neighbor full-pel MV positions.
   int skip_fullpel_search_using_startmv;
+
+  // Method to use for refining WARPED_CAUSAL motion vectors
+  // TODO(rachelbarker): Can this be unified with OBMC in some way?
+  WARP_SEARCH_METHOD warp_search_method;
+
+  // Maximum number of iterations in WARPED_CAUSAL refinement search
+  int warp_search_iters;
 } MV_SPEED_FEATURES;
 
 typedef struct INTER_MODE_SPEED_FEATURES {
@@ -813,8 +867,11 @@
   // 2: used with static rd model
   int inter_mode_rd_model_estimation;
 
-  // Bypass transform search based on skip rd
-  int txfm_rd_gate_level;
+  // Bypass transform search based on skip rd at following stages
+  //   i. Compound type mode search
+  //  ii. Motion mode search (mode evaluation and winner motion mode stage)
+  // iii. Transform search for best inter candidates
+  int txfm_rd_gate_level[TX_SEARCH_CASES];
 
   // Limit the inter mode tested in the RD loop
   int reduce_inter_modes;
@@ -927,14 +984,9 @@
   int prune_comp_using_best_single_mode_ref;
 
   // Skip NEARESTMV and NEARMV using weight computed in ref mv list population
-  // This speed feature sometimes leads to severe visual artifacts for
-  // the overlay frame. It makes inter RD mode search skip NEARESTMV
-  // and NEARMV, and no valid inter mode is evaluated when the NEWMV mode
-  // is also early terminated due to the constraint that it does not handle
-  // zero mv difference. In this cases, intra modes will be chosen, leading
-  // to bad prediction and flickering artifacts.
-  // Turn off this feature for now. Be careful to check visual quality if
-  // anyone is going to turn it on.
+  //
+  // Pruning is enabled only when both the top and left neighbor blocks are
+  // available and when the current block already has a valid inter prediction.
   int prune_nearest_near_mv_using_refmv_weight;
 
   // Based on previous ref_mv_idx search result, prune the following search.
@@ -999,7 +1051,20 @@
   // Enable/disable masked compound.
   int disable_masked_comp;
 
-  // Enable/disable the fast compound mode search.
+  // Enable/disable MV refinement for compound modes corresponds to compound
+  // types COMPOUND_AVERAGE, COMPOUND_DISTWTD (currently, this compound type
+  // is disabled for speeds >= 2 using the sf 'use_dist_wtd_comp_flag') and
+  // COMPOUND_DIFFWTD based on the availability. Levels 0 to 3 indicate
+  // increasing order of aggressiveness to disable MV refinement.
+  // 0: MV Refinement is enabled and for NEW_NEWMV mode used two iterations of
+  // refinement in av1_joint_motion_search().
+  // 1: MV Refinement is disabled for COMPOUND_DIFFWTD and enabled for
+  // COMPOUND_AVERAGE & COMPOUND_DISTWTD.
+  // 2: MV Refinement is enabled for COMPOUND_AVERAGE & COMPOUND_DISTWTD for
+  // NEW_NEWMV mode with one iteration of refinement in
+  // av1_joint_motion_search() and MV Refinement is disabled for other compound
+  // type modes.
+  // 3: MV Refinement is disabled.
   int enable_fast_compound_mode_search;
 
   // Reuse masked compound type search results
@@ -1236,6 +1301,21 @@
   // encode time by 4.65%, 9.16% and 10.45% for speed 6, 7 and 8 on a typical
   // image dataset with coding performance change less than 0.19%.
   bool prune_intra_tx_depths_using_nn;
+
+  // Enable/disable early breakout during transform search of intra modes, by
+  // using the minimum rd cost possible. By using this approach, the rd
+  // evaluation of applicable transform blocks (in the current block) can be
+  // avoided as
+  // 1) best_rd evolves during the search in choose_tx_size_type_from_rd()
+  // 2) appropriate ref_best_rd is passed in intra_block_yrd()
+  //
+  // For allintra encode, this speed feature reduces instruction count
+  // by 1.11%, 1.08%, 1.02% and 0.93% for speed 3, 6, 7 and 8 with coding
+  // performance change less than 0.02%. For AVIF image encode, this speed
+  // feature reduces encode time by 0.93%, 1.46%, 1.07%, 0.84%, 0.99% and 0.73%
+  // for speed 3, 4, 5, 6, 7 and 8 on a typical image dataset with coding
+  // performance change less than 0.004%.
+  bool use_rd_based_breakout_for_intra_tx_search;
 } TX_SPEED_FEATURES;
 
 typedef struct RD_CALC_SPEED_FEATURES {
@@ -1377,8 +1457,14 @@
   // Reduce the wiener filter win size for luma
   int reduce_wiener_window_size;
 
-  // Disable loop restoration filter
-  int disable_lr_filter;
+  // Flag to disable Wiener Loop restoration filter.
+  bool disable_wiener_filter;
+
+  // Flag to disable Self-guided Loop restoration filter.
+  bool disable_sgr_filter;
+
+  // Disable the refinement search around the wiener filter coefficients.
+  bool disable_wiener_coeff_refine_search;
 
   // Whether to downsample the rows in computation of wiener stats.
   int use_downsampled_wiener_stats;
@@ -1395,7 +1481,11 @@
   // Skipping aggressiveness increases from level 1 to 2.
   int skip_intra_pred;
 
-  // Perform coarse ME before calculating variance in variance-based partition
+  // Estimate motion before calculating variance in variance-based partition
+  // 0 - Only use zero MV
+  // 1 - perform coarse ME
+  // 2 - perform coarse ME, and also use neighbours' MVs
+  // 3 - use neighbours' MVs without performing coarse ME
   int estimate_motion_for_var_based_partition;
 
   // For nonrd_use_partition: mode of extra check of leaf partition
@@ -1486,8 +1576,8 @@
 
   // Bit mask to enable or disable intra modes for each prediction block size
   // separately, for nonrd_pickmode.  Currently, the sf is not respected when
-  // 'force_intra_check' is true in 'estimate_intra_mode()' function. Also, H
-  // and V pred modes allowed through this sf can be further pruned when
+  // 'force_intra_check' is true in 'av1_estimate_intra_mode()' function. Also,
+  // H and V pred modes allowed through this sf can be further pruned when
   //'prune_hv_pred_modes_using_src_sad' sf is true.
   int intra_y_mode_bsize_mask_nrd[BLOCK_SIZES];
 
@@ -1502,9 +1592,13 @@
   // Skips mode checks more aggressively in nonRD mode
   int nonrd_aggressive_skip;
 
-  // Skip cdef on 64x64 blocks when NEWMV or INTRA is not picked or color
-  // sensitivity is off. When color sensitivity is on for a superblock, all
-  // 64x64 blocks within will not skip.
+  // Skip cdef on 64x64 blocks/
+  // 0: disabled
+  // 1: skip when NEWMV or INTRA is not picked or color sensitivity is off.
+  // When color sensitivity is on for a superblock, all 64x64 blocks within
+  // will not skip.
+  // 2: more aggressive mode where skip is done for all frames where
+  // rc->high_source_sad = 0 (non slide-changes), and color sensitivity off.
   int skip_cdef_sb;
 
   // Forces larger partition blocks in variance based partitioning for intra
@@ -1565,6 +1659,7 @@
 
   // Temporal filtering
   // The value can be 1 or 2, which indicates the threshold to use.
+  // Must be off for lossless mode.
   int use_rtc_tf;
 
   // Prune the use of the identity transform in nonrd_pickmode,
@@ -1573,8 +1668,15 @@
   // already set.
   int prune_idtx_nonrd;
 
+  // Prune the use of paletter mode in nonrd pickmode.
+  int prune_palette_nonrd;
+
   // Skip loopfilter, for static content after slide change
   // or key frame, once quality has ramped up.
+  // 0: disabled
+  // 1: skip only after quality is ramped up.
+  // 2: aggrssive mode, where skip is done for all frames that
+  // where rc->high_source_sad = 0 (no slide-changes).
   int skip_lf_screen;
 
   // For nonrd: early exit out of variance partition that sets the
@@ -1640,6 +1742,26 @@
   // 0.08%.
   bool prune_h_pred_using_best_mode_so_far;
 
+  // Enable pruning of intra mode evaluations in nonrd path based on source
+  // variance and best mode so far. The pruning logic is enabled only if the
+  // mode is not a winner mode of both the neighboring blocks (left/top).
+  //
+  // For allintra encode, this speed feature reduces instruction count by 3.96%
+  // for speed 9 with coding performance change less than 0.38%.
+  // For AVIF image encode, this speed feature reduces encode time by 3.46% for
+  // speed 9 on a typical image dataset with coding performance change less than
+  // -0.06%.
+  bool enable_intra_mode_pruning_using_neighbors;
+
+  // Prune intra mode evaluations in nonrd path based on best sad so far.
+  //
+  // For allintra encode, this speed feature reduces instruction count by 3.05%
+  // for speed 9 with coding performance change less than 0.24%.
+  // For AVIF image encode, this speed feature reduces encode time by 1.87% for
+  // speed 9 on a typical image dataset with coding performance change less than
+  // 0.16%.
+  bool prune_intra_mode_using_best_sad_so_far;
+
   // If compound is enabled, and the current block size is \geq BLOCK_16X16,
   // limit the compound modes to GLOBAL_GLOBALMV. This does not apply to the
   // base layer of svc.

diff --git a/av1/encoder/superres_scale.c b/av1/encoder/superres_scale.c
index f439e70..5e1e289 100644
--- a/av1/encoder/superres_scale.c
+++ b/av1/encoder/superres_scale.c

@@ -403,7 +403,7 @@
   assert(!is_lossless_requested(&cpi->oxcf.rc_cfg));
   assert(!cm->features.all_lossless);
 
-  av1_superres_upscale(cm, NULL);
+  av1_superres_upscale(cm, NULL, cpi->image_pyramid_levels);
 
   // If regular resizing is occurring the source will need to be downscaled to
   // match the upscaled superres resolution. Otherwise the original source is

diff --git a/av1/encoder/svc_layercontext.c b/av1/encoder/svc_layercontext.c
index d31f55d..85678dc 100644
--- a/av1/encoder/svc_layercontext.c
+++ b/av1/encoder/svc_layercontext.c

@@ -8,6 +8,7 @@
  *  be found in the AUTHORS file in the root of the source tree.
  */
 
+#include <assert.h>
 #include <math.h>
 
 #include "av1/encoder/encoder.h"
@@ -84,6 +85,7 @@
 bool av1_alloc_layer_context(AV1_COMP *cpi, int num_layers) {
   SVC *const svc = &cpi->svc;
   if (svc->layer_context == NULL || svc->num_allocated_layers < num_layers) {
+    assert(num_layers > 1);
     aom_free(svc->layer_context);
     svc->num_allocated_layers = 0;
     svc->layer_context =
@@ -99,10 +101,13 @@
                                             const int64_t target_bandwidth) {
   const RATE_CONTROL *const rc = &cpi->rc;
   const PRIMARY_RATE_CONTROL *const p_rc = &cpi->ppi->p_rc;
+  AV1_COMMON *const cm = &cpi->common;
   SVC *const svc = &cpi->svc;
   int layer = 0;
   int64_t spatial_layer_target = 0;
   float bitrate_alloc = 1.0;
+  const int mi_rows = cm->mi_params.mi_rows;
+  const int mi_cols = cm->mi_params.mi_cols;
   for (int sl = 0; sl < svc->number_spatial_layers; ++sl) {
     for (int tl = 0; tl < svc->number_temporal_layers; ++tl) {
       layer = LAYER_IDS_TO_IDX(sl, tl, svc->number_temporal_layers);
@@ -116,7 +121,9 @@
       RATE_CONTROL *const lrc = &lc->rc;
       PRIMARY_RATE_CONTROL *const lp_rc = &lc->p_rc;
       lc->spatial_layer_target_bandwidth = spatial_layer_target;
-      bitrate_alloc = (float)lc->target_bandwidth / target_bandwidth;
+      if (target_bandwidth != 0) {
+        bitrate_alloc = (float)lc->target_bandwidth / target_bandwidth;
+      }
       lp_rc->starting_buffer_level =
           (int64_t)(p_rc->starting_buffer_level * bitrate_alloc);
       lp_rc->optimal_buffer_level =
@@ -134,6 +141,24 @@
       lrc->rtc_external_ratectrl = rc->rtc_external_ratectrl;
       lrc->worst_quality = av1_quantizer_to_qindex(lc->max_q);
       lrc->best_quality = av1_quantizer_to_qindex(lc->min_q);
+      if (rc->use_external_qp_one_pass) {
+        lrc->worst_quality = rc->worst_quality;
+        lrc->best_quality = rc->best_quality;
+      }
+      // Reset the cyclic refresh parameters, if needed (map is NULL),
+      // or number of spatial layers has changed.
+      // Cyclic refresh is only applied on base temporal layer.
+      if (svc->number_spatial_layers > 1 && tl == 0 &&
+          (lc->map == NULL ||
+           svc->prev_number_spatial_layers != svc->number_spatial_layers)) {
+        lc->sb_index = 0;
+        lc->actual_num_seg1_blocks = 0;
+        lc->actual_num_seg2_blocks = 0;
+        lc->counter_encode_maxq_scene_change = 0;
+        if (lc->map) aom_free(lc->map);
+        CHECK_MEM_ERROR(cm, lc->map,
+                        aom_calloc(mi_rows * mi_cols, sizeof(*lc->map)));
+      }
     }
   }
 }
@@ -178,8 +203,9 @@
 static AOM_INLINE bool check_ref_is_low_spatial_res_super_frame(
     int ref_frame, const SVC *svc, const RTC_REF *rtc_ref) {
   int ref_frame_idx = rtc_ref->ref_idx[ref_frame - 1];
-  return svc->buffer_time_index[ref_frame_idx] == svc->current_superframe &&
-         svc->buffer_spatial_layer[ref_frame_idx] <= svc->spatial_layer_id - 1;
+  return rtc_ref->buffer_time_index[ref_frame_idx] == svc->current_superframe &&
+         rtc_ref->buffer_spatial_layer[ref_frame_idx] <=
+             svc->spatial_layer_id - 1;
 }
 
 void av1_restore_layer_context(AV1_COMP *const cpi) {
@@ -232,6 +258,32 @@
   }
 }
 
+void av1_svc_update_buffer_slot_refreshed(AV1_COMP *const cpi) {
+  SVC *const svc = &cpi->svc;
+  RTC_REF *const rtc_ref = &cpi->ppi->rtc_ref;
+  const unsigned int current_frame =
+      cpi->ppi->use_svc ? svc->current_superframe
+                        : cpi->common.current_frame.frame_number;
+  // For any buffer slot that is refreshed, update it with
+  // the spatial_layer_id and the current_superframe.
+  if (cpi->common.current_frame.frame_type == KEY_FRAME) {
+    // All slots are refreshed on KEY.
+    for (unsigned int i = 0; i < REF_FRAMES; i++) {
+      rtc_ref->buffer_time_index[i] = current_frame;
+      rtc_ref->buffer_spatial_layer[i] = svc->spatial_layer_id;
+    }
+  } else if (rtc_ref->set_ref_frame_config) {
+    for (unsigned int i = 0; i < INTER_REFS_PER_FRAME; i++) {
+      const int ref_frame_map_idx = rtc_ref->ref_idx[i];
+      if (cpi->ppi->rtc_ref.refresh[ref_frame_map_idx]) {
+        rtc_ref->buffer_time_index[ref_frame_map_idx] = current_frame;
+        rtc_ref->buffer_spatial_layer[ref_frame_map_idx] =
+            svc->spatial_layer_id;
+      }
+    }
+  }
+}
+
 void av1_save_layer_context(AV1_COMP *const cpi) {
   SVC *const svc = &cpi->svc;
   const AV1_COMMON *const cm = &cpi->common;
@@ -255,23 +307,7 @@
     lc->actual_num_seg2_blocks = cr->actual_num_seg2_blocks;
     lc->counter_encode_maxq_scene_change = cr->counter_encode_maxq_scene_change;
   }
-  // For any buffer slot that is refreshed, update it with
-  // the spatial_layer_id and the current_superframe.
-  if (cpi->common.current_frame.frame_type == KEY_FRAME) {
-    // All slots are refreshed on KEY.
-    for (unsigned int i = 0; i < REF_FRAMES; i++) {
-      svc->buffer_time_index[i] = svc->current_superframe;
-      svc->buffer_spatial_layer[i] = svc->spatial_layer_id;
-    }
-  } else if (cpi->ppi->rtc_ref.set_ref_frame_config) {
-    for (unsigned int i = 0; i < INTER_REFS_PER_FRAME; i++) {
-      int ref_frame_map_idx = cpi->ppi->rtc_ref.ref_idx[i];
-      if (cpi->ppi->rtc_ref.refresh[ref_frame_map_idx]) {
-        svc->buffer_time_index[ref_frame_map_idx] = svc->current_superframe;
-        svc->buffer_spatial_layer[ref_frame_map_idx] = svc->spatial_layer_id;
-      }
-    }
-  }
+  av1_svc_update_buffer_slot_refreshed(cpi);
   for (unsigned int i = 0; i < REF_FRAMES; i++) {
     if (frame_is_intra_only(cm) ||
         cm->current_frame.refresh_frame_flags & (1 << i)) {
@@ -524,12 +560,24 @@
   SVC *const svc = &cpi->svc;
   for (int sl = 0; sl < svc->number_spatial_layers; ++sl) {
     // Check for reset based on avg_frame_bandwidth for spatial layer sl.
+    // If avg_frame_bandwidth for top temporal layer is not set
+    // (because enhancement layer was inactive), use the base TL0
     int layer = LAYER_IDS_TO_IDX(sl, svc->number_temporal_layers - 1,
                                  svc->number_temporal_layers);
     LAYER_CONTEXT *lc = &svc->layer_context[layer];
     RATE_CONTROL *lrc = &lc->rc;
-    if (lrc->avg_frame_bandwidth > (3 * lrc->prev_avg_frame_bandwidth >> 1) ||
-        lrc->avg_frame_bandwidth < (lrc->prev_avg_frame_bandwidth >> 1)) {
+    int avg_frame_bandwidth = lrc->avg_frame_bandwidth;
+    int prev_avg_frame_bandwidth = lrc->prev_avg_frame_bandwidth;
+    if (avg_frame_bandwidth == 0 || prev_avg_frame_bandwidth == 0) {
+      // Use base TL0.
+      layer = LAYER_IDS_TO_IDX(sl, 0, svc->number_temporal_layers);
+      lc = &svc->layer_context[layer];
+      lrc = &lc->rc;
+      avg_frame_bandwidth = lrc->avg_frame_bandwidth;
+      prev_avg_frame_bandwidth = lrc->prev_avg_frame_bandwidth;
+    }
+    if (avg_frame_bandwidth > (3 * prev_avg_frame_bandwidth >> 1) ||
+        avg_frame_bandwidth < (prev_avg_frame_bandwidth >> 1)) {
       // Reset for all temporal layers with spatial layer sl.
       for (int tl = 0; tl < svc->number_temporal_layers; ++tl) {
         int layer2 = LAYER_IDS_TO_IDX(sl, tl, svc->number_temporal_layers);
@@ -548,25 +596,76 @@
 
 void av1_svc_set_last_source(AV1_COMP *const cpi, EncodeFrameInput *frame_input,
                              YV12_BUFFER_CONFIG *prev_source) {
-  if (cpi->svc.spatial_layer_id == 0) {
-    // For base spatial layer: if the LAST reference (index 0) is not
-    // the previous (super)frame set the last_source to the source corresponding
-    // to the last TL0, otherwise keep it at prev_source.
-    frame_input->last_source = prev_source != NULL ? prev_source : NULL;
-    if (cpi->svc.current_superframe > 0) {
-      const int buffslot_last = cpi->ppi->rtc_ref.ref_idx[0];
-      if (cpi->svc.buffer_time_index[buffslot_last] <
-          cpi->svc.current_superframe - 1)
+  frame_input->last_source = prev_source != NULL ? prev_source : NULL;
+  if (!cpi->ppi->use_svc && cpi->rc.prev_frame_is_dropped &&
+      cpi->rc.frame_number_encoded > 0) {
+    frame_input->last_source = &cpi->svc.source_last_TL0;
+  } else {
+    RTC_REF *const rtc_ref = &cpi->ppi->rtc_ref;
+    if (cpi->svc.spatial_layer_id == 0) {
+      // For base spatial layer: if the LAST reference (index 0) is not
+      // the previous (super)frame set the last_source to the source
+      // corresponding to the last TL0, otherwise keep it at prev_source.
+      // Always use source_last_TL0 if previous base TL0 was dropped.
+      if (cpi->svc.current_superframe > 0) {
+        const int buffslot_last = rtc_ref->ref_idx[0];
+        // Check if previous frame was dropped on base TL0 layer.
+        const int layer =
+            LAYER_IDS_TO_IDX(0, 0, cpi->svc.number_temporal_layers);
+        LAYER_CONTEXT *lc = &cpi->svc.layer_context[layer];
+        RATE_CONTROL *lrc = &lc->rc;
+        if (lrc->prev_frame_is_dropped ||
+            rtc_ref->buffer_time_index[buffslot_last] <
+                cpi->svc.current_superframe - 1) {
+          frame_input->last_source = &cpi->svc.source_last_TL0;
+        }
+      }
+    } else if (cpi->svc.spatial_layer_id > 0) {
+      // For spatial enhancement layers: the previous source (prev_source)
+      // corresponds to the lower spatial layer (which is the same source so
+      // we can't use that), so always set the last_source to the source of the
+      // last TL0.
+      if (cpi->svc.current_superframe > 0)
         frame_input->last_source = &cpi->svc.source_last_TL0;
+      else
+        frame_input->last_source = NULL;
     }
-  } else if (cpi->svc.spatial_layer_id > 0) {
-    // For spatial enhancement layers: the previous source (prev_source)
-    // corresponds to the lower spatial layer (which is the same source so
-    // we can't use that), so always set the last_source to the source of the
-    // last TL0.
-    if (cpi->svc.current_superframe > 0)
-      frame_input->last_source = &cpi->svc.source_last_TL0;
-    else
-      frame_input->last_source = NULL;
+  }
+}
+
+int av1_svc_get_min_ref_dist(const AV1_COMP *cpi) {
+  RTC_REF *const rtc_ref = &cpi->ppi->rtc_ref;
+  int min_dist = INT_MAX;
+  const unsigned int current_frame_num =
+      cpi->ppi->use_svc ? cpi->svc.current_superframe
+                        : cpi->common.current_frame.frame_number;
+  for (unsigned int i = 0; i < INTER_REFS_PER_FRAME; i++) {
+    if (cpi->ppi->rtc_ref.reference[i]) {
+      const int ref_frame_map_idx = rtc_ref->ref_idx[i];
+      const int dist =
+          current_frame_num - rtc_ref->buffer_time_index[ref_frame_map_idx];
+      if (dist < min_dist) min_dist = dist;
+    }
+  }
+  return min_dist;
+}
+
+void av1_svc_set_reference_was_previous(AV1_COMP *cpi) {
+  RTC_REF *const rtc_ref = &cpi->ppi->rtc_ref;
+  // Check if the encoded frame had some reference that was the
+  // previous frame.
+  const unsigned int current_frame =
+      cpi->ppi->use_svc ? cpi->svc.current_superframe
+                        : cpi->common.current_frame.frame_number;
+  rtc_ref->reference_was_previous_frame = true;
+  if (current_frame > 0) {
+    rtc_ref->reference_was_previous_frame = false;
+    for (unsigned int i = 0; i < INTER_REFS_PER_FRAME; i++) {
+      if (rtc_ref->reference[i]) {
+        const int ref_frame_map_idx = rtc_ref->ref_idx[i];
+        if (rtc_ref->buffer_time_index[ref_frame_map_idx] == current_frame - 1)
+          rtc_ref->reference_was_previous_frame = true;
+      }
+    }
   }
 }

diff --git a/av1/encoder/svc_layercontext.h b/av1/encoder/svc_layercontext.h
index 5e983f6..3a6e0fc 100644
--- a/av1/encoder/svc_layercontext.h
+++ b/av1/encoder/svc_layercontext.h

@@ -29,7 +29,7 @@
   RATE_CONTROL rc;
   PRIMARY_RATE_CONTROL p_rc;
   int framerate_factor;
-  int64_t layer_target_bitrate;
+  int64_t layer_target_bitrate;  // In bits per second.
   int scaling_factor_num;
   int scaling_factor_den;
   int64_t target_bandwidth;
@@ -91,6 +91,7 @@
   int temporal_layer_id;
   int number_spatial_layers;
   int number_temporal_layers;
+  int prev_number_spatial_layers;
   int use_flexible_mode;
   int ksvc_fixed_mode;
   /*!\endcond */
@@ -98,8 +99,6 @@
   /*!\cond */
   double base_framerate;
   unsigned int current_superframe;
-  unsigned int buffer_time_index[REF_FRAMES];
-  unsigned char buffer_spatial_layer[REF_FRAMES];
   int skip_mvsearch_last;
   int skip_mvsearch_gf;
   int skip_mvsearch_altref;
@@ -114,11 +113,14 @@
 
   /*!
    * Layer context used for rate control in CBR mode.
+   * An array. The index for spatial layer `sl` and temporal layer `tl` is
+   * sl * number_temporal_layers + tl.
    */
   LAYER_CONTEXT *layer_context;
 
   /*!
-   * Number of layers allocated for layer_context.
+   * Number of layers allocated for layer_context. If nonzero, must be greater
+   * than or equal to number_spatial_layers * number_temporal_layers.
    */
   int num_allocated_layers;
 
@@ -286,6 +288,11 @@
                              struct EncodeFrameInput *frame_input,
                              YV12_BUFFER_CONFIG *prev_source);
 
+void av1_svc_update_buffer_slot_refreshed(struct AV1_COMP *const cpi);
+
+int av1_svc_get_min_ref_dist(const struct AV1_COMP *cpi);
+
+void av1_svc_set_reference_was_previous(struct AV1_COMP *cpi);
 #ifdef __cplusplus
 }  // extern "C"
 #endif

diff --git a/av1/encoder/temporal_filter.c b/av1/encoder/temporal_filter.c
index ad1cc64..91a0c78 100644
--- a/av1/encoder/temporal_filter.c
+++ b/av1/encoder/temporal_filter.c

@@ -16,6 +16,7 @@
 #include "config/aom_scale_rtcd.h"
 
 #include "aom_dsp/aom_dsp_common.h"
+#include "aom_dsp/mathutils.h"
 #include "aom_dsp/odintrin.h"
 #include "aom_mem/aom_mem.h"
 #include "aom_ports/aom_timer.h"
@@ -145,7 +146,7 @@
   const int q = av1_get_q(cpi);
 
   av1_make_default_fullpel_ms_params(&full_ms_params, cpi, mb, block_size,
-                                     &baseline_mv, search_site_cfg,
+                                     &baseline_mv, start_mv, search_site_cfg,
                                      /*fine_search_interval=*/0);
   av1_set_mv_search_method(&full_ms_params, search_site_cfg, search_method);
   full_ms_params.run_mesh_search = 1;
@@ -204,7 +205,7 @@
         mbd->plane[0].pre[0].buf = ref_frame->y_buffer + y_offset + offset;
         av1_make_default_fullpel_ms_params(&full_ms_params, cpi, mb,
                                            subblock_size, &baseline_mv,
-                                           search_site_cfg,
+                                           start_mv, search_site_cfg,
                                            /*fine_search_interval=*/0);
         av1_set_mv_search_method(&full_ms_params, search_site_cfg,
                                  search_method);
@@ -549,6 +550,8 @@
  *                              defined in libaom, converted from `qindex`
  * \param[in]   filter_strength Filtering strength. This value lies in range
  *                              [0, 6] where 6 is the maximum strength.
+ * \param[in]   tf_wgt_calc_lvl Controls the weight calculation method during
+ *                              temporal filtering
  * \param[out]  pred            Pointer to the well-built predictors
  * \param[out]  accum           Pointer to the pixel-wise accumulator for
  *                              filtering
@@ -563,7 +566,8 @@
     const BLOCK_SIZE block_size, const int mb_row, const int mb_col,
     const int num_planes, const double *noise_levels, const MV *subblock_mvs,
     const int *subblock_mses, const int q_factor, const int filter_strength,
-    const uint8_t *pred, uint32_t *accum, uint16_t *count) {
+    int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum,
+    uint16_t *count) {
   // Block information.
   const int mb_height = block_size_high[block_size];
   const int mb_width = block_size_wide[block_size];
@@ -693,7 +697,14 @@
         double scaled_error =
             combined_error * d_factor[subblock_idx] * decay_factor[plane];
         scaled_error = AOMMIN(scaled_error, 7);
-        const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
+        int weight;
+        if (tf_wgt_calc_lvl == 0) {
+          weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
+        } else {
+          const float fweight =
+              approx_exp((float)-scaled_error) * TF_WEIGHT_SCALE;
+          weight = iroundpf(fweight);
+        }
 
         const int idx = plane_offset + pred_idx;  // Index with plane shift.
         const int pred_value = is_high_bitdepth ? pred16[idx] : pred[idx];
@@ -716,11 +727,12 @@
     const BLOCK_SIZE block_size, const int mb_row, const int mb_col,
     const int num_planes, const double *noise_levels, const MV *subblock_mvs,
     const int *subblock_mses, const int q_factor, const int filter_strength,
-    const uint8_t *pred, uint32_t *accum, uint16_t *count) {
+    int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum,
+    uint16_t *count) {
   av1_apply_temporal_filter_c(frame_to_filter, mbd, block_size, mb_row, mb_col,
                               num_planes, noise_levels, subblock_mvs,
-                              subblock_mses, q_factor, filter_strength, pred,
-                              accum, count);
+                              subblock_mses, q_factor, filter_strength,
+                              tf_wgt_calc_lvl, pred, accum, count);
 }
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 /*!\brief Normalizes the accumulated filtering result to produce the filtered
@@ -809,6 +821,7 @@
   const int mi_h = mi_size_high_log2[block_size];
   const int mi_w = mi_size_wide_log2[block_size];
   const int num_planes = av1_num_planes(&cpi->common);
+  const int weight_calc_level_in_tf = cpi->sf.hl_sf.weight_calc_level_in_tf;
   uint32_t *accum = tf_data->accum;
   uint16_t *count = tf_data->count;
   uint8_t *pred = tf_data->pred;
@@ -865,27 +878,27 @@
             av1_highbd_apply_temporal_filter(
                 frame_to_filter, mbd, block_size, mb_row, mb_col, num_planes,
                 noise_levels, subblock_mvs, subblock_mses, q_factor,
-                filter_strength, pred, accum, count);
+                filter_strength, weight_calc_level_in_tf, pred, accum, count);
           } else {
 #endif  // CONFIG_AV1_HIGHBITDEPTH
             av1_apply_temporal_filter_c(
                 frame_to_filter, mbd, block_size, mb_row, mb_col, num_planes,
                 noise_levels, subblock_mvs, subblock_mses, q_factor,
-                filter_strength, pred, accum, count);
+                filter_strength, weight_calc_level_in_tf, pred, accum, count);
 #if CONFIG_AV1_HIGHBITDEPTH
           }
 #endif            // CONFIG_AV1_HIGHBITDEPTH
         } else {  // for 8-bit
           if (TF_BLOCK_SIZE == BLOCK_32X32 && TF_WINDOW_LENGTH == 5) {
-            av1_apply_temporal_filter(frame_to_filter, mbd, block_size, mb_row,
-                                      mb_col, num_planes, noise_levels,
-                                      subblock_mvs, subblock_mses, q_factor,
-                                      filter_strength, pred, accum, count);
+            av1_apply_temporal_filter(
+                frame_to_filter, mbd, block_size, mb_row, mb_col, num_planes,
+                noise_levels, subblock_mvs, subblock_mses, q_factor,
+                filter_strength, weight_calc_level_in_tf, pred, accum, count);
           } else {
             av1_apply_temporal_filter_c(
                 frame_to_filter, mbd, block_size, mb_row, mb_col, num_planes,
                 noise_levels, subblock_mvs, subblock_mses, q_factor,
-                filter_strength, pred, accum, count);
+                filter_strength, weight_calc_level_in_tf, pred, accum, count);
           }
         }
       }
@@ -995,11 +1008,9 @@
   const YV12_BUFFER_CONFIG *to_filter_frame = &to_filter_buf->img;
   const int num_planes = av1_num_planes(&cpi->common);
   double *noise_levels = tf_ctx->noise_levels;
-  for (int plane = 0; plane < num_planes; ++plane) {
-    noise_levels[plane] = av1_estimate_noise_from_single_plane(
-        to_filter_frame, plane, cpi->common.seq_params->bit_depth,
-        NOISE_ESTIMATION_EDGE_THRESHOLD);
-  }
+  av1_estimate_noise_level(to_filter_frame, noise_levels, AOM_PLANE_Y,
+                           num_planes - 1, cpi->common.seq_params->bit_depth,
+                           NOISE_ESTIMATION_EDGE_THRESHOLD);
   // Get quantization factor.
   const int q = av1_get_q(cpi);
   // Get correlation estimates from first-pass;
@@ -1040,6 +1051,18 @@
     adjust_num = 0;
   } else if ((update_type == KF_UPDATE) && q <= 10) {
     adjust_num = 0;
+  } else if (cpi->sf.hl_sf.adjust_num_frames_for_arf_filtering &&
+             update_type != KF_UPDATE) {
+    // Adjust number of frames to be considered for filtering based on noise
+    // level of the current frame. For low-noise frame, use more frames to
+    // filter such that the filtered frame can provide better predictions for
+    // subsequent frames and vice versa.
+    if (noise_levels[AOM_PLANE_Y] < 0.5)
+      adjust_num = 4;
+    else if (noise_levels[AOM_PLANE_Y] < 1.0)
+      adjust_num = 2;
+    else
+      adjust_num = 0;
   }
   num_frames = AOMMIN(num_frames + adjust_num, lookahead_depth);
 
@@ -1055,10 +1078,6 @@
     num_frames = AOMMIN(num_frames, gfu_boost / 150);
     num_frames += !(num_frames & 1);  // Make the number odd.
 
-    // Limit the number of frames if noise levels are low and high quantizers.
-    if (noise_levels[AOM_PLANE_Y] < 1.9 && cpi->ppi->p_rc.arf_q > 40)
-      num_frames = AOMMIN(num_frames, cpi->sf.hl_sf.num_frames_used_in_tf);
-
     // Only use 2 neighbours for the second ARF.
     if (update_type == INTNL_ARF_UPDATE) num_frames = AOMMIN(num_frames, 3);
     if (AOMMIN(max_after, max_before) >= num_frames / 2) {
@@ -1108,21 +1127,50 @@
 
 /*!\cond */
 
-// A constant number, sqrt(pi / 2),  used for noise estimation.
-static const double SQRT_PI_BY_2 = 1.25331413732;
+double av1_estimate_noise_from_single_plane_c(const uint8_t *src, int height,
+                                              int width, int stride,
+                                              int edge_thresh) {
+  int64_t accum = 0;
+  int count = 0;
 
-double av1_estimate_noise_from_single_plane(const YV12_BUFFER_CONFIG *frame,
-                                            const int plane,
-                                            const int bit_depth,
-                                            const int edge_thresh) {
-  const int is_y_plane = (plane == 0);
-  const int height = frame->crop_heights[is_y_plane ? 0 : 1];
-  const int width = frame->crop_widths[is_y_plane ? 0 : 1];
-  const int stride = frame->strides[is_y_plane ? 0 : 1];
-  const uint8_t *src = frame->buffers[plane];
-  const uint16_t *src16 = CONVERT_TO_SHORTPTR(src);
-  const int is_high_bitdepth = is_frame_high_bitdepth(frame);
+  for (int i = 1; i < height - 1; ++i) {
+    for (int j = 1; j < width - 1; ++j) {
+      // Setup a small 3x3 matrix.
+      const int center_idx = i * stride + j;
+      int mat[3][3];
+      for (int ii = -1; ii <= 1; ++ii) {
+        for (int jj = -1; jj <= 1; ++jj) {
+          const int idx = center_idx + ii * stride + jj;
+          mat[ii + 1][jj + 1] = src[idx];
+        }
+      }
+      // Compute sobel gradients.
+      const int Gx = (mat[0][0] - mat[0][2]) + (mat[2][0] - mat[2][2]) +
+                     2 * (mat[1][0] - mat[1][2]);
+      const int Gy = (mat[0][0] - mat[2][0]) + (mat[0][2] - mat[2][2]) +
+                     2 * (mat[0][1] - mat[2][1]);
+      const int Ga = ROUND_POWER_OF_TWO(abs(Gx) + abs(Gy), 0);
+      // Accumulate Laplacian.
+      if (Ga < edge_thresh) {  // Only count smooth pixels.
+        const int v = 4 * mat[1][1] -
+                      2 * (mat[0][1] + mat[2][1] + mat[1][0] + mat[1][2]) +
+                      (mat[0][0] + mat[0][2] + mat[2][0] + mat[2][2]);
+        accum += ROUND_POWER_OF_TWO(abs(v), 0);
+        ++count;
+      }
+    }
+  }
 
+  // Return -1.0 (unreliable estimation) if there are too few smooth pixels.
+  return (count < 16) ? -1.0 : (double)accum / (6 * count) * SQRT_PI_BY_2;
+}
+
+#if CONFIG_AV1_HIGHBITDEPTH
+double av1_highbd_estimate_noise_from_single_plane(const uint16_t *src16,
+                                                   int height, int width,
+                                                   const int stride,
+                                                   int bit_depth,
+                                                   int edge_thresh) {
   int64_t accum = 0;
   int count = 0;
   for (int i = 1; i < height - 1; ++i) {
@@ -1133,7 +1181,7 @@
       for (int ii = -1; ii <= 1; ++ii) {
         for (int jj = -1; jj <= 1; ++jj) {
           const int idx = center_idx + ii * stride + jj;
-          mat[ii + 1][jj + 1] = is_high_bitdepth ? src16[idx] : src[idx];
+          mat[ii + 1][jj + 1] = src16[idx];
         }
       }
       // Compute sobel gradients.
@@ -1156,6 +1204,35 @@
   // Return -1.0 (unreliable estimation) if there are too few smooth pixels.
   return (count < 16) ? -1.0 : (double)accum / (6 * count) * SQRT_PI_BY_2;
 }
+#endif
+
+void av1_estimate_noise_level(const YV12_BUFFER_CONFIG *frame,
+                              double *noise_level, int plane_from, int plane_to,
+                              int bit_depth, int edge_thresh) {
+  for (int plane = plane_from; plane <= plane_to; plane++) {
+    const bool is_uv_plane = (plane != AOM_PLANE_Y);
+    const int height = frame->crop_heights[is_uv_plane];
+    const int width = frame->crop_widths[is_uv_plane];
+    const int stride = frame->strides[is_uv_plane];
+    const uint8_t *src = frame->buffers[plane];
+
+#if CONFIG_AV1_HIGHBITDEPTH
+    const uint16_t *src16 = CONVERT_TO_SHORTPTR(src);
+    const int is_high_bitdepth = is_frame_high_bitdepth(frame);
+    if (is_high_bitdepth) {
+      noise_level[plane] = av1_highbd_estimate_noise_from_single_plane(
+          src16, height, width, stride, bit_depth, edge_thresh);
+    } else {
+      noise_level[plane] = av1_estimate_noise_from_single_plane(
+          src, height, width, stride, edge_thresh);
+    }
+#else
+    (void)bit_depth;
+    noise_level[plane] = av1_estimate_noise_from_single_plane(
+        src, height, width, stride, edge_thresh);
+#endif
+  }
+}
 
 // Initializes the members of TemporalFilterCtx
 // Inputs:
@@ -1293,7 +1370,7 @@
         seq_params->subsampling_x, seq_params->subsampling_y,
         seq_params->use_highbitdepth, cpi->oxcf.border_in_pixels,
         cm->features.byte_alignment, NULL, NULL, NULL,
-        cpi->oxcf.tool_cfg.enable_global_motion, 0);
+        cpi->image_pyramid_levels, 0);
     if (ret) {
       aom_internal_error(cm->error, AOM_CODEC_MEM_ERROR,
                          "Failed to allocate tf_info");

diff --git a/av1/encoder/temporal_filter.h b/av1/encoder/temporal_filter.h
index 725bd86..8aa4731 100644
--- a/av1/encoder/temporal_filter.h
+++ b/av1/encoder/temporal_filter.h

@@ -33,6 +33,9 @@
 // Window size for temporal filtering.
 #define TF_WINDOW_LENGTH 5
 
+// A constant number, sqrt(pi / 2),  used for noise estimation.
+static const double SQRT_PI_BY_2 = 1.25331413732;
+
 // Hyper-parameters used to compute filtering weight. These hyper-parameters can
 // be tuned for a better performance.
 // 0. A scale factor used in temporal filtering to raise the filter weight from
@@ -268,15 +271,15 @@
 // Signal Processing, 2008, St Julians, Malta.
 // Inputs:
 //   frame: Pointer to the frame to estimate noise level from.
-//   plane: Index of the plane used for noise estimation. Commonly, 0 for
-//          Y-plane, 1 for U-plane, and 2 for V-plane.
+//   noise_level: Pointer to store the estimated noise.
+//   plane_from: Index of the starting plane used for noise estimation.
+//               Commonly, 0 for Y-plane, 1 for U-plane, and 2 for V-plane.
+//   plane_to: Index of the end plane used for noise estimation.
 //   bit_depth: Actual bit-depth instead of the encoding bit-depth of the frame.
-// Returns:
-//   The estimated noise, or -1.0 if there are too few smooth pixels.
-double av1_estimate_noise_from_single_plane(const YV12_BUFFER_CONFIG *frame,
-                                            const int plane,
-                                            const int bit_depth,
-                                            const int edge_thresh);
+//   edge_thresh: Edge threshold.
+void av1_estimate_noise_level(const YV12_BUFFER_CONFIG *frame,
+                              double *noise_level, int plane_from, int plane_to,
+                              int bit_depth, int edge_thresh);
 /*!\endcond */
 
 /*!\brief Does temporal filter for a given macroblock row.

diff --git a/av1/encoder/tpl_model.c b/av1/encoder/tpl_model.c
index ef59c99..3aeb511 100644
--- a/av1/encoder/tpl_model.c
+++ b/av1/encoder/tpl_model.c

@@ -9,8 +9,9 @@
  * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
  */
 
-#include <stdint.h>
+#include <assert.h>
 #include <float.h>
+#include <stdint.h>
 
 #include "av1/encoder/thirdpass.h"
 #include "config/aom_config.h"
@@ -57,6 +58,7 @@
          sizeof(tpl_txfm_stats->abs_coeff_mean[0]) * tpl_txfm_stats->coeff_num);
 }
 
+#if CONFIG_BITRATE_ACCURACY
 void av1_accumulate_tpl_txfm_stats(const TplTxfmStats *sub_stats,
                                    TplTxfmStats *accumulated_stats) {
   accumulated_stats->txfm_block_count += sub_stats->txfm_block_count;
@@ -93,6 +95,7 @@
     const int frame_index) {
   tpl_data->txfm_stats_list[frame_index] = *tpl_txfm_stats;
 }
+#endif  // CONFIG_BITRATE_ACCURACY
 
 static AOM_INLINE void get_quantize_error(const MACROBLOCK *x, int plane,
                                           const tran_low_t *coeff,
@@ -190,7 +193,7 @@
             &tpl_data->tpl_rec_pool[frame], width, height,
             seq_params->subsampling_x, seq_params->subsampling_y,
             seq_params->use_highbitdepth, tpl_data->border_in_pixels,
-            byte_alignment, alloc_y_plane_only))
+            byte_alignment, 0, alloc_y_plane_only))
       aom_internal_error(&ppi->error, AOM_CODEC_MEM_ERROR,
                          "Failed to allocate frame buffer");
   }
@@ -217,8 +220,8 @@
   int rate_cost = 1;
 
   for (int idx = 0; idx < eob; ++idx) {
-    int abs_level = abs(qcoeff[scan_order->scan[idx]]);
-    rate_cost += (int)(log(abs_level + 1.0) / log(2.0)) + 1 + (abs_level > 0);
+    unsigned int abs_level = abs(qcoeff[scan_order->scan[idx]]);
+    rate_cost += get_msb(abs_level + 1) + 1 + (abs_level > 0);
   }
 
   return (rate_cost << AV1_PROB_COST_SHIFT);
@@ -228,7 +231,7 @@
     const MACROBLOCK *x, int16_t *src_diff, int diff_stride, uint8_t *src,
     int src_stride, uint8_t *dst, int dst_stride, tran_low_t *coeff,
     tran_low_t *qcoeff, tran_low_t *dqcoeff, int bw, int bh, TX_SIZE tx_size,
-    int *rate_cost, int64_t *recon_error, int64_t *sse) {
+    int do_recon, int *rate_cost, int64_t *recon_error, int64_t *sse) {
   const MACROBLOCKD *xd = &x->e_mbd;
   const BitDepthInfo bd_info = get_bit_depth_info(xd);
   uint16_t eob;
@@ -241,8 +244,9 @@
 
   *rate_cost = rate_estimator(qcoeff, eob, tx_size);
 
-  av1_inverse_transform_block(xd, dqcoeff, 0, DCT_DCT, tx_size, dst, dst_stride,
-                              eob, 0);
+  if (do_recon)
+    av1_inverse_transform_block(xd, dqcoeff, 0, DCT_DCT, tx_size, dst,
+                                dst_stride, eob, 0);
 }
 
 static uint32_t motion_estimation(AV1_COMP *cpi, MACROBLOCK *x,
@@ -277,14 +281,21 @@
 
   FULLPEL_MOTION_SEARCH_PARAMS full_ms_params;
   av1_make_default_fullpel_ms_params(&full_ms_params, cpi, x, bsize, &center_mv,
-                                     search_site_cfg,
+                                     start_mv, search_site_cfg,
                                      /*fine_search_interval=*/0);
   av1_set_mv_search_method(&full_ms_params, search_site_cfg,
                            tpl_sf->search_method);
 
-  av1_full_pixel_search(start_mv, &full_ms_params, step_param,
-                        cond_cost_list(cpi, cost_list), &best_mv->as_fullmv,
-                        NULL);
+  bestsme = av1_full_pixel_search(start_mv, &full_ms_params, step_param,
+                                  cond_cost_list(cpi, cost_list),
+                                  &best_mv->as_fullmv, NULL);
+
+  // When sub-pel motion search is skipped, populate sub-pel precision MV and
+  // return.
+  if (tpl_sf->subpel_force_stop == FULL_PEL) {
+    best_mv->as_mv = get_mv_from_fullmv(&best_mv->as_fullmv);
+    return bestsme;
+  }
 
   SUBPEL_MOTION_SEARCH_PARAMS ms_params;
   av1_make_default_subpel_ms_params(&ms_params, cpi, x, bsize, &center_mv,
@@ -337,13 +348,15 @@
     tran_low_t *dqcoeff, AV1_COMMON *cm, MACROBLOCK *x,
     const YV12_BUFFER_CONFIG *ref_frame_ptr[2], uint8_t *rec_buffer_pool[3],
     const int rec_stride_pool[3], TX_SIZE tx_size, PREDICTION_MODE best_mode,
-    int mi_row, int mi_col, int use_y_only_rate_distortion,
+    int mi_row, int mi_col, int use_y_only_rate_distortion, int do_recon,
     TplTxfmStats *tpl_txfm_stats) {
   const SequenceHeader *seq_params = cm->seq_params;
   *rate_cost = 0;
   *recon_error = 1;
   *pred_error = 1;
 
+  (void)tpl_txfm_stats;
+
   MACROBLOCKD *xd = &x->e_mbd;
   int is_compound = (best_mode == NEW_NEWMV);
   int num_planes = use_y_only_rate_distortion ? 1 : MAX_MB_PLANE;
@@ -423,12 +436,14 @@
         src_buffer_pool[plane] + src_mb_offset, src_stride, dst_buffer,
         dst_buffer_stride, coeff, qcoeff, dqcoeff, block_size_wide[bsize_plane],
         block_size_high[bsize_plane], max_txsize_rect_lookup[bsize_plane],
-        &this_rate, &this_recon_error, &sse);
+        do_recon, &this_rate, &this_recon_error, &sse);
 
+#if CONFIG_BITRATE_ACCURACY
     if (plane == 0 && tpl_txfm_stats) {
       // We only collect Y plane's transform coefficient
       av1_record_tpl_txfm_block(tpl_txfm_stats, coeff);
     }
+#endif  // CONFIG_BITRATE_ACCURACY
 
     *recon_error += this_recon_error;
     *pred_error += sse;
@@ -443,6 +458,7 @@
                                        TplDepStats *tpl_stats) {
   AV1_COMMON *cm = &cpi->common;
   const GF_GROUP *gf_group = &cpi->ppi->gf_group;
+  TPL_SPEED_FEATURES *tpl_sf = &cpi->sf.tpl_sf;
 
   (void)gf_group;
 
@@ -471,7 +487,7 @@
       mi_row * MI_SIZE * tpl_frame->rec_picture->y_stride + mi_col * MI_SIZE;
   uint8_t *dst_buffer = tpl_frame->rec_picture->y_buffer + dst_mb_offset;
   int dst_buffer_stride = tpl_frame->rec_picture->y_stride;
-  int use_y_only_rate_distortion = cpi->sf.tpl_sf.use_y_only_rate_distortion;
+  int use_y_only_rate_distortion = tpl_sf->use_y_only_rate_distortion;
 
   uint8_t *rec_buffer_pool[3] = {
     tpl_frame->rec_picture->y_buffer,
@@ -550,7 +566,7 @@
   // if cpi->sf.tpl_sf.prune_intra_modes is on, then search only DC_PRED,
   // H_PRED, and V_PRED
   const PREDICTION_MODE last_intra_mode =
-      cpi->sf.tpl_sf.prune_intra_modes ? D45_PRED : INTRA_MODE_END;
+      tpl_sf->prune_intra_modes ? D45_PRED : INTRA_MODE_END;
   const SequenceHeader *seq_params = cm->seq_params;
   for (PREDICTION_MODE mode = INTRA_MODE_START; mode < last_intra_mode;
        ++mode) {
@@ -576,7 +592,7 @@
     get_rate_distortion(&rate_cost, &recon_error, &pred_error, src_diff, coeff,
                         qcoeff, dqcoeff, cm, x, NULL, rec_buffer_pool,
                         rec_stride_pool, tx_size, best_mode, mi_row, mi_col,
-                        use_y_only_rate_distortion, NULL);
+                        use_y_only_rate_distortion, 1 /*do_recon*/, NULL);
 
     tpl_stats->intra_dist = recon_error << TPL_DEP_COST_SCALE_LOG2;
     tpl_stats->intra_sse = pred_error << TPL_DEP_COST_SCALE_LOG2;
@@ -656,7 +672,7 @@
       TplDepStats *ref_tpl_stats = &tpl_frame->tpl_stats_ptr[av1_tpl_ptr_pos(
           mi_row - mi_height, mi_col, tpl_frame->stride, block_mis_log2)];
       if (!is_alike_mv(ref_tpl_stats->mv[rf_idx], center_mvs, refmv_count,
-                       cpi->sf.tpl_sf.skip_alike_starting_mv)) {
+                       tpl_sf->skip_alike_starting_mv)) {
         center_mvs[refmv_count].mv.as_int = ref_tpl_stats->mv[rf_idx].as_int;
         ++refmv_count;
       }
@@ -666,7 +682,7 @@
       TplDepStats *ref_tpl_stats = &tpl_frame->tpl_stats_ptr[av1_tpl_ptr_pos(
           mi_row, mi_col - mi_width, tpl_frame->stride, block_mis_log2)];
       if (!is_alike_mv(ref_tpl_stats->mv[rf_idx], center_mvs, refmv_count,
-                       cpi->sf.tpl_sf.skip_alike_starting_mv)) {
+                       tpl_sf->skip_alike_starting_mv)) {
         center_mvs[refmv_count].mv.as_int = ref_tpl_stats->mv[rf_idx].as_int;
         ++refmv_count;
       }
@@ -677,7 +693,7 @@
           mi_row - mi_height, mi_col + mi_width, tpl_frame->stride,
           block_mis_log2)];
       if (!is_alike_mv(ref_tpl_stats->mv[rf_idx], center_mvs, refmv_count,
-                       cpi->sf.tpl_sf.skip_alike_starting_mv)) {
+                       tpl_sf->skip_alike_starting_mv)) {
         center_mvs[refmv_count].mv.as_int = ref_tpl_stats->mv[rf_idx].as_int;
         ++refmv_count;
       }
@@ -696,13 +712,13 @@
                                                     rf_idx + LAST_FRAME);
       if (tp_mv.as_int != INVALID_MV &&
           !is_alike_mv(tp_mv, center_mvs + 1, refmv_count - 1,
-                       cpi->sf.tpl_sf.skip_alike_starting_mv)) {
+                       tpl_sf->skip_alike_starting_mv)) {
         center_mvs[0].mv = tp_mv;
       }
     }
 
     // Prune starting mvs
-    if (cpi->sf.tpl_sf.prune_starting_mv) {
+    if (tpl_sf->prune_starting_mv && refmv_count > 1) {
       // Get each center mv's sad.
       for (idx = 0; idx < refmv_count; ++idx) {
         FULLPEL_MV mv = get_fullmv_from_mv(&center_mvs[idx].mv.as_mv);
@@ -713,10 +729,9 @@
       }
 
       // Rank center_mv using sad.
-      if (refmv_count > 1) {
-        qsort(center_mvs, refmv_count, sizeof(center_mvs[0]), compare_sad);
-      }
-      refmv_count = AOMMIN(4 - cpi->sf.tpl_sf.prune_starting_mv, refmv_count);
+      qsort(center_mvs, refmv_count, sizeof(center_mvs[0]), compare_sad);
+
+      refmv_count = AOMMIN(4 - tpl_sf->prune_starting_mv, refmv_count);
       // Further reduce number of refmv based on sad difference.
       if (refmv_count > 1) {
         int last_sad = center_mvs[refmv_count - 1].sad;
@@ -741,21 +756,31 @@
     tpl_stats->mv[rf_idx].as_int = best_rfidx_mv.as_int;
     single_mv[rf_idx] = best_rfidx_mv;
 
-    struct buf_2d ref_buf = { NULL, ref_frame_ptr->y_buffer,
-                              ref_frame_ptr->y_width, ref_frame_ptr->y_height,
-                              ref_frame_ptr->y_stride };
-    InterPredParams inter_pred_params;
-    av1_init_inter_params(&inter_pred_params, bw, bh, mi_row * MI_SIZE,
-                          mi_col * MI_SIZE, 0, 0, xd->bd, is_cur_buf_hbd(xd), 0,
-                          &tpl_data->sf, &ref_buf, kernel);
-    inter_pred_params.conv_params = get_conv_params(0, 0, xd->bd);
+    if (tpl_sf->subpel_force_stop != FULL_PEL) {
+      struct buf_2d ref_buf = { NULL, ref_frame_ptr->y_buffer,
+                                ref_frame_ptr->y_width, ref_frame_ptr->y_height,
+                                ref_frame_ptr->y_stride };
+      InterPredParams inter_pred_params;
+      av1_init_inter_params(&inter_pred_params, bw, bh, mi_row * MI_SIZE,
+                            mi_col * MI_SIZE, 0, 0, xd->bd, is_cur_buf_hbd(xd),
+                            0, &tpl_data->sf, &ref_buf, kernel);
+      inter_pred_params.conv_params = get_conv_params(0, 0, xd->bd);
 
-    av1_enc_build_one_inter_predictor(predictor, bw, &best_rfidx_mv.as_mv,
-                                      &inter_pred_params);
+      av1_enc_build_one_inter_predictor(predictor, bw, &best_rfidx_mv.as_mv,
+                                        &inter_pred_params);
 
-    inter_cost =
-        tpl_get_satd_cost(bd_info, src_diff, bw, src_mb_buffer, src_stride,
-                          predictor, bw, coeff, bw, bh, tx_size);
+      inter_cost =
+          tpl_get_satd_cost(bd_info, src_diff, bw, src_mb_buffer, src_stride,
+                            predictor, bw, coeff, bw, bh, tx_size);
+    } else {
+      const FULLPEL_MV best_fullmv = get_fullmv_from_mv(&best_rfidx_mv.as_mv);
+      // Since sub-pel motion search is not performed, use the prediction pixels
+      // directly from the reference block ref_mb
+      inter_cost = tpl_get_satd_cost(
+          bd_info, src_diff, bw, src_mb_buffer, src_stride,
+          &ref_mb[best_fullmv.row * ref_stride + best_fullmv.col], ref_stride,
+          coeff, bw, bh, tx_size);
+    }
     // Store inter cost for each ref frame
     tpl_stats->pred_error[rf_idx] = AOMMAX(1, inter_cost);
 
@@ -782,7 +807,7 @@
 
   int start_rf = 0;
   int end_rf = 3;
-  if (!cpi->sf.tpl_sf.allow_compound_pred) end_rf = 0;
+  if (!tpl_sf->allow_compound_pred) end_rf = 0;
   if (cpi->third_pass_ctx &&
       frame_offset < cpi->third_pass_ctx->frame_info_count &&
       tpl_data->frame_idx < gf_group->size) {
@@ -802,10 +827,10 @@
           break;
         }
       }
-      if (!found || !cpi->sf.tpl_sf.allow_compound_pred) {
+      if (!found || !tpl_sf->allow_compound_pred) {
         comp_ref_frames[2][0] = this_mi->ref_frame[0] - LAST_FRAME;
         comp_ref_frames[2][1] = this_mi->ref_frame[1] - LAST_FRAME;
-        if (!cpi->sf.tpl_sf.allow_compound_pred) {
+        if (!tpl_sf->allow_compound_pred) {
           start_rf = 2;
           end_rf = 3;
         }
@@ -854,7 +879,8 @@
     int_mv tmp_mv[2] = { single_mv[rf_idx0], single_mv[rf_idx1] };
     int rate_mv;
     av1_joint_motion_search(cpi, x, bsize, tmp_mv, NULL, 0, &rate_mv,
-                            !cpi->sf.mv_sf.disable_second_mv);
+                            !cpi->sf.mv_sf.disable_second_mv,
+                            NUM_JOINT_ME_REFINE_ITER);
 
     for (int ref = 0; ref < 2; ++ref) {
       struct buf_2d ref_buf = { NULL, ref_frame_ptr[ref]->y_buffer,
@@ -892,7 +918,7 @@
     xd->mi[0]->ref_frame[1] = best_rf_idx1 + LAST_FRAME;
   }
 
-  if (best_inter_cost < INT32_MAX) {
+  if (best_inter_cost < INT32_MAX && is_inter_mode(best_mode)) {
     xd->mi[0]->mv[0].as_int = best_mv[0].as_int;
     xd->mi[0]->mv[1].as_int = best_mv[1].as_int;
     const YV12_BUFFER_CONFIG *ref_frame_ptr[2] = {
@@ -907,7 +933,7 @@
     get_rate_distortion(&rate_cost, &recon_error, &pred_error, src_diff, coeff,
                         qcoeff, dqcoeff, cm, x, ref_frame_ptr, rec_buffer_pool,
                         rec_stride_pool, tx_size, best_mode, mi_row, mi_col,
-                        use_y_only_rate_distortion, NULL);
+                        use_y_only_rate_distortion, 0 /*do_recon*/, NULL);
     tpl_stats->srcrf_rate = rate_cost;
   }
 
@@ -935,7 +961,8 @@
   get_rate_distortion(&rate_cost, &recon_error, &pred_error, src_diff, coeff,
                       qcoeff, dqcoeff, cm, x, ref_frame_ptr, rec_buffer_pool,
                       rec_stride_pool, tx_size, best_mode, mi_row, mi_col,
-                      use_y_only_rate_distortion, tpl_txfm_stats);
+                      use_y_only_rate_distortion, 1 /*do_recon*/,
+                      tpl_txfm_stats);
 
   tpl_stats->recrf_dist = recon_error << TPL_DEP_COST_SCALE_LOG2;
   tpl_stats->recrf_sse = pred_error << TPL_DEP_COST_SCALE_LOG2;
@@ -957,7 +984,7 @@
     get_rate_distortion(&rate_cost, &recon_error, &pred_error, src_diff, coeff,
                         qcoeff, dqcoeff, cm, x, ref_frame_ptr, rec_buffer_pool,
                         rec_stride_pool, tx_size, best_mode, mi_row, mi_col,
-                        use_y_only_rate_distortion, NULL);
+                        use_y_only_rate_distortion, 1 /*do_recon*/, NULL);
     tpl_stats->cmp_recrf_dist[0] = recon_error << TPL_DEP_COST_SCALE_LOG2;
     tpl_stats->cmp_recrf_rate[0] = rate_cost;
 
@@ -978,7 +1005,7 @@
     get_rate_distortion(&rate_cost, &recon_error, &pred_error, src_diff, coeff,
                         qcoeff, dqcoeff, cm, x, ref_frame_ptr, rec_buffer_pool,
                         rec_stride_pool, tx_size, best_mode, mi_row, mi_col,
-                        use_y_only_rate_distortion, NULL);
+                        use_y_only_rate_distortion, 1 /*do_recon*/, NULL);
     tpl_stats->cmp_recrf_dist[1] = recon_error << TPL_DEP_COST_SCALE_LOG2;
     tpl_stats->cmp_recrf_rate[1] = rate_cost;
 
@@ -1315,6 +1342,10 @@
 
   // Initialize x->mbmi_ext when compound predictions are enabled.
   if (cpi->sf.tpl_sf.allow_compound_pred) av1_zero(x->mbmi_ext);
+
+  // Set the pointer to null since mbmi is only allocated inside this function.
+  assert(xd->mi == &mbmi_ptr);
+  xd->mi = NULL;
 }
 
 // This function stores the motion estimation dependencies of all the blocks in
@@ -1756,8 +1787,10 @@
     } else {
       mc_flow_dispenser(cpi);
     }
+#if CONFIG_BITRATE_ACCURACY
     av1_tpl_txfm_stats_update_abs_coeff_mean(&cpi->td.tpl_txfm_stats);
     av1_tpl_store_txfm_stats(tpl_data, &cpi->td.tpl_txfm_stats, frame_idx);
+#endif  // CONFIG_BITRATE_ACCURACY
 #if CONFIG_RATECTRL_LOG && CONFIG_THREE_PASS && CONFIG_BITRATE_ACCURACY
     if (cpi->oxcf.pass == AOM_RC_THIRD_PASS) {
       int frame_coding_idx =
@@ -2057,6 +2090,7 @@
           RDCOST(tpl_frame->base_rdmult, this_stats->mc_dep_rate,
                  this_stats->mc_dep_dist);
       double dist_scaled = (double)(this_stats->recrf_dist << RDDIV_BITS);
+      dist_scaled = AOMMAX(dist_scaled, 1);
       intra_cost_base += log(dist_scaled) * cbcmp;
       mc_dep_cost_base += log(dist_scaled + mc_dep_delta) * cbcmp;
       cbcmp_base += cbcmp;

diff --git a/av1/encoder/tpl_model.h b/av1/encoder/tpl_model.h
index 71cc320..36c3ae0 100644
--- a/av1/encoder/tpl_model.h
+++ b/av1/encoder/tpl_model.h

@@ -485,6 +485,7 @@
  */
 void av1_init_tpl_txfm_stats(TplTxfmStats *tpl_txfm_stats);
 
+#if CONFIG_BITRATE_ACCURACY
 /*
  *!\brief Accumulate TplTxfmStats
  *
@@ -516,6 +517,7 @@
  * \param[in]  txfm_stats     A structure for storing transform stats
  */
 void av1_tpl_txfm_stats_update_abs_coeff_mean(TplTxfmStats *txfm_stats);
+#endif  // CONFIG_BITRATE_ACCURACY
 
 /*!\brief  Estimate coefficient entropy using Laplace dsitribution
  *

diff --git a/av1/encoder/tune_butteraugli.c b/av1/encoder/tune_butteraugli.c
index 2f057e1..8f59373 100644
--- a/av1/encoder/tune_butteraugli.c
+++ b/av1/encoder/tune_butteraugli.c

@@ -209,7 +209,7 @@
   if (dst->buffer_alloc_sz == 0) {
     aom_alloc_frame_buffer(
         dst, width, height, ss_x, ss_y, cm->seq_params->use_highbitdepth,
-        cpi->oxcf.border_in_pixels, cm->features.byte_alignment, 0);
+        cpi->oxcf.border_in_pixels, cm->features.byte_alignment, 0, 0);
   }
   av1_copy_and_extend_frame(cpi->source, dst);
 
@@ -218,7 +218,7 @@
     aom_alloc_frame_buffer(
         resized_dst, width / resize_factor, height / resize_factor, ss_x, ss_y,
         cm->seq_params->use_highbitdepth, cpi->oxcf.border_in_pixels,
-        cm->features.byte_alignment, 0);
+        cm->features.byte_alignment, 0, 0);
   }
   av1_resize_and_extend_frame_nonnormative(cpi->source, resized_dst, bit_depth,
                                            av1_num_planes(cm));
@@ -241,7 +241,7 @@
   aom_alloc_frame_buffer(
       &resized_recon, width / resize_factor, height / resize_factor, ss_x, ss_y,
       cm->seq_params->use_highbitdepth, cpi->oxcf.border_in_pixels,
-      cm->features.byte_alignment, 0);
+      cm->features.byte_alignment, 0, 0);
   copy_img(&cpi->common.cur_frame->buf, &resized_recon, width / resize_factor,
            height / resize_factor);
 
@@ -264,13 +264,12 @@
 
   cpi->source = av1_realloc_and_scale_if_required(
       cm, cpi->unscaled_source, &cpi->scaled_source, cm->features.interp_filter,
-      0, false, false, cpi->oxcf.border_in_pixels,
-      cpi->oxcf.tool_cfg.enable_global_motion);
+      0, false, false, cpi->oxcf.border_in_pixels, cpi->image_pyramid_levels);
   if (cpi->unscaled_last_source != NULL) {
     cpi->last_source = av1_realloc_and_scale_if_required(
         cm, cpi->unscaled_last_source, &cpi->scaled_last_source,
         cm->features.interp_filter, 0, false, false, cpi->oxcf.border_in_pixels,
-        cpi->oxcf.tool_cfg.enable_global_motion);
+        cpi->image_pyramid_levels);
   }
 
   av1_setup_butteraugli_source(cpi);
@@ -299,9 +298,8 @@
   av1_set_quantizer(cm, q_cfg->qm_minlevel, q_cfg->qm_maxlevel, q_index,
                     q_cfg->enable_chroma_deltaq, q_cfg->enable_hdr_deltaq);
   av1_set_speed_features_qindex_dependent(cpi, oxcf->speed);
-  if (q_cfg->deltaq_mode != NO_DELTA_Q || q_cfg->enable_chroma_deltaq)
-    av1_init_quantizer(&cpi->enc_quant_dequant_params, &cm->quant_params,
-                       cm->seq_params->bit_depth);
+  av1_init_quantizer(&cpi->enc_quant_dequant_params, &cm->quant_params,
+                     cm->seq_params->bit_depth);
 
   av1_set_variance_partition_thresholds(cpi, q_index, 0);
   av1_encode_frame(cpi);

diff --git a/av1/encoder/tune_vmaf.c b/av1/encoder/tune_vmaf.c
index 46260a6..9c7c112 100644
--- a/av1/encoder/tune_vmaf.c
+++ b/av1/encoder/tune_vmaf.c

@@ -63,7 +63,7 @@
   // Do motion search.
   // Only do full search on the entire block.
   av1_make_default_fullpel_ms_params(&full_ms_params, cpi, mb, block_size,
-                                     &baseline_mv, search_site_cfg,
+                                     &baseline_mv, *ref_mv, search_site_cfg,
                                      /*fine_search_interval=*/0);
   av1_set_mv_search_method(&full_ms_params, search_site_cfg, search_method);
   av1_full_pixel_search(*ref_mv, &full_ms_params, step_param,
@@ -341,7 +341,7 @@
   aom_alloc_frame_buffer(
       &sharpened, width, height, source->subsampling_x, source->subsampling_y,
       cm->seq_params->use_highbitdepth, cpi->oxcf.border_in_pixels,
-      cm->features.byte_alignment, 0);
+      cm->features.byte_alignment, 0, 0);
 
   const double baseline_variance = frame_average_variance(cpi, source);
   double unsharp_amount;
@@ -393,7 +393,7 @@
   aom_alloc_frame_buffer(
       &blurred, width, height, source->subsampling_x, source->subsampling_y,
       cm->seq_params->use_highbitdepth, cpi->oxcf.border_in_pixels,
-      cm->features.byte_alignment, 0);
+      cm->features.byte_alignment, 0, 0);
 
   gaussian_blur(bit_depth, source, &blurred);
   unsharp(cpi, source, &blurred, source, best_frame_unsharp_amount);
@@ -413,11 +413,11 @@
   aom_alloc_frame_buffer(
       &source_extended, width, height, source->subsampling_x,
       source->subsampling_y, cm->seq_params->use_highbitdepth,
-      cpi->oxcf.border_in_pixels, cm->features.byte_alignment, 0);
+      cpi->oxcf.border_in_pixels, cm->features.byte_alignment, 0, 0);
   aom_alloc_frame_buffer(
       &blurred, width, height, source->subsampling_x, source->subsampling_y,
       cm->seq_params->use_highbitdepth, cpi->oxcf.border_in_pixels,
-      cm->features.byte_alignment, 0);
+      cm->features.byte_alignment, 0, 0);
 
   av1_copy_and_extend_frame(source, &source_extended);
   gaussian_blur(bit_depth, &source_extended, &blurred);
@@ -453,11 +453,11 @@
   memset(&source_extended, 0, sizeof(source_extended));
   aom_alloc_frame_buffer(
       &blurred, width, height, ss_x, ss_y, cm->seq_params->use_highbitdepth,
-      cpi->oxcf.border_in_pixels, cm->features.byte_alignment, 0);
+      cpi->oxcf.border_in_pixels, cm->features.byte_alignment, 0, 0);
   aom_alloc_frame_buffer(&source_extended, width, height, ss_x, ss_y,
                          cm->seq_params->use_highbitdepth,
                          cpi->oxcf.border_in_pixels,
-                         cm->features.byte_alignment, 0);
+                         cm->features.byte_alignment, 0, 0);
 
   av1_copy_and_extend_frame(source, &source_extended);
   gaussian_blur(bit_depth, &source_extended, &blurred);
@@ -493,11 +493,11 @@
   aom_alloc_frame_buffer(&source_block, block_w, block_h, ss_x, ss_y,
                          cm->seq_params->use_highbitdepth,
                          cpi->oxcf.border_in_pixels,
-                         cm->features.byte_alignment, 0);
+                         cm->features.byte_alignment, 0, 0);
   aom_alloc_frame_buffer(&blurred_block, block_w, block_h, ss_x, ss_y,
                          cm->seq_params->use_highbitdepth,
                          cpi->oxcf.border_in_pixels,
-                         cm->features.byte_alignment, 0);
+                         cm->features.byte_alignment, 0, 0);
 
   for (int row = 0; row < num_rows; ++row) {
     for (int col = 0; col < num_cols; ++col) {
@@ -620,7 +620,7 @@
   aom_alloc_frame_buffer(
       &resized_source, y_width / resize_factor, y_height / resize_factor, ss_x,
       ss_y, cm->seq_params->use_highbitdepth, cpi->oxcf.border_in_pixels,
-      cm->features.byte_alignment, 0);
+      cm->features.byte_alignment, 0, 0);
   av1_resize_and_extend_frame_nonnormative(cpi->source, &resized_source,
                                            bit_depth, av1_num_planes(cm));
 
@@ -638,7 +638,7 @@
   aom_alloc_frame_buffer(&blurred, resized_y_width, resized_y_height, ss_x,
                          ss_y, cm->seq_params->use_highbitdepth,
                          cpi->oxcf.border_in_pixels,
-                         cm->features.byte_alignment, 0);
+                         cm->features.byte_alignment, 0, 0);
   gaussian_blur(bit_depth, &resized_source, &blurred);
 
   YV12_BUFFER_CONFIG recon;
@@ -646,7 +646,7 @@
   aom_alloc_frame_buffer(&recon, resized_y_width, resized_y_height, ss_x, ss_y,
                          cm->seq_params->use_highbitdepth,
                          cpi->oxcf.border_in_pixels,
-                         cm->features.byte_alignment, 0);
+                         cm->features.byte_alignment, 0, 0);
   aom_yv12_copy_frame(&resized_source, &recon, 1);
 
   VmafContext *vmaf_context;
@@ -825,15 +825,15 @@
   aom_alloc_frame_buffer(&blurred_cur, y_width, y_height, ss_x, ss_y,
                          cm->seq_params->use_highbitdepth,
                          cpi->oxcf.border_in_pixels,
-                         cm->features.byte_alignment, 0);
+                         cm->features.byte_alignment, 0, 0);
   aom_alloc_frame_buffer(&blurred_last, y_width, y_height, ss_x, ss_y,
                          cm->seq_params->use_highbitdepth,
                          cpi->oxcf.border_in_pixels,
-                         cm->features.byte_alignment, 0);
+                         cm->features.byte_alignment, 0, 0);
   aom_alloc_frame_buffer(&blurred_next, y_width, y_height, ss_x, ss_y,
                          cm->seq_params->use_highbitdepth,
                          cpi->oxcf.border_in_pixels,
-                         cm->features.byte_alignment, 0);
+                         cm->features.byte_alignment, 0, 0);
 
   gaussian_blur(bit_depth, cur, &blurred_cur);
   gaussian_blur(bit_depth, last, &blurred_last);
@@ -935,7 +935,8 @@
   const double dvmaf = 26.11 * (1.0 - exp(-0.06 * motion));
   const double dsse = dvmaf * approx_sse / approx_dvmaf;
 
-  const double beta = approx_sse / (dsse + approx_sse);
+  // Clamping beta to address VQ issue (aomedia:3170).
+  const double beta = AOMMAX(approx_sse / (dsse + approx_sse), 0.5);
   const int offset =
       av1_get_deltaq_offset(cm->seq_params->bit_depth, current_qindex, beta);
   int qindex = current_qindex + offset;
@@ -1017,18 +1018,18 @@
   aom_alloc_frame_buffer(&recon_sharpened, width, height, ss_x, ss_y,
                          cm->seq_params->use_highbitdepth,
                          cpi->oxcf.border_in_pixels,
-                         cm->features.byte_alignment, 0);
+                         cm->features.byte_alignment, 0, 0);
   aom_alloc_frame_buffer(&src_sharpened, width, height, ss_x, ss_y,
                          cm->seq_params->use_highbitdepth,
                          cpi->oxcf.border_in_pixels,
-                         cm->features.byte_alignment, 0);
+                         cm->features.byte_alignment, 0, 0);
   aom_alloc_frame_buffer(&recon_blurred, width, height, ss_x, ss_y,
                          cm->seq_params->use_highbitdepth,
                          cpi->oxcf.border_in_pixels,
-                         cm->features.byte_alignment, 0);
+                         cm->features.byte_alignment, 0, 0);
   aom_alloc_frame_buffer(
       &src_blurred, width, height, ss_x, ss_y, cm->seq_params->use_highbitdepth,
-      cpi->oxcf.border_in_pixels, cm->features.byte_alignment, 0);
+      cpi->oxcf.border_in_pixels, cm->features.byte_alignment, 0, 0);
 
   gaussian_blur(bit_depth, recon, &recon_blurred);
   gaussian_blur(bit_depth, src, &src_blurred);

diff --git a/av1/encoder/tx_search.c b/av1/encoder/tx_search.c
index 74c9de2..d6217b7 100644
--- a/av1/encoder/tx_search.c
+++ b/av1/encoder/tx_search.c

@@ -2809,10 +2809,10 @@
 
   int feature_idx = get_mean_dev_features(diff, diff_stride, bw, bh, features);
 
-  features[feature_idx++] = logf(1.0f + (float)x->source_variance);
+  features[feature_idx++] = log1pf((float)x->source_variance);
 
   const int dc_q = av1_dc_quant_QTX(x->qindex, 0, xd->bd) >> (xd->bd - 8);
-  const float log_dc_q_square = logf(1.0f + (float)(dc_q * dc_q) / 256.0f);
+  const float log_dc_q_square = log1pf((float)(dc_q * dc_q) / 256.0f);
   features[feature_idx++] = log_dc_q_square;
   assert(feature_idx == NUM_INTRA_TX_SPLIT_FEATURES);
   for (int i = 0; i < NUM_INTRA_TX_SPLIT_FEATURES; i++) {
@@ -2895,7 +2895,13 @@
 #endif
 
     RD_STATS this_rd_stats;
-    rd[depth] = av1_uniform_txfm_yrd(cpi, x, &this_rd_stats, ref_best_rd, bs,
+    // When the speed feature use_rd_based_breakout_for_intra_tx_search is
+    // enabled, use the known minimum best_rd for early termination.
+    const int64_t rd_thresh =
+        cpi->sf.tx_sf.use_rd_based_breakout_for_intra_tx_search
+            ? AOMMIN(ref_best_rd, best_rd)
+            : ref_best_rd;
+    rd[depth] = av1_uniform_txfm_yrd(cpi, x, &this_rd_stats, rd_thresh, bs,
                                      tx_size, FTXS_NONE, skip_trellis);
     if (rd[depth] < best_rd) {
       av1_copy_array(best_blk_skip, txfm_info->blk_skip, num_blks);

diff --git a/av1/encoder/txb_rdopt.c b/av1/encoder/txb_rdopt.c
index 2f2b8fd..e551e8a 100644
--- a/av1/encoder/txb_rdopt.c
+++ b/av1/encoder/txb_rdopt.c

@@ -16,7 +16,7 @@
 
 static INLINE void update_coeff_general(
     int *accu_rate, int64_t *accu_dist, int si, int eob, TX_SIZE tx_size,
-    TX_CLASS tx_class, int bwl, int height, int64_t rdmult, int shift,
+    TX_CLASS tx_class, int bhl, int width, int64_t rdmult, int shift,
     int dc_sign_ctx, const int16_t *dequant, const int16_t *scan,
     const LV_MAP_COEFF_COST *txb_costs, const tran_low_t *tcoeff,
     tran_low_t *qcoeff, tran_low_t *dqcoeff, uint8_t *levels,
@@ -26,7 +26,7 @@
   const tran_low_t qc = qcoeff[ci];
   const int is_last = si == (eob - 1);
   const int coeff_ctx = get_lower_levels_ctx_general(
-      is_last, si, bwl, height, levels, ci, tx_size, tx_class);
+      is_last, si, bhl, width, levels, ci, tx_size, tx_class);
   if (qc == 0) {
     *accu_rate += txb_costs->base_cost[coeff_ctx][0];
   } else {
@@ -38,7 +38,7 @@
     const int64_t dist0 = get_coeff_dist(tqc, 0, shift, qmatrix, ci);
     const int rate =
         get_coeff_cost_general(is_last, ci, abs_qc, sign, coeff_ctx,
-                               dc_sign_ctx, txb_costs, bwl, tx_class, levels);
+                               dc_sign_ctx, txb_costs, bhl, tx_class, levels);
     const int64_t rd = RDCOST(rdmult, rate, dist);
 
     tran_low_t qc_low, dqc_low;
@@ -55,14 +55,14 @@
       dist_low = get_coeff_dist(tqc, dqc_low, shift, qmatrix, ci);
       rate_low =
           get_coeff_cost_general(is_last, ci, abs_qc_low, sign, coeff_ctx,
-                                 dc_sign_ctx, txb_costs, bwl, tx_class, levels);
+                                 dc_sign_ctx, txb_costs, bhl, tx_class, levels);
     }
 
     rd_low = RDCOST(rdmult, rate_low, dist_low);
     if (rd_low < rd) {
       qcoeff[ci] = qc_low;
       dqcoeff[ci] = dqc_low;
-      levels[get_padded_idx(ci, bwl)] = AOMMIN(abs_qc_low, INT8_MAX);
+      levels[get_padded_idx(ci, bhl)] = AOMMIN(abs_qc_low, INT8_MAX);
       *accu_rate += rate_low;
       *accu_dist += dist_low - dist0;
     } else {
@@ -74,7 +74,7 @@
 
 static AOM_FORCE_INLINE void update_coeff_simple(
     int *accu_rate, int si, int eob, TX_SIZE tx_size, TX_CLASS tx_class,
-    int bwl, int64_t rdmult, int shift, const int16_t *dequant,
+    int bhl, int64_t rdmult, int shift, const int16_t *dequant,
     const int16_t *scan, const LV_MAP_COEFF_COST *txb_costs,
     const tran_low_t *tcoeff, tran_low_t *qcoeff, tran_low_t *dqcoeff,
     uint8_t *levels, const qm_val_t *iqmatrix, const qm_val_t *qmatrix) {
@@ -87,7 +87,7 @@
   const int ci = scan[si];
   const tran_low_t qc = qcoeff[ci];
   const int coeff_ctx =
-      get_lower_levels_ctx(levels, ci, bwl, tx_size, tx_class);
+      get_lower_levels_ctx(levels, ci, bhl, tx_size, tx_class);
   if (qc == 0) {
     *accu_rate += txb_costs->base_cost[coeff_ctx][0];
   } else {
@@ -96,7 +96,7 @@
     const tran_low_t abs_dqc = abs(dqcoeff[ci]);
     int rate_low = 0;
     const int rate = get_two_coeff_cost_simple(
-        ci, abs_qc, coeff_ctx, txb_costs, bwl, tx_class, levels, &rate_low);
+        ci, abs_qc, coeff_ctx, txb_costs, bhl, tx_class, levels, &rate_low);
     if (abs_dqc < abs_tqc) {
       *accu_rate += rate;
       return;
@@ -115,7 +115,7 @@
       const int sign = (qc < 0) ? 1 : 0;
       qcoeff[ci] = (-sign ^ abs_qc_low) + sign;
       dqcoeff[ci] = (-sign ^ abs_dqc_low) + sign;
-      levels[get_padded_idx(ci, bwl)] = AOMMIN(abs_qc_low, INT8_MAX);
+      levels[get_padded_idx(ci, bhl)] = AOMMIN(abs_qc_low, INT8_MAX);
       *accu_rate += rate_low;
     } else {
       *accu_rate += rate;
@@ -125,7 +125,7 @@
 
 static AOM_FORCE_INLINE void update_coeff_eob(
     int *accu_rate, int64_t *accu_dist, int *eob, int *nz_num, int *nz_ci,
-    int si, TX_SIZE tx_size, TX_CLASS tx_class, int bwl, int height,
+    int si, TX_SIZE tx_size, TX_CLASS tx_class, int bhl, int width,
     int dc_sign_ctx, int64_t rdmult, int shift, const int16_t *dequant,
     const int16_t *scan, const LV_MAP_EOB_COST *txb_eob_costs,
     const LV_MAP_COEFF_COST *txb_costs, const tran_low_t *tcoeff,
@@ -136,7 +136,7 @@
   const int ci = scan[si];
   const tran_low_t qc = qcoeff[ci];
   const int coeff_ctx =
-      get_lower_levels_ctx(levels, ci, bwl, tx_size, tx_class);
+      get_lower_levels_ctx(levels, ci, bhl, tx_size, tx_class);
   if (qc == 0) {
     *accu_rate += txb_costs->base_cost[coeff_ctx][0];
   } else {
@@ -149,7 +149,7 @@
     int64_t dist = get_coeff_dist(tqc, dqc, shift, qmatrix, ci) - dist0;
     int rate =
         get_coeff_cost_general(0, ci, abs_qc, sign, coeff_ctx, dc_sign_ctx,
-                               txb_costs, bwl, tx_class, levels);
+                               txb_costs, bhl, tx_class, levels);
     int64_t rd = RDCOST(rdmult, *accu_rate + rate, *accu_dist + dist);
 
     tran_low_t qc_low, dqc_low;
@@ -169,18 +169,18 @@
       dist_low = get_coeff_dist(tqc, dqc_low, shift, qmatrix, ci) - dist0;
       rate_low =
           get_coeff_cost_general(0, ci, abs_qc_low, sign, coeff_ctx,
-                                 dc_sign_ctx, txb_costs, bwl, tx_class, levels);
+                                 dc_sign_ctx, txb_costs, bhl, tx_class, levels);
       rd_low = RDCOST(rdmult, *accu_rate + rate_low, *accu_dist + dist_low);
     }
 
     int lower_level_new_eob = 0;
     const int new_eob = si + 1;
-    const int coeff_ctx_new_eob = get_lower_levels_ctx_eob(bwl, height, si);
+    const int coeff_ctx_new_eob = get_lower_levels_ctx_eob(bhl, width, si);
     const int new_eob_cost =
         get_eob_cost(new_eob, txb_eob_costs, txb_costs, tx_class);
     int rate_coeff_eob =
         new_eob_cost + get_coeff_cost_eob(ci, abs_qc, sign, coeff_ctx_new_eob,
-                                          dc_sign_ctx, txb_costs, bwl,
+                                          dc_sign_ctx, txb_costs, bhl,
                                           tx_class);
     int64_t dist_new_eob = dist;
     int64_t rd_new_eob = RDCOST(rdmult, rate_coeff_eob, dist_new_eob);
@@ -189,7 +189,7 @@
       const int rate_coeff_eob_low =
           new_eob_cost + get_coeff_cost_eob(ci, abs_qc_low, sign,
                                             coeff_ctx_new_eob, dc_sign_ctx,
-                                            txb_costs, bwl, tx_class);
+                                            txb_costs, bhl, tx_class);
       const int64_t dist_new_eob_low = dist_low;
       const int64_t rd_new_eob_low =
           RDCOST(rdmult, rate_coeff_eob_low, dist_new_eob_low);
@@ -213,7 +213,7 @@
     if (sharpness == 0 && rd_new_eob < rd) {
       for (int ni = 0; ni < *nz_num; ++ni) {
         int last_ci = nz_ci[ni];
-        levels[get_padded_idx(last_ci, bwl)] = 0;
+        levels[get_padded_idx(last_ci, bhl)] = 0;
         qcoeff[last_ci] = 0;
         dqcoeff[last_ci] = 0;
       }
@@ -230,7 +230,7 @@
     if (lower_level) {
       qcoeff[ci] = qc_low;
       dqcoeff[ci] = dqc_low;
-      levels[get_padded_idx(ci, bwl)] = AOMMIN(abs_qc_low, INT8_MAX);
+      levels[get_padded_idx(ci, bhl)] = AOMMIN(abs_qc_low, INT8_MAX);
     }
     if (qcoeff[ci]) {
       nz_ci[*nz_num] = ci;
@@ -251,7 +251,7 @@
       qcoeff[ci] = 0;
       dqcoeff[ci] = 0;
       // no need to set up levels because this is the last step
-      // levels[get_padded_idx(ci, bwl)] = 0;
+      // levels[get_padded_idx(ci, bhl)] = 0;
     }
     *accu_rate = 0;
     *eob = 0;
@@ -324,10 +324,10 @@
   const TX_SIZE txs_ctx = get_txsize_entropy_ctx(tx_size);
   const TX_CLASS tx_class = tx_type_to_class[tx_type];
   const MB_MODE_INFO *mbmi = xd->mi[0];
-  const int bwl = get_txb_bwl(tx_size);
+  const int bhl = get_txb_bhl(tx_size);
   const int width = get_txb_wide(tx_size);
   const int height = get_txb_high(tx_size);
-  assert(width == (1 << bwl));
+  assert(height == (1 << bhl));
   const int is_inter = is_inter_block(mbmi);
   const LV_MAP_COEFF_COST *txb_costs =
       &coeff_costs->coeff_costs[txs_ctx][plane_type];
@@ -344,7 +344,7 @@
       rshift;
 
   uint8_t levels_buf[TX_PAD_2D];
-  uint8_t *const levels = set_levels(levels_buf, width);
+  uint8_t *const levels = set_levels(levels_buf, height);
 
   if (eob > 1) av1_txb_init_levels(qcoeff, width, height, levels);
 
@@ -365,16 +365,16 @@
   int nz_ci[3] = { ci, 0, 0 };
   if (abs_qc >= 2) {
     update_coeff_general(&accu_rate, &accu_dist, si, eob, tx_size, tx_class,
-                         bwl, height, rdmult, shift, txb_ctx->dc_sign_ctx,
+                         bhl, width, rdmult, shift, txb_ctx->dc_sign_ctx,
                          dequant, scan, txb_costs, tcoeff, qcoeff, dqcoeff,
                          levels, iqmatrix, qmatrix);
     --si;
   } else {
     assert(abs_qc == 1);
-    const int coeff_ctx = get_lower_levels_ctx_eob(bwl, height, si);
+    const int coeff_ctx = get_lower_levels_ctx_eob(bhl, width, si);
     accu_rate +=
         get_coeff_cost_eob(ci, abs_qc, sign, coeff_ctx, txb_ctx->dc_sign_ctx,
-                           txb_costs, bwl, tx_class);
+                           txb_costs, bhl, tx_class);
     const tran_low_t tqc = tcoeff[ci];
     const tran_low_t dqc = dqcoeff[ci];
     const int64_t dist = get_coeff_dist(tqc, dqc, shift, qmatrix, ci);
@@ -387,7 +387,7 @@
   case tx_class_literal:                                                   \
     for (; si >= 0 && nz_num <= max_nz_num; --si) {                        \
       update_coeff_eob(&accu_rate, &accu_dist, &eob, &nz_num, nz_ci, si,   \
-                       tx_size, tx_class_literal, bwl, height,             \
+                       tx_size, tx_class_literal, bhl, width,              \
                        txb_ctx->dc_sign_ctx, rdmult, shift, dequant, scan, \
                        txb_eob_costs, txb_costs, tcoeff, qcoeff, dqcoeff,  \
                        levels, sharpness, iqmatrix, qmatrix);              \
@@ -409,7 +409,7 @@
 #define UPDATE_COEFF_SIMPLE_CASE(tx_class_literal)                             \
   case tx_class_literal:                                                       \
     for (; si >= 1; --si) {                                                    \
-      update_coeff_simple(&accu_rate, si, eob, tx_size, tx_class_literal, bwl, \
+      update_coeff_simple(&accu_rate, si, eob, tx_size, tx_class_literal, bhl, \
                           rdmult, shift, dequant, scan, txb_costs, tcoeff,     \
                           qcoeff, dqcoeff, levels, iqmatrix, qmatrix);         \
     }                                                                          \
@@ -427,7 +427,7 @@
     // no need to update accu_dist because it's not used after this point
     int64_t dummy_dist = 0;
     update_coeff_general(&accu_rate, &dummy_dist, si, eob, tx_size, tx_class,
-                         bwl, height, rdmult, shift, txb_ctx->dc_sign_ctx,
+                         bhl, width, rdmult, shift, txb_ctx->dc_sign_ctx,
                          dequant, scan, txb_costs, tcoeff, qcoeff, dqcoeff,
                          levels, iqmatrix, qmatrix);
   }
@@ -456,13 +456,13 @@
     int reduced_tx_set_used) {
   const tran_low_t *const qcoeff = p->qcoeff + BLOCK_OFFSET(block);
   const int txb_skip_ctx = txb_ctx->txb_skip_ctx;
-  const int bwl = get_txb_bwl(tx_size);
+  const int bhl = get_txb_bhl(tx_size);
   const int width = get_txb_wide(tx_size);
   const int height = get_txb_high(tx_size);
   const SCAN_ORDER *const scan_order = get_scan(tx_size, tx_type);
   const int16_t *const scan = scan_order->scan;
   uint8_t levels_buf[TX_PAD_2D];
-  uint8_t *const levels = set_levels(levels_buf, width);
+  uint8_t *const levels = set_levels(levels_buf, height);
   DECLARE_ALIGNED(16, int8_t, coeff_contexts[MAX_TX_SQUARE]);
   const int eob_multi_size = txsize_log2_minus4[tx_size];
   const LV_MAP_EOB_COST *const eob_costs =
@@ -491,7 +491,7 @@
     if (v) {
       // sign bit cost
       if (level > NUM_BASE_LEVELS) {
-        const int ctx = get_br_ctx_eob(pos, bwl, tx_class);
+        const int ctx = get_br_ctx_eob(pos, bhl, tx_class);
         cost += get_br_cost(level, lps_cost[ctx]);
       }
       if (c) {
@@ -515,7 +515,7 @@
       // sign bit cost
       cost += av1_cost_literal(1);
       if (level > NUM_BASE_LEVELS) {
-        const int ctx = get_br_ctx(levels, pos, bwl, tx_class);
+        const int ctx = get_br_ctx(levels, pos, bhl, tx_class);
         cost += get_br_cost(level, lps_cost[ctx]);
       }
     }
@@ -535,7 +535,7 @@
       const int dc_sign_ctx = txb_ctx->dc_sign_ctx;
       cost += coeff_costs->dc_sign_cost[dc_sign_ctx][sign01];
       if (level > NUM_BASE_LEVELS) {
-        const int ctx = get_br_ctx(levels, pos, bwl, tx_class);
+        const int ctx = get_br_ctx(levels, pos, bhl, tx_class);
         cost += get_br_cost(level, lps_cost[ctx]);
       }
     }

diff --git a/av1/encoder/txb_rdopt_utils.h b/av1/encoder/txb_rdopt_utils.h
index d8158fd..b9f08aa 100644
--- a/av1/encoder/txb_rdopt_utils.h
+++ b/av1/encoder/txb_rdopt_utils.h

@@ -119,7 +119,7 @@
 
 static AOM_FORCE_INLINE int get_two_coeff_cost_simple(
     int ci, tran_low_t abs_qc, int coeff_ctx,
-    const LV_MAP_COEFF_COST *txb_costs, int bwl, TX_CLASS tx_class,
+    const LV_MAP_COEFF_COST *txb_costs, int bhl, TX_CLASS tx_class,
     const uint8_t *levels, int *cost_low) {
   // this simple version assumes the coeff's scan_idx is not DC (scan_idx != 0)
   // and not the last (scan_idx != eob - 1)
@@ -130,7 +130,7 @@
   if (abs_qc) {
     cost += av1_cost_literal(1);
     if (abs_qc > NUM_BASE_LEVELS) {
-      const int br_ctx = get_br_ctx(levels, ci, bwl, tx_class);
+      const int br_ctx = get_br_ctx(levels, ci, bhl, tx_class);
       int brcost_diff = 0;
       cost += get_br_cost_with_diff(abs_qc, txb_costs->lps_cost[br_ctx],
                                     &brcost_diff);
@@ -145,7 +145,7 @@
 static INLINE int get_coeff_cost_eob(int ci, tran_low_t abs_qc, int sign,
                                      int coeff_ctx, int dc_sign_ctx,
                                      const LV_MAP_COEFF_COST *txb_costs,
-                                     int bwl, TX_CLASS tx_class) {
+                                     int bhl, TX_CLASS tx_class) {
   int cost = 0;
   cost += txb_costs->base_eob_cost[coeff_ctx][AOMMIN(abs_qc, 3) - 1];
   if (abs_qc != 0) {
@@ -156,7 +156,7 @@
     }
     if (abs_qc > NUM_BASE_LEVELS) {
       int br_ctx;
-      br_ctx = get_br_ctx_eob(ci, bwl, tx_class);
+      br_ctx = get_br_ctx_eob(ci, bhl, tx_class);
       cost += get_br_cost(abs_qc, txb_costs->lps_cost[br_ctx]);
     }
   }
@@ -167,7 +167,7 @@
                                          int sign, int coeff_ctx,
                                          int dc_sign_ctx,
                                          const LV_MAP_COEFF_COST *txb_costs,
-                                         int bwl, TX_CLASS tx_class,
+                                         int bhl, TX_CLASS tx_class,
                                          const uint8_t *levels) {
   int cost = 0;
   if (is_last) {
@@ -184,9 +184,9 @@
     if (abs_qc > NUM_BASE_LEVELS) {
       int br_ctx;
       if (is_last)
-        br_ctx = get_br_ctx_eob(ci, bwl, tx_class);
+        br_ctx = get_br_ctx_eob(ci, bhl, tx_class);
       else
-        br_ctx = get_br_ctx(levels, ci, bwl, tx_class);
+        br_ctx = get_br_ctx(levels, ci, bhl, tx_class);
       cost += get_br_cost(abs_qc, txb_costs->lps_cost[br_ctx]);
     }
   }

diff --git a/av1/encoder/var_based_part.c b/av1/encoder/var_based_part.c
index 3d47a28..5b8f598 100644
--- a/av1/encoder/var_based_part.c
+++ b/av1/encoder/var_based_part.c

@@ -30,8 +30,6 @@
 #include "av1/encoder/var_based_part.h"
 #include "av1/encoder/reconinter_enc.h"
 
-extern const uint8_t AV1_VAR_OFFS[];
-
 // Possible values for the force_split variable while evaluating variance based
 // partitioning.
 enum {
@@ -50,49 +48,49 @@
 
 static AOM_INLINE void tree_to_node(void *data, BLOCK_SIZE bsize,
                                     variance_node *node) {
-  int i;
   node->part_variances = NULL;
   switch (bsize) {
     case BLOCK_128X128: {
       VP128x128 *vt = (VP128x128 *)data;
       node->part_variances = &vt->part_variances;
-      for (i = 0; i < 4; i++)
-        node->split[i] = &vt->split[i].part_variances.none;
+      for (int split_idx = 0; split_idx < 4; split_idx++)
+        node->split[split_idx] = &vt->split[split_idx].part_variances.none;
       break;
     }
     case BLOCK_64X64: {
       VP64x64 *vt = (VP64x64 *)data;
       node->part_variances = &vt->part_variances;
-      for (i = 0; i < 4; i++)
-        node->split[i] = &vt->split[i].part_variances.none;
+      for (int split_idx = 0; split_idx < 4; split_idx++)
+        node->split[split_idx] = &vt->split[split_idx].part_variances.none;
       break;
     }
     case BLOCK_32X32: {
       VP32x32 *vt = (VP32x32 *)data;
       node->part_variances = &vt->part_variances;
-      for (i = 0; i < 4; i++)
-        node->split[i] = &vt->split[i].part_variances.none;
+      for (int split_idx = 0; split_idx < 4; split_idx++)
+        node->split[split_idx] = &vt->split[split_idx].part_variances.none;
       break;
     }
     case BLOCK_16X16: {
       VP16x16 *vt = (VP16x16 *)data;
       node->part_variances = &vt->part_variances;
-      for (i = 0; i < 4; i++)
-        node->split[i] = &vt->split[i].part_variances.none;
+      for (int split_idx = 0; split_idx < 4; split_idx++)
+        node->split[split_idx] = &vt->split[split_idx].part_variances.none;
       break;
     }
     case BLOCK_8X8: {
       VP8x8 *vt = (VP8x8 *)data;
       node->part_variances = &vt->part_variances;
-      for (i = 0; i < 4; i++)
-        node->split[i] = &vt->split[i].part_variances.none;
+      for (int split_idx = 0; split_idx < 4; split_idx++)
+        node->split[split_idx] = &vt->split[split_idx].part_variances.none;
       break;
     }
     default: {
       VP4x4 *vt = (VP4x4 *)data;
       assert(bsize == BLOCK_4X4);
       node->part_variances = &vt->part_variances;
-      for (i = 0; i < 4; i++) node->split[i] = &vt->split[i];
+      for (int split_idx = 0; split_idx < 4; split_idx++)
+        node->split[split_idx] = &vt->split[split_idx];
       break;
     }
   }
@@ -217,12 +215,14 @@
     if (mi_row + bs_height_check <= tile->mi_row_end &&
         mi_col + bs_width_vert_check <= tile->mi_col_end) {
       BLOCK_SIZE subsize = get_partition_subsize(bsize, PARTITION_VERT);
+      BLOCK_SIZE plane_bsize =
+          get_plane_block_size(subsize, xd->plane[AOM_PLANE_U].subsampling_x,
+                               xd->plane[AOM_PLANE_U].subsampling_y);
       get_variance(&vt.part_variances->vert[0]);
       get_variance(&vt.part_variances->vert[1]);
       if (vt.part_variances->vert[0].variance < threshold &&
           vt.part_variances->vert[1].variance < threshold &&
-          get_plane_block_size(subsize, xd->plane[1].subsampling_x,
-                               xd->plane[1].subsampling_y) < BLOCK_INVALID) {
+          plane_bsize < BLOCK_INVALID) {
         set_block_size(cpi, mi_row, mi_col, subsize);
         set_block_size(cpi, mi_row, mi_col + block_width / 2, subsize);
         return 1;
@@ -232,12 +232,14 @@
     if (mi_col + bs_width_check <= tile->mi_col_end &&
         mi_row + bs_height_horiz_check <= tile->mi_row_end) {
       BLOCK_SIZE subsize = get_partition_subsize(bsize, PARTITION_HORZ);
+      BLOCK_SIZE plane_bsize =
+          get_plane_block_size(subsize, xd->plane[AOM_PLANE_U].subsampling_x,
+                               xd->plane[AOM_PLANE_U].subsampling_y);
       get_variance(&vt.part_variances->horz[0]);
       get_variance(&vt.part_variances->horz[1]);
       if (vt.part_variances->horz[0].variance < threshold &&
           vt.part_variances->horz[1].variance < threshold &&
-          get_plane_block_size(subsize, xd->plane[1].subsampling_x,
-                               xd->plane[1].subsampling_y) < BLOCK_INVALID) {
+          plane_bsize < BLOCK_INVALID) {
         set_block_size(cpi, mi_row, mi_col, subsize);
         set_block_size(cpi, mi_row + block_height / 2, mi_col, subsize);
         return 1;
@@ -251,9 +253,9 @@
 static AOM_INLINE int all_blks_inside(int x16_idx, int y16_idx, int pixels_wide,
                                       int pixels_high) {
   int all_inside = 1;
-  for (int k = 0; k < 4; k++) {
-    all_inside &= ((x16_idx + ((k & 1) << 3)) < pixels_wide);
-    all_inside &= ((y16_idx + ((k >> 1) << 3)) < pixels_high);
+  for (int idx = 0; idx < 4; idx++) {
+    all_inside &= ((x16_idx + GET_BLK_IDX_X(idx, 3)) < pixels_wide);
+    all_inside &= ((y16_idx + GET_BLK_IDX_Y(idx, 3)) < pixels_high);
   }
   return all_inside;
 }
@@ -261,113 +263,116 @@
 #if CONFIG_AV1_HIGHBITDEPTH
 // TODO(yunqingwang): Perform average of four 8x8 blocks similar to lowbd
 static AOM_INLINE void fill_variance_8x8avg_highbd(
-    const uint8_t *s, int sp, const uint8_t *d, int dp, int x16_idx,
-    int y16_idx, VP16x16 *vst, int pixels_wide, int pixels_high,
-    int is_key_frame) {
-  for (int k = 0; k < 4; k++) {
-    const int x8_idx = x16_idx + ((k & 1) << 3);
-    const int y8_idx = y16_idx + ((k >> 1) << 3);
+    const uint8_t *src_buf, int src_stride, const uint8_t *dst_buf,
+    int dst_stride, int x16_idx, int y16_idx, VP16x16 *vst, int pixels_wide,
+    int pixels_high) {
+  for (int idx = 0; idx < 4; idx++) {
+    const int x8_idx = x16_idx + GET_BLK_IDX_X(idx, 3);
+    const int y8_idx = y16_idx + GET_BLK_IDX_Y(idx, 3);
     unsigned int sse = 0;
     int sum = 0;
     if (x8_idx < pixels_wide && y8_idx < pixels_high) {
-      int s_avg;
-      int d_avg = 128;
-      s_avg = aom_highbd_avg_8x8(s + y8_idx * sp + x8_idx, sp);
-      if (!is_key_frame)
-        d_avg = aom_highbd_avg_8x8(d + y8_idx * dp + x8_idx, dp);
+      int src_avg = aom_highbd_avg_8x8(src_buf + y8_idx * src_stride + x8_idx,
+                                       src_stride);
+      int dst_avg = aom_highbd_avg_8x8(dst_buf + y8_idx * dst_stride + x8_idx,
+                                       dst_stride);
 
-      sum = s_avg - d_avg;
+      sum = src_avg - dst_avg;
       sse = sum * sum;
     }
-    fill_variance(sse, sum, 0, &vst->split[k].part_variances.none);
+    fill_variance(sse, sum, 0, &vst->split[idx].part_variances.none);
   }
 }
 #endif
 
-static AOM_INLINE void fill_variance_8x8avg_lowbd(const uint8_t *s, int sp,
-                                                  const uint8_t *d, int dp,
-                                                  int x16_idx, int y16_idx,
-                                                  VP16x16 *vst, int pixels_wide,
-                                                  int pixels_high,
-                                                  int is_key_frame) {
+static AOM_INLINE void fill_variance_8x8avg_lowbd(
+    const uint8_t *src_buf, int src_stride, const uint8_t *dst_buf,
+    int dst_stride, int x16_idx, int y16_idx, VP16x16 *vst, int pixels_wide,
+    int pixels_high) {
   unsigned int sse[4] = { 0 };
   int sum[4] = { 0 };
-  int d_avg[4] = { 128, 128, 128, 128 };
-  int s_avg[4];
 
   if (all_blks_inside(x16_idx, y16_idx, pixels_wide, pixels_high)) {
-    aom_avg_8x8_quad(s, sp, x16_idx, y16_idx, s_avg);
-    if (!is_key_frame) aom_avg_8x8_quad(d, dp, x16_idx, y16_idx, d_avg);
-    for (int k = 0; k < 4; k++) {
-      sum[k] = s_avg[k] - d_avg[k];
-      sse[k] = sum[k] * sum[k];
+    int src_avg[4];
+    int dst_avg[4];
+    aom_avg_8x8_quad(src_buf, src_stride, x16_idx, y16_idx, src_avg);
+    aom_avg_8x8_quad(dst_buf, dst_stride, x16_idx, y16_idx, dst_avg);
+    for (int idx = 0; idx < 4; idx++) {
+      sum[idx] = src_avg[idx] - dst_avg[idx];
+      sse[idx] = sum[idx] * sum[idx];
     }
   } else {
-    for (int k = 0; k < 4; k++) {
-      const int x8_idx = x16_idx + ((k & 1) << 3);
-      const int y8_idx = y16_idx + ((k >> 1) << 3);
+    for (int idx = 0; idx < 4; idx++) {
+      const int x8_idx = x16_idx + GET_BLK_IDX_X(idx, 3);
+      const int y8_idx = y16_idx + GET_BLK_IDX_Y(idx, 3);
       if (x8_idx < pixels_wide && y8_idx < pixels_high) {
-        s_avg[k] = aom_avg_8x8(s + y8_idx * sp + x8_idx, sp);
-        if (!is_key_frame) d_avg[k] = aom_avg_8x8(d + y8_idx * dp + x8_idx, dp);
-        sum[k] = s_avg[k] - d_avg[k];
-        sse[k] = sum[k] * sum[k];
+        int src_avg =
+            aom_avg_8x8(src_buf + y8_idx * src_stride + x8_idx, src_stride);
+        int dst_avg =
+            aom_avg_8x8(dst_buf + y8_idx * dst_stride + x8_idx, dst_stride);
+        sum[idx] = src_avg - dst_avg;
+        sse[idx] = sum[idx] * sum[idx];
       }
     }
   }
 
-  for (int k = 0; k < 4; k++) {
-    fill_variance(sse[k], sum[k], 0, &vst->split[k].part_variances.none);
+  for (int idx = 0; idx < 4; idx++) {
+    fill_variance(sse[idx], sum[idx], 0, &vst->split[idx].part_variances.none);
   }
 }
 
 // Obtain parameters required to calculate variance (such as sum, sse, etc,.)
 // at 8x8 sub-block level for a given 16x16 block.
-static AOM_INLINE void fill_variance_8x8avg(const uint8_t *s, int sp,
-                                            const uint8_t *d, int dp,
-                                            int x16_idx, int y16_idx,
-                                            VP16x16 *vst, int highbd_flag,
-                                            int pixels_wide, int pixels_high,
-                                            int is_key_frame) {
+// The function can be called only when is_key_frame is false since sum is
+// computed between source and reference frames.
+static AOM_INLINE void fill_variance_8x8avg(
+    const uint8_t *src_buf, int src_stride, const uint8_t *dst_buf,
+    int dst_stride, int x16_idx, int y16_idx, VP16x16 *vst, int highbd_flag,
+    int pixels_wide, int pixels_high) {
 #if CONFIG_AV1_HIGHBITDEPTH
   if (highbd_flag) {
-    fill_variance_8x8avg_highbd(s, sp, d, dp, x16_idx, y16_idx, vst,
-                                pixels_wide, pixels_high, is_key_frame);
+    fill_variance_8x8avg_highbd(src_buf, src_stride, dst_buf, dst_stride,
+                                x16_idx, y16_idx, vst, pixels_wide,
+                                pixels_high);
     return;
   }
 #else
   (void)highbd_flag;
 #endif  // CONFIG_AV1_HIGHBITDEPTH
-  fill_variance_8x8avg_lowbd(s, sp, d, dp, x16_idx, y16_idx, vst, pixels_wide,
-                             pixels_high, is_key_frame);
+  fill_variance_8x8avg_lowbd(src_buf, src_stride, dst_buf, dst_stride, x16_idx,
+                             y16_idx, vst, pixels_wide, pixels_high);
 }
 
-static int compute_minmax_8x8(const uint8_t *s, int sp, const uint8_t *d,
-                              int dp, int x16_idx, int y16_idx,
+static int compute_minmax_8x8(const uint8_t *src_buf, int src_stride,
+                              const uint8_t *dst_buf, int dst_stride,
+                              int x16_idx, int y16_idx,
 #if CONFIG_AV1_HIGHBITDEPTH
                               int highbd_flag,
 #endif
                               int pixels_wide, int pixels_high) {
-  int k;
   int minmax_max = 0;
   int minmax_min = 255;
   // Loop over the 4 8x8 subblocks.
-  for (k = 0; k < 4; k++) {
-    int x8_idx = x16_idx + ((k & 1) << 3);
-    int y8_idx = y16_idx + ((k >> 1) << 3);
+  for (int idx = 0; idx < 4; idx++) {
+    const int x8_idx = x16_idx + GET_BLK_IDX_X(idx, 3);
+    const int y8_idx = y16_idx + GET_BLK_IDX_Y(idx, 3);
     int min = 0;
     int max = 0;
     if (x8_idx < pixels_wide && y8_idx < pixels_high) {
 #if CONFIG_AV1_HIGHBITDEPTH
       if (highbd_flag & YV12_FLAG_HIGHBITDEPTH) {
-        aom_highbd_minmax_8x8(s + y8_idx * sp + x8_idx, sp,
-                              d + y8_idx * dp + x8_idx, dp, &min, &max);
+        aom_highbd_minmax_8x8(
+            src_buf + y8_idx * src_stride + x8_idx, src_stride,
+            dst_buf + y8_idx * dst_stride + x8_idx, dst_stride, &min, &max);
       } else {
-        aom_minmax_8x8(s + y8_idx * sp + x8_idx, sp, d + y8_idx * dp + x8_idx,
-                       dp, &min, &max);
+        aom_minmax_8x8(src_buf + y8_idx * src_stride + x8_idx, src_stride,
+                       dst_buf + y8_idx * dst_stride + x8_idx, dst_stride, &min,
+                       &max);
       }
 #else
-      aom_minmax_8x8(s + y8_idx * sp + x8_idx, sp, d + y8_idx * dp + x8_idx, dp,
-                     &min, &max);
+      aom_minmax_8x8(src_buf + y8_idx * src_stride + x8_idx, src_stride,
+                     dst_buf + y8_idx * dst_stride + x8_idx, dst_stride, &min,
+                     &max);
 #endif
       if ((max - min) > minmax_max) minmax_max = (max - min);
       if ((max - min) < minmax_min) minmax_min = (max - min);
@@ -376,43 +381,42 @@
   return (minmax_max - minmax_min);
 }
 
-static AOM_INLINE void fill_variance_4x4avg(const uint8_t *s, int sp,
-                                            const uint8_t *d, int dp,
-                                            int x8_idx, int y8_idx, VP8x8 *vst,
+// Function to compute average and variance of 4x4 sub-block.
+// The function can be called only when is_key_frame is true since sum is
+// computed using source frame only.
+static AOM_INLINE void fill_variance_4x4avg(const uint8_t *src_buf,
+                                            int src_stride, int x8_idx,
+                                            int y8_idx, VP8x8 *vst,
 #if CONFIG_AV1_HIGHBITDEPTH
                                             int highbd_flag,
 #endif
                                             int pixels_wide, int pixels_high,
-                                            int is_key_frame,
                                             int border_offset_4x4) {
-  int k;
-  for (k = 0; k < 4; k++) {
-    int x4_idx = x8_idx + ((k & 1) << 2);
-    int y4_idx = y8_idx + ((k >> 1) << 2);
+  for (int idx = 0; idx < 4; idx++) {
+    const int x4_idx = x8_idx + GET_BLK_IDX_X(idx, 2);
+    const int y4_idx = y8_idx + GET_BLK_IDX_Y(idx, 2);
     unsigned int sse = 0;
     int sum = 0;
     if (x4_idx < pixels_wide - border_offset_4x4 &&
         y4_idx < pixels_high - border_offset_4x4) {
-      int s_avg;
-      int d_avg = 128;
+      int src_avg;
+      int dst_avg = 128;
 #if CONFIG_AV1_HIGHBITDEPTH
       if (highbd_flag & YV12_FLAG_HIGHBITDEPTH) {
-        s_avg = aom_highbd_avg_4x4(s + y4_idx * sp + x4_idx, sp);
-        if (!is_key_frame)
-          d_avg = aom_highbd_avg_4x4(d + y4_idx * dp + x4_idx, dp);
+        src_avg = aom_highbd_avg_4x4(src_buf + y4_idx * src_stride + x4_idx,
+                                     src_stride);
       } else {
-        s_avg = aom_avg_4x4(s + y4_idx * sp + x4_idx, sp);
-        if (!is_key_frame) d_avg = aom_avg_4x4(d + y4_idx * dp + x4_idx, dp);
+        src_avg =
+            aom_avg_4x4(src_buf + y4_idx * src_stride + x4_idx, src_stride);
       }
 #else
-      s_avg = aom_avg_4x4(s + y4_idx * sp + x4_idx, sp);
-      if (!is_key_frame) d_avg = aom_avg_4x4(d + y4_idx * dp + x4_idx, dp);
+      src_avg = aom_avg_4x4(src_buf + y4_idx * src_stride + x4_idx, src_stride);
 #endif
 
-      sum = s_avg - d_avg;
+      sum = src_avg - dst_avg;
       sse = sum * sum;
     }
-    fill_variance(sse, sum, 0, &vst->split[k].part_variances.none);
+    fill_variance(sse, sum, 0, &vst->split[idx].part_variances.none);
   }
 }
 
@@ -430,101 +434,137 @@
   return threshold;
 }
 
-static AOM_INLINE void tune_thresh_based_on_qindex_window(
-    int qindex, int th, int win, int fac, int64_t thresholds[]) {
+// Tune thresholds less or more aggressively to prefer larger partitions
+static AOM_INLINE void tune_thresh_based_on_qindex(
+    AV1_COMP *cpi, int64_t thresholds[], uint64_t block_sad, int current_qindex,
+    int num_pixels, bool is_segment_id_boosted, int source_sad_nonrd,
+    int lighting_change) {
   double weight;
-
-  if (qindex < th - win)
-    weight = 1.0;
-  else if (qindex > th + win)
-    weight = 0.0;
-  else
-    weight = 1.0 - (qindex - th + win) / (2 * win);
-  thresholds[1] =
-      (int)((1 - weight) * (thresholds[1] << 1) + weight * thresholds[1]);
-  thresholds[2] =
-      (int)((1 - weight) * (thresholds[2] << 1) + weight * thresholds[2]);
-  thresholds[3] =
-      (int)((1 - weight) * (thresholds[3] << fac) + weight * thresholds[3]);
+  if (cpi->sf.rt_sf.prefer_large_partition_blocks >= 3) {
+    const int win = 20;
+    if (current_qindex < QINDEX_LARGE_BLOCK_THR - win)
+      weight = 1.0;
+    else if (current_qindex > QINDEX_LARGE_BLOCK_THR + win)
+      weight = 0.0;
+    else
+      weight =
+          1.0 - (current_qindex - QINDEX_LARGE_BLOCK_THR + win) / (2 * win);
+    if (num_pixels > RESOLUTION_480P) {
+      for (int i = 0; i < 4; i++) {
+        thresholds[i] <<= 1;
+      }
+    }
+    if (num_pixels <= RESOLUTION_288P) {
+      thresholds[3] = INT64_MAX;
+      if (is_segment_id_boosted == false) {
+        thresholds[1] <<= 2;
+        thresholds[2] <<= (source_sad_nonrd <= kLowSad) ? 5 : 4;
+      } else {
+        thresholds[1] <<= 1;
+        thresholds[2] <<= 3;
+      }
+      // Allow for split to 8x8 for superblocks where part of it has
+      // moving boundary. So allow for sb with source_sad above threshold,
+      // and avoid very large source_sad or high source content, to avoid
+      // too many 8x8 within superblock.
+      uint64_t avg_source_sad_thresh = 25000;
+      uint64_t block_sad_low = 25000;
+      uint64_t block_sad_high = 50000;
+      if (cpi->svc.temporal_layer_id == 0 &&
+          cpi->svc.number_temporal_layers > 1) {
+        // Increase the sad thresholds for base TL0, as reference/LAST is
+        // 2/4 frames behind (for 2/3 #TL).
+        avg_source_sad_thresh = 40000;
+        block_sad_high = 70000;
+      }
+      if (is_segment_id_boosted == false &&
+          cpi->rc.avg_source_sad < avg_source_sad_thresh &&
+          block_sad > block_sad_low && block_sad < block_sad_high &&
+          !lighting_change) {
+        thresholds[2] = (3 * thresholds[2]) >> 2;
+        thresholds[3] = thresholds[2] << 3;
+      }
+      // Condition the increase of partition thresholds on the segment
+      // and the content. Avoid the increase for superblocks which have
+      // high source sad, unless the whole frame has very high motion
+      // (i.e, cpi->rc.avg_source_sad is very large, in which case all blocks
+      // have high source sad).
+    } else if (num_pixels > RESOLUTION_480P && is_segment_id_boosted == false &&
+               (source_sad_nonrd != kHighSad ||
+                cpi->rc.avg_source_sad > 50000)) {
+      thresholds[0] = (3 * thresholds[0]) >> 1;
+      thresholds[3] = INT64_MAX;
+      if (current_qindex > QINDEX_LARGE_BLOCK_THR) {
+        thresholds[1] =
+            (int)((1 - weight) * (thresholds[1] << 1) + weight * thresholds[1]);
+        thresholds[2] =
+            (int)((1 - weight) * (thresholds[2] << 1) + weight * thresholds[2]);
+      }
+    } else if (current_qindex > QINDEX_LARGE_BLOCK_THR &&
+               is_segment_id_boosted == false &&
+               (source_sad_nonrd != kHighSad ||
+                cpi->rc.avg_source_sad > 50000)) {
+      thresholds[1] =
+          (int)((1 - weight) * (thresholds[1] << 2) + weight * thresholds[1]);
+      thresholds[2] =
+          (int)((1 - weight) * (thresholds[2] << 4) + weight * thresholds[2]);
+      thresholds[3] = INT64_MAX;
+    }
+  } else if (cpi->sf.rt_sf.prefer_large_partition_blocks >= 2) {
+    thresholds[1] <<= (source_sad_nonrd <= kLowSad) ? 2 : 0;
+    thresholds[2] =
+        (source_sad_nonrd <= kLowSad) ? (3 * thresholds[2]) : thresholds[2];
+  } else if (cpi->sf.rt_sf.prefer_large_partition_blocks >= 1) {
+    const int fac = (source_sad_nonrd <= kLowSad) ? 2 : 1;
+    if (current_qindex < QINDEX_LARGE_BLOCK_THR - 45)
+      weight = 1.0;
+    else if (current_qindex > QINDEX_LARGE_BLOCK_THR + 45)
+      weight = 0.0;
+    else
+      weight = 1.0 - (current_qindex - QINDEX_LARGE_BLOCK_THR + 45) / (2 * 45);
+    thresholds[1] =
+        (int)((1 - weight) * (thresholds[1] << 1) + weight * thresholds[1]);
+    thresholds[2] =
+        (int)((1 - weight) * (thresholds[2] << 1) + weight * thresholds[2]);
+    thresholds[3] =
+        (int)((1 - weight) * (thresholds[3] << fac) + weight * thresholds[3]);
+  }
+  if (cpi->sf.part_sf.disable_8x8_part_based_on_qidx && (current_qindex < 128))
+    thresholds[3] = INT64_MAX;
 }
 
-static AOM_INLINE void set_vbp_thresholds(AV1_COMP *cpi, int64_t thresholds[],
-                                          int q, int content_lowsumdiff,
-                                          int source_sad_nonrd,
-                                          int source_sad_rd, int segment_id,
-                                          uint64_t blk_sad,
-                                          int lighting_change) {
-  AV1_COMMON *const cm = &cpi->common;
-  const int is_key_frame = frame_is_intra_only(cm);
-  const int threshold_multiplier = is_key_frame ? 120 : 1;
-  const int ac_q = av1_ac_quant_QTX(q, 0, cm->seq_params->bit_depth);
-  int64_t threshold_base = (int64_t)(threshold_multiplier * ac_q);
-  const int current_qindex = cm->quant_params.base_qindex;
-  const int threshold_left_shift = cpi->sf.rt_sf.var_part_split_threshold_shift;
-
-  if (is_key_frame) {
-    if (cpi->sf.rt_sf.force_large_partition_blocks_intra) {
-      const int shift_steps =
-          threshold_left_shift - (cpi->oxcf.mode == ALLINTRA ? 7 : 8);
-      assert(shift_steps >= 0);
-      threshold_base <<= shift_steps;
-    }
-    thresholds[0] = threshold_base;
-    thresholds[1] = threshold_base;
-    if (cm->width * cm->height < 1280 * 720) {
-      thresholds[2] = threshold_base / 3;
-      thresholds[3] = threshold_base >> 1;
-    } else {
-      int shift_val = 2;
-      if (cpi->sf.rt_sf.force_large_partition_blocks_intra) {
-        shift_val = 0;
-      }
-
-      thresholds[2] = threshold_base >> shift_val;
-      thresholds[3] = threshold_base >> shift_val;
-    }
-    thresholds[4] = threshold_base << 2;
-    return;
+static void set_vbp_thresholds_key_frame(AV1_COMP *cpi, int64_t thresholds[],
+                                         int64_t threshold_base,
+                                         int threshold_left_shift,
+                                         int num_pixels) {
+  if (cpi->sf.rt_sf.force_large_partition_blocks_intra) {
+    const int shift_steps =
+        threshold_left_shift - (cpi->oxcf.mode == ALLINTRA ? 7 : 8);
+    assert(shift_steps >= 0);
+    threshold_base <<= shift_steps;
   }
-
-  // Increase partition thresholds for noisy content. Apply it only for
-  // superblocks where sumdiff is low, as we assume the sumdiff of superblock
-  // whose only change is due to noise will be low (i.e, noise will average
-  // out over large block).
-  if (cpi->noise_estimate.enabled && content_lowsumdiff &&
-      (cm->width * cm->height > 640 * 480) &&
-      cm->current_frame.frame_number > 60) {
-    NOISE_LEVEL noise_level =
-        av1_noise_estimate_extract_level(&cpi->noise_estimate);
-    if (noise_level == kHigh)
-      threshold_base = (5 * threshold_base) >> 1;
-    else if (noise_level == kMedium &&
-             !cpi->sf.rt_sf.prefer_large_partition_blocks)
-      threshold_base = (5 * threshold_base) >> 2;
-  }
-  // TODO(kyslov) Enable var based partition adjusment on temporal denoising
-#if 0  // CONFIG_AV1_TEMPORAL_DENOISING
-  if (cpi->oxcf.noise_sensitivity > 0 && denoise_svc(cpi) &&
-      cpi->oxcf.speed > 5 && cpi->denoiser.denoising_level >= kDenLow)
-      threshold_base =
-          av1_scale_part_thresh(threshold_base, cpi->denoiser.denoising_level,
-                                content_state, cpi->svc.temporal_layer_id);
-  else
-    threshold_base =
-        scale_part_thresh_content(threshold_base, cpi->oxcf.speed, cm->width,
-                                  cm->height, cpi->ppi->rtc_ref.non_reference_frame);
-#else
-  // Increase base variance threshold based on content_state/sum_diff level.
-  threshold_base = scale_part_thresh_content(
-      threshold_base, cpi->oxcf.speed, cm->width, cm->height,
-      cpi->ppi->rtc_ref.non_reference_frame);
-#endif
-  thresholds[0] = threshold_base >> 1;
+  thresholds[0] = threshold_base;
   thresholds[1] = threshold_base;
-  thresholds[3] = threshold_base << threshold_left_shift;
-  if (cm->width >= 1280 && cm->height >= 720)
-    thresholds[3] = thresholds[3] << 1;
-  if (cm->width * cm->height <= 352 * 288) {
+  if (num_pixels < RESOLUTION_720P) {
+    thresholds[2] = threshold_base / 3;
+    thresholds[3] = threshold_base >> 1;
+  } else {
+    int shift_val = 2;
+    if (cpi->sf.rt_sf.force_large_partition_blocks_intra) {
+      shift_val = 0;
+    }
+
+    thresholds[2] = threshold_base >> shift_val;
+    thresholds[3] = threshold_base >> shift_val;
+  }
+  thresholds[4] = threshold_base << 2;
+}
+
+static AOM_INLINE void tune_thresh_based_on_resolution(
+    AV1_COMP *cpi, int64_t thresholds[], int64_t threshold_base,
+    int current_qindex, int source_sad_rd, int num_pixels) {
+  if (num_pixels >= RESOLUTION_720P) thresholds[3] = thresholds[3] << 1;
+  if (num_pixels <= RESOLUTION_288P) {
     const int qindex_thr[5][2] = {
       { 200, 220 }, { 140, 170 }, { 120, 150 }, { 200, 210 }, { 170, 220 },
     };
@@ -563,85 +603,99 @@
                        qi_diff_high * (threshold_base << 3)) /
                       threshold_diff;
     }
-  } else if (cm->width < 1280 && cm->height < 720) {
+  } else if (num_pixels < RESOLUTION_720P) {
     thresholds[2] = (5 * threshold_base) >> 2;
-  } else if (cm->width < 1920 && cm->height < 1080) {
+  } else if (num_pixels < RESOLUTION_1080P) {
     thresholds[2] = threshold_base << 1;
-  } else if (cm->width < 2560 && cm->height < 1440) {
-    thresholds[2] = (5 * threshold_base) >> 1;
   } else {
-    thresholds[2] = (7 * threshold_base) >> 1;
-  }
-  // Tune thresholds less or more aggressively to prefer larger partitions
-  if (cpi->sf.rt_sf.prefer_large_partition_blocks >= 3) {
-    double weight;
-    const int win = 20;
-    if (current_qindex < QINDEX_LARGE_BLOCK_THR - win)
-      weight = 1.0;
-    else if (current_qindex > QINDEX_LARGE_BLOCK_THR + win)
-      weight = 0.0;
-    else
-      weight =
-          1.0 - (current_qindex - QINDEX_LARGE_BLOCK_THR + win) / (2 * win);
-    if (cm->width * cm->height > 640 * 480) {
-      for (int i = 0; i < 4; i++) {
-        thresholds[i] <<= 1;
-      }
-    }
-    if (cm->width * cm->height <= 352 * 288) {
-      thresholds[3] = INT64_MAX;
-      if (segment_id == 0) {
-        thresholds[1] <<= 2;
-        thresholds[2] <<= (source_sad_nonrd <= kLowSad) ? 5 : 4;
+    // num_pixels >= RESOLUTION_1080P
+    if (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN) {
+      if (num_pixels < RESOLUTION_1440P) {
+        thresholds[2] = (5 * threshold_base) >> 1;
       } else {
-        thresholds[1] <<= 1;
-        thresholds[2] <<= 3;
+        thresholds[2] = (7 * threshold_base) >> 1;
       }
-      // Allow for split to 8x8 for superblocks where part of it has
-      // moving boundary. So allow for sb with source_sad above threshold,
-      // and avoid very large source_sad or high source content, to avoid
-      // too many 8x8 within superblock.
-      if (segment_id == 0 && cpi->rc.avg_source_sad < 25000 &&
-          blk_sad > 25000 && blk_sad < 50000 && !lighting_change) {
-        thresholds[2] = (3 * thresholds[2]) >> 2;
-        thresholds[3] = thresholds[2] << 3;
+    } else {
+      if (cpi->oxcf.speed > 7) {
+        thresholds[2] = 6 * threshold_base;
+      } else {
+        thresholds[2] = 3 * threshold_base;
       }
-      // Condition the increase of partition thresholds on the segment
-      // and the content. Avoid the increase for superblocks which have
-      // high source sad, unless the whole frame has very high motion
-      // (i.e, cpi->rc.avg_source_sad is very large, in which case all blocks
-      // have high source sad).
-    } else if (cm->width * cm->height > 640 * 480 && segment_id == 0 &&
-               (source_sad_nonrd != kHighSad ||
-                cpi->rc.avg_source_sad > 50000)) {
-      thresholds[0] = (3 * thresholds[0]) >> 1;
-      thresholds[3] = INT64_MAX;
-      if (current_qindex > QINDEX_LARGE_BLOCK_THR) {
-        thresholds[1] =
-            (int)((1 - weight) * (thresholds[1] << 1) + weight * thresholds[1]);
-        thresholds[2] =
-            (int)((1 - weight) * (thresholds[2] << 1) + weight * thresholds[2]);
-      }
-    } else if (current_qindex > QINDEX_LARGE_BLOCK_THR && segment_id == 0 &&
-               (source_sad_nonrd != kHighSad ||
-                cpi->rc.avg_source_sad > 50000)) {
-      thresholds[1] =
-          (int)((1 - weight) * (thresholds[1] << 2) + weight * thresholds[1]);
-      thresholds[2] =
-          (int)((1 - weight) * (thresholds[2] << 4) + weight * thresholds[2]);
-      thresholds[3] = INT64_MAX;
     }
-  } else if (cpi->sf.rt_sf.prefer_large_partition_blocks >= 2) {
-    thresholds[1] <<= (source_sad_nonrd <= kLowSad) ? 2 : 0;
-    thresholds[2] =
-        (source_sad_nonrd <= kLowSad) ? (3 * thresholds[2]) : thresholds[2];
-  } else if (cpi->sf.rt_sf.prefer_large_partition_blocks >= 1) {
-    const int fac = (source_sad_nonrd <= kLowSad) ? 2 : 1;
-    tune_thresh_based_on_qindex_window(current_qindex, QINDEX_LARGE_BLOCK_THR,
-                                       45, fac, thresholds);
   }
-  if (cpi->sf.part_sf.disable_8x8_part_based_on_qidx && (current_qindex < 128))
-    thresholds[3] = INT64_MAX;
+}
+
+// Increase partition thresholds for noisy content. Apply it only for
+// superblocks where sumdiff is low, as we assume the sumdiff of superblock
+// whose only change is due to noise will be low (i.e, noise will average
+// out over large block).
+static AOM_INLINE int64_t tune_thresh_noisy_content(AV1_COMP *cpi,
+                                                    int64_t threshold_base,
+                                                    int content_lowsumdiff,
+                                                    int num_pixels) {
+  AV1_COMMON *const cm = &cpi->common;
+  int64_t updated_thresh_base = threshold_base;
+  if (cpi->noise_estimate.enabled && content_lowsumdiff &&
+      num_pixels > RESOLUTION_480P && cm->current_frame.frame_number > 60) {
+    NOISE_LEVEL noise_level =
+        av1_noise_estimate_extract_level(&cpi->noise_estimate);
+    if (noise_level == kHigh)
+      updated_thresh_base = (5 * updated_thresh_base) >> 1;
+    else if (noise_level == kMedium &&
+             !cpi->sf.rt_sf.prefer_large_partition_blocks)
+      updated_thresh_base = (5 * updated_thresh_base) >> 2;
+  }
+  // TODO(kyslov) Enable var based partition adjusment on temporal denoising
+#if 0  // CONFIG_AV1_TEMPORAL_DENOISING
+  if (cpi->oxcf.noise_sensitivity > 0 && denoise_svc(cpi) &&
+      cpi->oxcf.speed > 5 && cpi->denoiser.denoising_level >= kDenLow)
+      updated_thresh_base =
+          av1_scale_part_thresh(updated_thresh_base, cpi->denoiser.denoising_level,
+                                content_state, cpi->svc.temporal_layer_id);
+  else
+    threshold_base =
+        scale_part_thresh_content(updated_thresh_base, cpi->oxcf.speed, cm->width,
+                                  cm->height, cpi->ppi->rtc_ref.non_reference_frame);
+#else
+  // Increase base variance threshold based on content_state/sum_diff level.
+  updated_thresh_base = scale_part_thresh_content(
+      updated_thresh_base, cpi->oxcf.speed, cm->width, cm->height,
+      cpi->ppi->rtc_ref.non_reference_frame);
+#endif
+  return updated_thresh_base;
+}
+
+static AOM_INLINE void set_vbp_thresholds(
+    AV1_COMP *cpi, int64_t thresholds[], uint64_t blk_sad, int qindex,
+    int content_lowsumdiff, int source_sad_nonrd, int source_sad_rd,
+    bool is_segment_id_boosted, int lighting_change) {
+  AV1_COMMON *const cm = &cpi->common;
+  const int is_key_frame = frame_is_intra_only(cm);
+  const int threshold_multiplier = is_key_frame ? 120 : 1;
+  const int ac_q = av1_ac_quant_QTX(qindex, 0, cm->seq_params->bit_depth);
+  int64_t threshold_base = (int64_t)(threshold_multiplier * ac_q);
+  const int current_qindex = cm->quant_params.base_qindex;
+  const int threshold_left_shift = cpi->sf.rt_sf.var_part_split_threshold_shift;
+  const int num_pixels = cm->width * cm->height;
+
+  if (is_key_frame) {
+    set_vbp_thresholds_key_frame(cpi, thresholds, threshold_base,
+                                 threshold_left_shift, num_pixels);
+    return;
+  }
+
+  threshold_base = tune_thresh_noisy_content(cpi, threshold_base,
+                                             content_lowsumdiff, num_pixels);
+  thresholds[0] = threshold_base >> 1;
+  thresholds[1] = threshold_base;
+  thresholds[3] = threshold_base << threshold_left_shift;
+
+  tune_thresh_based_on_resolution(cpi, thresholds, threshold_base,
+                                  current_qindex, source_sad_rd, num_pixels);
+
+  tune_thresh_based_on_qindex(cpi, thresholds, blk_sad, current_qindex,
+                              num_pixels, is_segment_id_boosted,
+                              source_sad_nonrd, lighting_change);
 }
 
 // Set temporal variance low flag for superblock 64x64.
@@ -654,42 +708,43 @@
     if ((vt->part_variances).none.variance < (thresholds[0] >> 1))
       part_info->variance_low[0] = 1;
   } else if (xd->mi[0]->bsize == BLOCK_64X32) {
-    for (int i = 0; i < 2; i++) {
-      if (vt->part_variances.horz[i].variance < (thresholds[0] >> 2))
-        part_info->variance_low[i + 1] = 1;
+    for (int part_idx = 0; part_idx < 2; part_idx++) {
+      if (vt->part_variances.horz[part_idx].variance < (thresholds[0] >> 2))
+        part_info->variance_low[part_idx + 1] = 1;
     }
   } else if (xd->mi[0]->bsize == BLOCK_32X64) {
-    for (int i = 0; i < 2; i++) {
-      if (vt->part_variances.vert[i].variance < (thresholds[0] >> 2))
-        part_info->variance_low[i + 3] = 1;
+    for (int part_idx = 0; part_idx < 2; part_idx++) {
+      if (vt->part_variances.vert[part_idx].variance < (thresholds[0] >> 2))
+        part_info->variance_low[part_idx + 3] = 1;
     }
   } else {
     static const int idx[4][2] = { { 0, 0 }, { 0, 8 }, { 8, 0 }, { 8, 8 } };
-    for (int i = 0; i < 4; i++) {
-      const int idx_str =
-          mi_params->mi_stride * (mi_row + idx[i][0]) + mi_col + idx[i][1];
+    for (int lvl1_idx = 0; lvl1_idx < 4; lvl1_idx++) {
+      const int idx_str = mi_params->mi_stride * (mi_row + idx[lvl1_idx][0]) +
+                          mi_col + idx[lvl1_idx][1];
       MB_MODE_INFO **this_mi = mi_params->mi_grid_base + idx_str;
 
-      if (mi_params->mi_cols <= mi_col + idx[i][1] ||
-          mi_params->mi_rows <= mi_row + idx[i][0])
+      if (mi_params->mi_cols <= mi_col + idx[lvl1_idx][1] ||
+          mi_params->mi_rows <= mi_row + idx[lvl1_idx][0])
         continue;
 
       if (*this_mi == NULL) continue;
 
       if ((*this_mi)->bsize == BLOCK_32X32) {
         int64_t threshold_32x32 = (5 * thresholds[1]) >> 3;
-        if (vt->split[i].part_variances.none.variance < threshold_32x32)
-          part_info->variance_low[i + 5] = 1;
+        if (vt->split[lvl1_idx].part_variances.none.variance < threshold_32x32)
+          part_info->variance_low[lvl1_idx + 5] = 1;
       } else {
         // For 32x16 and 16x32 blocks, the flag is set on each 16x16 block
         // inside.
         if ((*this_mi)->bsize == BLOCK_16X16 ||
             (*this_mi)->bsize == BLOCK_32X16 ||
             (*this_mi)->bsize == BLOCK_16X32) {
-          for (int j = 0; j < 4; j++) {
-            if (vt->split[i].split[j].part_variances.none.variance <
-                (thresholds[2] >> 8))
-              part_info->variance_low[(i << 2) + j + 9] = 1;
+          for (int lvl2_idx = 0; lvl2_idx < 4; lvl2_idx++) {
+            if (vt->split[lvl1_idx]
+                    .split[lvl2_idx]
+                    .part_variances.none.variance < (thresholds[2] >> 8))
+              part_info->variance_low[(lvl1_idx << 2) + lvl2_idx + 9] = 1;
           }
         }
       }
@@ -705,68 +760,74 @@
     if (vt->part_variances.none.variance < (thresholds[0] >> 1))
       part_info->variance_low[0] = 1;
   } else if (xd->mi[0]->bsize == BLOCK_128X64) {
-    for (int i = 0; i < 2; i++) {
-      if (vt->part_variances.horz[i].variance < (thresholds[0] >> 2))
-        part_info->variance_low[i + 1] = 1;
+    for (int part_idx = 0; part_idx < 2; part_idx++) {
+      if (vt->part_variances.horz[part_idx].variance < (thresholds[0] >> 2))
+        part_info->variance_low[part_idx + 1] = 1;
     }
   } else if (xd->mi[0]->bsize == BLOCK_64X128) {
-    for (int i = 0; i < 2; i++) {
-      if (vt->part_variances.vert[i].variance < (thresholds[0] >> 2))
-        part_info->variance_low[i + 3] = 1;
+    for (int part_idx = 0; part_idx < 2; part_idx++) {
+      if (vt->part_variances.vert[part_idx].variance < (thresholds[0] >> 2))
+        part_info->variance_low[part_idx + 3] = 1;
     }
   } else {
     static const int idx64[4][2] = {
       { 0, 0 }, { 0, 16 }, { 16, 0 }, { 16, 16 }
     };
     static const int idx32[4][2] = { { 0, 0 }, { 0, 8 }, { 8, 0 }, { 8, 8 } };
-    for (int i = 0; i < 4; i++) {
-      const int idx_str =
-          mi_params->mi_stride * (mi_row + idx64[i][0]) + mi_col + idx64[i][1];
+    for (int lvl1_idx = 0; lvl1_idx < 4; lvl1_idx++) {
+      const int idx_str = mi_params->mi_stride * (mi_row + idx64[lvl1_idx][0]) +
+                          mi_col + idx64[lvl1_idx][1];
       MB_MODE_INFO **mi_64 = mi_params->mi_grid_base + idx_str;
       if (*mi_64 == NULL) continue;
-      if (mi_params->mi_cols <= mi_col + idx64[i][1] ||
-          mi_params->mi_rows <= mi_row + idx64[i][0])
+      if (mi_params->mi_cols <= mi_col + idx64[lvl1_idx][1] ||
+          mi_params->mi_rows <= mi_row + idx64[lvl1_idx][0])
         continue;
       const int64_t threshold_64x64 = (5 * thresholds[1]) >> 3;
       if ((*mi_64)->bsize == BLOCK_64X64) {
-        if (vt->split[i].part_variances.none.variance < threshold_64x64)
-          part_info->variance_low[5 + i] = 1;
+        if (vt->split[lvl1_idx].part_variances.none.variance < threshold_64x64)
+          part_info->variance_low[5 + lvl1_idx] = 1;
       } else if ((*mi_64)->bsize == BLOCK_64X32) {
-        for (int j = 0; j < 2; j++)
-          if (vt->split[i].part_variances.horz[j].variance <
+        for (int part_idx = 0; part_idx < 2; part_idx++)
+          if (vt->split[lvl1_idx].part_variances.horz[part_idx].variance <
               (threshold_64x64 >> 1))
-            part_info->variance_low[9 + (i << 1) + j] = 1;
+            part_info->variance_low[9 + (lvl1_idx << 1) + part_idx] = 1;
       } else if ((*mi_64)->bsize == BLOCK_32X64) {
-        for (int j = 0; j < 2; j++)
-          if (vt->split[i].part_variances.vert[j].variance <
+        for (int part_idx = 0; part_idx < 2; part_idx++)
+          if (vt->split[lvl1_idx].part_variances.vert[part_idx].variance <
               (threshold_64x64 >> 1))
-            part_info->variance_low[17 + (i << 1) + j] = 1;
+            part_info->variance_low[17 + (lvl1_idx << 1) + part_idx] = 1;
       } else {
-        for (int k = 0; k < 4; k++) {
-          const int idx_str1 = mi_params->mi_stride * idx32[k][0] + idx32[k][1];
+        for (int lvl2_idx = 0; lvl2_idx < 4; lvl2_idx++) {
+          const int idx_str1 =
+              mi_params->mi_stride * idx32[lvl2_idx][0] + idx32[lvl2_idx][1];
           MB_MODE_INFO **mi_32 = mi_params->mi_grid_base + idx_str + idx_str1;
           if (*mi_32 == NULL) continue;
 
-          if (mi_params->mi_cols <= mi_col + idx64[i][1] + idx32[k][1] ||
-              mi_params->mi_rows <= mi_row + idx64[i][0] + idx32[k][0])
+          if (mi_params->mi_cols <=
+                  mi_col + idx64[lvl1_idx][1] + idx32[lvl2_idx][1] ||
+              mi_params->mi_rows <=
+                  mi_row + idx64[lvl1_idx][0] + idx32[lvl2_idx][0])
             continue;
           const int64_t threshold_32x32 = (5 * thresholds[2]) >> 3;
           if ((*mi_32)->bsize == BLOCK_32X32) {
-            if (vt->split[i].split[k].part_variances.none.variance <
-                threshold_32x32)
-              part_info->variance_low[25 + (i << 2) + k] = 1;
+            if (vt->split[lvl1_idx]
+                    .split[lvl2_idx]
+                    .part_variances.none.variance < threshold_32x32)
+              part_info->variance_low[25 + (lvl1_idx << 2) + lvl2_idx] = 1;
           } else {
             // For 32x16 and 16x32 blocks, the flag is set on each 16x16 block
             // inside.
             if ((*mi_32)->bsize == BLOCK_16X16 ||
                 (*mi_32)->bsize == BLOCK_32X16 ||
                 (*mi_32)->bsize == BLOCK_16X32) {
-              for (int j = 0; j < 4; j++) {
-                if (vt->split[i]
-                        .split[k]
-                        .split[j]
-                        .part_variances.none.variance < (thresholds[3] >> 8))
-                  part_info->variance_low[41 + (i << 4) + (k << 2) + j] = 1;
+              for (int lvl3_idx = 0; lvl3_idx < 4; lvl3_idx++) {
+                VPartVar *none_var = &vt->split[lvl1_idx]
+                                          .split[lvl2_idx]
+                                          .split[lvl3_idx]
+                                          .part_variances.none;
+                if (none_var->variance < (thresholds[3] >> 8))
+                  part_info->variance_low[41 + (lvl1_idx << 4) +
+                                          (lvl2_idx << 2) + lvl3_idx] = 1;
               }
             }
           }
@@ -779,14 +840,13 @@
 static AOM_INLINE void set_low_temp_var_flag(
     AV1_COMP *cpi, PartitionSearchInfo *part_info, MACROBLOCKD *xd,
     VP128x128 *vt, int64_t thresholds[], MV_REFERENCE_FRAME ref_frame_partition,
-    int mi_col, int mi_row) {
+    int mi_col, int mi_row, const bool is_small_sb) {
   AV1_COMMON *const cm = &cpi->common;
   // Check temporal variance for bsize >= 16x16, if LAST_FRAME was selected.
   // If the temporal variance is small set the flag
   // variance_low for the block. The variance threshold can be adjusted, the
   // higher the more aggressive.
   if (ref_frame_partition == LAST_FRAME) {
-    const int is_small_sb = (cm->seq_params->sb_size == BLOCK_64X64);
     if (is_small_sb)
       set_low_temp_var_flag_64x64(&cm->mi_params, part_info, xd,
                                   &(vt->split[0]), thresholds, mi_col, mi_row);
@@ -922,37 +982,48 @@
   return force_skip_low_temp_var;
 }
 
-void av1_set_variance_partition_thresholds(AV1_COMP *cpi, int q,
+void av1_set_variance_partition_thresholds(AV1_COMP *cpi, int qindex,
                                            int content_lowsumdiff) {
   SPEED_FEATURES *const sf = &cpi->sf;
   if (sf->part_sf.partition_search_type != VAR_BASED_PARTITION) {
     return;
   } else {
-    set_vbp_thresholds(cpi, cpi->vbp_info.thresholds, q, content_lowsumdiff, 0,
-                       0, 0, 0, 0);
+    set_vbp_thresholds(cpi, cpi->vbp_info.thresholds, 0, qindex,
+                       content_lowsumdiff, 0, 0, 0, 0);
     // The threshold below is not changed locally.
-    cpi->vbp_info.threshold_minmax = 15 + (q >> 3);
+    cpi->vbp_info.threshold_minmax = 15 + (qindex >> 3);
   }
 }
 
 static AOM_INLINE void chroma_check(AV1_COMP *cpi, MACROBLOCK *x,
                                     BLOCK_SIZE bsize, unsigned int y_sad,
-                                    unsigned int y_sad_g, int is_key_frame,
-                                    int zero_motion, unsigned int *uv_sad) {
-  int i;
+                                    unsigned int y_sad_g,
+                                    unsigned int y_sad_alt, bool is_key_frame,
+                                    bool zero_motion, unsigned int *uv_sad) {
   MACROBLOCKD *xd = &x->e_mbd;
   const int source_sad_nonrd = x->content_state_sb.source_sad_nonrd;
   int shift_upper_limit = 1;
   int shift_lower_limit = 3;
+  int fac_uv = 6;
   if (is_key_frame || cpi->oxcf.tool_cfg.enable_monochrome) return;
 
+  // Use lower threshold (more conservative in setting color flag) for
+  // higher resolutions non-screen, which tend to have more camera noise.
+  // Since this may be used to skip compound mode in nonrd pickmode, which
+  // is generally more effective for higher resolutions, better to be more
+  // conservative.
+  if (cpi->oxcf.tune_cfg.content != AOM_CONTENT_SCREEN) {
+    if (cpi->common.width * cpi->common.height >= RESOLUTION_1080P)
+      fac_uv = 3;
+    else
+      fac_uv = 5;
+  }
   if (cpi->oxcf.tune_cfg.content == AOM_CONTENT_SCREEN &&
-      cpi->rc.high_source_sad)
-    shift_lower_limit = 5;
-  else if (source_sad_nonrd >= kMedSad &&
-           cpi->oxcf.tune_cfg.content != AOM_CONTENT_SCREEN &&
-           (int64_t) cpi->common.width * (int64_t) cpi->common.height >=
-                (int64_t) 640 * 360) {
+      cpi->rc.high_source_sad) {
+    shift_lower_limit = 7;
+  } else if (source_sad_nonrd >= kMedSad &&
+             cpi->oxcf.tune_cfg.content != AOM_CONTENT_SCREEN &&
+             cpi->common.width * cpi->common.height >= 640 * 360) {
     shift_upper_limit = 2;
     shift_lower_limit = source_sad_nonrd > kMedSad ? 5 : 4;
   }
@@ -961,14 +1032,16 @@
   const AV1_COMMON *const cm = &cpi->common;
   const YV12_BUFFER_CONFIG *yv12 = get_ref_frame_yv12_buf(cm, LAST_FRAME);
   const YV12_BUFFER_CONFIG *yv12_g = get_ref_frame_yv12_buf(cm, GOLDEN_FRAME);
+  const YV12_BUFFER_CONFIG *yv12_alt = get_ref_frame_yv12_buf(cm, ALTREF_FRAME);
   const struct scale_factors *const sf =
       get_ref_scale_factors_const(cm, LAST_FRAME);
   struct buf_2d dst;
   unsigned int uv_sad_g = 0;
+  unsigned int uv_sad_alt = 0;
 
-  for (i = 1; i <= 2; ++i) {
-    struct macroblock_plane *p = &x->plane[i];
-    struct macroblockd_plane *pd = &xd->plane[i];
+  for (int plane = AOM_PLANE_U; plane < MAX_MB_PLANE; ++plane) {
+    struct macroblock_plane *p = &x->plane[plane];
+    struct macroblockd_plane *pd = &xd->plane[plane];
     const BLOCK_SIZE bs =
         get_plane_block_size(bsize, pd->subsampling_x, pd->subsampling_y);
 
@@ -976,57 +1049,70 @@
       // For last:
       if (zero_motion) {
         if (mi->ref_frame[0] == LAST_FRAME) {
-          uv_sad[i - 1] = cpi->ppi->fn_ptr[bs].sdf(
+          uv_sad[plane - 1] = cpi->ppi->fn_ptr[bs].sdf(
               p->src.buf, p->src.stride, pd->pre[0].buf, pd->pre[0].stride);
         } else {
-          uint8_t *src = (i == 1) ? yv12->u_buffer : yv12->v_buffer;
+          uint8_t *src = (plane == 1) ? yv12->u_buffer : yv12->v_buffer;
           setup_pred_plane(&dst, xd->mi[0]->bsize, src, yv12->uv_crop_width,
                            yv12->uv_crop_height, yv12->uv_stride, xd->mi_row,
-                           xd->mi_col, sf, xd->plane[i].subsampling_x,
-                           xd->plane[i].subsampling_y);
+                           xd->mi_col, sf, xd->plane[plane].subsampling_x,
+                           xd->plane[plane].subsampling_y);
 
-          uv_sad[i - 1] = cpi->ppi->fn_ptr[bs].sdf(p->src.buf, p->src.stride,
-                                                   dst.buf, dst.stride);
+          uv_sad[plane - 1] = cpi->ppi->fn_ptr[bs].sdf(
+              p->src.buf, p->src.stride, dst.buf, dst.stride);
         }
       } else {
-        uv_sad[i - 1] = cpi->ppi->fn_ptr[bs].sdf(p->src.buf, p->src.stride,
-                                                 pd->dst.buf, pd->dst.stride);
+        uv_sad[plane - 1] = cpi->ppi->fn_ptr[bs].sdf(
+            p->src.buf, p->src.stride, pd->dst.buf, pd->dst.stride);
       }
 
       // For golden:
       if (y_sad_g != UINT_MAX) {
-        uint8_t *src = (i == 1) ? yv12_g->u_buffer : yv12_g->v_buffer;
+        uint8_t *src = (plane == 1) ? yv12_g->u_buffer : yv12_g->v_buffer;
         setup_pred_plane(&dst, xd->mi[0]->bsize, src, yv12_g->uv_crop_width,
                          yv12_g->uv_crop_height, yv12_g->uv_stride, xd->mi_row,
-                         xd->mi_col, sf, xd->plane[i].subsampling_x,
-                         xd->plane[i].subsampling_y);
+                         xd->mi_col, sf, xd->plane[plane].subsampling_x,
+                         xd->plane[plane].subsampling_y);
         uv_sad_g = cpi->ppi->fn_ptr[bs].sdf(p->src.buf, p->src.stride, dst.buf,
                                             dst.stride);
       }
+
+      // For altref:
+      if (y_sad_alt != UINT_MAX) {
+        uint8_t *src = (plane == 1) ? yv12_alt->u_buffer : yv12_alt->v_buffer;
+        setup_pred_plane(&dst, xd->mi[0]->bsize, src, yv12_alt->uv_crop_width,
+                         yv12_alt->uv_crop_height, yv12_alt->uv_stride,
+                         xd->mi_row, xd->mi_col, sf,
+                         xd->plane[plane].subsampling_x,
+                         xd->plane[plane].subsampling_y);
+        uv_sad_alt = cpi->ppi->fn_ptr[bs].sdf(p->src.buf, p->src.stride,
+                                              dst.buf, dst.stride);
+      }
     }
 
-    if (uv_sad[i - 1] > (y_sad >> shift_upper_limit))
-      x->color_sensitivity_sb[i - 1] = 1;
-    else if (uv_sad[i - 1] < (y_sad >> shift_lower_limit))
-      x->color_sensitivity_sb[i - 1] = 0;
+    if (uv_sad[plane - 1] > (y_sad >> shift_upper_limit))
+      x->color_sensitivity_sb[COLOR_SENS_IDX(plane)] = 1;
+    else if (uv_sad[plane - 1] < (y_sad >> shift_lower_limit))
+      x->color_sensitivity_sb[COLOR_SENS_IDX(plane)] = 0;
     // Borderline case: to be refined at coding block level in nonrd_pickmode,
     // for coding block size < sb_size.
     else
-      x->color_sensitivity_sb[i - 1] = 2;
+      x->color_sensitivity_sb[COLOR_SENS_IDX(plane)] = 2;
 
-    x->color_sensitivity_sb_g[i - 1] = uv_sad_g > y_sad_g / 6;
+    x->color_sensitivity_sb_g[COLOR_SENS_IDX(plane)] =
+        uv_sad_g > y_sad_g / fac_uv;
+    x->color_sensitivity_sb_alt[COLOR_SENS_IDX(plane)] =
+        uv_sad_alt > y_sad_alt / fac_uv;
   }
 }
 
 static void fill_variance_tree_leaves(
-    AV1_COMP *cpi, MACROBLOCK *x, VP128x128 *vt, VP16x16 *vt2,
-    PART_EVAL_STATUS *force_split, int avg_16x16[][4], int maxvar_16x16[][4],
-    int minvar_16x16[][4], int *variance4x4downsample, int64_t *thresholds,
-    uint8_t *src, int src_stride, const uint8_t *dst, int dst_stride,
-    bool is_key_frame) {
-  AV1_COMMON *cm = &cpi->common;
+    AV1_COMP *cpi, MACROBLOCK *x, VP128x128 *vt, PART_EVAL_STATUS *force_split,
+    int avg_16x16[][4], int maxvar_16x16[][4], int minvar_16x16[][4],
+    int *variance4x4downsample, int64_t *thresholds, const uint8_t *src_buf,
+    int src_stride, const uint8_t *dst_buf, int dst_stride, bool is_key_frame,
+    const bool is_small_sb) {
   MACROBLOCKD *xd = &x->e_mbd;
-  const int is_small_sb = (cm->seq_params->sb_size == BLOCK_64X64);
   const int num_64x64_blocks = is_small_sb ? 1 : 4;
   // TODO(kyslov) Bring back compute_minmax_variance with content type detection
   const int compute_minmax_variance = 0;
@@ -1034,6 +1120,8 @@
   int pixels_wide = 128, pixels_high = 128;
   int border_offset_4x4 = 0;
   int temporal_denoising = cpi->sf.rt_sf.use_rtc_tf;
+  // dst_buf pointer is not used for is_key_frame, so it should be NULL.
+  assert(IMPLIES(is_key_frame, dst_buf == NULL));
   if (is_small_sb) {
     pixels_wide = 64;
     pixels_high = 64;
@@ -1049,121 +1137,236 @@
   // data outside superblock (while its being modified by temporal filter).
   // Temporal filtering is never done on key frames.
   if (!is_key_frame && temporal_denoising) border_offset_4x4 = 4;
-  for (int m = 0; m < num_64x64_blocks; m++) {
-    const int x64_idx = ((m & 1) << 6);
-    const int y64_idx = ((m >> 1) << 6);
-    const int m2 = m << 2;
-    force_split[m + 1] = PART_EVAL_ALL;
+  for (int blk64_idx = 0; blk64_idx < num_64x64_blocks; blk64_idx++) {
+    const int x64_idx = GET_BLK_IDX_X(blk64_idx, 6);
+    const int y64_idx = GET_BLK_IDX_Y(blk64_idx, 6);
+    const int blk64_scale_idx = blk64_idx << 2;
+    force_split[blk64_idx + 1] = PART_EVAL_ALL;
 
-    for (int i = 0; i < 4; i++) {
-      const int x32_idx = x64_idx + ((i & 1) << 5);
-      const int y32_idx = y64_idx + ((i >> 1) << 5);
-      const int i2 = (m2 + i) << 2;
-      force_split[5 + m2 + i] = PART_EVAL_ALL;
-      avg_16x16[m][i] = 0;
-      maxvar_16x16[m][i] = 0;
-      minvar_16x16[m][i] = INT_MAX;
-      for (int j = 0; j < 4; j++) {
-        const int x16_idx = x32_idx + ((j & 1) << 4);
-        const int y16_idx = y32_idx + ((j >> 1) << 4);
-        const int split_index = 21 + i2 + j;
-        VP16x16 *vst = &vt->split[m].split[i].split[j];
+    for (int lvl1_idx = 0; lvl1_idx < 4; lvl1_idx++) {
+      const int x32_idx = x64_idx + GET_BLK_IDX_X(lvl1_idx, 5);
+      const int y32_idx = y64_idx + GET_BLK_IDX_Y(lvl1_idx, 5);
+      const int lvl1_scale_idx = (blk64_scale_idx + lvl1_idx) << 2;
+      force_split[5 + blk64_scale_idx + lvl1_idx] = PART_EVAL_ALL;
+      avg_16x16[blk64_idx][lvl1_idx] = 0;
+      maxvar_16x16[blk64_idx][lvl1_idx] = 0;
+      minvar_16x16[blk64_idx][lvl1_idx] = INT_MAX;
+      for (int lvl2_idx = 0; lvl2_idx < 4; lvl2_idx++) {
+        const int x16_idx = x32_idx + GET_BLK_IDX_X(lvl2_idx, 4);
+        const int y16_idx = y32_idx + GET_BLK_IDX_Y(lvl2_idx, 4);
+        const int split_index = 21 + lvl1_scale_idx + lvl2_idx;
+        VP16x16 *vst = &vt->split[blk64_idx].split[lvl1_idx].split[lvl2_idx];
         force_split[split_index] = PART_EVAL_ALL;
-        variance4x4downsample[i2 + j] = 0;
-        if (!is_key_frame) {
-          fill_variance_8x8avg(src, src_stride, dst, dst_stride, x16_idx,
-                               y16_idx, vst, is_cur_buf_hbd(xd), pixels_wide,
-                               pixels_high, is_key_frame);
+        variance4x4downsample[lvl1_scale_idx + lvl2_idx] = 0;
+        if (is_key_frame) {
+          force_split[split_index] = PART_EVAL_ALL;
+          // Go down to 4x4 down-sampling for variance.
+          variance4x4downsample[lvl1_scale_idx + lvl2_idx] = 1;
+          for (int lvl3_idx = 0; lvl3_idx < 4; lvl3_idx++) {
+            const int x8_idx = x16_idx + GET_BLK_IDX_X(lvl3_idx, 3);
+            const int y8_idx = y16_idx + GET_BLK_IDX_Y(lvl3_idx, 3);
+            VP8x8 *vst2 = &vst->split[lvl3_idx];
+            fill_variance_4x4avg(src_buf, src_stride, x8_idx, y8_idx, vst2,
+#if CONFIG_AV1_HIGHBITDEPTH
+                                 xd->cur_buf->flags,
+#endif
+                                 pixels_wide, pixels_high, border_offset_4x4);
+          }
+        } else {
+          fill_variance_8x8avg(src_buf, src_stride, dst_buf, dst_stride,
+                               x16_idx, y16_idx, vst, is_cur_buf_hbd(xd),
+                               pixels_wide, pixels_high);
 
-          fill_variance_tree(&vt->split[m].split[i].split[j], BLOCK_16X16);
-          get_variance(&vt->split[m].split[i].split[j].part_variances.none);
-          avg_16x16[m][i] +=
-              vt->split[m].split[i].split[j].part_variances.none.variance;
-          if (vt->split[m].split[i].split[j].part_variances.none.variance <
-              minvar_16x16[m][i])
-            minvar_16x16[m][i] =
-                vt->split[m].split[i].split[j].part_variances.none.variance;
-          if (vt->split[m].split[i].split[j].part_variances.none.variance >
-              maxvar_16x16[m][i])
-            maxvar_16x16[m][i] =
-                vt->split[m].split[i].split[j].part_variances.none.variance;
-          if (vt->split[m].split[i].split[j].part_variances.none.variance >
-              thresholds[3]) {
+          fill_variance_tree(vst, BLOCK_16X16);
+          VPartVar *none_var = &vt->split[blk64_idx]
+                                    .split[lvl1_idx]
+                                    .split[lvl2_idx]
+                                    .part_variances.none;
+          get_variance(none_var);
+          const int val_none_var = none_var->variance;
+          avg_16x16[blk64_idx][lvl1_idx] += val_none_var;
+          minvar_16x16[blk64_idx][lvl1_idx] =
+              AOMMIN(minvar_16x16[blk64_idx][lvl1_idx], val_none_var);
+          maxvar_16x16[blk64_idx][lvl1_idx] =
+              AOMMAX(maxvar_16x16[blk64_idx][lvl1_idx], val_none_var);
+          if (val_none_var > thresholds[3]) {
             // 16X16 variance is above threshold for split, so force split to
             // 8x8 for this 16x16 block (this also forces splits for upper
             // levels).
             force_split[split_index] = PART_EVAL_ONLY_SPLIT;
-            force_split[5 + m2 + i] = PART_EVAL_ONLY_SPLIT;
-            force_split[m + 1] = PART_EVAL_ONLY_SPLIT;
+            force_split[5 + blk64_scale_idx + lvl1_idx] = PART_EVAL_ONLY_SPLIT;
+            force_split[blk64_idx + 1] = PART_EVAL_ONLY_SPLIT;
             force_split[0] = PART_EVAL_ONLY_SPLIT;
           } else if (!cyclic_refresh_segment_id_boosted(segment_id) &&
-                     compute_minmax_variance &&
-                     vt->split[m]
-                             .split[i]
-                             .split[j]
-                             .part_variances.none.variance > thresholds[2]) {
+                     compute_minmax_variance && val_none_var > thresholds[2]) {
             // We have some nominal amount of 16x16 variance (based on average),
             // compute the minmax over the 8x8 sub-blocks, and if above
             // threshold, force split to 8x8 block for this 16x16 block.
-            int minmax = compute_minmax_8x8(src, src_stride, dst, dst_stride,
-                                            x16_idx, y16_idx,
+            int minmax = compute_minmax_8x8(src_buf, src_stride, dst_buf,
+                                            dst_stride, x16_idx, y16_idx,
 #if CONFIG_AV1_HIGHBITDEPTH
                                             xd->cur_buf->flags,
 #endif
                                             pixels_wide, pixels_high);
-            int thresh_minmax = (int)cpi->vbp_info.threshold_minmax;
+            const int thresh_minmax = (int)cpi->vbp_info.threshold_minmax;
             if (minmax > thresh_minmax) {
               force_split[split_index] = PART_EVAL_ONLY_SPLIT;
-              force_split[5 + m2 + i] = PART_EVAL_ONLY_SPLIT;
-              force_split[m + 1] = PART_EVAL_ONLY_SPLIT;
+              force_split[5 + blk64_scale_idx + lvl1_idx] =
+                  PART_EVAL_ONLY_SPLIT;
+              force_split[blk64_idx + 1] = PART_EVAL_ONLY_SPLIT;
               force_split[0] = PART_EVAL_ONLY_SPLIT;
             }
           }
         }
-        if (is_key_frame) {
-          force_split[split_index] = PART_EVAL_ALL;
-          // Go down to 4x4 down-sampling for variance.
-          variance4x4downsample[i2 + j] = 1;
-          for (int k = 0; k < 4; k++) {
-            int x8_idx = x16_idx + ((k & 1) << 3);
-            int y8_idx = y16_idx + ((k >> 1) << 3);
-            VP8x8 *vst2 = is_key_frame ? &vst->split[k] : &vt2[i2 + j].split[k];
-            fill_variance_4x4avg(
-                src, src_stride, dst, dst_stride, x8_idx, y8_idx, vst2,
-#if CONFIG_AV1_HIGHBITDEPTH
-                xd->cur_buf->flags,
-#endif
-                pixels_wide, pixels_high, is_key_frame, border_offset_4x4);
-          }
-        }
       }
     }
   }
 }
 
+static AOM_INLINE void set_ref_frame_for_partition(
+    AV1_COMP *cpi, MACROBLOCK *x, MACROBLOCKD *xd,
+    MV_REFERENCE_FRAME *ref_frame_partition, MB_MODE_INFO *mi,
+    unsigned int *y_sad, unsigned int *y_sad_g, unsigned int *y_sad_alt,
+    const YV12_BUFFER_CONFIG *yv12_g, const YV12_BUFFER_CONFIG *yv12_alt,
+    int mi_row, int mi_col, int num_planes) {
+  AV1_COMMON *const cm = &cpi->common;
+  const bool is_set_golden_ref_frame =
+      *y_sad_g < 0.9 * *y_sad && *y_sad_g < *y_sad_alt;
+  const bool is_set_altref_ref_frame =
+      *y_sad_alt < 0.9 * *y_sad && *y_sad_alt < *y_sad_g;
+
+  if (is_set_golden_ref_frame) {
+    av1_setup_pre_planes(xd, 0, yv12_g, mi_row, mi_col,
+                         get_ref_scale_factors(cm, GOLDEN_FRAME), num_planes);
+    mi->ref_frame[0] = GOLDEN_FRAME;
+    mi->mv[0].as_int = 0;
+    *y_sad = *y_sad_g;
+    *ref_frame_partition = GOLDEN_FRAME;
+    x->nonrd_prune_ref_frame_search = 0;
+  } else if (is_set_altref_ref_frame) {
+    av1_setup_pre_planes(xd, 0, yv12_alt, mi_row, mi_col,
+                         get_ref_scale_factors(cm, ALTREF_FRAME), num_planes);
+    mi->ref_frame[0] = ALTREF_FRAME;
+    mi->mv[0].as_int = 0;
+    *y_sad = *y_sad_alt;
+    *ref_frame_partition = ALTREF_FRAME;
+    x->nonrd_prune_ref_frame_search = 0;
+  } else {
+    *ref_frame_partition = LAST_FRAME;
+    x->nonrd_prune_ref_frame_search =
+        cpi->sf.rt_sf.nonrd_prune_ref_frame_search;
+  }
+}
+
+static AOM_FORCE_INLINE int mv_distance(const FULLPEL_MV *mv0,
+                                        const FULLPEL_MV *mv1) {
+  return abs(mv0->row - mv1->row) + abs(mv0->col - mv1->col);
+}
+
+static AOM_INLINE void evaluate_neighbour_mvs(AV1_COMP *cpi, MACROBLOCK *x,
+                                              unsigned int *y_sad,
+                                              bool is_small_sb,
+                                              int est_motion) {
+  const int source_sad_nonrd = x->content_state_sb.source_sad_nonrd;
+  // TODO([email protected]): test if this condition works with other
+  // speeds.
+  if (est_motion > 2 && source_sad_nonrd > kMedSad) return;
+
+  MACROBLOCKD *xd = &x->e_mbd;
+  BLOCK_SIZE bsize = is_small_sb ? BLOCK_64X64 : BLOCK_128X128;
+  MB_MODE_INFO *mi = xd->mi[0];
+
+  unsigned int above_y_sad = UINT_MAX;
+  unsigned int left_y_sad = UINT_MAX;
+  FULLPEL_MV above_mv = kZeroFullMv;
+  FULLPEL_MV left_mv = kZeroFullMv;
+  SubpelMvLimits subpel_mv_limits;
+  const MV dummy_mv = { 0, 0 };
+  av1_set_subpel_mv_search_range(&subpel_mv_limits, &x->mv_limits, &dummy_mv);
+
+  // Current best MV
+  FULLPEL_MV best_mv = get_fullmv_from_mv(&mi->mv[0].as_mv);
+  const int multi = (est_motion > 2 && source_sad_nonrd > kLowSad) ? 7 : 8;
+
+  if (xd->up_available) {
+    const MB_MODE_INFO *above_mbmi = xd->above_mbmi;
+    if (above_mbmi->mode >= INTRA_MODE_END &&
+        above_mbmi->ref_frame[0] == LAST_FRAME) {
+      MV temp = above_mbmi->mv[0].as_mv;
+      clamp_mv(&temp, &subpel_mv_limits);
+      above_mv = get_fullmv_from_mv(&temp);
+
+      if (mv_distance(&best_mv, &above_mv) > 0) {
+        uint8_t const *ref_buf =
+            get_buf_from_fullmv(&xd->plane[0].pre[0], &above_mv);
+        above_y_sad = cpi->ppi->fn_ptr[bsize].sdf(
+            x->plane[0].src.buf, x->plane[0].src.stride, ref_buf,
+            xd->plane[0].pre[0].stride);
+      }
+    }
+  }
+  if (xd->left_available) {
+    const MB_MODE_INFO *left_mbmi = xd->left_mbmi;
+    if (left_mbmi->mode >= INTRA_MODE_END &&
+        left_mbmi->ref_frame[0] == LAST_FRAME) {
+      MV temp = left_mbmi->mv[0].as_mv;
+      clamp_mv(&temp, &subpel_mv_limits);
+      left_mv = get_fullmv_from_mv(&temp);
+
+      if (mv_distance(&best_mv, &left_mv) > 0 &&
+          mv_distance(&above_mv, &left_mv) > 0) {
+        uint8_t const *ref_buf =
+            get_buf_from_fullmv(&xd->plane[0].pre[0], &left_mv);
+        left_y_sad = cpi->ppi->fn_ptr[bsize].sdf(
+            x->plane[0].src.buf, x->plane[0].src.stride, ref_buf,
+            xd->plane[0].pre[0].stride);
+      }
+    }
+  }
+
+  if (above_y_sad < ((multi * *y_sad) >> 3) && above_y_sad < left_y_sad) {
+    *y_sad = above_y_sad;
+    mi->mv[0].as_mv = get_mv_from_fullmv(&above_mv);
+    clamp_mv(&mi->mv[0].as_mv, &subpel_mv_limits);
+  }
+  if (left_y_sad < ((multi * *y_sad) >> 3) && left_y_sad < above_y_sad) {
+    *y_sad = left_y_sad;
+    mi->mv[0].as_mv = get_mv_from_fullmv(&left_mv);
+    clamp_mv(&mi->mv[0].as_mv, &subpel_mv_limits);
+  }
+}
+
 static void setup_planes(AV1_COMP *cpi, MACROBLOCK *x, unsigned int *y_sad,
                          unsigned int *y_sad_g, unsigned int *y_sad_alt,
                          unsigned int *y_sad_last,
                          MV_REFERENCE_FRAME *ref_frame_partition, int mi_row,
-                         int mi_col) {
+                         int mi_col, bool is_small_sb, bool scaled_ref_last) {
   AV1_COMMON *const cm = &cpi->common;
   MACROBLOCKD *xd = &x->e_mbd;
   const int num_planes = av1_num_planes(cm);
-  const int is_small_sb = (cm->seq_params->sb_size == BLOCK_64X64);
   BLOCK_SIZE bsize = is_small_sb ? BLOCK_64X64 : BLOCK_128X128;
   MB_MODE_INFO *mi = xd->mi[0];
-  const YV12_BUFFER_CONFIG *yv12 = get_ref_frame_yv12_buf(cm, LAST_FRAME);
+  const YV12_BUFFER_CONFIG *yv12 =
+      scaled_ref_last ? av1_get_scaled_ref_frame(cpi, LAST_FRAME)
+                      : get_ref_frame_yv12_buf(cm, LAST_FRAME);
   assert(yv12 != NULL);
   const YV12_BUFFER_CONFIG *yv12_g = NULL;
   const YV12_BUFFER_CONFIG *yv12_alt = NULL;
   // Check if LAST is a reference. For spatial layers always use it as
-  // reference scaling (golden or altref being lower resolution) is not
-  // handled/check here.
+  // reference scaling.
   int use_last_ref = (cpi->ref_frame_flags & AOM_LAST_FLAG) ||
                      cpi->svc.number_spatial_layers > 1;
   int use_golden_ref = cpi->ref_frame_flags & AOM_GOLD_FLAG;
   int use_alt_ref = cpi->ppi->rtc_ref.set_ref_frame_config ||
-                    cpi->sf.rt_sf.use_nonrd_altref_frame;
+                    cpi->sf.rt_sf.use_nonrd_altref_frame ||
+                    (cpi->sf.rt_sf.use_comp_ref_nonrd &&
+                     cpi->sf.rt_sf.ref_frame_comp_nonrd[2] == 1);
+  // On a resized frame (reference has different scale) only use
+  // LAST as reference for partitioning for now.
+  if (scaled_ref_last) {
+    use_golden_ref = 0;
+    use_alt_ref = 0;
+  }
 
   // For 1 spatial layer: GOLDEN is another temporal reference.
   // Check if it should be used as reference for partitioning.
@@ -1174,8 +1377,9 @@
       av1_setup_pre_planes(xd, 0, yv12_g, mi_row, mi_col,
                            get_ref_scale_factors(cm, GOLDEN_FRAME), num_planes);
       *y_sad_g = cpi->ppi->fn_ptr[bsize].sdf(
-          x->plane[0].src.buf, x->plane[0].src.stride, xd->plane[0].pre[0].buf,
-          xd->plane[0].pre[0].stride);
+          x->plane[AOM_PLANE_Y].src.buf, x->plane[AOM_PLANE_Y].src.stride,
+          xd->plane[AOM_PLANE_Y].pre[0].buf,
+          xd->plane[AOM_PLANE_Y].pre[0].stride);
     }
   }
 
@@ -1189,57 +1393,60 @@
       av1_setup_pre_planes(xd, 0, yv12_alt, mi_row, mi_col,
                            get_ref_scale_factors(cm, ALTREF_FRAME), num_planes);
       *y_sad_alt = cpi->ppi->fn_ptr[bsize].sdf(
-          x->plane[0].src.buf, x->plane[0].src.stride, xd->plane[0].pre[0].buf,
-          xd->plane[0].pre[0].stride);
+          x->plane[AOM_PLANE_Y].src.buf, x->plane[AOM_PLANE_Y].src.stride,
+          xd->plane[AOM_PLANE_Y].pre[0].buf,
+          xd->plane[AOM_PLANE_Y].pre[0].stride);
     }
   }
 
   if (use_last_ref) {
-    av1_setup_pre_planes(xd, 0, yv12, mi_row, mi_col,
-                         get_ref_scale_factors(cm, LAST_FRAME), num_planes);
+    const int source_sad_nonrd = x->content_state_sb.source_sad_nonrd;
+    av1_setup_pre_planes(
+        xd, 0, yv12, mi_row, mi_col,
+        scaled_ref_last ? NULL : get_ref_scale_factors(cm, LAST_FRAME),
+        num_planes);
     mi->ref_frame[0] = LAST_FRAME;
     mi->ref_frame[1] = NONE_FRAME;
     mi->bsize = cm->seq_params->sb_size;
     mi->mv[0].as_int = 0;
     mi->interp_filters = av1_broadcast_interp_filter(BILINEAR);
-    if (cpi->sf.rt_sf.estimate_motion_for_var_based_partition) {
+
+    int est_motion = cpi->sf.rt_sf.estimate_motion_for_var_based_partition;
+    // TODO(b/290596301): Look into adjusting this condition.
+    // There is regression on color content when
+    // estimate_motion_for_var_based_partition = 3 and high motion,
+    // so for now force it to 2 based on superblock sad.
+    if (est_motion > 2 && source_sad_nonrd > kMedSad) est_motion = 2;
+
+    if (est_motion == 1 || est_motion == 2) {
       if (xd->mb_to_right_edge >= 0 && xd->mb_to_bottom_edge >= 0) {
         const MV dummy_mv = { 0, 0 };
         *y_sad = av1_int_pro_motion_estimation(cpi, x, cm->seq_params->sb_size,
                                                mi_row, mi_col, &dummy_mv);
       }
     }
+
     if (*y_sad == UINT_MAX) {
       *y_sad = cpi->ppi->fn_ptr[bsize].sdf(
-          x->plane[0].src.buf, x->plane[0].src.stride, xd->plane[0].pre[0].buf,
-          xd->plane[0].pre[0].stride);
+          x->plane[AOM_PLANE_Y].src.buf, x->plane[AOM_PLANE_Y].src.stride,
+          xd->plane[AOM_PLANE_Y].pre[0].buf,
+          xd->plane[AOM_PLANE_Y].pre[0].stride);
     }
+
+    // Evaluate if neighbours' MVs give better predictions. Zero MV is tested
+    // already, so only non-zero MVs are tested here. Here the neighbour blocks
+    // are the first block above or left to this superblock.
+    if (est_motion >= 2 && (xd->up_available || xd->left_available))
+      evaluate_neighbour_mvs(cpi, x, y_sad, is_small_sb, est_motion);
+
     *y_sad_last = *y_sad;
   }
 
   // Pick the ref frame for partitioning, use golden or altref frame only if
   // its lower sad, bias to LAST with factor 0.9.
-  if (*y_sad_g < 0.9 * *y_sad && *y_sad_g < *y_sad_alt) {
-    av1_setup_pre_planes(xd, 0, yv12_g, mi_row, mi_col,
-                         get_ref_scale_factors(cm, GOLDEN_FRAME), num_planes);
-    mi->ref_frame[0] = GOLDEN_FRAME;
-    mi->mv[0].as_int = 0;
-    *y_sad = *y_sad_g;
-    *ref_frame_partition = GOLDEN_FRAME;
-    x->nonrd_prune_ref_frame_search = 0;
-  } else if (*y_sad_alt < 0.9 * *y_sad && *y_sad_alt < *y_sad_g) {
-    av1_setup_pre_planes(xd, 0, yv12_alt, mi_row, mi_col,
-                         get_ref_scale_factors(cm, ALTREF_FRAME), num_planes);
-    mi->ref_frame[0] = ALTREF_FRAME;
-    mi->mv[0].as_int = 0;
-    *y_sad = *y_sad_alt;
-    *ref_frame_partition = ALTREF_FRAME;
-    x->nonrd_prune_ref_frame_search = 0;
-  } else {
-    *ref_frame_partition = LAST_FRAME;
-    x->nonrd_prune_ref_frame_search =
-        cpi->sf.rt_sf.nonrd_prune_ref_frame_search;
-  }
+  set_ref_frame_for_partition(cpi, x, xd, ref_frame_partition, mi, y_sad,
+                              y_sad_g, y_sad_alt, yv12_g, yv12_alt, mi_row,
+                              mi_col, num_planes);
 
   // Only calculate the predictor for non-zero MV.
   if (mi->mv[0].as_int != 0) {
@@ -1255,9 +1462,10 @@
 static AOM_INLINE PART_EVAL_STATUS get_part_eval_based_on_sub_blk_var(
     VP16x16 *var_16x16_info, int64_t threshold16) {
   int max_8x8_var = 0, min_8x8_var = INT_MAX;
-  for (int k = 0; k < 4; k++) {
-    get_variance(&var_16x16_info->split[k].part_variances.none);
-    int this_8x8_var = var_16x16_info->split[k].part_variances.none.variance;
+  for (int split_idx = 0; split_idx < 4; split_idx++) {
+    get_variance(&var_16x16_info->split[split_idx].part_variances.none);
+    int this_8x8_var =
+        var_16x16_info->split[split_idx].part_variances.none.variance;
     max_8x8_var = AOMMAX(this_8x8_var, max_8x8_var);
     min_8x8_var = AOMMIN(this_8x8_var, min_8x8_var);
   }
@@ -1282,6 +1490,44 @@
   return false;
 }
 
+static AOM_INLINE bool set_force_zeromv_skip_for_sb(
+    AV1_COMP *cpi, MACROBLOCK *x, const TileInfo *const tile, VP16x16 *vt2,
+    VP128x128 *vt, unsigned int *uv_sad, int mi_row, int mi_col,
+    unsigned int y_sad, BLOCK_SIZE bsize) {
+  AV1_COMMON *const cm = &cpi->common;
+  if (!is_set_force_zeromv_skip_based_on_src_sad(
+          cpi->sf.rt_sf.set_zeromv_skip_based_on_source_sad,
+          x->content_state_sb.source_sad_nonrd))
+    return false;
+  const int block_width = mi_size_wide[cm->seq_params->sb_size];
+  const int block_height = mi_size_high[cm->seq_params->sb_size];
+  const unsigned int thresh_exit_part_y =
+      cpi->zeromv_skip_thresh_exit_part[bsize];
+  unsigned int thresh_exit_part_uv =
+      CALC_CHROMA_THRESH_FOR_ZEROMV_SKIP(thresh_exit_part_y);
+  // Be more aggressive in UV threshold if source_sad >= VeryLowSad
+  // to suppreess visual artifact caused by the speed feature:
+  // set_zeromv_skip_based_on_source_sad = 2. For now only for
+  // part_early_exit_zeromv = 1.
+  if (x->content_state_sb.source_sad_nonrd >= kVeryLowSad &&
+      cpi->sf.rt_sf.part_early_exit_zeromv == 1)
+    thresh_exit_part_uv = thresh_exit_part_uv >> 3;
+  if (mi_col + block_width <= tile->mi_col_end &&
+      mi_row + block_height <= tile->mi_row_end && y_sad < thresh_exit_part_y &&
+      uv_sad[0] < thresh_exit_part_uv && uv_sad[1] < thresh_exit_part_uv) {
+    set_block_size(cpi, mi_row, mi_col, bsize);
+    x->force_zeromv_skip_for_sb = 1;
+    if (vt2) aom_free(vt2);
+    if (vt) aom_free(vt);
+    // Partition shape is set here at SB level.
+    // Exit needs to happen from av1_choose_var_based_partitioning().
+    return true;
+  } else if (x->content_state_sb.source_sad_nonrd == kZeroSad &&
+             cpi->sf.rt_sf.part_early_exit_zeromv >= 2)
+    x->force_zeromv_skip_for_sb = 2;
+  return false;
+}
+
 int av1_choose_var_based_partitioning(AV1_COMP *cpi, const TileInfo *const tile,
                                       ThreadData *td, MACROBLOCK *x, int mi_row,
                                       int mi_col) {
@@ -1291,8 +1537,6 @@
   AV1_COMMON *const cm = &cpi->common;
   MACROBLOCKD *xd = &x->e_mbd;
   const int64_t *const vbp_thresholds = cpi->vbp_info.thresholds;
-
-  int i, j, k, m;
   VP128x128 *vt;
   VP16x16 *vt2 = NULL;
   PART_EVAL_STATUS force_split[85];
@@ -1307,22 +1551,22 @@
   int maxvar_16x16[4][4];
   int minvar_16x16[4][4];
   int64_t threshold_4x4avg;
-  uint8_t *s;
-  const uint8_t *d;
-  int sp;
-  int dp;
-  unsigned int uv_sad[2];
+  const uint8_t *src_buf;
+  const uint8_t *dst_buf;
+  int dst_stride;
+  unsigned int uv_sad[MAX_MB_PLANE - 1];
   NOISE_LEVEL noise_level = kLow;
-  int zero_motion = 1;
+  bool is_zero_motion = true;
+  bool scaled_ref_last = false;
 
-  int is_key_frame =
+  bool is_key_frame =
       (frame_is_intra_only(cm) ||
        (cpi->ppi->use_svc &&
         cpi->svc.layer_context[cpi->svc.temporal_layer_id].is_key_frame));
 
   assert(cm->seq_params->sb_size == BLOCK_64X64 ||
          cm->seq_params->sb_size == BLOCK_128X128);
-  const int is_small_sb = (cm->seq_params->sb_size == BLOCK_64X64);
+  const bool is_small_sb = (cm->seq_params->sb_size == BLOCK_64X64);
   const int num_64x64_blocks = is_small_sb ? 1 : 4;
 
   unsigned int y_sad = UINT_MAX;
@@ -1346,7 +1590,8 @@
   int variance4x4downsample[64];
   const int segment_id = xd->mi[0]->segment_id;
   uint64_t blk_sad = 0;
-  if (cpi->src_sad_blk_64x64 != NULL && !cpi->ppi->use_svc) {
+  if (cpi->src_sad_blk_64x64 != NULL &&
+      cpi->svc.spatial_layer_id == cpi->svc.number_spatial_layers - 1) {
     const int sb_size_by_mb = (cm->seq_params->sb_size == BLOCK_128X128)
                                   ? (cm->seq_params->mib_size >> 1)
                                   : cm->seq_params->mib_size;
@@ -1357,27 +1602,23 @@
     blk_sad = cpi->src_sad_blk_64x64[sbi_col + sbi_row * sb_cols];
   }
 
-  if (cpi->oxcf.q_cfg.aq_mode == CYCLIC_REFRESH_AQ && cm->seg.enabled &&
-      cyclic_refresh_segment_id_boosted(segment_id)) {
-    const int q =
-        av1_get_qindex(&cm->seg, segment_id, cm->quant_params.base_qindex);
-    set_vbp_thresholds(cpi, thresholds, q, x->content_state_sb.low_sumdiff,
-                       x->content_state_sb.source_sad_nonrd,
-                       x->content_state_sb.source_sad_rd, 1, blk_sad,
-                       x->content_state_sb.lighting_change);
-  } else {
-    set_vbp_thresholds(cpi, thresholds, cm->quant_params.base_qindex,
-                       x->content_state_sb.low_sumdiff,
-                       x->content_state_sb.source_sad_nonrd,
-                       x->content_state_sb.source_sad_rd, 0, blk_sad,
-                       x->content_state_sb.lighting_change);
-  }
+  const bool is_segment_id_boosted =
+      cpi->oxcf.q_cfg.aq_mode == CYCLIC_REFRESH_AQ && cm->seg.enabled &&
+      cyclic_refresh_segment_id_boosted(segment_id);
+  const int qindex =
+      is_segment_id_boosted
+          ? av1_get_qindex(&cm->seg, segment_id, cm->quant_params.base_qindex)
+          : cm->quant_params.base_qindex;
+  set_vbp_thresholds(
+      cpi, thresholds, blk_sad, qindex, x->content_state_sb.low_sumdiff,
+      x->content_state_sb.source_sad_nonrd, x->content_state_sb.source_sad_rd,
+      is_segment_id_boosted, x->content_state_sb.lighting_change);
 
   // For non keyframes, disable 4x4 average for low resolution when speed = 8
   threshold_4x4avg = INT64_MAX;
 
-  s = x->plane[0].src.buf;
-  sp = x->plane[0].src.stride;
+  src_buf = x->plane[AOM_PLANE_Y].src.buf;
+  int src_stride = x->plane[AOM_PLANE_Y].src.stride;
 
   // Index for force_split: 0 for 64x64, 1-4 for 32x32 blocks,
   // 5-20 for the 16x16 blocks.
@@ -1385,50 +1626,51 @@
   memset(x->part_search_info.variance_low, 0,
          sizeof(x->part_search_info.variance_low));
 
-  // Check if LAST frame is NULL or if the resolution of LAST is
-  // different than the current frame resolution, and if so, treat this frame
+  // Check if LAST frame is NULL, and if so, treat this frame
   // as a key frame, for the purpose of the superblock partitioning.
   // LAST == NULL can happen in cases where enhancement spatial layers are
   // enabled dyanmically and the only reference is the spatial(GOLDEN).
-  // TODO(marpan): Check se of scaled references for the different resoln.
+  // If LAST frame has a different resolution: set the scaled_ref_last flag
+  // and check if ref_scaled is NULL.
   if (!frame_is_intra_only(cm)) {
-    const YV12_BUFFER_CONFIG *const ref =
-        get_ref_frame_yv12_buf(cm, LAST_FRAME);
-    if (ref == NULL || ref->y_crop_height != cm->height ||
-        ref->y_crop_width != cm->width) {
-      is_key_frame = 1;
+    const YV12_BUFFER_CONFIG *ref = get_ref_frame_yv12_buf(cm, LAST_FRAME);
+    if (ref == NULL) {
+      is_key_frame = true;
+    } else if (ref->y_crop_height != cm->height ||
+               ref->y_crop_width != cm->width) {
+      scaled_ref_last = true;
+      const YV12_BUFFER_CONFIG *ref_scaled =
+          av1_get_scaled_ref_frame(cpi, LAST_FRAME);
+      if (ref_scaled == NULL) is_key_frame = true;
     }
   }
 
   if (!is_key_frame) {
     setup_planes(cpi, x, &y_sad, &y_sad_g, &y_sad_alt, &y_sad_last,
-                 &ref_frame_partition, mi_row, mi_col);
+                 &ref_frame_partition, mi_row, mi_col, is_small_sb,
+                 scaled_ref_last);
 
     MB_MODE_INFO *mi = xd->mi[0];
     // Use reference SB directly for zero mv.
     if (mi->mv[0].as_int != 0) {
-      d = xd->plane[0].dst.buf;
-      dp = xd->plane[0].dst.stride;
-      zero_motion = 0;
+      dst_buf = xd->plane[AOM_PLANE_Y].dst.buf;
+      dst_stride = xd->plane[AOM_PLANE_Y].dst.stride;
+      is_zero_motion = false;
     } else {
-      d = xd->plane[0].pre[0].buf;
-      dp = xd->plane[0].pre[0].stride;
+      dst_buf = xd->plane[AOM_PLANE_Y].pre[0].buf;
+      dst_stride = xd->plane[AOM_PLANE_Y].pre[0].stride;
     }
   } else {
-    d = AV1_VAR_OFFS;
-    dp = 0;
+    dst_buf = NULL;
+    dst_stride = 0;
   }
 
-  uv_sad[0] = 0;
-  uv_sad[1] = 0;
-  chroma_check(cpi, x, bsize, y_sad_last, y_sad_g, is_key_frame, zero_motion,
-               uv_sad);
+  // check and set the color sensitivity of sb.
+  av1_zero(uv_sad);
+  chroma_check(cpi, x, bsize, y_sad_last, y_sad_g, y_sad_alt, is_key_frame,
+               is_zero_motion, uv_sad);
 
   x->force_zeromv_skip_for_sb = 0;
-  const bool is_set_force_zeromv_skip =
-      is_set_force_zeromv_skip_based_on_src_sad(
-          cpi->sf.rt_sf.set_zeromv_skip_based_on_source_sad,
-          x->content_state_sb.source_sad_nonrd);
 
   // If the superblock is completely static (zero source sad) and
   // the y_sad (relative to LAST ref) is very small, take the sb_size partition
@@ -1438,27 +1680,11 @@
   // Condition on color uv_sad is also added.
   if (!is_key_frame && cpi->sf.rt_sf.part_early_exit_zeromv &&
       cpi->rc.frames_since_key > 30 && segment_id == CR_SEGMENT_ID_BASE &&
-      is_set_force_zeromv_skip && ref_frame_partition == LAST_FRAME &&
-      xd->mi[0]->mv[0].as_int == 0) {
-    const int block_width = mi_size_wide[cm->seq_params->sb_size];
-    const int block_height = mi_size_high[cm->seq_params->sb_size];
-    const unsigned int thresh_exit_part_y =
-        cpi->zeromv_skip_thresh_exit_part[bsize];
-    const unsigned int thresh_exit_part_uv =
-        CALC_CHROMA_THRESH_FOR_ZEROMV_SKIP(thresh_exit_part_y);
-    if (mi_col + block_width <= tile->mi_col_end &&
-        mi_row + block_height <= tile->mi_row_end &&
-        y_sad < thresh_exit_part_y && uv_sad[0] < thresh_exit_part_uv &&
-        uv_sad[1] < thresh_exit_part_uv) {
-      set_block_size(cpi, mi_row, mi_col, bsize);
-      x->force_zeromv_skip_for_sb = 1;
-      if (vt2) aom_free(vt2);
-      if (vt) aom_free(vt);
+      ref_frame_partition == LAST_FRAME && xd->mi[0]->mv[0].as_int == 0) {
+    // Exit here, if zero mv skip flag is set at SB level.
+    if (set_force_zeromv_skip_for_sb(cpi, x, tile, vt2, vt, uv_sad, mi_row,
+                                     mi_col, y_sad, bsize))
       return 0;
-    } else if (x->content_state_sb.source_sad_nonrd == kZeroSad &&
-               cpi->sf.rt_sf.part_early_exit_zeromv >= 2) {
-      x->force_zeromv_skip_for_sb = 2;
-    }
   }
 
   if (cpi->noise_estimate.enabled)
@@ -1468,95 +1694,102 @@
     CHECK_MEM_ERROR(cm, vt2, aom_malloc(sizeof(*vt2)));
   // Fill in the entire tree of 8x8 (or 4x4 under some conditions) variances
   // for splits.
-  fill_variance_tree_leaves(cpi, x, vt, vt2, force_split, avg_16x16,
-                            maxvar_16x16, minvar_16x16, variance4x4downsample,
-                            thresholds, s, sp, d, dp, is_key_frame);
+  fill_variance_tree_leaves(cpi, x, vt, force_split, avg_16x16, maxvar_16x16,
+                            minvar_16x16, variance4x4downsample, thresholds,
+                            src_buf, src_stride, dst_buf, dst_stride,
+                            is_key_frame, is_small_sb);
 
   avg_64x64 = 0;
-  for (m = 0; m < num_64x64_blocks; ++m) {
-    max_var_32x32[m] = 0;
-    min_var_32x32[m] = INT_MAX;
-    const int m2 = m << 2;
-    for (i = 0; i < 4; i++) {
-      const int i2 = (m2 + i) << 2;
-      for (j = 0; j < 4; j++) {
-        const int split_index = 21 + i2 + j;
-        if (variance4x4downsample[i2 + j] == 1) {
-          VP16x16 *vtemp =
-              (!is_key_frame) ? &vt2[i2 + j] : &vt->split[m].split[i].split[j];
-          for (k = 0; k < 4; k++)
-            fill_variance_tree(&vtemp->split[k], BLOCK_8X8);
-          fill_variance_tree(vtemp, BLOCK_16X16);
-          // If variance of this 16x16 block is above the threshold, force block
-          // to split. This also forces a split on the upper levels.
-          get_variance(&vtemp->part_variances.none);
-          if (vtemp->part_variances.none.variance > thresholds[3]) {
-            force_split[split_index] =
-                cpi->sf.rt_sf.vbp_prune_16x16_split_using_min_max_sub_blk_var
-                    ? get_part_eval_based_on_sub_blk_var(vtemp, thresholds[3])
-                    : PART_EVAL_ONLY_SPLIT;
-            force_split[5 + m2 + i] = PART_EVAL_ONLY_SPLIT;
-            force_split[m + 1] = PART_EVAL_ONLY_SPLIT;
-            force_split[0] = PART_EVAL_ONLY_SPLIT;
-          }
+  for (int blk64_idx = 0; blk64_idx < num_64x64_blocks; ++blk64_idx) {
+    max_var_32x32[blk64_idx] = 0;
+    min_var_32x32[blk64_idx] = INT_MAX;
+    const int blk64_scale_idx = blk64_idx << 2;
+    for (int lvl1_idx = 0; lvl1_idx < 4; lvl1_idx++) {
+      const int lvl1_scale_idx = (blk64_scale_idx + lvl1_idx) << 2;
+      for (int lvl2_idx = 0; lvl2_idx < 4; lvl2_idx++) {
+        if (variance4x4downsample[lvl1_scale_idx + lvl2_idx] != 1) continue;
+        VP16x16 *vtemp =
+            (!is_key_frame)
+                ? &vt2[lvl1_scale_idx + lvl2_idx]
+                : &vt->split[blk64_idx].split[lvl1_idx].split[lvl2_idx];
+        for (int lvl3_idx = 0; lvl3_idx < 4; lvl3_idx++)
+          fill_variance_tree(&vtemp->split[lvl3_idx], BLOCK_8X8);
+        fill_variance_tree(vtemp, BLOCK_16X16);
+        // If variance of this 16x16 block is above the threshold, force block
+        // to split. This also forces a split on the upper levels.
+        get_variance(&vtemp->part_variances.none);
+        if (vtemp->part_variances.none.variance > thresholds[3]) {
+          const int split_index = 21 + lvl1_scale_idx + lvl2_idx;
+          force_split[split_index] =
+              cpi->sf.rt_sf.vbp_prune_16x16_split_using_min_max_sub_blk_var
+                  ? get_part_eval_based_on_sub_blk_var(vtemp, thresholds[3])
+                  : PART_EVAL_ONLY_SPLIT;
+          force_split[5 + blk64_scale_idx + lvl1_idx] = PART_EVAL_ONLY_SPLIT;
+          force_split[blk64_idx + 1] = PART_EVAL_ONLY_SPLIT;
+          force_split[0] = PART_EVAL_ONLY_SPLIT;
         }
       }
-      fill_variance_tree(&vt->split[m].split[i], BLOCK_32X32);
+      fill_variance_tree(&vt->split[blk64_idx].split[lvl1_idx], BLOCK_32X32);
       // If variance of this 32x32 block is above the threshold, or if its above
       // (some threshold of) the average variance over the sub-16x16 blocks,
       // then force this block to split. This also forces a split on the upper
       // (64x64) level.
       uint64_t frame_sad_thresh = 20000;
+      const int is_360p_or_smaller = cm->width * cm->height <= RESOLUTION_360P;
       if (cpi->svc.number_temporal_layers > 2 &&
           cpi->svc.temporal_layer_id == 0)
         frame_sad_thresh = frame_sad_thresh << 1;
-      if (force_split[5 + m2 + i] == PART_EVAL_ALL) {
-        get_variance(&vt->split[m].split[i].part_variances.none);
-        var_32x32 = vt->split[m].split[i].part_variances.none.variance;
-        max_var_32x32[m] = AOMMAX(var_32x32, max_var_32x32[m]);
-        min_var_32x32[m] = AOMMIN(var_32x32, min_var_32x32[m]);
-        if (vt->split[m].split[i].part_variances.none.variance >
-                thresholds[2] ||
-            (!is_key_frame &&
-             vt->split[m].split[i].part_variances.none.variance >
-                 (thresholds[2] >> 1) &&
-             vt->split[m].split[i].part_variances.none.variance >
-                 (avg_16x16[m][i] >> 1))) {
-          force_split[5 + m2 + i] = PART_EVAL_ONLY_SPLIT;
-          force_split[m + 1] = PART_EVAL_ONLY_SPLIT;
+      if (force_split[5 + blk64_scale_idx + lvl1_idx] == PART_EVAL_ALL) {
+        get_variance(&vt->split[blk64_idx].split[lvl1_idx].part_variances.none);
+        var_32x32 =
+            vt->split[blk64_idx].split[lvl1_idx].part_variances.none.variance;
+        max_var_32x32[blk64_idx] = AOMMAX(var_32x32, max_var_32x32[blk64_idx]);
+        min_var_32x32[blk64_idx] = AOMMIN(var_32x32, min_var_32x32[blk64_idx]);
+        const int max_min_var_16X16_diff = (maxvar_16x16[blk64_idx][lvl1_idx] -
+                                            minvar_16x16[blk64_idx][lvl1_idx]);
+
+        if (var_32x32 > thresholds[2] ||
+            (!is_key_frame && var_32x32 > (thresholds[2] >> 1) &&
+             var_32x32 > (avg_16x16[blk64_idx][lvl1_idx] >> 1))) {
+          force_split[5 + blk64_scale_idx + lvl1_idx] = PART_EVAL_ONLY_SPLIT;
+          force_split[blk64_idx + 1] = PART_EVAL_ONLY_SPLIT;
           force_split[0] = PART_EVAL_ONLY_SPLIT;
-        } else if (!is_key_frame && (cm->width * cm->height <= 640 * 360) &&
-                   (((maxvar_16x16[m][i] - minvar_16x16[m][i]) >
-                         (thresholds[2] >> 1) &&
-                     maxvar_16x16[m][i] > thresholds[2]) ||
+        } else if (!is_key_frame && is_360p_or_smaller &&
+                   ((max_min_var_16X16_diff > (thresholds[2] >> 1) &&
+                     maxvar_16x16[blk64_idx][lvl1_idx] > thresholds[2]) ||
                     (cpi->sf.rt_sf.prefer_large_partition_blocks &&
                      x->content_state_sb.source_sad_nonrd > kLowSad &&
                      cpi->rc.frame_source_sad < frame_sad_thresh &&
-                     maxvar_16x16[m][i] > (thresholds[2] >> 4) &&
-                     maxvar_16x16[m][i] > (minvar_16x16[m][i] << 2)))) {
-          force_split[5 + m2 + i] = PART_EVAL_ONLY_SPLIT;
-          force_split[m + 1] = PART_EVAL_ONLY_SPLIT;
+                     maxvar_16x16[blk64_idx][lvl1_idx] > (thresholds[2] >> 4) &&
+                     maxvar_16x16[blk64_idx][lvl1_idx] >
+                         (minvar_16x16[blk64_idx][lvl1_idx] << 2)))) {
+          force_split[5 + blk64_scale_idx + lvl1_idx] = PART_EVAL_ONLY_SPLIT;
+          force_split[blk64_idx + 1] = PART_EVAL_ONLY_SPLIT;
           force_split[0] = PART_EVAL_ONLY_SPLIT;
         }
       }
     }
-    if (force_split[1 + m] == PART_EVAL_ALL) {
-      fill_variance_tree(&vt->split[m], BLOCK_64X64);
-      get_variance(&vt->split[m].part_variances.none);
-      var_64x64 = vt->split[m].part_variances.none.variance;
+    if (force_split[1 + blk64_idx] == PART_EVAL_ALL) {
+      fill_variance_tree(&vt->split[blk64_idx], BLOCK_64X64);
+      get_variance(&vt->split[blk64_idx].part_variances.none);
+      var_64x64 = vt->split[blk64_idx].part_variances.none.variance;
       max_var_64x64 = AOMMAX(var_64x64, max_var_64x64);
       min_var_64x64 = AOMMIN(var_64x64, min_var_64x64);
       // If the difference of the max-min variances of sub-blocks or max
       // variance of a sub-block is above some threshold of then force this
       // block to split. Only checking this for noise level >= medium, if
       // encoder is in SVC or if we already forced large blocks.
+      const int max_min_var_32x32_diff =
+          max_var_32x32[blk64_idx] - min_var_32x32[blk64_idx];
+      const int check_max_var = max_var_32x32[blk64_idx] > thresholds[1] >> 1;
+      const bool check_noise_lvl = noise_level >= kMedium ||
+                                   cpi->ppi->use_svc ||
+                                   cpi->sf.rt_sf.prefer_large_partition_blocks;
+      const int64_t set_threshold = 3 * (thresholds[1] >> 3);
 
-      if (!is_key_frame &&
-          (max_var_32x32[m] - min_var_32x32[m]) > 3 * (thresholds[1] >> 3) &&
-          max_var_32x32[m] > thresholds[1] >> 1 &&
-          (noise_level >= kMedium || cpi->ppi->use_svc ||
-           cpi->sf.rt_sf.prefer_large_partition_blocks)) {
-        force_split[1 + m] = PART_EVAL_ONLY_SPLIT;
+      if (!is_key_frame && max_min_var_32x32_diff > set_threshold &&
+          check_max_var && check_noise_lvl) {
+        force_split[1 + blk64_idx] = PART_EVAL_ONLY_SPLIT;
         force_split[0] = PART_EVAL_ONLY_SPLIT;
       }
       avg_64x64 += var_64x64;
@@ -1567,8 +1800,8 @@
   if (force_split[0] == PART_EVAL_ALL) {
     fill_variance_tree(vt, BLOCK_128X128);
     get_variance(&vt->part_variances.none);
-    if (!is_key_frame &&
-        vt->part_variances.none.variance > (9 * avg_64x64) >> 5)
+    const int set_avg_64x64 = (9 * avg_64x64) >> 5;
+    if (!is_key_frame && vt->part_variances.none.variance > set_avg_64x64)
       force_split[0] = PART_EVAL_ONLY_SPLIT;
 
     if (!is_key_frame &&
@@ -1580,51 +1813,51 @@
   if (mi_col + 32 > tile->mi_col_end || mi_row + 32 > tile->mi_row_end ||
       !set_vt_partitioning(cpi, xd, tile, vt, BLOCK_128X128, mi_row, mi_col,
                            thresholds[0], BLOCK_16X16, force_split[0])) {
-    for (m = 0; m < num_64x64_blocks; ++m) {
-      const int x64_idx = ((m & 1) << 4);
-      const int y64_idx = ((m >> 1) << 4);
-      const int m2 = m << 2;
+    for (int blk64_idx = 0; blk64_idx < num_64x64_blocks; ++blk64_idx) {
+      const int x64_idx = GET_BLK_IDX_X(blk64_idx, 4);
+      const int y64_idx = GET_BLK_IDX_Y(blk64_idx, 4);
+      const int blk64_scale_idx = blk64_idx << 2;
 
       // Now go through the entire structure, splitting every block size until
       // we get to one that's got a variance lower than our threshold.
-      if (!set_vt_partitioning(cpi, xd, tile, &vt->split[m], BLOCK_64X64,
-                               mi_row + y64_idx, mi_col + x64_idx,
-                               thresholds[1], BLOCK_16X16,
-                               force_split[1 + m])) {
-        for (i = 0; i < 4; ++i) {
-          const int x32_idx = ((i & 1) << 3);
-          const int y32_idx = ((i >> 1) << 3);
-          const int i2 = (m2 + i) << 2;
-          if (!set_vt_partitioning(cpi, xd, tile, &vt->split[m].split[i],
-                                   BLOCK_32X32, (mi_row + y64_idx + y32_idx),
-                                   (mi_col + x64_idx + x32_idx), thresholds[2],
-                                   BLOCK_16X16, force_split[5 + m2 + i])) {
-            for (j = 0; j < 4; ++j) {
-              const int x16_idx = ((j & 1) << 2);
-              const int y16_idx = ((j >> 1) << 2);
-              const int split_index = 21 + i2 + j;
-              // For inter frames: if variance4x4downsample[] == 1 for this
-              // 16x16 block, then the variance is based on 4x4 down-sampling,
-              // so use vt2 in set_vt_partioning(), otherwise use vt.
-              VP16x16 *vtemp =
-                  (!is_key_frame && variance4x4downsample[i2 + j] == 1)
-                      ? &vt2[i2 + j]
-                      : &vt->split[m].split[i].split[j];
-              if (!set_vt_partitioning(cpi, xd, tile, vtemp, BLOCK_16X16,
-                                       mi_row + y64_idx + y32_idx + y16_idx,
-                                       mi_col + x64_idx + x32_idx + x16_idx,
-                                       thresholds[3], BLOCK_8X8,
-                                       force_split[split_index])) {
-                for (k = 0; k < 4; ++k) {
-                  const int x8_idx = (k & 1) << 1;
-                  const int y8_idx = (k >> 1) << 1;
-                  set_block_size(
-                      cpi, (mi_row + y64_idx + y32_idx + y16_idx + y8_idx),
-                      (mi_col + x64_idx + x32_idx + x16_idx + x8_idx),
-                      BLOCK_8X8);
-                }
-              }
-            }
+      if (set_vt_partitioning(cpi, xd, tile, &vt->split[blk64_idx], BLOCK_64X64,
+                              mi_row + y64_idx, mi_col + x64_idx, thresholds[1],
+                              BLOCK_16X16, force_split[1 + blk64_idx]))
+        continue;
+      for (int lvl1_idx = 0; lvl1_idx < 4; ++lvl1_idx) {
+        const int x32_idx = GET_BLK_IDX_X(lvl1_idx, 3);
+        const int y32_idx = GET_BLK_IDX_Y(lvl1_idx, 3);
+        const int lvl1_scale_idx = (blk64_scale_idx + lvl1_idx) << 2;
+        if (set_vt_partitioning(
+                cpi, xd, tile, &vt->split[blk64_idx].split[lvl1_idx],
+                BLOCK_32X32, (mi_row + y64_idx + y32_idx),
+                (mi_col + x64_idx + x32_idx), thresholds[2], BLOCK_16X16,
+                force_split[5 + blk64_scale_idx + lvl1_idx]))
+          continue;
+        for (int lvl2_idx = 0; lvl2_idx < 4; ++lvl2_idx) {
+          const int x16_idx = GET_BLK_IDX_X(lvl2_idx, 2);
+          const int y16_idx = GET_BLK_IDX_Y(lvl2_idx, 2);
+          const int split_index = 21 + lvl1_scale_idx + lvl2_idx;
+          // For inter frames: if variance4x4downsample[] == 1 for this
+          // 16x16 block, then the variance is based on 4x4 down-sampling,
+          // so use vt2 in set_vt_partioning(), otherwise use vt.
+          VP16x16 *vtemp =
+              (!is_key_frame &&
+               variance4x4downsample[lvl1_scale_idx + lvl2_idx] == 1)
+                  ? &vt2[lvl1_scale_idx + lvl2_idx]
+                  : &vt->split[blk64_idx].split[lvl1_idx].split[lvl2_idx];
+          if (set_vt_partitioning(cpi, xd, tile, vtemp, BLOCK_16X16,
+                                  mi_row + y64_idx + y32_idx + y16_idx,
+                                  mi_col + x64_idx + x32_idx + x16_idx,
+                                  thresholds[3], BLOCK_8X8,
+                                  force_split[split_index]))
+            continue;
+          for (int lvl3_idx = 0; lvl3_idx < 4; ++lvl3_idx) {
+            const int x8_idx = GET_BLK_IDX_X(lvl3_idx, 1);
+            const int y8_idx = GET_BLK_IDX_Y(lvl3_idx, 1);
+            set_block_size(cpi, (mi_row + y64_idx + y32_idx + y16_idx + y8_idx),
+                           (mi_col + x64_idx + x32_idx + x16_idx + x8_idx),
+                           BLOCK_8X8);
           }
         }
       }
@@ -1633,7 +1866,7 @@
 
   if (cpi->sf.rt_sf.short_circuit_low_temp_var) {
     set_low_temp_var_flag(cpi, &x->part_search_info, xd, vt, thresholds,
-                          ref_frame_partition, mi_col, mi_row);
+                          ref_frame_partition, mi_col, mi_row, is_small_sb);
   }
 
   if (vt2) aom_free(vt2);

diff --git a/av1/encoder/var_based_part.h b/av1/encoder/var_based_part.h
index 7febc0e..f912458 100644
--- a/av1/encoder/var_based_part.h
+++ b/av1/encoder/var_based_part.h

@@ -20,6 +20,10 @@
 
 #include "av1/encoder/encoder.h"
 
+// Calculate block index x and y from split level and index
+#define GET_BLK_IDX_X(idx, level) (((idx) & (0x01)) << (level))
+#define GET_BLK_IDX_Y(idx, level) (((idx) >> (0x01)) << (level))
+
 #ifdef __cplusplus
 extern "C" {
 #endif

diff --git a/av1/encoder/x86/av1_fwd_txfm2d_avx2.c b/av1/encoder/x86/av1_fwd_txfm2d_avx2.c
index b898fc6..b143df3 100644
--- a/av1/encoder/x86/av1_fwd_txfm2d_avx2.c
+++ b/av1/encoder/x86/av1_fwd_txfm2d_avx2.c

@@ -1430,34 +1430,15 @@
   }
 }
 
-static INLINE void transpose_32_8x8_avx2(int stride, const __m256i *inputA,
-                                         __m256i *output) {
-  __m256i temp0 = _mm256_unpacklo_epi32(inputA[0], inputA[2]);
-  __m256i temp1 = _mm256_unpackhi_epi32(inputA[0], inputA[2]);
-  __m256i temp2 = _mm256_unpacklo_epi32(inputA[1], inputA[3]);
-  __m256i temp3 = _mm256_unpackhi_epi32(inputA[1], inputA[3]);
-  __m256i temp4 = _mm256_unpacklo_epi32(inputA[4], inputA[6]);
-  __m256i temp5 = _mm256_unpackhi_epi32(inputA[4], inputA[6]);
-  __m256i temp6 = _mm256_unpacklo_epi32(inputA[5], inputA[7]);
-  __m256i temp7 = _mm256_unpackhi_epi32(inputA[5], inputA[7]);
-
-  __m256i t0 = _mm256_unpacklo_epi32(temp0, temp2);
-  __m256i t1 = _mm256_unpackhi_epi32(temp0, temp2);
-  __m256i t2 = _mm256_unpacklo_epi32(temp1, temp3);
-  __m256i t3 = _mm256_unpackhi_epi32(temp1, temp3);
-  __m256i t4 = _mm256_unpacklo_epi32(temp4, temp6);
-  __m256i t5 = _mm256_unpackhi_epi32(temp4, temp6);
-  __m256i t6 = _mm256_unpacklo_epi32(temp5, temp7);
-  __m256i t7 = _mm256_unpackhi_epi32(temp5, temp7);
-
-  output[0 * stride] = _mm256_permute2x128_si256(t0, t4, 0x20);
-  output[1 * stride] = _mm256_permute2x128_si256(t1, t5, 0x20);
-  output[2 * stride] = _mm256_permute2x128_si256(t2, t6, 0x20);
-  output[3 * stride] = _mm256_permute2x128_si256(t3, t7, 0x20);
-  output[4 * stride] = _mm256_permute2x128_si256(t0, t4, 0x31);
-  output[5 * stride] = _mm256_permute2x128_si256(t1, t5, 0x31);
-  output[6 * stride] = _mm256_permute2x128_si256(t2, t6, 0x31);
-  output[7 * stride] = _mm256_permute2x128_si256(t3, t7, 0x31);
+static INLINE void store_output_32bit_w16(int32_t *const out,
+                                          const __m256i *const in1,
+                                          const __m256i *const in2,
+                                          const int stride,
+                                          const int out_size) {
+  for (int i = 0; i < out_size; ++i) {
+    _mm256_store_si256((__m256i *)(out + stride * i), in1[i]);
+    _mm256_store_si256((__m256i *)(out + stride * i + 8), in2[i]);
+  }
 }
 
 // Store 8 16 bit values. Sign extend the values.
@@ -1779,83 +1760,14 @@
   out[7] = _mm256_extractf128_si256(c3, 1);
 }
 
-static INLINE void transpose_16bit_and_store_8x8(const __m128i *const in,
-                                                 int32_t *output) {
-  // in[0]: 00 01 02 03  04 05 06 07
-  // in[1]: 10 11 12 13  14 15 16 17
-  // in[2]: 20 21 22 23  24 25 26 27
-  // in[3]: 30 31 32 33  34 35 36 37
-  // in[4]: 40 41 42 43  44 45 46 47
-  // in[5]: 50 51 52 53  54 55 56 57
-  // in[6]: 60 61 62 63  64 65 66 67
-  // in[7]: 70 71 72 73  74 75 76 77
-  // to:
-  // s04: 00 01 02 03  04 05 06 07 | 40 41 42 43  44 45 46 47
-  // s15: 10 11 12 13  14 15 16 17 | 50 51 52 53  54 55 56 57
-  // s26: 20 21 22 23  24 25 26 27 | 60 61 62 63  64 65 66 67
-  // s37: 30 31 32 33  34 35 36 37 | 70 71 72 73  74 75 76 77
-  const __m256i s04 =
-      _mm256_insertf128_si256(_mm256_castsi128_si256(in[0]), in[4], 0x1);
-  const __m256i s15 =
-      _mm256_insertf128_si256(_mm256_castsi128_si256(in[1]), in[5], 0x1);
-  const __m256i s26 =
-      _mm256_insertf128_si256(_mm256_castsi128_si256(in[2]), in[6], 0x1);
-  const __m256i s37 =
-      _mm256_insertf128_si256(_mm256_castsi128_si256(in[3]), in[7], 0x1);
-
-  // a0:    00 10 01 11  02 12 03 13 | 40 50 41 51  42 52 43 53
-  // a1:    04 14 05 15  06 16 07 17 | 44 54 45 55  46 56 47 57
-  // a2:    20 30 21 31  22 32 23 33 | 60 70 61 71  62 72 63 73
-  // a3:    24 34 25 35  26 36 27 37 | 64 74 65 75  66 76 67 77
-  const __m256i a0 = _mm256_unpacklo_epi16(s04, s15);
-  const __m256i a1 = _mm256_unpackhi_epi16(s04, s15);
-  const __m256i a2 = _mm256_unpacklo_epi16(s26, s37);
-  const __m256i a3 = _mm256_unpackhi_epi16(s26, s37);
-
-  // Unpack 32 bit elements resulting in:
-  // b0: 00 10 20 30  01 11 21 31 | 40 50 60 70  41 51 61 71
-  // b1: 02 12 22 32  03 13 23 33 | 42 52 62 72  43 53 63 73
-  // b2: 04 14 24 34  05 15 25 35 | 44 54 64 74  45 55 65 75
-  // b2: 06 16 26 36  07 17 27 37 | 46 56 66 76  47 57 67 77
-  const __m256i b0 = _mm256_unpacklo_epi32(a0, a2);
-  const __m256i b1 = _mm256_unpackhi_epi32(a0, a2);
-  const __m256i b2 = _mm256_unpacklo_epi32(a1, a3);
-  const __m256i b3 = _mm256_unpackhi_epi32(a1, a3);
-
-  // 00 10 20 30  40 50 60 70
-  // 01 11 21 31  41 51 61 71
-  // 02 12 22 32  42 52 62 72
-  // 03 13 23 33  43 53 63 73
-  // 04 14 24 34  44 54 64 74
-  // 05 15 25 35  45 55 65 75
-  // 06 16 26 36  46 56 66 76
-  // 07 17 27 37  47 57 67 77
-  const __m256i a_lo = _mm256_unpacklo_epi16(b0, b0);
-  const __m256i a_hi = _mm256_unpackhi_epi16(b0, b0);
-  const __m256i b_lo = _mm256_unpacklo_epi16(b1, b1);
-  const __m256i b_hi = _mm256_unpackhi_epi16(b1, b1);
-  const __m256i c_lo = _mm256_unpacklo_epi16(b2, b2);
-  const __m256i c_hi = _mm256_unpackhi_epi16(b2, b2);
-  const __m256i d_lo = _mm256_unpacklo_epi16(b3, b3);
-  const __m256i d_hi = _mm256_unpackhi_epi16(b3, b3);
-
-  const __m256i a_1 = _mm256_srai_epi32(a_lo, 16);
-  const __m256i a_2 = _mm256_srai_epi32(a_hi, 16);
-  const __m256i a_3 = _mm256_srai_epi32(b_lo, 16);
-  const __m256i a_4 = _mm256_srai_epi32(b_hi, 16);
-  const __m256i a_5 = _mm256_srai_epi32(c_lo, 16);
-  const __m256i a_6 = _mm256_srai_epi32(c_hi, 16);
-  const __m256i a_7 = _mm256_srai_epi32(d_lo, 16);
-  const __m256i a_8 = _mm256_srai_epi32(d_hi, 16);
-
-  _mm256_store_si256((__m256i *)output, a_1);
-  _mm256_store_si256((__m256i *)(output + 8), a_2);
-  _mm256_store_si256((__m256i *)(output + 16), a_3);
-  _mm256_store_si256((__m256i *)(output + 24), a_4);
-  _mm256_store_si256((__m256i *)(output + 32), a_5);
-  _mm256_store_si256((__m256i *)(output + 40), a_6);
-  _mm256_store_si256((__m256i *)(output + 48), a_7);
-  _mm256_store_si256((__m256i *)(output + 56), a_8);
+static INLINE void store_buffer_16bit_to_32bit_w8_avx2(const __m128i *const in,
+                                                       int32_t *const out,
+                                                       const int stride,
+                                                       const int out_size) {
+  for (int i = 0; i < out_size; ++i) {
+    _mm256_store_si256((__m256i *)(out + i * stride),
+                       _mm256_cvtepi16_epi32(in[i]));
+  }
 }
 
 static void av1_lowbd_fwd_txfm2d_8x8_avx2(const int16_t *input, int32_t *output,
@@ -1897,7 +1809,7 @@
   // Round and shift operation is avoided here as the shift bit is assumed to be
   // zero always.
   assert(shift[2] == 0);
-  transpose_16bit_and_store_8x8(buf, output);
+  store_buffer_16bit_to_32bit_w8_avx2(buf, output, 8, 8);
 }
 
 static void lowbd_fwd_txfm2d_16x16_avx2(const int16_t *input, int32_t *output,
@@ -1937,8 +1849,7 @@
   }
   row_txfm(buf, buf, cos_bit_row);
   round_shift_16bit_w16_avx2(buf, width, shift[2]);
-  transpose_16bit_16x16_avx2(buf, buf);
-  store_buffer_16bit_to_32bit_w16_avx2(buf, output + 16 * width * i, width, 16);
+  store_buffer_16bit_to_32bit_w16_avx2(buf, output + i * 16, height, width);
 }
 
 static void lowbd_fwd_txfm2d_32x32_avx2(const int16_t *input, int32_t *output,
@@ -1983,12 +1894,7 @@
     }
     row_txfm(buf, buf, cos_bit_row);
     round_shift_16bit_w16_avx2(buf, width, shift[2]);
-    transpose_16bit_16x16_avx2(buf, buf);
-    store_buffer_16bit_to_32bit_w16_avx2(buf, output + 16 * width * i, width,
-                                         16);
-    transpose_16bit_16x16_avx2(buf + 16, buf + 16);
-    store_buffer_16bit_to_32bit_w16_avx2(buf + 16, output + 16 * width * i + 16,
-                                         width, 16);
+    store_buffer_16bit_to_32bit_w16_avx2(buf, output + i * 16, height, width);
   }
 }
 
@@ -2032,13 +1938,7 @@
     fdct64_new_avx2(bufB, bufB, cos_bit_row);
     round_shift_array_32_avx2(bufA, bufA, 32, -shift[2]);
     round_shift_array_32_avx2(bufB, bufB, 32, -shift[2]);
-
-    int32_t *output8 = output + 16 * 32 * i;
-    for (int j = 0; j < 4; ++j) {
-      __m256i *out = (__m256i *)(output8 + 8 * j);
-      transpose_32_8x8_avx2(4, bufA + 8 * j, out);
-      transpose_32_8x8_avx2(4, bufB + 8 * j, out + 8 * 4);
-    }
+    store_output_32bit_w16(output + i * 16, bufA, bufB, 32, 32);
   }
 }
 
@@ -2081,9 +1981,8 @@
     }
     row_txfm(buf, buf, cos_bit_row);
     round_shift_16bit_w16_avx2(buf, width, shift[2]);
-    transpose_16bit_16x16_avx2(buf, buf);
-    store_rect_buffer_16bit_to_32bit_w16_avx2(buf, output + 16 * width * i,
-                                              width, 16);
+    store_rect_buffer_16bit_to_32bit_w16_avx2(buf, output + i * 16, height,
+                                              width);
   }
 }
 
@@ -2126,11 +2025,7 @@
   }
   row_txfm(buf, buf, cos_bit_row);
   round_shift_16bit_w16_avx2(buf, width, shift[2]);
-  transpose_16bit_16x16_avx2(buf, buf);
-  store_rect_buffer_16bit_to_32bit_w16_avx2(buf, output, width, 16);
-
-  transpose_16bit_16x16_avx2(buf + 16, buf + 16);
-  store_rect_buffer_16bit_to_32bit_w16_avx2(buf + 16, output + 16, width, 16);
+  store_rect_buffer_16bit_to_32bit_w16_avx2(buf, output, height, width);
 }
 
 static void lowbd_fwd_txfm2d_64x32_avx2(const int16_t *input, int32_t *output,
@@ -2172,12 +2067,7 @@
     round_shift_rect_array_32_avx2(bufA, bufA, 32, -shift[2], NewSqrt2);
     round_shift_rect_array_32_avx2(bufB, bufB, 32, -shift[2], NewSqrt2);
 
-    int32_t *output8 = output + 16 * 32 * i;
-    for (int j = 0; j < 4; ++j) {
-      __m256i *out = (__m256i *)(output8 + 8 * j);
-      transpose_32_8x8_avx2(4, bufA + 8 * j, out);
-      transpose_32_8x8_avx2(4, bufB + 8 * j, out + 8 * 4);
-    }
+    store_output_32bit_w16(output + i * 16, bufA, bufB, 32, 32);
   }
 }
 
@@ -2222,12 +2112,7 @@
     round_shift_rect_array_32_avx2(bufA, bufA, 32, -shift[2], NewSqrt2);
     round_shift_rect_array_32_avx2(bufB, bufB, 32, -shift[2], NewSqrt2);
 
-    int32_t *output8 = output + 16 * 32 * i;
-    for (int j = 0; j < 4; ++j) {
-      __m256i *out = (__m256i *)(output8 + 8 * j);
-      transpose_32_8x8_avx2(4, bufA + 8 * j, out);
-      transpose_32_8x8_avx2(4, bufB + 8 * j, out + 8 * 4);
-    }
+    store_output_32bit_w16(output + i * 16, bufA, bufB, 32, 32);
   }
 }
 
@@ -2260,19 +2145,12 @@
     }
   }
 
-  for (int i = 0; i < AOMMIN(4, height_div16); i++) {
+  for (int i = 0; i < AOMMIN(2, height_div16); i++) {
     __m256i *buf = buf1 + width * i;
     row_txfm(buf, buf, cos_bit_row);
     round_shift_16bit_w16_avx2(buf, width, shift[2]);
-    int32_t *output16 = output + 16 * width * i;
-    for (int j = 0; j < width_div16; ++j) {
-      __m256i *buf16 = buf + 16 * j;
-      transpose_16bit_16x16_avx2(buf16, buf16);
-      store_buffer_16bit_to_32bit_w16_avx2(buf16, output16 + 16 * j, width, 16);
-    }
+    store_buffer_16bit_to_32bit_w16_avx2(buf, output + width * i, 32, width);
   }
-  // Zero out the bottom 16x32 area.
-  memset(output + 16 * 32, 0, 16 * 32 * sizeof(*output));
 }
 
 static void lowbd_fwd_txfm2d_64x16_avx2(const int16_t *input, int32_t *output,
@@ -2308,13 +2186,10 @@
     __m256i *buf = buf1 + width * i;
     row_txfm(buf, buf, cos_bit_row);
     round_shift_16bit_w16_avx2(buf, width, shift[2]);
-    int32_t *output16 = output + 16 * 32 * i;
-    for (int j = 0; j < 2; ++j) {
-      __m256i *buf16 = buf + 16 * j;
-      transpose_16bit_16x16_avx2(buf16, buf16);
-      store_buffer_16bit_to_32bit_w16_avx2(buf16, output16 + 16 * j, 32, 16);
-    }
+    store_buffer_16bit_to_32bit_w16_avx2(buf, output + 16 * i, 16, 32);
   }
+  // Zero out the bottom 16x32 area.
+  memset(output + 16 * 32, 0, 16 * 32 * sizeof(*output));
 }
 
 static INLINE void btf_16_avx2(__m256i *w0, __m256i *w1, __m256i *in0,
@@ -3054,8 +2929,7 @@
   pack_reg(bufl, bufu, buf2);
   row_txfm(buf2, buf2, cos_bit_row);
   round_shift_16bit_w16_avx2(buf2, width, shift[2]);
-  transpose_16bit_16x8_avx2(buf2, buf2);
-  store_rect_buffer_16bit_to_32bit_w8_avx2(buf2, output, width, 8);
+  store_rect_buffer_16bit_to_32bit_w16_avx2(buf2, output, height, width);
 }
 
 static void lowbd_fwd_txfm2d_16x8_avx2(const int16_t *input, int32_t *output,
@@ -3099,10 +2973,7 @@
   }
   row_txfm(buf, buf, cos_bit_row);
   round_shift_16bit(buf, width, shift[2]);
-  transpose_16bit_8x8(buf, buf);
-  store_rect_buffer_16bit_to_32bit_w8(buf, output, width, height);
-  transpose_16bit_8x8(buf + 8, buf + 8);
-  store_rect_buffer_16bit_to_32bit_w8(buf + 8, output + 8, width, height);
+  store_rect_buffer_16bit_to_32bit_w8(buf, output, height, width);
 }
 
 static FwdTxfm2dFunc fwd_txfm2d_func_ls[TX_SIZES_ALL] = {

diff --git a/av1/encoder/x86/av1_fwd_txfm2d_sse4.c b/av1/encoder/x86/av1_fwd_txfm2d_sse4.c
index db554c4..825da8d 100644
--- a/av1/encoder/x86/av1_fwd_txfm2d_sse4.c
+++ b/av1/encoder/x86/av1_fwd_txfm2d_sse4.c

@@ -29,6 +29,16 @@
   }
 }
 
+static INLINE void store_output_32bit_w8(int32_t *const out,
+                                         const __m128i *const in1,
+                                         const __m128i *const in2,
+                                         const int stride, const int out_size) {
+  for (int i = 0; i < out_size; ++i) {
+    _mm_store_si128((__m128i *)(out + stride * i), in1[i]);
+    _mm_store_si128((__m128i *)(out + stride * i + 4), in2[i]);
+  }
+}
+
 typedef void (*TxfmFuncSSE2)(__m128i *input, __m128i *output,
                              const int8_t cos_bit, const int8_t *stage_range);
 
@@ -65,9 +75,9 @@
 
 static INLINE TxfmFuncSSE2 fwd_txfm_type_to_func(TXFM_TYPE txfm_type) {
   switch (txfm_type) {
-    case TXFM_TYPE_DCT32: return fdct32_sse4_1; break;
-    case TXFM_TYPE_DCT64: return fdct64_new_sse4_1; break;
-    case TXFM_TYPE_IDENTITY32: return idtx32x32_sse4_1; break;
+    case TXFM_TYPE_DCT32: return fdct32_sse4_1;
+    case TXFM_TYPE_DCT64: return fdct64_new_sse4_1;
+    case TXFM_TYPE_IDENTITY32: return idtx32x32_sse4_1;
     default: assert(0);
   }
   return NULL;
@@ -104,8 +114,7 @@
   av1_round_shift_array_32_sse4_1(buf_128, out_128, txfm2d_size_128, -shift[1]);
   transpose_32(txfm_size, out_128, buf_128);
   txfm_func_row(buf_128, out_128, cos_bit_row, stage_range_row);
-  av1_round_shift_array_32_sse4_1(out_128, buf_128, txfm2d_size_128, -shift[2]);
-  transpose_32(txfm_size, buf_128, out_128);
+  av1_round_shift_array_32_sse4_1(out_128, out_128, txfm2d_size_128, -shift[2]);
 }
 
 static INLINE void fwd_txfm2d_64x64_sse4_1(const int16_t *input,
@@ -140,8 +149,7 @@
   }
 
   txfm2d_size_128 = (col_num >> 1) * (txfm_size >> 1);
-  av1_round_shift_array_32_sse4_1(out_128, buf_128, txfm2d_size_128, -shift[2]);
-  transpose_8nx8n(buf_128, out_128, 32, 32);
+  av1_round_shift_array_32_sse4_1(out_128, out_128, txfm2d_size_128, -shift[2]);
 }
 
 void av1_fwd_txfm2d_32x32_sse4_1(const int16_t *input, int32_t *output,
@@ -162,29 +170,6 @@
   fwd_txfm2d_64x64_sse4_1(input, output, stride, &cfg, txfm_buf);
 }
 
-static INLINE void transpose_32_4x4x2(int stride, const __m128i *inputA,
-                                      const __m128i *inputB, __m128i *output) {
-  __m128i temp0 = _mm_unpacklo_epi32(inputA[0], inputA[2]);
-  __m128i temp1 = _mm_unpackhi_epi32(inputA[0], inputA[2]);
-  __m128i temp2 = _mm_unpacklo_epi32(inputA[1], inputA[3]);
-  __m128i temp3 = _mm_unpackhi_epi32(inputA[1], inputA[3]);
-
-  output[0 * stride] = _mm_unpacklo_epi32(temp0, temp2);
-  output[1 * stride] = _mm_unpackhi_epi32(temp0, temp2);
-  output[2 * stride] = _mm_unpacklo_epi32(temp1, temp3);
-  output[3 * stride] = _mm_unpackhi_epi32(temp1, temp3);
-
-  temp0 = _mm_unpacklo_epi32(inputB[0], inputB[2]);
-  temp1 = _mm_unpackhi_epi32(inputB[0], inputB[2]);
-  temp2 = _mm_unpacklo_epi32(inputB[1], inputB[3]);
-  temp3 = _mm_unpackhi_epi32(inputB[1], inputB[3]);
-
-  output[4 * stride] = _mm_unpacklo_epi32(temp0, temp2);
-  output[5 * stride] = _mm_unpackhi_epi32(temp0, temp2);
-  output[6 * stride] = _mm_unpacklo_epi32(temp1, temp3);
-  output[7 * stride] = _mm_unpackhi_epi32(temp1, temp3);
-}
-
 static void lowbd_fwd_txfm2d_64x64_sse4_1(const int16_t *input, int32_t *output,
                                           int stride, TX_TYPE tx_type, int bd) {
   (void)bd;
@@ -225,11 +210,7 @@
     av1_round_shift_array_32_sse4_1(bufA, bufA, 32, -shift[2]);
     av1_round_shift_array_32_sse4_1(bufB, bufB, 32, -shift[2]);
 
-    int32_t *output8 = output + 8 * 32 * i;
-    for (int j = 0; j < width_div8; ++j) {
-      __m128i *out = (__m128i *)(output8 + 4 * j);
-      transpose_32_4x4x2(8, bufA + 4 * j, bufB + 4 * j, out);
-    }
+    store_output_32bit_w8(output + i * 8, bufA, bufB, 32, 32);
   }
 }
 
@@ -272,11 +253,7 @@
     av1_round_shift_rect_array_32_sse4_1(bufA, bufA, 32, -shift[2], NewSqrt2);
     av1_round_shift_rect_array_32_sse4_1(bufB, bufB, 32, -shift[2], NewSqrt2);
 
-    int32_t *output8 = output + 8 * 32 * i;
-    for (int j = 0; j < width_div8; ++j) {
-      __m128i *out = (__m128i *)(output8 + 4 * j);
-      transpose_32_4x4x2(8, bufA + 4 * j, bufB + 4 * j, out);
-    }
+    store_output_32bit_w8(output + i * 8, bufA, bufB, 32, 32);
   }
 }
 
@@ -321,11 +298,7 @@
     av1_round_shift_rect_array_32_sse4_1(bufA, bufA, 32, -shift[2], NewSqrt2);
     av1_round_shift_rect_array_32_sse4_1(bufB, bufB, 32, -shift[2], NewSqrt2);
 
-    int32_t *output8 = output + 8 * 32 * i;
-    for (int j = 0; j < (32 / 4); ++j) {
-      __m128i *out = (__m128i *)(output8 + 4 * j);
-      transpose_32_4x4x2(8, bufA + 4 * j, bufB + 4 * j, out);
-    }
+    store_output_32bit_w8(output + i * 8, bufA, bufB, 32, 32);
   }
 }
 

diff --git a/av1/encoder/x86/av1_fwd_txfm_sse2.c b/av1/encoder/x86/av1_fwd_txfm_sse2.c
index 748ef4d..a4def75 100644
--- a/av1/encoder/x86/av1_fwd_txfm_sse2.c
+++ b/av1/encoder/x86/av1_fwd_txfm_sse2.c

@@ -2006,8 +2006,7 @@
   }
   row_txfm(buf, buf, cos_bit_row);
   round_shift_16bit(buf, width, shift[2]);
-  transpose_16bit_4x4(buf, buf);
-  store_buffer_16bit_to_32bit_w4(buf, output, width, height);
+  store_buffer_16bit_to_32bit_w4(buf, output, height, width);
 }
 
 void av1_lowbd_fwd_txfm2d_4x8_sse2(const int16_t *input, int32_t *output,
@@ -2045,8 +2044,7 @@
   }
   row_txfm(buf, buf, cos_bit_row);
   round_shift_16bit(buf, width, shift[2]);
-  transpose_16bit_8x4(buf, buf);
-  store_rect_buffer_16bit_to_32bit_w4(buf, output, width, height);
+  store_rect_buffer_16bit_to_32bit_w8(buf, output, height, width);
 }
 
 void av1_lowbd_fwd_txfm2d_4x16_sse2(const int16_t *input, int32_t *output,
@@ -2086,8 +2084,7 @@
     }
     row_txfm(buf, buf, cos_bit_row);
     round_shift_16bit(buf, width, shift[2]);
-    transpose_16bit_8x4(buf, buf);
-    store_buffer_16bit_to_32bit_w4(buf, output + 8 * width * i, width, 8);
+    store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
   }
 }
 
@@ -2124,8 +2121,7 @@
   }
   row_txfm(buf, buf, cos_bit_row);
   round_shift_16bit(buf, width, shift[2]);
-  transpose_16bit_8x8(buf, buf);
-  store_rect_buffer_16bit_to_32bit_w8(buf, output, width, height);
+  store_rect_buffer_16bit_to_32bit_w4(buf, output, height, width);
 }
 
 void av1_lowbd_fwd_txfm2d_8x8_sse2(const int16_t *input, int32_t *output,
@@ -2161,8 +2157,7 @@
   }
   row_txfm(buf, buf, cos_bit_row);
   round_shift_16bit(buf, width, shift[2]);
-  transpose_16bit_8x8(buf, buf);
-  store_buffer_16bit_to_32bit_w8(buf, output, width, height);
+  store_buffer_16bit_to_32bit_w8(buf, output, height, width);
 }
 
 void av1_lowbd_fwd_txfm2d_8x16_sse2(const int16_t *input, int32_t *output,
@@ -2202,8 +2197,7 @@
     }
     row_txfm(buf, buf, cos_bit_row);
     round_shift_16bit(buf, width, shift[2]);
-    transpose_16bit_8x8(buf, buf);
-    store_rect_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width, 8);
+    store_rect_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
   }
 }
 
@@ -2246,8 +2240,7 @@
     }
     row_txfm(buf, buf, cos_bit_row);
     round_shift_16bit(buf, width, shift[2]);
-    transpose_16bit_8x8(buf, buf);
-    store_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width, 8);
+    store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
   }
 }
 
@@ -2288,10 +2281,7 @@
   }
   row_txfm(buf, buf, cos_bit_row);
   round_shift_16bit(buf, width, shift[2]);
-  transpose_16bit_4x8(buf, buf);
-  store_buffer_16bit_to_32bit_w8(buf, output, width, height);
-  transpose_16bit_4x8(buf + 8, buf + 8);
-  store_buffer_16bit_to_32bit_w8(buf + 8, output + 8, width, height);
+  store_buffer_16bit_to_32bit_w4(buf, output, height, width);
 }
 
 void av1_lowbd_fwd_txfm2d_16x8_sse2(const int16_t *input, int32_t *output,
@@ -2331,10 +2321,7 @@
   }
   row_txfm(buf, buf, cos_bit_row);
   round_shift_16bit(buf, width, shift[2]);
-  transpose_16bit_8x8(buf, buf);
-  store_rect_buffer_16bit_to_32bit_w8(buf, output, width, height);
-  transpose_16bit_8x8(buf + 8, buf + 8);
-  store_rect_buffer_16bit_to_32bit_w8(buf + 8, output + 8, width, height);
+  store_rect_buffer_16bit_to_32bit_w8(buf, output, height, width);
 }
 
 void av1_lowbd_fwd_txfm2d_16x16_sse2(const int16_t *input, int32_t *output,
@@ -2376,11 +2363,7 @@
     }
     row_txfm(buf, buf, cos_bit_row);
     round_shift_16bit(buf, width, shift[2]);
-    transpose_16bit_8x8(buf, buf);
-    store_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width, 8);
-    transpose_16bit_8x8(buf + 8, buf + 8);
-    store_buffer_16bit_to_32bit_w8(buf + 8, output + 8 * width * i + 8, width,
-                                   8);
+    store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
   }
 }
 
@@ -2427,12 +2410,7 @@
       }
       row_txfm(buf, buf, cos_bit_row);
       round_shift_16bit(buf, width, shift[2]);
-      transpose_16bit_8x8(buf, buf);
-      store_rect_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width,
-                                          8);
-      transpose_16bit_8x8(buf + 8, buf + 8);
-      store_rect_buffer_16bit_to_32bit_w8(buf + 8, output + 8 * width * i + 8,
-                                          width, 8);
+      store_rect_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
     }
   } else {
     av1_fwd_txfm2d_16x32_c(input, output, stride, tx_type, bd);
@@ -2479,18 +2457,7 @@
       }
       row_txfm(buf, buf, cos_bit_row);
       round_shift_16bit(buf, width, shift[2]);
-      transpose_16bit_8x8(buf, buf);
-      store_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width,
-                                     height);
-      transpose_16bit_8x8(buf + 8, buf + 8);
-      store_buffer_16bit_to_32bit_w8(buf + 8, output + 8 * width * i + 8, width,
-                                     height);
-      transpose_16bit_8x8(buf + 16, buf + 16);
-      store_buffer_16bit_to_32bit_w8(buf + 16, output + 8 * width * i + 16,
-                                     width, height);
-      transpose_16bit_8x8(buf + 24, buf + 24);
-      store_buffer_16bit_to_32bit_w8(buf + 24, output + 8 * width * i + 24,
-                                     width, height);
+      store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
     }
   } else {
     av1_fwd_txfm2d_32x16_c(input, output, stride, tx_type, bd);
@@ -2538,18 +2505,7 @@
       }
       row_txfm(buf, buf, cos_bit_row);
       round_shift_16bit(buf, width, shift[2]);
-      transpose_16bit_8x8(buf, buf);
-      store_rect_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width,
-                                          8);
-      transpose_16bit_8x8(buf + 8, buf + 8);
-      store_rect_buffer_16bit_to_32bit_w8(buf + 8, output + 8 * width * i + 8,
-                                          width, 8);
-      transpose_16bit_8x8(buf + 16, buf + 16);
-      store_rect_buffer_16bit_to_32bit_w8(buf + 16, output + 8 * width * i + 16,
-                                          width, 8);
-      transpose_16bit_8x8(buf + 24, buf + 24);
-      store_rect_buffer_16bit_to_32bit_w8(buf + 24, output + 8 * width * i + 24,
-                                          width, 8);
+      store_rect_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
     }
   } else {
     av1_fwd_txfm2d_32x16_c(input, output, stride, tx_type, bd);
@@ -2599,17 +2555,7 @@
       }
       row_txfm(buf, buf, cos_bit_row);
       round_shift_16bit(buf, width, shift[2]);
-      transpose_16bit_8x8(buf, buf);
-      store_buffer_16bit_to_32bit_w8(buf, output + 8 * width * i, width, 8);
-      transpose_16bit_8x8(buf + 8, buf + 8);
-      store_buffer_16bit_to_32bit_w8(buf + 8, output + 8 * width * i + 8, width,
-                                     8);
-      transpose_16bit_8x8(buf + 16, buf + 16);
-      store_buffer_16bit_to_32bit_w8(buf + 16, output + 8 * width * i + 16,
-                                     width, 8);
-      transpose_16bit_8x8(buf + 24, buf + 24);
-      store_buffer_16bit_to_32bit_w8(buf + 24, output + 8 * width * i + 24,
-                                     width, 8);
+      store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, height, width);
     }
   } else {
     av1_fwd_txfm2d_32x32_c(input, output, stride, tx_type, bd);
@@ -2649,13 +2595,10 @@
     __m128i *buf = buf1 + width * i;
     row_txfm(buf, buf, cos_bit_row);
     round_shift_16bit(buf, width, shift[2]);
-    int32_t *output8 = output + 8 * 32 * i;
-    for (int j = 0; j < 4; ++j) {
-      __m128i *buf8 = buf + 8 * j;
-      transpose_16bit_8x8(buf8, buf8);
-      store_buffer_16bit_to_32bit_w8(buf8, output8 + 8 * j, 32, 8);
-    }
+    store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, 16, 32);
   }
+  // Zero out the bottom 16x32 area.
+  memset(output + 16 * 32, 0, 16 * 32 * sizeof(*output));
 }
 
 void av1_lowbd_fwd_txfm2d_16x64_sse2(const int16_t *input, int32_t *output,
@@ -2691,15 +2634,8 @@
     __m128i *buf = buf1 + width * i;
     row_txfm(buf, buf, cos_bit_row);
     round_shift_16bit(buf, width, shift[2]);
-    int32_t *output8 = output + 8 * width * i;
-    for (int j = 0; j < width_div8; ++j) {
-      __m128i *buf8 = buf + 8 * j;
-      transpose_16bit_8x8(buf8, buf8);
-      store_buffer_16bit_to_32bit_w8(buf8, output8 + 8 * j, width, 8);
-    }
+    store_buffer_16bit_to_32bit_w8(buf, output + 8 * i, 32, 16);
   }
-  // Zero out the bottom 16x32 area.
-  memset(output + 16 * 32, 0, 16 * 32 * sizeof(*output));
 }
 
 static FwdTxfm2dFunc fwd_txfm2d_func_ls[TX_SIZES_ALL] = {

diff --git a/av1/encoder/x86/av1_k_means_avx2.c b/av1/encoder/x86/av1_k_means_avx2.c
index a2db222..ad0b374 100644
--- a/av1/encoder/x86/av1_k_means_avx2.c
+++ b/av1/encoder/x86/av1_k_means_avx2.c

@@ -26,39 +26,44 @@
 void av1_calc_indices_dim1_avx2(const int16_t *data, const int16_t *centroids,
                                 uint8_t *indices, int64_t *total_dist, int n,
                                 int k) {
-  __m256i dist[PALETTE_MAX_SIZE];
   const __m256i v_zero = _mm256_setzero_si256();
   __m256i sum = _mm256_setzero_si256();
+  __m256i cents[PALETTE_MAX_SIZE];
+  for (int j = 0; j < k; ++j) {
+    cents[j] = _mm256_set1_epi16(centroids[j]);
+  }
 
   for (int i = 0; i < n; i += 16) {
-    __m256i ind = _mm256_loadu_si256((__m256i *)data);
-    for (int j = 0; j < k; j++) {
-      __m256i cent = _mm256_set1_epi16(centroids[j]);
-      __m256i d1 = _mm256_sub_epi16(ind, cent);
-      dist[j] = _mm256_abs_epi16(d1);
-    }
+    const __m256i in = _mm256_loadu_si256((__m256i *)data);
+    __m256i ind = _mm256_setzero_si256();
+    // Compute the distance to the first centroid.
+    __m256i d1 = _mm256_sub_epi16(in, cents[0]);
+    __m256i dist_min = _mm256_abs_epi16(d1);
 
-    ind = _mm256_setzero_si256();
-    for (int j = 1; j < k; j++) {
-      __m256i cmp = _mm256_cmpgt_epi16(dist[0], dist[j]);
-      dist[0] = _mm256_min_epi16(dist[0], dist[j]);
-      __m256i ind1 = _mm256_set1_epi16(j);
+    for (int j = 1; j < k; ++j) {
+      // Compute the distance to the centroid.
+      d1 = _mm256_sub_epi16(in, cents[j]);
+      const __m256i dist = _mm256_abs_epi16(d1);
+      // Compare to the minimal one.
+      const __m256i cmp = _mm256_cmpgt_epi16(dist_min, dist);
+      dist_min = _mm256_min_epi16(dist_min, dist);
+      const __m256i ind1 = _mm256_set1_epi16(j);
       ind = _mm256_or_si256(_mm256_andnot_si256(cmp, ind),
                             _mm256_and_si256(cmp, ind1));
     }
 
-    __m256i p1 = _mm256_packus_epi16(ind, v_zero);
-    __m256i px = _mm256_permute4x64_epi64(p1, 0x58);
-    __m128i d1 = _mm256_extracti128_si256(px, 0);
+    const __m256i p1 = _mm256_packus_epi16(ind, v_zero);
+    const __m256i px = _mm256_permute4x64_epi64(p1, 0x58);
+    const __m128i d2 = _mm256_extracti128_si256(px, 0);
 
-    _mm_storeu_si128((__m128i *)indices, d1);
+    _mm_storeu_si128((__m128i *)indices, d2);
 
     if (total_dist) {
       // Square, convert to 32 bit and add together.
-      dist[0] = _mm256_madd_epi16(dist[0], dist[0]);
+      dist_min = _mm256_madd_epi16(dist_min, dist_min);
       // Convert to 64 bit and add to sum.
-      const __m256i dist1 = _mm256_unpacklo_epi32(dist[0], v_zero);
-      const __m256i dist2 = _mm256_unpackhi_epi32(dist[0], v_zero);
+      const __m256i dist1 = _mm256_unpacklo_epi32(dist_min, v_zero);
+      const __m256i dist2 = _mm256_unpackhi_epi32(dist_min, v_zero);
       sum = _mm256_add_epi64(sum, dist1);
       sum = _mm256_add_epi64(sum, dist2);
     }
@@ -74,46 +79,52 @@
 void av1_calc_indices_dim2_avx2(const int16_t *data, const int16_t *centroids,
                                 uint8_t *indices, int64_t *total_dist, int n,
                                 int k) {
-  __m256i dist[PALETTE_MAX_SIZE];
   const __m256i v_zero = _mm256_setzero_si256();
+  const __m256i permute = _mm256_set_epi32(0, 0, 0, 0, 5, 1, 4, 0);
   __m256i sum = _mm256_setzero_si256();
+  __m256i ind[2];
+  __m256i cents[PALETTE_MAX_SIZE];
+  for (int j = 0; j < k; ++j) {
+    const int16_t cx = centroids[2 * j], cy = centroids[2 * j + 1];
+    cents[j] = _mm256_set_epi16(cy, cx, cy, cx, cy, cx, cy, cx, cy, cx, cy, cx,
+                                cy, cx, cy, cx);
+  }
 
-  for (int i = 0; i < n; i += 8) {
-    __m256i ind = _mm256_loadu_si256((__m256i *)data);
-    for (int j = 0; j < k; j++) {
-      const int16_t cx = centroids[2 * j], cy = centroids[2 * j + 1];
-      const __m256i cent = _mm256_set_epi16(cy, cx, cy, cx, cy, cx, cy, cx, cy,
-                                            cx, cy, cx, cy, cx, cy, cx);
-      const __m256i d1 = _mm256_sub_epi16(ind, cent);
-      dist[j] = _mm256_madd_epi16(d1, d1);
+  for (int i = 0; i < n; i += 16) {
+    for (int l = 0; l < 2; ++l) {
+      const __m256i in = _mm256_loadu_si256((__m256i *)data);
+      ind[l] = _mm256_setzero_si256();
+      // Compute the distance to the first centroid.
+      __m256i d1 = _mm256_sub_epi16(in, cents[0]);
+      __m256i dist_min = _mm256_madd_epi16(d1, d1);
+
+      for (int j = 1; j < k; ++j) {
+        // Compute the distance to the centroid.
+        d1 = _mm256_sub_epi16(in, cents[j]);
+        const __m256i dist = _mm256_madd_epi16(d1, d1);
+        // Compare to the minimal one.
+        const __m256i cmp = _mm256_cmpgt_epi32(dist_min, dist);
+        dist_min = _mm256_min_epi32(dist_min, dist);
+        const __m256i ind1 = _mm256_set1_epi32(j);
+        ind[l] = _mm256_or_si256(_mm256_andnot_si256(cmp, ind[l]),
+                                 _mm256_and_si256(cmp, ind1));
+      }
+      if (total_dist) {
+        // Convert to 64 bit and add to sum.
+        const __m256i dist1 = _mm256_unpacklo_epi32(dist_min, v_zero);
+        const __m256i dist2 = _mm256_unpackhi_epi32(dist_min, v_zero);
+        sum = _mm256_add_epi64(sum, dist1);
+        sum = _mm256_add_epi64(sum, dist2);
+      }
+      data += 16;
     }
-
-    ind = _mm256_setzero_si256();
-    for (int j = 1; j < k; j++) {
-      __m256i cmp = _mm256_cmpgt_epi32(dist[0], dist[j]);
-      dist[0] = _mm256_min_epi32(dist[0], dist[j]);
-      const __m256i ind1 = _mm256_set1_epi32(j);
-      ind = _mm256_or_si256(_mm256_andnot_si256(cmp, ind),
-                            _mm256_and_si256(cmp, ind1));
-    }
-
-    __m256i p1 = _mm256_packus_epi32(ind, v_zero);
-    __m256i px = _mm256_permute4x64_epi64(p1, 0x58);
-    __m256i p2 = _mm256_packus_epi16(px, v_zero);
-    __m128i d1 = _mm256_extracti128_si256(p2, 0);
-
-    _mm_storel_epi64((__m128i *)indices, d1);
-
-    if (total_dist) {
-      // Convert to 64 bit and add to sum.
-      const __m256i dist1 = _mm256_unpacklo_epi32(dist[0], v_zero);
-      const __m256i dist2 = _mm256_unpackhi_epi32(dist[0], v_zero);
-      sum = _mm256_add_epi64(sum, dist1);
-      sum = _mm256_add_epi64(sum, dist2);
-    }
-
-    indices += 8;
-    data += 16;
+    // Cast to 8 bit and store.
+    const __m256i d2 = _mm256_packus_epi32(ind[0], ind[1]);
+    const __m256i d3 = _mm256_packus_epi16(d2, v_zero);
+    const __m256i d4 = _mm256_permutevar8x32_epi32(d3, permute);
+    const __m128i d5 = _mm256_extracti128_si256(d4, 0);
+    _mm_storeu_si128((__m128i *)indices, d5);
+    indices += 16;
   }
   if (total_dist) {
     *total_dist = k_means_horizontal_sum_avx2(sum);

diff --git a/av1/encoder/x86/av1_k_means_sse2.c b/av1/encoder/x86/av1_k_means_sse2.c
index a284fa9..4338bf7 100644
--- a/av1/encoder/x86/av1_k_means_sse2.c
+++ b/av1/encoder/x86/av1_k_means_sse2.c

@@ -26,31 +26,37 @@
                                 uint8_t *indices, int64_t *total_dist, int n,
                                 int k) {
   const __m128i v_zero = _mm_setzero_si128();
-  __m128i dist[PALETTE_MAX_SIZE];
   __m128i sum = _mm_setzero_si128();
+  __m128i cents[PALETTE_MAX_SIZE];
+  for (int j = 0; j < k; ++j) {
+    cents[j] = _mm_set1_epi16(centroids[j]);
+  }
 
   for (int i = 0; i < n; i += 8) {
-    __m128i in = _mm_loadu_si128((__m128i *)data);
-    for (int j = 0; j < k; j++) {
-      __m128i cent = _mm_set1_epi16(centroids[j]);
-      __m128i d1 = _mm_sub_epi16(in, cent);
-      __m128i d2 = _mm_sub_epi16(cent, in);
-      dist[j] = _mm_max_epi16(d1, d2);
-    }
-
+    const __m128i in = _mm_loadu_si128((__m128i *)data);
     __m128i ind = _mm_setzero_si128();
-    for (int j = 1; j < k; j++) {
-      __m128i cmp = _mm_cmpgt_epi16(dist[0], dist[j]);
-      dist[0] = _mm_min_epi16(dist[0], dist[j]);
-      __m128i ind1 = _mm_set1_epi16(j);
+    // Compute the distance to the first centroid.
+    __m128i d1 = _mm_sub_epi16(in, cents[0]);
+    __m128i d2 = _mm_sub_epi16(cents[0], in);
+    __m128i dist_min = _mm_max_epi16(d1, d2);
+
+    for (int j = 1; j < k; ++j) {
+      // Compute the distance to the centroid.
+      d1 = _mm_sub_epi16(in, cents[j]);
+      d2 = _mm_sub_epi16(cents[j], in);
+      const __m128i dist = _mm_max_epi16(d1, d2);
+      // Compare to the minimal one.
+      const __m128i cmp = _mm_cmpgt_epi16(dist_min, dist);
+      dist_min = _mm_min_epi16(dist_min, dist);
+      const __m128i ind1 = _mm_set1_epi16(j);
       ind = _mm_or_si128(_mm_andnot_si128(cmp, ind), _mm_and_si128(cmp, ind1));
     }
     if (total_dist) {
       // Square, convert to 32 bit and add together.
-      dist[0] = _mm_madd_epi16(dist[0], dist[0]);
+      dist_min = _mm_madd_epi16(dist_min, dist_min);
       // Convert to 64 bit and add to sum.
-      const __m128i dist1 = _mm_unpacklo_epi32(dist[0], v_zero);
-      const __m128i dist2 = _mm_unpackhi_epi32(dist[0], v_zero);
+      const __m128i dist1 = _mm_unpacklo_epi32(dist_min, v_zero);
+      const __m128i dist2 = _mm_unpackhi_epi32(dist_min, v_zero);
       sum = _mm_add_epi64(sum, dist1);
       sum = _mm_add_epi64(sum, dist2);
     }
@@ -68,45 +74,49 @@
                                 uint8_t *indices, int64_t *total_dist, int n,
                                 int k) {
   const __m128i v_zero = _mm_setzero_si128();
-  int l = 1;
-  __m128i dist[PALETTE_MAX_SIZE];
-  __m128i ind[2];
   __m128i sum = _mm_setzero_si128();
+  __m128i ind[2];
+  __m128i cents[PALETTE_MAX_SIZE];
+  for (int j = 0; j < k; ++j) {
+    const int16_t cx = centroids[2 * j], cy = centroids[2 * j + 1];
+    cents[j] = _mm_set_epi16(cy, cx, cy, cx, cy, cx, cy, cx);
+  }
 
-  for (int i = 0; i < n; i += 4) {
-    l = (l == 0) ? 1 : 0;
-    __m128i ind1 = _mm_loadu_si128((__m128i *)data);
-    for (int j = 0; j < k; j++) {
-      const int16_t cx = centroids[2 * j], cy = centroids[2 * j + 1];
-      const __m128i cent = _mm_set_epi16(cy, cx, cy, cx, cy, cx, cy, cx);
-      const __m128i d1 = _mm_sub_epi16(ind1, cent);
-      dist[j] = _mm_madd_epi16(d1, d1);
-    }
+  for (int i = 0; i < n; i += 8) {
+    for (int l = 0; l < 2; ++l) {
+      const __m128i in = _mm_loadu_si128((__m128i *)data);
+      ind[l] = _mm_setzero_si128();
+      // Compute the distance to the first centroid.
+      __m128i d1 = _mm_sub_epi16(in, cents[0]);
+      __m128i dist_min = _mm_madd_epi16(d1, d1);
 
-    ind[l] = _mm_setzero_si128();
-    for (int j = 1; j < k; j++) {
-      __m128i cmp = _mm_cmpgt_epi32(dist[0], dist[j]);
-      __m128i dist1 = _mm_andnot_si128(cmp, dist[0]);
-      __m128i dist2 = _mm_and_si128(cmp, dist[j]);
-      dist[0] = _mm_or_si128(dist1, dist2);
-      ind1 = _mm_set1_epi32(j);
-      ind[l] =
-          _mm_or_si128(_mm_andnot_si128(cmp, ind[l]), _mm_and_si128(cmp, ind1));
+      for (int j = 1; j < k; ++j) {
+        // Compute the distance to the centroid.
+        d1 = _mm_sub_epi16(in, cents[j]);
+        const __m128i dist = _mm_madd_epi16(d1, d1);
+        // Compare to the minimal one.
+        const __m128i cmp = _mm_cmpgt_epi32(dist_min, dist);
+        const __m128i dist1 = _mm_andnot_si128(cmp, dist_min);
+        const __m128i dist2 = _mm_and_si128(cmp, dist);
+        dist_min = _mm_or_si128(dist1, dist2);
+        const __m128i ind1 = _mm_set1_epi32(j);
+        ind[l] = _mm_or_si128(_mm_andnot_si128(cmp, ind[l]),
+                              _mm_and_si128(cmp, ind1));
+      }
+      if (total_dist) {
+        // Convert to 64 bit and add to sum.
+        const __m128i dist1 = _mm_unpacklo_epi32(dist_min, v_zero);
+        const __m128i dist2 = _mm_unpackhi_epi32(dist_min, v_zero);
+        sum = _mm_add_epi64(sum, dist1);
+        sum = _mm_add_epi64(sum, dist2);
+      }
+      data += 8;
     }
-    ind[l] = _mm_packus_epi16(ind[l], v_zero);
-    if (total_dist) {
-      // Convert to 64 bit and add to sum.
-      const __m128i dist1 = _mm_unpacklo_epi32(dist[0], v_zero);
-      const __m128i dist2 = _mm_unpackhi_epi32(dist[0], v_zero);
-      sum = _mm_add_epi64(sum, dist1);
-      sum = _mm_add_epi64(sum, dist2);
-    }
-    if (l == 1) {
-      __m128i p2 = _mm_packus_epi16(_mm_unpacklo_epi64(ind[0], ind[1]), v_zero);
-      _mm_storel_epi64((__m128i *)indices, p2);
-      indices += 8;
-    }
-    data += 8;
+    // Cast to 8 bit and store.
+    const __m128i d2 = _mm_packus_epi16(ind[0], ind[1]);
+    const __m128i d3 = _mm_packus_epi16(d2, v_zero);
+    _mm_storel_epi64((__m128i *)indices, d3);
+    indices += 8;
   }
   if (total_dist) {
     *total_dist = k_means_horizontal_sum_sse2(sum);

diff --git a/av1/encoder/x86/encodetxb_avx2.c b/av1/encoder/x86/encodetxb_avx2.c
index 30a4129..9627f75 100644
--- a/av1/encoder/x86/encodetxb_avx2.c
+++ b/av1/encoder/x86/encodetxb_avx2.c

@@ -23,11 +23,11 @@
 
 void av1_txb_init_levels_avx2(const tran_low_t *const coeff, const int width,
                               const int height, uint8_t *const levels) {
-  const int stride = width + TX_PAD_HOR;
+  const int stride = height + TX_PAD_HOR;
   const __m256i y_zeros = _mm256_setzero_si256();
 
   const int32_t bottom_len = sizeof(*levels) * (TX_PAD_BOTTOM * stride);
-  uint8_t *bottom_buf_end = levels + (height + TX_PAD_BOTTOM) * stride;
+  uint8_t *bottom_buf_end = levels + (width + TX_PAD_BOTTOM) * stride;
   uint8_t *bottom_buf = bottom_buf_end - ((bottom_len + 31) & (~31));
 
   do {
@@ -38,7 +38,7 @@
   int i = 0;
   uint8_t *ls = levels;
   const tran_low_t *cf = coeff;
-  if (width == 4) {
+  if (height == 4) {
     do {
       const __m256i c0 = yy_loadu_256(cf);
       const __m256i c1 = yy_loadu_256(cf + 8);
@@ -50,8 +50,8 @@
       ls += 32;
       cf += 16;
       i += 4;
-    } while (i < height);
-  } else if (width == 8) {
+    } while (i < width);
+  } else if (height == 8) {
     do {
       const __m256i coeffA = yy_loadu_256(cf);
       const __m256i coeffB = yy_loadu_256(cf + 8);
@@ -67,18 +67,18 @@
       const __m128i res0 = _mm256_castsi256_si128(res);
       const __m128i res1 = _mm256_extracti128_si256(res, 1);
       xx_storel_64(ls, res0);
-      *(int32_t *)(ls + width) = 0;
+      *(int32_t *)(ls + height) = 0;
       xx_storel_64(ls + stride, _mm_srli_si128(res0, 8));
-      *(int32_t *)(ls + width + stride) = 0;
+      *(int32_t *)(ls + height + stride) = 0;
       xx_storel_64(ls + stride * 2, res1);
-      *(int32_t *)(ls + width + stride * 2) = 0;
+      *(int32_t *)(ls + height + stride * 2) = 0;
       xx_storel_64(ls + stride * 3, _mm_srli_si128(res1, 8));
-      *(int32_t *)(ls + width + stride * 3) = 0;
+      *(int32_t *)(ls + height + stride * 3) = 0;
       cf += 32;
       ls += stride << 2;
       i += 4;
-    } while (i < height);
-  } else if (width == 16) {
+    } while (i < width);
+  } else if (height == 16) {
     do {
       const __m256i coeffA = yy_loadu_256(cf);
       const __m256i coeffB = yy_loadu_256(cf + 8);
@@ -94,11 +94,11 @@
       xx_storeu_128(ls, _mm256_castsi256_si128(res));
       xx_storeu_128(ls + stride, _mm256_extracti128_si256(res, 1));
       cf += 32;
-      *(int32_t *)(ls + width) = 0;
-      *(int32_t *)(ls + stride + width) = 0;
+      *(int32_t *)(ls + height) = 0;
+      *(int32_t *)(ls + stride + height) = 0;
       ls += stride << 1;
       i += 2;
-    } while (i < height);
+    } while (i < width);
   } else {
     do {
       const __m256i coeffA = yy_loadu_256(cf);
@@ -114,9 +114,9 @@
       const __m256i res = _mm256_shuffle_epi32(res_, 0xd8);
       yy_storeu_256(ls, res);
       cf += 32;
-      *(int32_t *)(ls + width) = 0;
+      *(int32_t *)(ls + height) = 0;
       ls += stride;
       i += 1;
-    } while (i < height);
+    } while (i < width);
   }
 }

diff --git a/av1/encoder/x86/encodetxb_sse2.c b/av1/encoder/x86/encodetxb_sse2.c
index 394befb..d23a688 100644
--- a/av1/encoder/x86/encodetxb_sse2.c
+++ b/av1/encoder/x86/encodetxb_sse2.c

@@ -70,22 +70,22 @@
 }
 
 static INLINE void get_4_nz_map_contexts_2d(const uint8_t *levels,
-                                            const int height,
+                                            const int width,
                                             const ptrdiff_t *const offsets,
                                             int8_t *const coeff_contexts) {
   const int stride = 4 + TX_PAD_HOR;
   const __m128i pos_to_offset_large = _mm_set1_epi8(21);
   __m128i pos_to_offset =
-      (height == 4)
+      (width == 4)
           ? _mm_setr_epi8(0, 1, 6, 6, 1, 6, 6, 21, 6, 6, 21, 21, 6, 21, 21, 21)
-          : _mm_setr_epi8(0, 11, 11, 11, 11, 11, 11, 11, 6, 6, 21, 21, 6, 21,
+          : _mm_setr_epi8(0, 16, 16, 16, 16, 16, 16, 16, 6, 6, 21, 21, 6, 21,
                           21, 21);
   __m128i count;
   __m128i level[5];
   int8_t *cc = coeff_contexts;
-  int row = height;
+  int col = width;
 
-  assert(!(height % 4));
+  assert(!(width % 4));
 
   do {
     load_levels_4x4x5_sse2(levels, stride, offsets, level);
@@ -95,14 +95,14 @@
     pos_to_offset = pos_to_offset_large;
     levels += 4 * stride;
     cc += 16;
-    row -= 4;
-  } while (row);
+    col -= 4;
+  } while (col);
 
   coeff_contexts[0] = 0;
 }
 
-static INLINE void get_4_nz_map_contexts_hor(const uint8_t *levels,
-                                             const int height,
+static INLINE void get_4_nz_map_contexts_ver(const uint8_t *levels,
+                                             const int width,
                                              const ptrdiff_t *const offsets,
                                              int8_t *coeff_contexts) {
   const int stride = 4 + TX_PAD_HOR;
@@ -117,9 +117,9 @@
                     SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10);
   __m128i count;
   __m128i level[5];
-  int row = height;
+  int col = width;
 
-  assert(!(height % 4));
+  assert(!(width % 4));
 
   do {
     load_levels_4x4x5_sse2(levels, stride, offsets, level);
@@ -128,12 +128,12 @@
     _mm_store_si128((__m128i *)coeff_contexts, count);
     levels += 4 * stride;
     coeff_contexts += 16;
-    row -= 4;
-  } while (row);
+    col -= 4;
+  } while (col);
 }
 
-static INLINE void get_4_nz_map_contexts_ver(const uint8_t *levels,
-                                             const int height,
+static INLINE void get_4_nz_map_contexts_hor(const uint8_t *levels,
+                                             const int width,
                                              const ptrdiff_t *const offsets,
                                              int8_t *coeff_contexts) {
   const int stride = 4 + TX_PAD_HOR;
@@ -149,9 +149,9 @@
                     SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10);
   __m128i count;
   __m128i level[5];
-  int row = height;
+  int col = width;
 
-  assert(!(height % 4));
+  assert(!(width % 4));
 
   do {
     load_levels_4x4x5_sse2(levels, stride, offsets, level);
@@ -161,36 +161,36 @@
     pos_to_offset = pos_to_offset_large;
     levels += 4 * stride;
     coeff_contexts += 16;
-    row -= 4;
-  } while (row);
+    col -= 4;
+  } while (col);
 }
 
 static INLINE void get_8_coeff_contexts_2d(const uint8_t *levels,
-                                           const int height,
+                                           const int width,
                                            const ptrdiff_t *const offsets,
                                            int8_t *coeff_contexts) {
   const int stride = 8 + TX_PAD_HOR;
   int8_t *cc = coeff_contexts;
-  int row = height;
+  int col = width;
   __m128i count;
   __m128i level[5];
   __m128i pos_to_offset[3];
 
-  assert(!(height % 2));
+  assert(!(width % 2));
 
-  if (height == 8) {
+  if (width == 8) {
     pos_to_offset[0] =
         _mm_setr_epi8(0, 1, 6, 6, 21, 21, 21, 21, 1, 6, 6, 21, 21, 21, 21, 21);
     pos_to_offset[1] = _mm_setr_epi8(6, 6, 21, 21, 21, 21, 21, 21, 6, 21, 21,
                                      21, 21, 21, 21, 21);
-  } else if (height < 8) {
-    pos_to_offset[0] = _mm_setr_epi8(0, 16, 6, 6, 21, 21, 21, 21, 16, 16, 6, 21,
+  } else if (width < 8) {
+    pos_to_offset[0] = _mm_setr_epi8(0, 11, 6, 6, 21, 21, 21, 21, 11, 11, 6, 21,
                                      21, 21, 21, 21);
-    pos_to_offset[1] = _mm_setr_epi8(16, 16, 21, 21, 21, 21, 21, 21, 16, 16, 21,
+    pos_to_offset[1] = _mm_setr_epi8(11, 11, 21, 21, 21, 21, 21, 21, 11, 11, 21,
                                      21, 21, 21, 21, 21);
   } else {
-    pos_to_offset[0] = _mm_setr_epi8(0, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
-                                     11, 11, 11, 11, 11);
+    pos_to_offset[0] = _mm_setr_epi8(0, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
+                                     16, 16, 16, 16, 16);
     pos_to_offset[1] = _mm_setr_epi8(6, 6, 21, 21, 21, 21, 21, 21, 6, 21, 21,
                                      21, 21, 21, 21, 21);
   }
@@ -205,14 +205,14 @@
     pos_to_offset[1] = pos_to_offset[2];
     levels += 2 * stride;
     cc += 16;
-    row -= 2;
-  } while (row);
+    col -= 2;
+  } while (col);
 
   coeff_contexts[0] = 0;
 }
 
-static INLINE void get_8_coeff_contexts_hor(const uint8_t *levels,
-                                            const int height,
+static INLINE void get_8_coeff_contexts_ver(const uint8_t *levels,
+                                            const int width,
                                             const ptrdiff_t *const offsets,
                                             int8_t *coeff_contexts) {
   const int stride = 8 + TX_PAD_HOR;
@@ -225,11 +225,11 @@
                     SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10,
                     SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10,
                     SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10);
-  int row = height;
+  int col = width;
   __m128i count;
   __m128i level[5];
 
-  assert(!(height % 2));
+  assert(!(width % 2));
 
   do {
     load_levels_8x2x5_sse2(levels, stride, offsets, level);
@@ -238,12 +238,12 @@
     _mm_store_si128((__m128i *)coeff_contexts, count);
     levels += 2 * stride;
     coeff_contexts += 16;
-    row -= 2;
-  } while (row);
+    col -= 2;
+  } while (col);
 }
 
-static INLINE void get_8_coeff_contexts_ver(const uint8_t *levels,
-                                            const int height,
+static INLINE void get_8_coeff_contexts_hor(const uint8_t *levels,
+                                            const int width,
                                             const ptrdiff_t *const offsets,
                                             int8_t *coeff_contexts) {
   const int stride = 8 + TX_PAD_HOR;
@@ -257,11 +257,11 @@
                     SIG_COEF_CONTEXTS_2D + 5, SIG_COEF_CONTEXTS_2D + 5,
                     SIG_COEF_CONTEXTS_2D + 5, SIG_COEF_CONTEXTS_2D + 5,
                     SIG_COEF_CONTEXTS_2D + 5, SIG_COEF_CONTEXTS_2D + 5);
-  int row = height;
+  int col = width;
   __m128i count;
   __m128i level[5];
 
-  assert(!(height % 2));
+  assert(!(width % 2));
 
   do {
     load_levels_8x2x5_sse2(levels, stride, offsets, level);
@@ -271,8 +271,8 @@
     pos_to_offset = pos_to_offset_large;
     levels += 2 * stride;
     coeff_contexts += 16;
-    row -= 2;
-  } while (row);
+    col -= 2;
+  } while (col);
 }
 
 static INLINE void get_16n_coeff_contexts_2d(const uint8_t *levels,
@@ -281,15 +281,15 @@
                                              const int width, const int height,
                                              const ptrdiff_t *const offsets,
                                              int8_t *coeff_contexts) {
-  const int stride = width + TX_PAD_HOR;
+  const int stride = height + TX_PAD_HOR;
   int8_t *cc = coeff_contexts;
-  int row = height;
+  int col = width;
   __m128i pos_to_offset[5];
   __m128i pos_to_offset_large[3];
   __m128i count;
   __m128i level[5];
 
-  assert(!(width % 16));
+  assert(!(height % 16));
 
   pos_to_offset_large[2] = _mm_set1_epi8(21);
   if (real_width == real_height) {
@@ -303,27 +303,27 @@
                                      21, 21, 21, 21, 21);
     pos_to_offset[4] = pos_to_offset_large[0] = pos_to_offset_large[1] =
         pos_to_offset_large[2];
-  } else if (real_width > real_height) {
-    pos_to_offset[0] = _mm_setr_epi8(0, 16, 6, 6, 21, 21, 21, 21, 21, 21, 21,
+  } else if (real_width < real_height) {
+    pos_to_offset[0] = _mm_setr_epi8(0, 11, 6, 6, 21, 21, 21, 21, 21, 21, 21,
                                      21, 21, 21, 21, 21);
-    pos_to_offset[1] = _mm_setr_epi8(16, 16, 6, 21, 21, 21, 21, 21, 21, 21, 21,
+    pos_to_offset[1] = _mm_setr_epi8(11, 11, 6, 21, 21, 21, 21, 21, 21, 21, 21,
                                      21, 21, 21, 21, 21);
     pos_to_offset[2] = pos_to_offset[3] = pos_to_offset[4] = _mm_setr_epi8(
-        16, 16, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21);
+        11, 11, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21);
     pos_to_offset_large[0] = pos_to_offset_large[1] = pos_to_offset_large[2];
-  } else {  // real_width < real_height
+  } else {  // real_width > real_height
     pos_to_offset[0] = pos_to_offset[1] = _mm_setr_epi8(
-        11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11);
+        16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16);
     pos_to_offset[2] = _mm_setr_epi8(6, 6, 21, 21, 21, 21, 21, 21, 21, 21, 21,
                                      21, 21, 21, 21, 21);
     pos_to_offset[3] = _mm_setr_epi8(6, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
                                      21, 21, 21, 21, 21);
     pos_to_offset[4] = pos_to_offset_large[2];
-    pos_to_offset_large[0] = pos_to_offset_large[1] = _mm_set1_epi8(11);
+    pos_to_offset_large[0] = pos_to_offset_large[1] = _mm_set1_epi8(16);
   }
 
   do {
-    int w = width;
+    int h = height;
 
     do {
       load_levels_16x1x5_sse2(levels, stride, offsets, level);
@@ -332,9 +332,9 @@
       _mm_store_si128((__m128i *)cc, count);
       levels += 16;
       cc += 16;
-      w -= 16;
+      h -= 16;
       pos_to_offset[0] = pos_to_offset_large[0];
-    } while (w);
+    } while (h);
 
     pos_to_offset[0] = pos_to_offset[1];
     pos_to_offset[1] = pos_to_offset[2];
@@ -343,16 +343,16 @@
     pos_to_offset_large[0] = pos_to_offset_large[1];
     pos_to_offset_large[1] = pos_to_offset_large[2];
     levels += TX_PAD_HOR;
-  } while (--row);
+  } while (--col);
 
   coeff_contexts[0] = 0;
 }
 
-static INLINE void get_16n_coeff_contexts_hor(const uint8_t *levels,
+static INLINE void get_16n_coeff_contexts_ver(const uint8_t *levels,
                                               const int width, const int height,
                                               const ptrdiff_t *const offsets,
                                               int8_t *coeff_contexts) {
-  const int stride = width + TX_PAD_HOR;
+  const int stride = height + TX_PAD_HOR;
   const __m128i pos_to_offset_large =
       _mm_setr_epi8(SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10,
                     SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10,
@@ -364,9 +364,9 @@
                     SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10);
   __m128i count;
   __m128i level[5];
-  int row = height;
+  int col = width;
 
-  assert(!(width % 16));
+  assert(!(height % 16));
 
   do {
     __m128i pos_to_offset =
@@ -378,7 +378,7 @@
                       SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10,
                       SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10,
                       SIG_COEF_CONTEXTS_2D + 10, SIG_COEF_CONTEXTS_2D + 10);
-    int w = width;
+    int h = height;
 
     do {
       load_levels_16x1x5_sse2(levels, stride, offsets, level);
@@ -388,31 +388,31 @@
       pos_to_offset = pos_to_offset_large;
       levels += 16;
       coeff_contexts += 16;
-      w -= 16;
-    } while (w);
+      h -= 16;
+    } while (h);
 
     levels += TX_PAD_HOR;
-  } while (--row);
+  } while (--col);
 }
 
-static INLINE void get_16n_coeff_contexts_ver(const uint8_t *levels,
+static INLINE void get_16n_coeff_contexts_hor(const uint8_t *levels,
                                               const int width, const int height,
                                               const ptrdiff_t *const offsets,
                                               int8_t *coeff_contexts) {
-  const int stride = width + TX_PAD_HOR;
+  const int stride = height + TX_PAD_HOR;
   __m128i pos_to_offset[3];
   __m128i count;
   __m128i level[5];
-  int row = height;
+  int col = width;
 
-  assert(!(width % 16));
+  assert(!(height % 16));
 
   pos_to_offset[0] = _mm_set1_epi8(SIG_COEF_CONTEXTS_2D + 0);
   pos_to_offset[1] = _mm_set1_epi8(SIG_COEF_CONTEXTS_2D + 5);
   pos_to_offset[2] = _mm_set1_epi8(SIG_COEF_CONTEXTS_2D + 10);
 
   do {
-    int w = width;
+    int h = height;
 
     do {
       load_levels_16x1x5_sse2(levels, stride, offsets, level);
@@ -421,13 +421,13 @@
       _mm_store_si128((__m128i *)coeff_contexts, count);
       levels += 16;
       coeff_contexts += 16;
-      w -= 16;
-    } while (w);
+      h -= 16;
+    } while (h);
 
     pos_to_offset[0] = pos_to_offset[1];
     pos_to_offset[1] = pos_to_offset[2];
     levels += TX_PAD_HOR;
-  } while (--row);
+  } while (--col);
 }
 
 // Note: levels[] must be in the range [0, 127], inclusive.
@@ -446,7 +446,7 @@
   const int real_height = tx_size_high[tx_size];
   const int width = get_txb_wide(tx_size);
   const int height = get_txb_high(tx_size);
-  const int stride = width + TX_PAD_HOR;
+  const int stride = height + TX_PAD_HOR;
   ptrdiff_t offsets[3];
 
   /* coeff_contexts must be 16 byte aligned. */
@@ -457,11 +457,11 @@
     offsets[1] = 1 * stride + 1;
     offsets[2] = 2 * stride + 0;
 
-    if (width == 4) {
-      get_4_nz_map_contexts_2d(levels, height, offsets, coeff_contexts);
-    } else if (width == 8) {
-      get_8_coeff_contexts_2d(levels, height, offsets, coeff_contexts);
-    } else if (width == 16) {
+    if (height == 4) {
+      get_4_nz_map_contexts_2d(levels, width, offsets, coeff_contexts);
+    } else if (height == 8) {
+      get_8_coeff_contexts_2d(levels, width, offsets, coeff_contexts);
+    } else if (height == 16) {
       get_16n_coeff_contexts_2d(levels, real_width, real_height, width, height,
                                 offsets, coeff_contexts);
     } else {
@@ -469,36 +469,36 @@
                                 offsets, coeff_contexts);
     }
   } else if (tx_class == TX_CLASS_HORIZ) {
-    offsets[0] = 2;
-    offsets[1] = 3;
-    offsets[2] = 4;
-    if (width == 4) {
-      get_4_nz_map_contexts_hor(levels, height, offsets, coeff_contexts);
-    } else if (width == 8) {
-      get_8_coeff_contexts_hor(levels, height, offsets, coeff_contexts);
+    offsets[0] = 2 * stride;
+    offsets[1] = 3 * stride;
+    offsets[2] = 4 * stride;
+    if (height == 4) {
+      get_4_nz_map_contexts_hor(levels, width, offsets, coeff_contexts);
+    } else if (height == 8) {
+      get_8_coeff_contexts_hor(levels, width, offsets, coeff_contexts);
     } else {
       get_16n_coeff_contexts_hor(levels, width, height, offsets,
                                  coeff_contexts);
     }
   } else {  // TX_CLASS_VERT
-    offsets[0] = 2 * stride;
-    offsets[1] = 3 * stride;
-    offsets[2] = 4 * stride;
-    if (width == 4) {
-      get_4_nz_map_contexts_ver(levels, height, offsets, coeff_contexts);
-    } else if (width == 8) {
-      get_8_coeff_contexts_ver(levels, height, offsets, coeff_contexts);
+    offsets[0] = 2;
+    offsets[1] = 3;
+    offsets[2] = 4;
+    if (height == 4) {
+      get_4_nz_map_contexts_ver(levels, width, offsets, coeff_contexts);
+    } else if (height == 8) {
+      get_8_coeff_contexts_ver(levels, width, offsets, coeff_contexts);
     } else {
       get_16n_coeff_contexts_ver(levels, width, height, offsets,
                                  coeff_contexts);
     }
   }
 
-  const int bwl = get_txb_bwl(tx_size);
+  const int bhl = get_txb_bhl(tx_size);
   const int pos = scan[last_idx];
-  if (last_idx <= (height << bwl) / 8)
+  if (last_idx <= (width << bhl) / 8)
     coeff_contexts[pos] = 1;
-  else if (last_idx <= (height << bwl) / 4)
+  else if (last_idx <= (width << bhl) / 4)
     coeff_contexts[pos] = 2;
   else
     coeff_contexts[pos] = 3;

diff --git a/av1/encoder/x86/encodetxb_sse4.c b/av1/encoder/x86/encodetxb_sse4.c
index aeb57f2..72bd8e3 100644
--- a/av1/encoder/x86/encodetxb_sse4.c
+++ b/av1/encoder/x86/encodetxb_sse4.c

@@ -20,11 +20,11 @@
 
 void av1_txb_init_levels_sse4_1(const tran_low_t *const coeff, const int width,
                                 const int height, uint8_t *const levels) {
-  const int stride = width + TX_PAD_HOR;
+  const int stride = height + TX_PAD_HOR;
   const __m128i zeros = _mm_setzero_si128();
 
   const int32_t bottom_len = sizeof(*levels) * (TX_PAD_BOTTOM * stride);
-  uint8_t *bottom_buf = levels + stride * height;
+  uint8_t *bottom_buf = levels + stride * width;
   uint8_t *bottom_buf_end = bottom_buf + bottom_len;
   do {
     _mm_storeu_si128((__m128i *)(bottom_buf), zeros);
@@ -34,7 +34,7 @@
   int i = 0;
   uint8_t *ls = levels;
   const tran_low_t *cf = coeff;
-  if (width == 4) {
+  if (height == 4) {
     do {
       const __m128i coeffA = xx_loadu_128(cf);
       const __m128i coeffB = xx_loadu_128(cf + 4);
@@ -44,10 +44,10 @@
       const __m128i lsAB = _mm_unpacklo_epi32(absAB8, zeros);
       xx_storeu_128(ls, lsAB);
       ls += (stride << 1);
-      cf += (width << 1);
+      cf += (height << 1);
       i += 2;
-    } while (i < height);
-  } else if (width == 8) {
+    } while (i < width);
+  } else if (height == 8) {
     do {
       const __m128i coeffA = xx_loadu_128(cf);
       const __m128i coeffB = xx_loadu_128(cf + 4);
@@ -56,9 +56,9 @@
       const __m128i absAB8 = _mm_packs_epi16(absAB, zeros);
       xx_storeu_128(ls, absAB8);
       ls += stride;
-      cf += width;
+      cf += height;
       i += 1;
-    } while (i < height);
+    } while (i < width);
   } else {
     do {
       int j = 0;
@@ -75,10 +75,10 @@
         xx_storeu_128(ls + j, absABCD);
         j += 16;
         cf += 16;
-      } while (j < width);
-      *(int32_t *)(ls + width) = 0;
+      } while (j < height);
+      *(int32_t *)(ls + height) = 0;
       ls += stride;
       i += 1;
-    } while (i < height);
+    } while (i < width);
   }
 }

diff --git a/av1/encoder/x86/error_intrin_sse2.c b/av1/encoder/x86/error_intrin_sse2.c
index e876db1..61f65c6 100644
--- a/av1/encoder/x86/error_intrin_sse2.c
+++ b/av1/encoder/x86/error_intrin_sse2.c

@@ -65,11 +65,11 @@
   accum = reduce_sum_epi64(accum);
 
   // Store the results.
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   return _mm_cvtsi128_si64(accum);
 #else
   int64_t result;
   _mm_storel_epi64((__m128i *)&result, accum);
   return result;
-#endif  // ARCH_X86_64
+#endif  // AOM_ARCH_X86_64
 }

diff --git a/av1/encoder/x86/error_sse2.asm b/av1/encoder/x86/error_sse2.asm
index f4b4968..6407c10 100644
--- a/av1/encoder/x86/error_sse2.asm
+++ b/av1/encoder/x86/error_sse2.asm

@@ -75,7 +75,7 @@
   movhlps   m7, m6
   paddq     m4, m5
   paddq     m6, m7
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
   movq    rax, m4
   movq [sszq], m6
 %else

diff --git a/av1/encoder/x86/highbd_fwd_txfm_avx2.c b/av1/encoder/x86/highbd_fwd_txfm_avx2.c
index 1faa412..9cdf21f 100644
--- a/av1/encoder/x86/highbd_fwd_txfm_avx2.c
+++ b/av1/encoder/x86/highbd_fwd_txfm_avx2.c

@@ -561,8 +561,7 @@
       fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
       fdct8_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                  width_div8);
-      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case ADST_DCT:
       load_buffer_8x8_avx2(input, in, stride, 0, 0, shift[0]);
@@ -572,8 +571,7 @@
       fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
       fdct8_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                  width_div8);
-      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case DCT_ADST:
       load_buffer_8x8_avx2(input, in, stride, 0, 0, shift[0]);
@@ -583,8 +581,7 @@
       fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
       fadst8_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case ADST_ADST:
       load_buffer_8x8_avx2(input, in, stride, 0, 0, shift[0]);
@@ -594,8 +591,7 @@
       fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
       fadst8_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case FLIPADST_DCT:
       load_buffer_8x8_avx2(input, in, stride, 1, 0, shift[0]);
@@ -605,8 +601,7 @@
       fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
       fdct8_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                  width_div8);
-      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case DCT_FLIPADST:
       load_buffer_8x8_avx2(input, in, stride, 0, 1, shift[0]);
@@ -616,8 +611,7 @@
       fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
       fadst8_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case FLIPADST_FLIPADST:
       load_buffer_8x8_avx2(input, in, stride, 1, 1, shift[0]);
@@ -627,8 +621,7 @@
       fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
       fadst8_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case ADST_FLIPADST:
       load_buffer_8x8_avx2(input, in, stride, 0, 1, shift[0]);
@@ -638,8 +631,7 @@
       fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
       fadst8_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case FLIPADST_ADST:
       load_buffer_8x8_avx2(input, in, stride, 1, 0, shift[0]);
@@ -649,26 +641,27 @@
       fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
       fadst8_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case IDTX:
       load_buffer_8x8_avx2(input, in, stride, 0, 0, shift[0]);
       idtx8_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                  width_div8);
       col_txfm_8x8_rounding(out, -shift[1]);
-      idtx8_avx2(out, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
+      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
+      idtx8_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                  width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case V_DCT:
       load_buffer_8x8_avx2(input, in, stride, 0, 0, shift[0]);
       fdct8_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                  width_div8);
       col_txfm_8x8_rounding(out, -shift[1]);
-      idtx8_avx2(out, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
+      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
+      idtx8_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                  width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case H_DCT:
       load_buffer_8x8_avx2(input, in, stride, 0, 0, shift[0]);
@@ -678,17 +671,17 @@
       fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
       fdct8_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                  width_div8);
-      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case V_ADST:
       load_buffer_8x8_avx2(input, in, stride, 0, 0, shift[0]);
       fadst8_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                   width_div8);
       col_txfm_8x8_rounding(out, -shift[1]);
-      idtx8_avx2(out, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
+      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
+      idtx8_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                  width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case H_ADST:
       load_buffer_8x8_avx2(input, in, stride, 0, 0, shift[0]);
@@ -698,17 +691,17 @@
       fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
       fadst8_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                   width_div8);
-      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case V_FLIPADST:
       load_buffer_8x8_avx2(input, in, stride, 1, 0, shift[0]);
       fadst8_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                   width_div8);
       col_txfm_8x8_rounding(out, -shift[1]);
-      idtx8_avx2(out, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
+      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
+      idtx8_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                  width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     case H_FLIPADST:
       load_buffer_8x8_avx2(input, in, stride, 0, 1, shift[0]);
@@ -718,8 +711,7 @@
       fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
       fadst8_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                   width_div8);
-      fwd_txfm_transpose_8x8_avx2(out, in, width_div8, width_div8);
-      store_buffer_avx2(in, coeff, 8, 8);
+      store_buffer_avx2(out, coeff, 8, 8);
       break;
     default: assert(0);
   }
@@ -1333,9 +1325,7 @@
   fwd_txfm_transpose_8x8_avx2(out, in, 1, 2);
   fwd_txfm_transpose_8x8_avx2(&out[8], &in[1], 1, 2);
   row_txfm(in, out, bit, 2, 2);
-  fwd_txfm_transpose_8x8_avx2(out, in, 2, 1);
-  fwd_txfm_transpose_8x8_avx2(&out[1], &in[8], 2, 1);
-  round_shift_rect_array_32_avx2(in, in, 16, -shift[2], NewSqrt2);
+  round_shift_rect_array_32_avx2(out, in, 16, -shift[2], NewSqrt2);
   store_buffer_avx2(in, coeff, 8, 16);
   (void)bd;
 }
@@ -1394,10 +1384,8 @@
   fwd_txfm_transpose_8x8_avx2(out, in, 2, 1);
   fwd_txfm_transpose_8x8_avx2(&out[1], &in[8], 2, 1);
   row_txfm(in, out, bit, 1, 1);
-  fwd_txfm_transpose_8x8_avx2(out, in, 1, 2);
-  fwd_txfm_transpose_8x8_avx2(&out[8], &in[1], 1, 2);
-  round_shift_rect_array_32_avx2(in, in, 16, -shift[2], NewSqrt2);
-  store_buffer_avx2(in, coeff, 8, 16);
+  round_shift_rect_array_32_avx2(out, out, 16, -shift[2], NewSqrt2);
+  store_buffer_avx2(out, coeff, 8, 16);
   (void)bd;
 }
 void av1_fwd_txfm2d_16x16_avx2(const int16_t *input, int32_t *coeff, int stride,
@@ -1422,8 +1410,7 @@
       fwd_txfm_transpose_16x16_avx2(out, in);
       fdct16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      fwd_txfm_transpose_16x16_avx2(out, in);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case ADST_DCT:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 0, 0);
@@ -1434,8 +1421,7 @@
       fwd_txfm_transpose_16x16_avx2(out, in);
       fdct16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      fwd_txfm_transpose_16x16_avx2(out, in);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case DCT_ADST:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 0, 0);
@@ -1446,8 +1432,7 @@
       fwd_txfm_transpose_16x16_avx2(out, in);
       fadst16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                    width_div8);
-      fwd_txfm_transpose_16x16_avx2(out, in);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case ADST_ADST:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 0, 0);
@@ -1458,8 +1443,7 @@
       fwd_txfm_transpose_16x16_avx2(out, in);
       fadst16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                    width_div8);
-      fwd_txfm_transpose_16x16_avx2(out, in);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case FLIPADST_DCT:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 1, 0);
@@ -1470,8 +1454,7 @@
       fwd_txfm_transpose_16x16_avx2(out, in);
       fdct16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      fwd_txfm_transpose_16x16_avx2(out, in);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case DCT_FLIPADST:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 0, 1);
@@ -1482,8 +1465,7 @@
       fwd_txfm_transpose_16x16_avx2(out, in);
       fadst16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                    width_div8);
-      fwd_txfm_transpose_16x16_avx2(out, in);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case FLIPADST_FLIPADST:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 1, 1);
@@ -1494,8 +1476,7 @@
       fwd_txfm_transpose_16x16_avx2(out, in);
       fadst16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                    width_div8);
-      fwd_txfm_transpose_16x16_avx2(out, in);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case ADST_FLIPADST:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 0, 1);
@@ -1506,8 +1487,7 @@
       fwd_txfm_transpose_16x16_avx2(out, in);
       fadst16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                    width_div8);
-      fwd_txfm_transpose_16x16_avx2(out, in);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case FLIPADST_ADST:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 1, 0);
@@ -1518,8 +1498,7 @@
       fwd_txfm_transpose_16x16_avx2(out, in);
       fadst16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                    width_div8);
-      fwd_txfm_transpose_16x16_avx2(out, in);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case IDTX:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 0, 0);
@@ -1527,9 +1506,10 @@
       idtx16_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                   width_div8);
       round_shift_32_8xn_avx2(out, size, shift[1], width_div16);
-      idtx16_avx2(out, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
+      fwd_txfm_transpose_16x16_avx2(out, in);
+      idtx16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case V_DCT:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 0, 0);
@@ -1537,9 +1517,10 @@
       fdct16_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                   width_div8);
       round_shift_32_8xn_avx2(out, size, shift[1], width_div16);
-      idtx16_avx2(out, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
+      fwd_txfm_transpose_16x16_avx2(out, in);
+      idtx16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case H_DCT:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 0, 0);
@@ -1550,8 +1531,7 @@
       fwd_txfm_transpose_16x16_avx2(out, in);
       fdct16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      fwd_txfm_transpose_16x16_avx2(out, in);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case V_ADST:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 0, 0);
@@ -1559,9 +1539,10 @@
       fadst16_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                    width_div8);
       round_shift_32_8xn_avx2(out, size, shift[1], width_div16);
-      idtx16_avx2(out, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
+      fwd_txfm_transpose_16x16_avx2(out, in);
+      idtx16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case H_ADST:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 0, 0);
@@ -1572,8 +1553,7 @@
       fwd_txfm_transpose_16x16_avx2(out, in);
       fadst16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                    width_div8);
-      fwd_txfm_transpose_16x16_avx2(out, in);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case V_FLIPADST:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 1, 0);
@@ -1581,9 +1561,10 @@
       fadst16_avx2(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], width_div8,
                    width_div8);
       round_shift_32_8xn_avx2(out, size, shift[1], width_div16);
-      idtx16_avx2(out, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
+      fwd_txfm_transpose_16x16_avx2(out, in);
+      idtx16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                   width_div8);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     case H_FLIPADST:
       load_buffer_16xn_avx2(input, in, stride, height, width_div8, 0, 1);
@@ -1594,8 +1575,7 @@
       fwd_txfm_transpose_16x16_avx2(out, in);
       fadst16_avx2(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], width_div8,
                    width_div8);
-      fwd_txfm_transpose_16x16_avx2(out, in);
-      store_buffer_avx2(in, coeff, 8, 32);
+      store_buffer_avx2(out, coeff, 8, 32);
       break;
     default: assert(0);
   }
@@ -2091,15 +2071,7 @@
     round_shift_32_8xn_avx2(&buf1[(i << 1) + 1], height, shift[2], width_div8);
   }
 
-  for (r = 0; r < height; r += 8) {
-    for (c = 0; c < width_div8; c++) {
-      fwd_txfm_transpose_8x8_avx2(&buf1[r * width_div8 + c],
-                                  &buf0[c * 8 * width_div8 + (r >> 3)],
-                                  width_div8, width_div8);
-    }
-  }
-
-  store_buffer_avx2(buf0, output, 8, 128);
+  store_buffer_avx2(buf1, output, 8, 128);
 }
 static INLINE void fdct64_stage2_avx2(__m256i *x1, __m256i *x2,
                                       __m256i *cospi_m32, __m256i *cospi_p32,
@@ -3156,12 +3128,5 @@
                             width_div16);
   }
 
-  for (r = 0; r < (height >> 1); r += 8) {
-    for (c = 0; c < width_div16; c++) {
-      fwd_txfm_transpose_8x8_avx2(&buf0[r * width_div16 + c],
-                                  &buf1[c * 8 * width_div16 + (r >> 3)],
-                                  width_div16, width_div16);
-    }
-  }
-  store_buffer_avx2(buf1, output, 8, 128);
+  store_buffer_avx2(buf0, output, 8, 128);
 }

diff --git a/av1/encoder/x86/highbd_fwd_txfm_sse4.c b/av1/encoder/x86/highbd_fwd_txfm_sse4.c
index 73f9b44..158b4ae 100644
--- a/av1/encoder/x86/highbd_fwd_txfm_sse4.c
+++ b/av1/encoder/x86/highbd_fwd_txfm_sse4.c

@@ -22,6 +22,13 @@
 #include "config/aom_config.h"
 #include "config/av1_rtcd.h"
 
+static INLINE void store_output_w4(int32_t *const out, const __m128i *const in,
+                                   const int stride, const int out_size) {
+  for (int i = 0; i < out_size; ++i) {
+    _mm_store_si128((__m128i *)(out + i * stride), in[i]);
+  }
+}
+
 void av1_fwht4x4_sse4_1(const int16_t *input, tran_low_t *output, int stride) {
   __m128i in[4];
   in[0] = _mm_loadl_epi64((const __m128i *)(input + 0 * stride));
@@ -57,7 +64,9 @@
     op[2] = d1;
     op[3] = b1;
 
-    transpose_32bit_4x4(op, op);
+    if (i == 0) {
+      transpose_32bit_4x4(op, op);
+    }
   }
 
   op[0] = _mm_slli_epi32(op[0], UNIT_QUANT_SHIFT);
@@ -71,11 +80,6 @@
   _mm_storeu_si128((__m128i *)(output + 12), op[3]);
 }
 
-void av1_highbd_fwht4x4_sse4_1(const int16_t *input, tran_low_t *output,
-                               int stride) {
-  av1_fwht4x4_sse4_1(input, output, stride);
-}
-
 static INLINE void load_buffer_4x4(const int16_t *input, __m128i *in,
                                    int stride, int flipud, int fliplr,
                                    int shift) {
@@ -160,16 +164,10 @@
 
   // Note: shift[1] and shift[2] are zeros
 
-  // Transpose 4x4 32-bit
-  v0 = _mm_unpacklo_epi32(u0, u1);
-  v1 = _mm_unpackhi_epi32(u0, u1);
-  v2 = _mm_unpacklo_epi32(u2, u3);
-  v3 = _mm_unpackhi_epi32(u2, u3);
-
-  out[0] = _mm_unpacklo_epi64(v0, v2);
-  out[1] = _mm_unpackhi_epi64(v0, v2);
-  out[2] = _mm_unpacklo_epi64(v1, v3);
-  out[3] = _mm_unpackhi_epi64(v1, v3);
+  out[0] = u0;
+  out[1] = u1;
+  out[2] = u2;
+  out[3] = u3;
 }
 
 static INLINE void write_buffer_4x4(__m128i *res, int32_t *output) {
@@ -191,7 +189,6 @@
   __m128i s0, s1, s2, s3, s4, s5, s6, s7;
   __m128i x0, x1, x2, x3;
   __m128i u0, u1, u2, u3;
-  __m128i v0, v1, v2, v3;
 
   int idx = 0 * num_col;
   s0 = _mm_mullo_epi32(in[idx], sinpi1);
@@ -232,39 +229,22 @@
   u3 = _mm_add_epi32(s3, rnding);
   u3 = _mm_srai_epi32(u3, bit);
 
-  v0 = _mm_unpacklo_epi32(u0, u1);
-  v1 = _mm_unpackhi_epi32(u0, u1);
-  v2 = _mm_unpacklo_epi32(u2, u3);
-  v3 = _mm_unpackhi_epi32(u2, u3);
-
-  out[0] = _mm_unpacklo_epi64(v0, v2);
-  out[1] = _mm_unpackhi_epi64(v0, v2);
-  out[2] = _mm_unpacklo_epi64(v1, v3);
-  out[3] = _mm_unpackhi_epi64(v1, v3);
+  out[0] = u0;
+  out[1] = u1;
+  out[2] = u2;
+  out[3] = u3;
 }
 static void idtx4x4_sse4_1(__m128i *in, __m128i *out, int bit, int col_num) {
   (void)bit;
   __m128i fact = _mm_set1_epi32(NewSqrt2);
   __m128i offset = _mm_set1_epi32(1 << (NewSqrt2Bits - 1));
   __m128i a_low;
-  __m128i v[4];
 
   for (int i = 0; i < 4; i++) {
     a_low = _mm_mullo_epi32(in[i * col_num], fact);
     a_low = _mm_add_epi32(a_low, offset);
     out[i] = _mm_srai_epi32(a_low, NewSqrt2Bits);
   }
-
-  // Transpose for 4x4
-  v[0] = _mm_unpacklo_epi32(out[0], out[1]);
-  v[1] = _mm_unpackhi_epi32(out[0], out[1]);
-  v[2] = _mm_unpacklo_epi32(out[2], out[3]);
-  v[3] = _mm_unpackhi_epi32(out[2], out[3]);
-
-  out[0] = _mm_unpacklo_epi64(v[0], v[2]);
-  out[1] = _mm_unpackhi_epi64(v[0], v[2]);
-  out[2] = _mm_unpacklo_epi64(v[1], v[3]);
-  out[3] = _mm_unpackhi_epi64(v[1], v[3]);
 }
 void av1_fwd_txfm2d_4x4_sse4_1(const int16_t *input, int32_t *coeff,
                                int input_stride, TX_TYPE tx_type, int bd) {
@@ -277,96 +257,112 @@
     case DCT_DCT:
       load_buffer_4x4(input, in, input_stride, 0, 0, shift[0]);
       fdct4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       fdct4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case ADST_DCT:
       load_buffer_4x4(input, in, input_stride, 0, 0, shift[0]);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       fdct4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case DCT_ADST:
       load_buffer_4x4(input, in, input_stride, 0, 0, shift[0]);
       fdct4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case ADST_ADST:
       load_buffer_4x4(input, in, input_stride, 0, 0, shift[0]);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case FLIPADST_DCT:
       load_buffer_4x4(input, in, input_stride, 1, 0, shift[0]);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       fdct4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case DCT_FLIPADST:
       load_buffer_4x4(input, in, input_stride, 0, 1, shift[0]);
       fdct4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case FLIPADST_FLIPADST:
       load_buffer_4x4(input, in, input_stride, 1, 1, shift[0]);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case ADST_FLIPADST:
       load_buffer_4x4(input, in, input_stride, 0, 1, shift[0]);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case FLIPADST_ADST:
       load_buffer_4x4(input, in, input_stride, 1, 0, shift[0]);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case IDTX:
       load_buffer_4x4(input, in, input_stride, 0, 0, shift[0]);
       idtx4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       idtx4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case V_DCT:
       load_buffer_4x4(input, in, input_stride, 0, 0, shift[0]);
       fdct4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       idtx4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case H_DCT:
       load_buffer_4x4(input, in, input_stride, 0, 0, shift[0]);
       idtx4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       fdct4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case V_ADST:
       load_buffer_4x4(input, in, input_stride, 0, 0, shift[0]);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       idtx4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case H_ADST:
       load_buffer_4x4(input, in, input_stride, 0, 0, shift[0]);
       idtx4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_col[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case V_FLIPADST:
       load_buffer_4x4(input, in, input_stride, 1, 0, shift[0]);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       idtx4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
     case H_FLIPADST:
       load_buffer_4x4(input, in, input_stride, 0, 1, shift[0]);
       idtx4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
+      transpose_32bit_4x4(in, in);
       fadst4x4_sse4_1(in, in, av1_fwd_cos_bit_row[txw_idx][txh_idx], 1);
       write_buffer_4x4(in, coeff);
       break;
@@ -911,8 +907,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       fdct8x8_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case ADST_DCT:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -920,8 +915,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       fdct8x8_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case DCT_ADST:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -929,8 +923,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       fadst8x8_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case ADST_ADST:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -938,8 +931,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       fadst8x8_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case FLIPADST_DCT:
       load_buffer_8x8(input, in, stride, 1, 0, shift[0]);
@@ -947,8 +939,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       fdct8x8_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case DCT_FLIPADST:
       load_buffer_8x8(input, in, stride, 0, 1, shift[0]);
@@ -956,8 +947,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       fadst8x8_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case FLIPADST_FLIPADST:
       load_buffer_8x8(input, in, stride, 1, 1, shift[0]);
@@ -965,8 +955,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       fadst8x8_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case ADST_FLIPADST:
       load_buffer_8x8(input, in, stride, 0, 1, shift[0]);
@@ -974,8 +963,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       fadst8x8_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case FLIPADST_ADST:
       load_buffer_8x8(input, in, stride, 1, 0, shift[0]);
@@ -983,8 +971,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       fadst8x8_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case IDTX:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -992,8 +979,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       idtx8x8_sse4_1(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case V_DCT:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -1001,8 +987,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       idtx8x8_sse4_1(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case H_DCT:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -1010,8 +995,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       fdct8x8_sse4_1(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case V_ADST:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -1019,8 +1003,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       idtx8x8_sse4_1(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case H_ADST:
       load_buffer_8x8(input, in, stride, 0, 0, shift[0]);
@@ -1028,8 +1011,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       fadst8x8_sse4_1(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case V_FLIPADST:
       load_buffer_8x8(input, in, stride, 1, 0, shift[0]);
@@ -1037,8 +1019,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       idtx8x8_sse4_1(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     case H_FLIPADST:
       load_buffer_8x8(input, in, stride, 0, 1, shift[0]);
@@ -1046,8 +1027,7 @@
       col_txfm_8x8_rounding(out, -shift[1]);
       transpose_8x8(out, in);
       fadst8x8_sse4_1(in, out, av1_fwd_cos_bit_col[txw_idx][txh_idx], 2);
-      transpose_8x8(out, in);
-      write_buffer_8x8(in, coeff);
+      write_buffer_8x8(out, coeff);
       break;
     default: assert(0);
   }
@@ -1819,8 +1799,7 @@
       col_txfm_16x16_rounding(out, -shift[1]);
       transpose_16x16(out, in);
       fdct16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case ADST_DCT:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1829,8 +1808,7 @@
       col_txfm_16x16_rounding(out, -shift[1]);
       transpose_16x16(out, in);
       fdct16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case DCT_ADST:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1839,8 +1817,7 @@
       transpose_16x16(out, in);
       fadst16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx],
                         col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case ADST_ADST:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1850,8 +1827,7 @@
       transpose_16x16(out, in);
       fadst16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx],
                         col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case FLIPADST_DCT:
       load_buffer_16x16(input, in, stride, 1, 0, shift[0]);
@@ -1860,8 +1836,7 @@
       col_txfm_16x16_rounding(out, -shift[1]);
       transpose_16x16(out, in);
       fdct16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case DCT_FLIPADST:
       load_buffer_16x16(input, in, stride, 0, 1, shift[0]);
@@ -1870,8 +1845,7 @@
       transpose_16x16(out, in);
       fadst16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx],
                         col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case FLIPADST_FLIPADST:
       load_buffer_16x16(input, in, stride, 1, 1, shift[0]);
@@ -1881,8 +1855,7 @@
       transpose_16x16(out, in);
       fadst16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx],
                         col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case ADST_FLIPADST:
       load_buffer_16x16(input, in, stride, 0, 1, shift[0]);
@@ -1892,8 +1865,7 @@
       transpose_16x16(out, in);
       fadst16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx],
                         col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case FLIPADST_ADST:
       load_buffer_16x16(input, in, stride, 1, 0, shift[0]);
@@ -1903,8 +1875,7 @@
       transpose_16x16(out, in);
       fadst16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx],
                         col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case IDTX:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1912,8 +1883,7 @@
       col_txfm_16x16_rounding(out, -shift[1]);
       transpose_16x16(out, in);
       idtx16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case V_DCT:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1921,8 +1891,7 @@
       col_txfm_16x16_rounding(out, -shift[1]);
       transpose_16x16(out, in);
       idtx16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case H_DCT:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1930,8 +1899,7 @@
       col_txfm_16x16_rounding(out, -shift[1]);
       transpose_16x16(out, in);
       fdct16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case V_ADST:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1940,8 +1908,7 @@
       col_txfm_16x16_rounding(out, -shift[1]);
       transpose_16x16(out, in);
       idtx16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case H_ADST:
       load_buffer_16x16(input, in, stride, 0, 0, shift[0]);
@@ -1950,8 +1917,7 @@
       transpose_16x16(out, in);
       fadst16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx],
                         col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case V_FLIPADST:
       load_buffer_16x16(input, in, stride, 1, 0, shift[0]);
@@ -1960,8 +1926,7 @@
       col_txfm_16x16_rounding(out, -shift[1]);
       transpose_16x16(out, in);
       idtx16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx], col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     case H_FLIPADST:
       load_buffer_16x16(input, in, stride, 0, 1, shift[0]);
@@ -1970,8 +1935,7 @@
       transpose_16x16(out, in);
       fadst16x16_sse4_1(in, out, av1_fwd_cos_bit_row[txw_idx][txh_idx],
                         col_num);
-      transpose_16x16(out, in);
-      write_buffer_16x16(in, coeff);
+      write_buffer_16x16(out, coeff);
       break;
     default: assert(0);
   }
@@ -2218,11 +2182,10 @@
   }
 
   for (int i = 0; i < 2; i++) {
-    transpose_8x8(out + i * 16, in);
-    av1_round_shift_rect_array_32_sse4_1(in, in, 16, -shift[2], NewSqrt2);
-    write_buffer_16x8(in, coeff + i * 8, 16);
+    av1_round_shift_rect_array_32_sse4_1(out + i * 16, in, 16, -shift[2],
+                                         NewSqrt2);
+    write_buffer_8x8(in, coeff + i * 64);
   }
-
   (void)bd;
 }
 
@@ -2246,11 +2209,9 @@
 
   for (int i = 0; i < 2; i++) {
     row_txfm(out + i * 16, out, bit, 2);
-    transpose_8x8(out, in);
-    av1_round_shift_rect_array_32_sse4_1(in, in, 16, -shift[2], NewSqrt2);
-    write_buffer_8x8(in, coeff + i * 64);
+    av1_round_shift_rect_array_32_sse4_1(out, out, 16, -shift[2], NewSqrt2);
+    write_buffer_16x8(out, coeff + i * 8, 16);
   }
-
   (void)bd;
 }
 
@@ -2278,8 +2239,10 @@
   transpose_8nx8n(outcoeff128, in, txfm_size_col, txfm_size_row);
 
   // row transform
-  for (int i = 0; i < txfm_size_col; i++) {
-    row_txfm(in + i, outcoeff128 + i * txfm_size_col, bitrow, txfm_size_col);
+  for (int i = 0; i < 4; i++) {
+    __m128i tmp[4];
+    row_txfm(in + i, tmp, bitrow, txfm_size_row >> 2);
+    store_output_w4(coeff + i * 4, tmp, txfm_size_row, txfm_size_col);
   }
   (void)bd;
 }
@@ -2304,15 +2267,15 @@
   // col transform
   load_buffer_16x4(input, in, stride, ud_flip, lr_flip, shift[0]);
 
-  for (int i = 0; i < txfm_size_row; i++) {
-    col_txfm(in + i * txfm_size_row, outcoeff128 + i * txfm_size_row, bitcol,
-             1);
+  for (int i = 0; i < (txfm_size_col >> 2); i++) {
+    __m128i *cur_in = &in[i * txfm_size_row];
+    col_txfm(cur_in, cur_in, bitcol, 1);
+    transpose_32bit_4x4(cur_in, cur_in);
   }
-  col_txfm_8x8_rounding(outcoeff128, -shift[1]);
+  col_txfm_8x8_rounding(in, -shift[1]);
 
   // row transform
-  row_txfm(outcoeff128, in, bitrow, 1);
-  transpose_8nx8n(in, outcoeff128, txfm_size_row, txfm_size_col);
+  row_txfm(in, outcoeff128, bitrow, 1);
   (void)bd;
 }
 
@@ -2341,8 +2304,7 @@
 
   // row transform
   row_txfm(outcoef128, in, bitrow, 8);
-  transpose_8nx8n(in, outcoef128, 32, 16);
-  av1_round_shift_rect_array_32_sse4_1(outcoef128, outcoef128, 128, -shift[2],
+  av1_round_shift_rect_array_32_sse4_1(in, outcoef128, 128, -shift[2],
                                        NewSqrt2);
   (void)bd;
 }
@@ -2376,9 +2338,10 @@
   for (int i = 0; i < num_row; i++) {
     av1_fdct32_sse4_1((outcoef128 + i), (in + i), bitrow, num_row);
   }
-  transpose_8nx8n(in, outcoef128, txfm_size_row, txfm_size_col);
-  av1_round_shift_rect_array_32_sse4_1(outcoef128, outcoef128, 512, -shift[2],
-                                       NewSqrt2);
+  for (int i = 0; i < txfm_size_col; i++) {
+    av1_round_shift_rect_array_32_sse4_1(in + i * 16, outcoef128 + i * 8, 8,
+                                         -shift[2], NewSqrt2);
+  }
   (void)bd;
 }
 
@@ -2421,9 +2384,8 @@
   for (int i = 0; i < num_row; i++) {
     av1_fdct64_sse4_1((outcoef128 + i), (in + i), bitrow, num_row, num_row);
   }
-  transpose_8nx8n(in, outcoef128, txfm_size_row, txfm_size_col >> 1);
-  av1_round_shift_rect_array_32_sse4_1(outcoef128, outcoef128, 512 >> 1,
-                                       -shift[2], NewSqrt2);
+  av1_round_shift_rect_array_32_sse4_1(in, outcoef128, 512, -shift[2],
+                                       NewSqrt2);
   (void)bd;
 }
 
@@ -2450,8 +2412,7 @@
   for (int i = 0; i < 4; i++) {
     row_txfm((outcoef128 + i), (in + i), bitrow, 4);
   }
-  transpose_8nx8n(in, outcoef128, 16, 32);
-  av1_round_shift_rect_array_32_sse4_1(outcoef128, outcoef128, 128, -shift[2],
+  av1_round_shift_rect_array_32_sse4_1(in, outcoef128, 128, -shift[2],
                                        NewSqrt2);
   (void)bd;
 }
@@ -2486,9 +2447,8 @@
 
   // row transform
   for (int i = 0; i < txfm_size_col; i += 2) {
-    row_txfm((outcoef128 + i), (in + i), bitrow, txfm_size_col);
+    row_txfm((outcoef128 + i), (outcoef128 + i), bitrow, txfm_size_col);
   }
-  transpose_8nx8n(in, outcoef128, txfm_size_row, txfm_size_col);
   (void)bd;
 }
 
@@ -2519,9 +2479,8 @@
 
   // row transform
   for (int i = 0; i < num_col; i++) {
-    row_txfm((outcoef128 + i), (in + i), bitrow, num_col);
+    row_txfm((outcoef128 + i), (outcoef128 + i), bitrow, num_col);
   }
-  transpose_8nx8n(in, outcoef128, txfm_size_row, txfm_size_col);
   (void)bd;
 }
 #endif
@@ -2529,7 +2488,6 @@
 void av1_fwd_txfm2d_4x8_sse4_1(const int16_t *input, int32_t *coeff, int stride,
                                TX_TYPE tx_type, int bd) {
   __m128i in[8];
-  __m128i *outcoeff128 = (__m128i *)coeff;
   const int8_t *shift = av1_fwd_txfm_shift_ls[TX_4X8];
   const int txw_idx = get_txw_idx(TX_4X8);
   const int txh_idx = get_txh_idx(TX_4X8);
@@ -2546,13 +2504,15 @@
   load_buffer_4x8(input, in, stride, ud_flip, lr_flip, shift[0]);
   col_txfm(in, in, bitcol, 1);
   col_txfm_4x8_rounding(in, -shift[1]);
-  transpose_8nx8n(in, outcoeff128, txfm_size_col, txfm_size_row);
 
   for (int i = 0; i < 2; i++) {
-    row_txfm(outcoeff128 + i, in + i * txfm_size_col, bitrow, 2);
+    __m128i *cur_in = &in[i * 4];
+    transpose_32bit_4x4(cur_in, cur_in);
+    row_txfm(cur_in, cur_in, bitrow, 1);
+    av1_round_shift_rect_array_32_sse4_1(cur_in, cur_in, txfm_size_col,
+                                         -shift[2], NewSqrt2);
+    store_output_w4(coeff + i * 4, cur_in, txfm_size_row, 4);
   }
-  av1_round_shift_rect_array_32_sse4_1(in, outcoeff128, txfm_size_row,
-                                       -shift[2], NewSqrt2);
   (void)bd;
 }
 
@@ -2574,15 +2534,16 @@
   // col tranform
   load_buffer_8x4(input, in, stride, ud_flip, lr_flip, shift[0]);
   for (int i = 0; i < 2; i++) {
-    col_txfm(in + i * txfm_size_row, in + i * txfm_size_row, bitcol, 1);
+    __m128i *cur_in = &in[i * txfm_size_row];
+    col_txfm(cur_in, cur_in, bitcol, 1);
+    transpose_32bit_4x4(cur_in, cur_in);
   }
   col_txfm_4x8_rounding(in, -shift[1]);
 
   // row tranform
   row_txfm(in, outcoeff128, bitrow, 1);
-  av1_round_shift_rect_array_32_sse4_1(outcoeff128, in, txfm_size_col,
+  av1_round_shift_rect_array_32_sse4_1(outcoeff128, outcoeff128, txfm_size_col,
                                        -shift[2], NewSqrt2);
-  transpose_8nx8n(in, outcoeff128, txfm_size_row, txfm_size_col);
   (void)bd;
 }
 
@@ -2623,9 +2584,7 @@
   col_txfm_16x16_rounding(outcoeff128 + 192, -shift[1]);
 
   transpose_8nx8n(outcoeff128, in, txfm_size_col, 32);
-  fdct16x16_sse4_1(in, in, bitrow, 8);
-  transpose_8nx8n(in, outcoeff128, 32, txfm_size_col);
-  memset(coeff + txfm_size_col * 32, 0, txfm_size_col * 32 * sizeof(*coeff));
+  fdct16x16_sse4_1(in, outcoeff128, bitrow, 8);
   (void)bd;
 }
 
@@ -2662,9 +2621,9 @@
 
   transpose_8nx8n(outcoeff128, in, txfm_size_col, txfm_size_row);
   for (int i = 0; i < 4; i++) {
-    av1_fdct64_sse4_1(in + i, in + i, bitrow, 4, 4);
+    av1_fdct64_sse4_1(in + i, outcoeff128 + i, bitrow, 4, 4);
   }
-  transpose_8nx8n(in, outcoeff128, txfm_size_row, 32);
+  memset(coeff + txfm_size_row * 32, 0, txfm_size_row * 32 * sizeof(*coeff));
   (void)bd;
 }
 #endif

diff --git a/av1/encoder/x86/highbd_temporal_filter_avx2.c b/av1/encoder/x86/highbd_temporal_filter_avx2.c
index 68509fa..ca448ca 100644
--- a/av1/encoder/x86/highbd_temporal_filter_avx2.c
+++ b/av1/encoder/x86/highbd_temporal_filter_avx2.c

@@ -13,6 +13,7 @@
 #include <immintrin.h>
 
 #include "config/av1_rtcd.h"
+#include "aom_dsp/mathutils.h"
 #include "av1/encoder/encoder.h"
 #include "av1/encoder/temporal_filter.h"
 
@@ -147,7 +148,8 @@
     const int *subblock_mses, unsigned int *accumulator, uint16_t *count,
     uint32_t *frame_sse, uint32_t *luma_sse_sum, int bd,
     const double inv_num_ref_pixels, const double decay_factor,
-    const double inv_factor, const double weight_factor, double *d_factor) {
+    const double inv_factor, const double weight_factor, double *d_factor,
+    int tf_wgt_calc_lvl) {
   assert(((block_width == 16) || (block_width == 32)) &&
          ((block_height == 16) || (block_height == 32)));
 
@@ -304,28 +306,61 @@
     acc_5x5_sse[row][col + 3] = xx_mask_and_hadd(vsum, 3);
   }
 
-  for (int i = 0, k = 0; i < block_height; i++) {
-    for (int j = 0; j < block_width; j++, k++) {
-      const int pixel_value = frame2[i * stride2 + j];
-      uint32_t diff_sse = acc_5x5_sse[i][j] + luma_sse_sum[i * BW + j];
+  double subblock_mses_scaled[4];
+  double d_factor_decayed[4];
+  for (int idx = 0; idx < 4; idx++) {
+    subblock_mses_scaled[idx] = subblock_mses[idx] * inv_factor;
+    d_factor_decayed[idx] = d_factor[idx] * decay_factor;
+  }
+  if (tf_wgt_calc_lvl == 0) {
+    for (int i = 0, k = 0; i < block_height; i++) {
+      const int y_blk_raster_offset = (i >= block_height / 2) * 2;
+      for (int j = 0; j < block_width; j++, k++) {
+        const int pixel_value = frame2[i * stride2 + j];
+        uint32_t diff_sse = acc_5x5_sse[i][j] + luma_sse_sum[i * BW + j];
 
-      // Scale down the difference for high bit depth input.
-      diff_sse >>= ((bd - 8) * 2);
+        // Scale down the difference for high bit depth input.
+        diff_sse >>= ((bd - 8) * 2);
 
-      const double window_error = diff_sse * inv_num_ref_pixels;
-      const int subblock_idx =
-          (i >= block_height / 2) * 2 + (j >= block_width / 2);
-      const double block_error = (double)subblock_mses[subblock_idx];
-      const double combined_error =
-          weight_factor * window_error + block_error * inv_factor;
+        const double window_error = diff_sse * inv_num_ref_pixels;
+        const int subblock_idx = y_blk_raster_offset + (j >= block_width / 2);
 
-      double scaled_error =
-          combined_error * d_factor[subblock_idx] * decay_factor;
-      scaled_error = AOMMIN(scaled_error, 7);
-      const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
+        const double combined_error =
+            weight_factor * window_error + subblock_mses_scaled[subblock_idx];
 
-      count[k] += weight;
-      accumulator[k] += weight * pixel_value;
+        double scaled_error = combined_error * d_factor_decayed[subblock_idx];
+        scaled_error = AOMMIN(scaled_error, 7);
+        const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
+
+        count[k] += weight;
+        accumulator[k] += weight * pixel_value;
+      }
+    }
+  } else {
+    for (int i = 0, k = 0; i < block_height; i++) {
+      const int y_blk_raster_offset = (i >= block_height / 2) * 2;
+      for (int j = 0; j < block_width; j++, k++) {
+        const int pixel_value = frame2[i * stride2 + j];
+        uint32_t diff_sse = acc_5x5_sse[i][j] + luma_sse_sum[i * BW + j];
+
+        // Scale down the difference for high bit depth input.
+        diff_sse >>= ((bd - 8) * 2);
+
+        const double window_error = diff_sse * inv_num_ref_pixels;
+        const int subblock_idx = y_blk_raster_offset + (j >= block_width / 2);
+
+        const double combined_error =
+            weight_factor * window_error + subblock_mses_scaled[subblock_idx];
+
+        double scaled_error = combined_error * d_factor_decayed[subblock_idx];
+        scaled_error = AOMMIN(scaled_error, 7);
+        const float fweight =
+            approx_exp((float)-scaled_error) * TF_WEIGHT_SCALE;
+        const int weight = iroundpf(fweight);
+
+        count[k] += weight;
+        accumulator[k] += weight * pixel_value;
+      }
     }
   }
 }
@@ -335,7 +370,8 @@
     const BLOCK_SIZE block_size, const int mb_row, const int mb_col,
     const int num_planes, const double *noise_levels, const MV *subblock_mvs,
     const int *subblock_mses, const int q_factor, const int filter_strength,
-    const uint8_t *pred, uint32_t *accum, uint16_t *count) {
+    int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum,
+    uint16_t *count) {
   const int is_high_bitdepth = frame_to_filter->flags & YV12_FLAG_HIGHBITDEPTH;
   assert(block_size == BLOCK_32X32 && "Only support 32x32 block with sse2!");
   assert(TF_WINDOW_LENGTH == 5 && "Only support window length 5 with sse2!");
@@ -424,7 +460,7 @@
         ref, frame_stride, pred1 + plane_offset, plane_w, plane_w, plane_h,
         subblock_mses, accum + plane_offset, count + plane_offset, frame_sse,
         luma_sse_sum, mbd->bd, inv_num_ref_pixels, decay_factor, inv_factor,
-        weight_factor, d_factor);
+        weight_factor, d_factor, tf_wgt_calc_lvl);
     plane_offset += plane_h * plane_w;
   }
 }

diff --git a/av1/encoder/x86/highbd_temporal_filter_sse2.c b/av1/encoder/x86/highbd_temporal_filter_sse2.c
index 1bfdaf7..2032847 100644
--- a/av1/encoder/x86/highbd_temporal_filter_sse2.c
+++ b/av1/encoder/x86/highbd_temporal_filter_sse2.c

@@ -13,6 +13,7 @@
 #include <emmintrin.h>
 
 #include "config/av1_rtcd.h"
+#include "aom_dsp/mathutils.h"
 #include "av1/encoder/encoder.h"
 #include "av1/encoder/temporal_filter.h"
 
@@ -95,7 +96,8 @@
     const int *subblock_mses, unsigned int *accumulator, uint16_t *count,
     uint32_t *frame_sse, uint32_t *luma_sse_sum, int bd,
     const double inv_num_ref_pixels, const double decay_factor,
-    const double inv_factor, const double weight_factor, double *d_factor) {
+    const double inv_factor, const double weight_factor, double *d_factor,
+    int tf_wgt_calc_lvl) {
   assert(((block_width == 16) || (block_width == 32)) &&
          ((block_height == 16) || (block_height == 32)));
 
@@ -179,28 +181,61 @@
     }
   }
 
-  for (int i = 0, k = 0; i < block_height; i++) {
-    for (int j = 0; j < block_width; j++, k++) {
-      const int pixel_value = frame2[i * stride2 + j];
-      uint32_t diff_sse = acc_5x5_sse[i][j] + luma_sse_sum[i * BW + j];
+  double subblock_mses_scaled[4];
+  double d_factor_decayed[4];
+  for (int idx = 0; idx < 4; idx++) {
+    subblock_mses_scaled[idx] = subblock_mses[idx] * inv_factor;
+    d_factor_decayed[idx] = d_factor[idx] * decay_factor;
+  }
+  if (tf_wgt_calc_lvl == 0) {
+    for (int i = 0, k = 0; i < block_height; i++) {
+      const int y_blk_raster_offset = (i >= block_height / 2) * 2;
+      for (int j = 0; j < block_width; j++, k++) {
+        const int pixel_value = frame2[i * stride2 + j];
+        uint32_t diff_sse = acc_5x5_sse[i][j] + luma_sse_sum[i * BW + j];
 
-      // Scale down the difference for high bit depth input.
-      diff_sse >>= ((bd - 8) * 2);
+        // Scale down the difference for high bit depth input.
+        diff_sse >>= ((bd - 8) * 2);
 
-      const double window_error = diff_sse * inv_num_ref_pixels;
-      const int subblock_idx =
-          (i >= block_height / 2) * 2 + (j >= block_width / 2);
-      const double block_error = (double)subblock_mses[subblock_idx];
-      const double combined_error =
-          weight_factor * window_error + block_error * inv_factor;
+        const double window_error = diff_sse * inv_num_ref_pixels;
+        const int subblock_idx = y_blk_raster_offset + (j >= block_width / 2);
 
-      double scaled_error =
-          combined_error * d_factor[subblock_idx] * decay_factor;
-      scaled_error = AOMMIN(scaled_error, 7);
-      const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
+        const double combined_error =
+            weight_factor * window_error + subblock_mses_scaled[subblock_idx];
 
-      count[k] += weight;
-      accumulator[k] += weight * pixel_value;
+        double scaled_error = combined_error * d_factor_decayed[subblock_idx];
+        scaled_error = AOMMIN(scaled_error, 7);
+        const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
+
+        count[k] += weight;
+        accumulator[k] += weight * pixel_value;
+      }
+    }
+  } else {
+    for (int i = 0, k = 0; i < block_height; i++) {
+      const int y_blk_raster_offset = (i >= block_height / 2) * 2;
+      for (int j = 0; j < block_width; j++, k++) {
+        const int pixel_value = frame2[i * stride2 + j];
+        uint32_t diff_sse = acc_5x5_sse[i][j] + luma_sse_sum[i * BW + j];
+
+        // Scale down the difference for high bit depth input.
+        diff_sse >>= ((bd - 8) * 2);
+
+        const double window_error = diff_sse * inv_num_ref_pixels;
+        const int subblock_idx = y_blk_raster_offset + (j >= block_width / 2);
+
+        const double combined_error =
+            weight_factor * window_error + subblock_mses_scaled[subblock_idx];
+
+        double scaled_error = combined_error * d_factor_decayed[subblock_idx];
+        scaled_error = AOMMIN(scaled_error, 7);
+        const float fweight =
+            approx_exp((float)-scaled_error) * TF_WEIGHT_SCALE;
+        const int weight = iroundpf(fweight);
+
+        count[k] += weight;
+        accumulator[k] += weight * pixel_value;
+      }
     }
   }
 }
@@ -210,7 +245,8 @@
     const BLOCK_SIZE block_size, const int mb_row, const int mb_col,
     const int num_planes, const double *noise_levels, const MV *subblock_mvs,
     const int *subblock_mses, const int q_factor, const int filter_strength,
-    const uint8_t *pred, uint32_t *accum, uint16_t *count) {
+    int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum,
+    uint16_t *count) {
   const int is_high_bitdepth = frame_to_filter->flags & YV12_FLAG_HIGHBITDEPTH;
   assert(block_size == BLOCK_32X32 && "Only support 32x32 block with sse2!");
   assert(TF_WINDOW_LENGTH == 5 && "Only support window length 5 with sse2!");
@@ -299,7 +335,7 @@
         ref, frame_stride, pred1 + plane_offset, plane_w, plane_w, plane_h,
         subblock_mses, accum + plane_offset, count + plane_offset, frame_sse,
         luma_sse_sum, mbd->bd, inv_num_ref_pixels, decay_factor, inv_factor,
-        weight_factor, d_factor);
+        weight_factor, d_factor, tf_wgt_calc_lvl);
     plane_offset += plane_h * plane_w;
   }
 }

diff --git a/av1/encoder/x86/pickrst_avx2.c b/av1/encoder/x86/pickrst_avx2.c
index 3452f73..6658ed3 100644
--- a/av1/encoder/x86/pickrst_avx2.c
+++ b/av1/encoder/x86/pickrst_avx2.c

@@ -19,179 +19,6 @@
 #include "av1/common/restoration.h"
 #include "av1/encoder/pickrst.h"
 
-static INLINE void acc_stat_avx2(int32_t *dst, const uint8_t *src,
-                                 const __m128i *shuffle, const __m256i *kl) {
-  const __m128i s = _mm_shuffle_epi8(xx_loadu_128(src), *shuffle);
-  const __m256i d0 = _mm256_madd_epi16(*kl, _mm256_cvtepu8_epi16(s));
-  const __m256i dst0 = yy_load_256(dst);
-  const __m256i r0 = _mm256_add_epi32(dst0, d0);
-  yy_store_256(dst, r0);
-}
-
-static INLINE void acc_stat_win7_one_line_avx2(
-    const uint8_t *dgd, const uint8_t *src, int h_start, int h_end,
-    int dgd_stride, const __m128i *shuffle, int32_t *sumX,
-    int32_t sumY[WIENER_WIN][WIENER_WIN], int32_t M_int[WIENER_WIN][WIENER_WIN],
-    int32_t H_int[WIENER_WIN2][WIENER_WIN * 8]) {
-  int j, k, l;
-  const int wiener_win = WIENER_WIN;
-  // Main loop handles two pixels at a time
-  // We can assume that h_start is even, since it will always be aligned to
-  // a tile edge + some number of restoration units, and both of those will
-  // be 64-pixel aligned.
-  // However, at the edge of the image, h_end may be odd, so we need to handle
-  // that case correctly.
-  assert(h_start % 2 == 0);
-  const int h_end_even = h_end & ~1;
-  const int has_odd_pixel = h_end & 1;
-  for (j = h_start; j < h_end_even; j += 2) {
-    const uint8_t X1 = src[j];
-    const uint8_t X2 = src[j + 1];
-    *sumX += X1 + X2;
-    const uint8_t *dgd_ij = dgd + j;
-    for (k = 0; k < wiener_win; k++) {
-      const uint8_t *dgd_ijk = dgd_ij + k * dgd_stride;
-      for (l = 0; l < wiener_win; l++) {
-        int32_t *H_ = &H_int[(l * wiener_win + k)][0];
-        const uint8_t D1 = dgd_ijk[l];
-        const uint8_t D2 = dgd_ijk[l + 1];
-        sumY[k][l] += D1 + D2;
-        M_int[k][l] += D1 * X1 + D2 * X2;
-
-        const __m256i kl =
-            _mm256_cvtepu8_epi16(_mm_set1_epi16(loadu_int16(dgd_ijk + l)));
-        acc_stat_avx2(H_ + 0 * 8, dgd_ij + 0 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 1 * 8, dgd_ij + 1 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 2 * 8, dgd_ij + 2 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 3 * 8, dgd_ij + 3 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 4 * 8, dgd_ij + 4 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 5 * 8, dgd_ij + 5 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 6 * 8, dgd_ij + 6 * dgd_stride, shuffle, &kl);
-      }
-    }
-  }
-  // If the width is odd, add in the final pixel
-  if (has_odd_pixel) {
-    const uint8_t X1 = src[j];
-    *sumX += X1;
-    const uint8_t *dgd_ij = dgd + j;
-    for (k = 0; k < wiener_win; k++) {
-      const uint8_t *dgd_ijk = dgd_ij + k * dgd_stride;
-      for (l = 0; l < wiener_win; l++) {
-        int32_t *H_ = &H_int[(l * wiener_win + k)][0];
-        const uint8_t D1 = dgd_ijk[l];
-        sumY[k][l] += D1;
-        M_int[k][l] += D1 * X1;
-
-        // The `acc_stat_avx2` function wants its input to have interleaved
-        // copies of two pixels, but we only have one. However, the pixels
-        // are (effectively) used as inputs to a multiply-accumulate.
-        // So if we set the extra pixel slot to 0, then it is effectively
-        // ignored.
-        const __m256i kl = _mm256_cvtepu8_epi16(_mm_set1_epi16((int16_t)D1));
-        acc_stat_avx2(H_ + 0 * 8, dgd_ij + 0 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 1 * 8, dgd_ij + 1 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 2 * 8, dgd_ij + 2 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 3 * 8, dgd_ij + 3 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 4 * 8, dgd_ij + 4 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 5 * 8, dgd_ij + 5 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 6 * 8, dgd_ij + 6 * dgd_stride, shuffle, &kl);
-      }
-    }
-  }
-}
-
-static INLINE void compute_stats_win7_opt_avx2(
-    const uint8_t *dgd, const uint8_t *src, int h_start, int h_end, int v_start,
-    int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H,
-    int use_downsampled_wiener_stats) {
-  int i, j, k, l, m, n;
-  const int wiener_win = WIENER_WIN;
-  const int pixel_count = (h_end - h_start) * (v_end - v_start);
-  const int wiener_win2 = wiener_win * wiener_win;
-  const int wiener_halfwin = (wiener_win >> 1);
-  uint8_t avg = find_average(dgd, h_start, h_end, v_start, v_end, dgd_stride);
-
-  int32_t M_int32[WIENER_WIN][WIENER_WIN] = { { 0 } };
-  int64_t M_int64[WIENER_WIN][WIENER_WIN] = { { 0 } };
-  int32_t M_int32_row[WIENER_WIN][WIENER_WIN] = { { 0 } };
-
-  DECLARE_ALIGNED(32, int32_t,
-                  H_int32[WIENER_WIN2][WIENER_WIN * 8]) = { { 0 } };
-  DECLARE_ALIGNED(32, int32_t,
-                  H_int32_row[WIENER_WIN2][WIENER_WIN * 8]) = { { 0 } };
-  int64_t H_int64[WIENER_WIN2][WIENER_WIN * 8] = { { 0 } };
-  int32_t sumY[WIENER_WIN][WIENER_WIN] = { { 0 } };
-  int32_t sumX = 0;
-  const uint8_t *dgd_win = dgd - wiener_halfwin * dgd_stride - wiener_halfwin;
-  int downsample_factor =
-      use_downsampled_wiener_stats ? WIENER_STATS_DOWNSAMPLE_FACTOR : 1;
-  int32_t sumX_row = 0;
-  int32_t sumY_row[WIENER_WIN][WIENER_WIN] = { { 0 } };
-
-  const __m128i shuffle = xx_loadu_128(g_shuffle_stats_data);
-  for (j = v_start; j < v_end; j += 64) {
-    const int vert_end = AOMMIN(64, v_end - j) + j;
-    for (i = j; i < vert_end; i = i + downsample_factor) {
-      if (use_downsampled_wiener_stats &&
-          (vert_end - i < WIENER_STATS_DOWNSAMPLE_FACTOR)) {
-        downsample_factor = vert_end - i;
-      }
-      sumX_row = 0;
-      memset(sumY_row, 0, sizeof(int32_t) * WIENER_WIN * WIENER_WIN);
-      memset(M_int32_row, 0, sizeof(int32_t) * WIENER_WIN * WIENER_WIN);
-      memset(H_int32_row, 0, sizeof(int32_t) * WIENER_WIN2 * (WIENER_WIN * 8));
-      acc_stat_win7_one_line_avx2(
-          dgd_win + i * dgd_stride, src + i * src_stride, h_start, h_end,
-          dgd_stride, &shuffle, &sumX_row, sumY_row, M_int32_row, H_int32_row);
-      sumX += sumX_row * downsample_factor;
-
-      // Scale M matrix based on the downsampling factor
-      for (k = 0; k < wiener_win; ++k) {
-        for (l = 0; l < wiener_win; ++l) {
-          sumY[k][l] += (sumY_row[k][l] * downsample_factor);
-          M_int32[k][l] += (M_int32_row[k][l] * downsample_factor);
-        }
-      }
-      // Scale H matrix based on the downsampling factor
-      for (k = 0; k < WIENER_WIN2; ++k) {
-        for (l = 0; l < WIENER_WIN * 8; ++l) {
-          H_int32[k][l] += (H_int32_row[k][l] * downsample_factor);
-        }
-      }
-    }
-    for (k = 0; k < wiener_win; ++k) {
-      for (l = 0; l < wiener_win; ++l) {
-        M_int64[k][l] += M_int32[k][l];
-        M_int32[k][l] = 0;
-      }
-    }
-    for (k = 0; k < WIENER_WIN2; ++k) {
-      for (l = 0; l < WIENER_WIN * 8; ++l) {
-        H_int64[k][l] += H_int32[k][l];
-        H_int32[k][l] = 0;
-      }
-    }
-  }
-
-  const int64_t avg_square_sum = (int64_t)avg * (int64_t)avg * pixel_count;
-  for (k = 0; k < wiener_win; k++) {
-    for (l = 0; l < wiener_win; l++) {
-      const int32_t idx0 = l * wiener_win + k;
-      M[idx0] =
-          M_int64[k][l] + (avg_square_sum - (int64_t)avg * (sumX + sumY[k][l]));
-      int64_t *H_ = H + idx0 * wiener_win2;
-      int64_t *H_int_ = &H_int64[idx0][0];
-      for (m = 0; m < wiener_win; m++) {
-        for (n = 0; n < wiener_win; n++) {
-          H_[m * wiener_win + n] = H_int_[n * 8 + m] + avg_square_sum -
-                                   (int64_t)avg * (sumY[k][l] + sumY[n][m]);
-        }
-      }
-    }
-  }
-}
-
 #if CONFIG_AV1_HIGHBITDEPTH
 static INLINE void acc_stat_highbd_avx2(int64_t *dst, const uint16_t *dgd,
                                         const __m256i *shuffle,
@@ -537,188 +364,1173 @@
 }
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
-static INLINE void acc_stat_win5_one_line_avx2(
-    const uint8_t *dgd, const uint8_t *src, int h_start, int h_end,
-    int dgd_stride, const __m128i *shuffle, int32_t *sumX,
-    int32_t sumY[WIENER_WIN_CHROMA][WIENER_WIN_CHROMA],
-    int32_t M_int[WIENER_WIN_CHROMA][WIENER_WIN_CHROMA],
-    int32_t H_int[WIENER_WIN2_CHROMA][WIENER_WIN_CHROMA * 8]) {
-  int j, k, l;
-  const int wiener_win = WIENER_WIN_CHROMA;
-  // Main loop handles two pixels at a time
-  // We can assume that h_start is even, since it will always be aligned to
-  // a tile edge + some number of restoration units, and both of those will
-  // be 64-pixel aligned.
-  // However, at the edge of the image, h_end may be odd, so we need to handle
-  // that case correctly.
-  assert(h_start % 2 == 0);
-  const int h_end_even = h_end & ~1;
-  const int has_odd_pixel = h_end & 1;
-  for (j = h_start; j < h_end_even; j += 2) {
-    const uint8_t X1 = src[j];
-    const uint8_t X2 = src[j + 1];
-    *sumX += X1 + X2;
-    const uint8_t *dgd_ij = dgd + j;
-    for (k = 0; k < wiener_win; k++) {
-      const uint8_t *dgd_ijk = dgd_ij + k * dgd_stride;
-      for (l = 0; l < wiener_win; l++) {
-        int32_t *H_ = &H_int[(l * wiener_win + k)][0];
-        const uint8_t D1 = dgd_ijk[l];
-        const uint8_t D2 = dgd_ijk[l + 1];
-        sumY[k][l] += D1 + D2;
-        M_int[k][l] += D1 * X1 + D2 * X2;
+static INLINE void madd_and_accum_avx2(__m256i src, __m256i dgd, __m256i *sum) {
+  *sum = _mm256_add_epi32(*sum, _mm256_madd_epi16(src, dgd));
+}
 
-        const __m256i kl =
-            _mm256_cvtepu8_epi16(_mm_set1_epi16(loadu_int16(dgd_ijk + l)));
-        acc_stat_avx2(H_ + 0 * 8, dgd_ij + 0 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 1 * 8, dgd_ij + 1 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 2 * 8, dgd_ij + 2 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 3 * 8, dgd_ij + 3 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 4 * 8, dgd_ij + 4 * dgd_stride, shuffle, &kl);
-      }
-    }
+static INLINE __m256i convert_and_add_avx2(__m256i src) {
+  const __m256i s0 = _mm256_cvtepi32_epi64(_mm256_castsi256_si128(src));
+  const __m256i s1 = _mm256_cvtepi32_epi64(_mm256_extracti128_si256(src, 1));
+  return _mm256_add_epi64(s0, s1);
+}
+
+static INLINE __m256i hadd_four_32_to_64_avx2(__m256i src0, __m256i src1,
+                                              __m256i *src2, __m256i *src3) {
+  // 00 01 10 11 02 03 12 13
+  const __m256i s_0 = _mm256_hadd_epi32(src0, src1);
+  // 20 21 30 31 22 23 32 33
+  const __m256i s_1 = _mm256_hadd_epi32(*src2, *src3);
+  // 00+01 10+11 20+21 30+31 02+03 12+13 22+23 32+33
+  const __m256i s_2 = _mm256_hadd_epi32(s_0, s_1);
+  return convert_and_add_avx2(s_2);
+}
+
+static INLINE __m128i add_64bit_lvl_avx2(__m256i src0, __m256i src1) {
+  // 00 10 02 12
+  const __m256i t0 = _mm256_unpacklo_epi64(src0, src1);
+  // 01 11 03 13
+  const __m256i t1 = _mm256_unpackhi_epi64(src0, src1);
+  // 00+01 10+11 02+03 12+13
+  const __m256i sum = _mm256_add_epi64(t0, t1);
+  // 00+01 10+11
+  const __m128i sum0 = _mm256_castsi256_si128(sum);
+  // 02+03 12+13
+  const __m128i sum1 = _mm256_extracti128_si256(sum, 1);
+  // 00+01+02+03 10+11+12+13
+  return _mm_add_epi64(sum0, sum1);
+}
+
+static INLINE __m128i convert_32_to_64_add_avx2(__m256i src0, __m256i src1) {
+  // 00 01 02 03
+  const __m256i s0 = convert_and_add_avx2(src0);
+  // 10 11 12 13
+  const __m256i s1 = convert_and_add_avx2(src1);
+  return add_64bit_lvl_avx2(s0, s1);
+}
+
+static INLINE int32_t calc_sum_of_register(__m256i src) {
+  const __m128i src_l = _mm256_castsi256_si128(src);
+  const __m128i src_h = _mm256_extracti128_si256(src, 1);
+  const __m128i sum = _mm_add_epi32(src_l, src_h);
+  const __m128i dst0 = _mm_add_epi32(sum, _mm_srli_si128(sum, 8));
+  const __m128i dst1 = _mm_add_epi32(dst0, _mm_srli_si128(dst0, 4));
+  return _mm_cvtsi128_si32(dst1);
+}
+
+static INLINE void transpose_64bit_4x4_avx2(const __m256i *const src,
+                                            __m256i *const dst) {
+  // Unpack 64 bit elements. Goes from:
+  // src[0]: 00 01 02 03
+  // src[1]: 10 11 12 13
+  // src[2]: 20 21 22 23
+  // src[3]: 30 31 32 33
+  // to:
+  // reg0:    00 10 02 12
+  // reg1:    20 30 22 32
+  // reg2:    01 11 03 13
+  // reg3:    21 31 23 33
+  const __m256i reg0 = _mm256_unpacklo_epi64(src[0], src[1]);
+  const __m256i reg1 = _mm256_unpacklo_epi64(src[2], src[3]);
+  const __m256i reg2 = _mm256_unpackhi_epi64(src[0], src[1]);
+  const __m256i reg3 = _mm256_unpackhi_epi64(src[2], src[3]);
+
+  // Unpack 64 bit elements resulting in:
+  // dst[0]: 00 10 20 30
+  // dst[1]: 01 11 21 31
+  // dst[2]: 02 12 22 32
+  // dst[3]: 03 13 23 33
+  dst[0] = _mm256_inserti128_si256(reg0, _mm256_castsi256_si128(reg1), 1);
+  dst[1] = _mm256_inserti128_si256(reg2, _mm256_castsi256_si128(reg3), 1);
+  dst[2] = _mm256_inserti128_si256(reg1, _mm256_extracti128_si256(reg0, 1), 0);
+  dst[3] = _mm256_inserti128_si256(reg3, _mm256_extracti128_si256(reg2, 1), 0);
+}
+
+// When we load 32 values of int8_t type and need less than 32 values for
+// processing, the below mask is used to make the extra values zero.
+static const int8_t mask_8bit[32] = {
+  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  // 16 bytes
+  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,   // 16 bytes
+};
+
+// When we load 16 values of int16_t type and need less than 16 values for
+// processing, the below mask is used to make the extra values zero.
+static const int16_t mask_16bit[32] = {
+  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  // 16 bytes
+  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,   // 16 bytes
+};
+
+static INLINE uint8_t calc_dgd_buf_avg_avx2(const uint8_t *src, int32_t h_start,
+                                            int32_t h_end, int32_t v_start,
+                                            int32_t v_end, int32_t stride) {
+  const uint8_t *src_temp = src + v_start * stride + h_start;
+  const __m256i zero = _mm256_setzero_si256();
+  const int32_t width = h_end - h_start;
+  const int32_t height = v_end - v_start;
+  const int32_t wd_beyond_mul32 = width & 31;
+  const int32_t wd_mul32 = width - wd_beyond_mul32;
+  __m128i mask_low, mask_high;
+  __m256i ss = zero;
+
+  // When width is not multiple of 32, it still loads 32 and to make the data
+  // which is extra (beyond required) as zero using the below mask.
+  if (wd_beyond_mul32 >= 16) {
+    mask_low = _mm_set1_epi8(-1);
+    mask_high = _mm_loadu_si128((__m128i *)(&mask_8bit[32 - wd_beyond_mul32]));
+  } else {
+    mask_low = _mm_loadu_si128((__m128i *)(&mask_8bit[16 - wd_beyond_mul32]));
+    mask_high = _mm_setzero_si128();
   }
-  // If the width is odd, add in the final pixel
-  if (has_odd_pixel) {
-    const uint8_t X1 = src[j];
-    *sumX += X1;
-    const uint8_t *dgd_ij = dgd + j;
-    for (k = 0; k < wiener_win; k++) {
-      const uint8_t *dgd_ijk = dgd_ij + k * dgd_stride;
-      for (l = 0; l < wiener_win; l++) {
-        int32_t *H_ = &H_int[(l * wiener_win + k)][0];
-        const uint8_t D1 = dgd_ijk[l];
-        sumY[k][l] += D1;
-        M_int[k][l] += D1 * X1;
+  const __m256i mask =
+      _mm256_inserti128_si256(_mm256_castsi128_si256(mask_low), mask_high, 1);
 
-        // The `acc_stat_avx2` function wants its input to have interleaved
-        // copies of two pixels, but we only have one. However, the pixels
-        // are (effectively) used as inputs to a multiply-accumulate.
-        // So if we set the extra pixel slot to 0, then it is effectively
-        // ignored.
-        const __m256i kl = _mm256_cvtepu8_epi16(_mm_set1_epi16((int16_t)D1));
-        acc_stat_avx2(H_ + 0 * 8, dgd_ij + 0 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 1 * 8, dgd_ij + 1 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 2 * 8, dgd_ij + 2 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 3 * 8, dgd_ij + 3 * dgd_stride, shuffle, &kl);
-        acc_stat_avx2(H_ + 4 * 8, dgd_ij + 4 * dgd_stride, shuffle, &kl);
-      }
+  int32_t proc_ht = 0;
+  do {
+    // Process width in multiple of 32.
+    int32_t proc_wd = 0;
+    while (proc_wd < wd_mul32) {
+      const __m256i s_0 = _mm256_loadu_si256((__m256i *)(src_temp + proc_wd));
+      const __m256i sad_0 = _mm256_sad_epu8(s_0, zero);
+      ss = _mm256_add_epi32(ss, sad_0);
+      proc_wd += 32;
+    }
+
+    // Process the remaining width.
+    if (wd_beyond_mul32) {
+      const __m256i s_0 = _mm256_loadu_si256((__m256i *)(src_temp + proc_wd));
+      const __m256i s_m_0 = _mm256_and_si256(s_0, mask);
+      const __m256i sad_0 = _mm256_sad_epu8(s_m_0, zero);
+      ss = _mm256_add_epi32(ss, sad_0);
+    }
+    src_temp += stride;
+    proc_ht++;
+  } while (proc_ht < height);
+
+  const uint32_t sum = calc_sum_of_register(ss);
+  const uint8_t avg = sum / (width * height);
+  return avg;
+}
+
+// Fill (src-avg) or (dgd-avg) buffers. Note that when n = (width % 16) is not
+// 0, it writes (16 - n) more data than required.
+static INLINE void sub_avg_block_avx2(const uint8_t *src, int32_t src_stride,
+                                      uint8_t avg, int32_t width,
+                                      int32_t height, int16_t *dst,
+                                      int32_t dst_stride,
+                                      int use_downsampled_wiener_stats) {
+  const __m256i avg_reg = _mm256_set1_epi16(avg);
+
+  int32_t proc_ht = 0;
+  do {
+    int ds_factor =
+        use_downsampled_wiener_stats ? WIENER_STATS_DOWNSAMPLE_FACTOR : 1;
+    if (use_downsampled_wiener_stats &&
+        (height - proc_ht < WIENER_STATS_DOWNSAMPLE_FACTOR)) {
+      ds_factor = height - proc_ht;
+    }
+
+    int32_t proc_wd = 0;
+    while (proc_wd < width) {
+      const __m128i s = _mm_loadu_si128((__m128i *)(src + proc_wd));
+      const __m256i ss = _mm256_cvtepu8_epi16(s);
+      const __m256i d = _mm256_sub_epi16(ss, avg_reg);
+      _mm256_storeu_si256((__m256i *)(dst + proc_wd), d);
+      proc_wd += 16;
+    }
+
+    src += ds_factor * src_stride;
+    dst += ds_factor * dst_stride;
+    proc_ht += ds_factor;
+  } while (proc_ht < height);
+}
+
+// Fills lower-triangular elements of H buffer from upper triangular elements of
+// the same
+static INLINE void fill_lower_triag_elements_avx2(const int32_t wiener_win2,
+                                                  int64_t *const H) {
+  for (int32_t i = 0; i < wiener_win2 - 1; i += 4) {
+    __m256i in[4], out[4];
+
+    in[0] = _mm256_loadu_si256((__m256i *)(H + (i + 0) * wiener_win2 + i + 1));
+    in[1] = _mm256_loadu_si256((__m256i *)(H + (i + 1) * wiener_win2 + i + 1));
+    in[2] = _mm256_loadu_si256((__m256i *)(H + (i + 2) * wiener_win2 + i + 1));
+    in[3] = _mm256_loadu_si256((__m256i *)(H + (i + 3) * wiener_win2 + i + 1));
+
+    transpose_64bit_4x4_avx2(in, out);
+
+    _mm_storel_epi64((__m128i *)(H + (i + 1) * wiener_win2 + i),
+                     _mm256_castsi256_si128(out[0]));
+    _mm_storeu_si128((__m128i *)(H + (i + 2) * wiener_win2 + i),
+                     _mm256_castsi256_si128(out[1]));
+    _mm256_storeu_si256((__m256i *)(H + (i + 3) * wiener_win2 + i), out[2]);
+    _mm256_storeu_si256((__m256i *)(H + (i + 4) * wiener_win2 + i), out[3]);
+
+    for (int32_t j = i + 5; j < wiener_win2; j += 4) {
+      in[0] = _mm256_loadu_si256((__m256i *)(H + (i + 0) * wiener_win2 + j));
+      in[1] = _mm256_loadu_si256((__m256i *)(H + (i + 1) * wiener_win2 + j));
+      in[2] = _mm256_loadu_si256((__m256i *)(H + (i + 2) * wiener_win2 + j));
+      in[3] = _mm256_loadu_si256((__m256i *)(H + (i + 3) * wiener_win2 + j));
+
+      transpose_64bit_4x4_avx2(in, out);
+
+      _mm256_storeu_si256((__m256i *)(H + (j + 0) * wiener_win2 + i), out[0]);
+      _mm256_storeu_si256((__m256i *)(H + (j + 1) * wiener_win2 + i), out[1]);
+      _mm256_storeu_si256((__m256i *)(H + (j + 2) * wiener_win2 + i), out[2]);
+      _mm256_storeu_si256((__m256i *)(H + (j + 3) * wiener_win2 + i), out[3]);
     }
   }
 }
 
-static INLINE void compute_stats_win5_opt_avx2(
-    const uint8_t *dgd, const uint8_t *src, int h_start, int h_end, int v_start,
-    int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H,
-    int use_downsampled_wiener_stats) {
-  int i, j, k, l, m, n;
-  const int wiener_win = WIENER_WIN_CHROMA;
-  const int pixel_count = (h_end - h_start) * (v_end - v_start);
-  const int wiener_win2 = wiener_win * wiener_win;
-  const int wiener_halfwin = (wiener_win >> 1);
-  uint8_t avg = find_average(dgd, h_start, h_end, v_start, v_end, dgd_stride);
-
-  int32_t M_int32[WIENER_WIN_CHROMA][WIENER_WIN_CHROMA] = { { 0 } };
-  int32_t M_int32_row[WIENER_WIN_CHROMA][WIENER_WIN_CHROMA] = { { 0 } };
-  int64_t M_int64[WIENER_WIN_CHROMA][WIENER_WIN_CHROMA] = { { 0 } };
-  DECLARE_ALIGNED(
-      32, int32_t,
-      H_int32[WIENER_WIN2_CHROMA][WIENER_WIN_CHROMA * 8]) = { { 0 } };
-  DECLARE_ALIGNED(
-      32, int32_t,
-      H_int32_row[WIENER_WIN2_CHROMA][WIENER_WIN_CHROMA * 8]) = { { 0 } };
-  int64_t H_int64[WIENER_WIN2_CHROMA][WIENER_WIN_CHROMA * 8] = { { 0 } };
-  int32_t sumY[WIENER_WIN_CHROMA][WIENER_WIN_CHROMA] = { { 0 } };
-  int32_t sumX = 0;
-  const uint8_t *dgd_win = dgd - wiener_halfwin * dgd_stride - wiener_halfwin;
-  int downsample_factor =
-      use_downsampled_wiener_stats ? WIENER_STATS_DOWNSAMPLE_FACTOR : 1;
-  int32_t sumX_row = 0;
-  int32_t sumY_row[WIENER_WIN_CHROMA][WIENER_WIN_CHROMA] = { { 0 } };
-
-  const __m128i shuffle = xx_loadu_128(g_shuffle_stats_data);
-  for (j = v_start; j < v_end; j += 64) {
-    const int vert_end = AOMMIN(64, v_end - j) + j;
-    for (i = j; i < vert_end; i = i + downsample_factor) {
-      if (use_downsampled_wiener_stats &&
-          (vert_end - i < WIENER_STATS_DOWNSAMPLE_FACTOR)) {
-        downsample_factor = vert_end - i;
-      }
-      sumX_row = 0;
-      memset(sumY_row, 0,
-             sizeof(int32_t) * WIENER_WIN_CHROMA * WIENER_WIN_CHROMA);
-      memset(M_int32_row, 0,
-             sizeof(int32_t) * WIENER_WIN_CHROMA * WIENER_WIN_CHROMA);
-      memset(H_int32_row, 0,
-             sizeof(int32_t) * WIENER_WIN2_CHROMA * (WIENER_WIN_CHROMA * 8));
-      acc_stat_win5_one_line_avx2(
-          dgd_win + i * dgd_stride, src + i * src_stride, h_start, h_end,
-          dgd_stride, &shuffle, &sumX_row, sumY_row, M_int32_row, H_int32_row);
-      sumX += sumX_row * downsample_factor;
-
-      // Scale M matrix based on the downsampling factor
-      for (k = 0; k < wiener_win; ++k) {
-        for (l = 0; l < wiener_win; ++l) {
-          sumY[k][l] += (sumY_row[k][l] * downsample_factor);
-          M_int32[k][l] += (M_int32_row[k][l] * downsample_factor);
-        }
-      }
-      // Scale H matrix based on the downsampling factor
-      for (k = 0; k < WIENER_WIN2_CHROMA; ++k) {
-        for (l = 0; l < WIENER_WIN_CHROMA * 8; ++l) {
-          H_int32[k][l] += (H_int32_row[k][l] * downsample_factor);
-        }
-      }
-    }
-    for (k = 0; k < wiener_win; ++k) {
-      for (l = 0; l < wiener_win; ++l) {
-        M_int64[k][l] += M_int32[k][l];
-        M_int32[k][l] = 0;
-      }
-    }
-    for (k = 0; k < WIENER_WIN2_CHROMA; ++k) {
-      for (l = 0; l < WIENER_WIN_CHROMA * 8; ++l) {
-        H_int64[k][l] += H_int32[k][l];
-        H_int32[k][l] = 0;
-      }
-    }
+// Fill H buffer based on loop_count.
+#define INIT_H_VALUES(d, loop_count)                           \
+  for (int g = 0; g < (loop_count); g++) {                     \
+    const __m256i dgd0 =                                       \
+        _mm256_loadu_si256((__m256i *)((d) + (g * d_stride))); \
+    madd_and_accum_avx2(dgd_mul_df, dgd0, &sum_h[g]);          \
   }
 
-  const int64_t avg_square_sum = (int64_t)avg * (int64_t)avg * pixel_count;
-  for (k = 0; k < wiener_win; k++) {
-    for (l = 0; l < wiener_win; l++) {
-      const int32_t idx0 = l * wiener_win + k;
-      M[idx0] =
-          M_int64[k][l] + (avg_square_sum - (int64_t)avg * (sumX + sumY[k][l]));
-      int64_t *H_ = H + idx0 * wiener_win2;
-      int64_t *H_int_ = &H_int64[idx0][0];
-      for (m = 0; m < wiener_win; m++) {
-        for (n = 0; n < wiener_win; n++) {
-          H_[m * wiener_win + n] = H_int_[n * 8 + m] + avg_square_sum -
-                                   (int64_t)avg * (sumY[k][l] + sumY[n][m]);
-        }
-      }
-    }
+// Fill M & H buffer.
+#define INIT_MH_VALUES(d)                                      \
+  for (int g = 0; g < wiener_win; g++) {                       \
+    const __m256i dgds_0 =                                     \
+        _mm256_loadu_si256((__m256i *)((d) + (g * d_stride))); \
+    madd_and_accum_avx2(src_mul_df, dgds_0, &sum_m[g]);        \
+    madd_and_accum_avx2(dgd_mul_df, dgds_0, &sum_h[g]);        \
   }
+
+// Update the dgd pointers appropriately.
+#define INITIALIZATION(wiener_window_sz)                                 \
+  j = i / (wiener_window_sz);                                            \
+  const int16_t *d_window = d + j;                                       \
+  const int16_t *d_current_row =                                         \
+      d + j + ((i % (wiener_window_sz)) * d_stride);                     \
+  int proc_ht = v_start;                                                 \
+  downsample_factor =                                                    \
+      use_downsampled_wiener_stats ? WIENER_STATS_DOWNSAMPLE_FACTOR : 1; \
+  __m256i sum_h[wiener_window_sz];                                       \
+  memset(sum_h, 0, sizeof(sum_h));
+
+// Update the downsample factor appropriately.
+#define UPDATE_DOWNSAMPLE_FACTOR                              \
+  int proc_wd = 0;                                            \
+  if (use_downsampled_wiener_stats &&                         \
+      ((v_end - proc_ht) < WIENER_STATS_DOWNSAMPLE_FACTOR)) { \
+    downsample_factor = v_end - proc_ht;                      \
+  }                                                           \
+  const __m256i df_reg = _mm256_set1_epi16(downsample_factor);
+
+#define CALCULATE_REMAINING_H_WIN5                                             \
+  while (j < wiener_win) {                                                     \
+    d_window = d;                                                              \
+    d_current_row = d + (i / wiener_win) + ((i % wiener_win) * d_stride);      \
+    const __m256i zero = _mm256_setzero_si256();                               \
+    sum_h[0] = zero;                                                           \
+    sum_h[1] = zero;                                                           \
+    sum_h[2] = zero;                                                           \
+    sum_h[3] = zero;                                                           \
+    sum_h[4] = zero;                                                           \
+                                                                               \
+    proc_ht = v_start;                                                         \
+    downsample_factor =                                                        \
+        use_downsampled_wiener_stats ? WIENER_STATS_DOWNSAMPLE_FACTOR : 1;     \
+    do {                                                                       \
+      UPDATE_DOWNSAMPLE_FACTOR;                                                \
+                                                                               \
+      /* Process the amount of width multiple of 16.*/                         \
+      while (proc_wd < wd_mul16) {                                             \
+        const __m256i dgd =                                                    \
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));          \
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);            \
+        INIT_H_VALUES(d_window + j + proc_wd, 5)                               \
+                                                                               \
+        proc_wd += 16;                                                         \
+      };                                                                       \
+                                                                               \
+      /* Process the remaining width here. */                                  \
+      if (wd_beyond_mul16) {                                                   \
+        const __m256i dgd =                                                    \
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));          \
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);                  \
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);       \
+        INIT_H_VALUES(d_window + j + proc_wd, 5)                               \
+      }                                                                        \
+      proc_ht += downsample_factor;                                            \
+      d_window += downsample_factor * d_stride;                                \
+      d_current_row += downsample_factor * d_stride;                           \
+    } while (proc_ht < v_end);                                                 \
+    const __m256i s_h0 =                                                       \
+        hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);     \
+    _mm256_storeu_si256((__m256i *)(H + (i * wiener_win2) + (wiener_win * j)), \
+                        s_h0);                                                 \
+    const __m256i s_m_h = convert_and_add_avx2(sum_h[4]);                      \
+    const __m128i s_m_h0 = add_64bit_lvl_avx2(s_m_h, s_m_h);                   \
+    _mm_storel_epi64(                                                          \
+        (__m128i *)(H + (i * wiener_win2) + (wiener_win * j) + 4), s_m_h0);    \
+    j++;                                                                       \
+  }
+
+#define CALCULATE_REMAINING_H_WIN7                                             \
+  while (j < wiener_win) {                                                     \
+    d_window = d;                                                              \
+    d_current_row = d + (i / wiener_win) + ((i % wiener_win) * d_stride);      \
+    const __m256i zero = _mm256_setzero_si256();                               \
+    sum_h[0] = zero;                                                           \
+    sum_h[1] = zero;                                                           \
+    sum_h[2] = zero;                                                           \
+    sum_h[3] = zero;                                                           \
+    sum_h[4] = zero;                                                           \
+    sum_h[5] = zero;                                                           \
+    sum_h[6] = zero;                                                           \
+                                                                               \
+    proc_ht = v_start;                                                         \
+    downsample_factor =                                                        \
+        use_downsampled_wiener_stats ? WIENER_STATS_DOWNSAMPLE_FACTOR : 1;     \
+    do {                                                                       \
+      UPDATE_DOWNSAMPLE_FACTOR;                                                \
+                                                                               \
+      /* Process the amount of width multiple of 16.*/                         \
+      while (proc_wd < wd_mul16) {                                             \
+        const __m256i dgd =                                                    \
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));          \
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);            \
+        INIT_H_VALUES(d_window + j + proc_wd, 7)                               \
+                                                                               \
+        proc_wd += 16;                                                         \
+      };                                                                       \
+                                                                               \
+      /* Process the remaining width here. */                                  \
+      if (wd_beyond_mul16) {                                                   \
+        const __m256i dgd =                                                    \
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));          \
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);                  \
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);       \
+        INIT_H_VALUES(d_window + j + proc_wd, 7)                               \
+      }                                                                        \
+      proc_ht += downsample_factor;                                            \
+      d_window += downsample_factor * d_stride;                                \
+      d_current_row += downsample_factor * d_stride;                           \
+    } while (proc_ht < v_end);                                                 \
+    const __m256i s_h1 =                                                       \
+        hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);     \
+    _mm256_storeu_si256((__m256i *)(H + (i * wiener_win2) + (wiener_win * j)), \
+                        s_h1);                                                 \
+    const __m256i s_h2 =                                                       \
+        hadd_four_32_to_64_avx2(sum_h[4], sum_h[5], &sum_h[6], &sum_h[6]);     \
+    _mm256_storeu_si256(                                                       \
+        (__m256i *)(H + (i * wiener_win2) + (wiener_win * j) + 4), s_h2);      \
+    j++;                                                                       \
+  }
+
+// The buffers H(auto-covariance) and M(cross-correlation) are used to estimate
+// the filter tap values required for wiener filtering. Here, the buffer H is of
+// size ((wiener_window_size^2)*(wiener_window_size^2)) and M is of size
+// (wiener_window_size*wiener_window_size). H is a symmetric matrix where the
+// value above the diagonal (upper triangle) are equal to the values below the
+// diagonal (lower triangle). The calculation of elements/stats of H(upper
+// triangle) and M is done in steps as described below where each step fills
+// specific values of H and M.
+// Once the upper triangular elements of H matrix are derived, the same will be
+// copied to lower triangular using the function
+// fill_lower_triag_elements_avx2().
+// Example: Wiener window size =
+// WIENER_WIN_CHROMA (5) M buffer = [M0 M1 M2 ---- M23 M24] H buffer = Hxy
+// (x-row, y-column) [H00 H01 H02 ---- H023 H024] [H10 H11 H12 ---- H123 H124]
+// [H30 H31 H32 ---- H323 H324]
+// [H40 H41 H42 ---- H423 H424]
+// [H50 H51 H52 ---- H523 H524]
+// [H60 H61 H62 ---- H623 H624]
+//            ||
+//            ||
+// [H230 H231 H232 ---- H2323 H2324]
+// [H240 H241 H242 ---- H2423 H2424]
+// In Step 1, whole M buffers (i.e., M0 to M24) and the first row of H (i.e.,
+// H00 to H024) is filled. The remaining rows of H buffer are filled through
+// steps 2 to 6.
+static void compute_stats_win5_avx2(const int16_t *const d, int32_t d_stride,
+                                    const int16_t *const s, int32_t s_stride,
+                                    int32_t width, int v_start, int v_end,
+                                    int64_t *const M, int64_t *const H,
+                                    int use_downsampled_wiener_stats) {
+  const int32_t wiener_win = WIENER_WIN_CHROMA;
+  const int32_t wiener_win2 = wiener_win * wiener_win;
+  // Amount of width which is beyond multiple of 16. This case is handled
+  // appropriately to process only the required width towards the end.
+  const int32_t wd_mul16 = width & ~15;
+  const int32_t wd_beyond_mul16 = width - wd_mul16;
+  const __m256i mask =
+      _mm256_loadu_si256((__m256i *)(&mask_16bit[16 - wd_beyond_mul16]));
+  int downsample_factor;
+
+  // Step 1: Full M (i.e., M0 to M24) and first row H (i.e., H00 to H024)
+  // values are filled here. Here, the loop over 'j' is executed for values 0
+  // to 4 (wiener_win-1). When the loop executed for a specific 'j', 5 values of
+  // M and H are filled as shown below.
+  // j=0: M0-M4 and H00-H04, j=1: M5-M9 and H05-H09 are filled etc,.
+  int j = 0;
+  do {
+    const int16_t *s_t = s;
+    const int16_t *d_t = d;
+    __m256i sum_m[WIENER_WIN_CHROMA] = { _mm256_setzero_si256() };
+    __m256i sum_h[WIENER_WIN_CHROMA] = { _mm256_setzero_si256() };
+    downsample_factor =
+        use_downsampled_wiener_stats ? WIENER_STATS_DOWNSAMPLE_FACTOR : 1;
+    int proc_ht = v_start;
+    do {
+      UPDATE_DOWNSAMPLE_FACTOR
+
+      // Process the amount of width multiple of 16.
+      while (proc_wd < wd_mul16) {
+        const __m256i src = _mm256_loadu_si256((__m256i *)(s_t + proc_wd));
+        const __m256i dgd = _mm256_loadu_si256((__m256i *)(d_t + proc_wd));
+        const __m256i src_mul_df = _mm256_mullo_epi16(src, df_reg);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+        INIT_MH_VALUES(d_t + j + proc_wd)
+
+        proc_wd += 16;
+      }
+
+      // Process the remaining width here.
+      if (wd_beyond_mul16) {
+        const __m256i src = _mm256_loadu_si256((__m256i *)(s_t + proc_wd));
+        const __m256i dgd = _mm256_loadu_si256((__m256i *)(d_t + proc_wd));
+        const __m256i src_mask = _mm256_and_si256(src, mask);
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+        const __m256i src_mul_df = _mm256_mullo_epi16(src_mask, df_reg);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+        INIT_MH_VALUES(d_t + j + proc_wd)
+      }
+      proc_ht += downsample_factor;
+      s_t += downsample_factor * s_stride;
+      d_t += downsample_factor * d_stride;
+    } while (proc_ht < v_end);
+
+    const __m256i s_m =
+        hadd_four_32_to_64_avx2(sum_m[0], sum_m[1], &sum_m[2], &sum_m[3]);
+    const __m128i s_m_h = convert_32_to_64_add_avx2(sum_m[4], sum_h[4]);
+    _mm256_storeu_si256((__m256i *)(M + wiener_win * j), s_m);
+    _mm_storel_epi64((__m128i *)&M[wiener_win * j + 4], s_m_h);
+
+    const __m256i s_h =
+        hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);
+    _mm256_storeu_si256((__m256i *)(H + wiener_win * j), s_h);
+    _mm_storeh_epi64((__m128i *)&H[wiener_win * j + 4], s_m_h);
+  } while (++j < wiener_win);
+
+  // The below steps are designed to fill remaining rows of H buffer. Here, aim
+  // is to fill only upper triangle elements correspond to each row and lower
+  // triangle elements are copied from upper-triangle elements. Also, as
+  // mentioned in Step 1, the core function is designed to fill 5
+  // elements/stats/values of H buffer.
+  //
+  // Step 2: Here, the rows 1, 6, 11, 16 and 21 are filled. As we need to fill
+  // only upper-triangle elements, H10 from row1, H60-H64 and H65 from row6,etc,
+  // are need not be filled. As the core function process 5 values, in first
+  // iteration of 'j' only 4 values to be filled i.e., H11-H14 from row1,H66-H69
+  // from row6, etc.
+  for (int i = 1; i < wiener_win2; i += wiener_win) {
+    // Update the dgd pointers appropriately and also derive the 'j'th iteration
+    // from where the H buffer filling needs to be started.
+    INITIALIZATION(WIENER_WIN_CHROMA)
+
+    do {
+      UPDATE_DOWNSAMPLE_FACTOR
+
+      // Process the amount of width multiple of 16.
+      while (proc_wd < wd_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (1 * d_stride), 4)
+
+        proc_wd += 16;
+      }
+
+      // Process the remaining width here.
+      if (wd_beyond_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (1 * d_stride), 4)
+      }
+      proc_ht += downsample_factor;
+      d_window += downsample_factor * d_stride;
+      d_current_row += downsample_factor * d_stride;
+    } while (proc_ht < v_end);
+    const __m256i s_h =
+        hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);
+    _mm256_storeu_si256((__m256i *)(H + (i * wiener_win2) + i), s_h);
+
+    // process the remaining 'j' iterations.
+    j++;
+    CALCULATE_REMAINING_H_WIN5
+  }
+
+  // Step 3: Here, the rows 2, 7, 12, 17 and 22 are filled. As we need to fill
+  // only upper-triangle elements, H20-H21 from row2, H70-H74 and H75-H76 from
+  // row7, etc, are need not be filled. As the core function process 5 values,
+  // in first iteration of 'j' only 3 values to be filled i.e., H22-H24 from
+  // row2, H77-H79 from row7, etc.
+  for (int i = 2; i < wiener_win2; i += wiener_win) {
+    // Update the dgd pointers appropriately and also derive the 'j'th iteration
+    // from where the H buffer filling needs to be started.
+    INITIALIZATION(WIENER_WIN_CHROMA)
+
+    do {
+      UPDATE_DOWNSAMPLE_FACTOR
+
+      // Process the amount of width multiple of 16.
+      while (proc_wd < wd_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (2 * d_stride), 3)
+
+        proc_wd += 16;
+      }
+
+      // Process the remaining width here.
+      if (wd_beyond_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (2 * d_stride), 3)
+      }
+      proc_ht += downsample_factor;
+      d_window += downsample_factor * d_stride;
+      d_current_row += downsample_factor * d_stride;
+    } while (proc_ht < v_end);
+    const __m256i s_h =
+        hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);
+    _mm256_storeu_si256((__m256i *)(H + (i * wiener_win2) + i), s_h);
+
+    // process the remaining 'j' iterations.
+    j++;
+    CALCULATE_REMAINING_H_WIN5
+  }
+
+  // Step 4: Here, the rows 3, 8, 13, 18 and 23 are filled. As we need to fill
+  // only upper-triangle elements, H30-H32 from row3, H80-H84 and H85-H87 from
+  // row8, etc, are need not be filled. As the core function process 5 values,
+  // in first iteration of 'j' only 2 values to be filled i.e., H33-H34 from
+  // row3, H88-89 from row8, etc.
+  for (int i = 3; i < wiener_win2; i += wiener_win) {
+    // Update the dgd pointers appropriately and also derive the 'j'th iteration
+    // from where the H buffer filling needs to be started.
+    INITIALIZATION(WIENER_WIN_CHROMA)
+
+    do {
+      UPDATE_DOWNSAMPLE_FACTOR
+
+      // Process the amount of width multiple of 16.
+      while (proc_wd < wd_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (3 * d_stride), 2)
+
+        proc_wd += 16;
+      }
+
+      // Process the remaining width here.
+      if (wd_beyond_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (3 * d_stride), 2)
+      }
+      proc_ht += downsample_factor;
+      d_window += downsample_factor * d_stride;
+      d_current_row += downsample_factor * d_stride;
+    } while (proc_ht < v_end);
+    const __m128i s_h = convert_32_to_64_add_avx2(sum_h[0], sum_h[1]);
+    _mm_storeu_si128((__m128i *)(H + (i * wiener_win2) + i), s_h);
+
+    // process the remaining 'j' iterations.
+    j++;
+    CALCULATE_REMAINING_H_WIN5
+  }
+
+  // Step 5: Here, the rows 4, 9, 14, 19 and 24 are filled. As we need to fill
+  // only upper-triangle elements, H40-H43 from row4, H90-H94 and H95-H98 from
+  // row9, etc, are need not be filled. As the core function process 5 values,
+  // in first iteration of 'j' only 1 values to be filled i.e., H44 from row4,
+  // H99 from row9, etc.
+  for (int i = 4; i < wiener_win2; i += wiener_win) {
+    // Update the dgd pointers appropriately and also derive the 'j'th iteration
+    // from where the H buffer filling needs to be started.
+    INITIALIZATION(WIENER_WIN_CHROMA)
+    do {
+      UPDATE_DOWNSAMPLE_FACTOR
+
+      // Process the amount of width multiple of 16.
+      while (proc_wd < wd_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (4 * d_stride), 1)
+
+        proc_wd += 16;
+      }
+
+      // Process the remaining width here.
+      if (wd_beyond_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (4 * d_stride), 1)
+      }
+      proc_ht += downsample_factor;
+      d_window += downsample_factor * d_stride;
+      d_current_row += downsample_factor * d_stride;
+    } while (proc_ht < v_end);
+    const __m128i s_h = convert_32_to_64_add_avx2(sum_h[0], sum_h[1]);
+    _mm_storeu_si128((__m128i *)(H + (i * wiener_win2) + i), s_h);
+
+    // process the remaining 'j' iterations.
+    j++;
+    CALCULATE_REMAINING_H_WIN5
+  }
+
+  // Step 6: Here, the rows 5, 10, 15 and 20 are filled. As we need to fill only
+  // upper-triangle elements, H50-H54 from row5, H100-H104 and H105-H109 from
+  // row10,etc, are need not be filled. The first iteration of 'j' fills H55-H59
+  // from row5 and H1010-H1014 from row10, etc.
+  for (int i = 5; i < wiener_win2; i += wiener_win) {
+    // Derive j'th iteration from where the H buffer filling needs to be
+    // started.
+    j = i / wiener_win;
+    int shift = 0;
+    do {
+      // Update the dgd pointers appropriately.
+      int proc_ht = v_start;
+      const int16_t *d_window = d + (i / wiener_win);
+      const int16_t *d_current_row =
+          d + (i / wiener_win) + ((i % wiener_win) * d_stride);
+      downsample_factor =
+          use_downsampled_wiener_stats ? WIENER_STATS_DOWNSAMPLE_FACTOR : 1;
+      __m256i sum_h[WIENER_WIN_CHROMA] = { _mm256_setzero_si256() };
+      do {
+        UPDATE_DOWNSAMPLE_FACTOR
+
+        // Process the amount of width multiple of 16.
+        while (proc_wd < wd_mul16) {
+          const __m256i dgd =
+              _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+          const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+          INIT_H_VALUES(d_window + shift + proc_wd, 5)
+
+          proc_wd += 16;
+        }
+
+        // Process the remaining width here.
+        if (wd_beyond_mul16) {
+          const __m256i dgd =
+              _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+          const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+          const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+          INIT_H_VALUES(d_window + shift + proc_wd, 5)
+        }
+        proc_ht += downsample_factor;
+        d_window += downsample_factor * d_stride;
+        d_current_row += downsample_factor * d_stride;
+      } while (proc_ht < v_end);
+
+      const __m256i s_h =
+          hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);
+      _mm256_storeu_si256((__m256i *)(H + (i * wiener_win2) + (wiener_win * j)),
+                          s_h);
+      const __m256i s_m_h = convert_and_add_avx2(sum_h[4]);
+      const __m128i s_m_h0 = add_64bit_lvl_avx2(s_m_h, s_m_h);
+      _mm_storel_epi64(
+          (__m128i *)(H + (i * wiener_win2) + (wiener_win * j) + 4), s_m_h0);
+      shift++;
+    } while (++j < wiener_win);
+  }
+
+  fill_lower_triag_elements_avx2(wiener_win2, H);
+}
+
+// The buffers H(auto-covariance) and M(cross-correlation) are used to estimate
+// the filter tap values required for wiener filtering. Here, the buffer H is of
+// size ((wiener_window_size^2)*(wiener_window_size^2)) and M is of size
+// (wiener_window_size*wiener_window_size). H is a symmetric matrix where the
+// value above the diagonal (upper triangle) are equal to the values below the
+// diagonal (lower triangle). The calculation of elements/stats of H(upper
+// triangle) and M is done in steps as described below where each step fills
+// specific values of H and M.
+// Example:
+// Wiener window size = WIENER_WIN (7)
+// M buffer = [M0 M1 M2 ---- M47 M48]
+// H buffer = Hxy (x-row, y-column)
+// [H00 H01 H02 ---- H047 H048]
+// [H10 H11 H12 ---- H147 H148]
+// [H30 H31 H32 ---- H347 H348]
+// [H40 H41 H42 ---- H447 H448]
+// [H50 H51 H52 ---- H547 H548]
+// [H60 H61 H62 ---- H647 H648]
+//            ||
+//            ||
+// [H470 H471 H472 ---- H4747 H4748]
+// [H480 H481 H482 ---- H4847 H4848]
+// In Step 1, whole M buffers (i.e., M0 to M48) and the first row of H (i.e.,
+// H00 to H048) is filled. The remaining rows of H buffer are filled through
+// steps 2 to 8.
+static void compute_stats_win7_avx2(const int16_t *const d, int32_t d_stride,
+                                    const int16_t *const s, int32_t s_stride,
+                                    int32_t width, int v_start, int v_end,
+                                    int64_t *const M, int64_t *const H,
+                                    int use_downsampled_wiener_stats) {
+  const int32_t wiener_win = WIENER_WIN;
+  const int32_t wiener_win2 = wiener_win * wiener_win;
+  // Amount of width which is beyond multiple of 16. This case is handled
+  // appropriately to process only the required width towards the end.
+  const int32_t wd_mul16 = width & ~15;
+  const int32_t wd_beyond_mul16 = width - wd_mul16;
+  const __m256i mask =
+      _mm256_loadu_si256((__m256i *)(&mask_16bit[16 - wd_beyond_mul16]));
+  int downsample_factor;
+
+  // Step 1: Full M (i.e., M0 to M48) and first row H (i.e., H00 to H048)
+  // values are filled here. Here, the loop over 'j' is executed for values 0
+  // to 6. When the loop executed for a specific 'j', 7 values of M and H are
+  // filled as shown below.
+  // j=0: M0-M6 and H00-H06, j=1: M7-M13 and H07-H013 are filled etc,.
+  int j = 0;
+  do {
+    const int16_t *s_t = s;
+    const int16_t *d_t = d;
+    __m256i sum_m[WIENER_WIN] = { _mm256_setzero_si256() };
+    __m256i sum_h[WIENER_WIN] = { _mm256_setzero_si256() };
+    downsample_factor =
+        use_downsampled_wiener_stats ? WIENER_STATS_DOWNSAMPLE_FACTOR : 1;
+    int proc_ht = v_start;
+    do {
+      UPDATE_DOWNSAMPLE_FACTOR
+
+      // Process the amount of width multiple of 16.
+      while (proc_wd < wd_mul16) {
+        const __m256i src = _mm256_loadu_si256((__m256i *)(s_t + proc_wd));
+        const __m256i dgd = _mm256_loadu_si256((__m256i *)(d_t + proc_wd));
+        const __m256i src_mul_df = _mm256_mullo_epi16(src, df_reg);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+        INIT_MH_VALUES(d_t + j + proc_wd)
+
+        proc_wd += 16;
+      }
+
+      if (wd_beyond_mul16) {
+        const __m256i src = _mm256_loadu_si256((__m256i *)(s_t + proc_wd));
+        const __m256i dgd = _mm256_loadu_si256((__m256i *)(d_t + proc_wd));
+        const __m256i src_mask = _mm256_and_si256(src, mask);
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+        const __m256i src_mul_df = _mm256_mullo_epi16(src_mask, df_reg);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+        INIT_MH_VALUES(d_t + j + proc_wd)
+      }
+      proc_ht += downsample_factor;
+      s_t += downsample_factor * s_stride;
+      d_t += downsample_factor * d_stride;
+    } while (proc_ht < v_end);
+
+    const __m256i s_m0 =
+        hadd_four_32_to_64_avx2(sum_m[0], sum_m[1], &sum_m[2], &sum_m[3]);
+    const __m256i s_m1 =
+        hadd_four_32_to_64_avx2(sum_m[4], sum_m[5], &sum_m[6], &sum_m[6]);
+    _mm256_storeu_si256((__m256i *)(M + wiener_win * j + 0), s_m0);
+    _mm_storeu_si128((__m128i *)(M + wiener_win * j + 4),
+                     _mm256_castsi256_si128(s_m1));
+    _mm_storel_epi64((__m128i *)&M[wiener_win * j + 6],
+                     _mm256_extracti128_si256(s_m1, 1));
+
+    const __m256i sh_0 =
+        hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);
+    const __m256i sh_1 =
+        hadd_four_32_to_64_avx2(sum_h[4], sum_h[5], &sum_h[6], &sum_h[6]);
+    _mm256_storeu_si256((__m256i *)(H + wiener_win * j + 0), sh_0);
+    _mm_storeu_si128((__m128i *)(H + wiener_win * j + 4),
+                     _mm256_castsi256_si128(sh_1));
+    _mm_storel_epi64((__m128i *)&H[wiener_win * j + 6],
+                     _mm256_extracti128_si256(sh_1, 1));
+  } while (++j < wiener_win);
+
+  // The below steps are designed to fill remaining rows of H buffer. Here, aim
+  // is to fill only upper triangle elements correspond to each row and lower
+  // triangle elements are copied from upper-triangle elements. Also, as
+  // mentioned in Step 1, the core function is designed to fill 7
+  // elements/stats/values of H buffer.
+  //
+  // Step 2: Here, the rows 1, 8, 15, 22, 29, 36 and 43 are filled. As we need
+  // to fill only upper-triangle elements, H10 from row1, H80-H86 and H87 from
+  // row8, etc. are need not be filled. As the core function process 7 values,
+  // in first iteration of 'j' only 6 values to be filled i.e., H11-H16 from
+  // row1 and H88-H813 from row8, etc.
+  for (int i = 1; i < wiener_win2; i += wiener_win) {
+    // Update the dgd pointers appropriately and also derive the 'j'th iteration
+    // from where the H buffer filling needs to be started.
+    INITIALIZATION(WIENER_WIN)
+
+    do {
+      UPDATE_DOWNSAMPLE_FACTOR
+
+      // Process the amount of width multiple of 16.
+      while (proc_wd < wd_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (1 * d_stride), 6)
+
+        proc_wd += 16;
+      }
+
+      // Process the remaining width here.
+      if (wd_beyond_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (1 * d_stride), 6)
+      }
+      proc_ht += downsample_factor;
+      d_window += downsample_factor * d_stride;
+      d_current_row += downsample_factor * d_stride;
+    } while (proc_ht < v_end);
+    const __m256i s_h =
+        hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);
+    _mm256_storeu_si256((__m256i *)(H + (i * wiener_win2) + i), s_h);
+    const __m128i s_h0 = convert_32_to_64_add_avx2(sum_h[4], sum_h[5]);
+    _mm_storeu_si128((__m128i *)(H + (i * wiener_win2) + i + 4), s_h0);
+
+    // process the remaining 'j' iterations.
+    j++;
+    CALCULATE_REMAINING_H_WIN7
+  }
+
+  // Step 3: Here, the rows 2, 9, 16, 23, 30, 37 and 44 are filled. As we need
+  // to fill only upper-triangle elements, H20-H21 from row2, H90-H96 and
+  // H97-H98 from row9, etc. are need not be filled. As the core function
+  // process 7 values, in first iteration of 'j' only 5 values to be filled
+  // i.e., H22-H26 from row2 and H99-H913 from row9, etc.
+  for (int i = 2; i < wiener_win2; i += wiener_win) {
+    // Update the dgd pointers appropriately and also derive the 'j'th iteration
+    // from where the H buffer filling needs to be started.
+    INITIALIZATION(WIENER_WIN)
+    do {
+      UPDATE_DOWNSAMPLE_FACTOR
+
+      // Process the amount of width multiple of 16.
+      while (proc_wd < wd_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (2 * d_stride), 5)
+
+        proc_wd += 16;
+      }
+
+      // Process the remaining width here.
+      if (wd_beyond_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (2 * d_stride), 5)
+      }
+      proc_ht += downsample_factor;
+      d_window += downsample_factor * d_stride;
+      d_current_row += downsample_factor * d_stride;
+    } while (proc_ht < v_end);
+    const __m256i s_h =
+        hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);
+    _mm256_storeu_si256((__m256i *)(H + (i * wiener_win2) + i), s_h);
+    const __m256i s_m_h = convert_and_add_avx2(sum_h[4]);
+    const __m128i s_m_h0 = add_64bit_lvl_avx2(s_m_h, s_m_h);
+    _mm_storel_epi64((__m128i *)(H + (i * wiener_win2) + i + 4), s_m_h0);
+
+    // process the remaining 'j' iterations.
+    j++;
+    CALCULATE_REMAINING_H_WIN7
+  }
+
+  // Step 4: Here, the rows 3, 10, 17, 24, 31, 38 and 45 are filled. As we need
+  // to fill only upper-triangle elements, H30-H32 from row3, H100-H106 and
+  // H107-H109 from row10, etc. are need not be filled. As the core function
+  // process 7 values, in first iteration of 'j' only 4 values to be filled
+  // i.e., H33-H36 from row3 and H1010-H1013 from row10, etc.
+  for (int i = 3; i < wiener_win2; i += wiener_win) {
+    // Update the dgd pointers appropriately and also derive the 'j'th iteration
+    // from where the H buffer filling needs to be started.
+    INITIALIZATION(WIENER_WIN)
+
+    do {
+      UPDATE_DOWNSAMPLE_FACTOR
+
+      // Process the amount of width multiple of 16.
+      while (proc_wd < wd_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (3 * d_stride), 4)
+
+        proc_wd += 16;
+      }
+
+      // Process the remaining width here.
+      if (wd_beyond_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (3 * d_stride), 4)
+      }
+      proc_ht += downsample_factor;
+      d_window += downsample_factor * d_stride;
+      d_current_row += downsample_factor * d_stride;
+    } while (proc_ht < v_end);
+    const __m256i s_h =
+        hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);
+    _mm256_storeu_si256((__m256i *)(H + (i * wiener_win2) + i), s_h);
+
+    // process the remaining 'j' iterations.
+    j++;
+    CALCULATE_REMAINING_H_WIN7
+  }
+
+  // Step 5: Here, the rows 4, 11, 18, 25, 32, 39 and 46 are filled. As we need
+  // to fill only upper-triangle elements, H40-H43 from row4, H110-H116 and
+  // H117-H1110 from row10, etc. are need not be filled. As the core function
+  // process 7 values, in first iteration of 'j' only 3 values to be filled
+  // i.e., H44-H46 from row4 and H1111-H1113 from row11, etc.
+  for (int i = 4; i < wiener_win2; i += wiener_win) {
+    // Update the dgd pointers appropriately and also derive the 'j'th iteration
+    // from where the H buffer filling needs to be started.
+    INITIALIZATION(WIENER_WIN)
+
+    do {
+      UPDATE_DOWNSAMPLE_FACTOR
+
+      // Process the amount of width multiple of 16.
+      while (proc_wd < wd_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (4 * d_stride), 3)
+
+        proc_wd += 16;
+      }
+
+      // Process the remaining width here.
+      if (wd_beyond_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (4 * d_stride), 3)
+      }
+      proc_ht += downsample_factor;
+      d_window += downsample_factor * d_stride;
+      d_current_row += downsample_factor * d_stride;
+    } while (proc_ht < v_end);
+    const __m256i s_h =
+        hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);
+    _mm256_storeu_si256((__m256i *)(H + (i * wiener_win2) + i), s_h);
+
+    // process the remaining 'j' iterations.
+    j++;
+    CALCULATE_REMAINING_H_WIN7
+  }
+
+  // Step 6: Here, the rows 5, 12, 19, 26, 33, 40 and 47 are filled. As we need
+  // to fill only upper-triangle elements, H50-H54 from row5, H120-H126 and
+  // H127-H1211 from row12, etc. are need not be filled. As the core function
+  // process 7 values, in first iteration of 'j' only 2 values to be filled
+  // i.e., H55-H56 from row5 and H1212-H1213 from row12, etc.
+  for (int i = 5; i < wiener_win2; i += wiener_win) {
+    // Update the dgd pointers appropriately and also derive the 'j'th iteration
+    // from where the H buffer filling needs to be started.
+    INITIALIZATION(WIENER_WIN)
+    do {
+      UPDATE_DOWNSAMPLE_FACTOR
+
+      // Process the amount of width multiple of 16.
+      while (proc_wd < wd_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (5 * d_stride), 2)
+
+        proc_wd += 16;
+      }
+
+      // Process the remaining width here.
+      if (wd_beyond_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (5 * d_stride), 2)
+      }
+      proc_ht += downsample_factor;
+      d_window += downsample_factor * d_stride;
+      d_current_row += downsample_factor * d_stride;
+    } while (proc_ht < v_end);
+    const __m256i s_h =
+        hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);
+    _mm256_storeu_si256((__m256i *)(H + (i * wiener_win2) + i), s_h);
+
+    // process the remaining 'j' iterations.
+    j++;
+    CALCULATE_REMAINING_H_WIN7
+  }
+
+  // Step 7: Here, the rows 6, 13, 20, 27, 34, 41 and 48 are filled. As we need
+  // to fill only upper-triangle elements, H60-H65 from row6, H130-H136 and
+  // H137-H1312 from row13, etc. are need not be filled. As the core function
+  // process 7 values, in first iteration of 'j' only 1 value to be filled
+  // i.e., H66 from row6 and H1313 from row13, etc.
+  for (int i = 6; i < wiener_win2; i += wiener_win) {
+    // Update the dgd pointers appropriately and also derive the 'j'th iteration
+    // from where the H buffer filling needs to be started.
+    INITIALIZATION(WIENER_WIN)
+    do {
+      UPDATE_DOWNSAMPLE_FACTOR
+
+      // Process the amount of width multiple of 16.
+      while (proc_wd < wd_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (6 * d_stride), 1)
+
+        proc_wd += 16;
+      }
+
+      // Process the remaining width here.
+      if (wd_beyond_mul16) {
+        const __m256i dgd =
+            _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+        const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+        const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+        INIT_H_VALUES(d_window + proc_wd + (6 * d_stride), 1)
+      }
+      proc_ht += downsample_factor;
+      d_window += downsample_factor * d_stride;
+      d_current_row += downsample_factor * d_stride;
+    } while (proc_ht < v_end);
+    const __m256i s_h =
+        hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);
+    xx_storel_64(&H[(i * wiener_win2) + i], _mm256_castsi256_si128(s_h));
+
+    // process the remaining 'j' iterations.
+    j++;
+    CALCULATE_REMAINING_H_WIN7
+  }
+
+  // Step 8: Here, the rows 7, 14, 21, 28, 35 and 42 are filled. As we need
+  // to fill only upper-triangle elements, H70-H75 from row7, H140-H146 and
+  // H147-H1413 from row14, etc. are need not be filled. The first iteration of
+  // 'j' fills H77-H713 from row7 and H1414-H1420 from row14, etc.
+  for (int i = 7; i < wiener_win2; i += wiener_win) {
+    // Derive j'th iteration from where the H buffer filling needs to be
+    // started.
+    j = i / wiener_win;
+    int shift = 0;
+    do {
+      // Update the dgd pointers appropriately.
+      int proc_ht = v_start;
+      const int16_t *d_window = d + (i / WIENER_WIN);
+      const int16_t *d_current_row =
+          d + (i / WIENER_WIN) + ((i % WIENER_WIN) * d_stride);
+      downsample_factor =
+          use_downsampled_wiener_stats ? WIENER_STATS_DOWNSAMPLE_FACTOR : 1;
+      __m256i sum_h[WIENER_WIN] = { _mm256_setzero_si256() };
+      do {
+        UPDATE_DOWNSAMPLE_FACTOR
+
+        // Process the amount of width multiple of 16.
+        while (proc_wd < wd_mul16) {
+          const __m256i dgd =
+              _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+          const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd, df_reg);
+          INIT_H_VALUES(d_window + shift + proc_wd, 7)
+
+          proc_wd += 16;
+        }
+
+        // Process the remaining width here.
+        if (wd_beyond_mul16) {
+          const __m256i dgd =
+              _mm256_loadu_si256((__m256i *)(d_current_row + proc_wd));
+          const __m256i dgd_mask = _mm256_and_si256(dgd, mask);
+          const __m256i dgd_mul_df = _mm256_mullo_epi16(dgd_mask, df_reg);
+          INIT_H_VALUES(d_window + shift + proc_wd, 7)
+        }
+        proc_ht += downsample_factor;
+        d_window += downsample_factor * d_stride;
+        d_current_row += downsample_factor * d_stride;
+      } while (proc_ht < v_end);
+
+      const __m256i sh_0 =
+          hadd_four_32_to_64_avx2(sum_h[0], sum_h[1], &sum_h[2], &sum_h[3]);
+      const __m256i sh_1 =
+          hadd_four_32_to_64_avx2(sum_h[4], sum_h[5], &sum_h[6], &sum_h[6]);
+      _mm256_storeu_si256((__m256i *)(H + (i * wiener_win2) + (wiener_win * j)),
+                          sh_0);
+      _mm_storeu_si128(
+          (__m128i *)(H + (i * wiener_win2) + (wiener_win * j) + 4),
+          _mm256_castsi256_si128(sh_1));
+      _mm_storel_epi64((__m128i *)&H[(i * wiener_win2) + (wiener_win * j) + 6],
+                       _mm256_extracti128_si256(sh_1, 1));
+      shift++;
+    } while (++j < wiener_win);
+  }
+
+  fill_lower_triag_elements_avx2(wiener_win2, H);
 }
 
 void av1_compute_stats_avx2(int wiener_win, const uint8_t *dgd,
-                            const uint8_t *src, int h_start, int h_end,
+                            const uint8_t *src, int16_t *dgd_avg,
+                            int16_t *src_avg, int h_start, int h_end,
                             int v_start, int v_end, int dgd_stride,
                             int src_stride, int64_t *M, int64_t *H,
                             int use_downsampled_wiener_stats) {
-  if (wiener_win == WIENER_WIN) {
-    compute_stats_win7_opt_avx2(dgd, src, h_start, h_end, v_start, v_end,
-                                dgd_stride, src_stride, M, H,
-                                use_downsampled_wiener_stats);
-  } else if (wiener_win == WIENER_WIN_CHROMA) {
-    compute_stats_win5_opt_avx2(dgd, src, h_start, h_end, v_start, v_end,
-                                dgd_stride, src_stride, M, H,
-                                use_downsampled_wiener_stats);
-  } else {
-    av1_compute_stats_c(wiener_win, dgd, src, h_start, h_end, v_start, v_end,
-                        dgd_stride, src_stride, M, H,
+  if (wiener_win != WIENER_WIN && wiener_win != WIENER_WIN_CHROMA) {
+    // Currently, libaom supports Wiener filter processing with window sizes as
+    // WIENER_WIN_CHROMA(5) and WIENER_WIN(7). For any other window size, SIMD
+    // support is not facilitated. Hence, invoke C function for the same.
+    av1_compute_stats_c(wiener_win, dgd, src, dgd_avg, src_avg, h_start, h_end,
+                        v_start, v_end, dgd_stride, src_stride, M, H,
                         use_downsampled_wiener_stats);
+    return;
+  }
+
+  const int32_t wiener_halfwin = wiener_win >> 1;
+  const uint8_t avg =
+      calc_dgd_buf_avg_avx2(dgd, h_start, h_end, v_start, v_end, dgd_stride);
+  const int32_t width = h_end - h_start;
+  const int32_t height = v_end - v_start;
+  const int32_t d_stride = (width + 2 * wiener_halfwin + 15) & ~15;
+  const int32_t s_stride = (width + 15) & ~15;
+
+  // Based on the sf 'use_downsampled_wiener_stats', process either once for
+  // UPDATE_DOWNSAMPLE_FACTOR or for each row.
+  sub_avg_block_avx2(src + v_start * src_stride + h_start, src_stride, avg,
+                     width, height, src_avg, s_stride,
+                     use_downsampled_wiener_stats);
+
+  // Compute (dgd-avg) buffer here which is used to fill H buffer.
+  sub_avg_block_avx2(
+      dgd + (v_start - wiener_halfwin) * dgd_stride + h_start - wiener_halfwin,
+      dgd_stride, avg, width + 2 * wiener_halfwin, height + 2 * wiener_halfwin,
+      dgd_avg, d_stride, 0);
+  if (wiener_win == WIENER_WIN) {
+    compute_stats_win7_avx2(dgd_avg, d_stride, src_avg, s_stride, width,
+                            v_start, v_end, M, H, use_downsampled_wiener_stats);
+  } else if (wiener_win == WIENER_WIN_CHROMA) {
+    compute_stats_win5_avx2(dgd_avg, d_stride, src_avg, s_stride, width,
+                            v_start, v_end, M, H, use_downsampled_wiener_stats);
   }
 }
 

diff --git a/av1/encoder/x86/pickrst_sse4.c b/av1/encoder/x86/pickrst_sse4.c
index cdfcac9..50db305 100644
--- a/av1/encoder/x86/pickrst_sse4.c
+++ b/av1/encoder/x86/pickrst_sse4.c

@@ -11,6 +11,7 @@
 
 #include <assert.h>
 #include <emmintrin.h>
+#include "aom_dsp/x86/mem_sse2.h"
 #include "aom_dsp/x86/synonyms.h"
 
 #include "config/av1_rtcd.h"
@@ -62,7 +63,7 @@
         M_int[k][l] += D1 * X1 + D2 * X2;
 
         const __m128i kl =
-            _mm_cvtepu8_epi16(_mm_set1_epi16(*((int16_t *)(dgd_ijk + l))));
+            _mm_cvtepu8_epi16(_mm_set1_epi16(loadu_int16(dgd_ijk + l)));
         acc_stat_sse41(H_ + 0 * 8, dgd_ij + 0 * dgd_stride, shuffle, &kl);
         acc_stat_sse41(H_ + 1 * 8, dgd_ij + 1 * dgd_stride, shuffle, &kl);
         acc_stat_sse41(H_ + 2 * 8, dgd_ij + 2 * dgd_stride, shuffle, &kl);
@@ -265,7 +266,7 @@
 
         // Load two u16 values from dgd as a single u32
         // Then broadcast to 4x u32 slots of a 128
-        const __m128i dgd_ijkl = _mm_set1_epi32(*((int *)(dgd_ijk + l)));
+        const __m128i dgd_ijkl = _mm_set1_epi32(loadu_int32(dgd_ijk + l));
         // dgd_ijkl = [y x y x y x y x] as u16
 
         acc_stat_highbd_sse41(H_ + 0 * 8, dgd_ij + 0 * dgd_stride, shuffle,
@@ -414,7 +415,7 @@
 
         // Load two u16 values from dgd as a single u32
         // then broadcast to 4x u32 slots of a 128
-        const __m128i dgd_ijkl = _mm_set1_epi32(*((int *)(dgd_ijk + l)));
+        const __m128i dgd_ijkl = _mm_set1_epi32(loadu_int32(dgd_ijk + l));
         // dgd_ijkl = [y x y x y x y x] as u16
 
         acc_stat_highbd_sse41(H_ + 0 * 8, dgd_ij + 0 * dgd_stride, shuffle,
@@ -574,7 +575,7 @@
         M_int[k][l] += D1 * X1 + D2 * X2;
 
         const __m128i kl =
-            _mm_cvtepu8_epi16(_mm_set1_epi16(*((int16_t *)(dgd_ijk + l))));
+            _mm_cvtepu8_epi16(_mm_set1_epi16(loadu_int16(dgd_ijk + l)));
         acc_stat_sse41(H_ + 0 * 8, dgd_ij + 0 * dgd_stride, shuffle, &kl);
         acc_stat_sse41(H_ + 1 * 8, dgd_ij + 1 * dgd_stride, shuffle, &kl);
         acc_stat_sse41(H_ + 2 * 8, dgd_ij + 2 * dgd_stride, shuffle, &kl);
@@ -703,7 +704,8 @@
   }
 }
 void av1_compute_stats_sse4_1(int wiener_win, const uint8_t *dgd,
-                              const uint8_t *src, int h_start, int h_end,
+                              const uint8_t *src, int16_t *dgd_avg,
+                              int16_t *src_avg, int h_start, int h_end,
                               int v_start, int v_end, int dgd_stride,
                               int src_stride, int64_t *M, int64_t *H,
                               int use_downsampled_wiener_stats) {
@@ -716,8 +718,8 @@
                                   dgd_stride, src_stride, M, H,
                                   use_downsampled_wiener_stats);
   } else {
-    av1_compute_stats_c(wiener_win, dgd, src, h_start, h_end, v_start, v_end,
-                        dgd_stride, src_stride, M, H,
+    av1_compute_stats_c(wiener_win, dgd, src, dgd_avg, src_avg, h_start, h_end,
+                        v_start, v_end, dgd_stride, src_stride, M, H,
                         use_downsampled_wiener_stats);
   }
 }

diff --git a/av1/encoder/x86/reconinter_enc_sse2.c b/av1/encoder/x86/reconinter_enc_sse2.c
index d33fec7..a492483 100644
--- a/av1/encoder/x86/reconinter_enc_sse2.c
+++ b/av1/encoder/x86/reconinter_enc_sse2.c

@@ -345,20 +345,3 @@
     pred += 16;
   }
 }
-
-void aom_comp_mask_upsampled_pred_sse2(
-    MACROBLOCKD *xd, const AV1_COMMON *const cm, int mi_row, int mi_col,
-    const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
-    int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
-    int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask,
-    int subpel_search) {
-  if (subpel_x_q3 | subpel_y_q3) {
-    aom_upsampled_pred(xd, cm, mi_row, mi_col, mv, comp_pred, width, height,
-                       subpel_x_q3, subpel_y_q3, ref, ref_stride,
-                       subpel_search);
-    ref = comp_pred;
-    ref_stride = width;
-  }
-  aom_comp_mask_pred(comp_pred, pred, width, height, ref, ref_stride, mask,
-                     mask_stride, invert_mask);
-}

diff --git a/av1/encoder/x86/temporal_filter_avx2.c b/av1/encoder/x86/temporal_filter_avx2.c
index a9c8004..752d6f3 100644
--- a/av1/encoder/x86/temporal_filter_avx2.c
+++ b/av1/encoder/x86/temporal_filter_avx2.c

@@ -30,6 +30,205 @@
   { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 10, 11, 10, 11 }
 };
 
+#define CALC_X_GRADIENT(AC, GI, DF, out) \
+  out = _mm256_abs_epi16(                \
+      _mm256_add_epi16(_mm256_add_epi16(AC, GI), _mm256_slli_epi16(DF, 1)));
+
+#define CALC_Y_GRADIENT(AC, GI, BH, out) \
+  out = _mm256_abs_epi16(                \
+      _mm256_add_epi16(_mm256_sub_epi16(AC, GI), _mm256_slli_epi16(BH, 1)));
+
+double av1_estimate_noise_from_single_plane_avx2(const uint8_t *src, int height,
+                                                 int width, int stride,
+                                                 int edge_thresh) {
+  int count = 0;
+  int64_t accum = 0;
+  // w32 stores width multiple of 32.
+  const int w32 = (width - 1) & ~0x1f;
+  const __m256i zero = _mm256_setzero_si256();
+  const __m256i edge_threshold = _mm256_set1_epi16(edge_thresh);
+  __m256i num_accumulator = zero;
+  __m256i sum_accumulator = zero;
+
+  //  A | B | C
+  //  D | E | F
+  //  G | H | I
+  // g_x = (A - C) + (G - I) + 2*(D - F)
+  // g_y = (A + C) - (G + I) + 2*(B - H)
+  // v   = 4*E - 2*(D+F+B+H) + (A+C+G+I)
+
+  // Process the width multiple of 32 here.
+  for (int w = 1; w < w32; w += 32) {
+    int h = 1;
+    const int start_idx = h * stride + w;
+    const int stride_0 = start_idx - stride;
+
+    __m256i num_accum_row_lvl = zero;
+    const __m256i A = _mm256_loadu_si256((__m256i *)(&src[stride_0 - 1]));
+    const __m256i C = _mm256_loadu_si256((__m256i *)(&src[stride_0 + 1]));
+    const __m256i D = _mm256_loadu_si256((__m256i *)(&src[start_idx - 1]));
+    const __m256i F = _mm256_loadu_si256((__m256i *)(&src[start_idx + 1]));
+    __m256i B = _mm256_loadu_si256((__m256i *)(&src[stride_0]));
+    __m256i E = _mm256_loadu_si256((__m256i *)(&src[start_idx]));
+
+    const __m256i A_lo = _mm256_unpacklo_epi8(A, zero);
+    const __m256i A_hi = _mm256_unpackhi_epi8(A, zero);
+    const __m256i C_lo = _mm256_unpacklo_epi8(C, zero);
+    const __m256i C_hi = _mm256_unpackhi_epi8(C, zero);
+    const __m256i D_lo = _mm256_unpacklo_epi8(D, zero);
+    const __m256i D_hi = _mm256_unpackhi_epi8(D, zero);
+    const __m256i F_lo = _mm256_unpacklo_epi8(F, zero);
+    const __m256i F_hi = _mm256_unpackhi_epi8(F, zero);
+
+    __m256i sub_AC_lo = _mm256_sub_epi16(A_lo, C_lo);
+    __m256i sub_AC_hi = _mm256_sub_epi16(A_hi, C_hi);
+    __m256i sum_AC_lo = _mm256_add_epi16(A_lo, C_lo);
+    __m256i sum_AC_hi = _mm256_add_epi16(A_hi, C_hi);
+    __m256i sub_DF_lo = _mm256_sub_epi16(D_lo, F_lo);
+    __m256i sub_DF_hi = _mm256_sub_epi16(D_hi, F_hi);
+    __m256i sum_DF_lo = _mm256_add_epi16(D_lo, F_lo);
+    __m256i sum_DF_hi = _mm256_add_epi16(D_hi, F_hi);
+
+    for (; h < height - 1; h++) {
+      __m256i sum_GI_lo, sub_GI_lo, sum_GI_hi, sub_GI_hi, gx_lo, gy_lo, gx_hi,
+          gy_hi;
+      const int k = h * stride + w;
+      const __m256i G = _mm256_loadu_si256((__m256i *)(&src[k + stride - 1]));
+      const __m256i H = _mm256_loadu_si256((__m256i *)(&src[k + stride]));
+      const __m256i I = _mm256_loadu_si256((__m256i *)(&src[k + stride + 1]));
+
+      const __m256i B_lo = _mm256_unpacklo_epi8(B, zero);
+      const __m256i B_hi = _mm256_unpackhi_epi8(B, zero);
+      const __m256i G_lo = _mm256_unpacklo_epi8(G, zero);
+      const __m256i G_hi = _mm256_unpackhi_epi8(G, zero);
+      const __m256i I_lo = _mm256_unpacklo_epi8(I, zero);
+      const __m256i I_hi = _mm256_unpackhi_epi8(I, zero);
+      const __m256i H_lo = _mm256_unpacklo_epi8(H, zero);
+      const __m256i H_hi = _mm256_unpackhi_epi8(H, zero);
+
+      sub_GI_lo = _mm256_sub_epi16(G_lo, I_lo);
+      sub_GI_hi = _mm256_sub_epi16(G_hi, I_hi);
+      sum_GI_lo = _mm256_add_epi16(G_lo, I_lo);
+      sum_GI_hi = _mm256_add_epi16(G_hi, I_hi);
+      const __m256i sub_BH_lo = _mm256_sub_epi16(B_lo, H_lo);
+      const __m256i sub_BH_hi = _mm256_sub_epi16(B_hi, H_hi);
+
+      CALC_X_GRADIENT(sub_AC_lo, sub_GI_lo, sub_DF_lo, gx_lo)
+      CALC_Y_GRADIENT(sum_AC_lo, sum_GI_lo, sub_BH_lo, gy_lo)
+
+      const __m256i ga_lo = _mm256_add_epi16(gx_lo, gy_lo);
+
+      CALC_X_GRADIENT(sub_AC_hi, sub_GI_hi, sub_DF_hi, gx_hi)
+      CALC_Y_GRADIENT(sum_AC_hi, sum_GI_hi, sub_BH_hi, gy_hi)
+
+      const __m256i ga_hi = _mm256_add_epi16(gx_hi, gy_hi);
+
+      __m256i cmp_lo = _mm256_cmpgt_epi16(edge_threshold, ga_lo);
+      __m256i cmp_hi = _mm256_cmpgt_epi16(edge_threshold, ga_hi);
+      const __m256i comp_reg = _mm256_add_epi16(cmp_lo, cmp_hi);
+
+      // v = 4*E -2*(D+F+B+H) + (A+C+G+I)
+      if (_mm256_movemask_epi8(comp_reg) != 0) {
+        const __m256i sum_BH_lo = _mm256_add_epi16(B_lo, H_lo);
+        const __m256i sum_BH_hi = _mm256_add_epi16(B_hi, H_hi);
+
+        // 2*(D+F+B+H)
+        const __m256i sum_DFBH_lo =
+            _mm256_slli_epi16(_mm256_add_epi16(sum_DF_lo, sum_BH_lo), 1);
+        // (A+C+G+I)
+        const __m256i sum_ACGI_lo = _mm256_add_epi16(sum_AC_lo, sum_GI_lo);
+        const __m256i sum_DFBH_hi =
+            _mm256_slli_epi16(_mm256_add_epi16(sum_DF_hi, sum_BH_hi), 1);
+        const __m256i sum_ACGI_hi = _mm256_add_epi16(sum_AC_hi, sum_GI_hi);
+
+        // Convert E register values from 8bit to 16bit
+        const __m256i E_lo = _mm256_unpacklo_epi8(E, zero);
+        const __m256i E_hi = _mm256_unpackhi_epi8(E, zero);
+
+        // 4*E - 2*(D+F+B+H)+ (A+C+G+I)
+        const __m256i var_lo_0 = _mm256_abs_epi16(_mm256_add_epi16(
+            _mm256_sub_epi16(_mm256_slli_epi16(E_lo, 2), sum_DFBH_lo),
+            sum_ACGI_lo));
+        const __m256i var_hi_0 = _mm256_abs_epi16(_mm256_add_epi16(
+            _mm256_sub_epi16(_mm256_slli_epi16(E_hi, 2), sum_DFBH_hi),
+            sum_ACGI_hi));
+        cmp_lo = _mm256_srli_epi16(cmp_lo, 15);
+        cmp_hi = _mm256_srli_epi16(cmp_hi, 15);
+        const __m256i var_lo = _mm256_mullo_epi16(var_lo_0, cmp_lo);
+        const __m256i var_hi = _mm256_mullo_epi16(var_hi_0, cmp_hi);
+
+        num_accum_row_lvl = _mm256_add_epi16(num_accum_row_lvl, cmp_lo);
+        num_accum_row_lvl = _mm256_add_epi16(num_accum_row_lvl, cmp_hi);
+
+        sum_accumulator = _mm256_add_epi32(sum_accumulator,
+                                           _mm256_unpacklo_epi16(var_lo, zero));
+        sum_accumulator = _mm256_add_epi32(sum_accumulator,
+                                           _mm256_unpackhi_epi16(var_lo, zero));
+        sum_accumulator = _mm256_add_epi32(sum_accumulator,
+                                           _mm256_unpacklo_epi16(var_hi, zero));
+        sum_accumulator = _mm256_add_epi32(sum_accumulator,
+                                           _mm256_unpackhi_epi16(var_hi, zero));
+      }
+      sub_AC_lo = sub_DF_lo;
+      sub_AC_hi = sub_DF_hi;
+      sub_DF_lo = sub_GI_lo;
+      sub_DF_hi = sub_GI_hi;
+      sum_AC_lo = sum_DF_lo;
+      sum_AC_hi = sum_DF_hi;
+      sum_DF_lo = sum_GI_lo;
+      sum_DF_hi = sum_GI_hi;
+      B = E;
+      E = H;
+    }
+    const __m256i num_0 = _mm256_unpacklo_epi16(num_accum_row_lvl, zero);
+    const __m256i num_1 = _mm256_unpackhi_epi16(num_accum_row_lvl, zero);
+    num_accumulator =
+        _mm256_add_epi32(num_accumulator, _mm256_add_epi32(num_0, num_1));
+  }
+
+  // Process the remaining width here.
+  for (int h = 1; h < height - 1; ++h) {
+    for (int w = w32 + 1; w < width - 1; ++w) {
+      const int k = h * stride + w;
+
+      // Compute sobel gradients
+      const int g_x = (src[k - stride - 1] - src[k - stride + 1]) +
+                      (src[k + stride - 1] - src[k + stride + 1]) +
+                      2 * (src[k - 1] - src[k + 1]);
+      const int g_y = (src[k - stride - 1] - src[k + stride - 1]) +
+                      (src[k - stride + 1] - src[k + stride + 1]) +
+                      2 * (src[k - stride] - src[k + stride]);
+      const int ga = abs(g_x) + abs(g_y);
+
+      if (ga < edge_thresh) {
+        // Find Laplacian
+        const int v =
+            4 * src[k] -
+            2 * (src[k - 1] + src[k + 1] + src[k - stride] + src[k + stride]) +
+            (src[k - stride - 1] + src[k - stride + 1] + src[k + stride - 1] +
+             src[k + stride + 1]);
+        accum += abs(v);
+        ++count;
+      }
+    }
+  }
+
+  // s0 s1 n0 n1 s2 s3 n2 n3
+  __m256i sum_avx = _mm256_hadd_epi32(sum_accumulator, num_accumulator);
+  __m128i sum_avx_lo = _mm256_castsi256_si128(sum_avx);
+  __m128i sum_avx_hi = _mm256_extractf128_si256(sum_avx, 1);
+  // s0+s2 s1+s3 n0+n2 n1+n3
+  __m128i sum_avx_1 = _mm_add_epi32(sum_avx_lo, sum_avx_hi);
+  // s0+s2+s1+s3 n0+n2+n1+n3
+  __m128i result = _mm_add_epi32(_mm_srli_si128(sum_avx_1, 4), sum_avx_1);
+
+  accum += _mm_cvtsi128_si32(result);
+  count += _mm_extract_epi32(result, 2);
+
+  // If very few smooth pels, return -1 since the estimate is unreliable.
+  return (count < 16) ? -1.0 : (double)accum / (6 * count) * SQRT_PI_BY_2;
+}
+
 static AOM_FORCE_INLINE void get_squared_error_16x16_avx2(
     const uint8_t *frame1, const unsigned int stride, const uint8_t *frame2,
     const unsigned int stride2, const int block_width, const int block_height,
@@ -127,13 +326,31 @@
   return _mm_extract_epi32(v128a, 0);
 }
 
+// AVX2 implementation of approx_exp()
+static AOM_INLINE __m256 approx_exp_avx2(__m256 y) {
+#define A ((1 << 23) / 0.69314718056f)  // (1 << 23) / ln(2)
+#define B \
+  127  // Offset for the exponent according to IEEE floating point standard.
+#define C 60801  // Magic number controls the accuracy of approximation
+  const __m256 multiplier = _mm256_set1_ps(A);
+  const __m256i offset = _mm256_set1_epi32(B * (1 << 23) - C);
+
+  y = _mm256_mul_ps(y, multiplier);
+  y = _mm256_castsi256_ps(_mm256_add_epi32(_mm256_cvttps_epi32(y), offset));
+  return y;
+#undef A
+#undef B
+#undef C
+}
+
 static void apply_temporal_filter(
     const uint8_t *frame1, const unsigned int stride, const uint8_t *frame2,
     const unsigned int stride2, const int block_width, const int block_height,
     const int *subblock_mses, unsigned int *accumulator, uint16_t *count,
     uint16_t *frame_sse, uint32_t *luma_sse_sum,
     const double inv_num_ref_pixels, const double decay_factor,
-    const double inv_factor, const double weight_factor, double *d_factor) {
+    const double inv_factor, const double weight_factor, double *d_factor,
+    int tf_wgt_calc_lvl) {
   assert(((block_width == 16) || (block_width == 32)) &&
          ((block_height == 16) || (block_height == 32)));
 
@@ -192,25 +409,140 @@
     }
   }
 
-  for (int i = 0, k = 0; i < block_height; i++) {
-    for (int j = 0; j < block_width; j++, k++) {
-      const int pixel_value = frame2[i * stride2 + j];
-      uint32_t diff_sse = acc_5x5_sse[i][j] + luma_sse_sum[i * BW + j];
+  double subblock_mses_scaled[4];
+  double d_factor_decayed[4];
+  for (int idx = 0; idx < 4; idx++) {
+    subblock_mses_scaled[idx] = subblock_mses[idx] * inv_factor;
+    d_factor_decayed[idx] = d_factor[idx] * decay_factor;
+  }
+  if (tf_wgt_calc_lvl == 0) {
+    for (int i = 0, k = 0; i < block_height; i++) {
+      const int y_blk_raster_offset = (i >= block_height / 2) * 2;
+      for (int j = 0; j < block_width; j++, k++) {
+        const int pixel_value = frame2[i * stride2 + j];
+        uint32_t diff_sse = acc_5x5_sse[i][j] + luma_sse_sum[i * BW + j];
 
-      const double window_error = diff_sse * inv_num_ref_pixels;
-      const int subblock_idx =
-          (i >= block_height / 2) * 2 + (j >= block_width / 2);
-      const double block_error = (double)subblock_mses[subblock_idx];
-      const double combined_error =
-          weight_factor * window_error + block_error * inv_factor;
+        const double window_error = diff_sse * inv_num_ref_pixels;
+        const int subblock_idx = y_blk_raster_offset + (j >= block_width / 2);
+        const double combined_error =
+            weight_factor * window_error + subblock_mses_scaled[subblock_idx];
 
-      double scaled_error =
-          combined_error * d_factor[subblock_idx] * decay_factor;
-      scaled_error = AOMMIN(scaled_error, 7);
-      const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
+        double scaled_error = combined_error * d_factor_decayed[subblock_idx];
+        scaled_error = AOMMIN(scaled_error, 7);
+        const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
 
-      count[k] += weight;
-      accumulator[k] += weight * pixel_value;
+        count[k] += weight;
+        accumulator[k] += weight * pixel_value;
+      }
+    }
+  } else {
+    __m256d subblock_mses_reg[4];
+    __m256d d_factor_mul_n_decay_qr_invs[4];
+    const __m256 zero = _mm256_set1_ps(0.0f);
+    const __m256 point_five = _mm256_set1_ps(0.5f);
+    const __m256 seven = _mm256_set1_ps(7.0f);
+    const __m256d inv_num_ref_pixel_256bit = _mm256_set1_pd(inv_num_ref_pixels);
+    const __m256d weight_factor_256bit = _mm256_set1_pd(weight_factor);
+    const __m256 tf_weight_scale = _mm256_set1_ps((float)TF_WEIGHT_SCALE);
+    // Maintain registers to hold mse and d_factor at subblock level.
+    subblock_mses_reg[0] = _mm256_set1_pd(subblock_mses_scaled[0]);
+    subblock_mses_reg[1] = _mm256_set1_pd(subblock_mses_scaled[1]);
+    subblock_mses_reg[2] = _mm256_set1_pd(subblock_mses_scaled[2]);
+    subblock_mses_reg[3] = _mm256_set1_pd(subblock_mses_scaled[3]);
+    d_factor_mul_n_decay_qr_invs[0] = _mm256_set1_pd(d_factor_decayed[0]);
+    d_factor_mul_n_decay_qr_invs[1] = _mm256_set1_pd(d_factor_decayed[1]);
+    d_factor_mul_n_decay_qr_invs[2] = _mm256_set1_pd(d_factor_decayed[2]);
+    d_factor_mul_n_decay_qr_invs[3] = _mm256_set1_pd(d_factor_decayed[3]);
+
+    for (int i = 0; i < block_height; i++) {
+      const int y_blk_raster_offset = (i >= block_height / 2) * 2;
+      uint32_t *luma_sse_sum_temp = luma_sse_sum + i * BW;
+      for (int j = 0; j < block_width; j += 8) {
+        const __m256i acc_sse =
+            _mm256_lddqu_si256((__m256i *)(acc_5x5_sse[i] + j));
+        const __m256i luma_sse =
+            _mm256_lddqu_si256((__m256i *)((luma_sse_sum_temp + j)));
+
+        // uint32_t diff_sse = acc_5x5_sse[i][j] + luma_sse_sum[i * BW + j];
+        const __m256i diff_sse = _mm256_add_epi32(acc_sse, luma_sse);
+
+        const __m256d diff_sse_pd_1 =
+            _mm256_cvtepi32_pd(_mm256_castsi256_si128(diff_sse));
+        const __m256d diff_sse_pd_2 =
+            _mm256_cvtepi32_pd(_mm256_extracti128_si256(diff_sse, 1));
+
+        // const double window_error = diff_sse * inv_num_ref_pixels;
+        const __m256d window_error_1 =
+            _mm256_mul_pd(diff_sse_pd_1, inv_num_ref_pixel_256bit);
+        const __m256d window_error_2 =
+            _mm256_mul_pd(diff_sse_pd_2, inv_num_ref_pixel_256bit);
+
+        // const int subblock_idx = y_blk_raster_offset + (j >= block_width /
+        // 2);
+        const int subblock_idx = y_blk_raster_offset + (j >= block_width / 2);
+        const __m256d blk_error = subblock_mses_reg[subblock_idx];
+
+        // const double combined_error =
+        // weight_factor *window_error + subblock_mses_scaled[subblock_idx];
+        const __m256d combined_error_1 = _mm256_add_pd(
+            _mm256_mul_pd(window_error_1, weight_factor_256bit), blk_error);
+
+        const __m256d combined_error_2 = _mm256_add_pd(
+            _mm256_mul_pd(window_error_2, weight_factor_256bit), blk_error);
+
+        // d_factor_decayed[subblock_idx]
+        const __m256d d_fact_mul_n_decay =
+            d_factor_mul_n_decay_qr_invs[subblock_idx];
+
+        // double scaled_error = combined_error *
+        // d_factor_decayed[subblock_idx];
+        const __m256d scaled_error_1 =
+            _mm256_mul_pd(combined_error_1, d_fact_mul_n_decay);
+        const __m256d scaled_error_2 =
+            _mm256_mul_pd(combined_error_2, d_fact_mul_n_decay);
+
+        const __m128 scaled_error_ps_1 = _mm256_cvtpd_ps(scaled_error_1);
+        const __m128 scaled_error_ps_2 = _mm256_cvtpd_ps(scaled_error_2);
+
+        const __m256 scaled_error_ps = _mm256_insertf128_ps(
+            _mm256_castps128_ps256(scaled_error_ps_1), scaled_error_ps_2, 0x1);
+
+        // scaled_error = AOMMIN(scaled_error, 7);
+        const __m256 scaled_diff_ps = _mm256_min_ps(scaled_error_ps, seven);
+        const __m256 minus_scaled_diff_ps = _mm256_sub_ps(zero, scaled_diff_ps);
+        // const int weight =
+        //(int)(approx_exp((float)-scaled_error) * TF_WEIGHT_SCALE + 0.5f);
+        const __m256 exp_result = approx_exp_avx2(minus_scaled_diff_ps);
+        const __m256 scale_weight_exp_result =
+            _mm256_mul_ps(exp_result, tf_weight_scale);
+        const __m256 round_result =
+            _mm256_add_ps(scale_weight_exp_result, point_five);
+        __m256i weights_in_32bit = _mm256_cvttps_epi32(round_result);
+
+        __m128i weights_in_16bit =
+            _mm_packus_epi32(_mm256_castsi256_si128(weights_in_32bit),
+                             _mm256_extractf128_si256(weights_in_32bit, 0x1));
+
+        // count[k] += weight;
+        // accumulator[k] += weight * pixel_value;
+        const int stride_idx = i * stride2 + j;
+        const __m128i count_array =
+            _mm_loadu_si128((__m128i *)(count + stride_idx));
+        _mm_storeu_si128((__m128i *)(count + stride_idx),
+                         _mm_add_epi16(count_array, weights_in_16bit));
+
+        const __m256i accumulator_array =
+            _mm256_loadu_si256((__m256i *)(accumulator + stride_idx));
+        const __m128i pred_values =
+            _mm_loadl_epi64((__m128i *)(frame2 + stride_idx));
+
+        const __m256i pred_values_u32 = _mm256_cvtepu8_epi32(pred_values);
+        const __m256i mull_frame2_weight_u32 =
+            _mm256_mullo_epi32(pred_values_u32, weights_in_32bit);
+        _mm256_storeu_si256(
+            (__m256i *)(accumulator + stride_idx),
+            _mm256_add_epi32(accumulator_array, mull_frame2_weight_u32));
+      }
     }
   }
 }
@@ -220,7 +552,8 @@
     const BLOCK_SIZE block_size, const int mb_row, const int mb_col,
     const int num_planes, const double *noise_levels, const MV *subblock_mvs,
     const int *subblock_mses, const int q_factor, const int filter_strength,
-    const uint8_t *pred, uint32_t *accum, uint16_t *count) {
+    int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum,
+    uint16_t *count) {
   const int is_high_bitdepth = frame_to_filter->flags & YV12_FLAG_HIGHBITDEPTH;
   assert(block_size == BLOCK_32X32 && "Only support 32x32 block with avx2!");
   assert(TF_WINDOW_LENGTH == 5 && "Only support window length 5 with avx2!");
@@ -308,7 +641,7 @@
                           plane_w, plane_h, subblock_mses, accum + plane_offset,
                           count + plane_offset, frame_sse, luma_sse_sum,
                           inv_num_ref_pixels, decay_factor, inv_factor,
-                          weight_factor, d_factor);
+                          weight_factor, d_factor, tf_wgt_calc_lvl);
     plane_offset += plane_h * plane_w;
   }
 }

diff --git a/av1/encoder/x86/temporal_filter_sse2.c b/av1/encoder/x86/temporal_filter_sse2.c
index 8be7164..842d3b1 100644
--- a/av1/encoder/x86/temporal_filter_sse2.c
+++ b/av1/encoder/x86/temporal_filter_sse2.c

@@ -13,6 +13,7 @@
 #include <emmintrin.h>
 
 #include "config/av1_rtcd.h"
+#include "aom_dsp/mathutils.h"
 #include "av1/encoder/encoder.h"
 #include "av1/encoder/temporal_filter.h"
 
@@ -107,7 +108,8 @@
     const int *subblock_mses, unsigned int *accumulator, uint16_t *count,
     uint16_t *frame_sse, uint32_t *luma_sse_sum,
     const double inv_num_ref_pixels, const double decay_factor,
-    const double inv_factor, const double weight_factor, double *d_factor) {
+    const double inv_factor, const double weight_factor, double *d_factor,
+    int tf_wgt_calc_lvl) {
   assert(((block_width == 16) || (block_width == 32)) &&
          ((block_height == 16) || (block_height == 32)));
 
@@ -168,25 +170,52 @@
     }
   }
 
-  for (int i = 0, k = 0; i < block_height; i++) {
-    for (int j = 0; j < block_width; j++, k++) {
-      const int pixel_value = frame2[i * stride2 + j];
-      uint32_t diff_sse = acc_5x5_sse[i][j] + luma_sse_sum[i * BW + j];
+  double subblock_mses_scaled[4];
+  double d_factor_decayed[4];
+  for (int idx = 0; idx < 4; idx++) {
+    subblock_mses_scaled[idx] = subblock_mses[idx] * inv_factor;
+    d_factor_decayed[idx] = d_factor[idx] * decay_factor;
+  }
+  if (tf_wgt_calc_lvl == 0) {
+    for (int i = 0, k = 0; i < block_height; i++) {
+      const int y_blk_raster_offset = (i >= block_height / 2) * 2;
+      for (int j = 0; j < block_width; j++, k++) {
+        const int pixel_value = frame2[i * stride2 + j];
+        uint32_t diff_sse = acc_5x5_sse[i][j] + luma_sse_sum[i * BW + j];
 
-      const double window_error = diff_sse * inv_num_ref_pixels;
-      const int subblock_idx =
-          (i >= block_height / 2) * 2 + (j >= block_width / 2);
-      const double block_error = (double)subblock_mses[subblock_idx];
-      const double combined_error =
-          weight_factor * window_error + block_error * inv_factor;
+        const double window_error = diff_sse * inv_num_ref_pixels;
+        const int subblock_idx = y_blk_raster_offset + (j >= block_width / 2);
+        const double combined_error =
+            weight_factor * window_error + subblock_mses_scaled[subblock_idx];
 
-      double scaled_error =
-          combined_error * d_factor[subblock_idx] * decay_factor;
-      scaled_error = AOMMIN(scaled_error, 7);
-      const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
+        double scaled_error = combined_error * d_factor_decayed[subblock_idx];
+        scaled_error = AOMMIN(scaled_error, 7);
+        const int weight = (int)(exp(-scaled_error) * TF_WEIGHT_SCALE);
 
-      count[k] += weight;
-      accumulator[k] += weight * pixel_value;
+        count[k] += weight;
+        accumulator[k] += weight * pixel_value;
+      }
+    }
+  } else {
+    for (int i = 0, k = 0; i < block_height; i++) {
+      const int y_blk_raster_offset = (i >= block_height / 2) * 2;
+      for (int j = 0; j < block_width; j++, k++) {
+        const int pixel_value = frame2[i * stride2 + j];
+        uint32_t diff_sse = acc_5x5_sse[i][j] + luma_sse_sum[i * BW + j];
+
+        const double window_error = diff_sse * inv_num_ref_pixels;
+        const int subblock_idx = y_blk_raster_offset + (j >= block_width / 2);
+        const double combined_error =
+            weight_factor * window_error + subblock_mses_scaled[subblock_idx];
+
+        double scaled_error = combined_error * d_factor_decayed[subblock_idx];
+        scaled_error = AOMMIN(scaled_error, 7);
+        const float fweight =
+            approx_exp((float)-scaled_error) * TF_WEIGHT_SCALE;
+        const int weight = iroundpf(fweight);
+        count[k] += weight;
+        accumulator[k] += weight * pixel_value;
+      }
     }
   }
 }
@@ -196,7 +225,8 @@
     const BLOCK_SIZE block_size, const int mb_row, const int mb_col,
     const int num_planes, const double *noise_levels, const MV *subblock_mvs,
     const int *subblock_mses, const int q_factor, const int filter_strength,
-    const uint8_t *pred, uint32_t *accum, uint16_t *count) {
+    int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum,
+    uint16_t *count) {
   const int is_high_bitdepth = frame_to_filter->flags & YV12_FLAG_HIGHBITDEPTH;
   assert(block_size == BLOCK_32X32 && "Only support 32x32 block with sse2!");
   assert(TF_WINDOW_LENGTH == 5 && "Only support window length 5 with sse2!");
@@ -284,7 +314,7 @@
                           plane_w, plane_h, subblock_mses, accum + plane_offset,
                           count + plane_offset, frame_sse, luma_sse_sum,
                           inv_num_ref_pixels, decay_factor, inv_factor,
-                          weight_factor, d_factor);
+                          weight_factor, d_factor, tf_wgt_calc_lvl);
     plane_offset += plane_h * plane_w;
   }
 }

diff --git a/av1/encoder/x86/wedge_utils_avx2.c b/av1/encoder/x86/wedge_utils_avx2.c
index bbc62d5..9cde860 100644
--- a/av1/encoder/x86/wedge_utils_avx2.c
+++ b/av1/encoder/x86/wedge_utils_avx2.c

@@ -72,7 +72,7 @@
   __m128i v_acc_q_0 = _mm256_castsi256_si128(v_acc0_q);
   __m128i v_acc_q_1 = _mm256_extracti128_si256(v_acc0_q, 1);
   v_acc_q_0 = _mm_add_epi64(v_acc_q_0, v_acc_q_1);
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   csse = (uint64_t)_mm_extract_epi64(v_acc_q_0, 0);
 #else
   xx_storel_64(&csse, v_acc_q_0);
@@ -141,7 +141,7 @@
   __m128i v_acc_q_1 = _mm256_extracti128_si256(v_acc_q, 1);
   v_acc_q_0 = _mm_add_epi64(v_acc_q_0, v_acc_q_1);
 
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   acc = _mm_extract_epi64(v_acc_q_0, 0);
 #else
   xx_storel_64(&acc, v_acc_q_0);

diff --git a/av1/encoder/x86/wedge_utils_sse2.c b/av1/encoder/x86/wedge_utils_sse2.c
index e665b2e..d7ac222 100644
--- a/av1/encoder/x86/wedge_utils_sse2.c
+++ b/av1/encoder/x86/wedge_utils_sse2.c

@@ -85,7 +85,7 @@
 
   v_acc0_q = _mm_add_epi64(v_acc0_q, _mm_srli_si128(v_acc0_q, 8));
 
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   csse = (uint64_t)_mm_cvtsi128_si64(v_acc0_q);
 #else
   xx_storel_64(&csse, v_acc0_q);
@@ -174,7 +174,7 @@
 
   v_acc_q = _mm_add_epi64(v_acc_q, _mm_srli_si128(v_acc_q, 8));
 
-#if ARCH_X86_64
+#if AOM_ARCH_X86_64
   acc = _mm_cvtsi128_si64(v_acc_q);
 #else
   xx_storel_64(&acc, v_acc_q);

diff --git a/av1/qmode_rc/ducky_encode.cc b/av1/qmode_rc/ducky_encode.cc
deleted file mode 100644
index bd4b766..0000000
--- a/av1/qmode_rc/ducky_encode.cc
+++ /dev/null

@@ -1,718 +0,0 @@
-/*
- * Copyright (c) 2022, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-#include <stdlib.h>
-#include <string.h>
-#include <algorithm>
-#include <memory>
-#include <numeric>
-#include <vector>
-
-#include "av1/common/enums.h"
-#include "av1/encoder/rd.h"
-#include "config/aom_config.h"
-
-#include "aom/aom_encoder.h"
-
-#include "av1/av1_cx_iface.h"
-#include "av1/av1_iface_common.h"
-#include "av1/encoder/encoder.h"
-#include "av1/encoder/ethread.h"
-#include "av1/encoder/firstpass.h"
-#include "av1/encoder/temporal_filter.h"
-#include "av1/qmode_rc/ducky_encode.h"
-
-#include "common/tools_common.h"
-
-namespace aom {
-struct EncoderResource {
-  STATS_BUFFER_CTX *stats_buf_ctx;
-  FIRSTPASS_STATS *stats_buffer;
-  aom_image_t img;
-  AV1_PRIMARY *ppi;
-  int lookahead_push_count;
-  int encode_frame_count;  // Use in second pass only
-};
-
-class DuckyEncode::EncodeImpl {
- public:
-  VideoInfo video_info;
-  int g_usage;
-  int max_ref_frames;
-  int speed;
-  int base_qindex;
-  BLOCK_SIZE sb_size;
-  enum aom_rc_mode rc_end_usage;
-  aom_rational64_t timestamp_ratio;
-  std::vector<FIRSTPASS_STATS> stats_list;
-  EncoderResource enc_resource;
-  struct AvxInputContext input;
-};
-
-DuckyEncode::DuckyEncode(const VideoInfo &video_info, BLOCK_SIZE sb_size,
-                         int max_ref_frames, int speed, int base_qindex) {
-  impl_ptr_ = std::unique_ptr<EncodeImpl>(new EncodeImpl());
-  impl_ptr_->video_info = video_info;
-  impl_ptr_->g_usage = GOOD;
-  impl_ptr_->max_ref_frames = max_ref_frames;
-  impl_ptr_->speed = speed;
-  impl_ptr_->base_qindex = base_qindex;
-  impl_ptr_->sb_size = sb_size;
-  impl_ptr_->rc_end_usage = AOM_Q;
-  // TODO(angiebird): Set timestamp_ratio properly
-  // timestamp_ratio.den = cfg->g_timebase.den;
-  // timestamp_ratio.num = (int64_t)cfg->g_timebase.num * TICKS_PER_SEC;
-  impl_ptr_->timestamp_ratio = { 1, 1 };
-  // TODO(angiebird): How to set ptsvol and duration?
-  impl_ptr_->input.filename = impl_ptr_->video_info.file_path.c_str();
-}
-
-DuckyEncode::~DuckyEncode() {}
-
-static AV1EncoderConfig GetEncoderConfig(const VideoInfo &video_info,
-                                         int g_usage, aom_enc_pass pass) {
-  const aom_codec_iface *codec = aom_codec_av1_cx();
-  aom_codec_enc_cfg_t cfg;
-  aom_codec_enc_config_default(codec, &cfg, g_usage);
-  cfg.g_w = video_info.frame_width;
-  cfg.g_h = video_info.frame_height;
-  cfg.g_pass = pass;
-  // g_timebase is the inverse of frame_rate
-  cfg.g_timebase.num = video_info.frame_rate.den;
-  cfg.g_timebase.den = video_info.frame_rate.num;
-  if (pass == AOM_RC_SECOND_PASS) {
-    cfg.rc_twopass_stats_in.sz =
-        (video_info.frame_count + 1) * sizeof(FIRSTPASS_STATS);
-  }
-  AV1EncoderConfig oxcf = av1_get_encoder_config(&cfg);
-  // TODO(angiebird): Why didn't we init use_highbitdepth in
-  // av1_get_encoder_config()?
-  oxcf.use_highbitdepth = 0;
-
-  // TODO(jingning): Change this to 35 when the baseline rate control
-  // logic is in place.
-  // Force maximum look ahead buffer to be 19. This will disable the use
-  // of maximum 32 GOP length.
-  oxcf.gf_cfg.lag_in_frames = 19;
-
-  return oxcf;
-}
-
-static STATS_BUFFER_CTX *CreateStatsBufferCtx(int frame_count,
-                                              FIRSTPASS_STATS **stats_buffer) {
-  STATS_BUFFER_CTX *stats_buf_ctx = new STATS_BUFFER_CTX;
-  // +2 is for total_stats and total_left_stats
-  *stats_buffer = new FIRSTPASS_STATS[frame_count + 2];
-  stats_buf_ctx->stats_in_start = *stats_buffer;
-  stats_buf_ctx->stats_in_end = stats_buf_ctx->stats_in_start;
-  stats_buf_ctx->stats_in_buf_end = stats_buf_ctx->stats_in_start + frame_count;
-  stats_buf_ctx->total_stats = stats_buf_ctx->stats_in_buf_end;
-  stats_buf_ctx->total_left_stats =
-      stats_buf_ctx->stats_in_start + frame_count + 1;
-  for (FIRSTPASS_STATS *buffer = stats_buf_ctx->stats_in_start;
-       buffer <= stats_buf_ctx->total_left_stats; ++buffer) {
-    av1_twopass_zero_stats(buffer);
-  }
-  return stats_buf_ctx;
-}
-
-static void DestroyStatsBufferCtx(STATS_BUFFER_CTX **stats_buf_context,
-                                  FIRSTPASS_STATS **stats_buffer) {
-  (*stats_buf_context)->stats_in_start = nullptr;
-  (*stats_buf_context)->stats_in_end = nullptr;
-  (*stats_buf_context)->stats_in_buf_end = nullptr;
-  (*stats_buf_context)->total_stats = nullptr;
-  (*stats_buf_context)->total_left_stats = nullptr;
-  delete *stats_buf_context;
-  *stats_buf_context = nullptr;
-  delete[](*stats_buffer);
-  *stats_buffer = nullptr;
-}
-
-static FIRSTPASS_STATS ComputeTotalStats(
-    const std::vector<FIRSTPASS_STATS> &stats_list) {
-  FIRSTPASS_STATS total_stats = {};
-  for (size_t i = 0; i < stats_list.size(); ++i) {
-    av1_accumulate_stats(&total_stats, &stats_list[i]);
-  }
-  return total_stats;
-}
-
-static bool FileIsY4m(const char detect[4]) {
-  return memcmp(detect, "YUV4", 4) == 0;
-}
-
-static bool FourccIsIvf(const char detect[4]) {
-  return memcmp(detect, "DKIF", 4) == 0;
-}
-
-static void OpenInputFile(struct AvxInputContext *input) {
-  input->file = fopen(input->filename, "rb");
-  /* For RAW input sources, these bytes will applied on the first frame
-   *  in read_frame().
-   */
-  input->detect.buf_read = fread(input->detect.buf, 1, 4, input->file);
-  input->detect.position = 0;
-  aom_chroma_sample_position_t const csp = AOM_CSP_UNKNOWN;
-  if (input->detect.buf_read == 4 && FileIsY4m(input->detect.buf)) {
-    if (y4m_input_open(&input->y4m, input->file, input->detect.buf, 4, csp,
-                       input->only_i420) >= 0) {
-      input->file_type = FILE_TYPE_Y4M;
-      input->width = input->y4m.pic_w;
-      input->height = input->y4m.pic_h;
-      input->pixel_aspect_ratio.numerator = input->y4m.par_n;
-      input->pixel_aspect_ratio.denominator = input->y4m.par_d;
-      input->framerate.numerator = input->y4m.fps_n;
-      input->framerate.denominator = input->y4m.fps_d;
-      input->fmt = input->y4m.aom_fmt;
-      input->bit_depth = static_cast<aom_bit_depth_t>(input->y4m.bit_depth);
-      input->color_range = input->y4m.color_range;
-    } else
-      fatal("Unsupported Y4M stream.");
-  } else if (input->detect.buf_read == 4 && FourccIsIvf(input->detect.buf)) {
-    fatal("IVF is not supported as input.");
-  } else {
-    input->file_type = FILE_TYPE_RAW;
-  }
-}
-
-void DuckyEncode::InitEncoder(aom_enc_pass pass,
-                              const std::vector<FIRSTPASS_STATS> *stats_list) {
-  EncoderResource enc_resource = {};
-  enc_resource.lookahead_push_count = 0;
-  OpenInputFile(&impl_ptr_->input);
-  if (impl_ptr_->input.file_type != FILE_TYPE_Y4M) {
-    aom_img_alloc(&enc_resource.img, impl_ptr_->video_info.img_fmt,
-                  impl_ptr_->video_info.frame_width,
-                  impl_ptr_->video_info.frame_height, /*align=*/1);
-  }
-  AV1EncoderConfig oxcf =
-      GetEncoderConfig(impl_ptr_->video_info, impl_ptr_->g_usage, pass);
-  oxcf.dec_model_cfg.decoder_model_info_present_flag = 0;
-  oxcf.dec_model_cfg.display_model_info_present_flag = 0;
-  oxcf.ref_frm_cfg.max_reference_frames = impl_ptr_->max_ref_frames;
-  oxcf.speed = impl_ptr_->speed;
-  if (impl_ptr_->sb_size == BLOCK_64X64)
-    oxcf.tool_cfg.superblock_size = AOM_SUPERBLOCK_SIZE_64X64;
-  else
-    oxcf.tool_cfg.superblock_size = AOM_SUPERBLOCK_SIZE_128X128;
-
-  av1_initialize_enc(impl_ptr_->g_usage, impl_ptr_->rc_end_usage);
-  AV1_PRIMARY *ppi =
-      av1_create_primary_compressor(nullptr,
-                                    /*num_lap_buffers=*/0, &oxcf);
-  enc_resource.ppi = ppi;
-
-  assert(ppi != nullptr);
-  // Turn off ppi->b_calculate_psnr to avoid calling generate_psnr_packet() in
-  // av1_post_encode_updates().
-  // TODO(angiebird): Modify generate_psnr_packet() to handle the case that
-  // cpi->ppi->output_pkt_list = nullptr.
-  ppi->b_calculate_psnr = 0;
-
-  aom_codec_err_t res = AOM_CODEC_OK;
-  (void)res;
-  enc_resource.stats_buf_ctx = CreateStatsBufferCtx(
-      impl_ptr_->video_info.frame_count, &enc_resource.stats_buffer);
-  if (pass == AOM_RC_SECOND_PASS) {
-    assert(stats_list != nullptr);
-    std::copy(stats_list->begin(), stats_list->end(),
-              enc_resource.stats_buffer);
-    *enc_resource.stats_buf_ctx->total_stats = ComputeTotalStats(*stats_list);
-    oxcf.twopass_stats_in.buf = enc_resource.stats_buffer;
-    // We need +1 here because av1 encoder assumes
-    // oxcf.twopass_stats_in.buf[video_info.frame_count] has the total_stats
-    oxcf.twopass_stats_in.sz = (impl_ptr_->video_info.frame_count + 1) *
-                               sizeof(enc_resource.stats_buffer[0]);
-  } else {
-    assert(pass == AOM_RC_FIRST_PASS);
-    // We don't use stats_list for AOM_RC_FIRST_PASS.
-    assert(stats_list == nullptr);
-  }
-  ppi->twopass.stats_buf_ctx = enc_resource.stats_buf_ctx;
-  BufferPool *buffer_pool = nullptr;
-  res = av1_create_context_and_bufferpool(ppi, &ppi->cpi, &buffer_pool, &oxcf,
-                                          ENCODE_STAGE, -1);
-  // TODO(angiebird): Why didn't we set initial_dimensions in
-  // av1_create_compressor()?
-  ppi->cpi->initial_dimensions.width = oxcf.frm_dim_cfg.width;
-  ppi->cpi->initial_dimensions.height = oxcf.frm_dim_cfg.height;
-  // use_ducky_encode is the flag we use to change AV1 behavior
-  // slightly based on DuckyEncode's need. We should minimize this kind of
-  // change unless it's necessary.
-  ppi->cpi->use_ducky_encode = 1;
-  assert(res == AOM_CODEC_OK);
-  assert(ppi->cpi != nullptr);
-  assert(buffer_pool != nullptr);
-  const AV1_COMP *cpi = ppi->cpi;
-  SequenceHeader *seq_params = ppi->cpi->common.seq_params;
-  set_sb_size(seq_params, impl_ptr_->sb_size);
-  ppi->seq_params_locked = 1;
-  assert(ppi->lookahead == nullptr);
-
-  int lag_in_frames = cpi->oxcf.gf_cfg.lag_in_frames;
-  ppi->lookahead = av1_lookahead_init(
-      cpi->oxcf.frm_dim_cfg.width, cpi->oxcf.frm_dim_cfg.height,
-      seq_params->subsampling_x, seq_params->subsampling_y,
-      seq_params->use_highbitdepth, lag_in_frames, cpi->oxcf.border_in_pixels,
-      cpi->common.features.byte_alignment,
-      /*num_lap_buffers=*/0, /*is_all_intra=*/0,
-      cpi->oxcf.tool_cfg.enable_global_motion);
-
-  av1_tf_info_alloc(&cpi->ppi->tf_info, cpi);
-  assert(ppi->lookahead != nullptr);
-
-  impl_ptr_->enc_resource = enc_resource;
-}
-
-static void CloseInputFile(struct AvxInputContext *input) {
-  fclose(input->file);
-  if (input->file_type == FILE_TYPE_Y4M) y4m_input_close(&input->y4m);
-}
-
-void DuckyEncode::FreeEncoder() {
-  EncoderResource *enc_resource = &impl_ptr_->enc_resource;
-  CloseInputFile(&impl_ptr_->input);
-  aom_img_free(&enc_resource->img);
-  DestroyStatsBufferCtx(&enc_resource->stats_buf_ctx,
-                        &enc_resource->stats_buffer);
-  BufferPool *buffer_pool = enc_resource->ppi->cpi->common.buffer_pool;
-  av1_destroy_context_and_bufferpool(enc_resource->ppi->cpi, &buffer_pool);
-  av1_remove_primary_compressor(enc_resource->ppi);
-  enc_resource->ppi = nullptr;
-}
-
-static int ReadFrame(struct AvxInputContext *input_ctx, aom_image_t *img) {
-  FILE *f = input_ctx->file;
-  y4m_input *y4m = &input_ctx->y4m;
-  int shortread = 0;
-
-  if (input_ctx->file_type == FILE_TYPE_Y4M) {
-    if (y4m_input_fetch_frame(y4m, f, img) < 1) return 0;
-  } else {
-    shortread = read_yuv_frame(input_ctx, img);
-  }
-
-  return !shortread;
-}
-
-std::vector<FIRSTPASS_STATS> DuckyEncode::ComputeFirstPassStats() {
-  aom_enc_pass pass = AOM_RC_FIRST_PASS;
-  InitEncoder(pass, nullptr);
-  AV1_PRIMARY *ppi = impl_ptr_->enc_resource.ppi;
-  EncoderResource *enc_resource = &impl_ptr_->enc_resource;
-  struct lookahead_ctx *lookahead = ppi->lookahead;
-  int frame_count = impl_ptr_->video_info.frame_count;
-  aom_rational64_t timestamp_ratio = impl_ptr_->timestamp_ratio;
-  // TODO(angiebird): Ideally, ComputeFirstPassStats() doesn't output
-  // bitstream. Do we need bitstream buffer here?
-  std::vector<uint8_t> buf(1000);
-  std::vector<FIRSTPASS_STATS> stats_list;
-  for (int i = 0; i < frame_count; ++i) {
-    if (ReadFrame(&impl_ptr_->input, &impl_ptr_->enc_resource.img)) {
-      // TODO(angiebird): Set ts_start/ts_end properly
-      int64_t ts_start = enc_resource->lookahead_push_count;
-      int64_t ts_end = ts_start + 1;
-      YV12_BUFFER_CONFIG sd;
-      image2yuvconfig(&enc_resource->img, &sd);
-      av1_lookahead_push(lookahead, &sd, ts_start, ts_end,
-                         /*use_highbitdepth=*/0, /*flags=*/0);
-      ++enc_resource->lookahead_push_count;
-      AV1_COMP_DATA cpi_data = {};
-      cpi_data.cx_data = buf.data();
-      cpi_data.cx_data_sz = buf.size();
-      cpi_data.frame_size = 0;
-      cpi_data.flush = 1;  // Makes av1_get_compressed_data process a frame
-      cpi_data.ts_frame_start = ts_start;
-      cpi_data.ts_frame_end = ts_end;
-      cpi_data.pop_lookahead = 1;
-      cpi_data.timestamp_ratio = &timestamp_ratio;
-      // av1_get_compressed_data only generates first pass stats not
-      // compresses data
-      int res = av1_get_compressed_data(ppi->cpi, &cpi_data);
-      (void)res;
-      assert(res == static_cast<int>(AOM_CODEC_OK));
-      stats_list.push_back(*(ppi->twopass.stats_buf_ctx->stats_in_end - 1));
-      av1_post_encode_updates(ppi->cpi, &cpi_data);
-    }
-  }
-  av1_end_first_pass(ppi->cpi);
-
-  FreeEncoder();
-  return stats_list;
-}
-
-void DuckyEncode::StartEncode(const std::vector<FIRSTPASS_STATS> &stats_list) {
-  aom_enc_pass pass = AOM_RC_SECOND_PASS;
-  impl_ptr_->stats_list = stats_list;
-  InitEncoder(pass, &stats_list);
-  write_temp_delimiter_ = true;
-}
-
-static void DuckyEncodeInfoSetGopStruct(AV1_PRIMARY *ppi,
-                                        const GopStruct &gop_struct,
-                                        const GopEncodeInfo &gop_encode_info) {
-  GF_GROUP *gf_group = &ppi->gf_group;
-  ppi->p_rc.baseline_gf_interval = gop_struct.show_frame_count;
-  ppi->internal_altref_allowed = 1;
-
-  gf_group->size = static_cast<int>(gop_struct.gop_frame_list.size());
-  gf_group->max_layer_depth = 0;
-
-  int i = 0;
-  for (const auto &frame : gop_struct.gop_frame_list) {
-    gf_group->update_type[i] = (int)frame.update_type;
-    if (frame.update_type == GopFrameType::kRegularArf) gf_group->arf_index = i;
-
-    gf_group->frame_type[i] = !frame.is_key_frame;
-
-    gf_group->q_val[i] = gop_encode_info.param_list[i].q_index;
-    gf_group->rdmult_val[i] = gop_encode_info.param_list[i].rdmult;
-
-    gf_group->cur_frame_idx[i] = 0;
-    gf_group->arf_src_offset[i] = frame.order_idx - frame.display_idx;
-    gf_group->cur_frame_idx[i] = frame.display_idx;
-    gf_group->src_offset[i] = 0;
-
-    // TODO(jingning): Placeholder - update the arf boost.
-    gf_group->arf_boost[i] = 500;
-    gf_group->layer_depth[i] = frame.layer_depth;
-    gf_group->max_layer_depth =
-        AOMMAX(frame.layer_depth, gf_group->max_layer_depth);
-    gf_group->refbuf_state[i] =
-        frame.is_key_frame ? REFBUF_RESET : REFBUF_UPDATE;
-
-    std::fill_n(gf_group->ref_frame_list[i], REF_FRAMES, -1);
-    gf_group->update_ref_idx[i] = -1;
-    for (int ref_idx = 0;
-         ref_idx < static_cast<int>(frame.ref_frame_list.size()); ++ref_idx) {
-      int ref_frame = static_cast<int>(frame.ref_frame_list[ref_idx].name);
-      gf_group->ref_frame_list[i][ref_frame] =
-          static_cast<int8_t>(frame.ref_frame_list[ref_idx].index);
-    }
-    gf_group->update_ref_idx[i] = frame.update_ref_idx;
-    gf_group->primary_ref_idx[i] = frame.primary_ref_frame.index;
-    ++i;
-  }
-  ppi->cpi->gf_frame_index = 0;
-}
-
-static void DuckyEncodeInfoSetEncodeFrameDecision(
-    DuckyEncodeInfo *ducky_encode_info, const EncodeFrameDecision &decision) {
-  DuckyEncodeFrameInfo *frame_info = &ducky_encode_info->frame_info;
-  *frame_info = {};
-  frame_info->qp_mode = static_cast<DUCKY_ENCODE_FRAME_MODE>(decision.qp_mode);
-  frame_info->gop_mode = static_cast<DUCKY_ENCODE_GOP_MODE>(decision.gop_mode);
-  frame_info->q_index = decision.parameters.q_index;
-  frame_info->rdmult = decision.parameters.rdmult;
-  const size_t num_superblocks =
-      decision.parameters.superblock_encode_params.size();
-  frame_info->delta_q_enabled = 0;
-  if (num_superblocks > 1) {
-    frame_info->delta_q_enabled = 1;
-    frame_info->superblock_encode_qindex = new int[num_superblocks];
-    frame_info->superblock_encode_rdmult = new int[num_superblocks];
-    for (size_t i = 0; i < num_superblocks; ++i) {
-      frame_info->superblock_encode_qindex[i] =
-          decision.parameters.superblock_encode_params[i].q_index;
-      frame_info->superblock_encode_rdmult[i] =
-          decision.parameters.superblock_encode_params[i].rdmult;
-    }
-  }
-}
-
-static void DuckyEncodeInfoGetEncodeFrameResult(
-    const DuckyEncodeInfo *ducky_encode_info, EncodeFrameResult *result) {
-  const DuckyEncodeFrameResult &frame_result = ducky_encode_info->frame_result;
-  result->global_order_idx = frame_result.global_order_idx;
-  result->q_index = frame_result.q_index;
-  result->rdmult = frame_result.rdmult;
-  result->rate = frame_result.rate;
-  result->dist = frame_result.dist;
-  result->psnr = frame_result.psnr;
-}
-
-static void WriteObu(AV1_PRIMARY *ppi, AV1_COMP_DATA *cpi_data) {
-  AV1_COMP *const cpi = ppi->cpi;
-  uint32_t obu_header_size = 1;
-  const uint32_t obu_payload_size = 0;
-  const size_t length_field_size = aom_uleb_size_in_bytes(obu_payload_size);
-
-  const size_t move_offset = obu_header_size + length_field_size;
-  memmove(cpi_data->cx_data + move_offset, cpi_data->cx_data,
-          cpi_data->frame_size);
-  obu_header_size =
-      av1_write_obu_header(&ppi->level_params, &cpi->frame_header_count,
-                           OBU_TEMPORAL_DELIMITER, 0, cpi_data->cx_data);
-
-  // OBUs are preceded/succeeded by an unsigned leb128 coded integer.
-  if (av1_write_uleb_obu_size(obu_header_size, obu_payload_size,
-                              cpi_data->cx_data) != AOM_CODEC_OK) {
-    aom_internal_error(&ppi->error, AOM_CODEC_ERROR, NULL);
-  }
-
-  cpi_data->frame_size +=
-      obu_header_size + obu_payload_size + length_field_size;
-}
-
-TplGopStats DuckyEncode::ObtainTplStats(const GopStruct gop_struct,
-                                        bool rate_dist_present) {
-  TplGopStats tpl_gop_stats;
-
-  AV1_PRIMARY *ppi = impl_ptr_->enc_resource.ppi;
-  const uint8_t block_mis_log2 = ppi->tpl_data.tpl_stats_block_mis_log2;
-
-  for (size_t idx = 0; idx < gop_struct.gop_frame_list.size(); ++idx) {
-    TplFrameStats tpl_frame_stats = {};
-    tpl_frame_stats.rate_dist_present = rate_dist_present;
-
-    TplDepFrame *tpl_frame = &ppi->tpl_data.tpl_frame[idx];
-    if (gop_struct.gop_frame_list[idx].update_type == GopFrameType::kOverlay ||
-        gop_struct.gop_frame_list[idx].update_type ==
-            GopFrameType::kIntermediateOverlay) {
-      tpl_gop_stats.frame_stats_list.push_back(tpl_frame_stats);
-      continue;
-    }
-
-    int ref_frame_index_mapping[REF_FRAMES] = { 0 };
-    const GopFrame &gop_frame = gop_struct.gop_frame_list[idx];
-
-    for (auto &rf : gop_frame.ref_frame_list) {
-      ref_frame_index_mapping[static_cast<int>(rf.name)] = rf.index;
-    }
-
-    const int mi_rows = tpl_frame->mi_rows;
-    const int mi_cols = tpl_frame->mi_cols;
-    const int tpl_frame_stride = tpl_frame->stride;
-    tpl_frame_stats.frame_height = mi_rows * MI_SIZE;
-    tpl_frame_stats.frame_width = mi_cols * MI_SIZE;
-    tpl_frame_stats.min_block_size = (1 << block_mis_log2) * MI_SIZE;
-
-    const int mi_step = 1 << block_mis_log2;
-    for (int mi_row = 0; mi_row < mi_rows; mi_row += mi_step) {
-      for (int mi_col = 0; mi_col < mi_cols; mi_col += mi_step) {
-        int tpl_blk_pos = (mi_row >> block_mis_log2) * tpl_frame_stride +
-                          (mi_col >> block_mis_log2);
-        TplDepStats *tpl_stats_ptr = &tpl_frame->tpl_stats_ptr[tpl_blk_pos];
-
-        TplBlockStats block_stats;
-        block_stats.row = mi_row * MI_SIZE;
-        block_stats.col = mi_col * MI_SIZE;
-        block_stats.height = (1 << block_mis_log2) * MI_SIZE;
-        block_stats.width = (1 << block_mis_log2) * MI_SIZE;
-
-        block_stats.inter_cost =
-            RDCOST(tpl_frame->base_rdmult, tpl_stats_ptr->recrf_rate,
-                   tpl_stats_ptr->recrf_dist);
-        block_stats.intra_cost =
-            RDCOST(tpl_frame->base_rdmult, tpl_stats_ptr->intra_rate,
-                   tpl_stats_ptr->intra_dist);
-
-        if (tpl_frame_stats.rate_dist_present) {
-          block_stats.recrf_dist = tpl_stats_ptr->recrf_dist;
-          block_stats.recrf_rate = tpl_stats_ptr->recrf_rate;
-          block_stats.intra_pred_err = tpl_stats_ptr->intra_sse;
-          block_stats.inter_pred_err = tpl_stats_ptr->recrf_sse;
-        }
-
-        block_stats.ref_frame_index = { -1, -1 };
-
-        for (int i = 0; i < kBlockRefCount; ++i) {
-          if (tpl_stats_ptr->ref_frame_index[i] >= 0) {
-            block_stats.ref_frame_index[i] =
-                ref_frame_index_mapping[tpl_stats_ptr->ref_frame_index[i] + 1];
-            block_stats.mv[i] = {
-              tpl_stats_ptr->mv[tpl_stats_ptr->ref_frame_index[i]].as_mv.row,
-              tpl_stats_ptr->mv[tpl_stats_ptr->ref_frame_index[i]].as_mv.col, 3
-            };
-          }
-        }
-        tpl_frame_stats.block_stats_list.push_back(block_stats);
-      }
-    }
-
-    tpl_gop_stats.frame_stats_list.push_back(tpl_frame_stats);
-  }
-
-  return tpl_gop_stats;
-}
-
-// Obtain TPL stats through ducky_encode.
-// TODO(jianj): Populate rate_dist_present flag through qmode_rc_encoder
-std::vector<TplGopStats> DuckyEncode::ComputeTplStats(
-    const std::vector<FIRSTPASS_STATS> &stats_list,
-    const GopStructList &gop_list,
-    const GopEncodeInfoList &gop_encode_info_list) {
-  StartEncode(stats_list);
-  std::vector<TplGopStats> tpl_gop_stats_list;
-  AV1_PRIMARY *ppi = impl_ptr_->enc_resource.ppi;
-  const VideoInfo &video_info = impl_ptr_->video_info;
-  write_temp_delimiter_ = true;
-  AllocateBitstreamBuffer(video_info);
-
-  // Go through each gop and encode each frame in the gop
-  for (size_t i = 0; i < gop_list.size(); ++i) {
-    const aom::GopStruct &gop_struct = gop_list[i];
-    const aom::GopEncodeInfo &gop_encode_info = gop_encode_info_list[i];
-
-    DuckyEncodeInfoSetGopStruct(ppi, gop_struct, gop_encode_info);
-
-    aom::TplGopStats tpl_gop_stats;
-    for (auto &frame_param : gop_encode_info.param_list) {
-      // encoding frame frame_number
-      aom::EncodeFrameDecision frame_decision = { aom::EncodeFrameMode::kQindex,
-                                                  aom::EncodeGopMode::kGopRcl,
-                                                  frame_param };
-      EncodeFrame(frame_decision);
-      if (ppi->cpi->common.show_frame) pending_ctx_size_ = 0;
-      write_temp_delimiter_ = ppi->cpi->common.show_frame;
-    }
-    // The rate_dist_present needs to be populated.
-    tpl_gop_stats = ObtainTplStats(gop_struct, 0);
-    tpl_gop_stats_list.push_back(tpl_gop_stats);
-  }
-  EndEncode();
-  return tpl_gop_stats_list;
-}
-
-std::vector<TplGopStats> DuckyEncode::ComputeTwoPassTplStats(
-    const std::vector<FIRSTPASS_STATS> &stats_list,
-    const GopStructList &gop_list,
-    const GopEncodeInfoList &gop_encode_info_list,
-    const GopEncodeInfoList &alt_gop_encode_info_list) {
-  std::vector<TplGopStats> first_tpl_gop_stats_list =
-      ComputeTplStats(stats_list, gop_list, gop_encode_info_list);
-  const std::vector<TplGopStats> second_tpl_gop_stats_list =
-      ComputeTplStats(stats_list, gop_list, alt_gop_encode_info_list);
-  assert(first_tpl_gop_stats_list.size() == second_tpl_gop_stats_list.size());
-
-  // Set alternate_block_stats_list in first_tpl_gop_stats_list
-  // and return first_tpl_gop_stats_list
-  for (size_t i = 0; i < first_tpl_gop_stats_list.size(); ++i) {
-    for (size_t j = 0; j < first_tpl_gop_stats_list[i].frame_stats_list.size();
-         ++j) {
-      first_tpl_gop_stats_list[i]
-          .frame_stats_list[j]
-          .alternate_block_stats_list =
-          second_tpl_gop_stats_list[i].frame_stats_list[j].block_stats_list;
-    }
-  }
-  return first_tpl_gop_stats_list;
-}
-
-// Conduct final encoding process.
-std::vector<EncodeFrameResult> DuckyEncode::EncodeVideo(
-    const GopStructList &gop_list,
-    const GopEncodeInfoList &gop_encode_info_list) {
-  AV1_PRIMARY *ppi = impl_ptr_->enc_resource.ppi;
-  std::vector<EncodeFrameResult> encoded_frame_list;
-  const VideoInfo &video_info = impl_ptr_->video_info;
-
-  write_temp_delimiter_ = true;
-  AllocateBitstreamBuffer(video_info);
-
-  // Go through each gop and encode each frame in the gop
-  for (size_t i = 0; i < gop_list.size(); ++i) {
-    const aom::GopStruct &gop_struct = gop_list[i];
-    const aom::GopEncodeInfo &gop_encode_info = gop_encode_info_list[i];
-    DuckyEncodeInfoSetGopStruct(ppi, gop_struct, gop_encode_info);
-
-    for (auto &frame_param : gop_encode_info.param_list) {
-      aom::EncodeFrameDecision frame_decision = { aom::EncodeFrameMode::kQindex,
-                                                  aom::EncodeGopMode::kGopRcl,
-                                                  frame_param };
-      EncodeFrameResult temp_result = EncodeFrame(frame_decision);
-      if (ppi->cpi->common.show_frame) {
-        bitstream_buf_.resize(pending_ctx_size_);
-        EncodeFrameResult encode_frame_result = temp_result;
-        encode_frame_result.bitstream_buf = bitstream_buf_;
-        encoded_frame_list.push_back(encode_frame_result);
-
-        AllocateBitstreamBuffer(video_info);
-      }
-      write_temp_delimiter_ = ppi->cpi->common.show_frame;
-    }
-  }
-
-  return encoded_frame_list;
-}
-
-EncodeFrameResult DuckyEncode::EncodeFrame(
-    const EncodeFrameDecision &decision) {
-  EncodeFrameResult encode_frame_result = {};
-  encode_frame_result.bitstream_buf = bitstream_buf_;
-  AV1_PRIMARY *ppi = impl_ptr_->enc_resource.ppi;
-  aom_image_t *img = &impl_ptr_->enc_resource.img;
-  AV1_COMP *const cpi = ppi->cpi;
-  struct lookahead_ctx *lookahead = ppi->lookahead;
-
-  while (!av1_lookahead_full(lookahead)) {
-    if (ReadFrame(&impl_ptr_->input, img)) {
-      YV12_BUFFER_CONFIG sd;
-      image2yuvconfig(img, &sd);
-      int64_t ts_start = impl_ptr_->enc_resource.lookahead_push_count;
-      int64_t ts_end = ts_start + 1;
-      av1_lookahead_push(lookahead, &sd, ts_start, ts_end,
-                         /*use_highbitdepth=*/0, /*flags=*/0);
-      ++impl_ptr_->enc_resource.lookahead_push_count;
-    } else {
-      break;
-    }
-  }
-
-  AV1_COMP_DATA cpi_data = {};
-  cpi_data.cx_data = bitstream_buf_.data() + pending_ctx_size_;
-  cpi_data.cx_data_sz = bitstream_buf_.size() - pending_ctx_size_;
-  cpi_data.frame_size = 0;
-  cpi_data.flush = 1;
-  // ts_frame_start and ts_frame_end are not as important since we are focusing
-  // on q mode
-  cpi_data.ts_frame_start = impl_ptr_->enc_resource.encode_frame_count;
-  cpi_data.ts_frame_end = cpi_data.ts_frame_start + 1;
-  cpi_data.pop_lookahead = 1;
-  cpi_data.timestamp_ratio = &impl_ptr_->timestamp_ratio;
-  ++impl_ptr_->enc_resource.encode_frame_count;
-
-  av1_compute_num_workers_for_mt(cpi);
-  av1_init_frame_mt(ppi, cpi);
-
-  DuckyEncodeInfoSetEncodeFrameDecision(&cpi->ducky_encode_info, decision);
-  const int status = av1_get_compressed_data(cpi, &cpi_data);
-
-  if (write_temp_delimiter_) WriteObu(ppi, &cpi_data);
-  (void)status;
-  assert(status == static_cast<int>(AOM_CODEC_OK));
-  DuckyEncodeInfoGetEncodeFrameResult(&cpi->ducky_encode_info,
-                                      &encode_frame_result);
-  av1_post_encode_updates(cpi, &cpi_data);
-  if (cpi->common.show_frame) {
-    // decrement frames_left counter
-    ppi->frames_left = AOMMAX(0, ppi->frames_left - 1);
-  }
-
-  pending_ctx_size_ += cpi_data.frame_size;
-
-  fprintf(stderr, "frame %d, qp = %d, size %d, PSNR %f\n",
-          encode_frame_result.global_order_idx, encode_frame_result.q_index,
-          encode_frame_result.rate, encode_frame_result.psnr);
-  delete[] cpi->ducky_encode_info.frame_info.superblock_encode_qindex;
-  delete[] cpi->ducky_encode_info.frame_info.superblock_encode_rdmult;
-  return encode_frame_result;
-}
-
-void DuckyEncode::EndEncode() { FreeEncoder(); }
-
-void DuckyEncode::AllocateBitstreamBuffer(const VideoInfo &video_info) {
-  pending_ctx_size_ = 0;
-  // TODO(angiebird): Set bitstream_buf size to a conservatve upperbound.
-  bitstream_buf_.assign(
-      video_info.frame_width * video_info.frame_height * 3 * 8, 0);
-}
-}  // namespace aom

diff --git a/av1/qmode_rc/ducky_encode.h b/av1/qmode_rc/ducky_encode.h
deleted file mode 100644
index 5dee2a5..0000000
--- a/av1/qmode_rc/ducky_encode.h
+++ /dev/null

@@ -1,117 +0,0 @@
-/*
- * Copyright (c) 2022, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-
-#ifndef AOM_AV1_QMODE_RC_DUCKY_ENCODE_H_
-#define AOM_AV1_QMODE_RC_DUCKY_ENCODE_H_
-
-#include <cstddef>
-#include <cstdint>
-#include <memory>
-#include <string>
-#include <vector>
-
-#include "aom/aom_encoder.h"
-#include "av1/encoder/firstpass.h"
-#include "av1/qmode_rc/ratectrl_qmode_interface.h"
-
-namespace aom {
-struct VideoInfo {
-  int frame_width;
-  int frame_height;
-  aom_rational_t frame_rate;
-  aom_img_fmt_t img_fmt;
-  int frame_count;
-  std::string file_path;
-};
-
-struct EncodeFrameResult {
-  std::vector<uint8_t> bitstream_buf;
-  // TODO(angiebird): update global_coding_idx and global_order_idx properly.
-  int global_coding_idx;
-  int global_order_idx;
-  int q_index;
-  int rdmult;
-  int rate;
-  int64_t dist;
-  double psnr;
-};
-
-enum class EncodeFrameMode {
-  kNone,          // Let native AV1 determine q index and rdmult
-  kQindex,        // DuckyEncode determines q index and AV1 determines rdmult
-  kQindexRdmult,  // DuckyEncode determines q index and rdmult
-};
-
-enum class EncodeGopMode {
-  kNone,    // native AV1 decides GOP
-  kGopRcl,  // rate control lib decides GOP
-};
-
-struct EncodeFrameDecision {
-  EncodeFrameMode qp_mode;
-  EncodeGopMode gop_mode;
-  FrameEncodeParameters parameters;
-};
-
-using GopEncodeInfoList = std::vector<GopEncodeInfo>;
-
-// DuckyEncode is an experimental encoder c++ interface for two-pass mode.
-// This object can be used to do zero or more encode passes, where each encode
-// pass consists of:
-// - StartEncode()
-// - Zero or more calls to EncodeFrame()
-// - EndEncode()
-// Encode passes may not overlap, and any other sequence of these calls is
-// invalid.
-class DuckyEncode {
- public:
-  explicit DuckyEncode(const VideoInfo &video_info, BLOCK_SIZE sb_size,
-                       int max_ref_frames, int speed, int base_qindex);
-  ~DuckyEncode();
-  std::vector<FIRSTPASS_STATS> ComputeFirstPassStats();
-  void StartEncode(const std::vector<FIRSTPASS_STATS> &stats_list);
-
-  TplGopStats ObtainTplStats(const GopStruct gop_struct,
-                             bool rate_dist_present);
-
-  std::vector<TplGopStats> ComputeTplStats(
-      const std::vector<FIRSTPASS_STATS> &stats_list,
-      const GopStructList &gop_list,
-      const GopEncodeInfoList &gop_encode_info_list);
-
-  std::vector<TplGopStats> ComputeTwoPassTplStats(
-      const std::vector<FIRSTPASS_STATS> &stats_list,
-      const GopStructList &gop_list,
-      const GopEncodeInfoList &gop_encode_info_list,
-      const GopEncodeInfoList &alt_gop_encode_info_list);
-
-  std::vector<EncodeFrameResult> EncodeVideo(
-      const GopStructList &gop_list,
-      const GopEncodeInfoList &gop_encode_info_list);
-  EncodeFrameResult EncodeFrame(const EncodeFrameDecision &decision);
-  void EndEncode();
-  void AllocateBitstreamBuffer(const VideoInfo &video_info);
-
- private:
-  void InitEncoder(aom_enc_pass pass,
-                   const std::vector<FIRSTPASS_STATS> *stats_list);
-  void FreeEncoder();
-
- private:
-  class EncodeImpl;
-  std::unique_ptr<EncodeImpl> impl_ptr_;
-  bool write_temp_delimiter_;
-  std::vector<uint8_t> bitstream_buf_;
-  size_t pending_ctx_size_;
-};
-}  // namespace aom
-
-#endif  // AOM_AV1_QMODE_RC_DUCKY_ENCODE_H_

diff --git a/av1/qmode_rc/ratectrl_qmode.cc b/av1/qmode_rc/ratectrl_qmode.cc
deleted file mode 100644
index 0a2892d..0000000
--- a/av1/qmode_rc/ratectrl_qmode.cc
+++ /dev/null

@@ -1,1552 +0,0 @@
-/*
- * Copyright (c) 2022, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-#include "av1/qmode_rc/ratectrl_qmode.h"
-
-#include <algorithm>
-#include <cassert>
-#include <climits>
-#include <functional>
-#include <numeric>
-#include <sstream>
-#include <unordered_map>
-#include <unordered_set>
-#include <vector>
-
-#include "aom/aom_codec.h"
-#include "av1/encoder/pass2_strategy.h"
-#include "av1/encoder/tpl_model.h"
-
-namespace aom {
-
-// This is used before division to ensure that the divisor isn't zero or
-// too close to zero.
-static double ModifyDivisor(double divisor) {
-  const double kEpsilon = 0.0000001;
-  return (divisor < 0 ? std::min(divisor, -kEpsilon)
-                      : std::max(divisor, kEpsilon));
-}
-
-GopFrame GopFrameInvalid() {
-  GopFrame gop_frame = {};
-  gop_frame.is_valid = false;
-  gop_frame.coding_idx = -1;
-  gop_frame.order_idx = -1;
-  return gop_frame;
-}
-
-void SetGopFrameByType(GopFrameType gop_frame_type, GopFrame *gop_frame) {
-  gop_frame->update_type = gop_frame_type;
-  switch (gop_frame_type) {
-    case GopFrameType::kRegularKey:
-      gop_frame->is_key_frame = 1;
-      gop_frame->is_arf_frame = 0;
-      gop_frame->is_show_frame = 1;
-      gop_frame->is_golden_frame = 1;
-      gop_frame->encode_ref_mode = EncodeRefMode::kRegular;
-      break;
-    case GopFrameType::kRegularGolden:
-      gop_frame->is_key_frame = 0;
-      gop_frame->is_arf_frame = 0;
-      gop_frame->is_show_frame = 1;
-      gop_frame->is_golden_frame = 1;
-      gop_frame->encode_ref_mode = EncodeRefMode::kRegular;
-      break;
-    case GopFrameType::kRegularArf:
-      gop_frame->is_key_frame = 0;
-      gop_frame->is_arf_frame = 1;
-      gop_frame->is_show_frame = 0;
-      gop_frame->is_golden_frame = 1;
-      gop_frame->encode_ref_mode = EncodeRefMode::kRegular;
-      break;
-    case GopFrameType::kIntermediateArf:
-      gop_frame->is_key_frame = 0;
-      gop_frame->is_arf_frame = 1;
-      gop_frame->is_show_frame = 0;
-      gop_frame->is_golden_frame = gop_frame->layer_depth <= 2 ? 1 : 0;
-      gop_frame->encode_ref_mode = EncodeRefMode::kRegular;
-      break;
-    case GopFrameType::kRegularLeaf:
-      gop_frame->is_key_frame = 0;
-      gop_frame->is_arf_frame = 0;
-      gop_frame->is_show_frame = 1;
-      gop_frame->is_golden_frame = 0;
-      gop_frame->encode_ref_mode = EncodeRefMode::kRegular;
-      break;
-    case GopFrameType::kIntermediateOverlay:
-      gop_frame->is_key_frame = 0;
-      gop_frame->is_arf_frame = 0;
-      gop_frame->is_show_frame = 1;
-      gop_frame->is_golden_frame = 0;
-      gop_frame->encode_ref_mode = EncodeRefMode::kShowExisting;
-      break;
-    case GopFrameType::kOverlay:
-      gop_frame->is_key_frame = 0;
-      gop_frame->is_arf_frame = 0;
-      gop_frame->is_show_frame = 1;
-      gop_frame->is_golden_frame = 0;
-      gop_frame->encode_ref_mode = EncodeRefMode::kOverlay;
-      break;
-  }
-}
-
-GopFrame GopFrameBasic(int global_coding_idx_offset,
-                       int global_order_idx_offset, int coding_idx,
-                       int order_idx, int depth, int display_idx,
-                       GopFrameType gop_frame_type) {
-  GopFrame gop_frame = {};
-  gop_frame.is_valid = true;
-  gop_frame.coding_idx = coding_idx;
-  gop_frame.order_idx = order_idx;
-  gop_frame.display_idx = display_idx;
-  gop_frame.global_coding_idx = global_coding_idx_offset + coding_idx;
-  gop_frame.global_order_idx = global_order_idx_offset + order_idx;
-  gop_frame.layer_depth = depth + kLayerDepthOffset;
-  gop_frame.colocated_ref_idx = -1;
-  gop_frame.update_ref_idx = -1;
-  SetGopFrameByType(gop_frame_type, &gop_frame);
-  return gop_frame;
-}
-
-// This function create gop frames with indices of display order from
-// order_start to order_end - 1. The function will recursively introduce
-// intermediate ARF untill maximum depth is met or the number of regular frames
-// in between two ARFs are less than 3. Than the regular frames will be added
-// into the gop_struct.
-void ConstructGopMultiLayer(GopStruct *gop_struct,
-                            RefFrameManager *ref_frame_manager, int max_depth,
-                            int depth, int order_start, int order_end) {
-  GopFrame gop_frame;
-  int num_frames = order_end - order_start;
-  const int global_coding_idx_offset = gop_struct->global_coding_idx_offset;
-  const int global_order_idx_offset = gop_struct->global_order_idx_offset;
-  // If there are less than kMinIntervalToAddArf frames, stop introducing ARF
-  if (depth < max_depth && num_frames >= kMinIntervalToAddArf) {
-    int order_mid = (order_start + order_end) / 2;
-    // intermediate ARF
-    gop_frame = GopFrameBasic(
-        global_coding_idx_offset, global_order_idx_offset,
-        static_cast<int>(gop_struct->gop_frame_list.size()), order_mid, depth,
-        gop_struct->display_tracker, GopFrameType::kIntermediateArf);
-    ref_frame_manager->UpdateRefFrameTable(&gop_frame);
-    gop_struct->gop_frame_list.push_back(gop_frame);
-    ConstructGopMultiLayer(gop_struct, ref_frame_manager, max_depth, depth + 1,
-                           order_start, order_mid);
-    // show existing intermediate ARF
-    gop_frame =
-        GopFrameBasic(global_coding_idx_offset, global_order_idx_offset,
-                      static_cast<int>(gop_struct->gop_frame_list.size()),
-                      order_mid, max_depth, gop_struct->display_tracker,
-                      GopFrameType::kIntermediateOverlay);
-    ref_frame_manager->UpdateRefFrameTable(&gop_frame);
-    gop_struct->gop_frame_list.push_back(gop_frame);
-    ++gop_struct->display_tracker;
-    ConstructGopMultiLayer(gop_struct, ref_frame_manager, max_depth, depth + 1,
-                           order_mid + 1, order_end);
-  } else {
-    // regular frame
-    for (int i = order_start; i < order_end; ++i) {
-      gop_frame = GopFrameBasic(
-          global_coding_idx_offset, global_order_idx_offset,
-          static_cast<int>(gop_struct->gop_frame_list.size()), i, max_depth,
-          gop_struct->display_tracker, GopFrameType::kRegularLeaf);
-      ref_frame_manager->UpdateRefFrameTable(&gop_frame);
-      gop_struct->gop_frame_list.push_back(gop_frame);
-      ++gop_struct->display_tracker;
-    }
-  }
-}
-
-GopStruct ConstructGop(RefFrameManager *ref_frame_manager, int show_frame_count,
-                       bool has_key_frame, int global_coding_idx_offset,
-                       int global_order_idx_offset) {
-  GopStruct gop_struct;
-  gop_struct.show_frame_count = show_frame_count;
-  gop_struct.global_coding_idx_offset = global_coding_idx_offset;
-  gop_struct.global_order_idx_offset = global_order_idx_offset;
-  int order_start = 0;
-  int order_end = show_frame_count - 1;
-
-  // TODO(jingning): Re-enable the use of pyramid coding structure.
-  bool has_arf_frame = show_frame_count > kMinIntervalToAddArf;
-
-  gop_struct.display_tracker = 0;
-
-  GopFrame gop_frame;
-  if (has_key_frame) {
-    const int key_frame_depth = -1;
-    ref_frame_manager->Reset();
-    gop_frame = GopFrameBasic(
-        global_coding_idx_offset, global_order_idx_offset,
-        static_cast<int>(gop_struct.gop_frame_list.size()), order_start,
-        key_frame_depth, gop_struct.display_tracker, GopFrameType::kRegularKey);
-    ref_frame_manager->UpdateRefFrameTable(&gop_frame);
-    gop_struct.gop_frame_list.push_back(gop_frame);
-    order_start++;
-    ++gop_struct.display_tracker;
-  }
-
-  const int arf_depth = 0;
-  if (has_arf_frame) {
-    // Use multi-layer pyrmaid coding structure.
-    gop_frame = GopFrameBasic(
-        global_coding_idx_offset, global_order_idx_offset,
-        static_cast<int>(gop_struct.gop_frame_list.size()), order_end,
-        arf_depth, gop_struct.display_tracker, GopFrameType::kRegularArf);
-    ref_frame_manager->UpdateRefFrameTable(&gop_frame);
-    gop_struct.gop_frame_list.push_back(gop_frame);
-    ConstructGopMultiLayer(&gop_struct, ref_frame_manager,
-                           ref_frame_manager->MaxRefFrame() - 1, arf_depth + 1,
-                           order_start, order_end);
-    // Overlay
-    gop_frame =
-        GopFrameBasic(global_coding_idx_offset, global_order_idx_offset,
-                      static_cast<int>(gop_struct.gop_frame_list.size()),
-                      order_end, ref_frame_manager->MaxRefFrame() - 1,
-                      gop_struct.display_tracker, GopFrameType::kOverlay);
-    ref_frame_manager->UpdateRefFrameTable(&gop_frame);
-    gop_struct.gop_frame_list.push_back(gop_frame);
-    ++gop_struct.display_tracker;
-  } else {
-    // Use IPPP format.
-    for (int i = order_start; i <= order_end; ++i) {
-      gop_frame = GopFrameBasic(
-          global_coding_idx_offset, global_order_idx_offset,
-          static_cast<int>(gop_struct.gop_frame_list.size()), i, arf_depth + 1,
-          gop_struct.display_tracker, GopFrameType::kRegularLeaf);
-      ref_frame_manager->UpdateRefFrameTable(&gop_frame);
-      gop_struct.gop_frame_list.push_back(gop_frame);
-      ++gop_struct.display_tracker;
-    }
-  }
-
-  return gop_struct;
-}
-
-Status AV1RateControlQMode::SetRcParam(const RateControlParam &rc_param) {
-  std::ostringstream error_message;
-  if (rc_param.max_gop_show_frame_count <
-      std::max(4, rc_param.min_gop_show_frame_count)) {
-    error_message << "max_gop_show_frame_count ("
-                  << rc_param.max_gop_show_frame_count
-                  << ") must be at least 4 and may not be less than "
-                     "min_gop_show_frame_count ("
-                  << rc_param.min_gop_show_frame_count << ")";
-    return { AOM_CODEC_INVALID_PARAM, error_message.str() };
-  }
-  if (rc_param.ref_frame_table_size < 1 || rc_param.ref_frame_table_size > 8) {
-    error_message << "ref_frame_table_size (" << rc_param.ref_frame_table_size
-                  << ") must be in the range [1, 8].";
-    return { AOM_CODEC_INVALID_PARAM, error_message.str() };
-  }
-  if (rc_param.max_ref_frames < 1 || rc_param.max_ref_frames > 7) {
-    error_message << "max_ref_frames (" << rc_param.max_ref_frames
-                  << ") must be in the range [1, 7].";
-    return { AOM_CODEC_INVALID_PARAM, error_message.str() };
-  }
-  if (rc_param.base_q_index < 0 || rc_param.base_q_index > 255) {
-    error_message << "base_q_index (" << rc_param.base_q_index
-                  << ") must be in the range [0, 255].";
-    return { AOM_CODEC_INVALID_PARAM, error_message.str() };
-  }
-  if (rc_param.frame_width < 16 || rc_param.frame_width > 16384 ||
-      rc_param.frame_height < 16 || rc_param.frame_height > 16384) {
-    error_message << "frame_width (" << rc_param.frame_width
-                  << ") and frame_height (" << rc_param.frame_height
-                  << ") must be in the range [16, 16384].";
-    return { AOM_CODEC_INVALID_PARAM, error_message.str() };
-  }
-  rc_param_ = rc_param;
-  return { AOM_CODEC_OK, "" };
-}
-
-// Threshold for use of the lagging second reference frame. High second ref
-// usage may point to a transient event like a flash or occlusion rather than
-// a real scene cut.
-// We adapt the threshold based on number of frames in this key-frame group so
-// far.
-static double GetSecondRefUsageThreshold(int frame_count_so_far) {
-  const int adapt_upto = 32;
-  const double min_second_ref_usage_thresh = 0.085;
-  const double second_ref_usage_thresh_max_delta = 0.035;
-  if (frame_count_so_far >= adapt_upto) {
-    return min_second_ref_usage_thresh + second_ref_usage_thresh_max_delta;
-  }
-  return min_second_ref_usage_thresh +
-         ((double)frame_count_so_far / (adapt_upto - 1)) *
-             second_ref_usage_thresh_max_delta;
-}
-
-// Slide show transition detection.
-// Tests for case where there is very low error either side of the current frame
-// but much higher just for this frame. This can help detect key frames in
-// slide shows even where the slides are pictures of different sizes.
-// Also requires that intra and inter errors are very similar to help eliminate
-// harmful false positives.
-// It will not help if the transition is a fade or other multi-frame effect.
-static bool DetectSlideTransition(const FIRSTPASS_STATS &this_frame,
-                                  const FIRSTPASS_STATS &last_frame,
-                                  const FIRSTPASS_STATS &next_frame) {
-  // Intra / Inter threshold very low
-  constexpr double kVeryLowII = 1.5;
-  // Clean slide transitions we expect a sharp single frame spike in error.
-  constexpr double kErrorSpike = 5.0;
-
-  // TODO(angiebird): Understand the meaning of these conditions.
-  return (this_frame.intra_error < (this_frame.coded_error * kVeryLowII)) &&
-         (this_frame.coded_error > (last_frame.coded_error * kErrorSpike)) &&
-         (this_frame.coded_error > (next_frame.coded_error * kErrorSpike));
-}
-
-// Check if there is a significant intra/inter error change between the current
-// frame and its neighbor. If so, we should further test whether the current
-// frame should be a key frame.
-static bool DetectIntraInterErrorChange(const FIRSTPASS_STATS &this_stats,
-                                        const FIRSTPASS_STATS &last_stats,
-                                        const FIRSTPASS_STATS &next_stats) {
-  // Minimum % intra coding observed in first pass (1.0 = 100%)
-  constexpr double kMinIntraLevel = 0.25;
-  // Minimum ratio between the % of intra coding and inter coding in the first
-  // pass after discounting neutral blocks (discounting neutral blocks in this
-  // way helps catch scene cuts in clips with very flat areas or letter box
-  // format clips with image padding.
-  constexpr double kIntraVsInterRatio = 2.0;
-
-  const double modified_pcnt_inter =
-      this_stats.pcnt_inter - this_stats.pcnt_neutral;
-  const double pcnt_intra_min =
-      std::max(kMinIntraLevel, kIntraVsInterRatio * modified_pcnt_inter);
-
-  // In real scene cuts there is almost always a sharp change in the intra
-  // or inter error score.
-  constexpr double kErrorChangeThreshold = 0.4;
-  const double last_this_error_ratio =
-      fabs(last_stats.coded_error - this_stats.coded_error) /
-      ModifyDivisor(this_stats.coded_error);
-
-  const double this_next_error_ratio =
-      fabs(last_stats.intra_error - this_stats.intra_error) /
-      ModifyDivisor(this_stats.intra_error);
-
-  // Maximum threshold for the relative ratio of intra error score vs best
-  // inter error score.
-  constexpr double kThisIntraCodedErrorRatioMax = 1.9;
-  const double this_intra_coded_error_ratio =
-      this_stats.intra_error / ModifyDivisor(this_stats.coded_error);
-
-  // For real scene cuts we expect an improvment in the intra inter error
-  // ratio in the next frame.
-  constexpr double kNextIntraCodedErrorRatioMin = 3.5;
-  const double next_intra_coded_error_ratio =
-      next_stats.intra_error / ModifyDivisor(next_stats.coded_error);
-
-  double pcnt_intra = 1.0 - this_stats.pcnt_inter;
-  return pcnt_intra > pcnt_intra_min &&
-         this_intra_coded_error_ratio < kThisIntraCodedErrorRatioMax &&
-         (last_this_error_ratio > kErrorChangeThreshold ||
-          this_next_error_ratio > kErrorChangeThreshold ||
-          next_intra_coded_error_ratio > kNextIntraCodedErrorRatioMin);
-}
-
-// Check whether the candidate can be a key frame.
-// This is a rewrite of test_candidate_kf().
-static bool TestCandidateKey(const FirstpassInfo &first_pass_info,
-                             int candidate_key_idx, int frames_since_prev_key) {
-  const auto &stats_list = first_pass_info.stats_list;
-  const int stats_count = static_cast<int>(stats_list.size());
-  if (candidate_key_idx + 1 >= stats_count || candidate_key_idx - 1 < 0) {
-    return false;
-  }
-  const auto &last_stats = stats_list[candidate_key_idx - 1];
-  const auto &this_stats = stats_list[candidate_key_idx];
-  const auto &next_stats = stats_list[candidate_key_idx + 1];
-
-  if (frames_since_prev_key < 3) return false;
-  const double second_ref_usage_threshold =
-      GetSecondRefUsageThreshold(frames_since_prev_key);
-  if (this_stats.pcnt_second_ref >= second_ref_usage_threshold) return false;
-  if (next_stats.pcnt_second_ref >= second_ref_usage_threshold) return false;
-
-  // Hard threshold where the first pass chooses intra for almost all blocks.
-  // In such a case even if the frame is not a scene cut coding a key frame
-  // may be a good option.
-  constexpr double kVeryLowInterThreshold = 0.05;
-  if (this_stats.pcnt_inter < kVeryLowInterThreshold ||
-      DetectSlideTransition(this_stats, last_stats, next_stats) ||
-      DetectIntraInterErrorChange(this_stats, last_stats, next_stats)) {
-    double boost_score = 0.0;
-    double decay_accumulator = 1.0;
-
-    // We do "-1" because the candidate key is not counted.
-    int stats_after_this_stats = stats_count - candidate_key_idx - 1;
-
-    // Number of frames required to test for scene cut detection
-    constexpr int kSceneCutKeyTestIntervalMax = 16;
-
-    // Make sure we have enough stats after the candidate key.
-    const int frames_to_test_after_candidate_key =
-        std::min(kSceneCutKeyTestIntervalMax, stats_after_this_stats);
-
-    // Examine how well the key frame predicts subsequent frames.
-    int i;
-    for (i = 1; i <= frames_to_test_after_candidate_key; ++i) {
-      // Get the next frame details
-      const auto &stats = stats_list[candidate_key_idx + i];
-
-      // Cumulative effect of decay in prediction quality.
-      if (stats.pcnt_inter > 0.85) {
-        decay_accumulator *= stats.pcnt_inter;
-      } else {
-        decay_accumulator *= (0.85 + stats.pcnt_inter) / 2.0;
-      }
-
-      constexpr double kBoostFactor = 12.5;
-      double next_iiratio =
-          (kBoostFactor * stats.intra_error / ModifyDivisor(stats.coded_error));
-      next_iiratio = std::min(next_iiratio, 128.0);
-      double boost_score_increment = decay_accumulator * next_iiratio;
-
-      // Keep a running total.
-      boost_score += boost_score_increment;
-
-      // Test various breakout clauses.
-      // TODO(any): Test of intra error should be normalized to an MB.
-      // TODO(angiebird): Investigate the following questions.
-      // Question 1: next_iiratio (intra_error / coded_error) * kBoostFactor
-      // We know intra_error / coded_error >= 1 and kBoostFactor = 12.5,
-      // therefore, (intra_error / coded_error) * kBoostFactor will always
-      // greater than 1.5. Is "next_iiratio < 1.5" always false?
-      // Question 2: Similar to question 1, is "next_iiratio < 3.0" always true?
-      // Question 3: Why do we need to divide 200 with num_mbs_16x16?
-      if ((stats.pcnt_inter < 0.05) || (next_iiratio < 1.5) ||
-          (((stats.pcnt_inter - stats.pcnt_neutral) < 0.20) &&
-           (next_iiratio < 3.0)) ||
-          (boost_score_increment < 3.0) ||
-          (stats.intra_error <
-           (200.0 / static_cast<double>(first_pass_info.num_mbs_16x16)))) {
-        break;
-      }
-    }
-
-    // If there is tolerable prediction for at least the next 3 frames then
-    // break out else discard this potential key frame and move on
-    const int count_for_tolerable_prediction = 3;
-    if (boost_score > 30.0 && (i > count_for_tolerable_prediction)) {
-      return true;
-    }
-  }
-  return false;
-}
-
-// Compute key frame location from first_pass_info.
-std::vector<int> GetKeyFrameList(const FirstpassInfo &first_pass_info) {
-  std::vector<int> key_frame_list;
-  key_frame_list.push_back(0);  // The first frame is always a key frame
-  int candidate_key_idx = 1;
-  while (candidate_key_idx <
-         static_cast<int>(first_pass_info.stats_list.size())) {
-    const int frames_since_prev_key = candidate_key_idx - key_frame_list.back();
-    // Check for a scene cut.
-    const bool scenecut_detected = TestCandidateKey(
-        first_pass_info, candidate_key_idx, frames_since_prev_key);
-    if (scenecut_detected) {
-      key_frame_list.push_back(candidate_key_idx);
-    }
-    ++candidate_key_idx;
-  }
-  return key_frame_list;
-}
-
-// initialize GF_GROUP_STATS
-static void InitGFStats(GF_GROUP_STATS *gf_stats) {
-  gf_stats->gf_group_err = 0.0;
-  gf_stats->gf_group_raw_error = 0.0;
-  gf_stats->gf_group_skip_pct = 0.0;
-  gf_stats->gf_group_inactive_zone_rows = 0.0;
-
-  gf_stats->mv_ratio_accumulator = 0.0;
-  gf_stats->decay_accumulator = 1.0;
-  gf_stats->zero_motion_accumulator = 1.0;
-  gf_stats->loop_decay_rate = 1.0;
-  gf_stats->last_loop_decay_rate = 1.0;
-  gf_stats->this_frame_mv_in_out = 0.0;
-  gf_stats->mv_in_out_accumulator = 0.0;
-  gf_stats->abs_mv_in_out_accumulator = 0.0;
-
-  gf_stats->avg_sr_coded_error = 0.0;
-  gf_stats->avg_pcnt_second_ref = 0.0;
-  gf_stats->avg_new_mv_count = 0.0;
-  gf_stats->avg_wavelet_energy = 0.0;
-  gf_stats->avg_raw_err_stdev = 0.0;
-  gf_stats->non_zero_stdev_count = 0;
-}
-
-static int FindRegionIndex(const std::vector<REGIONS> &regions, int frame_idx) {
-  for (int k = 0; k < static_cast<int>(regions.size()); k++) {
-    if (regions[k].start <= frame_idx && regions[k].last >= frame_idx) {
-      return k;
-    }
-  }
-  return -1;
-}
-
-// This function detects a flash through the high relative pcnt_second_ref
-// score in the frame following a flash frame. The offset passed in should
-// reflect this.
-static bool DetectFlash(const std::vector<FIRSTPASS_STATS> &stats_list,
-                        int index) {
-  int next_index = index + 1;
-  if (next_index >= static_cast<int>(stats_list.size())) return false;
-  const FIRSTPASS_STATS &next_frame = stats_list[next_index];
-
-  // What we are looking for here is a situation where there is a
-  // brief break in prediction (such as a flash) but subsequent frames
-  // are reasonably well predicted by an earlier (pre flash) frame.
-  // The recovery after a flash is indicated by a high pcnt_second_ref
-  // compared to pcnt_inter.
-  return next_frame.pcnt_second_ref > next_frame.pcnt_inter &&
-         next_frame.pcnt_second_ref >= 0.5;
-}
-
-#define MIN_SHRINK_LEN 6
-
-// This function takes in a suggesting gop interval from cur_start to cur_last,
-// analyzes firstpass stats and region stats and then return a better gop cut
-// location.
-// TODO(b/231517281): Simplify the indices once we have an unit test.
-// We are using four indices here, order_index, cur_start, cur_last, and
-// frames_since_key. Ideally, only three indices are needed.
-// 1) start_index = order_index + cur_start
-// 2) end_index = order_index + cur_end
-// 3) key_index
-int FindBetterGopCut(const std::vector<FIRSTPASS_STATS> &stats_list,
-                     const std::vector<REGIONS> &regions_list,
-                     int min_gop_show_frame_count, int max_gop_show_frame_count,
-                     int order_index, int cur_start, int cur_last,
-                     int frames_since_key) {
-  // only try shrinking if interval smaller than active_max_gf_interval
-  if (cur_last - cur_start > max_gop_show_frame_count ||
-      cur_start >= cur_last) {
-    return cur_last;
-  }
-  int num_regions = static_cast<int>(regions_list.size());
-  int num_stats = static_cast<int>(stats_list.size());
-  const int min_shrink_int = std::max(MIN_SHRINK_LEN, min_gop_show_frame_count);
-
-  // find the region indices of where the first and last frame belong.
-  int k_start = FindRegionIndex(regions_list, cur_start + frames_since_key);
-  int k_last = FindRegionIndex(regions_list, cur_last + frames_since_key);
-  if (cur_start + frames_since_key == 0) k_start = 0;
-
-  int scenecut_idx = -1;
-  // See if we have a scenecut in between
-  for (int r = k_start + 1; r <= k_last; r++) {
-    if (regions_list[r].type == SCENECUT_REGION &&
-        regions_list[r].last - frames_since_key - cur_start >
-            min_gop_show_frame_count) {
-      scenecut_idx = r;
-      break;
-    }
-  }
-
-  // if the found scenecut is very close to the end, ignore it.
-  if (scenecut_idx >= 0 &&
-      regions_list[num_regions - 1].last - regions_list[scenecut_idx].last <
-          4) {
-    scenecut_idx = -1;
-  }
-
-  if (scenecut_idx != -1) {
-    // If we have a scenecut, then stop at it.
-    // TODO(bohanli): add logic here to stop before the scenecut and for
-    // the next gop start from the scenecut with GF
-    int is_minor_sc =
-        (regions_list[scenecut_idx].avg_cor_coeff *
-             (1 - stats_list[order_index + regions_list[scenecut_idx].start -
-                             frames_since_key]
-                          .noise_var /
-                      regions_list[scenecut_idx].avg_intra_err) >
-         0.6);
-    cur_last =
-        regions_list[scenecut_idx].last - frames_since_key - !is_minor_sc;
-  } else {
-    int is_last_analysed =
-        (k_last == num_regions - 1) &&
-        (cur_last + frames_since_key == regions_list[k_last].last);
-    int not_enough_regions =
-        k_last - k_start <= 1 + (regions_list[k_start].type == SCENECUT_REGION);
-    // if we are very close to the end, then do not shrink since it may
-    // introduce intervals that are too short
-    if (!(is_last_analysed && not_enough_regions)) {
-      const double arf_length_factor = 0.1;
-      double best_score = 0;
-      int best_j = -1;
-      const int first_frame = regions_list[0].start - frames_since_key;
-      const int last_frame =
-          regions_list[num_regions - 1].last - frames_since_key;
-      // score of how much the arf helps the whole GOP
-      double base_score = 0.0;
-      // Accumulate base_score in
-      for (int j = cur_start + 1; j < cur_start + min_shrink_int; j++) {
-        if (order_index + j >= num_stats) break;
-        base_score = (base_score + 1.0) * stats_list[order_index + j].cor_coeff;
-      }
-      int met_blending = 0;   // Whether we have met blending areas before
-      int last_blending = 0;  // Whether the previous frame if blending
-      for (int j = cur_start + min_shrink_int; j <= cur_last; j++) {
-        if (order_index + j >= num_stats) break;
-        base_score = (base_score + 1.0) * stats_list[order_index + j].cor_coeff;
-        int this_reg = FindRegionIndex(regions_list, j + frames_since_key);
-        if (this_reg < 0) continue;
-        // A GOP should include at most 1 blending region.
-        if (regions_list[this_reg].type == BLENDING_REGION) {
-          last_blending = 1;
-          if (met_blending) {
-            break;
-          } else {
-            base_score = 0;
-            continue;
-          }
-        } else {
-          if (last_blending) met_blending = 1;
-          last_blending = 0;
-        }
-
-        // Add the factor of how good the neighborhood is for this
-        // candidate arf.
-        double this_score = arf_length_factor * base_score;
-        double temp_accu_coeff = 1.0;
-        // following frames
-        int count_f = 0;
-        for (int n = j + 1; n <= j + 3 && n <= last_frame; n++) {
-          if (order_index + n >= num_stats) break;
-          temp_accu_coeff *= stats_list[order_index + n].cor_coeff;
-          this_score +=
-              temp_accu_coeff *
-              (1 - stats_list[order_index + n].noise_var /
-                       AOMMAX(regions_list[this_reg].avg_intra_err, 0.001));
-          count_f++;
-        }
-        // preceding frames
-        temp_accu_coeff = 1.0;
-        for (int n = j; n > j - 3 * 2 + count_f && n > first_frame; n--) {
-          if (order_index + n < 0) break;
-          temp_accu_coeff *= stats_list[order_index + n].cor_coeff;
-          this_score +=
-              temp_accu_coeff *
-              (1 - stats_list[order_index + n].noise_var /
-                       AOMMAX(regions_list[this_reg].avg_intra_err, 0.001));
-        }
-
-        if (this_score > best_score) {
-          best_score = this_score;
-          best_j = j;
-        }
-      }
-
-      // For blending areas, move one more frame in case we missed the
-      // first blending frame.
-      int best_reg = FindRegionIndex(regions_list, best_j + frames_since_key);
-      if (best_reg < num_regions - 1 && best_reg > 0) {
-        if (regions_list[best_reg - 1].type == BLENDING_REGION &&
-            regions_list[best_reg + 1].type == BLENDING_REGION) {
-          if (best_j + frames_since_key == regions_list[best_reg].start &&
-              best_j + frames_since_key < regions_list[best_reg].last) {
-            best_j += 1;
-          } else if (best_j + frames_since_key == regions_list[best_reg].last &&
-                     best_j + frames_since_key > regions_list[best_reg].start) {
-            best_j -= 1;
-          }
-        }
-      }
-
-      if (cur_last - best_j < 2) best_j = cur_last;
-      if (best_j > 0 && best_score > 0.1) cur_last = best_j;
-      // if cannot find anything, just cut at the original place.
-    }
-  }
-
-  return cur_last;
-}
-
-// Function to test for a condition where a complex transition is followed
-// by a static section. For example in slide shows where there is a fade
-// between slides. This is to help with more optimal kf and gf positioning.
-static bool DetectTransitionToStill(
-    const std::vector<FIRSTPASS_STATS> &stats_list, int next_stats_index,
-    int min_gop_show_frame_count, int frame_interval, int still_interval,
-    double loop_decay_rate, double last_decay_rate) {
-  // Break clause to detect very still sections after motion
-  // For example a static image after a fade or other transition
-  // instead of a clean scene cut.
-  if (frame_interval > min_gop_show_frame_count && loop_decay_rate >= 0.999 &&
-      last_decay_rate < 0.9) {
-    int stats_count = static_cast<int>(stats_list.size());
-    int stats_left = stats_count - next_stats_index;
-    if (stats_left >= still_interval) {
-      // Look ahead a few frames to see if static condition persists...
-      int j;
-      for (j = 0; j < still_interval; ++j) {
-        const FIRSTPASS_STATS &stats = stats_list[next_stats_index + j];
-        if (stats.pcnt_inter - stats.pcnt_motion < 0.999) break;
-      }
-      // Only if it does do we signal a transition to still.
-      return j == still_interval;
-    }
-  }
-  return false;
-}
-
-static int DetectGopCut(const std::vector<FIRSTPASS_STATS> &stats_list,
-                        int start_idx, int candidate_cut_idx, int next_key_idx,
-                        int flash_detected, int min_gop_show_frame_count,
-                        int max_gop_show_frame_count, int frame_width,
-                        int frame_height, const GF_GROUP_STATS &gf_stats) {
-  (void)max_gop_show_frame_count;
-  const int candidate_gop_size = candidate_cut_idx - start_idx;
-
-  if (!flash_detected) {
-    // Break clause to detect very still sections after motion. For example,
-    // a static image after a fade or other transition.
-    if (DetectTransitionToStill(stats_list, start_idx, min_gop_show_frame_count,
-                                candidate_gop_size, 5, gf_stats.loop_decay_rate,
-                                gf_stats.last_loop_decay_rate)) {
-      return 1;
-    }
-    const double arf_abs_zoom_thresh = 4.4;
-    // Motion breakout threshold for loop below depends on image size.
-    const double mv_ratio_accumulator_thresh =
-        (frame_height + frame_width) / 4.0;
-    // Some conditions to breakout after min interval.
-    if (candidate_gop_size >= min_gop_show_frame_count &&
-        // If possible don't break very close to a kf
-        (next_key_idx - candidate_cut_idx >= min_gop_show_frame_count) &&
-        (candidate_gop_size & 0x01) &&
-        (gf_stats.mv_ratio_accumulator > mv_ratio_accumulator_thresh ||
-         gf_stats.abs_mv_in_out_accumulator > arf_abs_zoom_thresh)) {
-      return 1;
-    }
-  }
-
-  // TODO(b/231489624): Check if we need this part.
-  // If almost totally static, we will not use the the max GF length later,
-  // so we can continue for more frames.
-  // if ((candidate_gop_size >= active_max_gf_interval + 1) &&
-  //     !is_almost_static(gf_stats->zero_motion_accumulator,
-  //                       twopass->kf_zeromotion_pct, cpi->ppi->lap_enabled)) {
-  //   return 0;
-  // }
-  return 0;
-}
-
-/*!\brief Determine the length of future GF groups.
- *
- * \ingroup gf_group_algo
- * This function decides the gf group length of future frames in batch
- *
- * \param[in]    rc_param         Rate control parameters
- * \param[in]    stats_list       List of first pass stats
- * \param[in]    regions_list     List of regions from av1_identify_regions
- * \param[in]    order_index      Index of current frame in stats_list
- * \param[in]    frames_since_key Number of frames since the last key frame
- * \param[in]    frames_to_key    Number of frames to the next key frame
- *
- * \return Returns a vector of decided GF group lengths.
- */
-static std::vector<int> PartitionGopIntervals(
-    const RateControlParam &rc_param,
-    const std::vector<FIRSTPASS_STATS> &stats_list,
-    const std::vector<REGIONS> &regions_list, int order_index,
-    int frames_since_key, int frames_to_key) {
-  int i = 0;
-  // If cpi->gf_state.arf_gf_boost_lst is 0, we are starting with a KF or GF.
-  int cur_start = 0;
-  // Each element is the last frame of the previous GOP. If there are n GOPs,
-  // you need n + 1 cuts to find the durations. So cut_pos starts out with -1,
-  // which is the last frame of the previous GOP.
-  std::vector<int> cut_pos(1, -1);
-  int cut_here = 0;
-  GF_GROUP_STATS gf_stats;
-  InitGFStats(&gf_stats);
-  int num_stats = static_cast<int>(stats_list.size());
-
-  while (i + order_index < num_stats) {
-    // reaches next key frame, break here
-    if (i >= frames_to_key - 1) {
-      cut_here = 2;
-    } else if (i - cur_start >= rc_param.max_gop_show_frame_count) {
-      // reached maximum len, but nothing special yet (almost static)
-      // let's look at the next interval
-      cut_here = 2;
-    } else {
-      // Test for the case where there is a brief flash but the prediction
-      // quality back to an earlier frame is then restored.
-      const int gop_start_idx = cur_start + order_index;
-      const int candidate_gop_cut_idx = i + order_index;
-      const int next_key_idx = frames_to_key + order_index;
-      const bool flash_detected =
-          DetectFlash(stats_list, candidate_gop_cut_idx);
-
-      // TODO(bohanli): remove redundant accumulations here, or unify
-      // this and the ones in define_gf_group
-      const FIRSTPASS_STATS *stats = &stats_list[candidate_gop_cut_idx];
-      av1_accumulate_next_frame_stats(stats, flash_detected, frames_since_key,
-                                      i, &gf_stats, rc_param.frame_width,
-                                      rc_param.frame_height);
-
-      // TODO(angiebird): Can we simplify this part? Looks like we are going to
-      // change the gop cut index with FindBetterGopCut() anyway.
-      cut_here = DetectGopCut(
-          stats_list, gop_start_idx, candidate_gop_cut_idx, next_key_idx,
-          flash_detected, rc_param.min_gop_show_frame_count,
-          rc_param.max_gop_show_frame_count, rc_param.frame_width,
-          rc_param.frame_height, gf_stats);
-    }
-
-    if (!cut_here) {
-      ++i;
-      continue;
-    }
-
-    // the current last frame in the gf group
-    int original_last = cut_here > 1 ? i : i - 1;
-    int cur_last = FindBetterGopCut(
-        stats_list, regions_list, rc_param.min_gop_show_frame_count,
-        rc_param.max_gop_show_frame_count, order_index, cur_start,
-        original_last, frames_since_key);
-    // only try shrinking if interval smaller than active_max_gf_interval
-    cut_pos.push_back(cur_last);
-
-    // reset pointers to the shrunken location
-    cur_start = cur_last;
-    int cur_region_idx =
-        FindRegionIndex(regions_list, cur_start + 1 + frames_since_key);
-    if (cur_region_idx >= 0)
-      if (regions_list[cur_region_idx].type == SCENECUT_REGION) cur_start++;
-
-    // reset accumulators
-    InitGFStats(&gf_stats);
-    i = cur_last + 1;
-
-    if (cut_here == 2 && i >= frames_to_key) break;
-  }
-
-  std::vector<int> gf_intervals;
-  // save intervals
-  for (size_t n = 1; n < cut_pos.size(); n++) {
-    gf_intervals.push_back(cut_pos[n] - cut_pos[n - 1]);
-  }
-
-  return gf_intervals;
-}
-
-StatusOr<GopStructList> AV1RateControlQMode::DetermineGopInfo(
-    const FirstpassInfo &firstpass_info) {
-  const int stats_size = static_cast<int>(firstpass_info.stats_list.size());
-  GopStructList gop_list;
-  RefFrameManager ref_frame_manager(rc_param_.ref_frame_table_size,
-                                    rc_param_.max_ref_frames);
-
-  // Make a copy of the first pass stats, and analyze them
-  FirstpassInfo fp_info_copy = firstpass_info;
-  av1_mark_flashes(fp_info_copy.stats_list.data(),
-                   fp_info_copy.stats_list.data() + stats_size);
-  av1_estimate_noise(fp_info_copy.stats_list.data(),
-                     fp_info_copy.stats_list.data() + stats_size);
-  av1_estimate_coeff(fp_info_copy.stats_list.data(),
-                     fp_info_copy.stats_list.data() + stats_size);
-
-  int global_coding_idx_offset = 0;
-  int global_order_idx_offset = 0;
-  std::vector<int> key_frame_list = GetKeyFrameList(fp_info_copy);
-  key_frame_list.push_back(stats_size);  // a sentinel value
-  for (size_t ki = 0; ki + 1 < key_frame_list.size(); ++ki) {
-    int frames_to_key = key_frame_list[ki + 1] - key_frame_list[ki];
-    int key_order_index = key_frame_list[ki];  // The key frame's display order
-
-    std::vector<REGIONS> regions_list(MAX_FIRSTPASS_ANALYSIS_FRAMES);
-    int total_regions = 0;
-    av1_identify_regions(fp_info_copy.stats_list.data() + key_order_index,
-                         frames_to_key, 0, regions_list.data(), &total_regions);
-    regions_list.resize(total_regions);
-    std::vector<int> gf_intervals = PartitionGopIntervals(
-        rc_param_, fp_info_copy.stats_list, regions_list, key_order_index,
-        /*frames_since_key=*/0, frames_to_key);
-    for (size_t gi = 0; gi < gf_intervals.size(); ++gi) {
-      const bool has_key_frame = gi == 0;
-      const int show_frame_count = gf_intervals[gi];
-      GopStruct gop =
-          ConstructGop(&ref_frame_manager, show_frame_count, has_key_frame,
-                       global_coding_idx_offset, global_order_idx_offset);
-      assert(gop.show_frame_count == show_frame_count);
-      global_coding_idx_offset += static_cast<int>(gop.gop_frame_list.size());
-      global_order_idx_offset += gop.show_frame_count;
-      gop_list.push_back(gop);
-    }
-  }
-  return gop_list;
-}
-
-TplFrameDepStats CreateTplFrameDepStats(int frame_height, int frame_width,
-                                        int min_block_size) {
-  const int unit_rows = (frame_height + min_block_size - 1) / min_block_size;
-  const int unit_cols = (frame_width + min_block_size - 1) / min_block_size;
-  TplFrameDepStats frame_dep_stats;
-  frame_dep_stats.unit_size = min_block_size;
-  frame_dep_stats.unit_stats.resize(unit_rows);
-  for (auto &row : frame_dep_stats.unit_stats) {
-    row.resize(unit_cols);
-  }
-  return frame_dep_stats;
-}
-
-TplUnitDepStats TplBlockStatsToDepStats(const TplBlockStats &block_stats,
-                                        int unit_count) {
-  TplUnitDepStats dep_stats = {};
-  dep_stats.intra_cost = block_stats.intra_cost * 1.0 / unit_count;
-  dep_stats.inter_cost = block_stats.inter_cost * 1.0 / unit_count;
-  // In rare case, inter_cost may be greater than intra_cost.
-  // If so, we need to modify inter_cost such that inter_cost <= intra_cost
-  // because it is required by GetPropagationFraction()
-  dep_stats.inter_cost = std::min(dep_stats.intra_cost, dep_stats.inter_cost);
-  dep_stats.mv = block_stats.mv;
-  dep_stats.ref_frame_index = block_stats.ref_frame_index;
-  return dep_stats;
-}
-
-namespace {
-Status ValidateBlockStats(const TplFrameStats &frame_stats,
-                          const TplBlockStats &block_stats,
-                          int min_block_size) {
-  if (block_stats.col >= frame_stats.frame_width ||
-      block_stats.row >= frame_stats.frame_height) {
-    std::ostringstream error_message;
-    error_message << "Block position (" << block_stats.col << ", "
-                  << block_stats.row
-                  << ") is out of range; frame dimensions are "
-                  << frame_stats.frame_width << " x "
-                  << frame_stats.frame_height;
-    return { AOM_CODEC_INVALID_PARAM, error_message.str() };
-  }
-  if (block_stats.col % min_block_size != 0 ||
-      block_stats.row % min_block_size != 0 ||
-      block_stats.width % min_block_size != 0 ||
-      block_stats.height % min_block_size != 0) {
-    std::ostringstream error_message;
-    error_message
-        << "Invalid block position or dimension, must be a multiple of "
-        << min_block_size << "; col = " << block_stats.col
-        << ", row = " << block_stats.row << ", width = " << block_stats.width
-        << ", height = " << block_stats.height;
-    return { AOM_CODEC_INVALID_PARAM, error_message.str() };
-  }
-  return { AOM_CODEC_OK, "" };
-}
-
-Status ValidateTplStats(const GopStruct &gop_struct,
-                        const TplGopStats &tpl_gop_stats) {
-  constexpr char kAdvice[] =
-      "Do the current RateControlParam settings match those used to generate "
-      "the TPL stats?";
-  if (gop_struct.gop_frame_list.size() !=
-      tpl_gop_stats.frame_stats_list.size()) {
-    std::ostringstream error_message;
-    error_message << "Frame count of GopStruct ("
-                  << gop_struct.gop_frame_list.size()
-                  << ") doesn't match frame count of TPL stats ("
-                  << tpl_gop_stats.frame_stats_list.size() << "). " << kAdvice;
-    return { AOM_CODEC_INVALID_PARAM, error_message.str() };
-  }
-  for (int i = 0; i < static_cast<int>(gop_struct.gop_frame_list.size()); ++i) {
-    const bool is_ref_frame = gop_struct.gop_frame_list[i].update_ref_idx >= 0;
-    const bool has_tpl_stats =
-        !tpl_gop_stats.frame_stats_list[i].block_stats_list.empty();
-    if (is_ref_frame && !has_tpl_stats) {
-      std::ostringstream error_message;
-      error_message << "The frame with global_coding_idx "
-                    << gop_struct.gop_frame_list[i].global_coding_idx
-                    << " is a reference frame, but has no TPL stats. "
-                    << kAdvice;
-      return { AOM_CODEC_INVALID_PARAM, error_message.str() };
-    }
-  }
-  return { AOM_CODEC_OK, "" };
-}
-}  // namespace
-
-StatusOr<TplFrameDepStats> CreateTplFrameDepStatsWithoutPropagation(
-    const TplFrameStats &frame_stats) {
-  if (frame_stats.block_stats_list.empty()) {
-    return TplFrameDepStats();
-  }
-  const int min_block_size = frame_stats.min_block_size;
-  const int unit_rows =
-      (frame_stats.frame_height + min_block_size - 1) / min_block_size;
-  const int unit_cols =
-      (frame_stats.frame_width + min_block_size - 1) / min_block_size;
-  TplFrameDepStats frame_dep_stats = CreateTplFrameDepStats(
-      frame_stats.frame_height, frame_stats.frame_width, min_block_size);
-  for (const TplBlockStats &block_stats : frame_stats.block_stats_list) {
-    Status status =
-        ValidateBlockStats(frame_stats, block_stats, min_block_size);
-    if (!status.ok()) {
-      return status;
-    }
-    const int block_unit_row = block_stats.row / min_block_size;
-    const int block_unit_col = block_stats.col / min_block_size;
-    // The block must start within the frame boundaries, but it may extend past
-    // the right edge or bottom of the frame. Find the number of unit rows and
-    // columns in the block which are fully within the frame.
-    const int block_unit_rows = std::min(block_stats.height / min_block_size,
-                                         unit_rows - block_unit_row);
-    const int block_unit_cols = std::min(block_stats.width / min_block_size,
-                                         unit_cols - block_unit_col);
-    const int unit_count = block_unit_rows * block_unit_cols;
-    TplUnitDepStats unit_stats =
-        TplBlockStatsToDepStats(block_stats, unit_count);
-    for (int r = 0; r < block_unit_rows; r++) {
-      for (int c = 0; c < block_unit_cols; c++) {
-        frame_dep_stats.unit_stats[block_unit_row + r][block_unit_col + c] =
-            unit_stats;
-      }
-    }
-  }
-
-  frame_dep_stats.rdcost = TplFrameDepStatsAccumulateInterCost(frame_dep_stats);
-
-  return frame_dep_stats;
-}
-
-int GetRefCodingIdxList(const TplUnitDepStats &unit_dep_stats,
-                        const RefFrameTable &ref_frame_table,
-                        int *ref_coding_idx_list) {
-  int ref_frame_count = 0;
-  for (int i = 0; i < kBlockRefCount; ++i) {
-    ref_coding_idx_list[i] = -1;
-    int ref_frame_index = unit_dep_stats.ref_frame_index[i];
-    if (ref_frame_index != -1) {
-      assert(ref_frame_index < static_cast<int>(ref_frame_table.size()));
-      ref_coding_idx_list[i] = ref_frame_table[ref_frame_index].coding_idx;
-      ref_frame_count++;
-    }
-  }
-  return ref_frame_count;
-}
-
-int GetBlockOverlapArea(int r0, int c0, int r1, int c1, int size) {
-  const int r_low = std::max(r0, r1);
-  const int r_high = std::min(r0 + size, r1 + size);
-  const int c_low = std::max(c0, c1);
-  const int c_high = std::min(c0 + size, c1 + size);
-  if (r_high >= r_low && c_high >= c_low) {
-    return (r_high - r_low) * (c_high - c_low);
-  }
-  return 0;
-}
-
-// TODO(angiebird): Merge TplFrameDepStatsAccumulateIntraCost and
-// TplFrameDepStatsAccumulate.
-double TplFrameDepStatsAccumulateIntraCost(
-    const TplFrameDepStats &frame_dep_stats) {
-  auto getIntraCost = [](double sum, const TplUnitDepStats &unit) {
-    return sum + unit.intra_cost;
-  };
-  double sum = 0;
-  for (const auto &row : frame_dep_stats.unit_stats) {
-    sum = std::accumulate(row.begin(), row.end(), sum, getIntraCost);
-  }
-  return std::max(sum, 1.0);
-}
-
-double TplFrameDepStatsAccumulateInterCost(
-    const TplFrameDepStats &frame_dep_stats) {
-  auto getInterCost = [](double sum, const TplUnitDepStats &unit) {
-    return sum + unit.inter_cost;
-  };
-  double sum = 0;
-  for (const auto &row : frame_dep_stats.unit_stats) {
-    sum = std::accumulate(row.begin(), row.end(), sum, getInterCost);
-  }
-  return std::max(sum, 1.0);
-}
-
-double TplFrameDepStatsAccumulate(const TplFrameDepStats &frame_dep_stats) {
-  auto getOverallCost = [](double sum, const TplUnitDepStats &unit) {
-    return sum + unit.propagation_cost + unit.intra_cost;
-  };
-  double sum = 0;
-  for (const auto &row : frame_dep_stats.unit_stats) {
-    sum = std::accumulate(row.begin(), row.end(), sum, getOverallCost);
-  }
-  return std::max(sum, 1.0);
-}
-
-// This is a generalization of GET_MV_RAWPEL that allows for an arbitrary
-// number of fractional bits.
-// TODO(angiebird): Add unit test to this function
-int GetFullpelValue(int subpel_value, int subpel_bits) {
-  const int subpel_scale = (1 << subpel_bits);
-  const int sign = subpel_value >= 0 ? 1 : -1;
-  int fullpel_value = (abs(subpel_value) + subpel_scale / 2) >> subpel_bits;
-  fullpel_value *= sign;
-  return fullpel_value;
-}
-
-double GetPropagationFraction(const TplUnitDepStats &unit_dep_stats) {
-  assert(unit_dep_stats.intra_cost >= unit_dep_stats.inter_cost);
-  return (unit_dep_stats.intra_cost - unit_dep_stats.inter_cost) /
-         ModifyDivisor(unit_dep_stats.intra_cost);
-}
-
-void TplFrameDepStatsPropagate(int coding_idx,
-                               const RefFrameTable &ref_frame_table,
-                               TplGopDepStats *tpl_gop_dep_stats) {
-  assert(!tpl_gop_dep_stats->frame_dep_stats_list.empty());
-  TplFrameDepStats *frame_dep_stats =
-      &tpl_gop_dep_stats->frame_dep_stats_list[coding_idx];
-
-  if (frame_dep_stats->unit_stats.empty()) return;
-
-  const int unit_size = frame_dep_stats->unit_size;
-  const int frame_unit_rows =
-      static_cast<int>(frame_dep_stats->unit_stats.size());
-  const int frame_unit_cols =
-      static_cast<int>(frame_dep_stats->unit_stats[0].size());
-  for (int unit_row = 0; unit_row < frame_unit_rows; ++unit_row) {
-    for (int unit_col = 0; unit_col < frame_unit_cols; ++unit_col) {
-      TplUnitDepStats &unit_dep_stats =
-          frame_dep_stats->unit_stats[unit_row][unit_col];
-      int ref_coding_idx_list[kBlockRefCount] = { -1, -1 };
-      int ref_frame_count = GetRefCodingIdxList(unit_dep_stats, ref_frame_table,
-                                                ref_coding_idx_list);
-      if (ref_frame_count == 0) continue;
-      for (int i = 0; i < kBlockRefCount; ++i) {
-        if (ref_coding_idx_list[i] == -1) continue;
-        assert(
-            ref_coding_idx_list[i] <
-            static_cast<int>(tpl_gop_dep_stats->frame_dep_stats_list.size()));
-        TplFrameDepStats &ref_frame_dep_stats =
-            tpl_gop_dep_stats->frame_dep_stats_list[ref_coding_idx_list[i]];
-        assert(!ref_frame_dep_stats.unit_stats.empty());
-        const auto &mv = unit_dep_stats.mv[i];
-        const int mv_row = GetFullpelValue(mv.row, mv.subpel_bits);
-        const int mv_col = GetFullpelValue(mv.col, mv.subpel_bits);
-        const int ref_pixel_r = unit_row * unit_size + mv_row;
-        const int ref_pixel_c = unit_col * unit_size + mv_col;
-        const int ref_unit_row_low =
-            (unit_row * unit_size + mv_row) / unit_size;
-        const int ref_unit_col_low =
-            (unit_col * unit_size + mv_col) / unit_size;
-
-        for (int j = 0; j < 2; ++j) {
-          for (int k = 0; k < 2; ++k) {
-            const int ref_unit_row = ref_unit_row_low + j;
-            const int ref_unit_col = ref_unit_col_low + k;
-            if (ref_unit_row >= 0 && ref_unit_row < frame_unit_rows &&
-                ref_unit_col >= 0 && ref_unit_col < frame_unit_cols) {
-              const int overlap_area = GetBlockOverlapArea(
-                  ref_pixel_r, ref_pixel_c, ref_unit_row * unit_size,
-                  ref_unit_col * unit_size, unit_size);
-              const double overlap_ratio =
-                  overlap_area * 1.0 / (unit_size * unit_size);
-              const double propagation_fraction =
-                  GetPropagationFraction(unit_dep_stats);
-              const double propagation_ratio =
-                  1.0 / ref_frame_count * overlap_ratio * propagation_fraction;
-              TplUnitDepStats &ref_unit_stats =
-                  ref_frame_dep_stats.unit_stats[ref_unit_row][ref_unit_col];
-              ref_unit_stats.propagation_cost +=
-                  (unit_dep_stats.intra_cost +
-                   unit_dep_stats.propagation_cost) *
-                  propagation_ratio;
-            }
-          }
-        }
-      }
-    }
-  }
-}
-
-std::vector<RefFrameTable> AV1RateControlQMode::GetRefFrameTableList(
-    const GopStruct &gop_struct,
-    const std::vector<LookaheadStats> &lookahead_stats,
-    RefFrameTable ref_frame_table) {
-  if (gop_struct.global_coding_idx_offset == 0) {
-    // For the first GOP, ref_frame_table need not be initialized. This is fine,
-    // because the first frame (a key frame) will fully initialize it.
-    ref_frame_table.assign(rc_param_.ref_frame_table_size, GopFrameInvalid());
-  } else {
-    // It's not the first GOP, so ref_frame_table must be valid.
-    assert(static_cast<int>(ref_frame_table.size()) ==
-           rc_param_.ref_frame_table_size);
-    assert(std::all_of(ref_frame_table.begin(), ref_frame_table.end(),
-                       std::mem_fn(&GopFrame::is_valid)));
-    // Reset the frame processing order of the initial ref_frame_table.
-    for (GopFrame &gop_frame : ref_frame_table) gop_frame.coding_idx = -1;
-  }
-
-  std::vector<RefFrameTable> ref_frame_table_list;
-  ref_frame_table_list.push_back(ref_frame_table);
-  for (const GopFrame &gop_frame : gop_struct.gop_frame_list) {
-    if (gop_frame.is_key_frame) {
-      ref_frame_table.assign(rc_param_.ref_frame_table_size, gop_frame);
-    } else if (gop_frame.update_ref_idx != -1) {
-      assert(gop_frame.update_ref_idx <
-             static_cast<int>(ref_frame_table.size()));
-      ref_frame_table[gop_frame.update_ref_idx] = gop_frame;
-    }
-    ref_frame_table_list.push_back(ref_frame_table);
-  }
-
-  int gop_size_offset = static_cast<int>(gop_struct.gop_frame_list.size());
-
-  for (const auto &lookahead_stat : lookahead_stats) {
-    for (GopFrame gop_frame : lookahead_stat.gop_struct->gop_frame_list) {
-      if (gop_frame.is_key_frame) {
-        ref_frame_table.assign(rc_param_.ref_frame_table_size, gop_frame);
-      } else if (gop_frame.update_ref_idx != -1) {
-        assert(gop_frame.update_ref_idx <
-               static_cast<int>(ref_frame_table.size()));
-        gop_frame.coding_idx += gop_size_offset;
-        ref_frame_table[gop_frame.update_ref_idx] = gop_frame;
-      }
-      ref_frame_table_list.push_back(ref_frame_table);
-    }
-    gop_size_offset +=
-        static_cast<int>(lookahead_stat.gop_struct->gop_frame_list.size());
-  }
-
-  return ref_frame_table_list;
-}
-
-StatusOr<TplGopDepStats> ComputeTplGopDepStats(
-    const TplGopStats &tpl_gop_stats,
-    const std::vector<LookaheadStats> &lookahead_stats,
-    const std::vector<RefFrameTable> &ref_frame_table_list) {
-  std::vector<const TplFrameStats *> tpl_frame_stats_list_with_lookahead;
-  for (const auto &tpl_frame_stats : tpl_gop_stats.frame_stats_list) {
-    tpl_frame_stats_list_with_lookahead.push_back(&tpl_frame_stats);
-  }
-  for (const auto &lookahead_stat : lookahead_stats) {
-    for (const auto &tpl_frame_stats :
-         lookahead_stat.tpl_gop_stats->frame_stats_list) {
-      tpl_frame_stats_list_with_lookahead.push_back(&tpl_frame_stats);
-    }
-  }
-
-  const int frame_count =
-      static_cast<int>(tpl_frame_stats_list_with_lookahead.size());
-
-  // Create the struct to store TPL dependency stats
-  TplGopDepStats tpl_gop_dep_stats;
-
-  tpl_gop_dep_stats.frame_dep_stats_list.reserve(frame_count);
-  for (int coding_idx = 0; coding_idx < frame_count; coding_idx++) {
-    const StatusOr<TplFrameDepStats> tpl_frame_dep_stats =
-        CreateTplFrameDepStatsWithoutPropagation(
-            *tpl_frame_stats_list_with_lookahead[coding_idx]);
-    if (!tpl_frame_dep_stats.ok()) {
-      return tpl_frame_dep_stats.status();
-    }
-    tpl_gop_dep_stats.frame_dep_stats_list.push_back(
-        std::move(*tpl_frame_dep_stats));
-  }
-
-  // Back propagation
-  for (int coding_idx = frame_count - 1; coding_idx >= 0; coding_idx--) {
-    auto &ref_frame_table = ref_frame_table_list[coding_idx];
-    // TODO(angiebird): Handle/test the case where reference frame
-    // is in the previous GOP
-    TplFrameDepStatsPropagate(coding_idx, ref_frame_table, &tpl_gop_dep_stats);
-  }
-  return tpl_gop_dep_stats;
-}
-
-static std::vector<uint8_t> SetupDeltaQ(const TplFrameDepStats &frame_dep_stats,
-                                        int frame_width, int frame_height,
-                                        int base_qindex,
-                                        double frame_importance) {
-  // TODO(jianj) : Add support to various superblock sizes.
-  const int sb_size = 64;
-  const int delta_q_res = 4;
-  const int num_unit_per_sb = sb_size / frame_dep_stats.unit_size;
-  const int sb_rows = (frame_height + sb_size - 1) / sb_size;
-  const int sb_cols = (frame_width + sb_size - 1) / sb_size;
-  const int unit_rows = (frame_height + frame_dep_stats.unit_size - 1) /
-                        frame_dep_stats.unit_size;
-  const int unit_cols =
-      (frame_width + frame_dep_stats.unit_size - 1) / frame_dep_stats.unit_size;
-  std::vector<uint8_t> superblock_q_indices;
-  // Calculate delta_q offset for each superblock.
-  for (int sb_row = 0; sb_row < sb_rows; ++sb_row) {
-    for (int sb_col = 0; sb_col < sb_cols; ++sb_col) {
-      double intra_cost = 0;
-      double mc_dep_cost = 0;
-      const int unit_row_start = sb_row * num_unit_per_sb;
-      const int unit_row_end =
-          std::min((sb_row + 1) * num_unit_per_sb, unit_rows);
-      const int unit_col_start = sb_col * num_unit_per_sb;
-      const int unit_col_end =
-          std::min((sb_col + 1) * num_unit_per_sb, unit_cols);
-      // A simplified version of av1_get_q_for_deltaq_objective()
-      for (int unit_row = unit_row_start; unit_row < unit_row_end; ++unit_row) {
-        for (int unit_col = unit_col_start; unit_col < unit_col_end;
-             ++unit_col) {
-          const TplUnitDepStats &unit_dep_stat =
-              frame_dep_stats.unit_stats[unit_row][unit_col];
-          intra_cost += unit_dep_stat.intra_cost;
-          mc_dep_cost += unit_dep_stat.propagation_cost;
-        }
-      }
-
-      double beta = 1.0;
-      if (mc_dep_cost > 0 && intra_cost > 0) {
-        const double r0 = 1 / frame_importance;
-        const double rk = intra_cost / mc_dep_cost;
-        beta = r0 / rk;
-        assert(beta > 0.0);
-      }
-      int offset = av1_get_deltaq_offset(AOM_BITS_8, base_qindex, beta);
-      offset = std::min(offset, delta_q_res * 9 - 1);
-      offset = std::max(offset, -delta_q_res * 9 + 1);
-      int qindex = offset + base_qindex;
-      qindex = std::min(qindex, MAXQ);
-      qindex = std::max(qindex, MINQ);
-      qindex = av1_adjust_q_from_delta_q_res(delta_q_res, base_qindex, qindex);
-      superblock_q_indices.push_back(static_cast<uint8_t>(qindex));
-    }
-  }
-
-  return superblock_q_indices;
-}
-
-static std::unordered_map<int, double> FindKMeansClusterMap(
-    const std::vector<uint8_t> &qindices,
-    const std::vector<double> &centroids) {
-  std::unordered_map<int, double> cluster_map;
-  for (const uint8_t qindex : qindices) {
-    double nearest_centroid = *std::min_element(
-        centroids.begin(), centroids.end(),
-        [qindex](const double centroid_a, const double centroid_b) {
-          return fabs(centroid_a - qindex) < fabs(centroid_b - qindex);
-        });
-    cluster_map.insert({ qindex, nearest_centroid });
-  }
-  return cluster_map;
-}
-
-namespace internal {
-
-std::unordered_map<int, int> KMeans(std::vector<uint8_t> qindices, int k) {
-  std::vector<double> centroids;
-  // Initialize the centroids with first k qindices
-  std::unordered_set<int> qindices_set;
-
-  for (const uint8_t qp : qindices) {
-    if (!qindices_set.insert(qp).second) continue;  // Already added.
-    centroids.push_back(qp);
-    if (static_cast<int>(centroids.size()) >= k) break;
-  }
-
-  std::unordered_map<int, double> intermediate_cluster_map;
-  while (true) {
-    // Find the closest centroid for each qindex
-    intermediate_cluster_map = FindKMeansClusterMap(qindices, centroids);
-    // For each cluster, calculate the new centroids
-    std::unordered_map<double, std::vector<int>> centroid_to_qindices;
-    for (const auto &qindex_centroid : intermediate_cluster_map) {
-      centroid_to_qindices[qindex_centroid.second].push_back(
-          qindex_centroid.first);
-    }
-    bool centroids_changed = false;
-    std::vector<double> new_centroids;
-    for (const auto &cluster : centroid_to_qindices) {
-      double sum = 0.0;
-      for (const int qindex : cluster.second) {
-        sum += qindex;
-      }
-      double new_centroid = sum / cluster.second.size();
-      new_centroids.push_back(new_centroid);
-      if (new_centroid != cluster.first) centroids_changed = true;
-    }
-    if (!centroids_changed) break;
-    centroids = new_centroids;
-  }
-  std::unordered_map<int, int> cluster_map;
-  for (const auto &qindex_centroid : intermediate_cluster_map) {
-    cluster_map.insert(
-        { qindex_centroid.first, static_cast<int>(qindex_centroid.second) });
-  }
-  return cluster_map;
-}
-}  // namespace internal
-
-static int GetRDMult(const GopFrame &gop_frame, int q_index) {
-  // TODO(angiebird):
-  // 1) Check if these rdmult rules are good in our use case.
-  // 2) Support high-bit-depth mode
-  if (gop_frame.is_golden_frame) {
-    // Assume ARF_UPDATE/GF_UPDATE share the same remult rule.
-    return av1_compute_rd_mult_based_on_qindex(AOM_BITS_8, GF_UPDATE, q_index);
-  } else if (gop_frame.is_key_frame) {
-    return av1_compute_rd_mult_based_on_qindex(AOM_BITS_8, KF_UPDATE, q_index);
-  } else {
-    // Assume LF_UPDATE/OVERLAY_UPDATE/INTNL_OVERLAY_UPDATE/INTNL_ARF_UPDATE
-    // share the same remult rule.
-    return av1_compute_rd_mult_based_on_qindex(AOM_BITS_8, LF_UPDATE, q_index);
-  }
-}
-
-StatusOr<GopEncodeInfo> AV1RateControlQMode::GetGopEncodeInfoWithNoStats(
-    const GopStruct &gop_struct) {
-  GopEncodeInfo gop_encode_info;
-  const int frame_count = static_cast<int>(gop_struct.gop_frame_list.size());
-  for (int i = 0; i < frame_count; i++) {
-    FrameEncodeParameters param;
-    const GopFrame &gop_frame = gop_struct.gop_frame_list[i];
-    // Use constant QP for TPL pass encoding. Keep the functionality
-    // that allows QP changes across sub-gop.
-    param.q_index = rc_param_.base_q_index;
-    param.rdmult = av1_compute_rd_mult_based_on_qindex(AOM_BITS_8, LF_UPDATE,
-                                                       rc_param_.base_q_index);
-    // TODO(jingning): gop_frame is needed in two pass tpl later.
-    (void)gop_frame;
-
-    if (rc_param_.tpl_pass_index) {
-      if (gop_frame.update_type == GopFrameType::kRegularGolden ||
-          gop_frame.update_type == GopFrameType::kRegularKey ||
-          gop_frame.update_type == GopFrameType::kRegularArf) {
-        double qstep_ratio = 1 / 3.0;
-        param.q_index = av1_get_q_index_from_qstep_ratio(
-            rc_param_.base_q_index, qstep_ratio, AOM_BITS_8);
-        if (rc_param_.base_q_index) param.q_index = AOMMAX(param.q_index, 1);
-      }
-    }
-    gop_encode_info.param_list.push_back(param);
-  }
-  return gop_encode_info;
-}
-
-StatusOr<GopEncodeInfo> AV1RateControlQMode::GetGopEncodeInfoWithFp(
-    const GopStruct &gop_struct,
-    const FirstpassInfo &firstpass_info AOM_UNUSED) {
-  // TODO(b/260859962): This is currently a placeholder. Should use the fp
-  // stats to calculate frame-level qp.
-  return GetGopEncodeInfoWithNoStats(gop_struct);
-}
-
-StatusOr<GopEncodeInfo> AV1RateControlQMode::GetGopEncodeInfoWithTpl(
-    const GopStruct &gop_struct, const TplGopStats &tpl_gop_stats,
-    const std::vector<LookaheadStats> &lookahead_stats,
-    const RefFrameTable &ref_frame_table_snapshot_init) {
-  const std::vector<RefFrameTable> ref_frame_table_list = GetRefFrameTableList(
-      gop_struct, lookahead_stats, ref_frame_table_snapshot_init);
-
-  GopEncodeInfo gop_encode_info;
-  gop_encode_info.final_snapshot = ref_frame_table_list.back();
-  StatusOr<TplGopDepStats> gop_dep_stats = ComputeTplGopDepStats(
-      tpl_gop_stats, lookahead_stats, ref_frame_table_list);
-  if (!gop_dep_stats.ok()) {
-    return gop_dep_stats.status();
-  }
-  const int frame_count =
-      static_cast<int>(tpl_gop_stats.frame_stats_list.size());
-  const int active_worst_quality = rc_param_.base_q_index;
-  int active_best_quality = rc_param_.base_q_index;
-  for (int i = 0; i < frame_count; i++) {
-    FrameEncodeParameters param;
-    const GopFrame &gop_frame = gop_struct.gop_frame_list[i];
-
-    if (gop_frame.update_type == GopFrameType::kOverlay ||
-        gop_frame.update_type == GopFrameType::kIntermediateOverlay ||
-        gop_frame.update_type == GopFrameType::kRegularLeaf) {
-      param.q_index = rc_param_.base_q_index;
-    } else if (gop_frame.update_type == GopFrameType::kRegularGolden ||
-               gop_frame.update_type == GopFrameType::kRegularKey ||
-               gop_frame.update_type == GopFrameType::kRegularArf) {
-      const TplFrameDepStats &frame_dep_stats =
-          gop_dep_stats->frame_dep_stats_list[i];
-      const double cost_without_propagation =
-          TplFrameDepStatsAccumulateIntraCost(frame_dep_stats);
-      const double cost_with_propagation =
-          TplFrameDepStatsAccumulate(frame_dep_stats);
-      const double frame_importance =
-          cost_with_propagation / cost_without_propagation;
-      // Imitate the behavior of av1_tpl_get_qstep_ratio()
-      const double qstep_ratio = sqrt(1 / frame_importance);
-      param.q_index = av1_get_q_index_from_qstep_ratio(rc_param_.base_q_index,
-                                                       qstep_ratio, AOM_BITS_8);
-      if (rc_param_.base_q_index) param.q_index = AOMMAX(param.q_index, 1);
-      active_best_quality = param.q_index;
-
-      if (rc_param_.max_distinct_q_indices_per_frame > 1) {
-        std::vector<uint8_t> superblock_q_indices = SetupDeltaQ(
-            frame_dep_stats, rc_param_.frame_width, rc_param_.frame_height,
-            param.q_index, frame_importance);
-        std::unordered_map<int, int> qindex_centroids = internal::KMeans(
-            superblock_q_indices, rc_param_.max_distinct_q_indices_per_frame);
-        for (size_t i = 0; i < superblock_q_indices.size(); ++i) {
-          const int curr_sb_qindex =
-              qindex_centroids.find(superblock_q_indices[i])->second;
-          const int delta_q_res = 4;
-          const int adjusted_qindex =
-              param.q_index +
-              (curr_sb_qindex - param.q_index) / delta_q_res * delta_q_res;
-          const int rd_mult = GetRDMult(gop_frame, adjusted_qindex);
-          param.superblock_encode_params.push_back(
-              { static_cast<uint8_t>(adjusted_qindex), rd_mult });
-        }
-      }
-    } else {
-      // Intermediate ARFs
-      assert(gop_frame.layer_depth >= 1);
-      const int depth_factor = 1 << (gop_frame.layer_depth - 1);
-      param.q_index =
-          (active_worst_quality * (depth_factor - 1) + active_best_quality) /
-          depth_factor;
-    }
-    param.rdmult = GetRDMult(gop_frame, param.q_index);
-    gop_encode_info.param_list.push_back(param);
-  }
-  return gop_encode_info;
-}
-
-StatusOr<GopEncodeInfo> AV1RateControlQMode::GetTplPassGopEncodeInfo(
-    const GopStruct &gop_struct, const FirstpassInfo &firstpass_info) {
-  return GetGopEncodeInfoWithFp(gop_struct, firstpass_info);
-}
-
-StatusOr<GopEncodeInfo> AV1RateControlQMode::GetGopEncodeInfo(
-    const GopStruct &gop_struct, const TplGopStats &tpl_gop_stats,
-    const std::vector<LookaheadStats> &lookahead_stats,
-    const FirstpassInfo &firstpass_info AOM_UNUSED,
-    const RefFrameTable &ref_frame_table_snapshot_init) {
-  // When TPL stats are not valid, use first pass stats.
-  Status status = ValidateTplStats(gop_struct, tpl_gop_stats);
-  if (!status.ok()) {
-    return status;
-  }
-
-  for (const auto &lookahead_stat : lookahead_stats) {
-    Status status = ValidateTplStats(*lookahead_stat.gop_struct,
-                                     *lookahead_stat.tpl_gop_stats);
-    if (!status.ok()) {
-      return status;
-    }
-  }
-
-  // TODO(b/260859962): Currently firstpass stats are used as an alternative,
-  // but we could also combine it with tpl results in the future for more
-  // stable qp determination.
-  return GetGopEncodeInfoWithTpl(gop_struct, tpl_gop_stats, lookahead_stats,
-                                 ref_frame_table_snapshot_init);
-}
-
-}  // namespace aom

diff --git a/av1/qmode_rc/ratectrl_qmode.h b/av1/qmode_rc/ratectrl_qmode.h
deleted file mode 100644
index f60000e..0000000
--- a/av1/qmode_rc/ratectrl_qmode.h
+++ /dev/null

@@ -1,141 +0,0 @@
-/*
- * Copyright (c) 2022, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-
-#ifndef AOM_AV1_QMODE_RC_RATECTRL_QMODE_H_
-#define AOM_AV1_QMODE_RC_RATECTRL_QMODE_H_
-
-#include <deque>
-#include <queue>
-#include <unordered_map>
-#include <vector>
-#include "av1/encoder/firstpass.h"
-#include "av1/qmode_rc/ratectrl_qmode_interface.h"
-#include "av1/qmode_rc/reference_manager.h"
-
-namespace aom {
-
-constexpr int kLayerDepthOffset = 1;
-constexpr int kMinIntervalToAddArf = 3;
-constexpr int kMinArfInterval = (kMinIntervalToAddArf + 1) / 2;
-
-struct TplUnitDepStats {
-  double propagation_cost;
-  double intra_cost;
-  double inter_cost;
-  std::array<MotionVector, kBlockRefCount> mv;
-  std::array<int, kBlockRefCount> ref_frame_index;
-};
-
-struct TplFrameDepStats {
-  int unit_size;  // equivalent to min_block_size
-  double rdcost;  // overall rate-distortion cost
-  std::vector<std::vector<TplUnitDepStats>> unit_stats;
-};
-
-struct TplGopDepStats {
-  std::vector<TplFrameDepStats> frame_dep_stats_list;
-};
-
-GopFrame GopFrameInvalid();
-
-// Set up is_key_frame, is_arf_frame, is_show_frame, is_golden_frame and
-// encode_ref_mode in GopFrame based on gop_frame_type
-void SetGopFrameByType(GopFrameType gop_frame_type, GopFrame *gop_frame);
-
-GopFrame GopFrameBasic(int global_coding_idx_offset,
-                       int global_order_idx_offset, int coding_idx,
-                       int order_idx, int depth, int display_idx,
-                       GopFrameType gop_frame_type);
-
-GopStruct ConstructGop(RefFrameManager *ref_frame_manager, int show_frame_count,
-                       bool has_key_frame, int global_coding_idx_offset,
-                       int global_order_idx_offset);
-
-// Creates a TplFrameDepStats containing an 2D array of default-initialized
-// TplUnitDepStats, with dimensions of
-//   ceil(frame_height / min_block_size) x ceil(frame_width / min_block_size).
-// i.e., there will be one entry for each square block of size min_block_size,
-// and blocks along the bottom or right edge of the frame may extend beyond the
-// edges of the frame.
-TplFrameDepStats CreateTplFrameDepStats(int frame_height, int frame_width,
-                                        int min_block_size);
-
-TplUnitDepStats TplBlockStatsToDepStats(const TplBlockStats &block_stats,
-                                        int unit_count);
-
-StatusOr<TplFrameDepStats> CreateTplFrameDepStatsWithoutPropagation(
-    const TplFrameStats &frame_stats);
-
-std::vector<int> GetKeyFrameList(const FirstpassInfo &first_pass_info);
-
-double TplFrameDepStatsAccumulateIntraCost(
-    const TplFrameDepStats &frame_dep_stats);
-
-double TplFrameDepStatsAccumulateInterCost(
-    const TplFrameDepStats &frame_dep_stats);
-
-double TplFrameDepStatsAccumulate(const TplFrameDepStats &frame_dep_stats);
-
-void TplFrameDepStatsPropagate(int coding_idx,
-                               const RefFrameTable &ref_frame_table,
-                               TplGopDepStats *tpl_gop_dep_stats);
-
-int GetBlockOverlapArea(int r0, int c0, int r1, int c1, int size);
-
-namespace internal {
-std::unordered_map<int, int> KMeans(std::vector<uint8_t> qindices, int k);
-}
-
-StatusOr<TplGopDepStats> ComputeTplGopDepStats(
-    const TplGopStats &tpl_gop_stats,
-    const std::vector<LookaheadStats> &lookahead_stats,
-    const std::vector<RefFrameTable> &ref_frame_table_list);
-
-class AV1RateControlQMode : public AV1RateControlQModeInterface {
- public:
-  Status SetRcParam(const RateControlParam &rc_param) override;
-  StatusOr<GopStructList> DetermineGopInfo(
-      const FirstpassInfo &firstpass_info) override;
-  StatusOr<GopEncodeInfo> GetGopEncodeInfo(
-      const GopStruct &gop_struct, const TplGopStats &tpl_gop_stats,
-      const std::vector<LookaheadStats> &lookahead_stats,
-      const FirstpassInfo &firstpass_info,
-      const RefFrameTable &ref_frame_table_snapshot) override;
-  StatusOr<GopEncodeInfo> GetTplPassGopEncodeInfo(
-      const GopStruct &gop_struct,
-      const FirstpassInfo &firstpass_info) override;
-
-  // Public for testing only.
-  // Returns snapshots of the ref frame before and after each frame in
-  // gop_struct. The returned list will have n+1 entries for n frames.
-  // If this is first GOP, ref_frame_table is ignored and all refs are assumed
-  // invalid; otherwise ref_frame_table is used as the initial state.
-  std::vector<RefFrameTable> GetRefFrameTableList(
-      const GopStruct &gop_struct,
-      const std::vector<LookaheadStats> &lookahead_stats,
-      RefFrameTable ref_frame_table);
-
- private:
-  RateControlParam rc_param_;
-
-  // Private methods to determine GOP encode info with different stats
-  StatusOr<GopEncodeInfo> GetGopEncodeInfoWithNoStats(
-      const GopStruct &gop_struct);
-  StatusOr<GopEncodeInfo> GetGopEncodeInfoWithFp(
-      const GopStruct &gop_struct, const FirstpassInfo &firstpass_info);
-  StatusOr<GopEncodeInfo> GetGopEncodeInfoWithTpl(
-      const GopStruct &gop_struct, const TplGopStats &tpl_gop_stats,
-      const std::vector<LookaheadStats> &lookahead_stats,
-      const RefFrameTable &ref_frame_table_snapshot_init);
-};
-}  // namespace aom
-
-#endif  // AOM_AV1_QMODE_RC_RATECTRL_QMODE_H_

diff --git a/av1/qmode_rc/ratectrl_qmode_interface.cc b/av1/qmode_rc/ratectrl_qmode_interface.cc
deleted file mode 100644
index 1f03e0c..0000000
--- a/av1/qmode_rc/ratectrl_qmode_interface.cc
+++ /dev/null

@@ -1,19 +0,0 @@
-/*
- * Copyright (c) 2022, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-
-#include "av1/qmode_rc/ratectrl_qmode_interface.h"
-
-namespace aom {
-
-AV1RateControlQModeInterface::AV1RateControlQModeInterface() = default;
-AV1RateControlQModeInterface::~AV1RateControlQModeInterface() = default;
-
-}  // namespace aom

diff --git a/av1/qmode_rc/ratectrl_qmode_interface.h b/av1/qmode_rc/ratectrl_qmode_interface.h
deleted file mode 100644
index a7fff4a..0000000
--- a/av1/qmode_rc/ratectrl_qmode_interface.h
+++ /dev/null

@@ -1,358 +0,0 @@
-/*
- * Copyright (c) 2022, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-
-#ifndef AOM_AV1_QMODE_RC_RATECTRL_QMODE_INTERFACE_H_
-#define AOM_AV1_QMODE_RC_RATECTRL_QMODE_INTERFACE_H_
-
-#include <array>
-#include <string>
-#include <vector>
-
-#include "aom/aom_codec.h"
-#include "av1/encoder/firstpass.h"
-
-namespace aom {
-
-constexpr int kBlockRefCount = 2;
-
-struct MotionVector {
-  int row;  // subpel row
-  int col;  // subpel col
-  // TODO(b/241589513): Move this to TplFrameStats; it's wasteful to code it
-  // separately for each block.
-  int subpel_bits;  // number of fractional bits used by row/col
-};
-
-enum class TplPassCount {
-  kOneTplPass = 1,
-  kTwoTplPasses = 2,
-};
-
-struct RateControlParam {
-  // Range of allowed GOP sizes (number of displayed frames).
-  int max_gop_show_frame_count;
-  int min_gop_show_frame_count;
-  // Number of reference frame buffers, i.e., size of the DPB.
-  int ref_frame_table_size;
-  // Maximum number of references a single frame may use.
-  int max_ref_frames;
-
-  int base_q_index;
-
-  // If greater than 1, enables per-superblock q_index, and limits the number of
-  // unique q_index values which may be used in a frame (each of which will have
-  // its own unique rdmult value).
-  int max_distinct_q_indices_per_frame;
-
-  // If per-superblock q_index is enabled and this is greater than 1, enables
-  // additional per-superblock scaling of lambda, and limits the number of
-  // unique lambda scale values which may be used in a frame.
-  int max_distinct_lambda_scales_per_frame;
-
-  int frame_width;
-  int frame_height;
-
-  // Total number of TPL passes.
-  TplPassCount tpl_pass_count = TplPassCount::kOneTplPass;
-  // Current TPL pass number, 0 or 1 (for GetTplPassGopEncodeInfo).
-  int tpl_pass_index = 0;
-};
-
-struct TplBlockStats {
-  int16_t height;      // Pixel height.
-  int16_t width;       // Pixel width.
-  int16_t row;         // Pixel row of the top left corner.
-  int16_t col;         // Pixel col of the top left corner.
-  int64_t intra_cost;  // Rd cost of the best intra mode.
-  int64_t inter_cost;  // Rd cost of the best inter mode.
-
-  // Valid only if TplFrameStats::rate_dist_present is true:
-  int64_t recrf_rate;      // Bits when using recon as reference.
-  int64_t recrf_dist;      // Distortion when using recon as reference.
-  int64_t intra_pred_err;  // Prediction residual of the intra mode.
-  int64_t inter_pred_err;  // Prediction residual of the inter mode.
-
-  std::array<MotionVector, kBlockRefCount> mv;
-  std::array<int, kBlockRefCount> ref_frame_index;
-};
-
-// gop frame type used for facilitate setting up GopFrame
-// TODO(angiebird): Define names for forward key frame and
-// key frame with overlay
-enum class GopFrameType {
-  kRegularKey,     // High quality key frame without overlay
-  kRegularLeaf,    // Regular leaf frame
-  kRegularGolden,  // Regular golden frame
-  kRegularArf,  // High quality arf with strong filtering followed by an overlay
-                // later
-  kOverlay,     // Overlay frame
-  kIntermediateOverlay,  // Intermediate overlay frame
-  kIntermediateArf,  // Good quality arf with weak or no filtering followed by a
-                     // show_existing later
-};
-
-enum class EncodeRefMode {
-  kRegular,
-  kOverlay,
-  kShowExisting,
-};
-
-enum class ReferenceName {
-  kNoneFrame = -1,
-  kIntraFrame = 0,
-  kLastFrame = 1,
-  kLast2Frame = 2,
-  kLast3Frame = 3,
-  kGoldenFrame = 4,
-  kBwdrefFrame = 5,
-  kAltref2Frame = 6,
-  kAltrefFrame = 7,
-};
-
-struct Status {
-  aom_codec_err_t code;
-  std::string message;  // Empty if code == AOM_CODEC_OK.
-  bool ok() const { return code == AOM_CODEC_OK; }
-};
-
-// A very simple imitation of absl::StatusOr, this is conceptually a union of a
-// Status struct and an object of type T. It models an object that is either a
-// usable object, or an error explaining why such an object is not present. A
-// StatusOr<T> may never hold a status with a code of AOM_CODEC_OK.
-template <typename T>
-class StatusOr {
- public:
-  StatusOr(const T &value) : value_(value) {}
-  StatusOr(T &&value) : value_(std::move(value)) {}
-  StatusOr(Status status) : status_(std::move(status)) {
-    assert(status_.code != AOM_CODEC_OK);
-  }
-
-  const Status &status() const { return status_; }
-  bool ok() const { return status().ok(); }
-
-  // operator* returns the value; it should only be called after checking that
-  // ok() returns true.
-  const T &operator*() const & { return value_; }
-  T &operator*() & { return value_; }
-  const T &&operator*() const && { return value_; }
-  T &&operator*() && { return std::move(value_); }
-
-  // sor->field is equivalent to (*sor).field.
-  const T *operator->() const & { return &value_; }
-  T *operator->() & { return &value_; }
-
-  // value() is equivalent to operator*, but asserts that ok() is true.
-  const T &value() const & {
-    assert(ok());
-    return value_;
-  }
-  T &value() & {
-    assert(ok());
-    return value_;
-  }
-  const T &&value() const && {
-    assert(ok());
-    return value_;
-  }
-  T &&value() && {
-    assert(ok());
-    return std::move(value_);
-  }
-
- private:
-  T value_;  // This could be std::optional<T> if it were available.
-  Status status_ = { AOM_CODEC_OK, "" };
-};
-
-struct ReferenceFrame {
-  int index;  // Index of reference slot containing the reference frame
-  ReferenceName name;
-};
-
-struct GopFrame {
-  // basic info
-  bool is_valid;
-  int order_idx;    // Index in display order in a GOP
-  int coding_idx;   // Index in coding order in a GOP
-  int display_idx;  // The number of displayed frames preceding this frame in
-                    // a GOP
-
-  int global_order_idx;   // Index in display order in the whole video chunk
-  int global_coding_idx;  // Index in coding order in the whole video chunk
-
-  bool is_key_frame;     // If this is key frame, reset reference buffers are
-                         // required
-  bool is_arf_frame;     // Is this a forward frame, a frame with order_idx
-                         // higher than the current display order
-  bool is_show_frame;    // Is this frame a show frame after coding
-  bool is_golden_frame;  // Is this a high quality frame
-
-  GopFrameType update_type;  // This is a redundant field. It is only used for
-                             // easy conversion in SW integration.
-
-  // reference frame info
-  EncodeRefMode encode_ref_mode;
-  int colocated_ref_idx;  // colocated_ref_idx == -1 when encode_ref_mode ==
-                          // EncodeRefMode::kRegular
-  int update_ref_idx;     // The reference index that this frame should be
-                          // updated to. update_ref_idx == -1 when this frame
-                          // will not serve as a reference frame
-  std::vector<ReferenceFrame>
-      ref_frame_list;  // A list of available reference frames in priority order
-                       // for the current to-be-coded frame. The list size
-                       // should be less or equal to ref_frame_table_size. The
-                       // reference frames with smaller indices are more likely
-                       // to be a good reference frame. Therefore, they should
-                       // be prioritized when the reference frame count is
-                       // limited. For example, if we plan to use 3 reference
-                       // frames, we should choose ref_frame_list[0],
-                       // ref_frame_list[1] and ref_frame_list[2].
-  int layer_depth;     // Layer depth in the GOP structure
-  ReferenceFrame primary_ref_frame;  // We will use the primary reference frame
-                                     // to update current frame's initial
-                                     // probability model
-};
-
-struct GopStruct {
-  int show_frame_count;
-  int global_coding_idx_offset;
-  int global_order_idx_offset;
-  // TODO(jingning): This can be removed once the framework is up running.
-  int display_tracker;  // Track the number of frames displayed proceeding a
-                        // current coding frame.
-  std::vector<GopFrame> gop_frame_list;
-};
-
-using GopStructList = std::vector<GopStruct>;
-
-struct SuperblockEncodeParameters {
-  int q_index;
-  int rdmult;
-};
-
-struct FrameEncodeParameters {
-  // Base q_index for the frame.
-  int q_index;
-
-  // Frame level Lagrangian multiplier.
-  int rdmult;
-
-  // If max_distinct_q_indices_per_frame <= 1, this will be empty.
-  // Otherwise:
-  // - There must be one entry per 64x64 superblock, in row-major order
-  // - There may be no more than max_distinct_q_indices_per_frame unique q_index
-  //   values
-  // - All entries with the same q_index must have the same rdmult
-  // (If it's desired to use different rdmult values with the same q_index, this
-  // must be done with superblock_lambda_scales.)
-  std::vector<SuperblockEncodeParameters> superblock_encode_params;
-
-  // If max_distinct_q_indices_per_frame <= 1 or
-  // max_distinct_lambda_scales_per_frame <= 1, this will be empty. Otherwise,
-  // it will have one entry per 64x64 superblock, in row-major order, with no
-  // more than max_distinct_lambda_scales_per_frame unique values. Each entry
-  // should be multiplied by the rdmult in the corresponding superblock's entry
-  // in superblock_encode_params.
-  std::vector<float> superblock_lambda_scales;
-};
-
-struct FirstpassInfo {
-  int num_mbs_16x16;  // Count of 16x16 unit blocks in each frame.
-                      // FIRSTPASS_STATS's unit block size is 16x16
-  std::vector<FIRSTPASS_STATS> stats_list;
-};
-
-// In general, the number of elements in RefFrameTable must always equal
-// ref_frame_table_size (as specified in RateControlParam), but see
-// GetGopEncodeInfo for the one exception.
-using RefFrameTable = std::vector<GopFrame>;
-
-struct GopEncodeInfo {
-  std::vector<FrameEncodeParameters> param_list;
-  RefFrameTable final_snapshot;  // RefFrameTable snapshot after coding this GOP
-};
-
-struct TplFrameStats {
-  int min_block_size;
-  int frame_width;
-  int frame_height;
-  bool rate_dist_present;  // True if recrf_rate and recrf_dist are populated.
-  std::vector<TplBlockStats> block_stats_list;
-  // Optional stats computed with different settings, should be empty unless
-  // tpl_pass_count == kTwoTplPasses.
-  std::vector<TplBlockStats> alternate_block_stats_list;
-};
-
-struct TplGopStats {
-  std::vector<TplFrameStats> frame_stats_list;
-};
-
-// Structure and TPL stats for a single GOP, to be used for lookahead.
-struct LookaheadStats {
-  const GopStruct *gop_struct;       // Not owned, may not be nullptr.
-  const TplGopStats *tpl_gop_stats;  // Not owned, may not be nullptr.
-};
-
-class AV1RateControlQModeInterface {
- public:
-  AV1RateControlQModeInterface();
-  virtual ~AV1RateControlQModeInterface();
-
-  virtual Status SetRcParam(const RateControlParam &rc_param) = 0;
-  virtual StatusOr<GopStructList> DetermineGopInfo(
-      const FirstpassInfo &firstpass_info) = 0;
-
-  // Accepts GOP structure and TPL info from the encoder and returns q index and
-  // rdmult for each frame. This should be called with consecutive GOPs as
-  // returned by DetermineGopInfo.
-  //
-  // GOP structure and TPL info from zero or more subsequent GOPs may optionally
-  // be passed in lookahead_stats.
-  //
-  // For the first GOP, a default-constructed RefFrameTable may be passed in as
-  // ref_frame_table_snapshot_init; for subsequent GOPs, it should be the
-  // final_snapshot returned on the previous call.
-  //
-  // TODO(b/260859962): Remove these once all callers and overrides are gone.
-  virtual StatusOr<GopEncodeInfo> GetGopEncodeInfo(
-      const GopStruct &gop_struct AOM_UNUSED,
-      const TplGopStats &tpl_gop_stats AOM_UNUSED,
-      const std::vector<LookaheadStats> &lookahead_stats AOM_UNUSED,
-      const RefFrameTable &ref_frame_table_snapshot AOM_UNUSED) {
-    return Status{ AOM_CODEC_UNSUP_FEATURE, "Deprecated" };
-  }
-  virtual StatusOr<GopEncodeInfo> GetTplPassGopEncodeInfo(
-      const GopStruct &gop_struct AOM_UNUSED) {
-    return Status{ AOM_CODEC_UNSUP_FEATURE, "Deprecated" };
-  }
-
-  // Extensions to the API to pass in the first pass info. There should be stats
-  // for all frames starting from the first frame of the GOP and continuing to
-  // the end of the sequence.
-  // TODO(b/260859962): Make pure virtual once all derived classes implement it.
-  virtual StatusOr<GopEncodeInfo> GetGopEncodeInfo(
-      const GopStruct &gop_struct AOM_UNUSED,
-      const TplGopStats &tpl_gop_stats AOM_UNUSED,
-      const std::vector<LookaheadStats> &lookahead_stats AOM_UNUSED,
-      const FirstpassInfo &firstpass_info AOM_UNUSED,
-      const RefFrameTable &ref_frame_table_snapshot AOM_UNUSED) {
-    return Status{ AOM_CODEC_UNSUP_FEATURE, "Not yet implemented" };
-  }
-  virtual StatusOr<GopEncodeInfo> GetTplPassGopEncodeInfo(
-      const GopStruct &gop_struct AOM_UNUSED,
-      const FirstpassInfo &firstpass_info AOM_UNUSED) {
-    return Status{ AOM_CODEC_UNSUP_FEATURE, "Not yet implemented" };
-  }
-};
-}  // namespace aom
-
-#endif  // AOM_AV1_QMODE_RC_RATECTRL_QMODE_INTERFACE_H_

diff --git a/av1/qmode_rc/reference_manager.cc b/av1/qmode_rc/reference_manager.cc
deleted file mode 100644
index eea7b7d..0000000
--- a/av1/qmode_rc/reference_manager.cc
+++ /dev/null

@@ -1,339 +0,0 @@
-/*
- * Copyright (c) 2022, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-
-#include <algorithm>
-#include <set>
-#include <utility>
-#include <tuple>
-#include <vector>
-
-#include "av1/qmode_rc/reference_manager.h"
-#include "av1/qmode_rc/ratectrl_qmode.h"
-
-namespace aom {
-
-void RefFrameManager::Reset() {
-  free_ref_idx_list_.clear();
-  for (int i = 0; i < static_cast<int>(ref_frame_table_.size()); ++i) {
-    free_ref_idx_list_.push_back(i);
-    ref_frame_table_[i] = GopFrameInvalid();
-  }
-  forward_stack_.clear();
-  backward_queue_.clear();
-  last_queue_.clear();
-}
-
-int RefFrameManager::AllocateRefIdx() {
-  if (free_ref_idx_list_.empty()) {
-    size_t backward_size = backward_queue_.size();
-    size_t last_size = last_queue_.size();
-    if (last_size >= backward_size) {
-      int ref_idx = last_queue_.front();
-      last_queue_.pop_front();
-      free_ref_idx_list_.push_back(ref_idx);
-    } else {
-      int ref_idx = backward_queue_.front();
-      backward_queue_.pop_front();
-      free_ref_idx_list_.push_back(ref_idx);
-    }
-  }
-
-  int ref_idx = free_ref_idx_list_.front();
-  free_ref_idx_list_.pop_front();
-  return ref_idx;
-}
-
-int RefFrameManager::GetRefFrameCountByType(
-    RefUpdateType ref_update_type) const {
-  size_t cnt = 0;
-  switch (ref_update_type) {
-    case RefUpdateType::kForward: cnt = forward_stack_.size(); break;
-    case RefUpdateType::kBackward: cnt = backward_queue_.size(); break;
-    case RefUpdateType::kLast: cnt = last_queue_.size(); break;
-    case RefUpdateType::kNone: cnt = 0; break;
-  }
-  return static_cast<int>(cnt);
-}
-
-int RefFrameManager::GetRefFrameCount() const {
-  return GetRefFrameCountByType(RefUpdateType::kForward) +
-         GetRefFrameCountByType(RefUpdateType::kBackward) +
-         GetRefFrameCountByType(RefUpdateType::kLast);
-}
-
-// TODO(angiebird): Add unit test.
-// Find the ref_idx corresponding to a ref_update_type.
-// Return -1 if no ref frame is found.
-// The priority_idx indicate closeness between the current frame and
-// the ref frame in display order.
-// For example, ref_update_type == kForward and priority_idx == 0 means
-// find the closest ref frame in forward_stack_.
-int RefFrameManager::GetRefFrameIdxByPriority(RefUpdateType ref_update_type,
-                                              int priority_idx) const {
-  if (ref_update_type == RefUpdateType::kForward) {
-    int size = static_cast<int>(forward_stack_.size());
-    // When two or more forward reference frames can be used, first get
-    // the highest quality one as the ARF, then going from nearest to
-    // the more distant ones in the forward reference frame list.
-    if (priority_idx < size) {
-      if (allow_two_fwd_frames_) {
-        if (priority_idx == 0) return forward_stack_[0];
-        return forward_stack_[size - priority_idx];
-      }
-
-      // Handle the special case where only one forward reference frame
-      // can be used. In this setting, we prefer the nearest frame.
-      return forward_stack_[size - 1 - priority_idx];
-    }
-  } else if (ref_update_type == RefUpdateType::kBackward) {
-    int size = static_cast<int>(backward_queue_.size());
-    if (priority_idx < size) {
-      return backward_queue_[size - priority_idx - 1];
-    }
-  } else if (ref_update_type == RefUpdateType::kLast) {
-    int size = static_cast<int>(last_queue_.size());
-    if (priority_idx < size) {
-      return last_queue_[size - priority_idx - 1];
-    }
-  }
-  return -1;
-}
-
-// The priority_idx indicate closeness between the current frame and
-// the ref frame in display order.
-// For example, ref_update_type == kForward and priority_idx == 0 means
-// find the closest ref frame in forward_stack_.
-GopFrame RefFrameManager::GetRefFrameByPriority(RefUpdateType ref_update_type,
-                                                int priority_idx) const {
-  int ref_idx = GetRefFrameIdxByPriority(ref_update_type, priority_idx);
-  if (ref_idx == -1) {
-    return GopFrameInvalid();
-  }
-  assert(ref_frame_table_[ref_idx].update_ref_idx == ref_idx);
-  return ref_frame_table_[ref_idx];
-}
-
-GopFrame RefFrameManager::GetRefFrameByIndex(int ref_idx) const {
-  return ref_frame_table_[ref_idx];
-}
-
-ReferenceName get_ref_name(RefUpdateType ref_update_type, int priority_idx,
-                           const std::set<ReferenceName> &used_name_set) {
-  // TODO(angiebird): Find the better way to assign name lists.
-  // Maybe sort the names based on how frequent each name is being used in the
-  // past?
-  const std::vector<ReferenceName> forward_name_list{
-    ReferenceName::kAltrefFrame,  ReferenceName::kBwdrefFrame,
-    ReferenceName::kAltref2Frame, ReferenceName::kGoldenFrame,
-    ReferenceName::kLast3Frame,   ReferenceName::kLast2Frame,
-    ReferenceName::kLastFrame
-  };
-  const std::vector<ReferenceName> backward_name_list{
-    ReferenceName::kGoldenFrame, ReferenceName::kLastFrame,
-    ReferenceName::kLast2Frame,  ReferenceName::kLast3Frame,
-    ReferenceName::kBwdrefFrame, ReferenceName::kAltref2Frame,
-    ReferenceName::kAltrefFrame
-  };
-  const std::vector<ReferenceName> last_name_list{
-    ReferenceName::kLastFrame,   ReferenceName::kLast2Frame,
-    ReferenceName::kLast3Frame,  ReferenceName::kGoldenFrame,
-    ReferenceName::kBwdrefFrame, ReferenceName::kAltref2Frame,
-    ReferenceName::kAltrefFrame
-  };
-
-  const std::vector<ReferenceName> *name_list = nullptr;
-  switch (ref_update_type) {
-    case RefUpdateType::kForward: name_list = &forward_name_list; break;
-    case RefUpdateType::kBackward: name_list = &backward_name_list; break;
-    case RefUpdateType::kLast: name_list = &last_name_list; break;
-    case RefUpdateType::kNone: break;
-  }
-
-  if (name_list) {
-    const int name_list_size = static_cast<int>(name_list->size());
-    for (int idx = priority_idx; idx < name_list_size; ++idx) {
-      ReferenceName ref_name = name_list->at(idx);
-      bool not_used = used_name_set.find(ref_name) == used_name_set.end();
-      if (not_used) return ref_name;
-    }
-  }
-  return ReferenceName::kNoneFrame;
-}
-
-// Generate a list of available reference frames in priority order for the
-// current to-be-coded frame. The list size should be less or equal to the size
-// of ref_frame_table_. The reference frames with smaller indices are more
-// likely to be a good reference frame. Therefore, they should be prioritized
-// when the reference frame count is limited. For example, if we plan to use 3
-// reference frames, we should choose ref_frame_list[0], ref_frame_list[1] and
-// ref_frame_list[2].
-std::vector<ReferenceFrame> RefFrameManager::GetRefFrameListByPriority() const {
-  constexpr int round_robin_size = 3;
-  const std::vector<RefUpdateType> round_robin_list{ RefUpdateType::kForward,
-                                                     RefUpdateType::kBackward,
-                                                     RefUpdateType::kLast };
-  std::vector<int> priority_idx_list(round_robin_size, 0);
-  int available_ref_frames = GetRefFrameCount();
-  std::vector<ReferenceFrame> ref_frame_list;
-  int ref_frame_count = 0;
-  int round_robin_idx = 0;
-
-  std::set<ReferenceName> used_name_set;
-  while (ref_frame_count < available_ref_frames &&
-         ref_frame_count < max_ref_frames_) {
-    const RefUpdateType ref_update_type = round_robin_list[round_robin_idx];
-    int priority_idx = priority_idx_list[round_robin_idx];
-    int ref_idx = GetRefFrameIdxByPriority(ref_update_type, priority_idx);
-    if (ref_idx != -1) {
-      const ReferenceName name =
-          get_ref_name(ref_update_type, priority_idx, used_name_set);
-      assert(name != ReferenceName::kNoneFrame);
-      used_name_set.insert(name);
-      ReferenceFrame ref_frame = { ref_idx, name };
-      ref_frame_list.push_back(ref_frame);
-      ++ref_frame_count;
-      ++priority_idx_list[round_robin_idx];
-    }
-    round_robin_idx = (round_robin_idx + 1) % round_robin_size;
-  }
-  return ref_frame_list;
-}
-
-void RefFrameManager::UpdateOrder(int global_order_idx) {
-  cur_global_order_idx_ = global_order_idx;
-  if (forward_stack_.empty()) {
-    return;
-  }
-  int ref_idx = forward_stack_.back();
-  const GopFrame &gf_frame = ref_frame_table_[ref_idx];
-
-  // If the current processing frame is an overlay / show existing frame.
-  if (gf_frame.global_order_idx == global_order_idx) {
-    forward_stack_.pop_back();
-    if (gf_frame.is_golden_frame) {
-      // high quality frame
-      backward_queue_.push_back(ref_idx);
-    } else {
-      last_queue_.push_back(ref_idx);
-    }
-  }
-}
-
-int RefFrameManager::ColocatedRefIdx(int global_order_idx) {
-  if (forward_stack_.empty()) return -1;
-  int ref_idx = forward_stack_.back();
-  int arf_global_order_idx = ref_frame_table_[ref_idx].global_order_idx;
-  if (arf_global_order_idx == global_order_idx) {
-    return ref_idx;
-  }
-  return -1;
-}
-
-static RefUpdateType infer_ref_update_type(const GopFrame &gop_frame,
-                                           int cur_global_order_idx) {
-  if (gop_frame.global_order_idx > cur_global_order_idx) {
-    return RefUpdateType::kForward;
-  }
-  if (gop_frame.is_golden_frame) {
-    return RefUpdateType::kBackward;
-  }
-  if (gop_frame.encode_ref_mode == EncodeRefMode::kShowExisting ||
-      gop_frame.encode_ref_mode == EncodeRefMode::kOverlay) {
-    return RefUpdateType::kNone;
-  }
-  return RefUpdateType::kLast;
-}
-
-using PrimaryRefKey = std::tuple<int,   // abs layer_depth delta
-                                 bool,  // is_key_frame differs
-                                 bool,  // is_golden_frame differs
-                                 bool,  // is_arf_frame differs
-                                 bool,  // is_show_frame differs
-                                 bool,  // encode_ref_mode differs
-                                 int>;  // abs order_idx delta
-
-// Generate PrimaryRefKey based on abs layer_depth delta,
-// frame flags and abs order_idx delta. These are the fields that will
-// be used to pick the primary reference frame for probability model
-static PrimaryRefKey get_primary_ref_key(const GopFrame &cur_frame,
-                                         const GopFrame &ref_frame) {
-  return std::make_tuple(abs(cur_frame.layer_depth - ref_frame.layer_depth),
-                         cur_frame.is_key_frame != ref_frame.is_key_frame,
-                         cur_frame.is_golden_frame != ref_frame.is_golden_frame,
-                         cur_frame.is_arf_frame != ref_frame.is_arf_frame,
-                         cur_frame.is_show_frame != ref_frame.is_show_frame,
-                         cur_frame.encode_ref_mode != ref_frame.encode_ref_mode,
-                         abs(cur_frame.order_idx - ref_frame.order_idx));
-}
-
-// Pick primary_ref_idx for probability model.
-ReferenceFrame RefFrameManager::GetPrimaryRefFrame(
-    const GopFrame &gop_frame) const {
-  assert(gop_frame.is_valid);
-  std::vector<std::pair<PrimaryRefKey, int>> candidate_list;
-  for (auto &ref_frame_in_gop_frame : gop_frame.ref_frame_list) {
-    const GopFrame &ref_frame = ref_frame_table_[ref_frame_in_gop_frame.index];
-    if (ref_frame.is_valid) {
-      assert(ref_frame_in_gop_frame.index == ref_frame.update_ref_idx);
-      PrimaryRefKey key = get_primary_ref_key(gop_frame, ref_frame);
-      std::pair<PrimaryRefKey, int> candidate = {
-        key, ref_frame_in_gop_frame.index
-      };
-      candidate_list.push_back(candidate);
-    }
-  }
-
-  std::sort(candidate_list.begin(), candidate_list.end());
-
-  ReferenceFrame ref_frame = { -1, ReferenceName::kNoneFrame };
-  assert(candidate_list.size() == gop_frame.ref_frame_list.size());
-  if (!candidate_list.empty()) {
-    int ref_idx = candidate_list[0].second;
-    for (const auto &frame : gop_frame.ref_frame_list) {
-      if (frame.index == ref_idx) {
-        ref_frame = frame;
-      }
-    }
-  }
-  return ref_frame;
-}
-
-void RefFrameManager::UpdateRefFrameTable(GopFrame *gop_frame) {
-  allow_two_fwd_frames_ =
-      (max_ref_frames_ - !!GetRefFrameCountByType(RefUpdateType::kBackward) -
-       !!GetRefFrameCountByType(RefUpdateType::kLast)) >= 2;
-  gop_frame->ref_frame_list = GetRefFrameListByPriority();
-  gop_frame->primary_ref_frame = GetPrimaryRefFrame(*gop_frame);
-  gop_frame->colocated_ref_idx = ColocatedRefIdx(gop_frame->global_order_idx);
-
-  if (gop_frame->is_show_frame) {
-    UpdateOrder(gop_frame->global_order_idx);
-  }
-  // Call infer_ref_update_type() after UpdateOrder() so that
-  // cur_global_order_idx_ is up-to-date
-  RefUpdateType ref_update_type =
-      infer_ref_update_type(*gop_frame, cur_global_order_idx_);
-  if (ref_update_type == RefUpdateType::kNone) {
-    gop_frame->update_ref_idx = -1;
-  } else {
-    const int ref_idx = AllocateRefIdx();
-    gop_frame->update_ref_idx = ref_idx;
-    switch (ref_update_type) {
-      case RefUpdateType::kForward: forward_stack_.push_back(ref_idx); break;
-      case RefUpdateType::kBackward: backward_queue_.push_back(ref_idx); break;
-      case RefUpdateType::kLast: last_queue_.push_back(ref_idx); break;
-      case RefUpdateType::kNone: break;
-    }
-    ref_frame_table_[ref_idx] = *gop_frame;
-  }
-}
-
-}  // namespace aom

diff --git a/av1/qmode_rc/reference_manager.h b/av1/qmode_rc/reference_manager.h
deleted file mode 100644
index 37b5038..0000000
--- a/av1/qmode_rc/reference_manager.h
+++ /dev/null

@@ -1,95 +0,0 @@
-/*
- * Copyright (c) 2022, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-
-#ifndef AOM_AV1_QMODE_RC_REFERENCE_MANAGER_H_
-#define AOM_AV1_QMODE_RC_REFERENCE_MANAGER_H_
-
-#include <deque>
-#include <iostream>
-#include <vector>
-
-#include "av1/qmode_rc/ratectrl_qmode_interface.h"
-
-namespace aom {
-
-enum class RefUpdateType { kForward, kBackward, kLast, kNone };
-
-class RefFrameManager {
- public:
-  explicit RefFrameManager(int ref_frame_table_size, int max_ref_frames)
-      : ref_frame_table_(ref_frame_table_size),
-        max_ref_frames_(max_ref_frames) {
-    // forward_max_size_ define max number of arf frames that can exists at
-    // the same time. In the other words, it's the max size of forward_stack_.
-    // TODO(angiebird): Figure out if this number is optimal.
-    forward_max_size_ = ref_frame_table_size - 2;
-    cur_global_order_idx_ = 0;
-    Reset();
-  }
-  ~RefFrameManager() = default;
-
-  RefFrameManager(const RefFrameManager &) = delete;
-  RefFrameManager &operator=(const RefFrameManager &) = delete;
-
-  friend std::ostream &operator<<(std::ostream &os,
-                                  const RefFrameManager &rfm) {
-    os << "=" << std::endl;
-    os << "forward: ";
-    for (const auto &ref_idx : rfm.forward_stack_) {
-      os << rfm.ref_frame_table_[ref_idx].order_idx << " ";
-    }
-    os << std::endl;
-    os << "backward: ";
-    for (const auto &ref_idx : rfm.backward_queue_) {
-      os << rfm.ref_frame_table_[ref_idx].order_idx << " ";
-    }
-    os << std::endl;
-    os << "last: ";
-    for (const auto &ref_idx : rfm.last_queue_) {
-      os << rfm.ref_frame_table_[ref_idx].order_idx << " ";
-    }
-    os << std::endl;
-    return os;
-  }
-
-  void Reset();
-  int AllocateRefIdx();
-  int GetRefFrameCountByType(RefUpdateType ref_update_type) const;
-  int GetRefFrameCount() const;
-  std::vector<ReferenceFrame> GetRefFrameListByPriority() const;
-  int GetRefFrameIdxByPriority(RefUpdateType ref_update_type,
-                               int priority_idx) const;
-  GopFrame GetRefFrameByPriority(RefUpdateType ref_update_type,
-                                 int priority_idx) const;
-  GopFrame GetRefFrameByIndex(int ref_idx) const;
-  void UpdateOrder(int global_order_idx);
-  int ColocatedRefIdx(int global_order_idx);
-  int ForwardMaxSize() const { return forward_max_size_; }
-  int MaxRefFrame() const { return max_ref_frames_; }
-  int CurGlobalOrderIdx() const { return cur_global_order_idx_; }
-  void UpdateRefFrameTable(GopFrame *gop_frame);
-  ReferenceFrame GetPrimaryRefFrame(const GopFrame &gop_frame) const;
-
- private:
-  int forward_max_size_;
-  int cur_global_order_idx_;
-  RefFrameTable ref_frame_table_;
-  int max_ref_frames_;
-  bool allow_two_fwd_frames_;
-  std::deque<int> free_ref_idx_list_;
-  std::vector<int> forward_stack_;
-  std::deque<int> backward_queue_;
-  std::deque<int> last_queue_;
-};
-
-}  // namespace aom
-
-#endif  // AOM_AV1_QMODE_RC_REFERENCE_MANAGER_H_

diff --git a/av1/ratectrl_rtc.cc b/av1/ratectrl_rtc.cc
index 6cf53f0..a3ec6f6 100644
--- a/av1/ratectrl_rtc.cc
+++ b/av1/ratectrl_rtc.cc

@@ -19,6 +19,8 @@
 #include "aom_mem/aom_mem.h"
 #include "av1/encoder/encoder.h"
 #include "av1/encoder/encoder_utils.h"
+#include "av1/encoder/pickcdef.h"
+#include "av1/encoder/picklpf.h"
 #include "av1/encoder/ratectrl.h"
 #include "av1/encoder/rc_utils.h"
 #include "av1/encoder/svc_layercontext.h"
@@ -38,6 +40,7 @@
   max_intra_bitrate_pct = 50;
   max_inter_bitrate_pct = 0;
   framerate = 30.0;
+  ss_number_layers = 1;
   ts_number_layers = 1;
   aq_mode = 0;
   layer_target_bitrate[0] = static_cast<int>(target_bandwidth);
@@ -68,9 +71,7 @@
   av1_zero(*rc_api->cpi_->ppi);
   rc_api->cpi_->common.seq_params = &rc_api->cpi_->ppi->seq_params;
   av1_zero(*rc_api->cpi_->common.seq_params);
-  const int num_layers = cfg.ss_number_layers * cfg.ts_number_layers;
-  if (!av1_alloc_layer_context(rc_api->cpi_, num_layers)) return nullptr;
-  rc_api->InitRateControl(cfg);
+  if (!rc_api->InitRateControl(cfg)) return nullptr;
   if (cfg.aq_mode) {
     AV1_COMP *const cpi = rc_api->cpi_;
     cpi->enc_seg.map = static_cast<uint8_t *>(aom_calloc(
@@ -110,7 +111,7 @@
   }
 }
 
-void AV1RateControlRTC::InitRateControl(const AV1RateControlRtcConfig &rc_cfg) {
+bool AV1RateControlRTC::InitRateControl(const AV1RateControlRtcConfig &rc_cfg) {
   AV1_COMMON *cm = &cpi_->common;
   AV1EncoderConfig *oxcf = &cpi_->oxcf;
   RATE_CONTROL *const rc = &cpi_->rc;
@@ -126,13 +127,14 @@
   oxcf->rc_cfg.drop_frames_water_mark = 0;
   oxcf->tool_cfg.bit_depth = AOM_BITS_8;
   oxcf->tool_cfg.superblock_size = AOM_SUPERBLOCK_SIZE_DYNAMIC;
+  oxcf->algo_cfg.loopfilter_control = LOOPFILTER_ALL;
   cm->current_frame.frame_number = 0;
   cpi_->ppi->p_rc.kf_boost = DEFAULT_KF_BOOST_RT;
   for (auto &lvl_idx : oxcf->target_seq_level_idx) lvl_idx = SEQ_LEVEL_MAX;
 
   memcpy(cpi_->ppi->level_params.target_seq_level_idx,
          oxcf->target_seq_level_idx, sizeof(oxcf->target_seq_level_idx));
-  UpdateRateControl(rc_cfg);
+  if (!UpdateRateControl(rc_cfg)) return false;
   set_sb_size(cm->seq_params,
               av1_select_sb_size(oxcf, cm->width, cm->height,
                                  cpi_->svc.number_spatial_layers));
@@ -146,14 +148,24 @@
   // Enable external rate control.
   cpi_->rc.rtc_external_ratectrl = 1;
   cpi_->sf.rt_sf.use_nonrd_pick_mode = 1;
+  return true;
 }
 
-void AV1RateControlRTC::UpdateRateControl(
+bool AV1RateControlRTC::UpdateRateControl(
     const AV1RateControlRtcConfig &rc_cfg) {
+  if (rc_cfg.ss_number_layers < 1 ||
+      rc_cfg.ss_number_layers > AOM_MAX_SS_LAYERS ||
+      rc_cfg.ts_number_layers < 1 ||
+      rc_cfg.ts_number_layers > AOM_MAX_TS_LAYERS) {
+    return false;
+  }
+  const int num_layers = rc_cfg.ss_number_layers * rc_cfg.ts_number_layers;
+  if (num_layers > 1 && !av1_alloc_layer_context(cpi_, num_layers)) {
+    return false;
+  }
   AV1_COMMON *cm = &cpi_->common;
   AV1EncoderConfig *oxcf = &cpi_->oxcf;
   RATE_CONTROL *const rc = &cpi_->rc;
-
   initial_width_ = rc_cfg.width;
   initial_height_ = rc_cfg.height;
   cm->width = rc_cfg.width;
@@ -180,35 +192,38 @@
   cpi_->svc.number_temporal_layers = rc_cfg.ts_number_layers;
   set_primary_rc_buffer_sizes(oxcf, cpi_->ppi);
   enc_set_mb_mi(&cm->mi_params, cm->width, cm->height, BLOCK_8X8);
-  int64_t target_bandwidth_svc = 0;
-  for (int sl = 0; sl < cpi_->svc.number_spatial_layers; ++sl) {
-    for (int tl = 0; tl < cpi_->svc.number_temporal_layers; ++tl) {
-      const int layer =
-          LAYER_IDS_TO_IDX(sl, tl, cpi_->svc.number_temporal_layers);
-      LAYER_CONTEXT *lc = &cpi_->svc.layer_context[layer];
-      RATE_CONTROL *const lrc = &lc->rc;
-      lc->layer_target_bitrate = 1000 * rc_cfg.layer_target_bitrate[layer];
-      lc->max_q = rc_cfg.max_quantizers[layer];
-      lc->min_q = rc_cfg.min_quantizers[layer];
-      lrc->worst_quality =
-          av1_quantizer_to_qindex(rc_cfg.max_quantizers[layer]);
-      lrc->best_quality = av1_quantizer_to_qindex(rc_cfg.min_quantizers[layer]);
-      lc->scaling_factor_num = rc_cfg.scaling_factor_num[sl];
-      lc->scaling_factor_den = rc_cfg.scaling_factor_den[sl];
-      lc->framerate_factor = rc_cfg.ts_rate_decimator[tl];
-      if (tl == cpi_->svc.number_temporal_layers - 1)
-        target_bandwidth_svc += lc->layer_target_bitrate;
-    }
-  }
   av1_new_framerate(cpi_, cpi_->framerate);
   if (cpi_->svc.number_temporal_layers > 1 ||
       cpi_->svc.number_spatial_layers > 1) {
+    int64_t target_bandwidth_svc = 0;
+    for (int sl = 0; sl < cpi_->svc.number_spatial_layers; ++sl) {
+      for (int tl = 0; tl < cpi_->svc.number_temporal_layers; ++tl) {
+        const int layer =
+            LAYER_IDS_TO_IDX(sl, tl, cpi_->svc.number_temporal_layers);
+        LAYER_CONTEXT *lc = &cpi_->svc.layer_context[layer];
+        RATE_CONTROL *const lrc = &lc->rc;
+        lc->layer_target_bitrate = 1000 * rc_cfg.layer_target_bitrate[layer];
+        lc->max_q = rc_cfg.max_quantizers[layer];
+        lc->min_q = rc_cfg.min_quantizers[layer];
+        lrc->worst_quality =
+            av1_quantizer_to_qindex(rc_cfg.max_quantizers[layer]);
+        lrc->best_quality =
+            av1_quantizer_to_qindex(rc_cfg.min_quantizers[layer]);
+        lc->scaling_factor_num = rc_cfg.scaling_factor_num[sl];
+        lc->scaling_factor_den = rc_cfg.scaling_factor_den[sl];
+        lc->framerate_factor = rc_cfg.ts_rate_decimator[tl];
+        if (tl == cpi_->svc.number_temporal_layers - 1)
+          target_bandwidth_svc += lc->layer_target_bitrate;
+      }
+    }
+
     if (cm->current_frame.frame_number == 0) av1_init_layer_context(cpi_);
     // This is needed to initialize external RC flag in layer context structure.
     cpi_->rc.rtc_external_ratectrl = 1;
     av1_update_layer_context_change_config(cpi_, target_bandwidth_svc);
   }
   check_reset_rc_flag(cpi_);
+  return true;
 }
 
 void AV1RateControlRTC::ComputeQP(const AV1FrameParamsRTC &frame_params) {
@@ -291,6 +306,27 @@
   return cpi_->common.quant_params.base_qindex;
 }
 
+AV1LoopfilterLevel AV1RateControlRTC::GetLoopfilterLevel() const {
+  av1_pick_filter_level(nullptr, cpi_, LPF_PICK_FROM_Q);
+  AV1LoopfilterLevel lpf_level;
+  lpf_level.filter_level[0] = cpi_->common.lf.filter_level[0];
+  lpf_level.filter_level[1] = cpi_->common.lf.filter_level[1];
+  lpf_level.filter_level_u = cpi_->common.lf.filter_level_u;
+  lpf_level.filter_level_v = cpi_->common.lf.filter_level_v;
+
+  return lpf_level;
+}
+
+AV1CdefInfo AV1RateControlRTC::GetCdefInfo() const {
+  av1_pick_cdef_from_qp(&cpi_->common, 0, 0);
+  AV1CdefInfo cdef_level;
+  cdef_level.cdef_strength_y = cpi_->common.cdef_info.cdef_strengths[0];
+  cdef_level.cdef_strength_uv = cpi_->common.cdef_info.cdef_uv_strengths[0];
+  cdef_level.damping = cpi_->common.cdef_info.cdef_damping;
+
+  return cdef_level;
+}
+
 signed char *AV1RateControlRTC::GetCyclicRefreshMap() const {
   return cpi_->cyclic_refresh->map;
 }
@@ -301,6 +337,8 @@
 
 void AV1RateControlRTC::PostEncodeUpdate(uint64_t encoded_frame_size) {
   cpi_->common.current_frame.frame_number++;
+  if (cpi_->svc.spatial_layer_id == cpi_->svc.number_spatial_layers - 1)
+    cpi_->svc.prev_number_spatial_layers = cpi_->svc.number_spatial_layers;
   av1_rc_postencode_update(cpi_, encoded_frame_size);
   if (cpi_->svc.number_spatial_layers > 1 ||
       cpi_->svc.number_temporal_layers > 1)

diff --git a/av1/ratectrl_rtc.h b/av1/ratectrl_rtc.h
index 9843803..e96e210 100644
--- a/av1/ratectrl_rtc.h
+++ b/av1/ratectrl_rtc.h

@@ -32,6 +32,8 @@
 
   int width;
   int height;
+  // Flag indicating if the content is screen or not.
+  bool is_screen;
   // 0-63
   int max_quantizer;
   int min_quantizer;
@@ -63,15 +65,31 @@
   int temporal_layer_id;
 };
 
+struct AV1LoopfilterLevel {
+  int filter_level[2];
+  int filter_level_u;
+  int filter_level_v;
+};
+
+struct AV1CdefInfo {
+  int cdef_strength_y;
+  int cdef_strength_uv;
+  int damping;
+};
+
 class AV1RateControlRTC {
  public:
   static std::unique_ptr<AV1RateControlRTC> Create(
       const AV1RateControlRtcConfig &cfg);
   ~AV1RateControlRTC();
 
-  void UpdateRateControl(const AV1RateControlRtcConfig &rc_cfg);
+  bool UpdateRateControl(const AV1RateControlRtcConfig &rc_cfg);
   // GetQP() needs to be called after ComputeQP() to get the latest QP
   int GetQP() const;
+  // GetLoopfilterLevel() needs to be called after ComputeQP()
+  AV1LoopfilterLevel GetLoopfilterLevel() const;
+  // GetCdefInfo() needs to be called after ComputeQP()
+  AV1CdefInfo GetCdefInfo() const;
   signed char *GetCyclicRefreshMap() const;
   int *GetDeltaQ() const;
   void ComputeQP(const AV1FrameParamsRTC &frame_params);
@@ -80,7 +98,7 @@
 
  private:
   AV1RateControlRTC() = default;
-  void InitRateControl(const AV1RateControlRtcConfig &cfg);
+  bool InitRateControl(const AV1RateControlRtcConfig &cfg);
   AV1_COMP *cpi_;
   int initial_width_;
   int initial_height_;

diff --git a/build/cmake/aom_config.c.template b/build/cmake/aom_config.c.template
index 62f0a10..93a6d8f 100644
--- a/build/cmake/aom_config.c.template
+++ b/build/cmake/aom_config.c.template

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2016, Alliance for Open Media. All rights reserved
+ * Copyright (c) @year@, Alliance for Open Media. All rights reserved
  *
  * This source code is subject to the terms of the BSD 2 Clause License and
  * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License

diff --git a/build/cmake/aom_config_defaults.cmake b/build/cmake/aom_config_defaults.cmake
index d63990b..5058022 100644
--- a/build/cmake/aom_config_defaults.cmake
+++ b/build/cmake/aom_config_defaults.cmake

@@ -23,10 +23,11 @@
 set_aom_detect_var(INLINE "" "Sets INLINE value for current target.")
 
 # CPUs.
-set_aom_detect_var(ARCH_ARM 0 "Enables ARM architecture.")
-set_aom_detect_var(ARCH_PPC 0 "Enables PPC architecture.")
-set_aom_detect_var(ARCH_X86 0 "Enables X86 architecture.")
-set_aom_detect_var(ARCH_X86_64 0 "Enables X86_64 architecture.")
+set_aom_detect_var(AOM_ARCH_AARCH64 0 "Enables AArch64 architecture.")
+set_aom_detect_var(AOM_ARCH_ARM 0 "Enables ARM architecture.")
+set_aom_detect_var(AOM_ARCH_PPC 0 "Enables PPC architecture.")
+set_aom_detect_var(AOM_ARCH_X86 0 "Enables X86 architecture.")
+set_aom_detect_var(AOM_ARCH_X86_64 0 "Enables X86_64 architecture.")
 
 # ARM feature flags.
 set_aom_detect_var(HAVE_NEON 0 "Enables NEON intrinsics optimizations.")
@@ -155,6 +156,11 @@
                    "AV1 experiment: Enable tensorflow lite library.")
 set_aom_config_var(CONFIG_THREE_PASS 0
                    "AV1 experiment: Enable three-pass encoding.")
+set_aom_config_var(CONFIG_OUTPUT_FRAME_SIZE 0
+                   "AV1 experiment: Output frame size information.")
+set_aom_config_var(
+  CONFIG_SALIENCY_MAP 0
+  "AV1 experiment: Enable saliency map based encoding tuning for VMAF.")
 set_aom_config_var(CONFIG_CWG_C013 0
                    "AV1 experiment: Support for 7.x and 8.x levels.")
 

diff --git a/build/cmake/aom_configure.cmake b/build/cmake/aom_configure.cmake
index 427507f..aaef2c3 100644
--- a/build/cmake/aom_configure.cmake
+++ b/build/cmake/aom_configure.cmake

@@ -155,49 +155,61 @@
 endif()
 
 if(AOM_TARGET_CPU STREQUAL "x86" OR AOM_TARGET_CPU STREQUAL "x86_64")
-  find_program(AS_EXECUTABLE yasm $ENV{YASM_PATH})
-  if(NOT AS_EXECUTABLE OR ENABLE_NASM)
-    unset(AS_EXECUTABLE CACHE)
-    find_program(AS_EXECUTABLE nasm $ENV{NASM_PATH})
-    if(AS_EXECUTABLE)
-      test_nasm()
-    endif()
+  find_program(CMAKE_ASM_NASM_COMPILER yasm $ENV{YASM_PATH})
+  if(NOT CMAKE_ASM_NASM_COMPILER OR ENABLE_NASM)
+    unset(CMAKE_ASM_NASM_COMPILER CACHE)
+    find_program(CMAKE_ASM_NASM_COMPILER nasm $ENV{NASM_PATH})
   endif()
 
-  if(NOT AS_EXECUTABLE)
+  include(CheckLanguage)
+  check_language(ASM_NASM)
+  if(CMAKE_ASM_NASM_COMPILER)
+    get_asm_obj_format("objformat")
+    unset(CMAKE_ASM_NASM_OBJECT_FORMAT)
+    set(CMAKE_ASM_NASM_OBJECT_FORMAT ${objformat})
+    enable_language(ASM_NASM)
+    if(CMAKE_ASM_NASM_COMPILER_ID STREQUAL "NASM")
+      test_nasm()
+    endif()
+    # Xcode requires building the objects manually, so pass the object format
+    # flag.
+    if(XCODE)
+      set(AOM_AS_FLAGS -f ${objformat} ${AOM_AS_FLAGS})
+    endif()
+  else()
     message(
       FATAL_ERROR
         "Unable to find assembler. Install 'yasm' or 'nasm.' "
         "To build without optimizations, add -DAOM_TARGET_CPU=generic to "
         "your cmake command line.")
   endif()
-  get_asm_obj_format("objformat")
-  set(AOM_AS_FLAGS -f ${objformat} ${AOM_AS_FLAGS})
   string(STRIP "${AOM_AS_FLAGS}" AOM_AS_FLAGS)
 elseif(AOM_TARGET_CPU MATCHES "arm")
   if(AOM_TARGET_SYSTEM STREQUAL "Darwin")
-    set(AS_EXECUTABLE as)
+    set(CMAKE_ASM_COMPILER as)
     set(AOM_AS_FLAGS -arch ${AOM_TARGET_CPU} -isysroot ${CMAKE_OSX_SYSROOT})
   elseif(AOM_TARGET_SYSTEM STREQUAL "Windows")
-    if(NOT AS_EXECUTABLE)
-      set(AS_EXECUTABLE ${CMAKE_C_COMPILER} -c -mimplicit-it=always)
+    if(NOT CMAKE_ASM_COMPILER)
+      set(CMAKE_ASM_COMPILER ${CMAKE_C_COMPILER} -c -mimplicit-it=always)
     endif()
   else()
-    if(NOT AS_EXECUTABLE)
-      set(AS_EXECUTABLE as)
+    if(NOT CMAKE_ASM_COMPILER)
+      set(CMAKE_ASM_COMPILER as)
     endif()
   endif()
-  find_program(as_executable_found ${AS_EXECUTABLE})
-  if(NOT as_executable_found)
+  include(CheckLanguage)
+  check_language(ASM)
+  if(NOT CMAKE_ASM_COMPILER)
     message(
       FATAL_ERROR
         "Unable to find assembler and optimizations are enabled."
-        "Searched for ${AS_EXECUTABLE}. Install it, add it to your path, or "
-        "set the assembler directly by adding -DAS_EXECUTABLE=<assembler path> "
-        "to your CMake command line."
+        "Searched for ${CMAKE_ASM_COMPILER}. Install it, add it to your path,"
+        "or set the assembler directly by adding "
+        "-DCMAKE_ASM_COMPILER=<assembler path> to your CMake command line."
         "To build without optimizations, add -DAOM_TARGET_CPU=generic to your "
         "cmake command line.")
   endif()
+  enable_language(ASM)
   string(STRIP "${AOM_AS_FLAGS}" AOM_AS_FLAGS)
 endif()
 
@@ -230,6 +242,8 @@
   # The default _WIN32_WINNT value in MinGW is 0x0502 (Windows XP with SP2). Set
   # it to 0x0601 (Windows 7).
   add_compiler_flag_if_supported("-D_WIN32_WINNT=0x0601")
+  # Quiet warnings related to fopen, printf, etc.
+  add_compiler_flag_if_supported("-D_CRT_SECURE_NO_WARNINGS")
 endif()
 
 #
@@ -288,6 +302,9 @@
 
 # Test compiler flags.
 if(MSVC)
+  # It isn't possible to specify C99 conformance for MSVC. MSVC doesn't support
+  # C++ standards modes earlier than C++14.
+  add_cxx_flag_if_supported("/std:c++14")
   add_compiler_flag_if_supported("/W3")
 
   # Disable MSVC warnings that suggest making code non-portable.
@@ -327,11 +344,13 @@
   add_c_flag_if_supported("-Wimplicit-function-declaration")
   add_compiler_flag_if_supported("-Wlogical-op")
   add_compiler_flag_if_supported("-Wpointer-arith")
+  add_compiler_flag_if_supported("-Wshadow")
   add_compiler_flag_if_supported("-Wshorten-64-to-32")
   add_compiler_flag_if_supported("-Wsign-compare")
   add_compiler_flag_if_supported("-Wstring-conversion")
   add_compiler_flag_if_supported("-Wtype-limits")
   add_compiler_flag_if_supported("-Wuninitialized")
+  add_compiler_flag_if_supported("-Wunreachable-code-aggressive")
   add_compiler_flag_if_supported("-Wunused")
   add_compiler_flag_if_supported("-Wvla")
   add_cxx_flag_if_supported("-Wc++14-extensions")
@@ -357,9 +376,6 @@
     add_compiler_flag_if_supported("-Wno-disabled-optimization")
   endif()
 
-  # Add -Wshadow only for C files to avoid massive gtest warning spam.
-  add_c_flag_if_supported("-Wshadow")
-
   # Add -Wundef only for C files to avoid massive gtest warning spam.
   add_c_flag_if_supported("-Wundef")
 
@@ -428,6 +444,7 @@
   message("--- Git missing, version will be read from CHANGELOG.")
 endif()
 
+string(TIMESTAMP year "%Y")
 configure_file("${AOM_ROOT}/build/cmake/aom_config.c.template"
                "${AOM_CONFIG_DIR}/config/aom_config.c")
 

diff --git a/build/cmake/aom_install.cmake b/build/cmake/aom_install.cmake
index 3b52a68..b02c7b9 100644
--- a/build/cmake/aom_install.cmake
+++ b/build/cmake/aom_install.cmake

@@ -31,8 +31,8 @@
     include("GNUInstallDirs")
     set(AOM_PKG_CONFIG_FILE "${AOM_CONFIG_DIR}/aom.pc")
 
-    # Create a dummy library target for creating aom.pc.
-    create_dummy_source_file(aom_pc c AOM_PKG_CONFIG_SOURCES)
+    # Create a library target for creating aom.pc.
+    create_no_op_source_file(aom_pc c AOM_PKG_CONFIG_SOURCES)
     add_library(aom_pc ${AOM_PKG_CONFIG_SOURCES})
 
     # Setup a rule to generate aom.pc.
@@ -49,6 +49,7 @@
               -DCONFIG_MULTITHREAD=${CONFIG_MULTITHREAD}
               -DCONFIG_TUNE_VMAF=${CONFIG_TUNE_VMAF}
               -DCONFIG_TUNE_BUTTERAUGLI=${CONFIG_TUNE_BUTTERAUGLI}
+              -DCONFIG_SALIENCY_MAP=${CONFIG_SALIENCY_MAP}
               -DCONFIG_TFLITE=${CONFIG_TFLITE}
               -DHAVE_PTHREAD_H=${HAVE_PTHREAD_H}
               -P

diff --git a/build/cmake/aom_optimization.cmake b/build/cmake/aom_optimization.cmake
index 8d28711..6b0c55a 100644
--- a/build/cmake/aom_optimization.cmake
+++ b/build/cmake/aom_optimization.cmake

@@ -131,11 +131,12 @@
 
 # Adds library target named $lib_name for ASM files in variable named by
 # $asm_sources. Builds an output directory path from $lib_name. Links $lib_name
-# into the aom library target(s). Generates a dummy C file with a dummy function
-# to ensure that all cmake generators can determine the linker language, and
-# that build tools don't complain that an object exposes no symbols.
+# into the aom library target(s). Generates a C file with an unused no-op
+# function to ensure that all cmake generators can determine the linker
+# language, and that build tools don't complain that an object exposes no
+# symbols.
 #
-# In shared library configs every step described above happens twice, and
+# In Xcode-based builds every step described above happens twice, and
 # directory/target/object names are updated to include _shared and _static
 # suffixes.
 function(add_asm_library lib_name asm_sources)
@@ -143,49 +144,66 @@
     return()
   endif()
 
-  list(APPEND asm_configs "static")
-  if(BUILD_SHARED_LIBS)
-    list(APPEND asm_configs "shared")
-  endif()
-
-  foreach(asm_config ${asm_configs})
-    set(asm_lib_name ${lib_name}_${asm_config})
-    set(asm_lib_obj_dir "${AOM_CONFIG_DIR}/asm_objects/${asm_lib_name}")
-    if(NOT EXISTS "${asm_lib_obj_dir}")
-      file(MAKE_DIRECTORY "${asm_lib_obj_dir}")
+  if(XCODE)
+    # CMake's generator does not output a build rule for Nasm files. Moreover,
+    # it makes Xcode believe Nasm files are of type "sourcecode" instead of
+    # "sourcecode.nasm", which prevents even the default rule from applying.
+    # This default rule is broken, though, because it doesn't apply any of the
+    # flags specified for ASM_NASM. See https://discourse.cmake.org/t/building-
+    # nasm-files-with-xcode/7934
+    list(APPEND asm_configs "static")
+    if(BUILD_SHARED_LIBS)
+      list(APPEND asm_configs "shared")
     endif()
 
-    add_library(${asm_lib_name} STATIC ${${asm_sources}})
-    set_property(TARGET ${asm_lib_name} PROPERTY FOLDER ${AOM_TARGET_CPU})
+    set(as_executable "${CMAKE_ASM_NASM_COMPILER}")
+    if(NOT as_executable)
+      set(as_executable "${CMAKE_ASM_COMPILER}")
+    endif()
 
-    foreach(asm_source ${${asm_sources}})
-      get_filename_component(asm_source_name "${asm_source}" NAME)
-      set(asm_object "${asm_lib_obj_dir}/${asm_source_name}.o")
-      add_custom_command(OUTPUT "${asm_object}"
-                         COMMAND ${AS_EXECUTABLE} ARGS ${AOM_AS_FLAGS}
-                                 -I${AOM_ROOT}/ -I${AOM_CONFIG_DIR}/ -o
-                                 "${asm_object}" "${asm_source}"
-                         DEPENDS "${asm_source}"
-                         COMMENT "Building ASM object ${asm_object}"
-                         WORKING_DIRECTORY "${AOM_CONFIG_DIR}"
-                         VERBATIM)
-      if(BUILD_SHARED_LIBS AND "${asm_config}" STREQUAL "static")
-        target_sources(aom_static PRIVATE "${asm_object}")
-      else()
-        target_sources(aom PRIVATE "${asm_object}")
+    foreach(asm_config ${asm_configs})
+      set(asm_lib_name ${lib_name}_${asm_config})
+      set(asm_lib_obj_dir "${AOM_CONFIG_DIR}/asm_objects/${asm_lib_name}")
+      if(NOT EXISTS "${asm_lib_obj_dir}")
+        file(MAKE_DIRECTORY "${asm_lib_obj_dir}")
       endif()
-    endforeach()
 
-    # The above created a target containing only ASM sources. CMake needs help
-    # here to determine the linker language. Add a dummy C file to force the
-    # linker language to C. We don't bother with setting the LINKER_LANGUAGE
-    # property on the library target because not all generators obey it (looking
-    # at you, Xcode generator).
-    add_dummy_source_file_to_target("${asm_lib_name}" "c")
+      foreach(asm_source ${${asm_sources}})
+        get_filename_component(asm_source_name "${asm_source}" NAME)
+        set(asm_object "${asm_lib_obj_dir}/${asm_source_name}.o")
+        add_custom_command(OUTPUT "${asm_object}"
+                           COMMAND ${as_executable} ARGS ${AOM_AS_FLAGS}
+                                   -I${AOM_ROOT}/ -I${AOM_CONFIG_DIR}/ -o
+                                   "${asm_object}" "${asm_source}"
+                           DEPENDS "${asm_source}"
+                           COMMENT "Building ASM object ${asm_object}"
+                           WORKING_DIRECTORY "${AOM_CONFIG_DIR}"
+                           VERBATIM)
+        if(BUILD_SHARED_LIBS AND "${asm_config}" STREQUAL "static")
+          target_sources(aom_static PRIVATE "${asm_object}")
+        else()
+          target_sources(aom PRIVATE "${asm_object}")
+        endif()
+      endforeach()
+    endforeach()
+  else()
+    # For non-Xcode generators, CMake does not need extra help. The language
+    # support takes care of it.
+    set(asm_lib_name ${lib_name})
+
+    add_library(${asm_lib_name} OBJECT ${${asm_sources}})
+    target_include_directories(${asm_lib_name}
+                               PRIVATE ${AOM_ROOT} ${AOM_CONFIG_DIR})
+    target_compile_options(${asm_lib_name} PRIVATE ${AOM_AS_FLAGS})
+    set_property(TARGET ${asm_lib_name} PROPERTY FOLDER ${AOM_TARGET_CPU})
+    if(BUILD_SHARED_LIBS)
+      target_sources(aom_static PRIVATE "$<TARGET_OBJECTS:${asm_lib_name}>")
+    endif()
+    target_sources(aom PRIVATE "$<TARGET_OBJECTS:${asm_lib_name}>")
 
     # Add the new lib target to the global list of aom library targets.
     list(APPEND AOM_LIB_TARGETS ${asm_lib_name})
-  endforeach()
+  endif()
 
   set(AOM_LIB_TARGETS ${AOM_LIB_TARGETS} PARENT_SCOPE)
 endfunction()
@@ -194,7 +212,8 @@
 # Currently checks only for presence of required object formats and support for
 # the -Ox argument (multipass optimization).
 function(test_nasm)
-  execute_process(COMMAND ${AS_EXECUTABLE} -hf OUTPUT_VARIABLE nasm_helptext)
+  execute_process(COMMAND ${CMAKE_ASM_NASM_COMPILER} -hf
+                  OUTPUT_VARIABLE nasm_helptext)
 
   if(NOT "${nasm_helptext}" MATCHES "-Ox")
     message(

diff --git a/build/cmake/cpu.cmake b/build/cmake/cpu.cmake
index 99ac38a..799a313 100644
--- a/build/cmake/cpu.cmake
+++ b/build/cmake/cpu.cmake

@@ -10,7 +10,10 @@
 #
 
 if("${AOM_TARGET_CPU}" MATCHES "^arm")
-  set(ARCH_ARM 1)
+  set(AOM_ARCH_ARM 1)
+  if("${AOM_TARGET_CPU}" STREQUAL "arm64")
+    set(AOM_ARCH_AARCH64 1)
+  endif()
   set(RTCD_ARCH_ARM "yes")
 
   if(ENABLE_NEON)
@@ -34,7 +37,7 @@
   endif()
 
 elseif("${AOM_TARGET_CPU}" MATCHES "ppc")
-  set(ARCH_PPC 1)
+  set(AOM_ARCH_PPC 1)
   set(RTCD_ARCH_PPC "yes")
 
   if(ENABLE_VSX)
@@ -46,10 +49,10 @@
   endif()
 elseif("${AOM_TARGET_CPU}" MATCHES "^x86")
   if("${AOM_TARGET_CPU}" STREQUAL "x86")
-    set(ARCH_X86 1)
+    set(AOM_ARCH_X86 1)
     set(RTCD_ARCH_X86 "yes")
   elseif("${AOM_TARGET_CPU}" STREQUAL "x86_64")
-    set(ARCH_X86_64 1)
+    set(AOM_ARCH_X86_64 1)
     set(RTCD_ARCH_X86_64 "yes")
   endif()
 

diff --git a/build/cmake/toolchains/android.cmake b/build/cmake/toolchains/android.cmake
index f0b9fab..4d38c9a 100644
--- a/build/cmake/toolchains/android.cmake
+++ b/build/cmake/toolchains/android.cmake

@@ -45,11 +45,11 @@
 endif()
 
 if(ANDROID_ABI MATCHES "^arm")
-  set(AS_EXECUTABLE as)
+  set(CMAKE_ASM_COMPILER as)
   # No runtime cpu detect for arm targets.
   set(CONFIG_RUNTIME_CPU_DETECT 0 CACHE STRING "")
 elseif(ANDROID_ABI MATCHES "^x86")
-  set(AS_EXECUTABLE yasm)
+  set(CMAKE_ASM_NASM_COMPILER yasm)
 endif()
 
 set(CMAKE_SYSTEM_NAME "Android")

diff --git a/build/cmake/toolchains/arm64-linux-gcc.cmake b/build/cmake/toolchains/arm64-linux-gcc.cmake
index 64e460b..133a96a 100644
--- a/build/cmake/toolchains/arm64-linux-gcc.cmake
+++ b/build/cmake/toolchains/arm64-linux-gcc.cmake

@@ -17,7 +17,8 @@
 
 if("${CROSS}" STREQUAL "")
 
-  # Default the cross compiler prefix to something known to work.
+  # Default the cross compiler prefix to one used by Debian and other package
+  # management systems.
   set(CROSS aarch64-linux-gnu-)
 endif()
 
@@ -27,8 +28,8 @@
 if(NOT CMAKE_CXX_COMPILER)
   set(CMAKE_CXX_COMPILER ${CROSS}g++)
 endif()
-if(NOT AS_EXECUTABLE)
-  set(AS_EXECUTABLE ${CROSS}as)
+if(NOT CMAKE_ASM_COMPILER)
+  set(CMAKE_ASM_COMPILER ${CROSS}as)
 endif()
 set(CMAKE_C_FLAGS_INIT "-march=armv8-a")
 set(CMAKE_CXX_FLAGS_INIT "-march=armv8-a")

diff --git a/build/cmake/toolchains/arm64-mingw-gcc.cmake b/build/cmake/toolchains/arm64-mingw-gcc.cmake
index 5472ed4..7400423 100644
--- a/build/cmake/toolchains/arm64-mingw-gcc.cmake
+++ b/build/cmake/toolchains/arm64-mingw-gcc.cmake

@@ -17,6 +17,8 @@
 set(CMAKE_SYSTEM_NAME "Windows")
 
 if("${CROSS}" STREQUAL "")
+
+  # Default the cross compiler prefix to one used by MSYS2.
   set(CROSS aarch64-w64-mingw32-)
 endif()
 

diff --git a/build/cmake/toolchains/armv7-linux-gcc.cmake b/build/cmake/toolchains/armv7-linux-gcc.cmake
index 1201538..366e198 100644
--- a/build/cmake/toolchains/armv7-linux-gcc.cmake
+++ b/build/cmake/toolchains/armv7-linux-gcc.cmake

@@ -17,7 +17,8 @@
 
 if("${CROSS}" STREQUAL "")
 
-  # Default the cross compiler prefix to something known to work.
+  # Default the cross compiler prefix to one used by Debian and other package
+  # management systems.
   set(CROSS arm-linux-gnueabihf-)
 endif()
 
@@ -31,8 +32,8 @@
 if(NOT CMAKE_CXX_COMPILER)
   set(CMAKE_CXX_COMPILER ${CROSS}g++)
 endif()
-if(NOT AS_EXECUTABLE)
-  set(AS_EXECUTABLE ${CROSS}as)
+if(NOT CMAKE_ASM_COMPILER)
+  set(CMAKE_ASM_COMPILER ${CROSS}as)
 endif()
 set(CMAKE_C_FLAGS_INIT "-march=armv7-a -mfpu=vfpv3 \
                           ${AOM_EXTRA_TOOLCHAIN_FLAGS}")

diff --git a/build/cmake/toolchains/armv7-mingw-gcc.cmake b/build/cmake/toolchains/armv7-mingw-gcc.cmake
index 8a92891..93f8c06 100644
--- a/build/cmake/toolchains/armv7-mingw-gcc.cmake
+++ b/build/cmake/toolchains/armv7-mingw-gcc.cmake

@@ -17,6 +17,8 @@
 set(CMAKE_SYSTEM_NAME "Windows")
 
 if("${CROSS}" STREQUAL "")
+
+  # Default the cross compiler prefix to one used by MSYS2.
   set(CROSS armv7-w64-mingw32-)
 endif()
 

diff --git a/build/cmake/toolchains/ppc-linux-gcc.cmake b/build/cmake/toolchains/ppc-linux-gcc.cmake
index ab0efea..3aa2652 100644
--- a/build/cmake/toolchains/ppc-linux-gcc.cmake
+++ b/build/cmake/toolchains/ppc-linux-gcc.cmake

@@ -17,8 +17,9 @@
 
 if("${CROSS}" STREQUAL "")
 
-  # Default the cross compiler prefix to something known to work.
-  set(CROSS powerpc64le-unknown-linux-gnu-)
+  # Default the cross compiler prefix to one used by Debian and other package
+  # management systems.
+  set(CROSS powerpc64le-linux-gnu-)
 endif()
 
 if(NOT CMAKE_C_COMPILER)
@@ -27,8 +28,8 @@
 if(NOT CMAKE_CXX_COMPILER)
   set(CMAKE_CXX_COMPILER ${CROSS}g++)
 endif()
-if(NOT AS_EXECUTABLE)
-  set(AS_EXECUTABLE ${CROSS}as)
+if(NOT CMAKE_ASM_COMPILER)
+  set(CMAKE_ASM_COMPILER ${CROSS}as)
 endif()
 set(CMAKE_SYSTEM_PROCESSOR "ppc")
 

diff --git a/build/cmake/toolchains/riscv-linux-gcc.cmake b/build/cmake/toolchains/riscv-linux-gcc.cmake
index 21e7370..4133be6 100644
--- a/build/cmake/toolchains/riscv-linux-gcc.cmake
+++ b/build/cmake/toolchains/riscv-linux-gcc.cmake

@@ -17,8 +17,9 @@
 
 if("${CROSS}" STREQUAL "")
 
-  # Default the cross compiler prefix to something known to work.
-  set(CROSS riscv64-unknown-linux-gnu-)
+  # Default the cross compiler prefix to one used by Debian and other package
+  # management systems.
+  set(CROSS riscv64-linux-gnu-)
 endif()
 
 if(NOT CMAKE_C_COMPILER)
@@ -27,8 +28,8 @@
 if(NOT CMAKE_CXX_COMPILER)
   set(CMAKE_CXX_COMPILER ${CROSS}g++)
 endif()
-if(NOT AS_EXECUTABLE)
-  set(AS_EXECUTABLE ${CROSS}as)
+if(NOT CMAKE_ASM_COMPILER)
+  set(CMAKE_ASM_COMPILER ${CROSS}as)
 endif()
 set(CMAKE_SYSTEM_PROCESSOR "riscv")
 

diff --git a/build/cmake/toolchains/x86-mingw-gcc.cmake b/build/cmake/toolchains/x86-mingw-gcc.cmake
index f75728f..2208333 100644
--- a/build/cmake/toolchains/x86-mingw-gcc.cmake
+++ b/build/cmake/toolchains/x86-mingw-gcc.cmake

@@ -20,6 +20,9 @@
 set(CMAKE_EXE_LINKER_FLAGS_INIT "-m32")
 
 if("${CROSS}" STREQUAL "")
+
+  # Default the cross compiler prefix to one used by Debian and other package
+  # management systems.
   set(CROSS i686-w64-mingw32-)
 endif()
 

diff --git a/build/cmake/toolchains/x86_64-mingw-gcc.cmake b/build/cmake/toolchains/x86_64-mingw-gcc.cmake
index 56e9b6e..978146a 100644
--- a/build/cmake/toolchains/x86_64-mingw-gcc.cmake
+++ b/build/cmake/toolchains/x86_64-mingw-gcc.cmake

@@ -17,6 +17,9 @@
 set(CMAKE_SYSTEM_NAME "Windows")
 
 if("${CROSS}" STREQUAL "")
+
+  # Default the cross compiler prefix to one used by Debian and other package
+  # management systems.
   set(CROSS x86_64-w64-mingw32-)
 endif()
 

diff --git a/build/cmake/util.cmake b/build/cmake/util.cmake
index 9b3da84..31de2e1 100644
--- a/build/cmake/util.cmake
+++ b/build/cmake/util.cmake

@@ -16,31 +16,32 @@
 # Directory where generated sources will be written.
 set(AOM_GEN_SRC_DIR "${AOM_CONFIG_DIR}/gen_src")
 
-# Creates dummy source file in $AOM_GEN_SRC_DIR named $basename.$extension and
-# returns the full path to the dummy source file via appending it to the list
-# variable referred to by $out_file_list_var parameter.
-macro(create_dummy_source_file basename extension out_file_list_var)
-  set(dummy_source_file "${AOM_GEN_SRC_DIR}/${basename}_dummy.${extension}")
-  file(WRITE "${dummy_source_file}"
+# Creates a no-op source file in $AOM_GEN_SRC_DIR named $basename.$extension and
+# returns the full path to the source file via appending it to the list variable
+# referred to by $out_file_list_var parameter.
+macro(create_no_op_source_file basename extension out_file_list_var)
+  set(no_op_source_file "${AOM_GEN_SRC_DIR}/${basename}_no_op.${extension}")
+  file(WRITE "${no_op_source_file}"
        "// Generated file. DO NOT EDIT!\n"
        "// ${target_name} needs a ${extension} file to force link language, \n"
        "// or to silence a harmless CMake warning: Ignore me.\n"
-       "void aom_${target_name}_dummy_function(void) {}\n")
-  list(APPEND "${out_file_list_var}" "${dummy_source_file}")
+       "void aom_${target_name}_no_op_function(void);\n"
+       "void aom_${target_name}_no_op_function(void) {}\n")
+  list(APPEND "${out_file_list_var}" "${no_op_source_file}")
 endmacro()
 
-# Convenience function for adding a dummy source file to $target_name using
-# $extension as the file extension. Wraps create_dummy_source_file().
-function(add_dummy_source_file_to_target target_name extension)
-  create_dummy_source_file("${target_name}" "${extension}"
-                           "dummy_source_file_list")
-  target_sources(${target_name} PRIVATE ${dummy_source_file_list})
+# Convenience function for adding a no-op source file to $target_name using
+# $extension as the file extension. Wraps create_no_op_source_file().
+function(add_no_op_source_file_to_target target_name extension)
+  create_no_op_source_file("${target_name}" "${extension}"
+                           "no_op_source_file_list")
+  target_sources(${target_name} PRIVATE ${no_op_source_file_list})
 endfunction()
 
 # Sets the value of the variable referenced by $feature to $value, and reports
 # the change to the user via call to message(WARNING ...). $cause is expected to
 # be a configuration variable that conflicts with $feature in some way. This
-# function is a noop if $feature is already set to $value.
+# function is a no-op if $feature is already set to $value.
 function(change_config_and_warn feature value cause)
   if(${feature} EQUAL ${value})
     return()
@@ -100,7 +101,7 @@
 # already been set via the CMake command line.
 #
 # The names of variables defaulted through this macro are added to
-# $AOM_CONFIG_VARS to facilitate build logging and diagnostics.
+# $AOM_DETECT_VARS to facilitate build logging and diagnostics.
 macro(set_aom_detect_var name value helpstring)
   unset(list_index)
   list(FIND AOM_DETECT_VARS ${name} list_index)

diff --git a/common/tools_common.c b/common/tools_common.c
index 6b579e0..afe4619 100644
--- a/common/tools_common.c
+++ b/common/tools_common.c

@@ -26,15 +26,9 @@
 #include "aom/aomdx.h"
 #endif
 
-#if defined(_WIN32) || defined(__OS2__)
+#if defined(_WIN32)
 #include <io.h>
 #include <fcntl.h>
-
-#ifdef __OS2__
-#define _setmode setmode
-#define _fileno fileno
-#define _O_BINARY O_BINARY
-#endif
 #endif
 
 #define LOG_ERROR(label)               \
@@ -50,7 +44,7 @@
 
 FILE *set_binary_mode(FILE *stream) {
   (void)stream;
-#if defined(_WIN32) || defined(__OS2__)
+#if defined(_WIN32)
   _setmode(_fileno(stream), _O_BINARY);
 #endif
   return stream;
@@ -76,6 +70,21 @@
   exit(EXIT_FAILURE);
 }
 
+const char *image_format_to_string(aom_img_fmt_t fmt) {
+  switch (fmt) {
+    case AOM_IMG_FMT_I420: return "I420";
+    case AOM_IMG_FMT_I422: return "I422";
+    case AOM_IMG_FMT_I444: return "I444";
+    case AOM_IMG_FMT_YV12: return "YV12";
+    case AOM_IMG_FMT_NV12: return "NV12";
+    case AOM_IMG_FMT_YV1216: return "YV1216";
+    case AOM_IMG_FMT_I42016: return "I42016";
+    case AOM_IMG_FMT_I42216: return "I42216";
+    case AOM_IMG_FMT_I44416: return "I44416";
+    default: return "Other";
+  }
+}
+
 int read_yuv_frame(struct AvxInputContext *input_ctx, aom_image_t *yuv_frame) {
   FILE *f = input_ctx->file;
   struct FileTypeDetectionBuffer *detect = &input_ctx->detect;
@@ -133,8 +142,8 @@
 
 struct CodecInfo {
   // Pointer to a function of zero arguments that returns an aom_codec_iface_t.
-  aom_codec_iface_t *(*const interface)();
-  char *short_name;
+  aom_codec_iface_t *(*interface)(void);
+  const char *short_name;
   uint32_t fourcc;
 };
 
@@ -300,7 +309,7 @@
     case AOM_IMG_FMT_I42016:
     case AOM_IMG_FMT_I42216:
     case AOM_IMG_FMT_I44416: break;
-    default: fatal("Unsupported image conversion"); break;
+    default: fatal("Unsupported image conversion");
   }
   for (plane = 0; plane < 3; plane++) {
     int w = src->d_w;
@@ -336,7 +345,7 @@
     case AOM_IMG_FMT_I420:
     case AOM_IMG_FMT_I422:
     case AOM_IMG_FMT_I444: break;
-    default: fatal("Unsupported image conversion"); break;
+    default: fatal("Unsupported image conversion");
   }
   for (plane = 0; plane < 3; plane++) {
     int w = src->d_w;
@@ -377,7 +386,7 @@
     case AOM_IMG_FMT_I420:
     case AOM_IMG_FMT_I422:
     case AOM_IMG_FMT_I444: break;
-    default: fatal("Unsupported image conversion"); break;
+    default: fatal("Unsupported image conversion");
   }
   for (plane = 0; plane < 3; plane++) {
     int w = src->d_w;
@@ -411,7 +420,7 @@
     case AOM_IMG_FMT_I42016:
     case AOM_IMG_FMT_I42216:
     case AOM_IMG_FMT_I44416: break;
-    default: fatal("Unsupported image conversion"); break;
+    default: fatal("Unsupported image conversion");
   }
   for (plane = 0; plane < 3; plane++) {
     int w = src->d_w;
@@ -444,7 +453,7 @@
     case AOM_IMG_FMT_I420:
     case AOM_IMG_FMT_I422:
     case AOM_IMG_FMT_I444: break;
-    default: fatal("Unsupported image conversion"); break;
+    default: fatal("Unsupported image conversion");
   }
   for (plane = 0; plane < 3; plane++) {
     int w = src->d_w;

diff --git a/common/tools_common.h b/common/tools_common.h
index eeccbe4..b31371c 100644
--- a/common/tools_common.h
+++ b/common/tools_common.h

@@ -157,7 +157,7 @@
 // The AOM library can support different encoders / decoders. These
 // functions provide different ways to lookup / iterate through them.
 // The return result may be NULL to indicate no codec was found.
-int get_aom_encoder_count();
+int get_aom_encoder_count(void);
 aom_codec_iface_t *get_aom_encoder_by_index(int i);
 aom_codec_iface_t *get_aom_encoder_by_short_name(const char *name);
 // If the interface is unknown, returns NULL.
@@ -165,7 +165,7 @@
 // If the interface is unknown, returns 0.
 uint32_t get_fourcc_by_aom_encoder(aom_codec_iface_t *iface);
 
-int get_aom_decoder_count();
+int get_aom_decoder_count(void);
 aom_codec_iface_t *get_aom_decoder_by_index(int i);
 aom_codec_iface_t *get_aom_decoder_by_short_name(const char *name);
 aom_codec_iface_t *get_aom_decoder_by_fourcc(uint32_t fourcc);
@@ -173,6 +173,8 @@
 // If the interface is unknown, returns 0.
 uint32_t get_fourcc_by_aom_decoder(aom_codec_iface_t *iface);
 
+const char *image_format_to_string(aom_img_fmt_t fmt);
+
 int read_yuv_frame(struct AvxInputContext *input_ctx, aom_image_t *yuv_frame);
 
 void aom_img_write(const aom_image_t *img, FILE *file);

diff --git a/common/y4menc.c b/common/y4menc.c
index 7d32465..25086a9 100644
--- a/common/y4menc.c
+++ b/common/y4menc.c

@@ -28,7 +28,8 @@
 
 // Return the Y4M name of the 8-bit colorspace, given the chroma position and
 // image format.
-const char *colorspace8(aom_chroma_sample_position_t csp, aom_img_fmt_t fmt) {
+static const char *colorspace8(aom_chroma_sample_position_t csp,
+                               aom_img_fmt_t fmt) {
   switch (fmt) {
     case AOM_IMG_FMT_I444: return "C444";
     case AOM_IMG_FMT_I422: return "C422";

diff --git a/common/y4minput.c b/common/y4minput.c
index 2fc8379..1974d76 100644
--- a/common/y4minput.c
+++ b/common/y4minput.c

@@ -1202,6 +1202,7 @@
   _img->fmt = _y4m->aom_fmt;
   _img->w = _img->d_w = _y4m->pic_w;
   _img->h = _img->d_h = _y4m->pic_h;
+  _img->bit_depth = _y4m->bit_depth;
   _img->x_chroma_shift = _y4m->dst_c_dec_h >> 1;
   _img->y_chroma_shift = _y4m->dst_c_dec_v >> 1;
   _img->bps = _y4m->bps;

diff --git a/config/arm/config/aom_config.asm b/config/arm/config/aom_config.asm
index a6b5453..ce46e8b 100644
--- a/config/arm/config/aom_config.asm
+++ b/config/arm/config/aom_config.asm

@@ -8,10 +8,11 @@
 ; Media Patent License 1.0 was not distributed with this source code in the
 ; PATENTS file, you can obtain it at www.aomedia.org/license/patent.
 ;
-ARCH_ARM equ 1
-ARCH_PPC equ 0
-ARCH_X86 equ 0
-ARCH_X86_64 equ 0
+AOM_ARCH_AARCH64 equ 0
+AOM_ARCH_ARM equ 1
+AOM_ARCH_PPC equ 0
+AOM_ARCH_X86 equ 0
+AOM_ARCH_X86_64 equ 0
 CONFIG_ACCOUNTING equ 0
 CONFIG_ANALYZER equ 0
 CONFIG_AV1_DECODER equ 1
@@ -47,6 +48,7 @@
 CONFIG_NORMAL_TILE_MODE equ 1
 CONFIG_OPTICAL_FLOW_API equ 0
 CONFIG_OS_SUPPORT equ 1
+CONFIG_OUTPUT_FRAME_SIZE equ 0
 CONFIG_PARTITION_SEARCH_ORDER equ 0
 CONFIG_PIC equ 1
 CONFIG_RATECTRL_LOG equ 0
@@ -55,6 +57,7 @@
 CONFIG_REALTIME_ONLY equ 0
 CONFIG_RT_ML_PARTITIONING equ 0
 CONFIG_RUNTIME_CPU_DETECT equ 0
+CONFIG_SALIENCY_MAP equ 0
 CONFIG_SHARED equ 0
 CONFIG_SIZE_LIMIT equ 1
 CONFIG_SPATIAL_RESAMPLING equ 1

diff --git a/config/arm/config/aom_config.c b/config/arm/config/aom_config.c
index 3fcba38..affe0d7 100644
--- a/config/arm/config/aom_config.c
+++ b/config/arm/config/aom_config.c

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2016, Alliance for Open Media. All rights reserved
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
  *
  * This source code is subject to the terms of the BSD 2 Clause License and
  * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License

diff --git a/config/arm/config/aom_config.h b/config/arm/config/aom_config.h
index 9f2cfc1..6611944 100644
--- a/config/arm/config/aom_config.h
+++ b/config/arm/config/aom_config.h

@@ -10,10 +10,11 @@
  */
 #ifndef AOM_CONFIG_H_
 #define AOM_CONFIG_H_
-#define ARCH_ARM 1
-#define ARCH_PPC 0
-#define ARCH_X86 0
-#define ARCH_X86_64 0
+#define AOM_ARCH_AARCH64 0
+#define AOM_ARCH_ARM 1
+#define AOM_ARCH_PPC 0
+#define AOM_ARCH_X86 0
+#define AOM_ARCH_X86_64 0
 #define CONFIG_ACCOUNTING 0
 #define CONFIG_ANALYZER 0
 #define CONFIG_AV1_DECODER 1
@@ -49,6 +50,7 @@
 #define CONFIG_NORMAL_TILE_MODE 1
 #define CONFIG_OPTICAL_FLOW_API 0
 #define CONFIG_OS_SUPPORT 1
+#define CONFIG_OUTPUT_FRAME_SIZE 0
 #define CONFIG_PARTITION_SEARCH_ORDER 0
 #define CONFIG_PIC 1
 #define CONFIG_RATECTRL_LOG 0
@@ -57,6 +59,7 @@
 #define CONFIG_REALTIME_ONLY 0
 #define CONFIG_RT_ML_PARTITIONING 0
 #define CONFIG_RUNTIME_CPU_DETECT 0
+#define CONFIG_SALIENCY_MAP 0
 #define CONFIG_SHARED 0
 #define CONFIG_SIZE_LIMIT 1
 #define CONFIG_SPATIAL_RESAMPLING 1

diff --git a/config/arm/config/aom_dsp_rtcd.h b/config/arm/config/aom_dsp_rtcd.h
index 7ae6636..ad77b04 100644
--- a/config/arm/config/aom_dsp_rtcd.h
+++ b/config/arm/config/aom_dsp_rtcd.h

@@ -14,8 +14,8 @@
 
 #include "aom/aom_integer.h"
 #include "aom_dsp/aom_dsp_common.h"
-#include "av1/common/enums.h"
 #include "av1/common/blockd.h"
+#include "av1/common/enums.h"
 
 
 #ifdef __cplusplus
@@ -46,19 +46,26 @@
 #define aom_blend_a64_vmask aom_blend_a64_vmask_neon
 
 void aom_comp_avg_pred_c(uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride);
-#define aom_comp_avg_pred aom_comp_avg_pred_c
+void aom_comp_avg_pred_neon(uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride);
+#define aom_comp_avg_pred aom_comp_avg_pred_neon
 
 void aom_comp_mask_pred_c(uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask);
-#define aom_comp_mask_pred aom_comp_mask_pred_c
+void aom_comp_mask_pred_neon(uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask);
+#define aom_comp_mask_pred aom_comp_mask_pred_neon
+
+void aom_compute_flow_at_point_c(const uint8_t *src, const uint8_t *ref, int x, int y, int width, int height, int stride, double *u, double *v);
+#define aom_compute_flow_at_point aom_compute_flow_at_point_c
 
 void aom_convolve8_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const InterpKernel *filter, int x0_q4, int x_step_q4, int y0_q4, int y_step_q4, int w, int h);
 #define aom_convolve8 aom_convolve8_c
 
 void aom_convolve8_horiz_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4, const int16_t *filter_y, int y_step_q4, int w, int h);
-#define aom_convolve8_horiz aom_convolve8_horiz_c
+void aom_convolve8_horiz_neon(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4, const int16_t *filter_y, int y_step_q4, int w, int h);
+#define aom_convolve8_horiz aom_convolve8_horiz_neon
 
 void aom_convolve8_vert_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4, const int16_t *filter_y, int y_step_q4, int w, int h);
-#define aom_convolve8_vert aom_convolve8_vert_c
+void aom_convolve8_vert_neon(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4, const int16_t *filter_y, int y_step_q4, int w, int h);
+#define aom_convolve8_vert aom_convolve8_vert_neon
 
 void aom_convolve_copy_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, int w, int h);
 void aom_convolve_copy_neon(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, int w, int h);
@@ -69,57 +76,72 @@
 #define aom_dc_128_predictor_16x16 aom_dc_128_predictor_16x16_neon
 
 void aom_dc_128_predictor_16x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_16x32 aom_dc_128_predictor_16x32_c
+void aom_dc_128_predictor_16x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_16x32 aom_dc_128_predictor_16x32_neon
 
 void aom_dc_128_predictor_16x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_16x4 aom_dc_128_predictor_16x4_c
+void aom_dc_128_predictor_16x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_16x4 aom_dc_128_predictor_16x4_neon
 
 void aom_dc_128_predictor_16x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_16x64 aom_dc_128_predictor_16x64_c
+void aom_dc_128_predictor_16x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_16x64 aom_dc_128_predictor_16x64_neon
 
 void aom_dc_128_predictor_16x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_16x8 aom_dc_128_predictor_16x8_c
+void aom_dc_128_predictor_16x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_16x8 aom_dc_128_predictor_16x8_neon
 
 void aom_dc_128_predictor_32x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_32x16 aom_dc_128_predictor_32x16_c
+void aom_dc_128_predictor_32x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_32x16 aom_dc_128_predictor_32x16_neon
 
 void aom_dc_128_predictor_32x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_128_predictor_32x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_128_predictor_32x32 aom_dc_128_predictor_32x32_neon
 
 void aom_dc_128_predictor_32x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_32x64 aom_dc_128_predictor_32x64_c
+void aom_dc_128_predictor_32x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_32x64 aom_dc_128_predictor_32x64_neon
 
 void aom_dc_128_predictor_32x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_32x8 aom_dc_128_predictor_32x8_c
+void aom_dc_128_predictor_32x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_32x8 aom_dc_128_predictor_32x8_neon
 
 void aom_dc_128_predictor_4x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_4x16 aom_dc_128_predictor_4x16_c
+void aom_dc_128_predictor_4x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_4x16 aom_dc_128_predictor_4x16_neon
 
 void aom_dc_128_predictor_4x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_128_predictor_4x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_128_predictor_4x4 aom_dc_128_predictor_4x4_neon
 
 void aom_dc_128_predictor_4x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_4x8 aom_dc_128_predictor_4x8_c
+void aom_dc_128_predictor_4x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_4x8 aom_dc_128_predictor_4x8_neon
 
 void aom_dc_128_predictor_64x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_64x16 aom_dc_128_predictor_64x16_c
+void aom_dc_128_predictor_64x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_64x16 aom_dc_128_predictor_64x16_neon
 
 void aom_dc_128_predictor_64x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_64x32 aom_dc_128_predictor_64x32_c
+void aom_dc_128_predictor_64x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_64x32 aom_dc_128_predictor_64x32_neon
 
 void aom_dc_128_predictor_64x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_64x64 aom_dc_128_predictor_64x64_c
+void aom_dc_128_predictor_64x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_64x64 aom_dc_128_predictor_64x64_neon
 
 void aom_dc_128_predictor_8x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_8x16 aom_dc_128_predictor_8x16_c
+void aom_dc_128_predictor_8x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_8x16 aom_dc_128_predictor_8x16_neon
 
 void aom_dc_128_predictor_8x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_8x32 aom_dc_128_predictor_8x32_c
+void aom_dc_128_predictor_8x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_8x32 aom_dc_128_predictor_8x32_neon
 
 void aom_dc_128_predictor_8x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_8x4 aom_dc_128_predictor_8x4_c
+void aom_dc_128_predictor_8x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_8x4 aom_dc_128_predictor_8x4_neon
 
 void aom_dc_128_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_128_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
@@ -130,57 +152,72 @@
 #define aom_dc_left_predictor_16x16 aom_dc_left_predictor_16x16_neon
 
 void aom_dc_left_predictor_16x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_16x32 aom_dc_left_predictor_16x32_c
+void aom_dc_left_predictor_16x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_16x32 aom_dc_left_predictor_16x32_neon
 
 void aom_dc_left_predictor_16x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_16x4 aom_dc_left_predictor_16x4_c
+void aom_dc_left_predictor_16x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_16x4 aom_dc_left_predictor_16x4_neon
 
 void aom_dc_left_predictor_16x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_16x64 aom_dc_left_predictor_16x64_c
+void aom_dc_left_predictor_16x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_16x64 aom_dc_left_predictor_16x64_neon
 
 void aom_dc_left_predictor_16x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_16x8 aom_dc_left_predictor_16x8_c
+void aom_dc_left_predictor_16x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_16x8 aom_dc_left_predictor_16x8_neon
 
 void aom_dc_left_predictor_32x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_32x16 aom_dc_left_predictor_32x16_c
+void aom_dc_left_predictor_32x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_32x16 aom_dc_left_predictor_32x16_neon
 
 void aom_dc_left_predictor_32x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_left_predictor_32x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_left_predictor_32x32 aom_dc_left_predictor_32x32_neon
 
 void aom_dc_left_predictor_32x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_32x64 aom_dc_left_predictor_32x64_c
+void aom_dc_left_predictor_32x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_32x64 aom_dc_left_predictor_32x64_neon
 
 void aom_dc_left_predictor_32x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_32x8 aom_dc_left_predictor_32x8_c
+void aom_dc_left_predictor_32x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_32x8 aom_dc_left_predictor_32x8_neon
 
 void aom_dc_left_predictor_4x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_4x16 aom_dc_left_predictor_4x16_c
+void aom_dc_left_predictor_4x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_4x16 aom_dc_left_predictor_4x16_neon
 
 void aom_dc_left_predictor_4x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_left_predictor_4x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_left_predictor_4x4 aom_dc_left_predictor_4x4_neon
 
 void aom_dc_left_predictor_4x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_4x8 aom_dc_left_predictor_4x8_c
+void aom_dc_left_predictor_4x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_4x8 aom_dc_left_predictor_4x8_neon
 
 void aom_dc_left_predictor_64x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_64x16 aom_dc_left_predictor_64x16_c
+void aom_dc_left_predictor_64x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_64x16 aom_dc_left_predictor_64x16_neon
 
 void aom_dc_left_predictor_64x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_64x32 aom_dc_left_predictor_64x32_c
+void aom_dc_left_predictor_64x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_64x32 aom_dc_left_predictor_64x32_neon
 
 void aom_dc_left_predictor_64x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_64x64 aom_dc_left_predictor_64x64_c
+void aom_dc_left_predictor_64x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_64x64 aom_dc_left_predictor_64x64_neon
 
 void aom_dc_left_predictor_8x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_8x16 aom_dc_left_predictor_8x16_c
+void aom_dc_left_predictor_8x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_8x16 aom_dc_left_predictor_8x16_neon
 
 void aom_dc_left_predictor_8x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_8x32 aom_dc_left_predictor_8x32_c
+void aom_dc_left_predictor_8x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_8x32 aom_dc_left_predictor_8x32_neon
 
 void aom_dc_left_predictor_8x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_8x4 aom_dc_left_predictor_8x4_c
+void aom_dc_left_predictor_8x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_8x4 aom_dc_left_predictor_8x4_neon
 
 void aom_dc_left_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_left_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
@@ -191,57 +228,72 @@
 #define aom_dc_predictor_16x16 aom_dc_predictor_16x16_neon
 
 void aom_dc_predictor_16x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_16x32 aom_dc_predictor_16x32_c
+void aom_dc_predictor_16x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_16x32 aom_dc_predictor_16x32_neon
 
 void aom_dc_predictor_16x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_16x4 aom_dc_predictor_16x4_c
+void aom_dc_predictor_16x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_16x4 aom_dc_predictor_16x4_neon
 
 void aom_dc_predictor_16x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_16x64 aom_dc_predictor_16x64_c
+void aom_dc_predictor_16x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_16x64 aom_dc_predictor_16x64_neon
 
 void aom_dc_predictor_16x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_16x8 aom_dc_predictor_16x8_c
+void aom_dc_predictor_16x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_16x8 aom_dc_predictor_16x8_neon
 
 void aom_dc_predictor_32x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_32x16 aom_dc_predictor_32x16_c
+void aom_dc_predictor_32x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_32x16 aom_dc_predictor_32x16_neon
 
 void aom_dc_predictor_32x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_predictor_32x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_predictor_32x32 aom_dc_predictor_32x32_neon
 
 void aom_dc_predictor_32x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_32x64 aom_dc_predictor_32x64_c
+void aom_dc_predictor_32x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_32x64 aom_dc_predictor_32x64_neon
 
 void aom_dc_predictor_32x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_32x8 aom_dc_predictor_32x8_c
+void aom_dc_predictor_32x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_32x8 aom_dc_predictor_32x8_neon
 
 void aom_dc_predictor_4x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_4x16 aom_dc_predictor_4x16_c
+void aom_dc_predictor_4x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_4x16 aom_dc_predictor_4x16_neon
 
 void aom_dc_predictor_4x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_predictor_4x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_predictor_4x4 aom_dc_predictor_4x4_neon
 
 void aom_dc_predictor_4x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_4x8 aom_dc_predictor_4x8_c
+void aom_dc_predictor_4x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_4x8 aom_dc_predictor_4x8_neon
 
 void aom_dc_predictor_64x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_64x16 aom_dc_predictor_64x16_c
+void aom_dc_predictor_64x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_64x16 aom_dc_predictor_64x16_neon
 
 void aom_dc_predictor_64x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_64x32 aom_dc_predictor_64x32_c
+void aom_dc_predictor_64x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_64x32 aom_dc_predictor_64x32_neon
 
 void aom_dc_predictor_64x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_64x64 aom_dc_predictor_64x64_c
+void aom_dc_predictor_64x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_64x64 aom_dc_predictor_64x64_neon
 
 void aom_dc_predictor_8x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_8x16 aom_dc_predictor_8x16_c
+void aom_dc_predictor_8x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_8x16 aom_dc_predictor_8x16_neon
 
 void aom_dc_predictor_8x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_8x32 aom_dc_predictor_8x32_c
+void aom_dc_predictor_8x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_8x32 aom_dc_predictor_8x32_neon
 
 void aom_dc_predictor_8x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_8x4 aom_dc_predictor_8x4_c
+void aom_dc_predictor_8x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_8x4 aom_dc_predictor_8x4_neon
 
 void aom_dc_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
@@ -252,57 +304,72 @@
 #define aom_dc_top_predictor_16x16 aom_dc_top_predictor_16x16_neon
 
 void aom_dc_top_predictor_16x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_16x32 aom_dc_top_predictor_16x32_c
+void aom_dc_top_predictor_16x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_16x32 aom_dc_top_predictor_16x32_neon
 
 void aom_dc_top_predictor_16x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_16x4 aom_dc_top_predictor_16x4_c
+void aom_dc_top_predictor_16x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_16x4 aom_dc_top_predictor_16x4_neon
 
 void aom_dc_top_predictor_16x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_16x64 aom_dc_top_predictor_16x64_c
+void aom_dc_top_predictor_16x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_16x64 aom_dc_top_predictor_16x64_neon
 
 void aom_dc_top_predictor_16x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_16x8 aom_dc_top_predictor_16x8_c
+void aom_dc_top_predictor_16x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_16x8 aom_dc_top_predictor_16x8_neon
 
 void aom_dc_top_predictor_32x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_32x16 aom_dc_top_predictor_32x16_c
+void aom_dc_top_predictor_32x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_32x16 aom_dc_top_predictor_32x16_neon
 
 void aom_dc_top_predictor_32x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_top_predictor_32x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_top_predictor_32x32 aom_dc_top_predictor_32x32_neon
 
 void aom_dc_top_predictor_32x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_32x64 aom_dc_top_predictor_32x64_c
+void aom_dc_top_predictor_32x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_32x64 aom_dc_top_predictor_32x64_neon
 
 void aom_dc_top_predictor_32x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_32x8 aom_dc_top_predictor_32x8_c
+void aom_dc_top_predictor_32x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_32x8 aom_dc_top_predictor_32x8_neon
 
 void aom_dc_top_predictor_4x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_4x16 aom_dc_top_predictor_4x16_c
+void aom_dc_top_predictor_4x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_4x16 aom_dc_top_predictor_4x16_neon
 
 void aom_dc_top_predictor_4x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_top_predictor_4x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_top_predictor_4x4 aom_dc_top_predictor_4x4_neon
 
 void aom_dc_top_predictor_4x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_4x8 aom_dc_top_predictor_4x8_c
+void aom_dc_top_predictor_4x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_4x8 aom_dc_top_predictor_4x8_neon
 
 void aom_dc_top_predictor_64x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_64x16 aom_dc_top_predictor_64x16_c
+void aom_dc_top_predictor_64x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_64x16 aom_dc_top_predictor_64x16_neon
 
 void aom_dc_top_predictor_64x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_64x32 aom_dc_top_predictor_64x32_c
+void aom_dc_top_predictor_64x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_64x32 aom_dc_top_predictor_64x32_neon
 
 void aom_dc_top_predictor_64x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_64x64 aom_dc_top_predictor_64x64_c
+void aom_dc_top_predictor_64x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_64x64 aom_dc_top_predictor_64x64_neon
 
 void aom_dc_top_predictor_8x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_8x16 aom_dc_top_predictor_8x16_c
+void aom_dc_top_predictor_8x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_8x16 aom_dc_top_predictor_8x16_neon
 
 void aom_dc_top_predictor_8x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_8x32 aom_dc_top_predictor_8x32_c
+void aom_dc_top_predictor_8x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_8x32 aom_dc_top_predictor_8x32_neon
 
 void aom_dc_top_predictor_8x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_8x4 aom_dc_top_predictor_8x4_c
+void aom_dc_top_predictor_8x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_8x4 aom_dc_top_predictor_8x4_neon
 
 void aom_dc_top_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_top_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
@@ -451,10 +518,6 @@
 void aom_fdct4x4_lp_neon(const int16_t *input, int16_t *output, int stride);
 #define aom_fdct4x4_lp aom_fdct4x4_lp_neon
 
-void aom_fdct8x8_c(const int16_t *input, tran_low_t *output, int stride);
-void aom_fdct8x8_neon(const int16_t *input, tran_low_t *output, int stride);
-#define aom_fdct8x8 aom_fdct8x8_neon
-
 void aom_fft16x16_float_c(const float *input, float *temp, float *output);
 #define aom_fft16x16_float aom_fft16x16_float_c
 
@@ -470,18 +533,6 @@
 void aom_fft8x8_float_c(const float *input, float *temp, float *output);
 #define aom_fft8x8_float aom_fft8x8_float_c
 
-void aom_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-void aom_get16x16var_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_get16x16var aom_get16x16var_neon
-
-unsigned int aom_get4x4sse_cs_c(const unsigned char *src_ptr, int source_stride, const unsigned char *ref_ptr, int ref_stride);
-unsigned int aom_get4x4sse_cs_neon(const unsigned char *src_ptr, int source_stride, const unsigned char *ref_ptr, int ref_stride);
-#define aom_get4x4sse_cs aom_get4x4sse_cs_neon
-
-void aom_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-void aom_get8x8var_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_get8x8var aom_get8x8var_neon
-
 void aom_get_blk_sse_sum_c(const int16_t *data, int stride, int bw, int bh, int *x_sum, int64_t *x2_sum);
 #define aom_get_blk_sse_sum aom_get_blk_sse_sum_c
 
@@ -489,7 +540,8 @@
 #define aom_get_mb_ss aom_get_mb_ss_c
 
 void aom_get_var_sse_sum_16x16_dual_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse16x16, unsigned int *tot_sse, int *tot_sum, uint32_t *var16x16);
-#define aom_get_var_sse_sum_16x16_dual aom_get_var_sse_sum_16x16_dual_c
+void aom_get_var_sse_sum_16x16_dual_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse16x16, unsigned int *tot_sse, int *tot_sum, uint32_t *var16x16);
+#define aom_get_var_sse_sum_16x16_dual aom_get_var_sse_sum_16x16_dual_neon
 
 void aom_get_var_sse_sum_8x8_quad_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse8x8, int *sum8x8, unsigned int *tot_sse, int *tot_sum, uint32_t *var8x8);
 void aom_get_var_sse_sum_8x8_quad_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse8x8, int *sum8x8, unsigned int *tot_sse, int *tot_sum, uint32_t *var8x8);
@@ -500,57 +552,72 @@
 #define aom_h_predictor_16x16 aom_h_predictor_16x16_neon
 
 void aom_h_predictor_16x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_16x32 aom_h_predictor_16x32_c
+void aom_h_predictor_16x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_16x32 aom_h_predictor_16x32_neon
 
 void aom_h_predictor_16x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_16x4 aom_h_predictor_16x4_c
+void aom_h_predictor_16x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_16x4 aom_h_predictor_16x4_neon
 
 void aom_h_predictor_16x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_16x64 aom_h_predictor_16x64_c
+void aom_h_predictor_16x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_16x64 aom_h_predictor_16x64_neon
 
 void aom_h_predictor_16x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_16x8 aom_h_predictor_16x8_c
+void aom_h_predictor_16x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_16x8 aom_h_predictor_16x8_neon
 
 void aom_h_predictor_32x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_32x16 aom_h_predictor_32x16_c
+void aom_h_predictor_32x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_32x16 aom_h_predictor_32x16_neon
 
 void aom_h_predictor_32x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_h_predictor_32x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_h_predictor_32x32 aom_h_predictor_32x32_neon
 
 void aom_h_predictor_32x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_32x64 aom_h_predictor_32x64_c
+void aom_h_predictor_32x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_32x64 aom_h_predictor_32x64_neon
 
 void aom_h_predictor_32x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_32x8 aom_h_predictor_32x8_c
+void aom_h_predictor_32x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_32x8 aom_h_predictor_32x8_neon
 
 void aom_h_predictor_4x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_4x16 aom_h_predictor_4x16_c
+void aom_h_predictor_4x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_4x16 aom_h_predictor_4x16_neon
 
 void aom_h_predictor_4x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_h_predictor_4x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_h_predictor_4x4 aom_h_predictor_4x4_neon
 
 void aom_h_predictor_4x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_4x8 aom_h_predictor_4x8_c
+void aom_h_predictor_4x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_4x8 aom_h_predictor_4x8_neon
 
 void aom_h_predictor_64x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_64x16 aom_h_predictor_64x16_c
+void aom_h_predictor_64x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_64x16 aom_h_predictor_64x16_neon
 
 void aom_h_predictor_64x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_64x32 aom_h_predictor_64x32_c
+void aom_h_predictor_64x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_64x32 aom_h_predictor_64x32_neon
 
 void aom_h_predictor_64x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_64x64 aom_h_predictor_64x64_c
+void aom_h_predictor_64x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_64x64 aom_h_predictor_64x64_neon
 
 void aom_h_predictor_8x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_8x16 aom_h_predictor_8x16_c
+void aom_h_predictor_8x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_8x16 aom_h_predictor_8x16_neon
 
 void aom_h_predictor_8x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_8x32 aom_h_predictor_8x32_c
+void aom_h_predictor_8x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_8x32 aom_h_predictor_8x32_neon
 
 void aom_h_predictor_8x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_8x4 aom_h_predictor_8x4_c
+void aom_h_predictor_8x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_8x4 aom_h_predictor_8x4_neon
 
 void aom_h_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_h_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
@@ -561,10 +628,12 @@
 #define aom_hadamard_16x16 aom_hadamard_16x16_neon
 
 void aom_hadamard_32x32_c(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
-#define aom_hadamard_32x32 aom_hadamard_32x32_c
+void aom_hadamard_32x32_neon(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
+#define aom_hadamard_32x32 aom_hadamard_32x32_neon
 
 void aom_hadamard_4x4_c(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
-#define aom_hadamard_4x4 aom_hadamard_4x4_c
+void aom_hadamard_4x4_neon(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
+#define aom_hadamard_4x4 aom_hadamard_4x4_neon
 
 void aom_hadamard_8x8_c(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
 void aom_hadamard_8x8_neon(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
@@ -648,12 +717,6 @@
 uint32_t aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_10_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_10_get16x16var aom_highbd_10_get16x16var_c
-
-void aom_highbd_10_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_10_get8x8var aom_highbd_10_get8x8var_c
-
 unsigned int aom_highbd_10_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_10_masked_sub_pixel_variance128x128 aom_highbd_10_masked_sub_pixel_variance128x128_c
 
@@ -721,16 +784,20 @@
 #define aom_highbd_10_masked_sub_pixel_variance8x8 aom_highbd_10_masked_sub_pixel_variance8x8_c
 
 unsigned int aom_highbd_10_mse16x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_10_mse16x16 aom_highbd_10_mse16x16_c
+unsigned int aom_highbd_10_mse16x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_10_mse16x16 aom_highbd_10_mse16x16_neon
 
 unsigned int aom_highbd_10_mse16x8_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_10_mse16x8 aom_highbd_10_mse16x8_c
+unsigned int aom_highbd_10_mse16x8_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_10_mse16x8 aom_highbd_10_mse16x8_neon
 
 unsigned int aom_highbd_10_mse8x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_10_mse8x16 aom_highbd_10_mse8x16_c
+unsigned int aom_highbd_10_mse8x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_10_mse8x16 aom_highbd_10_mse8x16_neon
 
 unsigned int aom_highbd_10_mse8x8_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_10_mse8x8 aom_highbd_10_mse8x8_c
+unsigned int aom_highbd_10_mse8x8_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_10_mse8x8 aom_highbd_10_mse8x8_neon
 
 unsigned int aom_highbd_10_obmc_sub_pixel_variance128x128_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
 #define aom_highbd_10_obmc_sub_pixel_variance128x128 aom_highbd_10_obmc_sub_pixel_variance128x128_c
@@ -1012,12 +1079,12 @@
 unsigned int aom_highbd_10_variance16x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x32 aom_highbd_10_variance16x32_neon
 
-unsigned int aom_highbd_10_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance16x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance16x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x4 aom_highbd_10_variance16x4_neon
 
-unsigned int aom_highbd_10_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance16x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance16x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x64 aom_highbd_10_variance16x64_neon
 
 unsigned int aom_highbd_10_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1042,12 +1109,12 @@
 unsigned int aom_highbd_10_variance32x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance32x64 aom_highbd_10_variance32x64_neon
 
-unsigned int aom_highbd_10_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance32x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance32x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance32x8 aom_highbd_10_variance32x8_neon
 
-unsigned int aom_highbd_10_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance4x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance4x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance4x16 aom_highbd_10_variance4x16_neon
 
 unsigned int aom_highbd_10_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1065,8 +1132,8 @@
 unsigned int aom_highbd_10_variance64x128_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance64x128 aom_highbd_10_variance64x128_neon
 
-unsigned int aom_highbd_10_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance64x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance64x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance64x16 aom_highbd_10_variance64x16_neon
 
 unsigned int aom_highbd_10_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1081,8 +1148,8 @@
 unsigned int aom_highbd_10_variance8x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance8x16 aom_highbd_10_variance8x16_neon
 
-unsigned int aom_highbd_10_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance8x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance8x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance8x32 aom_highbd_10_variance8x32_neon
 
 unsigned int aom_highbd_10_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1159,12 +1226,6 @@
 uint32_t aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_12_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_12_get16x16var aom_highbd_12_get16x16var_c
-
-void aom_highbd_12_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_12_get8x8var aom_highbd_12_get8x8var_c
-
 unsigned int aom_highbd_12_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_12_masked_sub_pixel_variance128x128 aom_highbd_12_masked_sub_pixel_variance128x128_c
 
@@ -1232,16 +1293,20 @@
 #define aom_highbd_12_masked_sub_pixel_variance8x8 aom_highbd_12_masked_sub_pixel_variance8x8_c
 
 unsigned int aom_highbd_12_mse16x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_12_mse16x16 aom_highbd_12_mse16x16_c
+unsigned int aom_highbd_12_mse16x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_12_mse16x16 aom_highbd_12_mse16x16_neon
 
 unsigned int aom_highbd_12_mse16x8_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_12_mse16x8 aom_highbd_12_mse16x8_c
+unsigned int aom_highbd_12_mse16x8_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_12_mse16x8 aom_highbd_12_mse16x8_neon
 
 unsigned int aom_highbd_12_mse8x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_12_mse8x16 aom_highbd_12_mse8x16_c
+unsigned int aom_highbd_12_mse8x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_12_mse8x16 aom_highbd_12_mse8x16_neon
 
 unsigned int aom_highbd_12_mse8x8_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_12_mse8x8 aom_highbd_12_mse8x8_c
+unsigned int aom_highbd_12_mse8x8_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_12_mse8x8 aom_highbd_12_mse8x8_neon
 
 unsigned int aom_highbd_12_obmc_sub_pixel_variance128x128_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
 #define aom_highbd_12_obmc_sub_pixel_variance128x128 aom_highbd_12_obmc_sub_pixel_variance128x128_c
@@ -1508,25 +1573,32 @@
 #define aom_highbd_12_sub_pixel_variance8x8 aom_highbd_12_sub_pixel_variance8x8_c
 
 unsigned int aom_highbd_12_variance128x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance128x128 aom_highbd_12_variance128x128_c
+unsigned int aom_highbd_12_variance128x128_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance128x128 aom_highbd_12_variance128x128_neon
 
 unsigned int aom_highbd_12_variance128x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance128x64 aom_highbd_12_variance128x64_c
+unsigned int aom_highbd_12_variance128x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance128x64 aom_highbd_12_variance128x64_neon
 
 unsigned int aom_highbd_12_variance16x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance16x16 aom_highbd_12_variance16x16_c
+unsigned int aom_highbd_12_variance16x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance16x16 aom_highbd_12_variance16x16_neon
 
 unsigned int aom_highbd_12_variance16x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance16x32 aom_highbd_12_variance16x32_c
+unsigned int aom_highbd_12_variance16x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance16x32 aom_highbd_12_variance16x32_neon
 
-unsigned int aom_highbd_12_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_12_variance16x4 aom_highbd_12_variance16x4_c
+unsigned int aom_highbd_12_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance16x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance16x4 aom_highbd_12_variance16x4_neon
 
-unsigned int aom_highbd_12_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_12_variance16x64 aom_highbd_12_variance16x64_c
+unsigned int aom_highbd_12_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance16x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance16x64 aom_highbd_12_variance16x64_neon
 
 unsigned int aom_highbd_12_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance16x8 aom_highbd_12_variance16x8_c
+unsigned int aom_highbd_12_variance16x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance16x8 aom_highbd_12_variance16x8_neon
 
 unsigned int aom_highbd_12_variance2x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance2x2 aom_highbd_12_variance2x2_c
@@ -1535,52 +1607,67 @@
 #define aom_highbd_12_variance2x4 aom_highbd_12_variance2x4_c
 
 unsigned int aom_highbd_12_variance32x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance32x16 aom_highbd_12_variance32x16_c
+unsigned int aom_highbd_12_variance32x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance32x16 aom_highbd_12_variance32x16_neon
 
 unsigned int aom_highbd_12_variance32x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance32x32 aom_highbd_12_variance32x32_c
+unsigned int aom_highbd_12_variance32x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance32x32 aom_highbd_12_variance32x32_neon
 
 unsigned int aom_highbd_12_variance32x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance32x64 aom_highbd_12_variance32x64_c
+unsigned int aom_highbd_12_variance32x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance32x64 aom_highbd_12_variance32x64_neon
 
-unsigned int aom_highbd_12_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_12_variance32x8 aom_highbd_12_variance32x8_c
+unsigned int aom_highbd_12_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance32x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance32x8 aom_highbd_12_variance32x8_neon
 
-unsigned int aom_highbd_12_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_12_variance4x16 aom_highbd_12_variance4x16_c
+unsigned int aom_highbd_12_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance4x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance4x16 aom_highbd_12_variance4x16_neon
 
 unsigned int aom_highbd_12_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance4x2 aom_highbd_12_variance4x2_c
 
 unsigned int aom_highbd_12_variance4x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance4x4 aom_highbd_12_variance4x4_c
+unsigned int aom_highbd_12_variance4x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance4x4 aom_highbd_12_variance4x4_neon
 
 unsigned int aom_highbd_12_variance4x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance4x8 aom_highbd_12_variance4x8_c
+unsigned int aom_highbd_12_variance4x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance4x8 aom_highbd_12_variance4x8_neon
 
 unsigned int aom_highbd_12_variance64x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance64x128 aom_highbd_12_variance64x128_c
+unsigned int aom_highbd_12_variance64x128_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance64x128 aom_highbd_12_variance64x128_neon
 
-unsigned int aom_highbd_12_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_12_variance64x16 aom_highbd_12_variance64x16_c
+unsigned int aom_highbd_12_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance64x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance64x16 aom_highbd_12_variance64x16_neon
 
 unsigned int aom_highbd_12_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance64x32 aom_highbd_12_variance64x32_c
+unsigned int aom_highbd_12_variance64x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance64x32 aom_highbd_12_variance64x32_neon
 
 unsigned int aom_highbd_12_variance64x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance64x64 aom_highbd_12_variance64x64_c
+unsigned int aom_highbd_12_variance64x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance64x64 aom_highbd_12_variance64x64_neon
 
 unsigned int aom_highbd_12_variance8x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance8x16 aom_highbd_12_variance8x16_c
+unsigned int aom_highbd_12_variance8x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance8x16 aom_highbd_12_variance8x16_neon
 
-unsigned int aom_highbd_12_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_12_variance8x32 aom_highbd_12_variance8x32_c
+unsigned int aom_highbd_12_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance8x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance8x32 aom_highbd_12_variance8x32_neon
 
 unsigned int aom_highbd_12_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance8x4 aom_highbd_12_variance8x4_c
+unsigned int aom_highbd_12_variance8x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance8x4 aom_highbd_12_variance8x4_neon
 
 unsigned int aom_highbd_12_variance8x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance8x8 aom_highbd_12_variance8x8_c
+unsigned int aom_highbd_12_variance8x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance8x8 aom_highbd_12_variance8x8_neon
 
 uint32_t aom_highbd_8_dist_wtd_sub_pixel_avg_variance128x128_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_8_dist_wtd_sub_pixel_avg_variance128x128 aom_highbd_8_dist_wtd_sub_pixel_avg_variance128x128_c
@@ -1648,12 +1735,6 @@
 uint32_t aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_8_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_8_get16x16var aom_highbd_8_get16x16var_c
-
-void aom_highbd_8_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_8_get8x8var aom_highbd_8_get8x8var_c
-
 unsigned int aom_highbd_8_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_8_masked_sub_pixel_variance128x128 aom_highbd_8_masked_sub_pixel_variance128x128_c
 
@@ -1721,16 +1802,20 @@
 #define aom_highbd_8_masked_sub_pixel_variance8x8 aom_highbd_8_masked_sub_pixel_variance8x8_c
 
 unsigned int aom_highbd_8_mse16x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_8_mse16x16 aom_highbd_8_mse16x16_c
+unsigned int aom_highbd_8_mse16x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_8_mse16x16 aom_highbd_8_mse16x16_neon
 
 unsigned int aom_highbd_8_mse16x8_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_8_mse16x8 aom_highbd_8_mse16x8_c
+unsigned int aom_highbd_8_mse16x8_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_8_mse16x8 aom_highbd_8_mse16x8_neon
 
 unsigned int aom_highbd_8_mse8x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_8_mse8x16 aom_highbd_8_mse8x16_c
+unsigned int aom_highbd_8_mse8x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_8_mse8x16 aom_highbd_8_mse8x16_neon
 
 unsigned int aom_highbd_8_mse8x8_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_8_mse8x8 aom_highbd_8_mse8x8_c
+unsigned int aom_highbd_8_mse8x8_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_8_mse8x8 aom_highbd_8_mse8x8_neon
 
 uint32_t aom_highbd_8_sub_pixel_avg_variance128x128_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred);
 #define aom_highbd_8_sub_pixel_avg_variance128x128 aom_highbd_8_sub_pixel_avg_variance128x128_c
@@ -1865,25 +1950,32 @@
 #define aom_highbd_8_sub_pixel_variance8x8 aom_highbd_8_sub_pixel_variance8x8_c
 
 unsigned int aom_highbd_8_variance128x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance128x128 aom_highbd_8_variance128x128_c
+unsigned int aom_highbd_8_variance128x128_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance128x128 aom_highbd_8_variance128x128_neon
 
 unsigned int aom_highbd_8_variance128x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance128x64 aom_highbd_8_variance128x64_c
+unsigned int aom_highbd_8_variance128x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance128x64 aom_highbd_8_variance128x64_neon
 
 unsigned int aom_highbd_8_variance16x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance16x16 aom_highbd_8_variance16x16_c
+unsigned int aom_highbd_8_variance16x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance16x16 aom_highbd_8_variance16x16_neon
 
 unsigned int aom_highbd_8_variance16x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance16x32 aom_highbd_8_variance16x32_c
+unsigned int aom_highbd_8_variance16x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance16x32 aom_highbd_8_variance16x32_neon
 
-unsigned int aom_highbd_8_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_8_variance16x4 aom_highbd_8_variance16x4_c
+unsigned int aom_highbd_8_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance16x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance16x4 aom_highbd_8_variance16x4_neon
 
-unsigned int aom_highbd_8_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_8_variance16x64 aom_highbd_8_variance16x64_c
+unsigned int aom_highbd_8_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance16x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance16x64 aom_highbd_8_variance16x64_neon
 
 unsigned int aom_highbd_8_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance16x8 aom_highbd_8_variance16x8_c
+unsigned int aom_highbd_8_variance16x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance16x8 aom_highbd_8_variance16x8_neon
 
 unsigned int aom_highbd_8_variance2x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance2x2 aom_highbd_8_variance2x2_c
@@ -1892,59 +1984,75 @@
 #define aom_highbd_8_variance2x4 aom_highbd_8_variance2x4_c
 
 unsigned int aom_highbd_8_variance32x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance32x16 aom_highbd_8_variance32x16_c
+unsigned int aom_highbd_8_variance32x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance32x16 aom_highbd_8_variance32x16_neon
 
 unsigned int aom_highbd_8_variance32x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance32x32 aom_highbd_8_variance32x32_c
+unsigned int aom_highbd_8_variance32x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance32x32 aom_highbd_8_variance32x32_neon
 
 unsigned int aom_highbd_8_variance32x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance32x64 aom_highbd_8_variance32x64_c
+unsigned int aom_highbd_8_variance32x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance32x64 aom_highbd_8_variance32x64_neon
 
-unsigned int aom_highbd_8_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_8_variance32x8 aom_highbd_8_variance32x8_c
+unsigned int aom_highbd_8_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance32x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance32x8 aom_highbd_8_variance32x8_neon
 
-unsigned int aom_highbd_8_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_8_variance4x16 aom_highbd_8_variance4x16_c
+unsigned int aom_highbd_8_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance4x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance4x16 aom_highbd_8_variance4x16_neon
 
 unsigned int aom_highbd_8_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance4x2 aom_highbd_8_variance4x2_c
 
 unsigned int aom_highbd_8_variance4x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance4x4 aom_highbd_8_variance4x4_c
+unsigned int aom_highbd_8_variance4x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance4x4 aom_highbd_8_variance4x4_neon
 
 unsigned int aom_highbd_8_variance4x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance4x8 aom_highbd_8_variance4x8_c
+unsigned int aom_highbd_8_variance4x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance4x8 aom_highbd_8_variance4x8_neon
 
 unsigned int aom_highbd_8_variance64x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance64x128 aom_highbd_8_variance64x128_c
+unsigned int aom_highbd_8_variance64x128_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance64x128 aom_highbd_8_variance64x128_neon
 
-unsigned int aom_highbd_8_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_8_variance64x16 aom_highbd_8_variance64x16_c
+unsigned int aom_highbd_8_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance64x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance64x16 aom_highbd_8_variance64x16_neon
 
 unsigned int aom_highbd_8_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance64x32 aom_highbd_8_variance64x32_c
+unsigned int aom_highbd_8_variance64x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance64x32 aom_highbd_8_variance64x32_neon
 
 unsigned int aom_highbd_8_variance64x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance64x64 aom_highbd_8_variance64x64_c
+unsigned int aom_highbd_8_variance64x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance64x64 aom_highbd_8_variance64x64_neon
 
 unsigned int aom_highbd_8_variance8x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance8x16 aom_highbd_8_variance8x16_c
+unsigned int aom_highbd_8_variance8x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance8x16 aom_highbd_8_variance8x16_neon
 
-unsigned int aom_highbd_8_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_8_variance8x32 aom_highbd_8_variance8x32_c
+unsigned int aom_highbd_8_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance8x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance8x32 aom_highbd_8_variance8x32_neon
 
 unsigned int aom_highbd_8_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance8x4 aom_highbd_8_variance8x4_c
+unsigned int aom_highbd_8_variance8x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance8x4 aom_highbd_8_variance8x4_neon
 
 unsigned int aom_highbd_8_variance8x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance8x8 aom_highbd_8_variance8x8_c
+unsigned int aom_highbd_8_variance8x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance8x8 aom_highbd_8_variance8x8_neon
 
 unsigned int aom_highbd_avg_4x4_c(const uint8_t *, int p);
 unsigned int aom_highbd_avg_4x4_neon(const uint8_t *, int p);
 #define aom_highbd_avg_4x4 aom_highbd_avg_4x4_neon
 
 unsigned int aom_highbd_avg_8x8_c(const uint8_t *, int p);
-#define aom_highbd_avg_8x8 aom_highbd_avg_8x8_c
+unsigned int aom_highbd_avg_8x8_neon(const uint8_t *, int p);
+#define aom_highbd_avg_8x8 aom_highbd_avg_8x8_neon
 
 void aom_highbd_blend_a64_d16_mask_c(uint8_t *dst, uint32_t dst_stride, const CONV_BUF_TYPE *src0, uint32_t src0_stride, const CONV_BUF_TYPE *src1, uint32_t src1_stride, const uint8_t *mask, uint32_t mask_stride, int w, int h, int subw, int subh, ConvolveParams *conv_params, const int bd);
 #define aom_highbd_blend_a64_d16_mask aom_highbd_blend_a64_d16_mask_c
@@ -1974,237 +2082,308 @@
 #define aom_highbd_convolve_copy aom_highbd_convolve_copy_c
 
 void aom_highbd_dc_128_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_16x16 aom_highbd_dc_128_predictor_16x16_c
+void aom_highbd_dc_128_predictor_16x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_16x16 aom_highbd_dc_128_predictor_16x16_neon
 
 void aom_highbd_dc_128_predictor_16x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_16x32 aom_highbd_dc_128_predictor_16x32_c
+void aom_highbd_dc_128_predictor_16x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_16x32 aom_highbd_dc_128_predictor_16x32_neon
 
 void aom_highbd_dc_128_predictor_16x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_16x4 aom_highbd_dc_128_predictor_16x4_c
+void aom_highbd_dc_128_predictor_16x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_16x4 aom_highbd_dc_128_predictor_16x4_neon
 
 void aom_highbd_dc_128_predictor_16x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_16x64 aom_highbd_dc_128_predictor_16x64_c
+void aom_highbd_dc_128_predictor_16x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_16x64 aom_highbd_dc_128_predictor_16x64_neon
 
 void aom_highbd_dc_128_predictor_16x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_16x8 aom_highbd_dc_128_predictor_16x8_c
+void aom_highbd_dc_128_predictor_16x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_16x8 aom_highbd_dc_128_predictor_16x8_neon
 
 void aom_highbd_dc_128_predictor_32x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_32x16 aom_highbd_dc_128_predictor_32x16_c
+void aom_highbd_dc_128_predictor_32x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_32x16 aom_highbd_dc_128_predictor_32x16_neon
 
 void aom_highbd_dc_128_predictor_32x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_32x32 aom_highbd_dc_128_predictor_32x32_c
+void aom_highbd_dc_128_predictor_32x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_32x32 aom_highbd_dc_128_predictor_32x32_neon
 
 void aom_highbd_dc_128_predictor_32x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_32x64 aom_highbd_dc_128_predictor_32x64_c
+void aom_highbd_dc_128_predictor_32x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_32x64 aom_highbd_dc_128_predictor_32x64_neon
 
 void aom_highbd_dc_128_predictor_32x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_32x8 aom_highbd_dc_128_predictor_32x8_c
+void aom_highbd_dc_128_predictor_32x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_32x8 aom_highbd_dc_128_predictor_32x8_neon
 
 void aom_highbd_dc_128_predictor_4x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_4x16 aom_highbd_dc_128_predictor_4x16_c
+void aom_highbd_dc_128_predictor_4x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_4x16 aom_highbd_dc_128_predictor_4x16_neon
 
 void aom_highbd_dc_128_predictor_4x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_4x4 aom_highbd_dc_128_predictor_4x4_c
+void aom_highbd_dc_128_predictor_4x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_4x4 aom_highbd_dc_128_predictor_4x4_neon
 
 void aom_highbd_dc_128_predictor_4x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_4x8 aom_highbd_dc_128_predictor_4x8_c
+void aom_highbd_dc_128_predictor_4x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_4x8 aom_highbd_dc_128_predictor_4x8_neon
 
 void aom_highbd_dc_128_predictor_64x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_64x16 aom_highbd_dc_128_predictor_64x16_c
+void aom_highbd_dc_128_predictor_64x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_64x16 aom_highbd_dc_128_predictor_64x16_neon
 
 void aom_highbd_dc_128_predictor_64x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_64x32 aom_highbd_dc_128_predictor_64x32_c
+void aom_highbd_dc_128_predictor_64x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_64x32 aom_highbd_dc_128_predictor_64x32_neon
 
 void aom_highbd_dc_128_predictor_64x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_64x64 aom_highbd_dc_128_predictor_64x64_c
+void aom_highbd_dc_128_predictor_64x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_64x64 aom_highbd_dc_128_predictor_64x64_neon
 
 void aom_highbd_dc_128_predictor_8x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_8x16 aom_highbd_dc_128_predictor_8x16_c
+void aom_highbd_dc_128_predictor_8x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_8x16 aom_highbd_dc_128_predictor_8x16_neon
 
 void aom_highbd_dc_128_predictor_8x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_8x32 aom_highbd_dc_128_predictor_8x32_c
+void aom_highbd_dc_128_predictor_8x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_8x32 aom_highbd_dc_128_predictor_8x32_neon
 
 void aom_highbd_dc_128_predictor_8x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_8x4 aom_highbd_dc_128_predictor_8x4_c
+void aom_highbd_dc_128_predictor_8x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_8x4 aom_highbd_dc_128_predictor_8x4_neon
 
 void aom_highbd_dc_128_predictor_8x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_8x8 aom_highbd_dc_128_predictor_8x8_c
+void aom_highbd_dc_128_predictor_8x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_8x8 aom_highbd_dc_128_predictor_8x8_neon
 
 void aom_highbd_dc_left_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_16x16 aom_highbd_dc_left_predictor_16x16_c
+void aom_highbd_dc_left_predictor_16x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_16x16 aom_highbd_dc_left_predictor_16x16_neon
 
 void aom_highbd_dc_left_predictor_16x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_16x32 aom_highbd_dc_left_predictor_16x32_c
+void aom_highbd_dc_left_predictor_16x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_16x32 aom_highbd_dc_left_predictor_16x32_neon
 
 void aom_highbd_dc_left_predictor_16x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_16x4 aom_highbd_dc_left_predictor_16x4_c
+void aom_highbd_dc_left_predictor_16x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_16x4 aom_highbd_dc_left_predictor_16x4_neon
 
 void aom_highbd_dc_left_predictor_16x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_16x64 aom_highbd_dc_left_predictor_16x64_c
+void aom_highbd_dc_left_predictor_16x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_16x64 aom_highbd_dc_left_predictor_16x64_neon
 
 void aom_highbd_dc_left_predictor_16x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_16x8 aom_highbd_dc_left_predictor_16x8_c
+void aom_highbd_dc_left_predictor_16x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_16x8 aom_highbd_dc_left_predictor_16x8_neon
 
 void aom_highbd_dc_left_predictor_32x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_32x16 aom_highbd_dc_left_predictor_32x16_c
+void aom_highbd_dc_left_predictor_32x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_32x16 aom_highbd_dc_left_predictor_32x16_neon
 
 void aom_highbd_dc_left_predictor_32x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_32x32 aom_highbd_dc_left_predictor_32x32_c
+void aom_highbd_dc_left_predictor_32x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_32x32 aom_highbd_dc_left_predictor_32x32_neon
 
 void aom_highbd_dc_left_predictor_32x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_32x64 aom_highbd_dc_left_predictor_32x64_c
+void aom_highbd_dc_left_predictor_32x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_32x64 aom_highbd_dc_left_predictor_32x64_neon
 
 void aom_highbd_dc_left_predictor_32x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_32x8 aom_highbd_dc_left_predictor_32x8_c
+void aom_highbd_dc_left_predictor_32x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_32x8 aom_highbd_dc_left_predictor_32x8_neon
 
 void aom_highbd_dc_left_predictor_4x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_4x16 aom_highbd_dc_left_predictor_4x16_c
+void aom_highbd_dc_left_predictor_4x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_4x16 aom_highbd_dc_left_predictor_4x16_neon
 
 void aom_highbd_dc_left_predictor_4x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_4x4 aom_highbd_dc_left_predictor_4x4_c
+void aom_highbd_dc_left_predictor_4x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_4x4 aom_highbd_dc_left_predictor_4x4_neon
 
 void aom_highbd_dc_left_predictor_4x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_4x8 aom_highbd_dc_left_predictor_4x8_c
+void aom_highbd_dc_left_predictor_4x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_4x8 aom_highbd_dc_left_predictor_4x8_neon
 
 void aom_highbd_dc_left_predictor_64x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_64x16 aom_highbd_dc_left_predictor_64x16_c
+void aom_highbd_dc_left_predictor_64x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_64x16 aom_highbd_dc_left_predictor_64x16_neon
 
 void aom_highbd_dc_left_predictor_64x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_64x32 aom_highbd_dc_left_predictor_64x32_c
+void aom_highbd_dc_left_predictor_64x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_64x32 aom_highbd_dc_left_predictor_64x32_neon
 
 void aom_highbd_dc_left_predictor_64x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_64x64 aom_highbd_dc_left_predictor_64x64_c
+void aom_highbd_dc_left_predictor_64x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_64x64 aom_highbd_dc_left_predictor_64x64_neon
 
 void aom_highbd_dc_left_predictor_8x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_8x16 aom_highbd_dc_left_predictor_8x16_c
+void aom_highbd_dc_left_predictor_8x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_8x16 aom_highbd_dc_left_predictor_8x16_neon
 
 void aom_highbd_dc_left_predictor_8x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_8x32 aom_highbd_dc_left_predictor_8x32_c
+void aom_highbd_dc_left_predictor_8x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_8x32 aom_highbd_dc_left_predictor_8x32_neon
 
 void aom_highbd_dc_left_predictor_8x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_8x4 aom_highbd_dc_left_predictor_8x4_c
+void aom_highbd_dc_left_predictor_8x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_8x4 aom_highbd_dc_left_predictor_8x4_neon
 
 void aom_highbd_dc_left_predictor_8x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_8x8 aom_highbd_dc_left_predictor_8x8_c
+void aom_highbd_dc_left_predictor_8x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_8x8 aom_highbd_dc_left_predictor_8x8_neon
 
 void aom_highbd_dc_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_dc_predictor_16x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_dc_predictor_16x16 aom_highbd_dc_predictor_16x16_neon
 
 void aom_highbd_dc_predictor_16x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_16x32 aom_highbd_dc_predictor_16x32_c
+void aom_highbd_dc_predictor_16x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_16x32 aom_highbd_dc_predictor_16x32_neon
 
 void aom_highbd_dc_predictor_16x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_16x4 aom_highbd_dc_predictor_16x4_c
+void aom_highbd_dc_predictor_16x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_16x4 aom_highbd_dc_predictor_16x4_neon
 
 void aom_highbd_dc_predictor_16x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_16x64 aom_highbd_dc_predictor_16x64_c
+void aom_highbd_dc_predictor_16x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_16x64 aom_highbd_dc_predictor_16x64_neon
 
 void aom_highbd_dc_predictor_16x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_16x8 aom_highbd_dc_predictor_16x8_c
+void aom_highbd_dc_predictor_16x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_16x8 aom_highbd_dc_predictor_16x8_neon
 
 void aom_highbd_dc_predictor_32x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_32x16 aom_highbd_dc_predictor_32x16_c
+void aom_highbd_dc_predictor_32x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_32x16 aom_highbd_dc_predictor_32x16_neon
 
 void aom_highbd_dc_predictor_32x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_dc_predictor_32x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_dc_predictor_32x32 aom_highbd_dc_predictor_32x32_neon
 
 void aom_highbd_dc_predictor_32x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_32x64 aom_highbd_dc_predictor_32x64_c
+void aom_highbd_dc_predictor_32x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_32x64 aom_highbd_dc_predictor_32x64_neon
 
 void aom_highbd_dc_predictor_32x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_32x8 aom_highbd_dc_predictor_32x8_c
+void aom_highbd_dc_predictor_32x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_32x8 aom_highbd_dc_predictor_32x8_neon
 
 void aom_highbd_dc_predictor_4x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_4x16 aom_highbd_dc_predictor_4x16_c
+void aom_highbd_dc_predictor_4x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_4x16 aom_highbd_dc_predictor_4x16_neon
 
 void aom_highbd_dc_predictor_4x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_dc_predictor_4x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_dc_predictor_4x4 aom_highbd_dc_predictor_4x4_neon
 
 void aom_highbd_dc_predictor_4x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_4x8 aom_highbd_dc_predictor_4x8_c
+void aom_highbd_dc_predictor_4x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_4x8 aom_highbd_dc_predictor_4x8_neon
 
 void aom_highbd_dc_predictor_64x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_64x16 aom_highbd_dc_predictor_64x16_c
+void aom_highbd_dc_predictor_64x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_64x16 aom_highbd_dc_predictor_64x16_neon
 
 void aom_highbd_dc_predictor_64x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_64x32 aom_highbd_dc_predictor_64x32_c
+void aom_highbd_dc_predictor_64x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_64x32 aom_highbd_dc_predictor_64x32_neon
 
 void aom_highbd_dc_predictor_64x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_dc_predictor_64x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_dc_predictor_64x64 aom_highbd_dc_predictor_64x64_neon
 
 void aom_highbd_dc_predictor_8x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_8x16 aom_highbd_dc_predictor_8x16_c
+void aom_highbd_dc_predictor_8x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_8x16 aom_highbd_dc_predictor_8x16_neon
 
 void aom_highbd_dc_predictor_8x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_8x32 aom_highbd_dc_predictor_8x32_c
+void aom_highbd_dc_predictor_8x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_8x32 aom_highbd_dc_predictor_8x32_neon
 
 void aom_highbd_dc_predictor_8x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_8x4 aom_highbd_dc_predictor_8x4_c
+void aom_highbd_dc_predictor_8x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_8x4 aom_highbd_dc_predictor_8x4_neon
 
 void aom_highbd_dc_predictor_8x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_dc_predictor_8x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_dc_predictor_8x8 aom_highbd_dc_predictor_8x8_neon
 
 void aom_highbd_dc_top_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_16x16 aom_highbd_dc_top_predictor_16x16_c
+void aom_highbd_dc_top_predictor_16x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_16x16 aom_highbd_dc_top_predictor_16x16_neon
 
 void aom_highbd_dc_top_predictor_16x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_16x32 aom_highbd_dc_top_predictor_16x32_c
+void aom_highbd_dc_top_predictor_16x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_16x32 aom_highbd_dc_top_predictor_16x32_neon
 
 void aom_highbd_dc_top_predictor_16x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_16x4 aom_highbd_dc_top_predictor_16x4_c
+void aom_highbd_dc_top_predictor_16x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_16x4 aom_highbd_dc_top_predictor_16x4_neon
 
 void aom_highbd_dc_top_predictor_16x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_16x64 aom_highbd_dc_top_predictor_16x64_c
+void aom_highbd_dc_top_predictor_16x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_16x64 aom_highbd_dc_top_predictor_16x64_neon
 
 void aom_highbd_dc_top_predictor_16x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_16x8 aom_highbd_dc_top_predictor_16x8_c
+void aom_highbd_dc_top_predictor_16x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_16x8 aom_highbd_dc_top_predictor_16x8_neon
 
 void aom_highbd_dc_top_predictor_32x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_32x16 aom_highbd_dc_top_predictor_32x16_c
+void aom_highbd_dc_top_predictor_32x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_32x16 aom_highbd_dc_top_predictor_32x16_neon
 
 void aom_highbd_dc_top_predictor_32x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_32x32 aom_highbd_dc_top_predictor_32x32_c
+void aom_highbd_dc_top_predictor_32x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_32x32 aom_highbd_dc_top_predictor_32x32_neon
 
 void aom_highbd_dc_top_predictor_32x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_32x64 aom_highbd_dc_top_predictor_32x64_c
+void aom_highbd_dc_top_predictor_32x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_32x64 aom_highbd_dc_top_predictor_32x64_neon
 
 void aom_highbd_dc_top_predictor_32x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_32x8 aom_highbd_dc_top_predictor_32x8_c
+void aom_highbd_dc_top_predictor_32x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_32x8 aom_highbd_dc_top_predictor_32x8_neon
 
 void aom_highbd_dc_top_predictor_4x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_4x16 aom_highbd_dc_top_predictor_4x16_c
+void aom_highbd_dc_top_predictor_4x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_4x16 aom_highbd_dc_top_predictor_4x16_neon
 
 void aom_highbd_dc_top_predictor_4x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_4x4 aom_highbd_dc_top_predictor_4x4_c
+void aom_highbd_dc_top_predictor_4x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_4x4 aom_highbd_dc_top_predictor_4x4_neon
 
 void aom_highbd_dc_top_predictor_4x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_4x8 aom_highbd_dc_top_predictor_4x8_c
+void aom_highbd_dc_top_predictor_4x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_4x8 aom_highbd_dc_top_predictor_4x8_neon
 
 void aom_highbd_dc_top_predictor_64x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_64x16 aom_highbd_dc_top_predictor_64x16_c
+void aom_highbd_dc_top_predictor_64x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_64x16 aom_highbd_dc_top_predictor_64x16_neon
 
 void aom_highbd_dc_top_predictor_64x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_64x32 aom_highbd_dc_top_predictor_64x32_c
+void aom_highbd_dc_top_predictor_64x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_64x32 aom_highbd_dc_top_predictor_64x32_neon
 
 void aom_highbd_dc_top_predictor_64x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_64x64 aom_highbd_dc_top_predictor_64x64_c
+void aom_highbd_dc_top_predictor_64x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_64x64 aom_highbd_dc_top_predictor_64x64_neon
 
 void aom_highbd_dc_top_predictor_8x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_8x16 aom_highbd_dc_top_predictor_8x16_c
+void aom_highbd_dc_top_predictor_8x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_8x16 aom_highbd_dc_top_predictor_8x16_neon
 
 void aom_highbd_dc_top_predictor_8x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_8x32 aom_highbd_dc_top_predictor_8x32_c
+void aom_highbd_dc_top_predictor_8x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_8x32 aom_highbd_dc_top_predictor_8x32_neon
 
 void aom_highbd_dc_top_predictor_8x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_8x4 aom_highbd_dc_top_predictor_8x4_c
+void aom_highbd_dc_top_predictor_8x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_8x4 aom_highbd_dc_top_predictor_8x4_neon
 
 void aom_highbd_dc_top_predictor_8x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_8x8 aom_highbd_dc_top_predictor_8x8_c
+void aom_highbd_dc_top_predictor_8x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_8x8 aom_highbd_dc_top_predictor_8x8_neon
 
 void aom_highbd_dist_wtd_comp_avg_pred_c(uint8_t *comp_pred8, const uint8_t *pred8, int width, int height, const uint8_t *ref8, int ref_stride, const DIST_WTD_COMP_PARAMS *jcp_param);
 #define aom_highbd_dist_wtd_comp_avg_pred aom_highbd_dist_wtd_comp_avg_pred_c
@@ -2275,74 +2454,93 @@
 unsigned int aom_highbd_dist_wtd_sad8x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_dist_wtd_sad8x8_avg aom_highbd_dist_wtd_sad8x8_avg_c
 
-void aom_highbd_fdct8x8_c(const int16_t *input, tran_low_t *output, int stride);
-#define aom_highbd_fdct8x8 aom_highbd_fdct8x8_c
-
 void aom_highbd_h_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_16x16 aom_highbd_h_predictor_16x16_c
+void aom_highbd_h_predictor_16x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_16x16 aom_highbd_h_predictor_16x16_neon
 
 void aom_highbd_h_predictor_16x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_16x32 aom_highbd_h_predictor_16x32_c
+void aom_highbd_h_predictor_16x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_16x32 aom_highbd_h_predictor_16x32_neon
 
 void aom_highbd_h_predictor_16x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_16x4 aom_highbd_h_predictor_16x4_c
+void aom_highbd_h_predictor_16x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_16x4 aom_highbd_h_predictor_16x4_neon
 
 void aom_highbd_h_predictor_16x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_16x64 aom_highbd_h_predictor_16x64_c
+void aom_highbd_h_predictor_16x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_16x64 aom_highbd_h_predictor_16x64_neon
 
 void aom_highbd_h_predictor_16x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_16x8 aom_highbd_h_predictor_16x8_c
+void aom_highbd_h_predictor_16x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_16x8 aom_highbd_h_predictor_16x8_neon
 
 void aom_highbd_h_predictor_32x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_32x16 aom_highbd_h_predictor_32x16_c
+void aom_highbd_h_predictor_32x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_32x16 aom_highbd_h_predictor_32x16_neon
 
 void aom_highbd_h_predictor_32x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_32x32 aom_highbd_h_predictor_32x32_c
+void aom_highbd_h_predictor_32x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_32x32 aom_highbd_h_predictor_32x32_neon
 
 void aom_highbd_h_predictor_32x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_32x64 aom_highbd_h_predictor_32x64_c
+void aom_highbd_h_predictor_32x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_32x64 aom_highbd_h_predictor_32x64_neon
 
 void aom_highbd_h_predictor_32x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_32x8 aom_highbd_h_predictor_32x8_c
+void aom_highbd_h_predictor_32x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_32x8 aom_highbd_h_predictor_32x8_neon
 
 void aom_highbd_h_predictor_4x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_4x16 aom_highbd_h_predictor_4x16_c
+void aom_highbd_h_predictor_4x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_4x16 aom_highbd_h_predictor_4x16_neon
 
 void aom_highbd_h_predictor_4x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_4x4 aom_highbd_h_predictor_4x4_c
+void aom_highbd_h_predictor_4x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_4x4 aom_highbd_h_predictor_4x4_neon
 
 void aom_highbd_h_predictor_4x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_4x8 aom_highbd_h_predictor_4x8_c
+void aom_highbd_h_predictor_4x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_4x8 aom_highbd_h_predictor_4x8_neon
 
 void aom_highbd_h_predictor_64x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_64x16 aom_highbd_h_predictor_64x16_c
+void aom_highbd_h_predictor_64x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_64x16 aom_highbd_h_predictor_64x16_neon
 
 void aom_highbd_h_predictor_64x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_64x32 aom_highbd_h_predictor_64x32_c
+void aom_highbd_h_predictor_64x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_64x32 aom_highbd_h_predictor_64x32_neon
 
 void aom_highbd_h_predictor_64x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_64x64 aom_highbd_h_predictor_64x64_c
+void aom_highbd_h_predictor_64x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_64x64 aom_highbd_h_predictor_64x64_neon
 
 void aom_highbd_h_predictor_8x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_8x16 aom_highbd_h_predictor_8x16_c
+void aom_highbd_h_predictor_8x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_8x16 aom_highbd_h_predictor_8x16_neon
 
 void aom_highbd_h_predictor_8x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_8x32 aom_highbd_h_predictor_8x32_c
+void aom_highbd_h_predictor_8x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_8x32 aom_highbd_h_predictor_8x32_neon
 
 void aom_highbd_h_predictor_8x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_8x4 aom_highbd_h_predictor_8x4_c
+void aom_highbd_h_predictor_8x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_8x4 aom_highbd_h_predictor_8x4_neon
 
 void aom_highbd_h_predictor_8x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_8x8 aom_highbd_h_predictor_8x8_c
+void aom_highbd_h_predictor_8x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_8x8 aom_highbd_h_predictor_8x8_neon
 
 void aom_highbd_hadamard_16x16_c(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
-#define aom_highbd_hadamard_16x16 aom_highbd_hadamard_16x16_c
+void aom_highbd_hadamard_16x16_neon(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
+#define aom_highbd_hadamard_16x16 aom_highbd_hadamard_16x16_neon
 
 void aom_highbd_hadamard_32x32_c(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
-#define aom_highbd_hadamard_32x32 aom_highbd_hadamard_32x32_c
+void aom_highbd_hadamard_32x32_neon(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
+#define aom_highbd_hadamard_32x32 aom_highbd_hadamard_32x32_neon
 
 void aom_highbd_hadamard_8x8_c(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
-#define aom_highbd_hadamard_8x8 aom_highbd_hadamard_8x8_c
+void aom_highbd_hadamard_8x8_neon(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
+#define aom_highbd_hadamard_8x8 aom_highbd_hadamard_8x8_neon
 
 void aom_highbd_lpf_horizontal_14_c(uint16_t *s, int pitch, const uint8_t *blimit, const uint8_t *limit, const uint8_t *thresh, int bd);
 void aom_highbd_lpf_horizontal_14_neon(uint16_t *s, int pitch, const uint8_t *blimit, const uint8_t *limit, const uint8_t *thresh, int bd);
@@ -2475,7 +2673,8 @@
 #define aom_highbd_masked_sad8x8 aom_highbd_masked_sad8x8_c
 
 void aom_highbd_minmax_8x8_c(const uint8_t *s, int p, const uint8_t *d, int dp, int *min, int *max);
-#define aom_highbd_minmax_8x8 aom_highbd_minmax_8x8_c
+void aom_highbd_minmax_8x8_neon(const uint8_t *s, int p, const uint8_t *d, int dp, int *min, int *max);
+#define aom_highbd_minmax_8x8 aom_highbd_minmax_8x8_neon
 
 unsigned int aom_highbd_obmc_sad128x128_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
 #define aom_highbd_obmc_sad128x128 aom_highbd_obmc_sad128x128_c
@@ -2776,7 +2975,8 @@
 #define aom_highbd_quantize_b_adaptive aom_highbd_quantize_b_adaptive_neon
 
 unsigned int aom_highbd_sad128x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad128x128 aom_highbd_sad128x128_c
+unsigned int aom_highbd_sad128x128_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad128x128 aom_highbd_sad128x128_neon
 
 unsigned int aom_highbd_sad128x128_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad128x128_avg aom_highbd_sad128x128_avg_c
@@ -2785,10 +2985,12 @@
 #define aom_highbd_sad128x128x3d aom_highbd_sad128x128x3d_c
 
 void aom_highbd_sad128x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad128x128x4d aom_highbd_sad128x128x4d_c
+void aom_highbd_sad128x128x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad128x128x4d aom_highbd_sad128x128x4d_neon
 
 unsigned int aom_highbd_sad128x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad128x64 aom_highbd_sad128x64_c
+unsigned int aom_highbd_sad128x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad128x64 aom_highbd_sad128x64_neon
 
 unsigned int aom_highbd_sad128x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad128x64_avg aom_highbd_sad128x64_avg_c
@@ -2797,10 +2999,12 @@
 #define aom_highbd_sad128x64x3d aom_highbd_sad128x64x3d_c
 
 void aom_highbd_sad128x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad128x64x4d aom_highbd_sad128x64x4d_c
+void aom_highbd_sad128x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad128x64x4d aom_highbd_sad128x64x4d_neon
 
 unsigned int aom_highbd_sad16x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad16x16 aom_highbd_sad16x16_c
+unsigned int aom_highbd_sad16x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad16x16 aom_highbd_sad16x16_neon
 
 unsigned int aom_highbd_sad16x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad16x16_avg aom_highbd_sad16x16_avg_c
@@ -2809,10 +3013,12 @@
 #define aom_highbd_sad16x16x3d aom_highbd_sad16x16x3d_c
 
 void aom_highbd_sad16x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad16x16x4d aom_highbd_sad16x16x4d_c
+void aom_highbd_sad16x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad16x16x4d aom_highbd_sad16x16x4d_neon
 
 unsigned int aom_highbd_sad16x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad16x32 aom_highbd_sad16x32_c
+unsigned int aom_highbd_sad16x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad16x32 aom_highbd_sad16x32_neon
 
 unsigned int aom_highbd_sad16x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad16x32_avg aom_highbd_sad16x32_avg_c
@@ -2821,10 +3027,12 @@
 #define aom_highbd_sad16x32x3d aom_highbd_sad16x32x3d_c
 
 void aom_highbd_sad16x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad16x32x4d aom_highbd_sad16x32x4d_c
+void aom_highbd_sad16x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad16x32x4d aom_highbd_sad16x32x4d_neon
 
 unsigned int aom_highbd_sad16x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad16x4 aom_highbd_sad16x4_c
+unsigned int aom_highbd_sad16x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad16x4 aom_highbd_sad16x4_neon
 
 unsigned int aom_highbd_sad16x4_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad16x4_avg aom_highbd_sad16x4_avg_c
@@ -2833,10 +3041,12 @@
 #define aom_highbd_sad16x4x3d aom_highbd_sad16x4x3d_c
 
 void aom_highbd_sad16x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad16x4x4d aom_highbd_sad16x4x4d_c
+void aom_highbd_sad16x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad16x4x4d aom_highbd_sad16x4x4d_neon
 
 unsigned int aom_highbd_sad16x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad16x64 aom_highbd_sad16x64_c
+unsigned int aom_highbd_sad16x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad16x64 aom_highbd_sad16x64_neon
 
 unsigned int aom_highbd_sad16x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad16x64_avg aom_highbd_sad16x64_avg_c
@@ -2845,10 +3055,12 @@
 #define aom_highbd_sad16x64x3d aom_highbd_sad16x64x3d_c
 
 void aom_highbd_sad16x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad16x64x4d aom_highbd_sad16x64x4d_c
+void aom_highbd_sad16x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad16x64x4d aom_highbd_sad16x64x4d_neon
 
 unsigned int aom_highbd_sad16x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad16x8 aom_highbd_sad16x8_c
+unsigned int aom_highbd_sad16x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad16x8 aom_highbd_sad16x8_neon
 
 unsigned int aom_highbd_sad16x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad16x8_avg aom_highbd_sad16x8_avg_c
@@ -2857,10 +3069,12 @@
 #define aom_highbd_sad16x8x3d aom_highbd_sad16x8x3d_c
 
 void aom_highbd_sad16x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad16x8x4d aom_highbd_sad16x8x4d_c
+void aom_highbd_sad16x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad16x8x4d aom_highbd_sad16x8x4d_neon
 
 unsigned int aom_highbd_sad32x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad32x16 aom_highbd_sad32x16_c
+unsigned int aom_highbd_sad32x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad32x16 aom_highbd_sad32x16_neon
 
 unsigned int aom_highbd_sad32x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad32x16_avg aom_highbd_sad32x16_avg_c
@@ -2869,10 +3083,12 @@
 #define aom_highbd_sad32x16x3d aom_highbd_sad32x16x3d_c
 
 void aom_highbd_sad32x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad32x16x4d aom_highbd_sad32x16x4d_c
+void aom_highbd_sad32x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad32x16x4d aom_highbd_sad32x16x4d_neon
 
 unsigned int aom_highbd_sad32x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad32x32 aom_highbd_sad32x32_c
+unsigned int aom_highbd_sad32x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad32x32 aom_highbd_sad32x32_neon
 
 unsigned int aom_highbd_sad32x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad32x32_avg aom_highbd_sad32x32_avg_c
@@ -2881,10 +3097,12 @@
 #define aom_highbd_sad32x32x3d aom_highbd_sad32x32x3d_c
 
 void aom_highbd_sad32x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad32x32x4d aom_highbd_sad32x32x4d_c
+void aom_highbd_sad32x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad32x32x4d aom_highbd_sad32x32x4d_neon
 
 unsigned int aom_highbd_sad32x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad32x64 aom_highbd_sad32x64_c
+unsigned int aom_highbd_sad32x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad32x64 aom_highbd_sad32x64_neon
 
 unsigned int aom_highbd_sad32x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad32x64_avg aom_highbd_sad32x64_avg_c
@@ -2893,10 +3111,12 @@
 #define aom_highbd_sad32x64x3d aom_highbd_sad32x64x3d_c
 
 void aom_highbd_sad32x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad32x64x4d aom_highbd_sad32x64x4d_c
+void aom_highbd_sad32x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad32x64x4d aom_highbd_sad32x64x4d_neon
 
 unsigned int aom_highbd_sad32x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad32x8 aom_highbd_sad32x8_c
+unsigned int aom_highbd_sad32x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad32x8 aom_highbd_sad32x8_neon
 
 unsigned int aom_highbd_sad32x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad32x8_avg aom_highbd_sad32x8_avg_c
@@ -2905,10 +3125,12 @@
 #define aom_highbd_sad32x8x3d aom_highbd_sad32x8x3d_c
 
 void aom_highbd_sad32x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad32x8x4d aom_highbd_sad32x8x4d_c
+void aom_highbd_sad32x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad32x8x4d aom_highbd_sad32x8x4d_neon
 
 unsigned int aom_highbd_sad4x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad4x16 aom_highbd_sad4x16_c
+unsigned int aom_highbd_sad4x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad4x16 aom_highbd_sad4x16_neon
 
 unsigned int aom_highbd_sad4x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad4x16_avg aom_highbd_sad4x16_avg_c
@@ -2917,10 +3139,12 @@
 #define aom_highbd_sad4x16x3d aom_highbd_sad4x16x3d_c
 
 void aom_highbd_sad4x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad4x16x4d aom_highbd_sad4x16x4d_c
+void aom_highbd_sad4x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad4x16x4d aom_highbd_sad4x16x4d_neon
 
 unsigned int aom_highbd_sad4x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad4x4 aom_highbd_sad4x4_c
+unsigned int aom_highbd_sad4x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad4x4 aom_highbd_sad4x4_neon
 
 unsigned int aom_highbd_sad4x4_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad4x4_avg aom_highbd_sad4x4_avg_c
@@ -2929,10 +3153,12 @@
 #define aom_highbd_sad4x4x3d aom_highbd_sad4x4x3d_c
 
 void aom_highbd_sad4x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad4x4x4d aom_highbd_sad4x4x4d_c
+void aom_highbd_sad4x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad4x4x4d aom_highbd_sad4x4x4d_neon
 
 unsigned int aom_highbd_sad4x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad4x8 aom_highbd_sad4x8_c
+unsigned int aom_highbd_sad4x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad4x8 aom_highbd_sad4x8_neon
 
 unsigned int aom_highbd_sad4x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad4x8_avg aom_highbd_sad4x8_avg_c
@@ -2941,10 +3167,12 @@
 #define aom_highbd_sad4x8x3d aom_highbd_sad4x8x3d_c
 
 void aom_highbd_sad4x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad4x8x4d aom_highbd_sad4x8x4d_c
+void aom_highbd_sad4x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad4x8x4d aom_highbd_sad4x8x4d_neon
 
 unsigned int aom_highbd_sad64x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad64x128 aom_highbd_sad64x128_c
+unsigned int aom_highbd_sad64x128_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad64x128 aom_highbd_sad64x128_neon
 
 unsigned int aom_highbd_sad64x128_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad64x128_avg aom_highbd_sad64x128_avg_c
@@ -2953,10 +3181,12 @@
 #define aom_highbd_sad64x128x3d aom_highbd_sad64x128x3d_c
 
 void aom_highbd_sad64x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad64x128x4d aom_highbd_sad64x128x4d_c
+void aom_highbd_sad64x128x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad64x128x4d aom_highbd_sad64x128x4d_neon
 
 unsigned int aom_highbd_sad64x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad64x16 aom_highbd_sad64x16_c
+unsigned int aom_highbd_sad64x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad64x16 aom_highbd_sad64x16_neon
 
 unsigned int aom_highbd_sad64x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad64x16_avg aom_highbd_sad64x16_avg_c
@@ -2965,10 +3195,12 @@
 #define aom_highbd_sad64x16x3d aom_highbd_sad64x16x3d_c
 
 void aom_highbd_sad64x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad64x16x4d aom_highbd_sad64x16x4d_c
+void aom_highbd_sad64x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad64x16x4d aom_highbd_sad64x16x4d_neon
 
 unsigned int aom_highbd_sad64x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad64x32 aom_highbd_sad64x32_c
+unsigned int aom_highbd_sad64x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad64x32 aom_highbd_sad64x32_neon
 
 unsigned int aom_highbd_sad64x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad64x32_avg aom_highbd_sad64x32_avg_c
@@ -2977,10 +3209,12 @@
 #define aom_highbd_sad64x32x3d aom_highbd_sad64x32x3d_c
 
 void aom_highbd_sad64x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad64x32x4d aom_highbd_sad64x32x4d_c
+void aom_highbd_sad64x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad64x32x4d aom_highbd_sad64x32x4d_neon
 
 unsigned int aom_highbd_sad64x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad64x64 aom_highbd_sad64x64_c
+unsigned int aom_highbd_sad64x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad64x64 aom_highbd_sad64x64_neon
 
 unsigned int aom_highbd_sad64x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad64x64_avg aom_highbd_sad64x64_avg_c
@@ -2989,10 +3223,12 @@
 #define aom_highbd_sad64x64x3d aom_highbd_sad64x64x3d_c
 
 void aom_highbd_sad64x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad64x64x4d aom_highbd_sad64x64x4d_c
+void aom_highbd_sad64x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad64x64x4d aom_highbd_sad64x64x4d_neon
 
 unsigned int aom_highbd_sad8x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad8x16 aom_highbd_sad8x16_c
+unsigned int aom_highbd_sad8x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad8x16 aom_highbd_sad8x16_neon
 
 unsigned int aom_highbd_sad8x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad8x16_avg aom_highbd_sad8x16_avg_c
@@ -3001,10 +3237,12 @@
 #define aom_highbd_sad8x16x3d aom_highbd_sad8x16x3d_c
 
 void aom_highbd_sad8x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad8x16x4d aom_highbd_sad8x16x4d_c
+void aom_highbd_sad8x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad8x16x4d aom_highbd_sad8x16x4d_neon
 
 unsigned int aom_highbd_sad8x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad8x32 aom_highbd_sad8x32_c
+unsigned int aom_highbd_sad8x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad8x32 aom_highbd_sad8x32_neon
 
 unsigned int aom_highbd_sad8x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad8x32_avg aom_highbd_sad8x32_avg_c
@@ -3013,10 +3251,12 @@
 #define aom_highbd_sad8x32x3d aom_highbd_sad8x32x3d_c
 
 void aom_highbd_sad8x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad8x32x4d aom_highbd_sad8x32x4d_c
+void aom_highbd_sad8x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad8x32x4d aom_highbd_sad8x32x4d_neon
 
 unsigned int aom_highbd_sad8x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad8x4 aom_highbd_sad8x4_c
+unsigned int aom_highbd_sad8x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad8x4 aom_highbd_sad8x4_neon
 
 unsigned int aom_highbd_sad8x4_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad8x4_avg aom_highbd_sad8x4_avg_c
@@ -3025,10 +3265,12 @@
 #define aom_highbd_sad8x4x3d aom_highbd_sad8x4x3d_c
 
 void aom_highbd_sad8x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad8x4x4d aom_highbd_sad8x4x4d_c
+void aom_highbd_sad8x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad8x4x4d aom_highbd_sad8x4x4d_neon
 
 unsigned int aom_highbd_sad8x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad8x8 aom_highbd_sad8x8_c
+unsigned int aom_highbd_sad8x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad8x8 aom_highbd_sad8x8_neon
 
 unsigned int aom_highbd_sad8x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad8x8_avg aom_highbd_sad8x8_avg_c
@@ -3037,139 +3279,184 @@
 #define aom_highbd_sad8x8x3d aom_highbd_sad8x8x3d_c
 
 void aom_highbd_sad8x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad8x8x4d aom_highbd_sad8x8x4d_c
+void aom_highbd_sad8x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad8x8x4d aom_highbd_sad8x8x4d_neon
 
 unsigned int aom_highbd_sad_skip_128x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_128x128 aom_highbd_sad_skip_128x128_c
+unsigned int aom_highbd_sad_skip_128x128_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_128x128 aom_highbd_sad_skip_128x128_neon
 
 void aom_highbd_sad_skip_128x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_128x128x4d aom_highbd_sad_skip_128x128x4d_c
+void aom_highbd_sad_skip_128x128x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_128x128x4d aom_highbd_sad_skip_128x128x4d_neon
 
 unsigned int aom_highbd_sad_skip_128x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_128x64 aom_highbd_sad_skip_128x64_c
+unsigned int aom_highbd_sad_skip_128x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_128x64 aom_highbd_sad_skip_128x64_neon
 
 void aom_highbd_sad_skip_128x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_128x64x4d aom_highbd_sad_skip_128x64x4d_c
+void aom_highbd_sad_skip_128x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_128x64x4d aom_highbd_sad_skip_128x64x4d_neon
 
 unsigned int aom_highbd_sad_skip_16x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_16x16 aom_highbd_sad_skip_16x16_c
+unsigned int aom_highbd_sad_skip_16x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_16x16 aom_highbd_sad_skip_16x16_neon
 
 void aom_highbd_sad_skip_16x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_16x16x4d aom_highbd_sad_skip_16x16x4d_c
+void aom_highbd_sad_skip_16x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_16x16x4d aom_highbd_sad_skip_16x16x4d_neon
 
 unsigned int aom_highbd_sad_skip_16x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_16x32 aom_highbd_sad_skip_16x32_c
+unsigned int aom_highbd_sad_skip_16x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_16x32 aom_highbd_sad_skip_16x32_neon
 
 void aom_highbd_sad_skip_16x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_16x32x4d aom_highbd_sad_skip_16x32x4d_c
+void aom_highbd_sad_skip_16x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_16x32x4d aom_highbd_sad_skip_16x32x4d_neon
 
 unsigned int aom_highbd_sad_skip_16x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_16x4 aom_highbd_sad_skip_16x4_c
+unsigned int aom_highbd_sad_skip_16x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_16x4 aom_highbd_sad_skip_16x4_neon
 
 void aom_highbd_sad_skip_16x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_16x4x4d aom_highbd_sad_skip_16x4x4d_c
+void aom_highbd_sad_skip_16x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_16x4x4d aom_highbd_sad_skip_16x4x4d_neon
 
 unsigned int aom_highbd_sad_skip_16x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_16x64 aom_highbd_sad_skip_16x64_c
+unsigned int aom_highbd_sad_skip_16x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_16x64 aom_highbd_sad_skip_16x64_neon
 
 void aom_highbd_sad_skip_16x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_16x64x4d aom_highbd_sad_skip_16x64x4d_c
+void aom_highbd_sad_skip_16x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_16x64x4d aom_highbd_sad_skip_16x64x4d_neon
 
 unsigned int aom_highbd_sad_skip_16x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_16x8 aom_highbd_sad_skip_16x8_c
+unsigned int aom_highbd_sad_skip_16x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_16x8 aom_highbd_sad_skip_16x8_neon
 
 void aom_highbd_sad_skip_16x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_16x8x4d aom_highbd_sad_skip_16x8x4d_c
+void aom_highbd_sad_skip_16x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_16x8x4d aom_highbd_sad_skip_16x8x4d_neon
 
 unsigned int aom_highbd_sad_skip_32x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_32x16 aom_highbd_sad_skip_32x16_c
+unsigned int aom_highbd_sad_skip_32x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_32x16 aom_highbd_sad_skip_32x16_neon
 
 void aom_highbd_sad_skip_32x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_32x16x4d aom_highbd_sad_skip_32x16x4d_c
+void aom_highbd_sad_skip_32x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_32x16x4d aom_highbd_sad_skip_32x16x4d_neon
 
 unsigned int aom_highbd_sad_skip_32x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_32x32 aom_highbd_sad_skip_32x32_c
+unsigned int aom_highbd_sad_skip_32x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_32x32 aom_highbd_sad_skip_32x32_neon
 
 void aom_highbd_sad_skip_32x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_32x32x4d aom_highbd_sad_skip_32x32x4d_c
+void aom_highbd_sad_skip_32x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_32x32x4d aom_highbd_sad_skip_32x32x4d_neon
 
 unsigned int aom_highbd_sad_skip_32x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_32x64 aom_highbd_sad_skip_32x64_c
+unsigned int aom_highbd_sad_skip_32x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_32x64 aom_highbd_sad_skip_32x64_neon
 
 void aom_highbd_sad_skip_32x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_32x64x4d aom_highbd_sad_skip_32x64x4d_c
+void aom_highbd_sad_skip_32x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_32x64x4d aom_highbd_sad_skip_32x64x4d_neon
 
 unsigned int aom_highbd_sad_skip_32x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_32x8 aom_highbd_sad_skip_32x8_c
+unsigned int aom_highbd_sad_skip_32x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_32x8 aom_highbd_sad_skip_32x8_neon
 
 void aom_highbd_sad_skip_32x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_32x8x4d aom_highbd_sad_skip_32x8x4d_c
+void aom_highbd_sad_skip_32x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_32x8x4d aom_highbd_sad_skip_32x8x4d_neon
 
 unsigned int aom_highbd_sad_skip_4x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_4x16 aom_highbd_sad_skip_4x16_c
+unsigned int aom_highbd_sad_skip_4x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_4x16 aom_highbd_sad_skip_4x16_neon
 
 void aom_highbd_sad_skip_4x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_4x16x4d aom_highbd_sad_skip_4x16x4d_c
+void aom_highbd_sad_skip_4x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_4x16x4d aom_highbd_sad_skip_4x16x4d_neon
 
 unsigned int aom_highbd_sad_skip_4x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_4x4 aom_highbd_sad_skip_4x4_c
+unsigned int aom_highbd_sad_skip_4x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_4x4 aom_highbd_sad_skip_4x4_neon
 
 void aom_highbd_sad_skip_4x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_4x4x4d aom_highbd_sad_skip_4x4x4d_c
+void aom_highbd_sad_skip_4x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_4x4x4d aom_highbd_sad_skip_4x4x4d_neon
 
 unsigned int aom_highbd_sad_skip_4x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_4x8 aom_highbd_sad_skip_4x8_c
+unsigned int aom_highbd_sad_skip_4x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_4x8 aom_highbd_sad_skip_4x8_neon
 
 void aom_highbd_sad_skip_4x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_4x8x4d aom_highbd_sad_skip_4x8x4d_c
+void aom_highbd_sad_skip_4x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_4x8x4d aom_highbd_sad_skip_4x8x4d_neon
 
 unsigned int aom_highbd_sad_skip_64x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_64x128 aom_highbd_sad_skip_64x128_c
+unsigned int aom_highbd_sad_skip_64x128_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_64x128 aom_highbd_sad_skip_64x128_neon
 
 void aom_highbd_sad_skip_64x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_64x128x4d aom_highbd_sad_skip_64x128x4d_c
+void aom_highbd_sad_skip_64x128x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_64x128x4d aom_highbd_sad_skip_64x128x4d_neon
 
 unsigned int aom_highbd_sad_skip_64x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_64x16 aom_highbd_sad_skip_64x16_c
+unsigned int aom_highbd_sad_skip_64x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_64x16 aom_highbd_sad_skip_64x16_neon
 
 void aom_highbd_sad_skip_64x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_64x16x4d aom_highbd_sad_skip_64x16x4d_c
+void aom_highbd_sad_skip_64x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_64x16x4d aom_highbd_sad_skip_64x16x4d_neon
 
 unsigned int aom_highbd_sad_skip_64x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_64x32 aom_highbd_sad_skip_64x32_c
+unsigned int aom_highbd_sad_skip_64x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_64x32 aom_highbd_sad_skip_64x32_neon
 
 void aom_highbd_sad_skip_64x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_64x32x4d aom_highbd_sad_skip_64x32x4d_c
+void aom_highbd_sad_skip_64x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_64x32x4d aom_highbd_sad_skip_64x32x4d_neon
 
 unsigned int aom_highbd_sad_skip_64x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_64x64 aom_highbd_sad_skip_64x64_c
+unsigned int aom_highbd_sad_skip_64x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_64x64 aom_highbd_sad_skip_64x64_neon
 
 void aom_highbd_sad_skip_64x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_64x64x4d aom_highbd_sad_skip_64x64x4d_c
+void aom_highbd_sad_skip_64x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_64x64x4d aom_highbd_sad_skip_64x64x4d_neon
 
 unsigned int aom_highbd_sad_skip_8x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_8x16 aom_highbd_sad_skip_8x16_c
+unsigned int aom_highbd_sad_skip_8x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_8x16 aom_highbd_sad_skip_8x16_neon
 
 void aom_highbd_sad_skip_8x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_8x16x4d aom_highbd_sad_skip_8x16x4d_c
+void aom_highbd_sad_skip_8x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_8x16x4d aom_highbd_sad_skip_8x16x4d_neon
 
 unsigned int aom_highbd_sad_skip_8x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_8x32 aom_highbd_sad_skip_8x32_c
+unsigned int aom_highbd_sad_skip_8x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_8x32 aom_highbd_sad_skip_8x32_neon
 
 void aom_highbd_sad_skip_8x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_8x32x4d aom_highbd_sad_skip_8x32x4d_c
+void aom_highbd_sad_skip_8x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_8x32x4d aom_highbd_sad_skip_8x32x4d_neon
 
 unsigned int aom_highbd_sad_skip_8x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_8x4 aom_highbd_sad_skip_8x4_c
+unsigned int aom_highbd_sad_skip_8x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_8x4 aom_highbd_sad_skip_8x4_neon
 
 void aom_highbd_sad_skip_8x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_8x4x4d aom_highbd_sad_skip_8x4x4d_c
+void aom_highbd_sad_skip_8x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_8x4x4d aom_highbd_sad_skip_8x4x4d_neon
 
 unsigned int aom_highbd_sad_skip_8x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_8x8 aom_highbd_sad_skip_8x8_c
+unsigned int aom_highbd_sad_skip_8x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_8x8 aom_highbd_sad_skip_8x8_neon
 
 void aom_highbd_sad_skip_8x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_8x8x4d aom_highbd_sad_skip_8x8x4d_c
+void aom_highbd_sad_skip_8x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_8x8x4d aom_highbd_sad_skip_8x8x4d_neon
 
 void aom_highbd_smooth_h_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_smooth_h_predictor_16x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
@@ -3610,205 +3897,272 @@
 #define aom_lpf_vertical_8_quad aom_lpf_vertical_8_quad_neon
 
 unsigned int aom_masked_sad128x128_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad128x128 aom_masked_sad128x128_c
+unsigned int aom_masked_sad128x128_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad128x128 aom_masked_sad128x128_neon
 
 void aom_masked_sad128x128x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad128x128x4d aom_masked_sad128x128x4d_c
+void aom_masked_sad128x128x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad128x128x4d aom_masked_sad128x128x4d_neon
 
 unsigned int aom_masked_sad128x64_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad128x64 aom_masked_sad128x64_c
+unsigned int aom_masked_sad128x64_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad128x64 aom_masked_sad128x64_neon
 
 void aom_masked_sad128x64x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad128x64x4d aom_masked_sad128x64x4d_c
+void aom_masked_sad128x64x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad128x64x4d aom_masked_sad128x64x4d_neon
 
 unsigned int aom_masked_sad16x16_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad16x16 aom_masked_sad16x16_c
+unsigned int aom_masked_sad16x16_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad16x16 aom_masked_sad16x16_neon
 
 void aom_masked_sad16x16x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad16x16x4d aom_masked_sad16x16x4d_c
+void aom_masked_sad16x16x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad16x16x4d aom_masked_sad16x16x4d_neon
 
 unsigned int aom_masked_sad16x32_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad16x32 aom_masked_sad16x32_c
+unsigned int aom_masked_sad16x32_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad16x32 aom_masked_sad16x32_neon
 
 void aom_masked_sad16x32x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad16x32x4d aom_masked_sad16x32x4d_c
+void aom_masked_sad16x32x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad16x32x4d aom_masked_sad16x32x4d_neon
 
 unsigned int aom_masked_sad16x4_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad16x4 aom_masked_sad16x4_c
+unsigned int aom_masked_sad16x4_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad16x4 aom_masked_sad16x4_neon
 
 void aom_masked_sad16x4x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad16x4x4d aom_masked_sad16x4x4d_c
+void aom_masked_sad16x4x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad16x4x4d aom_masked_sad16x4x4d_neon
 
 unsigned int aom_masked_sad16x64_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad16x64 aom_masked_sad16x64_c
+unsigned int aom_masked_sad16x64_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad16x64 aom_masked_sad16x64_neon
 
 void aom_masked_sad16x64x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad16x64x4d aom_masked_sad16x64x4d_c
+void aom_masked_sad16x64x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad16x64x4d aom_masked_sad16x64x4d_neon
 
 unsigned int aom_masked_sad16x8_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad16x8 aom_masked_sad16x8_c
+unsigned int aom_masked_sad16x8_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad16x8 aom_masked_sad16x8_neon
 
 void aom_masked_sad16x8x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad16x8x4d aom_masked_sad16x8x4d_c
+void aom_masked_sad16x8x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad16x8x4d aom_masked_sad16x8x4d_neon
 
 unsigned int aom_masked_sad32x16_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad32x16 aom_masked_sad32x16_c
+unsigned int aom_masked_sad32x16_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad32x16 aom_masked_sad32x16_neon
 
 void aom_masked_sad32x16x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad32x16x4d aom_masked_sad32x16x4d_c
+void aom_masked_sad32x16x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad32x16x4d aom_masked_sad32x16x4d_neon
 
 unsigned int aom_masked_sad32x32_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad32x32 aom_masked_sad32x32_c
+unsigned int aom_masked_sad32x32_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad32x32 aom_masked_sad32x32_neon
 
 void aom_masked_sad32x32x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad32x32x4d aom_masked_sad32x32x4d_c
+void aom_masked_sad32x32x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad32x32x4d aom_masked_sad32x32x4d_neon
 
 unsigned int aom_masked_sad32x64_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad32x64 aom_masked_sad32x64_c
+unsigned int aom_masked_sad32x64_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad32x64 aom_masked_sad32x64_neon
 
 void aom_masked_sad32x64x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad32x64x4d aom_masked_sad32x64x4d_c
+void aom_masked_sad32x64x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad32x64x4d aom_masked_sad32x64x4d_neon
 
 unsigned int aom_masked_sad32x8_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad32x8 aom_masked_sad32x8_c
+unsigned int aom_masked_sad32x8_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad32x8 aom_masked_sad32x8_neon
 
 void aom_masked_sad32x8x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad32x8x4d aom_masked_sad32x8x4d_c
+void aom_masked_sad32x8x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad32x8x4d aom_masked_sad32x8x4d_neon
 
 unsigned int aom_masked_sad4x16_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad4x16 aom_masked_sad4x16_c
+unsigned int aom_masked_sad4x16_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad4x16 aom_masked_sad4x16_neon
 
 void aom_masked_sad4x16x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad4x16x4d aom_masked_sad4x16x4d_c
+void aom_masked_sad4x16x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad4x16x4d aom_masked_sad4x16x4d_neon
 
 unsigned int aom_masked_sad4x4_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad4x4 aom_masked_sad4x4_c
+unsigned int aom_masked_sad4x4_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad4x4 aom_masked_sad4x4_neon
 
 void aom_masked_sad4x4x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad4x4x4d aom_masked_sad4x4x4d_c
+void aom_masked_sad4x4x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad4x4x4d aom_masked_sad4x4x4d_neon
 
 unsigned int aom_masked_sad4x8_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad4x8 aom_masked_sad4x8_c
+unsigned int aom_masked_sad4x8_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad4x8 aom_masked_sad4x8_neon
 
 void aom_masked_sad4x8x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad4x8x4d aom_masked_sad4x8x4d_c
+void aom_masked_sad4x8x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad4x8x4d aom_masked_sad4x8x4d_neon
 
 unsigned int aom_masked_sad64x128_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad64x128 aom_masked_sad64x128_c
+unsigned int aom_masked_sad64x128_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad64x128 aom_masked_sad64x128_neon
 
 void aom_masked_sad64x128x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad64x128x4d aom_masked_sad64x128x4d_c
+void aom_masked_sad64x128x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad64x128x4d aom_masked_sad64x128x4d_neon
 
 unsigned int aom_masked_sad64x16_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad64x16 aom_masked_sad64x16_c
+unsigned int aom_masked_sad64x16_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad64x16 aom_masked_sad64x16_neon
 
 void aom_masked_sad64x16x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad64x16x4d aom_masked_sad64x16x4d_c
+void aom_masked_sad64x16x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad64x16x4d aom_masked_sad64x16x4d_neon
 
 unsigned int aom_masked_sad64x32_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad64x32 aom_masked_sad64x32_c
+unsigned int aom_masked_sad64x32_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad64x32 aom_masked_sad64x32_neon
 
 void aom_masked_sad64x32x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad64x32x4d aom_masked_sad64x32x4d_c
+void aom_masked_sad64x32x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad64x32x4d aom_masked_sad64x32x4d_neon
 
 unsigned int aom_masked_sad64x64_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad64x64 aom_masked_sad64x64_c
+unsigned int aom_masked_sad64x64_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad64x64 aom_masked_sad64x64_neon
 
 void aom_masked_sad64x64x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad64x64x4d aom_masked_sad64x64x4d_c
+void aom_masked_sad64x64x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad64x64x4d aom_masked_sad64x64x4d_neon
 
 unsigned int aom_masked_sad8x16_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad8x16 aom_masked_sad8x16_c
+unsigned int aom_masked_sad8x16_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad8x16 aom_masked_sad8x16_neon
 
 void aom_masked_sad8x16x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad8x16x4d aom_masked_sad8x16x4d_c
+void aom_masked_sad8x16x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad8x16x4d aom_masked_sad8x16x4d_neon
 
 unsigned int aom_masked_sad8x32_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad8x32 aom_masked_sad8x32_c
+unsigned int aom_masked_sad8x32_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad8x32 aom_masked_sad8x32_neon
 
 void aom_masked_sad8x32x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad8x32x4d aom_masked_sad8x32x4d_c
+void aom_masked_sad8x32x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad8x32x4d aom_masked_sad8x32x4d_neon
 
 unsigned int aom_masked_sad8x4_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad8x4 aom_masked_sad8x4_c
+unsigned int aom_masked_sad8x4_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad8x4 aom_masked_sad8x4_neon
 
 void aom_masked_sad8x4x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad8x4x4d aom_masked_sad8x4x4d_c
+void aom_masked_sad8x4x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad8x4x4d aom_masked_sad8x4x4d_neon
 
 unsigned int aom_masked_sad8x8_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad8x8 aom_masked_sad8x8_c
+unsigned int aom_masked_sad8x8_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad8x8 aom_masked_sad8x8_neon
 
 void aom_masked_sad8x8x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad8x8x4d aom_masked_sad8x8x4d_c
+void aom_masked_sad8x8x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad8x8x4d aom_masked_sad8x8x4d_neon
 
 unsigned int aom_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance128x128 aom_masked_sub_pixel_variance128x128_c
+unsigned int aom_masked_sub_pixel_variance128x128_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance128x128 aom_masked_sub_pixel_variance128x128_neon
 
 unsigned int aom_masked_sub_pixel_variance128x64_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance128x64 aom_masked_sub_pixel_variance128x64_c
+unsigned int aom_masked_sub_pixel_variance128x64_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance128x64 aom_masked_sub_pixel_variance128x64_neon
 
 unsigned int aom_masked_sub_pixel_variance16x16_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance16x16 aom_masked_sub_pixel_variance16x16_c
+unsigned int aom_masked_sub_pixel_variance16x16_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance16x16 aom_masked_sub_pixel_variance16x16_neon
 
 unsigned int aom_masked_sub_pixel_variance16x32_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance16x32 aom_masked_sub_pixel_variance16x32_c
+unsigned int aom_masked_sub_pixel_variance16x32_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance16x32 aom_masked_sub_pixel_variance16x32_neon
 
 unsigned int aom_masked_sub_pixel_variance16x4_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance16x4 aom_masked_sub_pixel_variance16x4_c
+unsigned int aom_masked_sub_pixel_variance16x4_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance16x4 aom_masked_sub_pixel_variance16x4_neon
 
 unsigned int aom_masked_sub_pixel_variance16x64_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance16x64 aom_masked_sub_pixel_variance16x64_c
+unsigned int aom_masked_sub_pixel_variance16x64_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance16x64 aom_masked_sub_pixel_variance16x64_neon
 
 unsigned int aom_masked_sub_pixel_variance16x8_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance16x8 aom_masked_sub_pixel_variance16x8_c
+unsigned int aom_masked_sub_pixel_variance16x8_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance16x8 aom_masked_sub_pixel_variance16x8_neon
 
 unsigned int aom_masked_sub_pixel_variance32x16_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance32x16 aom_masked_sub_pixel_variance32x16_c
+unsigned int aom_masked_sub_pixel_variance32x16_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance32x16 aom_masked_sub_pixel_variance32x16_neon
 
 unsigned int aom_masked_sub_pixel_variance32x32_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance32x32 aom_masked_sub_pixel_variance32x32_c
+unsigned int aom_masked_sub_pixel_variance32x32_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance32x32 aom_masked_sub_pixel_variance32x32_neon
 
 unsigned int aom_masked_sub_pixel_variance32x64_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance32x64 aom_masked_sub_pixel_variance32x64_c
+unsigned int aom_masked_sub_pixel_variance32x64_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance32x64 aom_masked_sub_pixel_variance32x64_neon
 
 unsigned int aom_masked_sub_pixel_variance32x8_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance32x8 aom_masked_sub_pixel_variance32x8_c
+unsigned int aom_masked_sub_pixel_variance32x8_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance32x8 aom_masked_sub_pixel_variance32x8_neon
 
 unsigned int aom_masked_sub_pixel_variance4x16_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance4x16 aom_masked_sub_pixel_variance4x16_c
+unsigned int aom_masked_sub_pixel_variance4x16_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance4x16 aom_masked_sub_pixel_variance4x16_neon
 
 unsigned int aom_masked_sub_pixel_variance4x4_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance4x4 aom_masked_sub_pixel_variance4x4_c
+unsigned int aom_masked_sub_pixel_variance4x4_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance4x4 aom_masked_sub_pixel_variance4x4_neon
 
 unsigned int aom_masked_sub_pixel_variance4x8_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance4x8 aom_masked_sub_pixel_variance4x8_c
+unsigned int aom_masked_sub_pixel_variance4x8_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance4x8 aom_masked_sub_pixel_variance4x8_neon
 
 unsigned int aom_masked_sub_pixel_variance64x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance64x128 aom_masked_sub_pixel_variance64x128_c
+unsigned int aom_masked_sub_pixel_variance64x128_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance64x128 aom_masked_sub_pixel_variance64x128_neon
 
 unsigned int aom_masked_sub_pixel_variance64x16_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance64x16 aom_masked_sub_pixel_variance64x16_c
+unsigned int aom_masked_sub_pixel_variance64x16_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance64x16 aom_masked_sub_pixel_variance64x16_neon
 
 unsigned int aom_masked_sub_pixel_variance64x32_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance64x32 aom_masked_sub_pixel_variance64x32_c
+unsigned int aom_masked_sub_pixel_variance64x32_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance64x32 aom_masked_sub_pixel_variance64x32_neon
 
 unsigned int aom_masked_sub_pixel_variance64x64_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance64x64 aom_masked_sub_pixel_variance64x64_c
+unsigned int aom_masked_sub_pixel_variance64x64_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance64x64 aom_masked_sub_pixel_variance64x64_neon
 
 unsigned int aom_masked_sub_pixel_variance8x16_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance8x16 aom_masked_sub_pixel_variance8x16_c
+unsigned int aom_masked_sub_pixel_variance8x16_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance8x16 aom_masked_sub_pixel_variance8x16_neon
 
 unsigned int aom_masked_sub_pixel_variance8x32_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance8x32 aom_masked_sub_pixel_variance8x32_c
+unsigned int aom_masked_sub_pixel_variance8x32_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance8x32 aom_masked_sub_pixel_variance8x32_neon
 
 unsigned int aom_masked_sub_pixel_variance8x4_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance8x4 aom_masked_sub_pixel_variance8x4_c
+unsigned int aom_masked_sub_pixel_variance8x4_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance8x4 aom_masked_sub_pixel_variance8x4_neon
 
 unsigned int aom_masked_sub_pixel_variance8x8_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance8x8 aom_masked_sub_pixel_variance8x8_c
+unsigned int aom_masked_sub_pixel_variance8x8_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance8x8 aom_masked_sub_pixel_variance8x8_neon
 
 void aom_minmax_8x8_c(const uint8_t *s, int p, const uint8_t *d, int dp, int *min, int *max);
-#define aom_minmax_8x8 aom_minmax_8x8_c
+void aom_minmax_8x8_neon(const uint8_t *s, int p, const uint8_t *d, int dp, int *min, int *max);
+#define aom_minmax_8x8 aom_minmax_8x8_neon
 
 unsigned int aom_mse16x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
 unsigned int aom_mse16x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
@@ -3837,202 +4191,268 @@
 #define aom_mse_wxh_16bit_highbd aom_mse_wxh_16bit_highbd_c
 
 unsigned int aom_obmc_sad128x128_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad128x128 aom_obmc_sad128x128_c
+unsigned int aom_obmc_sad128x128_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad128x128 aom_obmc_sad128x128_neon
 
 unsigned int aom_obmc_sad128x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad128x64 aom_obmc_sad128x64_c
+unsigned int aom_obmc_sad128x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad128x64 aom_obmc_sad128x64_neon
 
 unsigned int aom_obmc_sad16x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad16x16 aom_obmc_sad16x16_c
+unsigned int aom_obmc_sad16x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad16x16 aom_obmc_sad16x16_neon
 
 unsigned int aom_obmc_sad16x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad16x32 aom_obmc_sad16x32_c
+unsigned int aom_obmc_sad16x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad16x32 aom_obmc_sad16x32_neon
 
 unsigned int aom_obmc_sad16x4_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad16x4 aom_obmc_sad16x4_c
+unsigned int aom_obmc_sad16x4_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad16x4 aom_obmc_sad16x4_neon
 
 unsigned int aom_obmc_sad16x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad16x64 aom_obmc_sad16x64_c
+unsigned int aom_obmc_sad16x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad16x64 aom_obmc_sad16x64_neon
 
 unsigned int aom_obmc_sad16x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad16x8 aom_obmc_sad16x8_c
+unsigned int aom_obmc_sad16x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad16x8 aom_obmc_sad16x8_neon
 
 unsigned int aom_obmc_sad32x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad32x16 aom_obmc_sad32x16_c
+unsigned int aom_obmc_sad32x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad32x16 aom_obmc_sad32x16_neon
 
 unsigned int aom_obmc_sad32x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad32x32 aom_obmc_sad32x32_c
+unsigned int aom_obmc_sad32x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad32x32 aom_obmc_sad32x32_neon
 
 unsigned int aom_obmc_sad32x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad32x64 aom_obmc_sad32x64_c
+unsigned int aom_obmc_sad32x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad32x64 aom_obmc_sad32x64_neon
 
 unsigned int aom_obmc_sad32x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad32x8 aom_obmc_sad32x8_c
+unsigned int aom_obmc_sad32x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad32x8 aom_obmc_sad32x8_neon
 
 unsigned int aom_obmc_sad4x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad4x16 aom_obmc_sad4x16_c
+unsigned int aom_obmc_sad4x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad4x16 aom_obmc_sad4x16_neon
 
 unsigned int aom_obmc_sad4x4_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad4x4 aom_obmc_sad4x4_c
+unsigned int aom_obmc_sad4x4_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad4x4 aom_obmc_sad4x4_neon
 
 unsigned int aom_obmc_sad4x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad4x8 aom_obmc_sad4x8_c
+unsigned int aom_obmc_sad4x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad4x8 aom_obmc_sad4x8_neon
 
 unsigned int aom_obmc_sad64x128_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad64x128 aom_obmc_sad64x128_c
+unsigned int aom_obmc_sad64x128_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad64x128 aom_obmc_sad64x128_neon
 
 unsigned int aom_obmc_sad64x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad64x16 aom_obmc_sad64x16_c
+unsigned int aom_obmc_sad64x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad64x16 aom_obmc_sad64x16_neon
 
 unsigned int aom_obmc_sad64x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad64x32 aom_obmc_sad64x32_c
+unsigned int aom_obmc_sad64x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad64x32 aom_obmc_sad64x32_neon
 
 unsigned int aom_obmc_sad64x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad64x64 aom_obmc_sad64x64_c
+unsigned int aom_obmc_sad64x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad64x64 aom_obmc_sad64x64_neon
 
 unsigned int aom_obmc_sad8x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad8x16 aom_obmc_sad8x16_c
+unsigned int aom_obmc_sad8x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad8x16 aom_obmc_sad8x16_neon
 
 unsigned int aom_obmc_sad8x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad8x32 aom_obmc_sad8x32_c
+unsigned int aom_obmc_sad8x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad8x32 aom_obmc_sad8x32_neon
 
 unsigned int aom_obmc_sad8x4_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad8x4 aom_obmc_sad8x4_c
+unsigned int aom_obmc_sad8x4_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad8x4 aom_obmc_sad8x4_neon
 
 unsigned int aom_obmc_sad8x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad8x8 aom_obmc_sad8x8_c
+unsigned int aom_obmc_sad8x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad8x8 aom_obmc_sad8x8_neon
 
 unsigned int aom_obmc_sub_pixel_variance128x128_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance128x128 aom_obmc_sub_pixel_variance128x128_c
+unsigned int aom_obmc_sub_pixel_variance128x128_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance128x128 aom_obmc_sub_pixel_variance128x128_neon
 
 unsigned int aom_obmc_sub_pixel_variance128x64_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance128x64 aom_obmc_sub_pixel_variance128x64_c
+unsigned int aom_obmc_sub_pixel_variance128x64_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance128x64 aom_obmc_sub_pixel_variance128x64_neon
 
 unsigned int aom_obmc_sub_pixel_variance16x16_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance16x16 aom_obmc_sub_pixel_variance16x16_c
+unsigned int aom_obmc_sub_pixel_variance16x16_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance16x16 aom_obmc_sub_pixel_variance16x16_neon
 
 unsigned int aom_obmc_sub_pixel_variance16x32_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance16x32 aom_obmc_sub_pixel_variance16x32_c
+unsigned int aom_obmc_sub_pixel_variance16x32_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance16x32 aom_obmc_sub_pixel_variance16x32_neon
 
 unsigned int aom_obmc_sub_pixel_variance16x4_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance16x4 aom_obmc_sub_pixel_variance16x4_c
+unsigned int aom_obmc_sub_pixel_variance16x4_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance16x4 aom_obmc_sub_pixel_variance16x4_neon
 
 unsigned int aom_obmc_sub_pixel_variance16x64_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance16x64 aom_obmc_sub_pixel_variance16x64_c
+unsigned int aom_obmc_sub_pixel_variance16x64_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance16x64 aom_obmc_sub_pixel_variance16x64_neon
 
 unsigned int aom_obmc_sub_pixel_variance16x8_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance16x8 aom_obmc_sub_pixel_variance16x8_c
+unsigned int aom_obmc_sub_pixel_variance16x8_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance16x8 aom_obmc_sub_pixel_variance16x8_neon
 
 unsigned int aom_obmc_sub_pixel_variance32x16_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance32x16 aom_obmc_sub_pixel_variance32x16_c
+unsigned int aom_obmc_sub_pixel_variance32x16_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance32x16 aom_obmc_sub_pixel_variance32x16_neon
 
 unsigned int aom_obmc_sub_pixel_variance32x32_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance32x32 aom_obmc_sub_pixel_variance32x32_c
+unsigned int aom_obmc_sub_pixel_variance32x32_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance32x32 aom_obmc_sub_pixel_variance32x32_neon
 
 unsigned int aom_obmc_sub_pixel_variance32x64_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance32x64 aom_obmc_sub_pixel_variance32x64_c
+unsigned int aom_obmc_sub_pixel_variance32x64_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance32x64 aom_obmc_sub_pixel_variance32x64_neon
 
 unsigned int aom_obmc_sub_pixel_variance32x8_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance32x8 aom_obmc_sub_pixel_variance32x8_c
+unsigned int aom_obmc_sub_pixel_variance32x8_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance32x8 aom_obmc_sub_pixel_variance32x8_neon
 
 unsigned int aom_obmc_sub_pixel_variance4x16_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance4x16 aom_obmc_sub_pixel_variance4x16_c
+unsigned int aom_obmc_sub_pixel_variance4x16_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance4x16 aom_obmc_sub_pixel_variance4x16_neon
 
 unsigned int aom_obmc_sub_pixel_variance4x4_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance4x4 aom_obmc_sub_pixel_variance4x4_c
+unsigned int aom_obmc_sub_pixel_variance4x4_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance4x4 aom_obmc_sub_pixel_variance4x4_neon
 
 unsigned int aom_obmc_sub_pixel_variance4x8_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance4x8 aom_obmc_sub_pixel_variance4x8_c
+unsigned int aom_obmc_sub_pixel_variance4x8_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance4x8 aom_obmc_sub_pixel_variance4x8_neon
 
 unsigned int aom_obmc_sub_pixel_variance64x128_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance64x128 aom_obmc_sub_pixel_variance64x128_c
+unsigned int aom_obmc_sub_pixel_variance64x128_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance64x128 aom_obmc_sub_pixel_variance64x128_neon
 
 unsigned int aom_obmc_sub_pixel_variance64x16_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance64x16 aom_obmc_sub_pixel_variance64x16_c
+unsigned int aom_obmc_sub_pixel_variance64x16_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance64x16 aom_obmc_sub_pixel_variance64x16_neon
 
 unsigned int aom_obmc_sub_pixel_variance64x32_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance64x32 aom_obmc_sub_pixel_variance64x32_c
+unsigned int aom_obmc_sub_pixel_variance64x32_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance64x32 aom_obmc_sub_pixel_variance64x32_neon
 
 unsigned int aom_obmc_sub_pixel_variance64x64_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance64x64 aom_obmc_sub_pixel_variance64x64_c
+unsigned int aom_obmc_sub_pixel_variance64x64_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance64x64 aom_obmc_sub_pixel_variance64x64_neon
 
 unsigned int aom_obmc_sub_pixel_variance8x16_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance8x16 aom_obmc_sub_pixel_variance8x16_c
+unsigned int aom_obmc_sub_pixel_variance8x16_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance8x16 aom_obmc_sub_pixel_variance8x16_neon
 
 unsigned int aom_obmc_sub_pixel_variance8x32_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance8x32 aom_obmc_sub_pixel_variance8x32_c
+unsigned int aom_obmc_sub_pixel_variance8x32_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance8x32 aom_obmc_sub_pixel_variance8x32_neon
 
 unsigned int aom_obmc_sub_pixel_variance8x4_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance8x4 aom_obmc_sub_pixel_variance8x4_c
+unsigned int aom_obmc_sub_pixel_variance8x4_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance8x4 aom_obmc_sub_pixel_variance8x4_neon
 
 unsigned int aom_obmc_sub_pixel_variance8x8_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance8x8 aom_obmc_sub_pixel_variance8x8_c
+unsigned int aom_obmc_sub_pixel_variance8x8_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance8x8 aom_obmc_sub_pixel_variance8x8_neon
 
 unsigned int aom_obmc_variance128x128_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance128x128 aom_obmc_variance128x128_c
+unsigned int aom_obmc_variance128x128_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance128x128 aom_obmc_variance128x128_neon
 
 unsigned int aom_obmc_variance128x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance128x64 aom_obmc_variance128x64_c
+unsigned int aom_obmc_variance128x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance128x64 aom_obmc_variance128x64_neon
 
 unsigned int aom_obmc_variance16x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance16x16 aom_obmc_variance16x16_c
+unsigned int aom_obmc_variance16x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance16x16 aom_obmc_variance16x16_neon
 
 unsigned int aom_obmc_variance16x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance16x32 aom_obmc_variance16x32_c
+unsigned int aom_obmc_variance16x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance16x32 aom_obmc_variance16x32_neon
 
 unsigned int aom_obmc_variance16x4_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance16x4 aom_obmc_variance16x4_c
+unsigned int aom_obmc_variance16x4_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance16x4 aom_obmc_variance16x4_neon
 
 unsigned int aom_obmc_variance16x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance16x64 aom_obmc_variance16x64_c
+unsigned int aom_obmc_variance16x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance16x64 aom_obmc_variance16x64_neon
 
 unsigned int aom_obmc_variance16x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance16x8 aom_obmc_variance16x8_c
+unsigned int aom_obmc_variance16x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance16x8 aom_obmc_variance16x8_neon
 
 unsigned int aom_obmc_variance32x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance32x16 aom_obmc_variance32x16_c
+unsigned int aom_obmc_variance32x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance32x16 aom_obmc_variance32x16_neon
 
 unsigned int aom_obmc_variance32x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance32x32 aom_obmc_variance32x32_c
+unsigned int aom_obmc_variance32x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance32x32 aom_obmc_variance32x32_neon
 
 unsigned int aom_obmc_variance32x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance32x64 aom_obmc_variance32x64_c
+unsigned int aom_obmc_variance32x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance32x64 aom_obmc_variance32x64_neon
 
 unsigned int aom_obmc_variance32x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance32x8 aom_obmc_variance32x8_c
+unsigned int aom_obmc_variance32x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance32x8 aom_obmc_variance32x8_neon
 
 unsigned int aom_obmc_variance4x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance4x16 aom_obmc_variance4x16_c
+unsigned int aom_obmc_variance4x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance4x16 aom_obmc_variance4x16_neon
 
 unsigned int aom_obmc_variance4x4_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance4x4 aom_obmc_variance4x4_c
+unsigned int aom_obmc_variance4x4_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance4x4 aom_obmc_variance4x4_neon
 
 unsigned int aom_obmc_variance4x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance4x8 aom_obmc_variance4x8_c
+unsigned int aom_obmc_variance4x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance4x8 aom_obmc_variance4x8_neon
 
 unsigned int aom_obmc_variance64x128_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance64x128 aom_obmc_variance64x128_c
+unsigned int aom_obmc_variance64x128_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance64x128 aom_obmc_variance64x128_neon
 
 unsigned int aom_obmc_variance64x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance64x16 aom_obmc_variance64x16_c
+unsigned int aom_obmc_variance64x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance64x16 aom_obmc_variance64x16_neon
 
 unsigned int aom_obmc_variance64x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance64x32 aom_obmc_variance64x32_c
+unsigned int aom_obmc_variance64x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance64x32 aom_obmc_variance64x32_neon
 
 unsigned int aom_obmc_variance64x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance64x64 aom_obmc_variance64x64_c
+unsigned int aom_obmc_variance64x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance64x64 aom_obmc_variance64x64_neon
 
 unsigned int aom_obmc_variance8x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance8x16 aom_obmc_variance8x16_c
+unsigned int aom_obmc_variance8x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance8x16 aom_obmc_variance8x16_neon
 
 unsigned int aom_obmc_variance8x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance8x32 aom_obmc_variance8x32_c
+unsigned int aom_obmc_variance8x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance8x32 aom_obmc_variance8x32_neon
 
 unsigned int aom_obmc_variance8x4_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance8x4 aom_obmc_variance8x4_c
+unsigned int aom_obmc_variance8x4_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance8x4 aom_obmc_variance8x4_neon
 
 unsigned int aom_obmc_variance8x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance8x8 aom_obmc_variance8x8_c
+unsigned int aom_obmc_variance8x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance8x8 aom_obmc_variance8x8_neon
 
 void aom_paeth_predictor_16x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_paeth_predictor_16x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
@@ -4110,9 +4530,6 @@
 void aom_paeth_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_paeth_predictor_8x8 aom_paeth_predictor_8x8_neon
 
-void aom_pixel_scale_c(const int16_t *src_diff, ptrdiff_t src_stride, int16_t *coeff, int log_scale, int h8, int w8);
-#define aom_pixel_scale aom_pixel_scale_c
-
 void aom_quantize_b_c(const tran_low_t *coeff_ptr, intptr_t n_coeffs, const int16_t *zbin_ptr, const int16_t *round_ptr, const int16_t *quant_ptr, const int16_t *quant_shift_ptr, tran_low_t *qcoeff_ptr, tran_low_t *dqcoeff_ptr, const int16_t *dequant_ptr, uint16_t *eob_ptr, const int16_t *scan, const int16_t *iscan);
 void aom_quantize_b_neon(const tran_low_t *coeff_ptr, intptr_t n_coeffs, const int16_t *zbin_ptr, const int16_t *round_ptr, const int16_t *quant_ptr, const int16_t *quant_shift_ptr, tran_low_t *qcoeff_ptr, tran_low_t *dqcoeff_ptr, const int16_t *dequant_ptr, uint16_t *eob_ptr, const int16_t *scan, const int16_t *iscan);
 #define aom_quantize_b aom_quantize_b_neon
@@ -4143,15 +4560,13 @@
 #define aom_sad128x128_avg aom_sad128x128_avg_neon
 
 void aom_sad128x128x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad128x128x3d aom_sad128x128x3d_c
+void aom_sad128x128x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad128x128x3d aom_sad128x128x3d_neon
 
 void aom_sad128x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad128x128x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad128x128x4d aom_sad128x128x4d_neon
 
-void aom_sad128x128x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad128x128x4d_avg aom_sad128x128x4d_avg_c
-
 unsigned int aom_sad128x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad128x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad128x64 aom_sad128x64_neon
@@ -4161,18 +4576,13 @@
 #define aom_sad128x64_avg aom_sad128x64_avg_neon
 
 void aom_sad128x64x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad128x64x3d aom_sad128x64x3d_c
+void aom_sad128x64x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad128x64x3d aom_sad128x64x3d_neon
 
 void aom_sad128x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad128x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad128x64x4d aom_sad128x64x4d_neon
 
-void aom_sad128x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad128x64x4d_avg aom_sad128x64x4d_avg_c
-
-unsigned int aom_sad128xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad128xh aom_sad128xh_c
-
 unsigned int aom_sad16x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x16 aom_sad16x16_neon
@@ -4182,15 +4592,13 @@
 #define aom_sad16x16_avg aom_sad16x16_avg_neon
 
 void aom_sad16x16x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad16x16x3d aom_sad16x16x3d_c
+void aom_sad16x16x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad16x16x3d aom_sad16x16x3d_neon
 
 void aom_sad16x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad16x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x16x4d aom_sad16x16x4d_neon
 
-void aom_sad16x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x16x4d_avg aom_sad16x16x4d_avg_c
-
 unsigned int aom_sad16x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x32 aom_sad16x32_neon
@@ -4200,15 +4608,13 @@
 #define aom_sad16x32_avg aom_sad16x32_avg_neon
 
 void aom_sad16x32x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad16x32x3d aom_sad16x32x3d_c
+void aom_sad16x32x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad16x32x3d aom_sad16x32x3d_neon
 
 void aom_sad16x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad16x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x32x4d aom_sad16x32x4d_neon
 
-void aom_sad16x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x32x4d_avg aom_sad16x32x4d_avg_c
-
 unsigned int aom_sad16x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x4 aom_sad16x4_neon
@@ -4218,15 +4624,13 @@
 #define aom_sad16x4_avg aom_sad16x4_avg_neon
 
 void aom_sad16x4x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad16x4x3d aom_sad16x4x3d_c
+void aom_sad16x4x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad16x4x3d aom_sad16x4x3d_neon
 
 void aom_sad16x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad16x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x4x4d aom_sad16x4x4d_neon
 
-void aom_sad16x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x4x4d_avg aom_sad16x4x4d_avg_c
-
 unsigned int aom_sad16x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x64 aom_sad16x64_neon
@@ -4236,15 +4640,13 @@
 #define aom_sad16x64_avg aom_sad16x64_avg_neon
 
 void aom_sad16x64x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad16x64x3d aom_sad16x64x3d_c
+void aom_sad16x64x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad16x64x3d aom_sad16x64x3d_neon
 
 void aom_sad16x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad16x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x64x4d aom_sad16x64x4d_neon
 
-void aom_sad16x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x64x4d_avg aom_sad16x64x4d_avg_c
-
 unsigned int aom_sad16x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x8 aom_sad16x8_neon
@@ -4254,18 +4656,13 @@
 #define aom_sad16x8_avg aom_sad16x8_avg_neon
 
 void aom_sad16x8x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad16x8x3d aom_sad16x8x3d_c
+void aom_sad16x8x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad16x8x3d aom_sad16x8x3d_neon
 
 void aom_sad16x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad16x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x8x4d aom_sad16x8x4d_neon
 
-void aom_sad16x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x8x4d_avg aom_sad16x8x4d_avg_c
-
-unsigned int aom_sad16xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad16xh aom_sad16xh_c
-
 unsigned int aom_sad32x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x16 aom_sad32x16_neon
@@ -4275,15 +4672,13 @@
 #define aom_sad32x16_avg aom_sad32x16_avg_neon
 
 void aom_sad32x16x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad32x16x3d aom_sad32x16x3d_c
+void aom_sad32x16x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad32x16x3d aom_sad32x16x3d_neon
 
 void aom_sad32x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad32x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x16x4d aom_sad32x16x4d_neon
 
-void aom_sad32x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x16x4d_avg aom_sad32x16x4d_avg_c
-
 unsigned int aom_sad32x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x32 aom_sad32x32_neon
@@ -4293,15 +4688,13 @@
 #define aom_sad32x32_avg aom_sad32x32_avg_neon
 
 void aom_sad32x32x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad32x32x3d aom_sad32x32x3d_c
+void aom_sad32x32x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad32x32x3d aom_sad32x32x3d_neon
 
 void aom_sad32x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad32x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x32x4d aom_sad32x32x4d_neon
 
-void aom_sad32x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x32x4d_avg aom_sad32x32x4d_avg_c
-
 unsigned int aom_sad32x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x64 aom_sad32x64_neon
@@ -4311,15 +4704,13 @@
 #define aom_sad32x64_avg aom_sad32x64_avg_neon
 
 void aom_sad32x64x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad32x64x3d aom_sad32x64x3d_c
+void aom_sad32x64x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad32x64x3d aom_sad32x64x3d_neon
 
 void aom_sad32x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad32x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x64x4d aom_sad32x64x4d_neon
 
-void aom_sad32x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x64x4d_avg aom_sad32x64x4d_avg_c
-
 unsigned int aom_sad32x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x8 aom_sad32x8_neon
@@ -4329,18 +4720,13 @@
 #define aom_sad32x8_avg aom_sad32x8_avg_neon
 
 void aom_sad32x8x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad32x8x3d aom_sad32x8x3d_c
+void aom_sad32x8x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad32x8x3d aom_sad32x8x3d_neon
 
 void aom_sad32x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad32x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x8x4d aom_sad32x8x4d_neon
 
-void aom_sad32x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x8x4d_avg aom_sad32x8x4d_avg_c
-
-unsigned int aom_sad32xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad32xh aom_sad32xh_c
-
 unsigned int aom_sad4x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad4x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x16 aom_sad4x16_neon
@@ -4350,15 +4736,13 @@
 #define aom_sad4x16_avg aom_sad4x16_avg_neon
 
 void aom_sad4x16x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad4x16x3d aom_sad4x16x3d_c
+void aom_sad4x16x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad4x16x3d aom_sad4x16x3d_neon
 
 void aom_sad4x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad4x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x16x4d aom_sad4x16x4d_neon
 
-void aom_sad4x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x16x4d_avg aom_sad4x16x4d_avg_c
-
 unsigned int aom_sad4x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad4x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x4 aom_sad4x4_neon
@@ -4368,15 +4752,13 @@
 #define aom_sad4x4_avg aom_sad4x4_avg_neon
 
 void aom_sad4x4x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad4x4x3d aom_sad4x4x3d_c
+void aom_sad4x4x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad4x4x3d aom_sad4x4x3d_neon
 
 void aom_sad4x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad4x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x4x4d aom_sad4x4x4d_neon
 
-void aom_sad4x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x4x4d_avg aom_sad4x4x4d_avg_c
-
 unsigned int aom_sad4x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad4x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x8 aom_sad4x8_neon
@@ -4386,18 +4768,13 @@
 #define aom_sad4x8_avg aom_sad4x8_avg_neon
 
 void aom_sad4x8x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad4x8x3d aom_sad4x8x3d_c
+void aom_sad4x8x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad4x8x3d aom_sad4x8x3d_neon
 
 void aom_sad4x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad4x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x8x4d aom_sad4x8x4d_neon
 
-void aom_sad4x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x8x4d_avg aom_sad4x8x4d_avg_c
-
-unsigned int aom_sad4xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad4xh aom_sad4xh_c
-
 unsigned int aom_sad64x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x128_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x128 aom_sad64x128_neon
@@ -4407,15 +4784,13 @@
 #define aom_sad64x128_avg aom_sad64x128_avg_neon
 
 void aom_sad64x128x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad64x128x3d aom_sad64x128x3d_c
+void aom_sad64x128x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad64x128x3d aom_sad64x128x3d_neon
 
 void aom_sad64x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad64x128x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x128x4d aom_sad64x128x4d_neon
 
-void aom_sad64x128x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x128x4d_avg aom_sad64x128x4d_avg_c
-
 unsigned int aom_sad64x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x16 aom_sad64x16_neon
@@ -4425,15 +4800,13 @@
 #define aom_sad64x16_avg aom_sad64x16_avg_neon
 
 void aom_sad64x16x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad64x16x3d aom_sad64x16x3d_c
+void aom_sad64x16x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad64x16x3d aom_sad64x16x3d_neon
 
 void aom_sad64x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad64x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x16x4d aom_sad64x16x4d_neon
 
-void aom_sad64x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x16x4d_avg aom_sad64x16x4d_avg_c
-
 unsigned int aom_sad64x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x32 aom_sad64x32_neon
@@ -4443,15 +4816,13 @@
 #define aom_sad64x32_avg aom_sad64x32_avg_neon
 
 void aom_sad64x32x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad64x32x3d aom_sad64x32x3d_c
+void aom_sad64x32x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad64x32x3d aom_sad64x32x3d_neon
 
 void aom_sad64x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad64x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x32x4d aom_sad64x32x4d_neon
 
-void aom_sad64x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x32x4d_avg aom_sad64x32x4d_avg_c
-
 unsigned int aom_sad64x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x64 aom_sad64x64_neon
@@ -4461,18 +4832,13 @@
 #define aom_sad64x64_avg aom_sad64x64_avg_neon
 
 void aom_sad64x64x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad64x64x3d aom_sad64x64x3d_c
+void aom_sad64x64x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad64x64x3d aom_sad64x64x3d_neon
 
 void aom_sad64x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad64x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x64x4d aom_sad64x64x4d_neon
 
-void aom_sad64x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x64x4d_avg aom_sad64x64x4d_avg_c
-
-unsigned int aom_sad64xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad64xh aom_sad64xh_c
-
 unsigned int aom_sad8x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x16 aom_sad8x16_neon
@@ -4482,15 +4848,13 @@
 #define aom_sad8x16_avg aom_sad8x16_avg_neon
 
 void aom_sad8x16x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad8x16x3d aom_sad8x16x3d_c
+void aom_sad8x16x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad8x16x3d aom_sad8x16x3d_neon
 
 void aom_sad8x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad8x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x16x4d aom_sad8x16x4d_neon
 
-void aom_sad8x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x16x4d_avg aom_sad8x16x4d_avg_c
-
 unsigned int aom_sad8x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x32 aom_sad8x32_neon
@@ -4500,15 +4864,13 @@
 #define aom_sad8x32_avg aom_sad8x32_avg_neon
 
 void aom_sad8x32x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad8x32x3d aom_sad8x32x3d_c
+void aom_sad8x32x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad8x32x3d aom_sad8x32x3d_neon
 
 void aom_sad8x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad8x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x32x4d aom_sad8x32x4d_neon
 
-void aom_sad8x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x32x4d_avg aom_sad8x32x4d_avg_c
-
 unsigned int aom_sad8x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x4 aom_sad8x4_neon
@@ -4518,15 +4880,13 @@
 #define aom_sad8x4_avg aom_sad8x4_avg_neon
 
 void aom_sad8x4x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad8x4x3d aom_sad8x4x3d_c
+void aom_sad8x4x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad8x4x3d aom_sad8x4x3d_neon
 
 void aom_sad8x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad8x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x4x4d aom_sad8x4x4d_neon
 
-void aom_sad8x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x4x4d_avg aom_sad8x4x4d_avg_c
-
 unsigned int aom_sad8x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x8 aom_sad8x8_neon
@@ -4536,18 +4896,13 @@
 #define aom_sad8x8_avg aom_sad8x8_avg_neon
 
 void aom_sad8x8x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad8x8x3d aom_sad8x8x3d_c
+void aom_sad8x8x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad8x8x3d aom_sad8x8x3d_neon
 
 void aom_sad8x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad8x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x8x4d aom_sad8x8x4d_neon
 
-void aom_sad8x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x8x4d_avg aom_sad8x8x4d_avg_c
-
-unsigned int aom_sad8xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad8xh aom_sad8xh_c
-
 unsigned int aom_sad_skip_128x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad_skip_128x128_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad_skip_128x128 aom_sad_skip_128x128_neon
@@ -4581,10 +4936,12 @@
 #define aom_sad_skip_16x32x4d aom_sad_skip_16x32x4d_neon
 
 unsigned int aom_sad_skip_16x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_sad_skip_16x4 aom_sad_skip_16x4_c
+unsigned int aom_sad_skip_16x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_sad_skip_16x4 aom_sad_skip_16x4_neon
 
 void aom_sad_skip_16x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad_skip_16x4x4d aom_sad_skip_16x4x4d_c
+void aom_sad_skip_16x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad_skip_16x4x4d aom_sad_skip_16x4x4d_neon
 
 unsigned int aom_sad_skip_16x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad_skip_16x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
@@ -4643,10 +5000,12 @@
 #define aom_sad_skip_4x16x4d aom_sad_skip_4x16x4d_neon
 
 unsigned int aom_sad_skip_4x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_sad_skip_4x4 aom_sad_skip_4x4_c
+unsigned int aom_sad_skip_4x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_sad_skip_4x4 aom_sad_skip_4x4_neon
 
 void aom_sad_skip_4x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad_skip_4x4x4d aom_sad_skip_4x4x4d_c
+void aom_sad_skip_4x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad_skip_4x4x4d aom_sad_skip_4x4x4d_neon
 
 unsigned int aom_sad_skip_4x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad_skip_4x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
@@ -4705,10 +5064,12 @@
 #define aom_sad_skip_8x32x4d aom_sad_skip_8x32x4d_neon
 
 unsigned int aom_sad_skip_8x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_sad_skip_8x4 aom_sad_skip_8x4_c
+unsigned int aom_sad_skip_8x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_sad_skip_8x4 aom_sad_skip_8x4_neon
 
 void aom_sad_skip_8x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad_skip_8x4x4d aom_sad_skip_8x4x4d_c
+void aom_sad_skip_8x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad_skip_8x4x4d aom_sad_skip_8x4x4d_neon
 
 unsigned int aom_sad_skip_8x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad_skip_8x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
@@ -5150,7 +5511,8 @@
 #define aom_sum_squares_2d_i16 aom_sum_squares_2d_i16_neon
 
 uint64_t aom_sum_squares_i16_c(const int16_t *src, uint32_t N);
-#define aom_sum_squares_i16 aom_sum_squares_i16_c
+uint64_t aom_sum_squares_i16_neon(const int16_t *src, uint32_t N);
+#define aom_sum_squares_i16 aom_sum_squares_i16_neon
 
 uint64_t aom_sum_sse_2d_i16_c(const int16_t *src, int src_stride, int width, int height, int *sum);
 uint64_t aom_sum_sse_2d_i16_neon(const int16_t *src, int src_stride, int width, int height, int *sum);
@@ -5161,67 +5523,84 @@
 #define aom_v_predictor_16x16 aom_v_predictor_16x16_neon
 
 void aom_v_predictor_16x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_16x32 aom_v_predictor_16x32_c
+void aom_v_predictor_16x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_16x32 aom_v_predictor_16x32_neon
 
 void aom_v_predictor_16x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_16x4 aom_v_predictor_16x4_c
+void aom_v_predictor_16x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_16x4 aom_v_predictor_16x4_neon
 
 void aom_v_predictor_16x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_16x64 aom_v_predictor_16x64_c
+void aom_v_predictor_16x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_16x64 aom_v_predictor_16x64_neon
 
 void aom_v_predictor_16x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_16x8 aom_v_predictor_16x8_c
+void aom_v_predictor_16x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_16x8 aom_v_predictor_16x8_neon
 
 void aom_v_predictor_32x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_32x16 aom_v_predictor_32x16_c
+void aom_v_predictor_32x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_32x16 aom_v_predictor_32x16_neon
 
 void aom_v_predictor_32x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_v_predictor_32x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_v_predictor_32x32 aom_v_predictor_32x32_neon
 
 void aom_v_predictor_32x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_32x64 aom_v_predictor_32x64_c
+void aom_v_predictor_32x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_32x64 aom_v_predictor_32x64_neon
 
 void aom_v_predictor_32x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_32x8 aom_v_predictor_32x8_c
+void aom_v_predictor_32x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_32x8 aom_v_predictor_32x8_neon
 
 void aom_v_predictor_4x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_4x16 aom_v_predictor_4x16_c
+void aom_v_predictor_4x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_4x16 aom_v_predictor_4x16_neon
 
 void aom_v_predictor_4x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_v_predictor_4x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_v_predictor_4x4 aom_v_predictor_4x4_neon
 
 void aom_v_predictor_4x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_4x8 aom_v_predictor_4x8_c
+void aom_v_predictor_4x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_4x8 aom_v_predictor_4x8_neon
 
 void aom_v_predictor_64x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_64x16 aom_v_predictor_64x16_c
+void aom_v_predictor_64x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_64x16 aom_v_predictor_64x16_neon
 
 void aom_v_predictor_64x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_64x32 aom_v_predictor_64x32_c
+void aom_v_predictor_64x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_64x32 aom_v_predictor_64x32_neon
 
 void aom_v_predictor_64x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_64x64 aom_v_predictor_64x64_c
+void aom_v_predictor_64x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_64x64 aom_v_predictor_64x64_neon
 
 void aom_v_predictor_8x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_8x16 aom_v_predictor_8x16_c
+void aom_v_predictor_8x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_8x16 aom_v_predictor_8x16_neon
 
 void aom_v_predictor_8x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_8x32 aom_v_predictor_8x32_c
+void aom_v_predictor_8x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_8x32 aom_v_predictor_8x32_neon
 
 void aom_v_predictor_8x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_8x4 aom_v_predictor_8x4_c
+void aom_v_predictor_8x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_8x4 aom_v_predictor_8x4_neon
 
 void aom_v_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_v_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_v_predictor_8x8 aom_v_predictor_8x8_neon
 
 uint64_t aom_var_2d_u16_c(uint8_t *src, int src_stride, int width, int height);
-#define aom_var_2d_u16 aom_var_2d_u16_c
+uint64_t aom_var_2d_u16_neon(uint8_t *src, int src_stride, int width, int height);
+#define aom_var_2d_u16 aom_var_2d_u16_neon
 
 uint64_t aom_var_2d_u8_c(uint8_t *src, int src_stride, int width, int height);
-#define aom_var_2d_u8 aom_var_2d_u8_c
+uint64_t aom_var_2d_u8_neon(uint8_t *src, int src_stride, int width, int height);
+#define aom_var_2d_u8 aom_var_2d_u8_neon
 
 unsigned int aom_variance128x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 unsigned int aom_variance128x128_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -5324,7 +5703,7 @@
 int aom_vector_var_neon(const int16_t *ref, const int16_t *src, int bwl);
 #define aom_vector_var aom_vector_var_neon
 
-double av1_compute_cross_correlation_c(unsigned char *im1, int stride1, int x1, int y1, unsigned char *im2, int stride2, int x2, int y2);
+double av1_compute_cross_correlation_c(const unsigned char *frame1, int stride1, int x1, int y1, const unsigned char *frame2, int stride2, int x2, int y2);
 #define av1_compute_cross_correlation av1_compute_cross_correlation_c
 
 void aom_dsp_rtcd(void);

diff --git a/config/arm/config/aom_scale_rtcd.h b/config/arm/config/aom_scale_rtcd.h
index df4b96f..d296957 100644
--- a/config/arm/config/aom_scale_rtcd.h
+++ b/config/arm/config/aom_scale_rtcd.h

@@ -80,7 +80,7 @@
 void aom_yv12_partial_copy_y_c(const struct yv12_buffer_config *src_ybc, int hstart1, int hend1, int vstart1, int vend1, struct yv12_buffer_config *dst_ybc, int hstart2, int vstart2);
 #define aom_yv12_partial_copy_y aom_yv12_partial_copy_y_c
 
-int aom_yv12_realloc_with_new_border_c(struct yv12_buffer_config *ybf, int new_border, int byte_alignment, int num_planes);
+int aom_yv12_realloc_with_new_border_c(struct yv12_buffer_config *ybf, int new_border, int byte_alignment, int num_pyramid_levels, int num_planes);
 #define aom_yv12_realloc_with_new_border aom_yv12_realloc_with_new_border_c
 
 void aom_scale_rtcd(void);

diff --git a/config/arm/config/av1_rtcd.h b/config/arm/config/av1_rtcd.h
index 964bb72..1a3fa19 100644
--- a/config/arm/config/av1_rtcd.h
+++ b/config/arm/config/av1_rtcd.h

@@ -15,12 +15,12 @@
 #include "aom/aom_integer.h"
 #include "aom_dsp/odintrin.h"
 #include "aom_dsp/txfm_common.h"
-#include "av1/common/common.h"
-#include "av1/common/enums.h"
-#include "av1/common/quant_common.h"
-#include "av1/common/filter.h"
-#include "av1/common/convolve.h"
 #include "av1/common/av1_txfm.h"
+#include "av1/common/common.h"
+#include "av1/common/convolve.h"
+#include "av1/common/enums.h"
+#include "av1/common/filter.h"
+#include "av1/common/quant_common.h"
 #include "av1/common/restoration.h"
 
 struct macroblockd;
@@ -80,14 +80,11 @@
                                                    const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
                                                    int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
                                                    int ref_stride, int subpel_search);
-#define aom_comp_avg_upsampled_pred aom_comp_avg_upsampled_pred_c
-
-void aom_comp_mask_upsampled_pred_c(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
-                                                       const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
-                                                       int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
-                                                       int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask,
-                                                       int subpel_search);
-#define aom_comp_mask_upsampled_pred aom_comp_mask_upsampled_pred_c
+void aom_comp_avg_upsampled_pred_neon(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
+                                                   const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
+                                                   int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
+                                                   int ref_stride, int subpel_search);
+#define aom_comp_avg_upsampled_pred aom_comp_avg_upsampled_pred_neon
 
 void aom_dist_wtd_comp_avg_upsampled_pred_c(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
                                                        const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
@@ -118,14 +115,17 @@
 void aom_upsampled_pred_c(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
                                           const MV *const mv, uint8_t *comp_pred, int width, int height, int subpel_x_q3,
                                           int subpel_y_q3, const uint8_t *ref, int ref_stride, int subpel_search);
-#define aom_upsampled_pred aom_upsampled_pred_c
+void aom_upsampled_pred_neon(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
+                                          const MV *const mv, uint8_t *comp_pred, int width, int height, int subpel_x_q3,
+                                          int subpel_y_q3, const uint8_t *ref, int ref_stride, int subpel_search);
+#define aom_upsampled_pred aom_upsampled_pred_neon
 
 void av1_apply_selfguided_restoration_c(const uint8_t *dat, int width, int height, int stride, int eps, const int *xqd, uint8_t *dst, int dst_stride, int32_t *tmpbuf, int bit_depth, int highbd);
 void av1_apply_selfguided_restoration_neon(const uint8_t *dat, int width, int height, int stride, int eps, const int *xqd, uint8_t *dst, int dst_stride, int32_t *tmpbuf, int bit_depth, int highbd);
 #define av1_apply_selfguided_restoration av1_apply_selfguided_restoration_neon
 
-void av1_apply_temporal_filter_c(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
-void av1_apply_temporal_filter_neon(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_apply_temporal_filter_c(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_apply_temporal_filter_neon(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
 #define av1_apply_temporal_filter av1_apply_temporal_filter_neon
 
 int64_t av1_block_error_c(const tran_low_t *coeff, const tran_low_t *dqcoeff, intptr_t block_size, int64_t *ssz);
@@ -150,10 +150,12 @@
 #define av1_calc_frame_error av1_calc_frame_error_c
 
 void av1_calc_indices_dim1_c(const int16_t *data, const int16_t *centroids, uint8_t *indices, int64_t *total_dist, int n, int k);
-#define av1_calc_indices_dim1 av1_calc_indices_dim1_c
+void av1_calc_indices_dim1_neon(const int16_t *data, const int16_t *centroids, uint8_t *indices, int64_t *total_dist, int n, int k);
+#define av1_calc_indices_dim1 av1_calc_indices_dim1_neon
 
 void av1_calc_indices_dim2_c(const int16_t *data, const int16_t *centroids, uint8_t *indices, int64_t *total_dist, int n, int k);
-#define av1_calc_indices_dim2 av1_calc_indices_dim2_c
+void av1_calc_indices_dim2_neon(const int16_t *data, const int16_t *centroids, uint8_t *indices, int64_t *total_dist, int n, int k);
+#define av1_calc_indices_dim2 av1_calc_indices_dim2_neon
 
 void av1_calc_proj_params_c( const uint8_t *src8, int width, int height, int src_stride, const uint8_t *dat8, int dat_stride, int32_t *flt0, int flt0_stride, int32_t *flt1, int flt1_stride, int64_t H[2][2], int64_t C[2], const sgr_params_type *params);
 #define av1_calc_proj_params av1_calc_proj_params_c
@@ -179,7 +181,7 @@
 bool av1_cnn_predict_c( const float **input, int in_width, int in_height, int in_stride, const CNN_CONFIG *cnn_config, const CNN_THREAD_DATA *thread_data, CNN_MULTI_OUT *output_struct);
 #define av1_cnn_predict av1_cnn_predict_c
 
-void av1_compute_stats_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, int use_downsampled_wiener_stats);
+void av1_compute_stats_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int16_t *dgd_avg, int16_t *src_avg, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, int use_downsampled_wiener_stats);
 #define av1_compute_stats av1_compute_stats_c
 
 void av1_compute_stats_highbd_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, aom_bit_depth_t bit_depth);
@@ -231,6 +233,9 @@
 void av1_dr_prediction_z3_neon(uint8_t *dst, ptrdiff_t stride, int bw, int bh, const uint8_t *above, const uint8_t *left, int upsample_left, int dx, int dy);
 #define av1_dr_prediction_z3 av1_dr_prediction_z3_neon
 
+double av1_estimate_noise_from_single_plane_c(const uint8_t *src, int height, int width, int stride, int edge_thresh);
+#define av1_estimate_noise_from_single_plane av1_estimate_noise_from_single_plane_c
+
 void av1_filter_intra_edge_c(uint8_t *p, int sz, int strength);
 #define av1_filter_intra_edge av1_filter_intra_edge_c
 
@@ -332,7 +337,7 @@
 void av1_get_nz_map_contexts_neon(const uint8_t *const levels, const int16_t *const scan, const uint16_t eob, const TX_SIZE tx_size, const TX_CLASS tx_class, int8_t *const coeff_contexts);
 #define av1_get_nz_map_contexts av1_get_nz_map_contexts_neon
 
-void av1_highbd_apply_temporal_filter_c(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_highbd_apply_temporal_filter_c(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
 #define av1_highbd_apply_temporal_filter av1_highbd_apply_temporal_filter_c
 
 int64_t av1_highbd_block_error_c(const tran_low_t *coeff, const tran_low_t *dqcoeff, intptr_t block_size, int64_t *ssz, int bd);
@@ -348,10 +353,12 @@
 #define av1_highbd_convolve8_vert av1_highbd_convolve8_vert_c
 
 void av1_highbd_convolve_2d_scale_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const InterpFilterParams *filter_params_y, const int subpel_x_qn, const int x_step_qn, const int subpel_y_qn, const int y_step_qn, ConvolveParams *conv_params, int bd);
-#define av1_highbd_convolve_2d_scale av1_highbd_convolve_2d_scale_c
+void av1_highbd_convolve_2d_scale_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const InterpFilterParams *filter_params_y, const int subpel_x_qn, const int x_step_qn, const int subpel_y_qn, const int y_step_qn, ConvolveParams *conv_params, int bd);
+#define av1_highbd_convolve_2d_scale av1_highbd_convolve_2d_scale_neon
 
 void av1_highbd_convolve_2d_sr_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const InterpFilterParams *filter_params_y, const int subpel_x_qn, const int subpel_y_qn, ConvolveParams *conv_params, int bd);
-#define av1_highbd_convolve_2d_sr av1_highbd_convolve_2d_sr_c
+void av1_highbd_convolve_2d_sr_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const InterpFilterParams *filter_params_y, const int subpel_x_qn, const int subpel_y_qn, ConvolveParams *conv_params, int bd);
+#define av1_highbd_convolve_2d_sr av1_highbd_convolve_2d_sr_neon
 
 void av1_highbd_convolve_avg_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4, const int16_t *filter_y, int y_step_q4, int w, int h, int bps);
 #define av1_highbd_convolve_avg av1_highbd_convolve_avg_c
@@ -360,25 +367,32 @@
 #define av1_highbd_convolve_copy av1_highbd_convolve_copy_c
 
 void av1_highbd_convolve_horiz_rs_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const int16_t *x_filters, int x0_qn, int x_step_qn, int bd);
-#define av1_highbd_convolve_horiz_rs av1_highbd_convolve_horiz_rs_c
+void av1_highbd_convolve_horiz_rs_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const int16_t *x_filters, int x0_qn, int x_step_qn, int bd);
+#define av1_highbd_convolve_horiz_rs av1_highbd_convolve_horiz_rs_neon
 
 void av1_highbd_convolve_x_sr_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn, ConvolveParams *conv_params, int bd);
-#define av1_highbd_convolve_x_sr av1_highbd_convolve_x_sr_c
+void av1_highbd_convolve_x_sr_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn, ConvolveParams *conv_params, int bd);
+#define av1_highbd_convolve_x_sr av1_highbd_convolve_x_sr_neon
 
 void av1_highbd_convolve_y_sr_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_y, const int subpel_y_qn, int bd);
-#define av1_highbd_convolve_y_sr av1_highbd_convolve_y_sr_c
+void av1_highbd_convolve_y_sr_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_y, const int subpel_y_qn, int bd);
+#define av1_highbd_convolve_y_sr av1_highbd_convolve_y_sr_neon
 
 void av1_highbd_dist_wtd_convolve_2d_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const InterpFilterParams *filter_params_y, const int subpel_x_qn, const int subpel_y_qn, ConvolveParams *conv_params, int bd);
-#define av1_highbd_dist_wtd_convolve_2d av1_highbd_dist_wtd_convolve_2d_c
+void av1_highbd_dist_wtd_convolve_2d_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const InterpFilterParams *filter_params_y, const int subpel_x_qn, const int subpel_y_qn, ConvolveParams *conv_params, int bd);
+#define av1_highbd_dist_wtd_convolve_2d av1_highbd_dist_wtd_convolve_2d_neon
 
 void av1_highbd_dist_wtd_convolve_2d_copy_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, ConvolveParams *conv_params, int bd);
-#define av1_highbd_dist_wtd_convolve_2d_copy av1_highbd_dist_wtd_convolve_2d_copy_c
+void av1_highbd_dist_wtd_convolve_2d_copy_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, ConvolveParams *conv_params, int bd);
+#define av1_highbd_dist_wtd_convolve_2d_copy av1_highbd_dist_wtd_convolve_2d_copy_neon
 
 void av1_highbd_dist_wtd_convolve_x_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn, ConvolveParams *conv_params, int bd);
-#define av1_highbd_dist_wtd_convolve_x av1_highbd_dist_wtd_convolve_x_c
+void av1_highbd_dist_wtd_convolve_x_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn, ConvolveParams *conv_params, int bd);
+#define av1_highbd_dist_wtd_convolve_x av1_highbd_dist_wtd_convolve_x_neon
 
 void av1_highbd_dist_wtd_convolve_y_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_y, const int subpel_y_qn, ConvolveParams *conv_params, int bd);
-#define av1_highbd_dist_wtd_convolve_y av1_highbd_dist_wtd_convolve_y_c
+void av1_highbd_dist_wtd_convolve_y_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_y, const int subpel_y_qn, ConvolveParams *conv_params, int bd);
+#define av1_highbd_dist_wtd_convolve_y av1_highbd_dist_wtd_convolve_y_neon
 
 void av1_highbd_dr_prediction_z1_c(uint16_t *dst, ptrdiff_t stride, int bw, int bh, const uint16_t *above, const uint16_t *left, int upsample_above, int dx, int dy, int bd);
 #define av1_highbd_dr_prediction_z1 av1_highbd_dr_prediction_z1_c
@@ -389,9 +403,8 @@
 void av1_highbd_dr_prediction_z3_c(uint16_t *dst, ptrdiff_t stride, int bw, int bh, const uint16_t *above, const uint16_t *left, int upsample_left, int dx, int dy, int bd);
 #define av1_highbd_dr_prediction_z3 av1_highbd_dr_prediction_z3_c
 
-void av1_highbd_fwht4x4_c(const int16_t *input, tran_low_t *output, int stride);
-void av1_highbd_fwht4x4_neon(const int16_t *input, tran_low_t *output, int stride);
-#define av1_highbd_fwht4x4 av1_highbd_fwht4x4_neon
+double av1_highbd_estimate_noise_from_single_plane_c(const uint16_t *src, int height, int width, int stride, int bit_depth, int edge_thresh);
+#define av1_highbd_estimate_noise_from_single_plane av1_highbd_estimate_noise_from_single_plane_c
 
 void av1_highbd_inv_txfm_add_c(const tran_low_t *input, uint8_t *dest, int stride, const TxfmParam *txfm_param);
 void av1_highbd_inv_txfm_add_neon(const tran_low_t *input, uint8_t *dest, int stride, const TxfmParam *txfm_param);

diff --git a/config/arm64/config/aom_config.asm b/config/arm64/config/aom_config.asm
index a6b5453..9214692 100644
--- a/config/arm64/config/aom_config.asm
+++ b/config/arm64/config/aom_config.asm

@@ -8,10 +8,11 @@
 ; Media Patent License 1.0 was not distributed with this source code in the
 ; PATENTS file, you can obtain it at www.aomedia.org/license/patent.
 ;
-ARCH_ARM equ 1
-ARCH_PPC equ 0
-ARCH_X86 equ 0
-ARCH_X86_64 equ 0
+AOM_ARCH_AARCH64 equ 1
+AOM_ARCH_ARM equ 1
+AOM_ARCH_PPC equ 0
+AOM_ARCH_X86 equ 0
+AOM_ARCH_X86_64 equ 0
 CONFIG_ACCOUNTING equ 0
 CONFIG_ANALYZER equ 0
 CONFIG_AV1_DECODER equ 1
@@ -47,6 +48,7 @@
 CONFIG_NORMAL_TILE_MODE equ 1
 CONFIG_OPTICAL_FLOW_API equ 0
 CONFIG_OS_SUPPORT equ 1
+CONFIG_OUTPUT_FRAME_SIZE equ 0
 CONFIG_PARTITION_SEARCH_ORDER equ 0
 CONFIG_PIC equ 1
 CONFIG_RATECTRL_LOG equ 0
@@ -55,6 +57,7 @@
 CONFIG_REALTIME_ONLY equ 0
 CONFIG_RT_ML_PARTITIONING equ 0
 CONFIG_RUNTIME_CPU_DETECT equ 0
+CONFIG_SALIENCY_MAP equ 0
 CONFIG_SHARED equ 0
 CONFIG_SIZE_LIMIT equ 1
 CONFIG_SPATIAL_RESAMPLING equ 1

diff --git a/config/arm64/config/aom_config.c b/config/arm64/config/aom_config.c
index a4d09f7..0a75709 100644
--- a/config/arm64/config/aom_config.c
+++ b/config/arm64/config/aom_config.c

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2016, Alliance for Open Media. All rights reserved
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
  *
  * This source code is subject to the terms of the BSD 2 Clause License and
  * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License

diff --git a/config/arm64/config/aom_config.h b/config/arm64/config/aom_config.h
index 9f2cfc1..239527c 100644
--- a/config/arm64/config/aom_config.h
+++ b/config/arm64/config/aom_config.h

@@ -10,10 +10,11 @@
  */
 #ifndef AOM_CONFIG_H_
 #define AOM_CONFIG_H_
-#define ARCH_ARM 1
-#define ARCH_PPC 0
-#define ARCH_X86 0
-#define ARCH_X86_64 0
+#define AOM_ARCH_AARCH64 1
+#define AOM_ARCH_ARM 1
+#define AOM_ARCH_PPC 0
+#define AOM_ARCH_X86 0
+#define AOM_ARCH_X86_64 0
 #define CONFIG_ACCOUNTING 0
 #define CONFIG_ANALYZER 0
 #define CONFIG_AV1_DECODER 1
@@ -49,6 +50,7 @@
 #define CONFIG_NORMAL_TILE_MODE 1
 #define CONFIG_OPTICAL_FLOW_API 0
 #define CONFIG_OS_SUPPORT 1
+#define CONFIG_OUTPUT_FRAME_SIZE 0
 #define CONFIG_PARTITION_SEARCH_ORDER 0
 #define CONFIG_PIC 1
 #define CONFIG_RATECTRL_LOG 0
@@ -57,6 +59,7 @@
 #define CONFIG_REALTIME_ONLY 0
 #define CONFIG_RT_ML_PARTITIONING 0
 #define CONFIG_RUNTIME_CPU_DETECT 0
+#define CONFIG_SALIENCY_MAP 0
 #define CONFIG_SHARED 0
 #define CONFIG_SIZE_LIMIT 1
 #define CONFIG_SPATIAL_RESAMPLING 1

diff --git a/config/arm64/config/aom_dsp_rtcd.h b/config/arm64/config/aom_dsp_rtcd.h
index 7ae6636..ad77b04 100644
--- a/config/arm64/config/aom_dsp_rtcd.h
+++ b/config/arm64/config/aom_dsp_rtcd.h

@@ -14,8 +14,8 @@
 
 #include "aom/aom_integer.h"
 #include "aom_dsp/aom_dsp_common.h"
-#include "av1/common/enums.h"
 #include "av1/common/blockd.h"
+#include "av1/common/enums.h"
 
 
 #ifdef __cplusplus
@@ -46,19 +46,26 @@
 #define aom_blend_a64_vmask aom_blend_a64_vmask_neon
 
 void aom_comp_avg_pred_c(uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride);
-#define aom_comp_avg_pred aom_comp_avg_pred_c
+void aom_comp_avg_pred_neon(uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride);
+#define aom_comp_avg_pred aom_comp_avg_pred_neon
 
 void aom_comp_mask_pred_c(uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask);
-#define aom_comp_mask_pred aom_comp_mask_pred_c
+void aom_comp_mask_pred_neon(uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask);
+#define aom_comp_mask_pred aom_comp_mask_pred_neon
+
+void aom_compute_flow_at_point_c(const uint8_t *src, const uint8_t *ref, int x, int y, int width, int height, int stride, double *u, double *v);
+#define aom_compute_flow_at_point aom_compute_flow_at_point_c
 
 void aom_convolve8_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const InterpKernel *filter, int x0_q4, int x_step_q4, int y0_q4, int y_step_q4, int w, int h);
 #define aom_convolve8 aom_convolve8_c
 
 void aom_convolve8_horiz_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4, const int16_t *filter_y, int y_step_q4, int w, int h);
-#define aom_convolve8_horiz aom_convolve8_horiz_c
+void aom_convolve8_horiz_neon(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4, const int16_t *filter_y, int y_step_q4, int w, int h);
+#define aom_convolve8_horiz aom_convolve8_horiz_neon
 
 void aom_convolve8_vert_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4, const int16_t *filter_y, int y_step_q4, int w, int h);
-#define aom_convolve8_vert aom_convolve8_vert_c
+void aom_convolve8_vert_neon(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4, const int16_t *filter_y, int y_step_q4, int w, int h);
+#define aom_convolve8_vert aom_convolve8_vert_neon
 
 void aom_convolve_copy_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, int w, int h);
 void aom_convolve_copy_neon(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, int w, int h);
@@ -69,57 +76,72 @@
 #define aom_dc_128_predictor_16x16 aom_dc_128_predictor_16x16_neon
 
 void aom_dc_128_predictor_16x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_16x32 aom_dc_128_predictor_16x32_c
+void aom_dc_128_predictor_16x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_16x32 aom_dc_128_predictor_16x32_neon
 
 void aom_dc_128_predictor_16x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_16x4 aom_dc_128_predictor_16x4_c
+void aom_dc_128_predictor_16x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_16x4 aom_dc_128_predictor_16x4_neon
 
 void aom_dc_128_predictor_16x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_16x64 aom_dc_128_predictor_16x64_c
+void aom_dc_128_predictor_16x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_16x64 aom_dc_128_predictor_16x64_neon
 
 void aom_dc_128_predictor_16x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_16x8 aom_dc_128_predictor_16x8_c
+void aom_dc_128_predictor_16x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_16x8 aom_dc_128_predictor_16x8_neon
 
 void aom_dc_128_predictor_32x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_32x16 aom_dc_128_predictor_32x16_c
+void aom_dc_128_predictor_32x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_32x16 aom_dc_128_predictor_32x16_neon
 
 void aom_dc_128_predictor_32x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_128_predictor_32x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_128_predictor_32x32 aom_dc_128_predictor_32x32_neon
 
 void aom_dc_128_predictor_32x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_32x64 aom_dc_128_predictor_32x64_c
+void aom_dc_128_predictor_32x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_32x64 aom_dc_128_predictor_32x64_neon
 
 void aom_dc_128_predictor_32x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_32x8 aom_dc_128_predictor_32x8_c
+void aom_dc_128_predictor_32x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_32x8 aom_dc_128_predictor_32x8_neon
 
 void aom_dc_128_predictor_4x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_4x16 aom_dc_128_predictor_4x16_c
+void aom_dc_128_predictor_4x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_4x16 aom_dc_128_predictor_4x16_neon
 
 void aom_dc_128_predictor_4x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_128_predictor_4x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_128_predictor_4x4 aom_dc_128_predictor_4x4_neon
 
 void aom_dc_128_predictor_4x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_4x8 aom_dc_128_predictor_4x8_c
+void aom_dc_128_predictor_4x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_4x8 aom_dc_128_predictor_4x8_neon
 
 void aom_dc_128_predictor_64x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_64x16 aom_dc_128_predictor_64x16_c
+void aom_dc_128_predictor_64x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_64x16 aom_dc_128_predictor_64x16_neon
 
 void aom_dc_128_predictor_64x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_64x32 aom_dc_128_predictor_64x32_c
+void aom_dc_128_predictor_64x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_64x32 aom_dc_128_predictor_64x32_neon
 
 void aom_dc_128_predictor_64x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_64x64 aom_dc_128_predictor_64x64_c
+void aom_dc_128_predictor_64x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_64x64 aom_dc_128_predictor_64x64_neon
 
 void aom_dc_128_predictor_8x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_8x16 aom_dc_128_predictor_8x16_c
+void aom_dc_128_predictor_8x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_8x16 aom_dc_128_predictor_8x16_neon
 
 void aom_dc_128_predictor_8x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_8x32 aom_dc_128_predictor_8x32_c
+void aom_dc_128_predictor_8x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_8x32 aom_dc_128_predictor_8x32_neon
 
 void aom_dc_128_predictor_8x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_128_predictor_8x4 aom_dc_128_predictor_8x4_c
+void aom_dc_128_predictor_8x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_128_predictor_8x4 aom_dc_128_predictor_8x4_neon
 
 void aom_dc_128_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_128_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
@@ -130,57 +152,72 @@
 #define aom_dc_left_predictor_16x16 aom_dc_left_predictor_16x16_neon
 
 void aom_dc_left_predictor_16x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_16x32 aom_dc_left_predictor_16x32_c
+void aom_dc_left_predictor_16x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_16x32 aom_dc_left_predictor_16x32_neon
 
 void aom_dc_left_predictor_16x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_16x4 aom_dc_left_predictor_16x4_c
+void aom_dc_left_predictor_16x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_16x4 aom_dc_left_predictor_16x4_neon
 
 void aom_dc_left_predictor_16x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_16x64 aom_dc_left_predictor_16x64_c
+void aom_dc_left_predictor_16x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_16x64 aom_dc_left_predictor_16x64_neon
 
 void aom_dc_left_predictor_16x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_16x8 aom_dc_left_predictor_16x8_c
+void aom_dc_left_predictor_16x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_16x8 aom_dc_left_predictor_16x8_neon
 
 void aom_dc_left_predictor_32x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_32x16 aom_dc_left_predictor_32x16_c
+void aom_dc_left_predictor_32x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_32x16 aom_dc_left_predictor_32x16_neon
 
 void aom_dc_left_predictor_32x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_left_predictor_32x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_left_predictor_32x32 aom_dc_left_predictor_32x32_neon
 
 void aom_dc_left_predictor_32x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_32x64 aom_dc_left_predictor_32x64_c
+void aom_dc_left_predictor_32x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_32x64 aom_dc_left_predictor_32x64_neon
 
 void aom_dc_left_predictor_32x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_32x8 aom_dc_left_predictor_32x8_c
+void aom_dc_left_predictor_32x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_32x8 aom_dc_left_predictor_32x8_neon
 
 void aom_dc_left_predictor_4x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_4x16 aom_dc_left_predictor_4x16_c
+void aom_dc_left_predictor_4x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_4x16 aom_dc_left_predictor_4x16_neon
 
 void aom_dc_left_predictor_4x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_left_predictor_4x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_left_predictor_4x4 aom_dc_left_predictor_4x4_neon
 
 void aom_dc_left_predictor_4x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_4x8 aom_dc_left_predictor_4x8_c
+void aom_dc_left_predictor_4x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_4x8 aom_dc_left_predictor_4x8_neon
 
 void aom_dc_left_predictor_64x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_64x16 aom_dc_left_predictor_64x16_c
+void aom_dc_left_predictor_64x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_64x16 aom_dc_left_predictor_64x16_neon
 
 void aom_dc_left_predictor_64x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_64x32 aom_dc_left_predictor_64x32_c
+void aom_dc_left_predictor_64x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_64x32 aom_dc_left_predictor_64x32_neon
 
 void aom_dc_left_predictor_64x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_64x64 aom_dc_left_predictor_64x64_c
+void aom_dc_left_predictor_64x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_64x64 aom_dc_left_predictor_64x64_neon
 
 void aom_dc_left_predictor_8x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_8x16 aom_dc_left_predictor_8x16_c
+void aom_dc_left_predictor_8x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_8x16 aom_dc_left_predictor_8x16_neon
 
 void aom_dc_left_predictor_8x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_8x32 aom_dc_left_predictor_8x32_c
+void aom_dc_left_predictor_8x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_8x32 aom_dc_left_predictor_8x32_neon
 
 void aom_dc_left_predictor_8x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_left_predictor_8x4 aom_dc_left_predictor_8x4_c
+void aom_dc_left_predictor_8x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_left_predictor_8x4 aom_dc_left_predictor_8x4_neon
 
 void aom_dc_left_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_left_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
@@ -191,57 +228,72 @@
 #define aom_dc_predictor_16x16 aom_dc_predictor_16x16_neon
 
 void aom_dc_predictor_16x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_16x32 aom_dc_predictor_16x32_c
+void aom_dc_predictor_16x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_16x32 aom_dc_predictor_16x32_neon
 
 void aom_dc_predictor_16x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_16x4 aom_dc_predictor_16x4_c
+void aom_dc_predictor_16x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_16x4 aom_dc_predictor_16x4_neon
 
 void aom_dc_predictor_16x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_16x64 aom_dc_predictor_16x64_c
+void aom_dc_predictor_16x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_16x64 aom_dc_predictor_16x64_neon
 
 void aom_dc_predictor_16x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_16x8 aom_dc_predictor_16x8_c
+void aom_dc_predictor_16x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_16x8 aom_dc_predictor_16x8_neon
 
 void aom_dc_predictor_32x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_32x16 aom_dc_predictor_32x16_c
+void aom_dc_predictor_32x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_32x16 aom_dc_predictor_32x16_neon
 
 void aom_dc_predictor_32x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_predictor_32x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_predictor_32x32 aom_dc_predictor_32x32_neon
 
 void aom_dc_predictor_32x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_32x64 aom_dc_predictor_32x64_c
+void aom_dc_predictor_32x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_32x64 aom_dc_predictor_32x64_neon
 
 void aom_dc_predictor_32x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_32x8 aom_dc_predictor_32x8_c
+void aom_dc_predictor_32x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_32x8 aom_dc_predictor_32x8_neon
 
 void aom_dc_predictor_4x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_4x16 aom_dc_predictor_4x16_c
+void aom_dc_predictor_4x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_4x16 aom_dc_predictor_4x16_neon
 
 void aom_dc_predictor_4x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_predictor_4x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_predictor_4x4 aom_dc_predictor_4x4_neon
 
 void aom_dc_predictor_4x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_4x8 aom_dc_predictor_4x8_c
+void aom_dc_predictor_4x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_4x8 aom_dc_predictor_4x8_neon
 
 void aom_dc_predictor_64x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_64x16 aom_dc_predictor_64x16_c
+void aom_dc_predictor_64x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_64x16 aom_dc_predictor_64x16_neon
 
 void aom_dc_predictor_64x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_64x32 aom_dc_predictor_64x32_c
+void aom_dc_predictor_64x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_64x32 aom_dc_predictor_64x32_neon
 
 void aom_dc_predictor_64x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_64x64 aom_dc_predictor_64x64_c
+void aom_dc_predictor_64x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_64x64 aom_dc_predictor_64x64_neon
 
 void aom_dc_predictor_8x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_8x16 aom_dc_predictor_8x16_c
+void aom_dc_predictor_8x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_8x16 aom_dc_predictor_8x16_neon
 
 void aom_dc_predictor_8x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_8x32 aom_dc_predictor_8x32_c
+void aom_dc_predictor_8x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_8x32 aom_dc_predictor_8x32_neon
 
 void aom_dc_predictor_8x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_predictor_8x4 aom_dc_predictor_8x4_c
+void aom_dc_predictor_8x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_predictor_8x4 aom_dc_predictor_8x4_neon
 
 void aom_dc_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
@@ -252,57 +304,72 @@
 #define aom_dc_top_predictor_16x16 aom_dc_top_predictor_16x16_neon
 
 void aom_dc_top_predictor_16x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_16x32 aom_dc_top_predictor_16x32_c
+void aom_dc_top_predictor_16x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_16x32 aom_dc_top_predictor_16x32_neon
 
 void aom_dc_top_predictor_16x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_16x4 aom_dc_top_predictor_16x4_c
+void aom_dc_top_predictor_16x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_16x4 aom_dc_top_predictor_16x4_neon
 
 void aom_dc_top_predictor_16x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_16x64 aom_dc_top_predictor_16x64_c
+void aom_dc_top_predictor_16x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_16x64 aom_dc_top_predictor_16x64_neon
 
 void aom_dc_top_predictor_16x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_16x8 aom_dc_top_predictor_16x8_c
+void aom_dc_top_predictor_16x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_16x8 aom_dc_top_predictor_16x8_neon
 
 void aom_dc_top_predictor_32x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_32x16 aom_dc_top_predictor_32x16_c
+void aom_dc_top_predictor_32x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_32x16 aom_dc_top_predictor_32x16_neon
 
 void aom_dc_top_predictor_32x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_top_predictor_32x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_top_predictor_32x32 aom_dc_top_predictor_32x32_neon
 
 void aom_dc_top_predictor_32x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_32x64 aom_dc_top_predictor_32x64_c
+void aom_dc_top_predictor_32x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_32x64 aom_dc_top_predictor_32x64_neon
 
 void aom_dc_top_predictor_32x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_32x8 aom_dc_top_predictor_32x8_c
+void aom_dc_top_predictor_32x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_32x8 aom_dc_top_predictor_32x8_neon
 
 void aom_dc_top_predictor_4x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_4x16 aom_dc_top_predictor_4x16_c
+void aom_dc_top_predictor_4x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_4x16 aom_dc_top_predictor_4x16_neon
 
 void aom_dc_top_predictor_4x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_top_predictor_4x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_dc_top_predictor_4x4 aom_dc_top_predictor_4x4_neon
 
 void aom_dc_top_predictor_4x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_4x8 aom_dc_top_predictor_4x8_c
+void aom_dc_top_predictor_4x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_4x8 aom_dc_top_predictor_4x8_neon
 
 void aom_dc_top_predictor_64x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_64x16 aom_dc_top_predictor_64x16_c
+void aom_dc_top_predictor_64x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_64x16 aom_dc_top_predictor_64x16_neon
 
 void aom_dc_top_predictor_64x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_64x32 aom_dc_top_predictor_64x32_c
+void aom_dc_top_predictor_64x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_64x32 aom_dc_top_predictor_64x32_neon
 
 void aom_dc_top_predictor_64x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_64x64 aom_dc_top_predictor_64x64_c
+void aom_dc_top_predictor_64x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_64x64 aom_dc_top_predictor_64x64_neon
 
 void aom_dc_top_predictor_8x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_8x16 aom_dc_top_predictor_8x16_c
+void aom_dc_top_predictor_8x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_8x16 aom_dc_top_predictor_8x16_neon
 
 void aom_dc_top_predictor_8x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_8x32 aom_dc_top_predictor_8x32_c
+void aom_dc_top_predictor_8x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_8x32 aom_dc_top_predictor_8x32_neon
 
 void aom_dc_top_predictor_8x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_dc_top_predictor_8x4 aom_dc_top_predictor_8x4_c
+void aom_dc_top_predictor_8x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_dc_top_predictor_8x4 aom_dc_top_predictor_8x4_neon
 
 void aom_dc_top_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_dc_top_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
@@ -451,10 +518,6 @@
 void aom_fdct4x4_lp_neon(const int16_t *input, int16_t *output, int stride);
 #define aom_fdct4x4_lp aom_fdct4x4_lp_neon
 
-void aom_fdct8x8_c(const int16_t *input, tran_low_t *output, int stride);
-void aom_fdct8x8_neon(const int16_t *input, tran_low_t *output, int stride);
-#define aom_fdct8x8 aom_fdct8x8_neon
-
 void aom_fft16x16_float_c(const float *input, float *temp, float *output);
 #define aom_fft16x16_float aom_fft16x16_float_c
 
@@ -470,18 +533,6 @@
 void aom_fft8x8_float_c(const float *input, float *temp, float *output);
 #define aom_fft8x8_float aom_fft8x8_float_c
 
-void aom_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-void aom_get16x16var_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_get16x16var aom_get16x16var_neon
-
-unsigned int aom_get4x4sse_cs_c(const unsigned char *src_ptr, int source_stride, const unsigned char *ref_ptr, int ref_stride);
-unsigned int aom_get4x4sse_cs_neon(const unsigned char *src_ptr, int source_stride, const unsigned char *ref_ptr, int ref_stride);
-#define aom_get4x4sse_cs aom_get4x4sse_cs_neon
-
-void aom_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-void aom_get8x8var_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_get8x8var aom_get8x8var_neon
-
 void aom_get_blk_sse_sum_c(const int16_t *data, int stride, int bw, int bh, int *x_sum, int64_t *x2_sum);
 #define aom_get_blk_sse_sum aom_get_blk_sse_sum_c
 
@@ -489,7 +540,8 @@
 #define aom_get_mb_ss aom_get_mb_ss_c
 
 void aom_get_var_sse_sum_16x16_dual_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse16x16, unsigned int *tot_sse, int *tot_sum, uint32_t *var16x16);
-#define aom_get_var_sse_sum_16x16_dual aom_get_var_sse_sum_16x16_dual_c
+void aom_get_var_sse_sum_16x16_dual_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse16x16, unsigned int *tot_sse, int *tot_sum, uint32_t *var16x16);
+#define aom_get_var_sse_sum_16x16_dual aom_get_var_sse_sum_16x16_dual_neon
 
 void aom_get_var_sse_sum_8x8_quad_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse8x8, int *sum8x8, unsigned int *tot_sse, int *tot_sum, uint32_t *var8x8);
 void aom_get_var_sse_sum_8x8_quad_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse8x8, int *sum8x8, unsigned int *tot_sse, int *tot_sum, uint32_t *var8x8);
@@ -500,57 +552,72 @@
 #define aom_h_predictor_16x16 aom_h_predictor_16x16_neon
 
 void aom_h_predictor_16x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_16x32 aom_h_predictor_16x32_c
+void aom_h_predictor_16x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_16x32 aom_h_predictor_16x32_neon
 
 void aom_h_predictor_16x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_16x4 aom_h_predictor_16x4_c
+void aom_h_predictor_16x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_16x4 aom_h_predictor_16x4_neon
 
 void aom_h_predictor_16x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_16x64 aom_h_predictor_16x64_c
+void aom_h_predictor_16x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_16x64 aom_h_predictor_16x64_neon
 
 void aom_h_predictor_16x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_16x8 aom_h_predictor_16x8_c
+void aom_h_predictor_16x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_16x8 aom_h_predictor_16x8_neon
 
 void aom_h_predictor_32x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_32x16 aom_h_predictor_32x16_c
+void aom_h_predictor_32x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_32x16 aom_h_predictor_32x16_neon
 
 void aom_h_predictor_32x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_h_predictor_32x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_h_predictor_32x32 aom_h_predictor_32x32_neon
 
 void aom_h_predictor_32x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_32x64 aom_h_predictor_32x64_c
+void aom_h_predictor_32x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_32x64 aom_h_predictor_32x64_neon
 
 void aom_h_predictor_32x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_32x8 aom_h_predictor_32x8_c
+void aom_h_predictor_32x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_32x8 aom_h_predictor_32x8_neon
 
 void aom_h_predictor_4x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_4x16 aom_h_predictor_4x16_c
+void aom_h_predictor_4x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_4x16 aom_h_predictor_4x16_neon
 
 void aom_h_predictor_4x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_h_predictor_4x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_h_predictor_4x4 aom_h_predictor_4x4_neon
 
 void aom_h_predictor_4x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_4x8 aom_h_predictor_4x8_c
+void aom_h_predictor_4x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_4x8 aom_h_predictor_4x8_neon
 
 void aom_h_predictor_64x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_64x16 aom_h_predictor_64x16_c
+void aom_h_predictor_64x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_64x16 aom_h_predictor_64x16_neon
 
 void aom_h_predictor_64x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_64x32 aom_h_predictor_64x32_c
+void aom_h_predictor_64x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_64x32 aom_h_predictor_64x32_neon
 
 void aom_h_predictor_64x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_64x64 aom_h_predictor_64x64_c
+void aom_h_predictor_64x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_64x64 aom_h_predictor_64x64_neon
 
 void aom_h_predictor_8x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_8x16 aom_h_predictor_8x16_c
+void aom_h_predictor_8x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_8x16 aom_h_predictor_8x16_neon
 
 void aom_h_predictor_8x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_8x32 aom_h_predictor_8x32_c
+void aom_h_predictor_8x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_8x32 aom_h_predictor_8x32_neon
 
 void aom_h_predictor_8x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_h_predictor_8x4 aom_h_predictor_8x4_c
+void aom_h_predictor_8x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_h_predictor_8x4 aom_h_predictor_8x4_neon
 
 void aom_h_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_h_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
@@ -561,10 +628,12 @@
 #define aom_hadamard_16x16 aom_hadamard_16x16_neon
 
 void aom_hadamard_32x32_c(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
-#define aom_hadamard_32x32 aom_hadamard_32x32_c
+void aom_hadamard_32x32_neon(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
+#define aom_hadamard_32x32 aom_hadamard_32x32_neon
 
 void aom_hadamard_4x4_c(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
-#define aom_hadamard_4x4 aom_hadamard_4x4_c
+void aom_hadamard_4x4_neon(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
+#define aom_hadamard_4x4 aom_hadamard_4x4_neon
 
 void aom_hadamard_8x8_c(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
 void aom_hadamard_8x8_neon(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
@@ -648,12 +717,6 @@
 uint32_t aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_10_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_10_get16x16var aom_highbd_10_get16x16var_c
-
-void aom_highbd_10_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_10_get8x8var aom_highbd_10_get8x8var_c
-
 unsigned int aom_highbd_10_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_10_masked_sub_pixel_variance128x128 aom_highbd_10_masked_sub_pixel_variance128x128_c
 
@@ -721,16 +784,20 @@
 #define aom_highbd_10_masked_sub_pixel_variance8x8 aom_highbd_10_masked_sub_pixel_variance8x8_c
 
 unsigned int aom_highbd_10_mse16x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_10_mse16x16 aom_highbd_10_mse16x16_c
+unsigned int aom_highbd_10_mse16x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_10_mse16x16 aom_highbd_10_mse16x16_neon
 
 unsigned int aom_highbd_10_mse16x8_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_10_mse16x8 aom_highbd_10_mse16x8_c
+unsigned int aom_highbd_10_mse16x8_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_10_mse16x8 aom_highbd_10_mse16x8_neon
 
 unsigned int aom_highbd_10_mse8x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_10_mse8x16 aom_highbd_10_mse8x16_c
+unsigned int aom_highbd_10_mse8x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_10_mse8x16 aom_highbd_10_mse8x16_neon
 
 unsigned int aom_highbd_10_mse8x8_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_10_mse8x8 aom_highbd_10_mse8x8_c
+unsigned int aom_highbd_10_mse8x8_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_10_mse8x8 aom_highbd_10_mse8x8_neon
 
 unsigned int aom_highbd_10_obmc_sub_pixel_variance128x128_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
 #define aom_highbd_10_obmc_sub_pixel_variance128x128 aom_highbd_10_obmc_sub_pixel_variance128x128_c
@@ -1012,12 +1079,12 @@
 unsigned int aom_highbd_10_variance16x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x32 aom_highbd_10_variance16x32_neon
 
-unsigned int aom_highbd_10_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance16x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance16x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x4 aom_highbd_10_variance16x4_neon
 
-unsigned int aom_highbd_10_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance16x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance16x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x64 aom_highbd_10_variance16x64_neon
 
 unsigned int aom_highbd_10_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1042,12 +1109,12 @@
 unsigned int aom_highbd_10_variance32x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance32x64 aom_highbd_10_variance32x64_neon
 
-unsigned int aom_highbd_10_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance32x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance32x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance32x8 aom_highbd_10_variance32x8_neon
 
-unsigned int aom_highbd_10_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance4x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance4x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance4x16 aom_highbd_10_variance4x16_neon
 
 unsigned int aom_highbd_10_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1065,8 +1132,8 @@
 unsigned int aom_highbd_10_variance64x128_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance64x128 aom_highbd_10_variance64x128_neon
 
-unsigned int aom_highbd_10_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance64x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance64x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance64x16 aom_highbd_10_variance64x16_neon
 
 unsigned int aom_highbd_10_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1081,8 +1148,8 @@
 unsigned int aom_highbd_10_variance8x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance8x16 aom_highbd_10_variance8x16_neon
 
-unsigned int aom_highbd_10_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance8x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance8x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance8x32 aom_highbd_10_variance8x32_neon
 
 unsigned int aom_highbd_10_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1159,12 +1226,6 @@
 uint32_t aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_12_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_12_get16x16var aom_highbd_12_get16x16var_c
-
-void aom_highbd_12_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_12_get8x8var aom_highbd_12_get8x8var_c
-
 unsigned int aom_highbd_12_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_12_masked_sub_pixel_variance128x128 aom_highbd_12_masked_sub_pixel_variance128x128_c
 
@@ -1232,16 +1293,20 @@
 #define aom_highbd_12_masked_sub_pixel_variance8x8 aom_highbd_12_masked_sub_pixel_variance8x8_c
 
 unsigned int aom_highbd_12_mse16x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_12_mse16x16 aom_highbd_12_mse16x16_c
+unsigned int aom_highbd_12_mse16x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_12_mse16x16 aom_highbd_12_mse16x16_neon
 
 unsigned int aom_highbd_12_mse16x8_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_12_mse16x8 aom_highbd_12_mse16x8_c
+unsigned int aom_highbd_12_mse16x8_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_12_mse16x8 aom_highbd_12_mse16x8_neon
 
 unsigned int aom_highbd_12_mse8x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_12_mse8x16 aom_highbd_12_mse8x16_c
+unsigned int aom_highbd_12_mse8x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_12_mse8x16 aom_highbd_12_mse8x16_neon
 
 unsigned int aom_highbd_12_mse8x8_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_12_mse8x8 aom_highbd_12_mse8x8_c
+unsigned int aom_highbd_12_mse8x8_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_12_mse8x8 aom_highbd_12_mse8x8_neon
 
 unsigned int aom_highbd_12_obmc_sub_pixel_variance128x128_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
 #define aom_highbd_12_obmc_sub_pixel_variance128x128 aom_highbd_12_obmc_sub_pixel_variance128x128_c
@@ -1508,25 +1573,32 @@
 #define aom_highbd_12_sub_pixel_variance8x8 aom_highbd_12_sub_pixel_variance8x8_c
 
 unsigned int aom_highbd_12_variance128x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance128x128 aom_highbd_12_variance128x128_c
+unsigned int aom_highbd_12_variance128x128_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance128x128 aom_highbd_12_variance128x128_neon
 
 unsigned int aom_highbd_12_variance128x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance128x64 aom_highbd_12_variance128x64_c
+unsigned int aom_highbd_12_variance128x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance128x64 aom_highbd_12_variance128x64_neon
 
 unsigned int aom_highbd_12_variance16x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance16x16 aom_highbd_12_variance16x16_c
+unsigned int aom_highbd_12_variance16x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance16x16 aom_highbd_12_variance16x16_neon
 
 unsigned int aom_highbd_12_variance16x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance16x32 aom_highbd_12_variance16x32_c
+unsigned int aom_highbd_12_variance16x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance16x32 aom_highbd_12_variance16x32_neon
 
-unsigned int aom_highbd_12_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_12_variance16x4 aom_highbd_12_variance16x4_c
+unsigned int aom_highbd_12_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance16x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance16x4 aom_highbd_12_variance16x4_neon
 
-unsigned int aom_highbd_12_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_12_variance16x64 aom_highbd_12_variance16x64_c
+unsigned int aom_highbd_12_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance16x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance16x64 aom_highbd_12_variance16x64_neon
 
 unsigned int aom_highbd_12_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance16x8 aom_highbd_12_variance16x8_c
+unsigned int aom_highbd_12_variance16x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance16x8 aom_highbd_12_variance16x8_neon
 
 unsigned int aom_highbd_12_variance2x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance2x2 aom_highbd_12_variance2x2_c
@@ -1535,52 +1607,67 @@
 #define aom_highbd_12_variance2x4 aom_highbd_12_variance2x4_c
 
 unsigned int aom_highbd_12_variance32x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance32x16 aom_highbd_12_variance32x16_c
+unsigned int aom_highbd_12_variance32x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance32x16 aom_highbd_12_variance32x16_neon
 
 unsigned int aom_highbd_12_variance32x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance32x32 aom_highbd_12_variance32x32_c
+unsigned int aom_highbd_12_variance32x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance32x32 aom_highbd_12_variance32x32_neon
 
 unsigned int aom_highbd_12_variance32x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance32x64 aom_highbd_12_variance32x64_c
+unsigned int aom_highbd_12_variance32x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance32x64 aom_highbd_12_variance32x64_neon
 
-unsigned int aom_highbd_12_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_12_variance32x8 aom_highbd_12_variance32x8_c
+unsigned int aom_highbd_12_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance32x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance32x8 aom_highbd_12_variance32x8_neon
 
-unsigned int aom_highbd_12_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_12_variance4x16 aom_highbd_12_variance4x16_c
+unsigned int aom_highbd_12_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance4x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance4x16 aom_highbd_12_variance4x16_neon
 
 unsigned int aom_highbd_12_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance4x2 aom_highbd_12_variance4x2_c
 
 unsigned int aom_highbd_12_variance4x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance4x4 aom_highbd_12_variance4x4_c
+unsigned int aom_highbd_12_variance4x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance4x4 aom_highbd_12_variance4x4_neon
 
 unsigned int aom_highbd_12_variance4x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance4x8 aom_highbd_12_variance4x8_c
+unsigned int aom_highbd_12_variance4x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance4x8 aom_highbd_12_variance4x8_neon
 
 unsigned int aom_highbd_12_variance64x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance64x128 aom_highbd_12_variance64x128_c
+unsigned int aom_highbd_12_variance64x128_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance64x128 aom_highbd_12_variance64x128_neon
 
-unsigned int aom_highbd_12_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_12_variance64x16 aom_highbd_12_variance64x16_c
+unsigned int aom_highbd_12_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance64x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance64x16 aom_highbd_12_variance64x16_neon
 
 unsigned int aom_highbd_12_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance64x32 aom_highbd_12_variance64x32_c
+unsigned int aom_highbd_12_variance64x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance64x32 aom_highbd_12_variance64x32_neon
 
 unsigned int aom_highbd_12_variance64x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance64x64 aom_highbd_12_variance64x64_c
+unsigned int aom_highbd_12_variance64x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance64x64 aom_highbd_12_variance64x64_neon
 
 unsigned int aom_highbd_12_variance8x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance8x16 aom_highbd_12_variance8x16_c
+unsigned int aom_highbd_12_variance8x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance8x16 aom_highbd_12_variance8x16_neon
 
-unsigned int aom_highbd_12_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_12_variance8x32 aom_highbd_12_variance8x32_c
+unsigned int aom_highbd_12_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance8x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance8x32 aom_highbd_12_variance8x32_neon
 
 unsigned int aom_highbd_12_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance8x4 aom_highbd_12_variance8x4_c
+unsigned int aom_highbd_12_variance8x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance8x4 aom_highbd_12_variance8x4_neon
 
 unsigned int aom_highbd_12_variance8x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_12_variance8x8 aom_highbd_12_variance8x8_c
+unsigned int aom_highbd_12_variance8x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_12_variance8x8 aom_highbd_12_variance8x8_neon
 
 uint32_t aom_highbd_8_dist_wtd_sub_pixel_avg_variance128x128_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_8_dist_wtd_sub_pixel_avg_variance128x128 aom_highbd_8_dist_wtd_sub_pixel_avg_variance128x128_c
@@ -1648,12 +1735,6 @@
 uint32_t aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_8_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_8_get16x16var aom_highbd_8_get16x16var_c
-
-void aom_highbd_8_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_8_get8x8var aom_highbd_8_get8x8var_c
-
 unsigned int aom_highbd_8_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_8_masked_sub_pixel_variance128x128 aom_highbd_8_masked_sub_pixel_variance128x128_c
 
@@ -1721,16 +1802,20 @@
 #define aom_highbd_8_masked_sub_pixel_variance8x8 aom_highbd_8_masked_sub_pixel_variance8x8_c
 
 unsigned int aom_highbd_8_mse16x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_8_mse16x16 aom_highbd_8_mse16x16_c
+unsigned int aom_highbd_8_mse16x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_8_mse16x16 aom_highbd_8_mse16x16_neon
 
 unsigned int aom_highbd_8_mse16x8_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_8_mse16x8 aom_highbd_8_mse16x8_c
+unsigned int aom_highbd_8_mse16x8_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_8_mse16x8 aom_highbd_8_mse16x8_neon
 
 unsigned int aom_highbd_8_mse8x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_8_mse8x16 aom_highbd_8_mse8x16_c
+unsigned int aom_highbd_8_mse8x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_8_mse8x16 aom_highbd_8_mse8x16_neon
 
 unsigned int aom_highbd_8_mse8x8_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
-#define aom_highbd_8_mse8x8 aom_highbd_8_mse8x8_c
+unsigned int aom_highbd_8_mse8x8_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
+#define aom_highbd_8_mse8x8 aom_highbd_8_mse8x8_neon
 
 uint32_t aom_highbd_8_sub_pixel_avg_variance128x128_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred);
 #define aom_highbd_8_sub_pixel_avg_variance128x128 aom_highbd_8_sub_pixel_avg_variance128x128_c
@@ -1865,25 +1950,32 @@
 #define aom_highbd_8_sub_pixel_variance8x8 aom_highbd_8_sub_pixel_variance8x8_c
 
 unsigned int aom_highbd_8_variance128x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance128x128 aom_highbd_8_variance128x128_c
+unsigned int aom_highbd_8_variance128x128_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance128x128 aom_highbd_8_variance128x128_neon
 
 unsigned int aom_highbd_8_variance128x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance128x64 aom_highbd_8_variance128x64_c
+unsigned int aom_highbd_8_variance128x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance128x64 aom_highbd_8_variance128x64_neon
 
 unsigned int aom_highbd_8_variance16x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance16x16 aom_highbd_8_variance16x16_c
+unsigned int aom_highbd_8_variance16x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance16x16 aom_highbd_8_variance16x16_neon
 
 unsigned int aom_highbd_8_variance16x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance16x32 aom_highbd_8_variance16x32_c
+unsigned int aom_highbd_8_variance16x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance16x32 aom_highbd_8_variance16x32_neon
 
-unsigned int aom_highbd_8_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_8_variance16x4 aom_highbd_8_variance16x4_c
+unsigned int aom_highbd_8_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance16x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance16x4 aom_highbd_8_variance16x4_neon
 
-unsigned int aom_highbd_8_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_8_variance16x64 aom_highbd_8_variance16x64_c
+unsigned int aom_highbd_8_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance16x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance16x64 aom_highbd_8_variance16x64_neon
 
 unsigned int aom_highbd_8_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance16x8 aom_highbd_8_variance16x8_c
+unsigned int aom_highbd_8_variance16x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance16x8 aom_highbd_8_variance16x8_neon
 
 unsigned int aom_highbd_8_variance2x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance2x2 aom_highbd_8_variance2x2_c
@@ -1892,59 +1984,75 @@
 #define aom_highbd_8_variance2x4 aom_highbd_8_variance2x4_c
 
 unsigned int aom_highbd_8_variance32x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance32x16 aom_highbd_8_variance32x16_c
+unsigned int aom_highbd_8_variance32x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance32x16 aom_highbd_8_variance32x16_neon
 
 unsigned int aom_highbd_8_variance32x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance32x32 aom_highbd_8_variance32x32_c
+unsigned int aom_highbd_8_variance32x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance32x32 aom_highbd_8_variance32x32_neon
 
 unsigned int aom_highbd_8_variance32x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance32x64 aom_highbd_8_variance32x64_c
+unsigned int aom_highbd_8_variance32x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance32x64 aom_highbd_8_variance32x64_neon
 
-unsigned int aom_highbd_8_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_8_variance32x8 aom_highbd_8_variance32x8_c
+unsigned int aom_highbd_8_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance32x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance32x8 aom_highbd_8_variance32x8_neon
 
-unsigned int aom_highbd_8_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_8_variance4x16 aom_highbd_8_variance4x16_c
+unsigned int aom_highbd_8_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance4x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance4x16 aom_highbd_8_variance4x16_neon
 
 unsigned int aom_highbd_8_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance4x2 aom_highbd_8_variance4x2_c
 
 unsigned int aom_highbd_8_variance4x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance4x4 aom_highbd_8_variance4x4_c
+unsigned int aom_highbd_8_variance4x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance4x4 aom_highbd_8_variance4x4_neon
 
 unsigned int aom_highbd_8_variance4x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance4x8 aom_highbd_8_variance4x8_c
+unsigned int aom_highbd_8_variance4x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance4x8 aom_highbd_8_variance4x8_neon
 
 unsigned int aom_highbd_8_variance64x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance64x128 aom_highbd_8_variance64x128_c
+unsigned int aom_highbd_8_variance64x128_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance64x128 aom_highbd_8_variance64x128_neon
 
-unsigned int aom_highbd_8_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_8_variance64x16 aom_highbd_8_variance64x16_c
+unsigned int aom_highbd_8_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance64x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance64x16 aom_highbd_8_variance64x16_neon
 
 unsigned int aom_highbd_8_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance64x32 aom_highbd_8_variance64x32_c
+unsigned int aom_highbd_8_variance64x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance64x32 aom_highbd_8_variance64x32_neon
 
 unsigned int aom_highbd_8_variance64x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance64x64 aom_highbd_8_variance64x64_c
+unsigned int aom_highbd_8_variance64x64_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance64x64 aom_highbd_8_variance64x64_neon
 
 unsigned int aom_highbd_8_variance8x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance8x16 aom_highbd_8_variance8x16_c
+unsigned int aom_highbd_8_variance8x16_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance8x16 aom_highbd_8_variance8x16_neon
 
-unsigned int aom_highbd_8_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-#define aom_highbd_8_variance8x32 aom_highbd_8_variance8x32_c
+unsigned int aom_highbd_8_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance8x32_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance8x32 aom_highbd_8_variance8x32_neon
 
 unsigned int aom_highbd_8_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance8x4 aom_highbd_8_variance8x4_c
+unsigned int aom_highbd_8_variance8x4_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance8x4 aom_highbd_8_variance8x4_neon
 
 unsigned int aom_highbd_8_variance8x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
-#define aom_highbd_8_variance8x8 aom_highbd_8_variance8x8_c
+unsigned int aom_highbd_8_variance8x8_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+#define aom_highbd_8_variance8x8 aom_highbd_8_variance8x8_neon
 
 unsigned int aom_highbd_avg_4x4_c(const uint8_t *, int p);
 unsigned int aom_highbd_avg_4x4_neon(const uint8_t *, int p);
 #define aom_highbd_avg_4x4 aom_highbd_avg_4x4_neon
 
 unsigned int aom_highbd_avg_8x8_c(const uint8_t *, int p);
-#define aom_highbd_avg_8x8 aom_highbd_avg_8x8_c
+unsigned int aom_highbd_avg_8x8_neon(const uint8_t *, int p);
+#define aom_highbd_avg_8x8 aom_highbd_avg_8x8_neon
 
 void aom_highbd_blend_a64_d16_mask_c(uint8_t *dst, uint32_t dst_stride, const CONV_BUF_TYPE *src0, uint32_t src0_stride, const CONV_BUF_TYPE *src1, uint32_t src1_stride, const uint8_t *mask, uint32_t mask_stride, int w, int h, int subw, int subh, ConvolveParams *conv_params, const int bd);
 #define aom_highbd_blend_a64_d16_mask aom_highbd_blend_a64_d16_mask_c
@@ -1974,237 +2082,308 @@
 #define aom_highbd_convolve_copy aom_highbd_convolve_copy_c
 
 void aom_highbd_dc_128_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_16x16 aom_highbd_dc_128_predictor_16x16_c
+void aom_highbd_dc_128_predictor_16x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_16x16 aom_highbd_dc_128_predictor_16x16_neon
 
 void aom_highbd_dc_128_predictor_16x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_16x32 aom_highbd_dc_128_predictor_16x32_c
+void aom_highbd_dc_128_predictor_16x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_16x32 aom_highbd_dc_128_predictor_16x32_neon
 
 void aom_highbd_dc_128_predictor_16x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_16x4 aom_highbd_dc_128_predictor_16x4_c
+void aom_highbd_dc_128_predictor_16x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_16x4 aom_highbd_dc_128_predictor_16x4_neon
 
 void aom_highbd_dc_128_predictor_16x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_16x64 aom_highbd_dc_128_predictor_16x64_c
+void aom_highbd_dc_128_predictor_16x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_16x64 aom_highbd_dc_128_predictor_16x64_neon
 
 void aom_highbd_dc_128_predictor_16x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_16x8 aom_highbd_dc_128_predictor_16x8_c
+void aom_highbd_dc_128_predictor_16x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_16x8 aom_highbd_dc_128_predictor_16x8_neon
 
 void aom_highbd_dc_128_predictor_32x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_32x16 aom_highbd_dc_128_predictor_32x16_c
+void aom_highbd_dc_128_predictor_32x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_32x16 aom_highbd_dc_128_predictor_32x16_neon
 
 void aom_highbd_dc_128_predictor_32x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_32x32 aom_highbd_dc_128_predictor_32x32_c
+void aom_highbd_dc_128_predictor_32x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_32x32 aom_highbd_dc_128_predictor_32x32_neon
 
 void aom_highbd_dc_128_predictor_32x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_32x64 aom_highbd_dc_128_predictor_32x64_c
+void aom_highbd_dc_128_predictor_32x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_32x64 aom_highbd_dc_128_predictor_32x64_neon
 
 void aom_highbd_dc_128_predictor_32x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_32x8 aom_highbd_dc_128_predictor_32x8_c
+void aom_highbd_dc_128_predictor_32x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_32x8 aom_highbd_dc_128_predictor_32x8_neon
 
 void aom_highbd_dc_128_predictor_4x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_4x16 aom_highbd_dc_128_predictor_4x16_c
+void aom_highbd_dc_128_predictor_4x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_4x16 aom_highbd_dc_128_predictor_4x16_neon
 
 void aom_highbd_dc_128_predictor_4x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_4x4 aom_highbd_dc_128_predictor_4x4_c
+void aom_highbd_dc_128_predictor_4x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_4x4 aom_highbd_dc_128_predictor_4x4_neon
 
 void aom_highbd_dc_128_predictor_4x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_4x8 aom_highbd_dc_128_predictor_4x8_c
+void aom_highbd_dc_128_predictor_4x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_4x8 aom_highbd_dc_128_predictor_4x8_neon
 
 void aom_highbd_dc_128_predictor_64x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_64x16 aom_highbd_dc_128_predictor_64x16_c
+void aom_highbd_dc_128_predictor_64x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_64x16 aom_highbd_dc_128_predictor_64x16_neon
 
 void aom_highbd_dc_128_predictor_64x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_64x32 aom_highbd_dc_128_predictor_64x32_c
+void aom_highbd_dc_128_predictor_64x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_64x32 aom_highbd_dc_128_predictor_64x32_neon
 
 void aom_highbd_dc_128_predictor_64x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_64x64 aom_highbd_dc_128_predictor_64x64_c
+void aom_highbd_dc_128_predictor_64x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_64x64 aom_highbd_dc_128_predictor_64x64_neon
 
 void aom_highbd_dc_128_predictor_8x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_8x16 aom_highbd_dc_128_predictor_8x16_c
+void aom_highbd_dc_128_predictor_8x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_8x16 aom_highbd_dc_128_predictor_8x16_neon
 
 void aom_highbd_dc_128_predictor_8x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_8x32 aom_highbd_dc_128_predictor_8x32_c
+void aom_highbd_dc_128_predictor_8x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_8x32 aom_highbd_dc_128_predictor_8x32_neon
 
 void aom_highbd_dc_128_predictor_8x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_8x4 aom_highbd_dc_128_predictor_8x4_c
+void aom_highbd_dc_128_predictor_8x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_8x4 aom_highbd_dc_128_predictor_8x4_neon
 
 void aom_highbd_dc_128_predictor_8x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_128_predictor_8x8 aom_highbd_dc_128_predictor_8x8_c
+void aom_highbd_dc_128_predictor_8x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_128_predictor_8x8 aom_highbd_dc_128_predictor_8x8_neon
 
 void aom_highbd_dc_left_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_16x16 aom_highbd_dc_left_predictor_16x16_c
+void aom_highbd_dc_left_predictor_16x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_16x16 aom_highbd_dc_left_predictor_16x16_neon
 
 void aom_highbd_dc_left_predictor_16x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_16x32 aom_highbd_dc_left_predictor_16x32_c
+void aom_highbd_dc_left_predictor_16x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_16x32 aom_highbd_dc_left_predictor_16x32_neon
 
 void aom_highbd_dc_left_predictor_16x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_16x4 aom_highbd_dc_left_predictor_16x4_c
+void aom_highbd_dc_left_predictor_16x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_16x4 aom_highbd_dc_left_predictor_16x4_neon
 
 void aom_highbd_dc_left_predictor_16x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_16x64 aom_highbd_dc_left_predictor_16x64_c
+void aom_highbd_dc_left_predictor_16x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_16x64 aom_highbd_dc_left_predictor_16x64_neon
 
 void aom_highbd_dc_left_predictor_16x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_16x8 aom_highbd_dc_left_predictor_16x8_c
+void aom_highbd_dc_left_predictor_16x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_16x8 aom_highbd_dc_left_predictor_16x8_neon
 
 void aom_highbd_dc_left_predictor_32x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_32x16 aom_highbd_dc_left_predictor_32x16_c
+void aom_highbd_dc_left_predictor_32x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_32x16 aom_highbd_dc_left_predictor_32x16_neon
 
 void aom_highbd_dc_left_predictor_32x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_32x32 aom_highbd_dc_left_predictor_32x32_c
+void aom_highbd_dc_left_predictor_32x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_32x32 aom_highbd_dc_left_predictor_32x32_neon
 
 void aom_highbd_dc_left_predictor_32x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_32x64 aom_highbd_dc_left_predictor_32x64_c
+void aom_highbd_dc_left_predictor_32x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_32x64 aom_highbd_dc_left_predictor_32x64_neon
 
 void aom_highbd_dc_left_predictor_32x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_32x8 aom_highbd_dc_left_predictor_32x8_c
+void aom_highbd_dc_left_predictor_32x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_32x8 aom_highbd_dc_left_predictor_32x8_neon
 
 void aom_highbd_dc_left_predictor_4x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_4x16 aom_highbd_dc_left_predictor_4x16_c
+void aom_highbd_dc_left_predictor_4x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_4x16 aom_highbd_dc_left_predictor_4x16_neon
 
 void aom_highbd_dc_left_predictor_4x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_4x4 aom_highbd_dc_left_predictor_4x4_c
+void aom_highbd_dc_left_predictor_4x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_4x4 aom_highbd_dc_left_predictor_4x4_neon
 
 void aom_highbd_dc_left_predictor_4x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_4x8 aom_highbd_dc_left_predictor_4x8_c
+void aom_highbd_dc_left_predictor_4x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_4x8 aom_highbd_dc_left_predictor_4x8_neon
 
 void aom_highbd_dc_left_predictor_64x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_64x16 aom_highbd_dc_left_predictor_64x16_c
+void aom_highbd_dc_left_predictor_64x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_64x16 aom_highbd_dc_left_predictor_64x16_neon
 
 void aom_highbd_dc_left_predictor_64x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_64x32 aom_highbd_dc_left_predictor_64x32_c
+void aom_highbd_dc_left_predictor_64x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_64x32 aom_highbd_dc_left_predictor_64x32_neon
 
 void aom_highbd_dc_left_predictor_64x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_64x64 aom_highbd_dc_left_predictor_64x64_c
+void aom_highbd_dc_left_predictor_64x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_64x64 aom_highbd_dc_left_predictor_64x64_neon
 
 void aom_highbd_dc_left_predictor_8x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_8x16 aom_highbd_dc_left_predictor_8x16_c
+void aom_highbd_dc_left_predictor_8x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_8x16 aom_highbd_dc_left_predictor_8x16_neon
 
 void aom_highbd_dc_left_predictor_8x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_8x32 aom_highbd_dc_left_predictor_8x32_c
+void aom_highbd_dc_left_predictor_8x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_8x32 aom_highbd_dc_left_predictor_8x32_neon
 
 void aom_highbd_dc_left_predictor_8x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_8x4 aom_highbd_dc_left_predictor_8x4_c
+void aom_highbd_dc_left_predictor_8x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_8x4 aom_highbd_dc_left_predictor_8x4_neon
 
 void aom_highbd_dc_left_predictor_8x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_left_predictor_8x8 aom_highbd_dc_left_predictor_8x8_c
+void aom_highbd_dc_left_predictor_8x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_left_predictor_8x8 aom_highbd_dc_left_predictor_8x8_neon
 
 void aom_highbd_dc_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_dc_predictor_16x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_dc_predictor_16x16 aom_highbd_dc_predictor_16x16_neon
 
 void aom_highbd_dc_predictor_16x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_16x32 aom_highbd_dc_predictor_16x32_c
+void aom_highbd_dc_predictor_16x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_16x32 aom_highbd_dc_predictor_16x32_neon
 
 void aom_highbd_dc_predictor_16x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_16x4 aom_highbd_dc_predictor_16x4_c
+void aom_highbd_dc_predictor_16x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_16x4 aom_highbd_dc_predictor_16x4_neon
 
 void aom_highbd_dc_predictor_16x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_16x64 aom_highbd_dc_predictor_16x64_c
+void aom_highbd_dc_predictor_16x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_16x64 aom_highbd_dc_predictor_16x64_neon
 
 void aom_highbd_dc_predictor_16x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_16x8 aom_highbd_dc_predictor_16x8_c
+void aom_highbd_dc_predictor_16x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_16x8 aom_highbd_dc_predictor_16x8_neon
 
 void aom_highbd_dc_predictor_32x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_32x16 aom_highbd_dc_predictor_32x16_c
+void aom_highbd_dc_predictor_32x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_32x16 aom_highbd_dc_predictor_32x16_neon
 
 void aom_highbd_dc_predictor_32x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_dc_predictor_32x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_dc_predictor_32x32 aom_highbd_dc_predictor_32x32_neon
 
 void aom_highbd_dc_predictor_32x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_32x64 aom_highbd_dc_predictor_32x64_c
+void aom_highbd_dc_predictor_32x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_32x64 aom_highbd_dc_predictor_32x64_neon
 
 void aom_highbd_dc_predictor_32x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_32x8 aom_highbd_dc_predictor_32x8_c
+void aom_highbd_dc_predictor_32x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_32x8 aom_highbd_dc_predictor_32x8_neon
 
 void aom_highbd_dc_predictor_4x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_4x16 aom_highbd_dc_predictor_4x16_c
+void aom_highbd_dc_predictor_4x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_4x16 aom_highbd_dc_predictor_4x16_neon
 
 void aom_highbd_dc_predictor_4x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_dc_predictor_4x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_dc_predictor_4x4 aom_highbd_dc_predictor_4x4_neon
 
 void aom_highbd_dc_predictor_4x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_4x8 aom_highbd_dc_predictor_4x8_c
+void aom_highbd_dc_predictor_4x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_4x8 aom_highbd_dc_predictor_4x8_neon
 
 void aom_highbd_dc_predictor_64x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_64x16 aom_highbd_dc_predictor_64x16_c
+void aom_highbd_dc_predictor_64x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_64x16 aom_highbd_dc_predictor_64x16_neon
 
 void aom_highbd_dc_predictor_64x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_64x32 aom_highbd_dc_predictor_64x32_c
+void aom_highbd_dc_predictor_64x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_64x32 aom_highbd_dc_predictor_64x32_neon
 
 void aom_highbd_dc_predictor_64x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_dc_predictor_64x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_dc_predictor_64x64 aom_highbd_dc_predictor_64x64_neon
 
 void aom_highbd_dc_predictor_8x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_8x16 aom_highbd_dc_predictor_8x16_c
+void aom_highbd_dc_predictor_8x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_8x16 aom_highbd_dc_predictor_8x16_neon
 
 void aom_highbd_dc_predictor_8x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_8x32 aom_highbd_dc_predictor_8x32_c
+void aom_highbd_dc_predictor_8x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_8x32 aom_highbd_dc_predictor_8x32_neon
 
 void aom_highbd_dc_predictor_8x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_predictor_8x4 aom_highbd_dc_predictor_8x4_c
+void aom_highbd_dc_predictor_8x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_predictor_8x4 aom_highbd_dc_predictor_8x4_neon
 
 void aom_highbd_dc_predictor_8x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_dc_predictor_8x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_dc_predictor_8x8 aom_highbd_dc_predictor_8x8_neon
 
 void aom_highbd_dc_top_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_16x16 aom_highbd_dc_top_predictor_16x16_c
+void aom_highbd_dc_top_predictor_16x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_16x16 aom_highbd_dc_top_predictor_16x16_neon
 
 void aom_highbd_dc_top_predictor_16x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_16x32 aom_highbd_dc_top_predictor_16x32_c
+void aom_highbd_dc_top_predictor_16x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_16x32 aom_highbd_dc_top_predictor_16x32_neon
 
 void aom_highbd_dc_top_predictor_16x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_16x4 aom_highbd_dc_top_predictor_16x4_c
+void aom_highbd_dc_top_predictor_16x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_16x4 aom_highbd_dc_top_predictor_16x4_neon
 
 void aom_highbd_dc_top_predictor_16x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_16x64 aom_highbd_dc_top_predictor_16x64_c
+void aom_highbd_dc_top_predictor_16x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_16x64 aom_highbd_dc_top_predictor_16x64_neon
 
 void aom_highbd_dc_top_predictor_16x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_16x8 aom_highbd_dc_top_predictor_16x8_c
+void aom_highbd_dc_top_predictor_16x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_16x8 aom_highbd_dc_top_predictor_16x8_neon
 
 void aom_highbd_dc_top_predictor_32x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_32x16 aom_highbd_dc_top_predictor_32x16_c
+void aom_highbd_dc_top_predictor_32x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_32x16 aom_highbd_dc_top_predictor_32x16_neon
 
 void aom_highbd_dc_top_predictor_32x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_32x32 aom_highbd_dc_top_predictor_32x32_c
+void aom_highbd_dc_top_predictor_32x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_32x32 aom_highbd_dc_top_predictor_32x32_neon
 
 void aom_highbd_dc_top_predictor_32x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_32x64 aom_highbd_dc_top_predictor_32x64_c
+void aom_highbd_dc_top_predictor_32x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_32x64 aom_highbd_dc_top_predictor_32x64_neon
 
 void aom_highbd_dc_top_predictor_32x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_32x8 aom_highbd_dc_top_predictor_32x8_c
+void aom_highbd_dc_top_predictor_32x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_32x8 aom_highbd_dc_top_predictor_32x8_neon
 
 void aom_highbd_dc_top_predictor_4x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_4x16 aom_highbd_dc_top_predictor_4x16_c
+void aom_highbd_dc_top_predictor_4x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_4x16 aom_highbd_dc_top_predictor_4x16_neon
 
 void aom_highbd_dc_top_predictor_4x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_4x4 aom_highbd_dc_top_predictor_4x4_c
+void aom_highbd_dc_top_predictor_4x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_4x4 aom_highbd_dc_top_predictor_4x4_neon
 
 void aom_highbd_dc_top_predictor_4x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_4x8 aom_highbd_dc_top_predictor_4x8_c
+void aom_highbd_dc_top_predictor_4x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_4x8 aom_highbd_dc_top_predictor_4x8_neon
 
 void aom_highbd_dc_top_predictor_64x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_64x16 aom_highbd_dc_top_predictor_64x16_c
+void aom_highbd_dc_top_predictor_64x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_64x16 aom_highbd_dc_top_predictor_64x16_neon
 
 void aom_highbd_dc_top_predictor_64x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_64x32 aom_highbd_dc_top_predictor_64x32_c
+void aom_highbd_dc_top_predictor_64x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_64x32 aom_highbd_dc_top_predictor_64x32_neon
 
 void aom_highbd_dc_top_predictor_64x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_64x64 aom_highbd_dc_top_predictor_64x64_c
+void aom_highbd_dc_top_predictor_64x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_64x64 aom_highbd_dc_top_predictor_64x64_neon
 
 void aom_highbd_dc_top_predictor_8x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_8x16 aom_highbd_dc_top_predictor_8x16_c
+void aom_highbd_dc_top_predictor_8x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_8x16 aom_highbd_dc_top_predictor_8x16_neon
 
 void aom_highbd_dc_top_predictor_8x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_8x32 aom_highbd_dc_top_predictor_8x32_c
+void aom_highbd_dc_top_predictor_8x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_8x32 aom_highbd_dc_top_predictor_8x32_neon
 
 void aom_highbd_dc_top_predictor_8x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_8x4 aom_highbd_dc_top_predictor_8x4_c
+void aom_highbd_dc_top_predictor_8x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_8x4 aom_highbd_dc_top_predictor_8x4_neon
 
 void aom_highbd_dc_top_predictor_8x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_dc_top_predictor_8x8 aom_highbd_dc_top_predictor_8x8_c
+void aom_highbd_dc_top_predictor_8x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_dc_top_predictor_8x8 aom_highbd_dc_top_predictor_8x8_neon
 
 void aom_highbd_dist_wtd_comp_avg_pred_c(uint8_t *comp_pred8, const uint8_t *pred8, int width, int height, const uint8_t *ref8, int ref_stride, const DIST_WTD_COMP_PARAMS *jcp_param);
 #define aom_highbd_dist_wtd_comp_avg_pred aom_highbd_dist_wtd_comp_avg_pred_c
@@ -2275,74 +2454,93 @@
 unsigned int aom_highbd_dist_wtd_sad8x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_dist_wtd_sad8x8_avg aom_highbd_dist_wtd_sad8x8_avg_c
 
-void aom_highbd_fdct8x8_c(const int16_t *input, tran_low_t *output, int stride);
-#define aom_highbd_fdct8x8 aom_highbd_fdct8x8_c
-
 void aom_highbd_h_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_16x16 aom_highbd_h_predictor_16x16_c
+void aom_highbd_h_predictor_16x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_16x16 aom_highbd_h_predictor_16x16_neon
 
 void aom_highbd_h_predictor_16x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_16x32 aom_highbd_h_predictor_16x32_c
+void aom_highbd_h_predictor_16x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_16x32 aom_highbd_h_predictor_16x32_neon
 
 void aom_highbd_h_predictor_16x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_16x4 aom_highbd_h_predictor_16x4_c
+void aom_highbd_h_predictor_16x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_16x4 aom_highbd_h_predictor_16x4_neon
 
 void aom_highbd_h_predictor_16x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_16x64 aom_highbd_h_predictor_16x64_c
+void aom_highbd_h_predictor_16x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_16x64 aom_highbd_h_predictor_16x64_neon
 
 void aom_highbd_h_predictor_16x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_16x8 aom_highbd_h_predictor_16x8_c
+void aom_highbd_h_predictor_16x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_16x8 aom_highbd_h_predictor_16x8_neon
 
 void aom_highbd_h_predictor_32x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_32x16 aom_highbd_h_predictor_32x16_c
+void aom_highbd_h_predictor_32x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_32x16 aom_highbd_h_predictor_32x16_neon
 
 void aom_highbd_h_predictor_32x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_32x32 aom_highbd_h_predictor_32x32_c
+void aom_highbd_h_predictor_32x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_32x32 aom_highbd_h_predictor_32x32_neon
 
 void aom_highbd_h_predictor_32x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_32x64 aom_highbd_h_predictor_32x64_c
+void aom_highbd_h_predictor_32x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_32x64 aom_highbd_h_predictor_32x64_neon
 
 void aom_highbd_h_predictor_32x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_32x8 aom_highbd_h_predictor_32x8_c
+void aom_highbd_h_predictor_32x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_32x8 aom_highbd_h_predictor_32x8_neon
 
 void aom_highbd_h_predictor_4x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_4x16 aom_highbd_h_predictor_4x16_c
+void aom_highbd_h_predictor_4x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_4x16 aom_highbd_h_predictor_4x16_neon
 
 void aom_highbd_h_predictor_4x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_4x4 aom_highbd_h_predictor_4x4_c
+void aom_highbd_h_predictor_4x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_4x4 aom_highbd_h_predictor_4x4_neon
 
 void aom_highbd_h_predictor_4x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_4x8 aom_highbd_h_predictor_4x8_c
+void aom_highbd_h_predictor_4x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_4x8 aom_highbd_h_predictor_4x8_neon
 
 void aom_highbd_h_predictor_64x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_64x16 aom_highbd_h_predictor_64x16_c
+void aom_highbd_h_predictor_64x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_64x16 aom_highbd_h_predictor_64x16_neon
 
 void aom_highbd_h_predictor_64x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_64x32 aom_highbd_h_predictor_64x32_c
+void aom_highbd_h_predictor_64x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_64x32 aom_highbd_h_predictor_64x32_neon
 
 void aom_highbd_h_predictor_64x64_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_64x64 aom_highbd_h_predictor_64x64_c
+void aom_highbd_h_predictor_64x64_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_64x64 aom_highbd_h_predictor_64x64_neon
 
 void aom_highbd_h_predictor_8x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_8x16 aom_highbd_h_predictor_8x16_c
+void aom_highbd_h_predictor_8x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_8x16 aom_highbd_h_predictor_8x16_neon
 
 void aom_highbd_h_predictor_8x32_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_8x32 aom_highbd_h_predictor_8x32_c
+void aom_highbd_h_predictor_8x32_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_8x32 aom_highbd_h_predictor_8x32_neon
 
 void aom_highbd_h_predictor_8x4_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_8x4 aom_highbd_h_predictor_8x4_c
+void aom_highbd_h_predictor_8x4_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_8x4 aom_highbd_h_predictor_8x4_neon
 
 void aom_highbd_h_predictor_8x8_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
-#define aom_highbd_h_predictor_8x8 aom_highbd_h_predictor_8x8_c
+void aom_highbd_h_predictor_8x8_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
+#define aom_highbd_h_predictor_8x8 aom_highbd_h_predictor_8x8_neon
 
 void aom_highbd_hadamard_16x16_c(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
-#define aom_highbd_hadamard_16x16 aom_highbd_hadamard_16x16_c
+void aom_highbd_hadamard_16x16_neon(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
+#define aom_highbd_hadamard_16x16 aom_highbd_hadamard_16x16_neon
 
 void aom_highbd_hadamard_32x32_c(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
-#define aom_highbd_hadamard_32x32 aom_highbd_hadamard_32x32_c
+void aom_highbd_hadamard_32x32_neon(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
+#define aom_highbd_hadamard_32x32 aom_highbd_hadamard_32x32_neon
 
 void aom_highbd_hadamard_8x8_c(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
-#define aom_highbd_hadamard_8x8 aom_highbd_hadamard_8x8_c
+void aom_highbd_hadamard_8x8_neon(const int16_t *src_diff, ptrdiff_t src_stride, tran_low_t *coeff);
+#define aom_highbd_hadamard_8x8 aom_highbd_hadamard_8x8_neon
 
 void aom_highbd_lpf_horizontal_14_c(uint16_t *s, int pitch, const uint8_t *blimit, const uint8_t *limit, const uint8_t *thresh, int bd);
 void aom_highbd_lpf_horizontal_14_neon(uint16_t *s, int pitch, const uint8_t *blimit, const uint8_t *limit, const uint8_t *thresh, int bd);
@@ -2475,7 +2673,8 @@
 #define aom_highbd_masked_sad8x8 aom_highbd_masked_sad8x8_c
 
 void aom_highbd_minmax_8x8_c(const uint8_t *s, int p, const uint8_t *d, int dp, int *min, int *max);
-#define aom_highbd_minmax_8x8 aom_highbd_minmax_8x8_c
+void aom_highbd_minmax_8x8_neon(const uint8_t *s, int p, const uint8_t *d, int dp, int *min, int *max);
+#define aom_highbd_minmax_8x8 aom_highbd_minmax_8x8_neon
 
 unsigned int aom_highbd_obmc_sad128x128_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
 #define aom_highbd_obmc_sad128x128 aom_highbd_obmc_sad128x128_c
@@ -2776,7 +2975,8 @@
 #define aom_highbd_quantize_b_adaptive aom_highbd_quantize_b_adaptive_neon
 
 unsigned int aom_highbd_sad128x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad128x128 aom_highbd_sad128x128_c
+unsigned int aom_highbd_sad128x128_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad128x128 aom_highbd_sad128x128_neon
 
 unsigned int aom_highbd_sad128x128_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad128x128_avg aom_highbd_sad128x128_avg_c
@@ -2785,10 +2985,12 @@
 #define aom_highbd_sad128x128x3d aom_highbd_sad128x128x3d_c
 
 void aom_highbd_sad128x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad128x128x4d aom_highbd_sad128x128x4d_c
+void aom_highbd_sad128x128x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad128x128x4d aom_highbd_sad128x128x4d_neon
 
 unsigned int aom_highbd_sad128x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad128x64 aom_highbd_sad128x64_c
+unsigned int aom_highbd_sad128x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad128x64 aom_highbd_sad128x64_neon
 
 unsigned int aom_highbd_sad128x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad128x64_avg aom_highbd_sad128x64_avg_c
@@ -2797,10 +2999,12 @@
 #define aom_highbd_sad128x64x3d aom_highbd_sad128x64x3d_c
 
 void aom_highbd_sad128x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad128x64x4d aom_highbd_sad128x64x4d_c
+void aom_highbd_sad128x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad128x64x4d aom_highbd_sad128x64x4d_neon
 
 unsigned int aom_highbd_sad16x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad16x16 aom_highbd_sad16x16_c
+unsigned int aom_highbd_sad16x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad16x16 aom_highbd_sad16x16_neon
 
 unsigned int aom_highbd_sad16x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad16x16_avg aom_highbd_sad16x16_avg_c
@@ -2809,10 +3013,12 @@
 #define aom_highbd_sad16x16x3d aom_highbd_sad16x16x3d_c
 
 void aom_highbd_sad16x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad16x16x4d aom_highbd_sad16x16x4d_c
+void aom_highbd_sad16x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad16x16x4d aom_highbd_sad16x16x4d_neon
 
 unsigned int aom_highbd_sad16x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad16x32 aom_highbd_sad16x32_c
+unsigned int aom_highbd_sad16x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad16x32 aom_highbd_sad16x32_neon
 
 unsigned int aom_highbd_sad16x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad16x32_avg aom_highbd_sad16x32_avg_c
@@ -2821,10 +3027,12 @@
 #define aom_highbd_sad16x32x3d aom_highbd_sad16x32x3d_c
 
 void aom_highbd_sad16x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad16x32x4d aom_highbd_sad16x32x4d_c
+void aom_highbd_sad16x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad16x32x4d aom_highbd_sad16x32x4d_neon
 
 unsigned int aom_highbd_sad16x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad16x4 aom_highbd_sad16x4_c
+unsigned int aom_highbd_sad16x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad16x4 aom_highbd_sad16x4_neon
 
 unsigned int aom_highbd_sad16x4_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad16x4_avg aom_highbd_sad16x4_avg_c
@@ -2833,10 +3041,12 @@
 #define aom_highbd_sad16x4x3d aom_highbd_sad16x4x3d_c
 
 void aom_highbd_sad16x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad16x4x4d aom_highbd_sad16x4x4d_c
+void aom_highbd_sad16x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad16x4x4d aom_highbd_sad16x4x4d_neon
 
 unsigned int aom_highbd_sad16x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad16x64 aom_highbd_sad16x64_c
+unsigned int aom_highbd_sad16x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad16x64 aom_highbd_sad16x64_neon
 
 unsigned int aom_highbd_sad16x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad16x64_avg aom_highbd_sad16x64_avg_c
@@ -2845,10 +3055,12 @@
 #define aom_highbd_sad16x64x3d aom_highbd_sad16x64x3d_c
 
 void aom_highbd_sad16x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad16x64x4d aom_highbd_sad16x64x4d_c
+void aom_highbd_sad16x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad16x64x4d aom_highbd_sad16x64x4d_neon
 
 unsigned int aom_highbd_sad16x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad16x8 aom_highbd_sad16x8_c
+unsigned int aom_highbd_sad16x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad16x8 aom_highbd_sad16x8_neon
 
 unsigned int aom_highbd_sad16x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad16x8_avg aom_highbd_sad16x8_avg_c
@@ -2857,10 +3069,12 @@
 #define aom_highbd_sad16x8x3d aom_highbd_sad16x8x3d_c
 
 void aom_highbd_sad16x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad16x8x4d aom_highbd_sad16x8x4d_c
+void aom_highbd_sad16x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad16x8x4d aom_highbd_sad16x8x4d_neon
 
 unsigned int aom_highbd_sad32x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad32x16 aom_highbd_sad32x16_c
+unsigned int aom_highbd_sad32x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad32x16 aom_highbd_sad32x16_neon
 
 unsigned int aom_highbd_sad32x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad32x16_avg aom_highbd_sad32x16_avg_c
@@ -2869,10 +3083,12 @@
 #define aom_highbd_sad32x16x3d aom_highbd_sad32x16x3d_c
 
 void aom_highbd_sad32x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad32x16x4d aom_highbd_sad32x16x4d_c
+void aom_highbd_sad32x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad32x16x4d aom_highbd_sad32x16x4d_neon
 
 unsigned int aom_highbd_sad32x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad32x32 aom_highbd_sad32x32_c
+unsigned int aom_highbd_sad32x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad32x32 aom_highbd_sad32x32_neon
 
 unsigned int aom_highbd_sad32x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad32x32_avg aom_highbd_sad32x32_avg_c
@@ -2881,10 +3097,12 @@
 #define aom_highbd_sad32x32x3d aom_highbd_sad32x32x3d_c
 
 void aom_highbd_sad32x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad32x32x4d aom_highbd_sad32x32x4d_c
+void aom_highbd_sad32x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad32x32x4d aom_highbd_sad32x32x4d_neon
 
 unsigned int aom_highbd_sad32x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad32x64 aom_highbd_sad32x64_c
+unsigned int aom_highbd_sad32x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad32x64 aom_highbd_sad32x64_neon
 
 unsigned int aom_highbd_sad32x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad32x64_avg aom_highbd_sad32x64_avg_c
@@ -2893,10 +3111,12 @@
 #define aom_highbd_sad32x64x3d aom_highbd_sad32x64x3d_c
 
 void aom_highbd_sad32x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad32x64x4d aom_highbd_sad32x64x4d_c
+void aom_highbd_sad32x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad32x64x4d aom_highbd_sad32x64x4d_neon
 
 unsigned int aom_highbd_sad32x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad32x8 aom_highbd_sad32x8_c
+unsigned int aom_highbd_sad32x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad32x8 aom_highbd_sad32x8_neon
 
 unsigned int aom_highbd_sad32x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad32x8_avg aom_highbd_sad32x8_avg_c
@@ -2905,10 +3125,12 @@
 #define aom_highbd_sad32x8x3d aom_highbd_sad32x8x3d_c
 
 void aom_highbd_sad32x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad32x8x4d aom_highbd_sad32x8x4d_c
+void aom_highbd_sad32x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad32x8x4d aom_highbd_sad32x8x4d_neon
 
 unsigned int aom_highbd_sad4x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad4x16 aom_highbd_sad4x16_c
+unsigned int aom_highbd_sad4x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad4x16 aom_highbd_sad4x16_neon
 
 unsigned int aom_highbd_sad4x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad4x16_avg aom_highbd_sad4x16_avg_c
@@ -2917,10 +3139,12 @@
 #define aom_highbd_sad4x16x3d aom_highbd_sad4x16x3d_c
 
 void aom_highbd_sad4x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad4x16x4d aom_highbd_sad4x16x4d_c
+void aom_highbd_sad4x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad4x16x4d aom_highbd_sad4x16x4d_neon
 
 unsigned int aom_highbd_sad4x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad4x4 aom_highbd_sad4x4_c
+unsigned int aom_highbd_sad4x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad4x4 aom_highbd_sad4x4_neon
 
 unsigned int aom_highbd_sad4x4_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad4x4_avg aom_highbd_sad4x4_avg_c
@@ -2929,10 +3153,12 @@
 #define aom_highbd_sad4x4x3d aom_highbd_sad4x4x3d_c
 
 void aom_highbd_sad4x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad4x4x4d aom_highbd_sad4x4x4d_c
+void aom_highbd_sad4x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad4x4x4d aom_highbd_sad4x4x4d_neon
 
 unsigned int aom_highbd_sad4x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad4x8 aom_highbd_sad4x8_c
+unsigned int aom_highbd_sad4x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad4x8 aom_highbd_sad4x8_neon
 
 unsigned int aom_highbd_sad4x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad4x8_avg aom_highbd_sad4x8_avg_c
@@ -2941,10 +3167,12 @@
 #define aom_highbd_sad4x8x3d aom_highbd_sad4x8x3d_c
 
 void aom_highbd_sad4x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad4x8x4d aom_highbd_sad4x8x4d_c
+void aom_highbd_sad4x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad4x8x4d aom_highbd_sad4x8x4d_neon
 
 unsigned int aom_highbd_sad64x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad64x128 aom_highbd_sad64x128_c
+unsigned int aom_highbd_sad64x128_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad64x128 aom_highbd_sad64x128_neon
 
 unsigned int aom_highbd_sad64x128_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad64x128_avg aom_highbd_sad64x128_avg_c
@@ -2953,10 +3181,12 @@
 #define aom_highbd_sad64x128x3d aom_highbd_sad64x128x3d_c
 
 void aom_highbd_sad64x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad64x128x4d aom_highbd_sad64x128x4d_c
+void aom_highbd_sad64x128x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad64x128x4d aom_highbd_sad64x128x4d_neon
 
 unsigned int aom_highbd_sad64x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad64x16 aom_highbd_sad64x16_c
+unsigned int aom_highbd_sad64x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad64x16 aom_highbd_sad64x16_neon
 
 unsigned int aom_highbd_sad64x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad64x16_avg aom_highbd_sad64x16_avg_c
@@ -2965,10 +3195,12 @@
 #define aom_highbd_sad64x16x3d aom_highbd_sad64x16x3d_c
 
 void aom_highbd_sad64x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad64x16x4d aom_highbd_sad64x16x4d_c
+void aom_highbd_sad64x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad64x16x4d aom_highbd_sad64x16x4d_neon
 
 unsigned int aom_highbd_sad64x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad64x32 aom_highbd_sad64x32_c
+unsigned int aom_highbd_sad64x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad64x32 aom_highbd_sad64x32_neon
 
 unsigned int aom_highbd_sad64x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad64x32_avg aom_highbd_sad64x32_avg_c
@@ -2977,10 +3209,12 @@
 #define aom_highbd_sad64x32x3d aom_highbd_sad64x32x3d_c
 
 void aom_highbd_sad64x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad64x32x4d aom_highbd_sad64x32x4d_c
+void aom_highbd_sad64x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad64x32x4d aom_highbd_sad64x32x4d_neon
 
 unsigned int aom_highbd_sad64x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad64x64 aom_highbd_sad64x64_c
+unsigned int aom_highbd_sad64x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad64x64 aom_highbd_sad64x64_neon
 
 unsigned int aom_highbd_sad64x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad64x64_avg aom_highbd_sad64x64_avg_c
@@ -2989,10 +3223,12 @@
 #define aom_highbd_sad64x64x3d aom_highbd_sad64x64x3d_c
 
 void aom_highbd_sad64x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad64x64x4d aom_highbd_sad64x64x4d_c
+void aom_highbd_sad64x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad64x64x4d aom_highbd_sad64x64x4d_neon
 
 unsigned int aom_highbd_sad8x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad8x16 aom_highbd_sad8x16_c
+unsigned int aom_highbd_sad8x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad8x16 aom_highbd_sad8x16_neon
 
 unsigned int aom_highbd_sad8x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad8x16_avg aom_highbd_sad8x16_avg_c
@@ -3001,10 +3237,12 @@
 #define aom_highbd_sad8x16x3d aom_highbd_sad8x16x3d_c
 
 void aom_highbd_sad8x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad8x16x4d aom_highbd_sad8x16x4d_c
+void aom_highbd_sad8x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad8x16x4d aom_highbd_sad8x16x4d_neon
 
 unsigned int aom_highbd_sad8x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad8x32 aom_highbd_sad8x32_c
+unsigned int aom_highbd_sad8x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad8x32 aom_highbd_sad8x32_neon
 
 unsigned int aom_highbd_sad8x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad8x32_avg aom_highbd_sad8x32_avg_c
@@ -3013,10 +3251,12 @@
 #define aom_highbd_sad8x32x3d aom_highbd_sad8x32x3d_c
 
 void aom_highbd_sad8x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad8x32x4d aom_highbd_sad8x32x4d_c
+void aom_highbd_sad8x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad8x32x4d aom_highbd_sad8x32x4d_neon
 
 unsigned int aom_highbd_sad8x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad8x4 aom_highbd_sad8x4_c
+unsigned int aom_highbd_sad8x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad8x4 aom_highbd_sad8x4_neon
 
 unsigned int aom_highbd_sad8x4_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad8x4_avg aom_highbd_sad8x4_avg_c
@@ -3025,10 +3265,12 @@
 #define aom_highbd_sad8x4x3d aom_highbd_sad8x4x3d_c
 
 void aom_highbd_sad8x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad8x4x4d aom_highbd_sad8x4x4d_c
+void aom_highbd_sad8x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad8x4x4d aom_highbd_sad8x4x4d_neon
 
 unsigned int aom_highbd_sad8x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad8x8 aom_highbd_sad8x8_c
+unsigned int aom_highbd_sad8x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad8x8 aom_highbd_sad8x8_neon
 
 unsigned int aom_highbd_sad8x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred);
 #define aom_highbd_sad8x8_avg aom_highbd_sad8x8_avg_c
@@ -3037,139 +3279,184 @@
 #define aom_highbd_sad8x8x3d aom_highbd_sad8x8x3d_c
 
 void aom_highbd_sad8x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad8x8x4d aom_highbd_sad8x8x4d_c
+void aom_highbd_sad8x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad8x8x4d aom_highbd_sad8x8x4d_neon
 
 unsigned int aom_highbd_sad_skip_128x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_128x128 aom_highbd_sad_skip_128x128_c
+unsigned int aom_highbd_sad_skip_128x128_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_128x128 aom_highbd_sad_skip_128x128_neon
 
 void aom_highbd_sad_skip_128x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_128x128x4d aom_highbd_sad_skip_128x128x4d_c
+void aom_highbd_sad_skip_128x128x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_128x128x4d aom_highbd_sad_skip_128x128x4d_neon
 
 unsigned int aom_highbd_sad_skip_128x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_128x64 aom_highbd_sad_skip_128x64_c
+unsigned int aom_highbd_sad_skip_128x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_128x64 aom_highbd_sad_skip_128x64_neon
 
 void aom_highbd_sad_skip_128x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_128x64x4d aom_highbd_sad_skip_128x64x4d_c
+void aom_highbd_sad_skip_128x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_128x64x4d aom_highbd_sad_skip_128x64x4d_neon
 
 unsigned int aom_highbd_sad_skip_16x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_16x16 aom_highbd_sad_skip_16x16_c
+unsigned int aom_highbd_sad_skip_16x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_16x16 aom_highbd_sad_skip_16x16_neon
 
 void aom_highbd_sad_skip_16x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_16x16x4d aom_highbd_sad_skip_16x16x4d_c
+void aom_highbd_sad_skip_16x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_16x16x4d aom_highbd_sad_skip_16x16x4d_neon
 
 unsigned int aom_highbd_sad_skip_16x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_16x32 aom_highbd_sad_skip_16x32_c
+unsigned int aom_highbd_sad_skip_16x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_16x32 aom_highbd_sad_skip_16x32_neon
 
 void aom_highbd_sad_skip_16x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_16x32x4d aom_highbd_sad_skip_16x32x4d_c
+void aom_highbd_sad_skip_16x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_16x32x4d aom_highbd_sad_skip_16x32x4d_neon
 
 unsigned int aom_highbd_sad_skip_16x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_16x4 aom_highbd_sad_skip_16x4_c
+unsigned int aom_highbd_sad_skip_16x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_16x4 aom_highbd_sad_skip_16x4_neon
 
 void aom_highbd_sad_skip_16x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_16x4x4d aom_highbd_sad_skip_16x4x4d_c
+void aom_highbd_sad_skip_16x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_16x4x4d aom_highbd_sad_skip_16x4x4d_neon
 
 unsigned int aom_highbd_sad_skip_16x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_16x64 aom_highbd_sad_skip_16x64_c
+unsigned int aom_highbd_sad_skip_16x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_16x64 aom_highbd_sad_skip_16x64_neon
 
 void aom_highbd_sad_skip_16x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_16x64x4d aom_highbd_sad_skip_16x64x4d_c
+void aom_highbd_sad_skip_16x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_16x64x4d aom_highbd_sad_skip_16x64x4d_neon
 
 unsigned int aom_highbd_sad_skip_16x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_16x8 aom_highbd_sad_skip_16x8_c
+unsigned int aom_highbd_sad_skip_16x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_16x8 aom_highbd_sad_skip_16x8_neon
 
 void aom_highbd_sad_skip_16x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_16x8x4d aom_highbd_sad_skip_16x8x4d_c
+void aom_highbd_sad_skip_16x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_16x8x4d aom_highbd_sad_skip_16x8x4d_neon
 
 unsigned int aom_highbd_sad_skip_32x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_32x16 aom_highbd_sad_skip_32x16_c
+unsigned int aom_highbd_sad_skip_32x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_32x16 aom_highbd_sad_skip_32x16_neon
 
 void aom_highbd_sad_skip_32x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_32x16x4d aom_highbd_sad_skip_32x16x4d_c
+void aom_highbd_sad_skip_32x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_32x16x4d aom_highbd_sad_skip_32x16x4d_neon
 
 unsigned int aom_highbd_sad_skip_32x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_32x32 aom_highbd_sad_skip_32x32_c
+unsigned int aom_highbd_sad_skip_32x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_32x32 aom_highbd_sad_skip_32x32_neon
 
 void aom_highbd_sad_skip_32x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_32x32x4d aom_highbd_sad_skip_32x32x4d_c
+void aom_highbd_sad_skip_32x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_32x32x4d aom_highbd_sad_skip_32x32x4d_neon
 
 unsigned int aom_highbd_sad_skip_32x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_32x64 aom_highbd_sad_skip_32x64_c
+unsigned int aom_highbd_sad_skip_32x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_32x64 aom_highbd_sad_skip_32x64_neon
 
 void aom_highbd_sad_skip_32x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_32x64x4d aom_highbd_sad_skip_32x64x4d_c
+void aom_highbd_sad_skip_32x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_32x64x4d aom_highbd_sad_skip_32x64x4d_neon
 
 unsigned int aom_highbd_sad_skip_32x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_32x8 aom_highbd_sad_skip_32x8_c
+unsigned int aom_highbd_sad_skip_32x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_32x8 aom_highbd_sad_skip_32x8_neon
 
 void aom_highbd_sad_skip_32x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_32x8x4d aom_highbd_sad_skip_32x8x4d_c
+void aom_highbd_sad_skip_32x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_32x8x4d aom_highbd_sad_skip_32x8x4d_neon
 
 unsigned int aom_highbd_sad_skip_4x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_4x16 aom_highbd_sad_skip_4x16_c
+unsigned int aom_highbd_sad_skip_4x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_4x16 aom_highbd_sad_skip_4x16_neon
 
 void aom_highbd_sad_skip_4x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_4x16x4d aom_highbd_sad_skip_4x16x4d_c
+void aom_highbd_sad_skip_4x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_4x16x4d aom_highbd_sad_skip_4x16x4d_neon
 
 unsigned int aom_highbd_sad_skip_4x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_4x4 aom_highbd_sad_skip_4x4_c
+unsigned int aom_highbd_sad_skip_4x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_4x4 aom_highbd_sad_skip_4x4_neon
 
 void aom_highbd_sad_skip_4x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_4x4x4d aom_highbd_sad_skip_4x4x4d_c
+void aom_highbd_sad_skip_4x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_4x4x4d aom_highbd_sad_skip_4x4x4d_neon
 
 unsigned int aom_highbd_sad_skip_4x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_4x8 aom_highbd_sad_skip_4x8_c
+unsigned int aom_highbd_sad_skip_4x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_4x8 aom_highbd_sad_skip_4x8_neon
 
 void aom_highbd_sad_skip_4x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_4x8x4d aom_highbd_sad_skip_4x8x4d_c
+void aom_highbd_sad_skip_4x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_4x8x4d aom_highbd_sad_skip_4x8x4d_neon
 
 unsigned int aom_highbd_sad_skip_64x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_64x128 aom_highbd_sad_skip_64x128_c
+unsigned int aom_highbd_sad_skip_64x128_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_64x128 aom_highbd_sad_skip_64x128_neon
 
 void aom_highbd_sad_skip_64x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_64x128x4d aom_highbd_sad_skip_64x128x4d_c
+void aom_highbd_sad_skip_64x128x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_64x128x4d aom_highbd_sad_skip_64x128x4d_neon
 
 unsigned int aom_highbd_sad_skip_64x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_64x16 aom_highbd_sad_skip_64x16_c
+unsigned int aom_highbd_sad_skip_64x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_64x16 aom_highbd_sad_skip_64x16_neon
 
 void aom_highbd_sad_skip_64x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_64x16x4d aom_highbd_sad_skip_64x16x4d_c
+void aom_highbd_sad_skip_64x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_64x16x4d aom_highbd_sad_skip_64x16x4d_neon
 
 unsigned int aom_highbd_sad_skip_64x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_64x32 aom_highbd_sad_skip_64x32_c
+unsigned int aom_highbd_sad_skip_64x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_64x32 aom_highbd_sad_skip_64x32_neon
 
 void aom_highbd_sad_skip_64x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_64x32x4d aom_highbd_sad_skip_64x32x4d_c
+void aom_highbd_sad_skip_64x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_64x32x4d aom_highbd_sad_skip_64x32x4d_neon
 
 unsigned int aom_highbd_sad_skip_64x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_64x64 aom_highbd_sad_skip_64x64_c
+unsigned int aom_highbd_sad_skip_64x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_64x64 aom_highbd_sad_skip_64x64_neon
 
 void aom_highbd_sad_skip_64x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_64x64x4d aom_highbd_sad_skip_64x64x4d_c
+void aom_highbd_sad_skip_64x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_64x64x4d aom_highbd_sad_skip_64x64x4d_neon
 
 unsigned int aom_highbd_sad_skip_8x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_8x16 aom_highbd_sad_skip_8x16_c
+unsigned int aom_highbd_sad_skip_8x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_8x16 aom_highbd_sad_skip_8x16_neon
 
 void aom_highbd_sad_skip_8x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_8x16x4d aom_highbd_sad_skip_8x16x4d_c
+void aom_highbd_sad_skip_8x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_8x16x4d aom_highbd_sad_skip_8x16x4d_neon
 
 unsigned int aom_highbd_sad_skip_8x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_8x32 aom_highbd_sad_skip_8x32_c
+unsigned int aom_highbd_sad_skip_8x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_8x32 aom_highbd_sad_skip_8x32_neon
 
 void aom_highbd_sad_skip_8x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_8x32x4d aom_highbd_sad_skip_8x32x4d_c
+void aom_highbd_sad_skip_8x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_8x32x4d aom_highbd_sad_skip_8x32x4d_neon
 
 unsigned int aom_highbd_sad_skip_8x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_8x4 aom_highbd_sad_skip_8x4_c
+unsigned int aom_highbd_sad_skip_8x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_8x4 aom_highbd_sad_skip_8x4_neon
 
 void aom_highbd_sad_skip_8x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_8x4x4d aom_highbd_sad_skip_8x4x4d_c
+void aom_highbd_sad_skip_8x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_8x4x4d aom_highbd_sad_skip_8x4x4d_neon
 
 unsigned int aom_highbd_sad_skip_8x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_highbd_sad_skip_8x8 aom_highbd_sad_skip_8x8_c
+unsigned int aom_highbd_sad_skip_8x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_highbd_sad_skip_8x8 aom_highbd_sad_skip_8x8_neon
 
 void aom_highbd_sad_skip_8x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
-#define aom_highbd_sad_skip_8x8x4d aom_highbd_sad_skip_8x8x4d_c
+void aom_highbd_sad_skip_8x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[], int ref_stride, uint32_t *sad_array);
+#define aom_highbd_sad_skip_8x8x4d aom_highbd_sad_skip_8x8x4d_neon
 
 void aom_highbd_smooth_h_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_smooth_h_predictor_16x16_neon(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
@@ -3610,205 +3897,272 @@
 #define aom_lpf_vertical_8_quad aom_lpf_vertical_8_quad_neon
 
 unsigned int aom_masked_sad128x128_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad128x128 aom_masked_sad128x128_c
+unsigned int aom_masked_sad128x128_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad128x128 aom_masked_sad128x128_neon
 
 void aom_masked_sad128x128x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad128x128x4d aom_masked_sad128x128x4d_c
+void aom_masked_sad128x128x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad128x128x4d aom_masked_sad128x128x4d_neon
 
 unsigned int aom_masked_sad128x64_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad128x64 aom_masked_sad128x64_c
+unsigned int aom_masked_sad128x64_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad128x64 aom_masked_sad128x64_neon
 
 void aom_masked_sad128x64x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad128x64x4d aom_masked_sad128x64x4d_c
+void aom_masked_sad128x64x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad128x64x4d aom_masked_sad128x64x4d_neon
 
 unsigned int aom_masked_sad16x16_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad16x16 aom_masked_sad16x16_c
+unsigned int aom_masked_sad16x16_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad16x16 aom_masked_sad16x16_neon
 
 void aom_masked_sad16x16x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad16x16x4d aom_masked_sad16x16x4d_c
+void aom_masked_sad16x16x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad16x16x4d aom_masked_sad16x16x4d_neon
 
 unsigned int aom_masked_sad16x32_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad16x32 aom_masked_sad16x32_c
+unsigned int aom_masked_sad16x32_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad16x32 aom_masked_sad16x32_neon
 
 void aom_masked_sad16x32x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad16x32x4d aom_masked_sad16x32x4d_c
+void aom_masked_sad16x32x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad16x32x4d aom_masked_sad16x32x4d_neon
 
 unsigned int aom_masked_sad16x4_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad16x4 aom_masked_sad16x4_c
+unsigned int aom_masked_sad16x4_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad16x4 aom_masked_sad16x4_neon
 
 void aom_masked_sad16x4x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad16x4x4d aom_masked_sad16x4x4d_c
+void aom_masked_sad16x4x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad16x4x4d aom_masked_sad16x4x4d_neon
 
 unsigned int aom_masked_sad16x64_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad16x64 aom_masked_sad16x64_c
+unsigned int aom_masked_sad16x64_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad16x64 aom_masked_sad16x64_neon
 
 void aom_masked_sad16x64x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad16x64x4d aom_masked_sad16x64x4d_c
+void aom_masked_sad16x64x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad16x64x4d aom_masked_sad16x64x4d_neon
 
 unsigned int aom_masked_sad16x8_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad16x8 aom_masked_sad16x8_c
+unsigned int aom_masked_sad16x8_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad16x8 aom_masked_sad16x8_neon
 
 void aom_masked_sad16x8x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad16x8x4d aom_masked_sad16x8x4d_c
+void aom_masked_sad16x8x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad16x8x4d aom_masked_sad16x8x4d_neon
 
 unsigned int aom_masked_sad32x16_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad32x16 aom_masked_sad32x16_c
+unsigned int aom_masked_sad32x16_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad32x16 aom_masked_sad32x16_neon
 
 void aom_masked_sad32x16x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad32x16x4d aom_masked_sad32x16x4d_c
+void aom_masked_sad32x16x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad32x16x4d aom_masked_sad32x16x4d_neon
 
 unsigned int aom_masked_sad32x32_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad32x32 aom_masked_sad32x32_c
+unsigned int aom_masked_sad32x32_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad32x32 aom_masked_sad32x32_neon
 
 void aom_masked_sad32x32x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad32x32x4d aom_masked_sad32x32x4d_c
+void aom_masked_sad32x32x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad32x32x4d aom_masked_sad32x32x4d_neon
 
 unsigned int aom_masked_sad32x64_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad32x64 aom_masked_sad32x64_c
+unsigned int aom_masked_sad32x64_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad32x64 aom_masked_sad32x64_neon
 
 void aom_masked_sad32x64x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad32x64x4d aom_masked_sad32x64x4d_c
+void aom_masked_sad32x64x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad32x64x4d aom_masked_sad32x64x4d_neon
 
 unsigned int aom_masked_sad32x8_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad32x8 aom_masked_sad32x8_c
+unsigned int aom_masked_sad32x8_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad32x8 aom_masked_sad32x8_neon
 
 void aom_masked_sad32x8x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad32x8x4d aom_masked_sad32x8x4d_c
+void aom_masked_sad32x8x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad32x8x4d aom_masked_sad32x8x4d_neon
 
 unsigned int aom_masked_sad4x16_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad4x16 aom_masked_sad4x16_c
+unsigned int aom_masked_sad4x16_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad4x16 aom_masked_sad4x16_neon
 
 void aom_masked_sad4x16x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad4x16x4d aom_masked_sad4x16x4d_c
+void aom_masked_sad4x16x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad4x16x4d aom_masked_sad4x16x4d_neon
 
 unsigned int aom_masked_sad4x4_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad4x4 aom_masked_sad4x4_c
+unsigned int aom_masked_sad4x4_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad4x4 aom_masked_sad4x4_neon
 
 void aom_masked_sad4x4x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad4x4x4d aom_masked_sad4x4x4d_c
+void aom_masked_sad4x4x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad4x4x4d aom_masked_sad4x4x4d_neon
 
 unsigned int aom_masked_sad4x8_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad4x8 aom_masked_sad4x8_c
+unsigned int aom_masked_sad4x8_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad4x8 aom_masked_sad4x8_neon
 
 void aom_masked_sad4x8x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad4x8x4d aom_masked_sad4x8x4d_c
+void aom_masked_sad4x8x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad4x8x4d aom_masked_sad4x8x4d_neon
 
 unsigned int aom_masked_sad64x128_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad64x128 aom_masked_sad64x128_c
+unsigned int aom_masked_sad64x128_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad64x128 aom_masked_sad64x128_neon
 
 void aom_masked_sad64x128x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad64x128x4d aom_masked_sad64x128x4d_c
+void aom_masked_sad64x128x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad64x128x4d aom_masked_sad64x128x4d_neon
 
 unsigned int aom_masked_sad64x16_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad64x16 aom_masked_sad64x16_c
+unsigned int aom_masked_sad64x16_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad64x16 aom_masked_sad64x16_neon
 
 void aom_masked_sad64x16x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad64x16x4d aom_masked_sad64x16x4d_c
+void aom_masked_sad64x16x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad64x16x4d aom_masked_sad64x16x4d_neon
 
 unsigned int aom_masked_sad64x32_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad64x32 aom_masked_sad64x32_c
+unsigned int aom_masked_sad64x32_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad64x32 aom_masked_sad64x32_neon
 
 void aom_masked_sad64x32x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad64x32x4d aom_masked_sad64x32x4d_c
+void aom_masked_sad64x32x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad64x32x4d aom_masked_sad64x32x4d_neon
 
 unsigned int aom_masked_sad64x64_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad64x64 aom_masked_sad64x64_c
+unsigned int aom_masked_sad64x64_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad64x64 aom_masked_sad64x64_neon
 
 void aom_masked_sad64x64x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad64x64x4d aom_masked_sad64x64x4d_c
+void aom_masked_sad64x64x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad64x64x4d aom_masked_sad64x64x4d_neon
 
 unsigned int aom_masked_sad8x16_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad8x16 aom_masked_sad8x16_c
+unsigned int aom_masked_sad8x16_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad8x16 aom_masked_sad8x16_neon
 
 void aom_masked_sad8x16x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad8x16x4d aom_masked_sad8x16x4d_c
+void aom_masked_sad8x16x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad8x16x4d aom_masked_sad8x16x4d_neon
 
 unsigned int aom_masked_sad8x32_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad8x32 aom_masked_sad8x32_c
+unsigned int aom_masked_sad8x32_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad8x32 aom_masked_sad8x32_neon
 
 void aom_masked_sad8x32x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad8x32x4d aom_masked_sad8x32x4d_c
+void aom_masked_sad8x32x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad8x32x4d aom_masked_sad8x32x4d_neon
 
 unsigned int aom_masked_sad8x4_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad8x4 aom_masked_sad8x4_c
+unsigned int aom_masked_sad8x4_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad8x4 aom_masked_sad8x4_neon
 
 void aom_masked_sad8x4x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad8x4x4d aom_masked_sad8x4x4d_c
+void aom_masked_sad8x4x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad8x4x4d aom_masked_sad8x4x4d_neon
 
 unsigned int aom_masked_sad8x8_c(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
-#define aom_masked_sad8x8 aom_masked_sad8x8_c
+unsigned int aom_masked_sad8x8_neon(const uint8_t *src, int src_stride, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask);
+#define aom_masked_sad8x8 aom_masked_sad8x8_neon
 
 void aom_masked_sad8x8x4d_c(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
-#define aom_masked_sad8x8x4d aom_masked_sad8x8x4d_c
+void aom_masked_sad8x8x4d_neon(const uint8_t *src, int src_stride, const uint8_t *ref[4], int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned sads[4]);
+#define aom_masked_sad8x8x4d aom_masked_sad8x8x4d_neon
 
 unsigned int aom_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance128x128 aom_masked_sub_pixel_variance128x128_c
+unsigned int aom_masked_sub_pixel_variance128x128_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance128x128 aom_masked_sub_pixel_variance128x128_neon
 
 unsigned int aom_masked_sub_pixel_variance128x64_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance128x64 aom_masked_sub_pixel_variance128x64_c
+unsigned int aom_masked_sub_pixel_variance128x64_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance128x64 aom_masked_sub_pixel_variance128x64_neon
 
 unsigned int aom_masked_sub_pixel_variance16x16_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance16x16 aom_masked_sub_pixel_variance16x16_c
+unsigned int aom_masked_sub_pixel_variance16x16_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance16x16 aom_masked_sub_pixel_variance16x16_neon
 
 unsigned int aom_masked_sub_pixel_variance16x32_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance16x32 aom_masked_sub_pixel_variance16x32_c
+unsigned int aom_masked_sub_pixel_variance16x32_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance16x32 aom_masked_sub_pixel_variance16x32_neon
 
 unsigned int aom_masked_sub_pixel_variance16x4_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance16x4 aom_masked_sub_pixel_variance16x4_c
+unsigned int aom_masked_sub_pixel_variance16x4_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance16x4 aom_masked_sub_pixel_variance16x4_neon
 
 unsigned int aom_masked_sub_pixel_variance16x64_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance16x64 aom_masked_sub_pixel_variance16x64_c
+unsigned int aom_masked_sub_pixel_variance16x64_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance16x64 aom_masked_sub_pixel_variance16x64_neon
 
 unsigned int aom_masked_sub_pixel_variance16x8_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance16x8 aom_masked_sub_pixel_variance16x8_c
+unsigned int aom_masked_sub_pixel_variance16x8_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance16x8 aom_masked_sub_pixel_variance16x8_neon
 
 unsigned int aom_masked_sub_pixel_variance32x16_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance32x16 aom_masked_sub_pixel_variance32x16_c
+unsigned int aom_masked_sub_pixel_variance32x16_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance32x16 aom_masked_sub_pixel_variance32x16_neon
 
 unsigned int aom_masked_sub_pixel_variance32x32_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance32x32 aom_masked_sub_pixel_variance32x32_c
+unsigned int aom_masked_sub_pixel_variance32x32_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance32x32 aom_masked_sub_pixel_variance32x32_neon
 
 unsigned int aom_masked_sub_pixel_variance32x64_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance32x64 aom_masked_sub_pixel_variance32x64_c
+unsigned int aom_masked_sub_pixel_variance32x64_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance32x64 aom_masked_sub_pixel_variance32x64_neon
 
 unsigned int aom_masked_sub_pixel_variance32x8_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance32x8 aom_masked_sub_pixel_variance32x8_c
+unsigned int aom_masked_sub_pixel_variance32x8_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance32x8 aom_masked_sub_pixel_variance32x8_neon
 
 unsigned int aom_masked_sub_pixel_variance4x16_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance4x16 aom_masked_sub_pixel_variance4x16_c
+unsigned int aom_masked_sub_pixel_variance4x16_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance4x16 aom_masked_sub_pixel_variance4x16_neon
 
 unsigned int aom_masked_sub_pixel_variance4x4_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance4x4 aom_masked_sub_pixel_variance4x4_c
+unsigned int aom_masked_sub_pixel_variance4x4_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance4x4 aom_masked_sub_pixel_variance4x4_neon
 
 unsigned int aom_masked_sub_pixel_variance4x8_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance4x8 aom_masked_sub_pixel_variance4x8_c
+unsigned int aom_masked_sub_pixel_variance4x8_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance4x8 aom_masked_sub_pixel_variance4x8_neon
 
 unsigned int aom_masked_sub_pixel_variance64x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance64x128 aom_masked_sub_pixel_variance64x128_c
+unsigned int aom_masked_sub_pixel_variance64x128_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance64x128 aom_masked_sub_pixel_variance64x128_neon
 
 unsigned int aom_masked_sub_pixel_variance64x16_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance64x16 aom_masked_sub_pixel_variance64x16_c
+unsigned int aom_masked_sub_pixel_variance64x16_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance64x16 aom_masked_sub_pixel_variance64x16_neon
 
 unsigned int aom_masked_sub_pixel_variance64x32_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance64x32 aom_masked_sub_pixel_variance64x32_c
+unsigned int aom_masked_sub_pixel_variance64x32_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance64x32 aom_masked_sub_pixel_variance64x32_neon
 
 unsigned int aom_masked_sub_pixel_variance64x64_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance64x64 aom_masked_sub_pixel_variance64x64_c
+unsigned int aom_masked_sub_pixel_variance64x64_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance64x64 aom_masked_sub_pixel_variance64x64_neon
 
 unsigned int aom_masked_sub_pixel_variance8x16_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance8x16 aom_masked_sub_pixel_variance8x16_c
+unsigned int aom_masked_sub_pixel_variance8x16_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance8x16 aom_masked_sub_pixel_variance8x16_neon
 
 unsigned int aom_masked_sub_pixel_variance8x32_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance8x32 aom_masked_sub_pixel_variance8x32_c
+unsigned int aom_masked_sub_pixel_variance8x32_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance8x32 aom_masked_sub_pixel_variance8x32_neon
 
 unsigned int aom_masked_sub_pixel_variance8x4_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance8x4 aom_masked_sub_pixel_variance8x4_c
+unsigned int aom_masked_sub_pixel_variance8x4_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance8x4 aom_masked_sub_pixel_variance8x4_neon
 
 unsigned int aom_masked_sub_pixel_variance8x8_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
-#define aom_masked_sub_pixel_variance8x8 aom_masked_sub_pixel_variance8x8_c
+unsigned int aom_masked_sub_pixel_variance8x8_neon(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
+#define aom_masked_sub_pixel_variance8x8 aom_masked_sub_pixel_variance8x8_neon
 
 void aom_minmax_8x8_c(const uint8_t *s, int p, const uint8_t *d, int dp, int *min, int *max);
-#define aom_minmax_8x8 aom_minmax_8x8_c
+void aom_minmax_8x8_neon(const uint8_t *s, int p, const uint8_t *d, int dp, int *min, int *max);
+#define aom_minmax_8x8 aom_minmax_8x8_neon
 
 unsigned int aom_mse16x16_c(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
 unsigned int aom_mse16x16_neon(const uint8_t *src_ptr, int  source_stride, const uint8_t *ref_ptr, int  recon_stride, unsigned int *sse);
@@ -3837,202 +4191,268 @@
 #define aom_mse_wxh_16bit_highbd aom_mse_wxh_16bit_highbd_c
 
 unsigned int aom_obmc_sad128x128_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad128x128 aom_obmc_sad128x128_c
+unsigned int aom_obmc_sad128x128_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad128x128 aom_obmc_sad128x128_neon
 
 unsigned int aom_obmc_sad128x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad128x64 aom_obmc_sad128x64_c
+unsigned int aom_obmc_sad128x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad128x64 aom_obmc_sad128x64_neon
 
 unsigned int aom_obmc_sad16x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad16x16 aom_obmc_sad16x16_c
+unsigned int aom_obmc_sad16x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad16x16 aom_obmc_sad16x16_neon
 
 unsigned int aom_obmc_sad16x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad16x32 aom_obmc_sad16x32_c
+unsigned int aom_obmc_sad16x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad16x32 aom_obmc_sad16x32_neon
 
 unsigned int aom_obmc_sad16x4_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad16x4 aom_obmc_sad16x4_c
+unsigned int aom_obmc_sad16x4_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad16x4 aom_obmc_sad16x4_neon
 
 unsigned int aom_obmc_sad16x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad16x64 aom_obmc_sad16x64_c
+unsigned int aom_obmc_sad16x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad16x64 aom_obmc_sad16x64_neon
 
 unsigned int aom_obmc_sad16x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad16x8 aom_obmc_sad16x8_c
+unsigned int aom_obmc_sad16x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad16x8 aom_obmc_sad16x8_neon
 
 unsigned int aom_obmc_sad32x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad32x16 aom_obmc_sad32x16_c
+unsigned int aom_obmc_sad32x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad32x16 aom_obmc_sad32x16_neon
 
 unsigned int aom_obmc_sad32x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad32x32 aom_obmc_sad32x32_c
+unsigned int aom_obmc_sad32x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad32x32 aom_obmc_sad32x32_neon
 
 unsigned int aom_obmc_sad32x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad32x64 aom_obmc_sad32x64_c
+unsigned int aom_obmc_sad32x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad32x64 aom_obmc_sad32x64_neon
 
 unsigned int aom_obmc_sad32x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad32x8 aom_obmc_sad32x8_c
+unsigned int aom_obmc_sad32x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad32x8 aom_obmc_sad32x8_neon
 
 unsigned int aom_obmc_sad4x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad4x16 aom_obmc_sad4x16_c
+unsigned int aom_obmc_sad4x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad4x16 aom_obmc_sad4x16_neon
 
 unsigned int aom_obmc_sad4x4_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad4x4 aom_obmc_sad4x4_c
+unsigned int aom_obmc_sad4x4_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad4x4 aom_obmc_sad4x4_neon
 
 unsigned int aom_obmc_sad4x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad4x8 aom_obmc_sad4x8_c
+unsigned int aom_obmc_sad4x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad4x8 aom_obmc_sad4x8_neon
 
 unsigned int aom_obmc_sad64x128_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad64x128 aom_obmc_sad64x128_c
+unsigned int aom_obmc_sad64x128_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad64x128 aom_obmc_sad64x128_neon
 
 unsigned int aom_obmc_sad64x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad64x16 aom_obmc_sad64x16_c
+unsigned int aom_obmc_sad64x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad64x16 aom_obmc_sad64x16_neon
 
 unsigned int aom_obmc_sad64x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad64x32 aom_obmc_sad64x32_c
+unsigned int aom_obmc_sad64x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad64x32 aom_obmc_sad64x32_neon
 
 unsigned int aom_obmc_sad64x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad64x64 aom_obmc_sad64x64_c
+unsigned int aom_obmc_sad64x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad64x64 aom_obmc_sad64x64_neon
 
 unsigned int aom_obmc_sad8x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad8x16 aom_obmc_sad8x16_c
+unsigned int aom_obmc_sad8x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad8x16 aom_obmc_sad8x16_neon
 
 unsigned int aom_obmc_sad8x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad8x32 aom_obmc_sad8x32_c
+unsigned int aom_obmc_sad8x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad8x32 aom_obmc_sad8x32_neon
 
 unsigned int aom_obmc_sad8x4_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad8x4 aom_obmc_sad8x4_c
+unsigned int aom_obmc_sad8x4_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad8x4 aom_obmc_sad8x4_neon
 
 unsigned int aom_obmc_sad8x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
-#define aom_obmc_sad8x8 aom_obmc_sad8x8_c
+unsigned int aom_obmc_sad8x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask);
+#define aom_obmc_sad8x8 aom_obmc_sad8x8_neon
 
 unsigned int aom_obmc_sub_pixel_variance128x128_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance128x128 aom_obmc_sub_pixel_variance128x128_c
+unsigned int aom_obmc_sub_pixel_variance128x128_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance128x128 aom_obmc_sub_pixel_variance128x128_neon
 
 unsigned int aom_obmc_sub_pixel_variance128x64_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance128x64 aom_obmc_sub_pixel_variance128x64_c
+unsigned int aom_obmc_sub_pixel_variance128x64_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance128x64 aom_obmc_sub_pixel_variance128x64_neon
 
 unsigned int aom_obmc_sub_pixel_variance16x16_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance16x16 aom_obmc_sub_pixel_variance16x16_c
+unsigned int aom_obmc_sub_pixel_variance16x16_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance16x16 aom_obmc_sub_pixel_variance16x16_neon
 
 unsigned int aom_obmc_sub_pixel_variance16x32_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance16x32 aom_obmc_sub_pixel_variance16x32_c
+unsigned int aom_obmc_sub_pixel_variance16x32_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance16x32 aom_obmc_sub_pixel_variance16x32_neon
 
 unsigned int aom_obmc_sub_pixel_variance16x4_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance16x4 aom_obmc_sub_pixel_variance16x4_c
+unsigned int aom_obmc_sub_pixel_variance16x4_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance16x4 aom_obmc_sub_pixel_variance16x4_neon
 
 unsigned int aom_obmc_sub_pixel_variance16x64_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance16x64 aom_obmc_sub_pixel_variance16x64_c
+unsigned int aom_obmc_sub_pixel_variance16x64_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance16x64 aom_obmc_sub_pixel_variance16x64_neon
 
 unsigned int aom_obmc_sub_pixel_variance16x8_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance16x8 aom_obmc_sub_pixel_variance16x8_c
+unsigned int aom_obmc_sub_pixel_variance16x8_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance16x8 aom_obmc_sub_pixel_variance16x8_neon
 
 unsigned int aom_obmc_sub_pixel_variance32x16_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance32x16 aom_obmc_sub_pixel_variance32x16_c
+unsigned int aom_obmc_sub_pixel_variance32x16_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance32x16 aom_obmc_sub_pixel_variance32x16_neon
 
 unsigned int aom_obmc_sub_pixel_variance32x32_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance32x32 aom_obmc_sub_pixel_variance32x32_c
+unsigned int aom_obmc_sub_pixel_variance32x32_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance32x32 aom_obmc_sub_pixel_variance32x32_neon
 
 unsigned int aom_obmc_sub_pixel_variance32x64_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance32x64 aom_obmc_sub_pixel_variance32x64_c
+unsigned int aom_obmc_sub_pixel_variance32x64_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance32x64 aom_obmc_sub_pixel_variance32x64_neon
 
 unsigned int aom_obmc_sub_pixel_variance32x8_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance32x8 aom_obmc_sub_pixel_variance32x8_c
+unsigned int aom_obmc_sub_pixel_variance32x8_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance32x8 aom_obmc_sub_pixel_variance32x8_neon
 
 unsigned int aom_obmc_sub_pixel_variance4x16_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance4x16 aom_obmc_sub_pixel_variance4x16_c
+unsigned int aom_obmc_sub_pixel_variance4x16_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance4x16 aom_obmc_sub_pixel_variance4x16_neon
 
 unsigned int aom_obmc_sub_pixel_variance4x4_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance4x4 aom_obmc_sub_pixel_variance4x4_c
+unsigned int aom_obmc_sub_pixel_variance4x4_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance4x4 aom_obmc_sub_pixel_variance4x4_neon
 
 unsigned int aom_obmc_sub_pixel_variance4x8_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance4x8 aom_obmc_sub_pixel_variance4x8_c
+unsigned int aom_obmc_sub_pixel_variance4x8_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance4x8 aom_obmc_sub_pixel_variance4x8_neon
 
 unsigned int aom_obmc_sub_pixel_variance64x128_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance64x128 aom_obmc_sub_pixel_variance64x128_c
+unsigned int aom_obmc_sub_pixel_variance64x128_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance64x128 aom_obmc_sub_pixel_variance64x128_neon
 
 unsigned int aom_obmc_sub_pixel_variance64x16_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance64x16 aom_obmc_sub_pixel_variance64x16_c
+unsigned int aom_obmc_sub_pixel_variance64x16_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance64x16 aom_obmc_sub_pixel_variance64x16_neon
 
 unsigned int aom_obmc_sub_pixel_variance64x32_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance64x32 aom_obmc_sub_pixel_variance64x32_c
+unsigned int aom_obmc_sub_pixel_variance64x32_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance64x32 aom_obmc_sub_pixel_variance64x32_neon
 
 unsigned int aom_obmc_sub_pixel_variance64x64_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance64x64 aom_obmc_sub_pixel_variance64x64_c
+unsigned int aom_obmc_sub_pixel_variance64x64_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance64x64 aom_obmc_sub_pixel_variance64x64_neon
 
 unsigned int aom_obmc_sub_pixel_variance8x16_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance8x16 aom_obmc_sub_pixel_variance8x16_c
+unsigned int aom_obmc_sub_pixel_variance8x16_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance8x16 aom_obmc_sub_pixel_variance8x16_neon
 
 unsigned int aom_obmc_sub_pixel_variance8x32_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance8x32 aom_obmc_sub_pixel_variance8x32_c
+unsigned int aom_obmc_sub_pixel_variance8x32_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance8x32 aom_obmc_sub_pixel_variance8x32_neon
 
 unsigned int aom_obmc_sub_pixel_variance8x4_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance8x4 aom_obmc_sub_pixel_variance8x4_c
+unsigned int aom_obmc_sub_pixel_variance8x4_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance8x4 aom_obmc_sub_pixel_variance8x4_neon
 
 unsigned int aom_obmc_sub_pixel_variance8x8_c(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_sub_pixel_variance8x8 aom_obmc_sub_pixel_variance8x8_c
+unsigned int aom_obmc_sub_pixel_variance8x8_neon(const uint8_t *pre, int pre_stride, int xoffset, int yoffset, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_sub_pixel_variance8x8 aom_obmc_sub_pixel_variance8x8_neon
 
 unsigned int aom_obmc_variance128x128_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance128x128 aom_obmc_variance128x128_c
+unsigned int aom_obmc_variance128x128_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance128x128 aom_obmc_variance128x128_neon
 
 unsigned int aom_obmc_variance128x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance128x64 aom_obmc_variance128x64_c
+unsigned int aom_obmc_variance128x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance128x64 aom_obmc_variance128x64_neon
 
 unsigned int aom_obmc_variance16x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance16x16 aom_obmc_variance16x16_c
+unsigned int aom_obmc_variance16x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance16x16 aom_obmc_variance16x16_neon
 
 unsigned int aom_obmc_variance16x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance16x32 aom_obmc_variance16x32_c
+unsigned int aom_obmc_variance16x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance16x32 aom_obmc_variance16x32_neon
 
 unsigned int aom_obmc_variance16x4_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance16x4 aom_obmc_variance16x4_c
+unsigned int aom_obmc_variance16x4_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance16x4 aom_obmc_variance16x4_neon
 
 unsigned int aom_obmc_variance16x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance16x64 aom_obmc_variance16x64_c
+unsigned int aom_obmc_variance16x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance16x64 aom_obmc_variance16x64_neon
 
 unsigned int aom_obmc_variance16x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance16x8 aom_obmc_variance16x8_c
+unsigned int aom_obmc_variance16x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance16x8 aom_obmc_variance16x8_neon
 
 unsigned int aom_obmc_variance32x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance32x16 aom_obmc_variance32x16_c
+unsigned int aom_obmc_variance32x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance32x16 aom_obmc_variance32x16_neon
 
 unsigned int aom_obmc_variance32x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance32x32 aom_obmc_variance32x32_c
+unsigned int aom_obmc_variance32x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance32x32 aom_obmc_variance32x32_neon
 
 unsigned int aom_obmc_variance32x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance32x64 aom_obmc_variance32x64_c
+unsigned int aom_obmc_variance32x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance32x64 aom_obmc_variance32x64_neon
 
 unsigned int aom_obmc_variance32x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance32x8 aom_obmc_variance32x8_c
+unsigned int aom_obmc_variance32x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance32x8 aom_obmc_variance32x8_neon
 
 unsigned int aom_obmc_variance4x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance4x16 aom_obmc_variance4x16_c
+unsigned int aom_obmc_variance4x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance4x16 aom_obmc_variance4x16_neon
 
 unsigned int aom_obmc_variance4x4_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance4x4 aom_obmc_variance4x4_c
+unsigned int aom_obmc_variance4x4_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance4x4 aom_obmc_variance4x4_neon
 
 unsigned int aom_obmc_variance4x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance4x8 aom_obmc_variance4x8_c
+unsigned int aom_obmc_variance4x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance4x8 aom_obmc_variance4x8_neon
 
 unsigned int aom_obmc_variance64x128_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance64x128 aom_obmc_variance64x128_c
+unsigned int aom_obmc_variance64x128_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance64x128 aom_obmc_variance64x128_neon
 
 unsigned int aom_obmc_variance64x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance64x16 aom_obmc_variance64x16_c
+unsigned int aom_obmc_variance64x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance64x16 aom_obmc_variance64x16_neon
 
 unsigned int aom_obmc_variance64x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance64x32 aom_obmc_variance64x32_c
+unsigned int aom_obmc_variance64x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance64x32 aom_obmc_variance64x32_neon
 
 unsigned int aom_obmc_variance64x64_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance64x64 aom_obmc_variance64x64_c
+unsigned int aom_obmc_variance64x64_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance64x64 aom_obmc_variance64x64_neon
 
 unsigned int aom_obmc_variance8x16_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance8x16 aom_obmc_variance8x16_c
+unsigned int aom_obmc_variance8x16_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance8x16 aom_obmc_variance8x16_neon
 
 unsigned int aom_obmc_variance8x32_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance8x32 aom_obmc_variance8x32_c
+unsigned int aom_obmc_variance8x32_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance8x32 aom_obmc_variance8x32_neon
 
 unsigned int aom_obmc_variance8x4_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance8x4 aom_obmc_variance8x4_c
+unsigned int aom_obmc_variance8x4_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance8x4 aom_obmc_variance8x4_neon
 
 unsigned int aom_obmc_variance8x8_c(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
-#define aom_obmc_variance8x8 aom_obmc_variance8x8_c
+unsigned int aom_obmc_variance8x8_neon(const uint8_t *pre, int pre_stride, const int32_t *wsrc, const int32_t *mask, unsigned int *sse);
+#define aom_obmc_variance8x8 aom_obmc_variance8x8_neon
 
 void aom_paeth_predictor_16x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_paeth_predictor_16x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
@@ -4110,9 +4530,6 @@
 void aom_paeth_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_paeth_predictor_8x8 aom_paeth_predictor_8x8_neon
 
-void aom_pixel_scale_c(const int16_t *src_diff, ptrdiff_t src_stride, int16_t *coeff, int log_scale, int h8, int w8);
-#define aom_pixel_scale aom_pixel_scale_c
-
 void aom_quantize_b_c(const tran_low_t *coeff_ptr, intptr_t n_coeffs, const int16_t *zbin_ptr, const int16_t *round_ptr, const int16_t *quant_ptr, const int16_t *quant_shift_ptr, tran_low_t *qcoeff_ptr, tran_low_t *dqcoeff_ptr, const int16_t *dequant_ptr, uint16_t *eob_ptr, const int16_t *scan, const int16_t *iscan);
 void aom_quantize_b_neon(const tran_low_t *coeff_ptr, intptr_t n_coeffs, const int16_t *zbin_ptr, const int16_t *round_ptr, const int16_t *quant_ptr, const int16_t *quant_shift_ptr, tran_low_t *qcoeff_ptr, tran_low_t *dqcoeff_ptr, const int16_t *dequant_ptr, uint16_t *eob_ptr, const int16_t *scan, const int16_t *iscan);
 #define aom_quantize_b aom_quantize_b_neon
@@ -4143,15 +4560,13 @@
 #define aom_sad128x128_avg aom_sad128x128_avg_neon
 
 void aom_sad128x128x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad128x128x3d aom_sad128x128x3d_c
+void aom_sad128x128x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad128x128x3d aom_sad128x128x3d_neon
 
 void aom_sad128x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad128x128x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad128x128x4d aom_sad128x128x4d_neon
 
-void aom_sad128x128x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad128x128x4d_avg aom_sad128x128x4d_avg_c
-
 unsigned int aom_sad128x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad128x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad128x64 aom_sad128x64_neon
@@ -4161,18 +4576,13 @@
 #define aom_sad128x64_avg aom_sad128x64_avg_neon
 
 void aom_sad128x64x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad128x64x3d aom_sad128x64x3d_c
+void aom_sad128x64x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad128x64x3d aom_sad128x64x3d_neon
 
 void aom_sad128x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad128x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad128x64x4d aom_sad128x64x4d_neon
 
-void aom_sad128x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad128x64x4d_avg aom_sad128x64x4d_avg_c
-
-unsigned int aom_sad128xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad128xh aom_sad128xh_c
-
 unsigned int aom_sad16x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x16 aom_sad16x16_neon
@@ -4182,15 +4592,13 @@
 #define aom_sad16x16_avg aom_sad16x16_avg_neon
 
 void aom_sad16x16x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad16x16x3d aom_sad16x16x3d_c
+void aom_sad16x16x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad16x16x3d aom_sad16x16x3d_neon
 
 void aom_sad16x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad16x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x16x4d aom_sad16x16x4d_neon
 
-void aom_sad16x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x16x4d_avg aom_sad16x16x4d_avg_c
-
 unsigned int aom_sad16x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x32 aom_sad16x32_neon
@@ -4200,15 +4608,13 @@
 #define aom_sad16x32_avg aom_sad16x32_avg_neon
 
 void aom_sad16x32x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad16x32x3d aom_sad16x32x3d_c
+void aom_sad16x32x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad16x32x3d aom_sad16x32x3d_neon
 
 void aom_sad16x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad16x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x32x4d aom_sad16x32x4d_neon
 
-void aom_sad16x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x32x4d_avg aom_sad16x32x4d_avg_c
-
 unsigned int aom_sad16x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x4 aom_sad16x4_neon
@@ -4218,15 +4624,13 @@
 #define aom_sad16x4_avg aom_sad16x4_avg_neon
 
 void aom_sad16x4x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad16x4x3d aom_sad16x4x3d_c
+void aom_sad16x4x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad16x4x3d aom_sad16x4x3d_neon
 
 void aom_sad16x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad16x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x4x4d aom_sad16x4x4d_neon
 
-void aom_sad16x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x4x4d_avg aom_sad16x4x4d_avg_c
-
 unsigned int aom_sad16x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x64 aom_sad16x64_neon
@@ -4236,15 +4640,13 @@
 #define aom_sad16x64_avg aom_sad16x64_avg_neon
 
 void aom_sad16x64x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad16x64x3d aom_sad16x64x3d_c
+void aom_sad16x64x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad16x64x3d aom_sad16x64x3d_neon
 
 void aom_sad16x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad16x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x64x4d aom_sad16x64x4d_neon
 
-void aom_sad16x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x64x4d_avg aom_sad16x64x4d_avg_c
-
 unsigned int aom_sad16x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x8 aom_sad16x8_neon
@@ -4254,18 +4656,13 @@
 #define aom_sad16x8_avg aom_sad16x8_avg_neon
 
 void aom_sad16x8x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad16x8x3d aom_sad16x8x3d_c
+void aom_sad16x8x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad16x8x3d aom_sad16x8x3d_neon
 
 void aom_sad16x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad16x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x8x4d aom_sad16x8x4d_neon
 
-void aom_sad16x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x8x4d_avg aom_sad16x8x4d_avg_c
-
-unsigned int aom_sad16xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad16xh aom_sad16xh_c
-
 unsigned int aom_sad32x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x16 aom_sad32x16_neon
@@ -4275,15 +4672,13 @@
 #define aom_sad32x16_avg aom_sad32x16_avg_neon
 
 void aom_sad32x16x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad32x16x3d aom_sad32x16x3d_c
+void aom_sad32x16x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad32x16x3d aom_sad32x16x3d_neon
 
 void aom_sad32x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad32x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x16x4d aom_sad32x16x4d_neon
 
-void aom_sad32x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x16x4d_avg aom_sad32x16x4d_avg_c
-
 unsigned int aom_sad32x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x32 aom_sad32x32_neon
@@ -4293,15 +4688,13 @@
 #define aom_sad32x32_avg aom_sad32x32_avg_neon
 
 void aom_sad32x32x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad32x32x3d aom_sad32x32x3d_c
+void aom_sad32x32x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad32x32x3d aom_sad32x32x3d_neon
 
 void aom_sad32x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad32x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x32x4d aom_sad32x32x4d_neon
 
-void aom_sad32x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x32x4d_avg aom_sad32x32x4d_avg_c
-
 unsigned int aom_sad32x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x64 aom_sad32x64_neon
@@ -4311,15 +4704,13 @@
 #define aom_sad32x64_avg aom_sad32x64_avg_neon
 
 void aom_sad32x64x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad32x64x3d aom_sad32x64x3d_c
+void aom_sad32x64x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad32x64x3d aom_sad32x64x3d_neon
 
 void aom_sad32x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad32x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x64x4d aom_sad32x64x4d_neon
 
-void aom_sad32x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x64x4d_avg aom_sad32x64x4d_avg_c
-
 unsigned int aom_sad32x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x8 aom_sad32x8_neon
@@ -4329,18 +4720,13 @@
 #define aom_sad32x8_avg aom_sad32x8_avg_neon
 
 void aom_sad32x8x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad32x8x3d aom_sad32x8x3d_c
+void aom_sad32x8x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad32x8x3d aom_sad32x8x3d_neon
 
 void aom_sad32x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad32x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x8x4d aom_sad32x8x4d_neon
 
-void aom_sad32x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x8x4d_avg aom_sad32x8x4d_avg_c
-
-unsigned int aom_sad32xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad32xh aom_sad32xh_c
-
 unsigned int aom_sad4x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad4x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x16 aom_sad4x16_neon
@@ -4350,15 +4736,13 @@
 #define aom_sad4x16_avg aom_sad4x16_avg_neon
 
 void aom_sad4x16x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad4x16x3d aom_sad4x16x3d_c
+void aom_sad4x16x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad4x16x3d aom_sad4x16x3d_neon
 
 void aom_sad4x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad4x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x16x4d aom_sad4x16x4d_neon
 
-void aom_sad4x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x16x4d_avg aom_sad4x16x4d_avg_c
-
 unsigned int aom_sad4x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad4x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x4 aom_sad4x4_neon
@@ -4368,15 +4752,13 @@
 #define aom_sad4x4_avg aom_sad4x4_avg_neon
 
 void aom_sad4x4x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad4x4x3d aom_sad4x4x3d_c
+void aom_sad4x4x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad4x4x3d aom_sad4x4x3d_neon
 
 void aom_sad4x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad4x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x4x4d aom_sad4x4x4d_neon
 
-void aom_sad4x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x4x4d_avg aom_sad4x4x4d_avg_c
-
 unsigned int aom_sad4x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad4x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x8 aom_sad4x8_neon
@@ -4386,18 +4768,13 @@
 #define aom_sad4x8_avg aom_sad4x8_avg_neon
 
 void aom_sad4x8x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad4x8x3d aom_sad4x8x3d_c
+void aom_sad4x8x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad4x8x3d aom_sad4x8x3d_neon
 
 void aom_sad4x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad4x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x8x4d aom_sad4x8x4d_neon
 
-void aom_sad4x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x8x4d_avg aom_sad4x8x4d_avg_c
-
-unsigned int aom_sad4xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad4xh aom_sad4xh_c
-
 unsigned int aom_sad64x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x128_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x128 aom_sad64x128_neon
@@ -4407,15 +4784,13 @@
 #define aom_sad64x128_avg aom_sad64x128_avg_neon
 
 void aom_sad64x128x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad64x128x3d aom_sad64x128x3d_c
+void aom_sad64x128x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad64x128x3d aom_sad64x128x3d_neon
 
 void aom_sad64x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad64x128x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x128x4d aom_sad64x128x4d_neon
 
-void aom_sad64x128x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x128x4d_avg aom_sad64x128x4d_avg_c
-
 unsigned int aom_sad64x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x16 aom_sad64x16_neon
@@ -4425,15 +4800,13 @@
 #define aom_sad64x16_avg aom_sad64x16_avg_neon
 
 void aom_sad64x16x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad64x16x3d aom_sad64x16x3d_c
+void aom_sad64x16x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad64x16x3d aom_sad64x16x3d_neon
 
 void aom_sad64x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad64x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x16x4d aom_sad64x16x4d_neon
 
-void aom_sad64x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x16x4d_avg aom_sad64x16x4d_avg_c
-
 unsigned int aom_sad64x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x32 aom_sad64x32_neon
@@ -4443,15 +4816,13 @@
 #define aom_sad64x32_avg aom_sad64x32_avg_neon
 
 void aom_sad64x32x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad64x32x3d aom_sad64x32x3d_c
+void aom_sad64x32x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad64x32x3d aom_sad64x32x3d_neon
 
 void aom_sad64x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad64x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x32x4d aom_sad64x32x4d_neon
 
-void aom_sad64x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x32x4d_avg aom_sad64x32x4d_avg_c
-
 unsigned int aom_sad64x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x64 aom_sad64x64_neon
@@ -4461,18 +4832,13 @@
 #define aom_sad64x64_avg aom_sad64x64_avg_neon
 
 void aom_sad64x64x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad64x64x3d aom_sad64x64x3d_c
+void aom_sad64x64x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad64x64x3d aom_sad64x64x3d_neon
 
 void aom_sad64x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad64x64x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x64x4d aom_sad64x64x4d_neon
 
-void aom_sad64x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x64x4d_avg aom_sad64x64x4d_avg_c
-
-unsigned int aom_sad64xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad64xh aom_sad64xh_c
-
 unsigned int aom_sad8x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x16_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x16 aom_sad8x16_neon
@@ -4482,15 +4848,13 @@
 #define aom_sad8x16_avg aom_sad8x16_avg_neon
 
 void aom_sad8x16x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad8x16x3d aom_sad8x16x3d_c
+void aom_sad8x16x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad8x16x3d aom_sad8x16x3d_neon
 
 void aom_sad8x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad8x16x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x16x4d aom_sad8x16x4d_neon
 
-void aom_sad8x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x16x4d_avg aom_sad8x16x4d_avg_c
-
 unsigned int aom_sad8x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x32_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x32 aom_sad8x32_neon
@@ -4500,15 +4864,13 @@
 #define aom_sad8x32_avg aom_sad8x32_avg_neon
 
 void aom_sad8x32x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad8x32x3d aom_sad8x32x3d_c
+void aom_sad8x32x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad8x32x3d aom_sad8x32x3d_neon
 
 void aom_sad8x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad8x32x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x32x4d aom_sad8x32x4d_neon
 
-void aom_sad8x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x32x4d_avg aom_sad8x32x4d_avg_c
-
 unsigned int aom_sad8x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x4 aom_sad8x4_neon
@@ -4518,15 +4880,13 @@
 #define aom_sad8x4_avg aom_sad8x4_avg_neon
 
 void aom_sad8x4x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad8x4x3d aom_sad8x4x3d_c
+void aom_sad8x4x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad8x4x3d aom_sad8x4x3d_neon
 
 void aom_sad8x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad8x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x4x4d aom_sad8x4x4d_neon
 
-void aom_sad8x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x4x4d_avg aom_sad8x4x4d_avg_c
-
 unsigned int aom_sad8x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x8 aom_sad8x8_neon
@@ -4536,18 +4896,13 @@
 #define aom_sad8x8_avg aom_sad8x8_avg_neon
 
 void aom_sad8x8x3d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad8x8x3d aom_sad8x8x3d_c
+void aom_sad8x8x3d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad8x8x3d aom_sad8x8x3d_neon
 
 void aom_sad8x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 void aom_sad8x8x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x8x4d aom_sad8x8x4d_neon
 
-void aom_sad8x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x8x4d_avg aom_sad8x8x4d_avg_c
-
-unsigned int aom_sad8xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad8xh aom_sad8xh_c
-
 unsigned int aom_sad_skip_128x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad_skip_128x128_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad_skip_128x128 aom_sad_skip_128x128_neon
@@ -4581,10 +4936,12 @@
 #define aom_sad_skip_16x32x4d aom_sad_skip_16x32x4d_neon
 
 unsigned int aom_sad_skip_16x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_sad_skip_16x4 aom_sad_skip_16x4_c
+unsigned int aom_sad_skip_16x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_sad_skip_16x4 aom_sad_skip_16x4_neon
 
 void aom_sad_skip_16x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad_skip_16x4x4d aom_sad_skip_16x4x4d_c
+void aom_sad_skip_16x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad_skip_16x4x4d aom_sad_skip_16x4x4d_neon
 
 unsigned int aom_sad_skip_16x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad_skip_16x64_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
@@ -4643,10 +5000,12 @@
 #define aom_sad_skip_4x16x4d aom_sad_skip_4x16x4d_neon
 
 unsigned int aom_sad_skip_4x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_sad_skip_4x4 aom_sad_skip_4x4_c
+unsigned int aom_sad_skip_4x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_sad_skip_4x4 aom_sad_skip_4x4_neon
 
 void aom_sad_skip_4x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad_skip_4x4x4d aom_sad_skip_4x4x4d_c
+void aom_sad_skip_4x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad_skip_4x4x4d aom_sad_skip_4x4x4d_neon
 
 unsigned int aom_sad_skip_4x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad_skip_4x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
@@ -4705,10 +5064,12 @@
 #define aom_sad_skip_8x32x4d aom_sad_skip_8x32x4d_neon
 
 unsigned int aom_sad_skip_8x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
-#define aom_sad_skip_8x4 aom_sad_skip_8x4_c
+unsigned int aom_sad_skip_8x4_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
+#define aom_sad_skip_8x4 aom_sad_skip_8x4_neon
 
 void aom_sad_skip_8x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
-#define aom_sad_skip_8x4x4d aom_sad_skip_8x4x4d_c
+void aom_sad_skip_8x4x4d_neon(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
+#define aom_sad_skip_8x4x4d aom_sad_skip_8x4x4d_neon
 
 unsigned int aom_sad_skip_8x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad_skip_8x8_neon(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
@@ -5150,7 +5511,8 @@
 #define aom_sum_squares_2d_i16 aom_sum_squares_2d_i16_neon
 
 uint64_t aom_sum_squares_i16_c(const int16_t *src, uint32_t N);
-#define aom_sum_squares_i16 aom_sum_squares_i16_c
+uint64_t aom_sum_squares_i16_neon(const int16_t *src, uint32_t N);
+#define aom_sum_squares_i16 aom_sum_squares_i16_neon
 
 uint64_t aom_sum_sse_2d_i16_c(const int16_t *src, int src_stride, int width, int height, int *sum);
 uint64_t aom_sum_sse_2d_i16_neon(const int16_t *src, int src_stride, int width, int height, int *sum);
@@ -5161,67 +5523,84 @@
 #define aom_v_predictor_16x16 aom_v_predictor_16x16_neon
 
 void aom_v_predictor_16x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_16x32 aom_v_predictor_16x32_c
+void aom_v_predictor_16x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_16x32 aom_v_predictor_16x32_neon
 
 void aom_v_predictor_16x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_16x4 aom_v_predictor_16x4_c
+void aom_v_predictor_16x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_16x4 aom_v_predictor_16x4_neon
 
 void aom_v_predictor_16x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_16x64 aom_v_predictor_16x64_c
+void aom_v_predictor_16x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_16x64 aom_v_predictor_16x64_neon
 
 void aom_v_predictor_16x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_16x8 aom_v_predictor_16x8_c
+void aom_v_predictor_16x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_16x8 aom_v_predictor_16x8_neon
 
 void aom_v_predictor_32x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_32x16 aom_v_predictor_32x16_c
+void aom_v_predictor_32x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_32x16 aom_v_predictor_32x16_neon
 
 void aom_v_predictor_32x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_v_predictor_32x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_v_predictor_32x32 aom_v_predictor_32x32_neon
 
 void aom_v_predictor_32x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_32x64 aom_v_predictor_32x64_c
+void aom_v_predictor_32x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_32x64 aom_v_predictor_32x64_neon
 
 void aom_v_predictor_32x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_32x8 aom_v_predictor_32x8_c
+void aom_v_predictor_32x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_32x8 aom_v_predictor_32x8_neon
 
 void aom_v_predictor_4x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_4x16 aom_v_predictor_4x16_c
+void aom_v_predictor_4x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_4x16 aom_v_predictor_4x16_neon
 
 void aom_v_predictor_4x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_v_predictor_4x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_v_predictor_4x4 aom_v_predictor_4x4_neon
 
 void aom_v_predictor_4x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_4x8 aom_v_predictor_4x8_c
+void aom_v_predictor_4x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_4x8 aom_v_predictor_4x8_neon
 
 void aom_v_predictor_64x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_64x16 aom_v_predictor_64x16_c
+void aom_v_predictor_64x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_64x16 aom_v_predictor_64x16_neon
 
 void aom_v_predictor_64x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_64x32 aom_v_predictor_64x32_c
+void aom_v_predictor_64x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_64x32 aom_v_predictor_64x32_neon
 
 void aom_v_predictor_64x64_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_64x64 aom_v_predictor_64x64_c
+void aom_v_predictor_64x64_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_64x64 aom_v_predictor_64x64_neon
 
 void aom_v_predictor_8x16_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_8x16 aom_v_predictor_8x16_c
+void aom_v_predictor_8x16_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_8x16 aom_v_predictor_8x16_neon
 
 void aom_v_predictor_8x32_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_8x32 aom_v_predictor_8x32_c
+void aom_v_predictor_8x32_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_8x32 aom_v_predictor_8x32_neon
 
 void aom_v_predictor_8x4_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
-#define aom_v_predictor_8x4 aom_v_predictor_8x4_c
+void aom_v_predictor_8x4_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
+#define aom_v_predictor_8x4 aom_v_predictor_8x4_neon
 
 void aom_v_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 void aom_v_predictor_8x8_neon(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_v_predictor_8x8 aom_v_predictor_8x8_neon
 
 uint64_t aom_var_2d_u16_c(uint8_t *src, int src_stride, int width, int height);
-#define aom_var_2d_u16 aom_var_2d_u16_c
+uint64_t aom_var_2d_u16_neon(uint8_t *src, int src_stride, int width, int height);
+#define aom_var_2d_u16 aom_var_2d_u16_neon
 
 uint64_t aom_var_2d_u8_c(uint8_t *src, int src_stride, int width, int height);
-#define aom_var_2d_u8 aom_var_2d_u8_c
+uint64_t aom_var_2d_u8_neon(uint8_t *src, int src_stride, int width, int height);
+#define aom_var_2d_u8 aom_var_2d_u8_neon
 
 unsigned int aom_variance128x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 unsigned int aom_variance128x128_neon(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -5324,7 +5703,7 @@
 int aom_vector_var_neon(const int16_t *ref, const int16_t *src, int bwl);
 #define aom_vector_var aom_vector_var_neon
 
-double av1_compute_cross_correlation_c(unsigned char *im1, int stride1, int x1, int y1, unsigned char *im2, int stride2, int x2, int y2);
+double av1_compute_cross_correlation_c(const unsigned char *frame1, int stride1, int x1, int y1, const unsigned char *frame2, int stride2, int x2, int y2);
 #define av1_compute_cross_correlation av1_compute_cross_correlation_c
 
 void aom_dsp_rtcd(void);

diff --git a/config/arm64/config/aom_scale_rtcd.h b/config/arm64/config/aom_scale_rtcd.h
index df4b96f..d296957 100644
--- a/config/arm64/config/aom_scale_rtcd.h
+++ b/config/arm64/config/aom_scale_rtcd.h

@@ -80,7 +80,7 @@
 void aom_yv12_partial_copy_y_c(const struct yv12_buffer_config *src_ybc, int hstart1, int hend1, int vstart1, int vend1, struct yv12_buffer_config *dst_ybc, int hstart2, int vstart2);
 #define aom_yv12_partial_copy_y aom_yv12_partial_copy_y_c
 
-int aom_yv12_realloc_with_new_border_c(struct yv12_buffer_config *ybf, int new_border, int byte_alignment, int num_planes);
+int aom_yv12_realloc_with_new_border_c(struct yv12_buffer_config *ybf, int new_border, int byte_alignment, int num_pyramid_levels, int num_planes);
 #define aom_yv12_realloc_with_new_border aom_yv12_realloc_with_new_border_c
 
 void aom_scale_rtcd(void);

diff --git a/config/arm64/config/av1_rtcd.h b/config/arm64/config/av1_rtcd.h
index 964bb72..1a3fa19 100644
--- a/config/arm64/config/av1_rtcd.h
+++ b/config/arm64/config/av1_rtcd.h

@@ -15,12 +15,12 @@
 #include "aom/aom_integer.h"
 #include "aom_dsp/odintrin.h"
 #include "aom_dsp/txfm_common.h"
-#include "av1/common/common.h"
-#include "av1/common/enums.h"
-#include "av1/common/quant_common.h"
-#include "av1/common/filter.h"
-#include "av1/common/convolve.h"
 #include "av1/common/av1_txfm.h"
+#include "av1/common/common.h"
+#include "av1/common/convolve.h"
+#include "av1/common/enums.h"
+#include "av1/common/filter.h"
+#include "av1/common/quant_common.h"
 #include "av1/common/restoration.h"
 
 struct macroblockd;
@@ -80,14 +80,11 @@
                                                    const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
                                                    int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
                                                    int ref_stride, int subpel_search);
-#define aom_comp_avg_upsampled_pred aom_comp_avg_upsampled_pred_c
-
-void aom_comp_mask_upsampled_pred_c(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
-                                                       const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
-                                                       int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
-                                                       int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask,
-                                                       int subpel_search);
-#define aom_comp_mask_upsampled_pred aom_comp_mask_upsampled_pred_c
+void aom_comp_avg_upsampled_pred_neon(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
+                                                   const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
+                                                   int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
+                                                   int ref_stride, int subpel_search);
+#define aom_comp_avg_upsampled_pred aom_comp_avg_upsampled_pred_neon
 
 void aom_dist_wtd_comp_avg_upsampled_pred_c(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
                                                        const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
@@ -118,14 +115,17 @@
 void aom_upsampled_pred_c(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
                                           const MV *const mv, uint8_t *comp_pred, int width, int height, int subpel_x_q3,
                                           int subpel_y_q3, const uint8_t *ref, int ref_stride, int subpel_search);
-#define aom_upsampled_pred aom_upsampled_pred_c
+void aom_upsampled_pred_neon(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
+                                          const MV *const mv, uint8_t *comp_pred, int width, int height, int subpel_x_q3,
+                                          int subpel_y_q3, const uint8_t *ref, int ref_stride, int subpel_search);
+#define aom_upsampled_pred aom_upsampled_pred_neon
 
 void av1_apply_selfguided_restoration_c(const uint8_t *dat, int width, int height, int stride, int eps, const int *xqd, uint8_t *dst, int dst_stride, int32_t *tmpbuf, int bit_depth, int highbd);
 void av1_apply_selfguided_restoration_neon(const uint8_t *dat, int width, int height, int stride, int eps, const int *xqd, uint8_t *dst, int dst_stride, int32_t *tmpbuf, int bit_depth, int highbd);
 #define av1_apply_selfguided_restoration av1_apply_selfguided_restoration_neon
 
-void av1_apply_temporal_filter_c(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
-void av1_apply_temporal_filter_neon(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_apply_temporal_filter_c(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_apply_temporal_filter_neon(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
 #define av1_apply_temporal_filter av1_apply_temporal_filter_neon
 
 int64_t av1_block_error_c(const tran_low_t *coeff, const tran_low_t *dqcoeff, intptr_t block_size, int64_t *ssz);
@@ -150,10 +150,12 @@
 #define av1_calc_frame_error av1_calc_frame_error_c
 
 void av1_calc_indices_dim1_c(const int16_t *data, const int16_t *centroids, uint8_t *indices, int64_t *total_dist, int n, int k);
-#define av1_calc_indices_dim1 av1_calc_indices_dim1_c
+void av1_calc_indices_dim1_neon(const int16_t *data, const int16_t *centroids, uint8_t *indices, int64_t *total_dist, int n, int k);
+#define av1_calc_indices_dim1 av1_calc_indices_dim1_neon
 
 void av1_calc_indices_dim2_c(const int16_t *data, const int16_t *centroids, uint8_t *indices, int64_t *total_dist, int n, int k);
-#define av1_calc_indices_dim2 av1_calc_indices_dim2_c
+void av1_calc_indices_dim2_neon(const int16_t *data, const int16_t *centroids, uint8_t *indices, int64_t *total_dist, int n, int k);
+#define av1_calc_indices_dim2 av1_calc_indices_dim2_neon
 
 void av1_calc_proj_params_c( const uint8_t *src8, int width, int height, int src_stride, const uint8_t *dat8, int dat_stride, int32_t *flt0, int flt0_stride, int32_t *flt1, int flt1_stride, int64_t H[2][2], int64_t C[2], const sgr_params_type *params);
 #define av1_calc_proj_params av1_calc_proj_params_c
@@ -179,7 +181,7 @@
 bool av1_cnn_predict_c( const float **input, int in_width, int in_height, int in_stride, const CNN_CONFIG *cnn_config, const CNN_THREAD_DATA *thread_data, CNN_MULTI_OUT *output_struct);
 #define av1_cnn_predict av1_cnn_predict_c
 
-void av1_compute_stats_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, int use_downsampled_wiener_stats);
+void av1_compute_stats_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int16_t *dgd_avg, int16_t *src_avg, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, int use_downsampled_wiener_stats);
 #define av1_compute_stats av1_compute_stats_c
 
 void av1_compute_stats_highbd_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, aom_bit_depth_t bit_depth);
@@ -231,6 +233,9 @@
 void av1_dr_prediction_z3_neon(uint8_t *dst, ptrdiff_t stride, int bw, int bh, const uint8_t *above, const uint8_t *left, int upsample_left, int dx, int dy);
 #define av1_dr_prediction_z3 av1_dr_prediction_z3_neon
 
+double av1_estimate_noise_from_single_plane_c(const uint8_t *src, int height, int width, int stride, int edge_thresh);
+#define av1_estimate_noise_from_single_plane av1_estimate_noise_from_single_plane_c
+
 void av1_filter_intra_edge_c(uint8_t *p, int sz, int strength);
 #define av1_filter_intra_edge av1_filter_intra_edge_c
 
@@ -332,7 +337,7 @@
 void av1_get_nz_map_contexts_neon(const uint8_t *const levels, const int16_t *const scan, const uint16_t eob, const TX_SIZE tx_size, const TX_CLASS tx_class, int8_t *const coeff_contexts);
 #define av1_get_nz_map_contexts av1_get_nz_map_contexts_neon
 
-void av1_highbd_apply_temporal_filter_c(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_highbd_apply_temporal_filter_c(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
 #define av1_highbd_apply_temporal_filter av1_highbd_apply_temporal_filter_c
 
 int64_t av1_highbd_block_error_c(const tran_low_t *coeff, const tran_low_t *dqcoeff, intptr_t block_size, int64_t *ssz, int bd);
@@ -348,10 +353,12 @@
 #define av1_highbd_convolve8_vert av1_highbd_convolve8_vert_c
 
 void av1_highbd_convolve_2d_scale_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const InterpFilterParams *filter_params_y, const int subpel_x_qn, const int x_step_qn, const int subpel_y_qn, const int y_step_qn, ConvolveParams *conv_params, int bd);
-#define av1_highbd_convolve_2d_scale av1_highbd_convolve_2d_scale_c
+void av1_highbd_convolve_2d_scale_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const InterpFilterParams *filter_params_y, const int subpel_x_qn, const int x_step_qn, const int subpel_y_qn, const int y_step_qn, ConvolveParams *conv_params, int bd);
+#define av1_highbd_convolve_2d_scale av1_highbd_convolve_2d_scale_neon
 
 void av1_highbd_convolve_2d_sr_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const InterpFilterParams *filter_params_y, const int subpel_x_qn, const int subpel_y_qn, ConvolveParams *conv_params, int bd);
-#define av1_highbd_convolve_2d_sr av1_highbd_convolve_2d_sr_c
+void av1_highbd_convolve_2d_sr_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const InterpFilterParams *filter_params_y, const int subpel_x_qn, const int subpel_y_qn, ConvolveParams *conv_params, int bd);
+#define av1_highbd_convolve_2d_sr av1_highbd_convolve_2d_sr_neon
 
 void av1_highbd_convolve_avg_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4, const int16_t *filter_y, int y_step_q4, int w, int h, int bps);
 #define av1_highbd_convolve_avg av1_highbd_convolve_avg_c
@@ -360,25 +367,32 @@
 #define av1_highbd_convolve_copy av1_highbd_convolve_copy_c
 
 void av1_highbd_convolve_horiz_rs_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const int16_t *x_filters, int x0_qn, int x_step_qn, int bd);
-#define av1_highbd_convolve_horiz_rs av1_highbd_convolve_horiz_rs_c
+void av1_highbd_convolve_horiz_rs_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const int16_t *x_filters, int x0_qn, int x_step_qn, int bd);
+#define av1_highbd_convolve_horiz_rs av1_highbd_convolve_horiz_rs_neon
 
 void av1_highbd_convolve_x_sr_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn, ConvolveParams *conv_params, int bd);
-#define av1_highbd_convolve_x_sr av1_highbd_convolve_x_sr_c
+void av1_highbd_convolve_x_sr_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn, ConvolveParams *conv_params, int bd);
+#define av1_highbd_convolve_x_sr av1_highbd_convolve_x_sr_neon
 
 void av1_highbd_convolve_y_sr_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_y, const int subpel_y_qn, int bd);
-#define av1_highbd_convolve_y_sr av1_highbd_convolve_y_sr_c
+void av1_highbd_convolve_y_sr_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_y, const int subpel_y_qn, int bd);
+#define av1_highbd_convolve_y_sr av1_highbd_convolve_y_sr_neon
 
 void av1_highbd_dist_wtd_convolve_2d_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const InterpFilterParams *filter_params_y, const int subpel_x_qn, const int subpel_y_qn, ConvolveParams *conv_params, int bd);
-#define av1_highbd_dist_wtd_convolve_2d av1_highbd_dist_wtd_convolve_2d_c
+void av1_highbd_dist_wtd_convolve_2d_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const InterpFilterParams *filter_params_y, const int subpel_x_qn, const int subpel_y_qn, ConvolveParams *conv_params, int bd);
+#define av1_highbd_dist_wtd_convolve_2d av1_highbd_dist_wtd_convolve_2d_neon
 
 void av1_highbd_dist_wtd_convolve_2d_copy_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, ConvolveParams *conv_params, int bd);
-#define av1_highbd_dist_wtd_convolve_2d_copy av1_highbd_dist_wtd_convolve_2d_copy_c
+void av1_highbd_dist_wtd_convolve_2d_copy_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, ConvolveParams *conv_params, int bd);
+#define av1_highbd_dist_wtd_convolve_2d_copy av1_highbd_dist_wtd_convolve_2d_copy_neon
 
 void av1_highbd_dist_wtd_convolve_x_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn, ConvolveParams *conv_params, int bd);
-#define av1_highbd_dist_wtd_convolve_x av1_highbd_dist_wtd_convolve_x_c
+void av1_highbd_dist_wtd_convolve_x_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn, ConvolveParams *conv_params, int bd);
+#define av1_highbd_dist_wtd_convolve_x av1_highbd_dist_wtd_convolve_x_neon
 
 void av1_highbd_dist_wtd_convolve_y_c(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_y, const int subpel_y_qn, ConvolveParams *conv_params, int bd);
-#define av1_highbd_dist_wtd_convolve_y av1_highbd_dist_wtd_convolve_y_c
+void av1_highbd_dist_wtd_convolve_y_neon(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w, int h, const InterpFilterParams *filter_params_y, const int subpel_y_qn, ConvolveParams *conv_params, int bd);
+#define av1_highbd_dist_wtd_convolve_y av1_highbd_dist_wtd_convolve_y_neon
 
 void av1_highbd_dr_prediction_z1_c(uint16_t *dst, ptrdiff_t stride, int bw, int bh, const uint16_t *above, const uint16_t *left, int upsample_above, int dx, int dy, int bd);
 #define av1_highbd_dr_prediction_z1 av1_highbd_dr_prediction_z1_c
@@ -389,9 +403,8 @@
 void av1_highbd_dr_prediction_z3_c(uint16_t *dst, ptrdiff_t stride, int bw, int bh, const uint16_t *above, const uint16_t *left, int upsample_left, int dx, int dy, int bd);
 #define av1_highbd_dr_prediction_z3 av1_highbd_dr_prediction_z3_c
 
-void av1_highbd_fwht4x4_c(const int16_t *input, tran_low_t *output, int stride);
-void av1_highbd_fwht4x4_neon(const int16_t *input, tran_low_t *output, int stride);
-#define av1_highbd_fwht4x4 av1_highbd_fwht4x4_neon
+double av1_highbd_estimate_noise_from_single_plane_c(const uint16_t *src, int height, int width, int stride, int bit_depth, int edge_thresh);
+#define av1_highbd_estimate_noise_from_single_plane av1_highbd_estimate_noise_from_single_plane_c
 
 void av1_highbd_inv_txfm_add_c(const tran_low_t *input, uint8_t *dest, int stride, const TxfmParam *txfm_param);
 void av1_highbd_inv_txfm_add_neon(const tran_low_t *input, uint8_t *dest, int stride, const TxfmParam *txfm_param);

diff --git a/config/config/aom_version.h b/config/config/aom_version.h
index 4586009..c3705db 100644
--- a/config/config/aom_version.h
+++ b/config/config/aom_version.h

@@ -10,10 +10,10 @@
  */
 
 #define VERSION_MAJOR 3
-#define VERSION_MINOR 6
-#define VERSION_PATCH 1
-#define VERSION_EXTRA "226-g5cf4c68cb3"
+#define VERSION_MINOR 7
+#define VERSION_PATCH 0
+#define VERSION_EXTRA "273-g722272fc9"
 #define VERSION_PACKED \
   ((VERSION_MAJOR << 16) | (VERSION_MINOR << 8) | (VERSION_PATCH))
-#define VERSION_STRING_NOSP "3.6.1-226-g5cf4c68cb3"
-#define VERSION_STRING " 3.6.1-226-g5cf4c68cb3"
+#define VERSION_STRING_NOSP "3.7.0-273-g722272fc9"
+#define VERSION_STRING " 3.7.0-273-g722272fc9"

diff --git a/config/riscv64/config/aom_config.asm b/config/riscv64/config/aom_config.asm
index b9c668a..02ff408 100644
--- a/config/riscv64/config/aom_config.asm
+++ b/config/riscv64/config/aom_config.asm

@@ -8,10 +8,11 @@
 ; Media Patent License 1.0 was not distributed with this source code in the
 ; PATENTS file, you can obtain it at www.aomedia.org/license/patent.
 ;
-ARCH_ARM equ 0
-ARCH_PPC equ 0
-ARCH_X86 equ 0
-ARCH_X86_64 equ 0
+AOM_ARCH_AARCH64 equ 0
+AOM_ARCH_ARM equ 0
+AOM_ARCH_PPC equ 0
+AOM_ARCH_X86 equ 0
+AOM_ARCH_X86_64 equ 0
 CONFIG_ACCOUNTING equ 0
 CONFIG_ANALYZER equ 0
 CONFIG_AV1_DECODER equ 1
@@ -47,6 +48,7 @@
 CONFIG_NORMAL_TILE_MODE equ 1
 CONFIG_OPTICAL_FLOW_API equ 0
 CONFIG_OS_SUPPORT equ 1
+CONFIG_OUTPUT_FRAME_SIZE equ 0
 CONFIG_PARTITION_SEARCH_ORDER equ 0
 CONFIG_PIC equ 1
 CONFIG_RATECTRL_LOG equ 0
@@ -55,6 +57,7 @@
 CONFIG_REALTIME_ONLY equ 0
 CONFIG_RT_ML_PARTITIONING equ 0
 CONFIG_RUNTIME_CPU_DETECT equ 0
+CONFIG_SALIENCY_MAP equ 0
 CONFIG_SHARED equ 0
 CONFIG_SIZE_LIMIT equ 1
 CONFIG_SPATIAL_RESAMPLING equ 1

diff --git a/config/riscv64/config/aom_config.c b/config/riscv64/config/aom_config.c
index 14ddb81..07609ac 100644
--- a/config/riscv64/config/aom_config.c
+++ b/config/riscv64/config/aom_config.c

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2016, Alliance for Open Media. All rights reserved
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
  *
  * This source code is subject to the terms of the BSD 2 Clause License and
  * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License

diff --git a/config/riscv64/config/aom_config.h b/config/riscv64/config/aom_config.h
index e629873..91b6249 100644
--- a/config/riscv64/config/aom_config.h
+++ b/config/riscv64/config/aom_config.h

@@ -10,10 +10,11 @@
  */
 #ifndef AOM_CONFIG_H_
 #define AOM_CONFIG_H_
-#define ARCH_ARM 0
-#define ARCH_PPC 0
-#define ARCH_X86 0
-#define ARCH_X86_64 0
+#define AOM_ARCH_AARCH64 0
+#define AOM_ARCH_ARM 0
+#define AOM_ARCH_PPC 0
+#define AOM_ARCH_X86 0
+#define AOM_ARCH_X86_64 0
 #define CONFIG_ACCOUNTING 0
 #define CONFIG_ANALYZER 0
 #define CONFIG_AV1_DECODER 1
@@ -49,6 +50,7 @@
 #define CONFIG_NORMAL_TILE_MODE 1
 #define CONFIG_OPTICAL_FLOW_API 0
 #define CONFIG_OS_SUPPORT 1
+#define CONFIG_OUTPUT_FRAME_SIZE 0
 #define CONFIG_PARTITION_SEARCH_ORDER 0
 #define CONFIG_PIC 1
 #define CONFIG_RATECTRL_LOG 0
@@ -57,6 +59,7 @@
 #define CONFIG_REALTIME_ONLY 0
 #define CONFIG_RT_ML_PARTITIONING 0
 #define CONFIG_RUNTIME_CPU_DETECT 0
+#define CONFIG_SALIENCY_MAP 0
 #define CONFIG_SHARED 0
 #define CONFIG_SIZE_LIMIT 1
 #define CONFIG_SPATIAL_RESAMPLING 1

diff --git a/config/riscv64/config/aom_dsp_rtcd.h b/config/riscv64/config/aom_dsp_rtcd.h
index 4a9a683..e724d0d 100644
--- a/config/riscv64/config/aom_dsp_rtcd.h
+++ b/config/riscv64/config/aom_dsp_rtcd.h

@@ -14,8 +14,8 @@
 
 #include "aom/aom_integer.h"
 #include "aom_dsp/aom_dsp_common.h"
-#include "av1/common/enums.h"
 #include "av1/common/blockd.h"
+#include "av1/common/enums.h"
 
 
 #ifdef __cplusplus
@@ -46,6 +46,9 @@
 void aom_comp_mask_pred_c(uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask);
 #define aom_comp_mask_pred aom_comp_mask_pred_c
 
+void aom_compute_flow_at_point_c(const uint8_t *src, const uint8_t *ref, int x, int y, int width, int height, int stride, double *u, double *v);
+#define aom_compute_flow_at_point aom_compute_flow_at_point_c
+
 void aom_convolve8_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const InterpKernel *filter, int x0_q4, int x_step_q4, int y0_q4, int y_step_q4, int w, int h);
 #define aom_convolve8 aom_convolve8_c
 
@@ -427,9 +430,6 @@
 void aom_fdct4x4_lp_c(const int16_t *input, int16_t *output, int stride);
 #define aom_fdct4x4_lp aom_fdct4x4_lp_c
 
-void aom_fdct8x8_c(const int16_t *input, tran_low_t *output, int stride);
-#define aom_fdct8x8 aom_fdct8x8_c
-
 void aom_fft16x16_float_c(const float *input, float *temp, float *output);
 #define aom_fft16x16_float aom_fft16x16_float_c
 
@@ -445,15 +445,6 @@
 void aom_fft8x8_float_c(const float *input, float *temp, float *output);
 #define aom_fft8x8_float aom_fft8x8_float_c
 
-void aom_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_get16x16var aom_get16x16var_c
-
-unsigned int aom_get4x4sse_cs_c(const unsigned char *src_ptr, int source_stride, const unsigned char *ref_ptr, int ref_stride);
-#define aom_get4x4sse_cs aom_get4x4sse_cs_c
-
-void aom_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_get8x8var aom_get8x8var_c
-
 void aom_get_blk_sse_sum_c(const int16_t *data, int stride, int bw, int bh, int *x_sum, int64_t *x2_sum);
 #define aom_get_blk_sse_sum aom_get_blk_sse_sum_c
 
@@ -610,12 +601,6 @@
 uint32_t aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_10_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_10_get16x16var aom_highbd_10_get16x16var_c
-
-void aom_highbd_10_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_10_get8x8var aom_highbd_10_get8x8var_c
-
 unsigned int aom_highbd_10_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_10_masked_sub_pixel_variance128x128 aom_highbd_10_masked_sub_pixel_variance128x128_c
 
@@ -970,10 +955,10 @@
 unsigned int aom_highbd_10_variance16x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x32 aom_highbd_10_variance16x32_c
 
-unsigned int aom_highbd_10_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x4 aom_highbd_10_variance16x4_c
 
-unsigned int aom_highbd_10_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x64 aom_highbd_10_variance16x64_c
 
 unsigned int aom_highbd_10_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -994,10 +979,10 @@
 unsigned int aom_highbd_10_variance32x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance32x64 aom_highbd_10_variance32x64_c
 
-unsigned int aom_highbd_10_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance32x8 aom_highbd_10_variance32x8_c
 
-unsigned int aom_highbd_10_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance4x16 aom_highbd_10_variance4x16_c
 
 unsigned int aom_highbd_10_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1012,7 +997,7 @@
 unsigned int aom_highbd_10_variance64x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance64x128 aom_highbd_10_variance64x128_c
 
-unsigned int aom_highbd_10_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance64x16 aom_highbd_10_variance64x16_c
 
 unsigned int aom_highbd_10_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1024,7 +1009,7 @@
 unsigned int aom_highbd_10_variance8x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance8x16 aom_highbd_10_variance8x16_c
 
-unsigned int aom_highbd_10_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance8x32 aom_highbd_10_variance8x32_c
 
 unsigned int aom_highbd_10_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1099,12 +1084,6 @@
 uint32_t aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_12_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_12_get16x16var aom_highbd_12_get16x16var_c
-
-void aom_highbd_12_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_12_get8x8var aom_highbd_12_get8x8var_c
-
 unsigned int aom_highbd_12_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_12_masked_sub_pixel_variance128x128 aom_highbd_12_masked_sub_pixel_variance128x128_c
 
@@ -1459,10 +1438,10 @@
 unsigned int aom_highbd_12_variance16x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance16x32 aom_highbd_12_variance16x32_c
 
-unsigned int aom_highbd_12_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance16x4 aom_highbd_12_variance16x4_c
 
-unsigned int aom_highbd_12_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance16x64 aom_highbd_12_variance16x64_c
 
 unsigned int aom_highbd_12_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1483,10 +1462,10 @@
 unsigned int aom_highbd_12_variance32x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance32x64 aom_highbd_12_variance32x64_c
 
-unsigned int aom_highbd_12_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance32x8 aom_highbd_12_variance32x8_c
 
-unsigned int aom_highbd_12_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance4x16 aom_highbd_12_variance4x16_c
 
 unsigned int aom_highbd_12_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1501,7 +1480,7 @@
 unsigned int aom_highbd_12_variance64x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance64x128 aom_highbd_12_variance64x128_c
 
-unsigned int aom_highbd_12_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance64x16 aom_highbd_12_variance64x16_c
 
 unsigned int aom_highbd_12_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1513,7 +1492,7 @@
 unsigned int aom_highbd_12_variance8x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance8x16 aom_highbd_12_variance8x16_c
 
-unsigned int aom_highbd_12_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance8x32 aom_highbd_12_variance8x32_c
 
 unsigned int aom_highbd_12_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1588,12 +1567,6 @@
 uint32_t aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_8_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_8_get16x16var aom_highbd_8_get16x16var_c
-
-void aom_highbd_8_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_8_get8x8var aom_highbd_8_get8x8var_c
-
 unsigned int aom_highbd_8_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_8_masked_sub_pixel_variance128x128 aom_highbd_8_masked_sub_pixel_variance128x128_c
 
@@ -1816,10 +1789,10 @@
 unsigned int aom_highbd_8_variance16x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance16x32 aom_highbd_8_variance16x32_c
 
-unsigned int aom_highbd_8_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance16x4 aom_highbd_8_variance16x4_c
 
-unsigned int aom_highbd_8_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance16x64 aom_highbd_8_variance16x64_c
 
 unsigned int aom_highbd_8_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1840,10 +1813,10 @@
 unsigned int aom_highbd_8_variance32x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance32x64 aom_highbd_8_variance32x64_c
 
-unsigned int aom_highbd_8_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance32x8 aom_highbd_8_variance32x8_c
 
-unsigned int aom_highbd_8_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance4x16 aom_highbd_8_variance4x16_c
 
 unsigned int aom_highbd_8_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1858,7 +1831,7 @@
 unsigned int aom_highbd_8_variance64x128_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance64x128 aom_highbd_8_variance64x128_c
 
-unsigned int aom_highbd_8_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance64x16 aom_highbd_8_variance64x16_c
 
 unsigned int aom_highbd_8_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1870,7 +1843,7 @@
 unsigned int aom_highbd_8_variance8x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance8x16 aom_highbd_8_variance8x16_c
 
-unsigned int aom_highbd_8_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance8x32 aom_highbd_8_variance8x32_c
 
 unsigned int aom_highbd_8_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -2209,9 +2182,6 @@
 unsigned int aom_highbd_dist_wtd_sad8x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_dist_wtd_sad8x8_avg aom_highbd_dist_wtd_sad8x8_avg_c
 
-void aom_highbd_fdct8x8_c(const int16_t *input, tran_low_t *output, int stride);
-#define aom_highbd_fdct8x8 aom_highbd_fdct8x8_c
-
 void aom_highbd_h_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_h_predictor_16x16 aom_highbd_h_predictor_16x16_c
 
@@ -3874,9 +3844,6 @@
 void aom_paeth_predictor_8x8_c(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_paeth_predictor_8x8 aom_paeth_predictor_8x8_c
 
-void aom_pixel_scale_c(const int16_t *src_diff, ptrdiff_t src_stride, int16_t *coeff, int log_scale, int h8, int w8);
-#define aom_pixel_scale aom_pixel_scale_c
-
 void aom_quantize_b_c(const tran_low_t *coeff_ptr, intptr_t n_coeffs, const int16_t *zbin_ptr, const int16_t *round_ptr, const int16_t *quant_ptr, const int16_t *quant_shift_ptr, tran_low_t *qcoeff_ptr, tran_low_t *dqcoeff_ptr, const int16_t *dequant_ptr, uint16_t *eob_ptr, const int16_t *scan, const int16_t *iscan);
 #define aom_quantize_b aom_quantize_b_c
 
@@ -3907,9 +3874,6 @@
 void aom_sad128x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad128x128x4d aom_sad128x128x4d_c
 
-void aom_sad128x128x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad128x128x4d_avg aom_sad128x128x4d_avg_c
-
 unsigned int aom_sad128x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad128x64 aom_sad128x64_c
 
@@ -3922,12 +3886,6 @@
 void aom_sad128x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad128x64x4d aom_sad128x64x4d_c
 
-void aom_sad128x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad128x64x4d_avg aom_sad128x64x4d_avg_c
-
-unsigned int aom_sad128xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad128xh aom_sad128xh_c
-
 unsigned int aom_sad16x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x16 aom_sad16x16_c
 
@@ -3940,9 +3898,6 @@
 void aom_sad16x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x16x4d aom_sad16x16x4d_c
 
-void aom_sad16x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x16x4d_avg aom_sad16x16x4d_avg_c
-
 unsigned int aom_sad16x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x32 aom_sad16x32_c
 
@@ -3955,9 +3910,6 @@
 void aom_sad16x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x32x4d aom_sad16x32x4d_c
 
-void aom_sad16x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x32x4d_avg aom_sad16x32x4d_avg_c
-
 unsigned int aom_sad16x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x4 aom_sad16x4_c
 
@@ -3970,9 +3922,6 @@
 void aom_sad16x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x4x4d aom_sad16x4x4d_c
 
-void aom_sad16x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x4x4d_avg aom_sad16x4x4d_avg_c
-
 unsigned int aom_sad16x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x64 aom_sad16x64_c
 
@@ -3985,9 +3934,6 @@
 void aom_sad16x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x64x4d aom_sad16x64x4d_c
 
-void aom_sad16x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x64x4d_avg aom_sad16x64x4d_avg_c
-
 unsigned int aom_sad16x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x8 aom_sad16x8_c
 
@@ -4000,12 +3946,6 @@
 void aom_sad16x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x8x4d aom_sad16x8x4d_c
 
-void aom_sad16x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x8x4d_avg aom_sad16x8x4d_avg_c
-
-unsigned int aom_sad16xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad16xh aom_sad16xh_c
-
 unsigned int aom_sad32x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x16 aom_sad32x16_c
 
@@ -4018,9 +3958,6 @@
 void aom_sad32x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x16x4d aom_sad32x16x4d_c
 
-void aom_sad32x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x16x4d_avg aom_sad32x16x4d_avg_c
-
 unsigned int aom_sad32x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x32 aom_sad32x32_c
 
@@ -4033,9 +3970,6 @@
 void aom_sad32x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x32x4d aom_sad32x32x4d_c
 
-void aom_sad32x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x32x4d_avg aom_sad32x32x4d_avg_c
-
 unsigned int aom_sad32x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x64 aom_sad32x64_c
 
@@ -4048,9 +3982,6 @@
 void aom_sad32x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x64x4d aom_sad32x64x4d_c
 
-void aom_sad32x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x64x4d_avg aom_sad32x64x4d_avg_c
-
 unsigned int aom_sad32x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x8 aom_sad32x8_c
 
@@ -4063,12 +3994,6 @@
 void aom_sad32x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x8x4d aom_sad32x8x4d_c
 
-void aom_sad32x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x8x4d_avg aom_sad32x8x4d_avg_c
-
-unsigned int aom_sad32xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad32xh aom_sad32xh_c
-
 unsigned int aom_sad4x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x16 aom_sad4x16_c
 
@@ -4081,9 +4006,6 @@
 void aom_sad4x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x16x4d aom_sad4x16x4d_c
 
-void aom_sad4x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x16x4d_avg aom_sad4x16x4d_avg_c
-
 unsigned int aom_sad4x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x4 aom_sad4x4_c
 
@@ -4096,9 +4018,6 @@
 void aom_sad4x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x4x4d aom_sad4x4x4d_c
 
-void aom_sad4x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x4x4d_avg aom_sad4x4x4d_avg_c
-
 unsigned int aom_sad4x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x8 aom_sad4x8_c
 
@@ -4111,12 +4030,6 @@
 void aom_sad4x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x8x4d aom_sad4x8x4d_c
 
-void aom_sad4x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x8x4d_avg aom_sad4x8x4d_avg_c
-
-unsigned int aom_sad4xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad4xh aom_sad4xh_c
-
 unsigned int aom_sad64x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x128 aom_sad64x128_c
 
@@ -4129,9 +4042,6 @@
 void aom_sad64x128x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x128x4d aom_sad64x128x4d_c
 
-void aom_sad64x128x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x128x4d_avg aom_sad64x128x4d_avg_c
-
 unsigned int aom_sad64x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x16 aom_sad64x16_c
 
@@ -4144,9 +4054,6 @@
 void aom_sad64x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x16x4d aom_sad64x16x4d_c
 
-void aom_sad64x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x16x4d_avg aom_sad64x16x4d_avg_c
-
 unsigned int aom_sad64x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x32 aom_sad64x32_c
 
@@ -4159,9 +4066,6 @@
 void aom_sad64x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x32x4d aom_sad64x32x4d_c
 
-void aom_sad64x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x32x4d_avg aom_sad64x32x4d_avg_c
-
 unsigned int aom_sad64x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x64 aom_sad64x64_c
 
@@ -4174,12 +4078,6 @@
 void aom_sad64x64x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x64x4d aom_sad64x64x4d_c
 
-void aom_sad64x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x64x4d_avg aom_sad64x64x4d_avg_c
-
-unsigned int aom_sad64xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad64xh aom_sad64xh_c
-
 unsigned int aom_sad8x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x16 aom_sad8x16_c
 
@@ -4192,9 +4090,6 @@
 void aom_sad8x16x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x16x4d aom_sad8x16x4d_c
 
-void aom_sad8x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x16x4d_avg aom_sad8x16x4d_avg_c
-
 unsigned int aom_sad8x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x32 aom_sad8x32_c
 
@@ -4207,9 +4102,6 @@
 void aom_sad8x32x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x32x4d aom_sad8x32x4d_c
 
-void aom_sad8x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x32x4d_avg aom_sad8x32x4d_avg_c
-
 unsigned int aom_sad8x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x4 aom_sad8x4_c
 
@@ -4222,9 +4114,6 @@
 void aom_sad8x4x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x4x4d aom_sad8x4x4d_c
 
-void aom_sad8x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x4x4d_avg aom_sad8x4x4d_avg_c
-
 unsigned int aom_sad8x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x8 aom_sad8x8_c
 
@@ -4237,12 +4126,6 @@
 void aom_sad8x8x4d_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x8x4d aom_sad8x8x4d_c
 
-void aom_sad8x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x8x4d_avg aom_sad8x8x4d_avg_c
-
-unsigned int aom_sad8xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad8xh aom_sad8xh_c
-
 unsigned int aom_sad_skip_128x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad_skip_128x128 aom_sad_skip_128x128_c
 
@@ -4846,7 +4729,7 @@
 int aom_vector_var_c(const int16_t *ref, const int16_t *src, int bwl);
 #define aom_vector_var aom_vector_var_c
 
-double av1_compute_cross_correlation_c(unsigned char *im1, int stride1, int x1, int y1, unsigned char *im2, int stride2, int x2, int y2);
+double av1_compute_cross_correlation_c(const unsigned char *frame1, int stride1, int x1, int y1, const unsigned char *frame2, int stride2, int x2, int y2);
 #define av1_compute_cross_correlation av1_compute_cross_correlation_c
 
 void aom_dsp_rtcd(void);

diff --git a/config/riscv64/config/aom_scale_rtcd.h b/config/riscv64/config/aom_scale_rtcd.h
index 69d50c9..733b2d9 100644
--- a/config/riscv64/config/aom_scale_rtcd.h
+++ b/config/riscv64/config/aom_scale_rtcd.h

@@ -80,7 +80,7 @@
 void aom_yv12_partial_copy_y_c(const struct yv12_buffer_config *src_ybc, int hstart1, int hend1, int vstart1, int vend1, struct yv12_buffer_config *dst_ybc, int hstart2, int vstart2);
 #define aom_yv12_partial_copy_y aom_yv12_partial_copy_y_c
 
-int aom_yv12_realloc_with_new_border_c(struct yv12_buffer_config *ybf, int new_border, int byte_alignment, int num_planes);
+int aom_yv12_realloc_with_new_border_c(struct yv12_buffer_config *ybf, int new_border, int byte_alignment, int num_pyramid_levels, int num_planes);
 #define aom_yv12_realloc_with_new_border aom_yv12_realloc_with_new_border_c
 
 void aom_scale_rtcd(void);

diff --git a/config/riscv64/config/av1_rtcd.h b/config/riscv64/config/av1_rtcd.h
index 01da23f..3d406ef 100644
--- a/config/riscv64/config/av1_rtcd.h
+++ b/config/riscv64/config/av1_rtcd.h

@@ -15,12 +15,12 @@
 #include "aom/aom_integer.h"
 #include "aom_dsp/odintrin.h"
 #include "aom_dsp/txfm_common.h"
-#include "av1/common/common.h"
-#include "av1/common/enums.h"
-#include "av1/common/quant_common.h"
-#include "av1/common/filter.h"
-#include "av1/common/convolve.h"
 #include "av1/common/av1_txfm.h"
+#include "av1/common/common.h"
+#include "av1/common/convolve.h"
+#include "av1/common/enums.h"
+#include "av1/common/filter.h"
+#include "av1/common/quant_common.h"
 #include "av1/common/restoration.h"
 
 struct macroblockd;
@@ -82,13 +82,6 @@
                                                    int ref_stride, int subpel_search);
 #define aom_comp_avg_upsampled_pred aom_comp_avg_upsampled_pred_c
 
-void aom_comp_mask_upsampled_pred_c(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
-                                                       const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
-                                                       int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
-                                                       int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask,
-                                                       int subpel_search);
-#define aom_comp_mask_upsampled_pred aom_comp_mask_upsampled_pred_c
-
 void aom_dist_wtd_comp_avg_upsampled_pred_c(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
                                                        const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
                                                        int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
@@ -122,7 +115,7 @@
 void av1_apply_selfguided_restoration_c(const uint8_t *dat, int width, int height, int stride, int eps, const int *xqd, uint8_t *dst, int dst_stride, int32_t *tmpbuf, int bit_depth, int highbd);
 #define av1_apply_selfguided_restoration av1_apply_selfguided_restoration_c
 
-void av1_apply_temporal_filter_c(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_apply_temporal_filter_c(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
 #define av1_apply_temporal_filter av1_apply_temporal_filter_c
 
 int64_t av1_block_error_c(const tran_low_t *coeff, const tran_low_t *dqcoeff, intptr_t block_size, int64_t *ssz);
@@ -173,7 +166,7 @@
 bool av1_cnn_predict_c( const float **input, int in_width, int in_height, int in_stride, const CNN_CONFIG *cnn_config, const CNN_THREAD_DATA *thread_data, CNN_MULTI_OUT *output_struct);
 #define av1_cnn_predict av1_cnn_predict_c
 
-void av1_compute_stats_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, int use_downsampled_wiener_stats);
+void av1_compute_stats_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int16_t *dgd_avg, int16_t *src_avg, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, int use_downsampled_wiener_stats);
 #define av1_compute_stats av1_compute_stats_c
 
 void av1_compute_stats_highbd_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, aom_bit_depth_t bit_depth);
@@ -215,6 +208,9 @@
 void av1_dr_prediction_z3_c(uint8_t *dst, ptrdiff_t stride, int bw, int bh, const uint8_t *above, const uint8_t *left, int upsample_left, int dx, int dy);
 #define av1_dr_prediction_z3 av1_dr_prediction_z3_c
 
+double av1_estimate_noise_from_single_plane_c(const uint8_t *src, int height, int width, int stride, int edge_thresh);
+#define av1_estimate_noise_from_single_plane av1_estimate_noise_from_single_plane_c
+
 void av1_filter_intra_edge_c(uint8_t *p, int sz, int strength);
 #define av1_filter_intra_edge av1_filter_intra_edge_c
 
@@ -293,7 +289,7 @@
 void av1_get_nz_map_contexts_c(const uint8_t *const levels, const int16_t *const scan, const uint16_t eob, const TX_SIZE tx_size, const TX_CLASS tx_class, int8_t *const coeff_contexts);
 #define av1_get_nz_map_contexts av1_get_nz_map_contexts_c
 
-void av1_highbd_apply_temporal_filter_c(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_highbd_apply_temporal_filter_c(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
 #define av1_highbd_apply_temporal_filter av1_highbd_apply_temporal_filter_c
 
 int64_t av1_highbd_block_error_c(const tran_low_t *coeff, const tran_low_t *dqcoeff, intptr_t block_size, int64_t *ssz, int bd);
@@ -350,8 +346,8 @@
 void av1_highbd_dr_prediction_z3_c(uint16_t *dst, ptrdiff_t stride, int bw, int bh, const uint16_t *above, const uint16_t *left, int upsample_left, int dx, int dy, int bd);
 #define av1_highbd_dr_prediction_z3 av1_highbd_dr_prediction_z3_c
 
-void av1_highbd_fwht4x4_c(const int16_t *input, tran_low_t *output, int stride);
-#define av1_highbd_fwht4x4 av1_highbd_fwht4x4_c
+double av1_highbd_estimate_noise_from_single_plane_c(const uint16_t *src, int height, int width, int stride, int bit_depth, int edge_thresh);
+#define av1_highbd_estimate_noise_from_single_plane av1_highbd_estimate_noise_from_single_plane_c
 
 void av1_highbd_inv_txfm_add_c(const tran_low_t *input, uint8_t *dest, int stride, const TxfmParam *txfm_param);
 #define av1_highbd_inv_txfm_add av1_highbd_inv_txfm_add_c

diff --git a/config/x86/config/aom_config.asm b/config/x86/config/aom_config.asm
index 0a256ea..e01202e 100644
--- a/config/x86/config/aom_config.asm
+++ b/config/x86/config/aom_config.asm

@@ -1,7 +1,8 @@
-%define ARCH_ARM 0
-%define ARCH_PPC 0
-%define ARCH_X86 1
-%define ARCH_X86_64 0
+%define AOM_ARCH_AARCH64 0
+%define AOM_ARCH_ARM 0
+%define AOM_ARCH_PPC 0
+%define AOM_ARCH_X86 1
+%define AOM_ARCH_X86_64 0
 %define CONFIG_ACCOUNTING 0
 %define CONFIG_ANALYZER 0
 %define CONFIG_AV1_DECODER 1
@@ -37,6 +38,7 @@
 %define CONFIG_NORMAL_TILE_MODE 1
 %define CONFIG_OPTICAL_FLOW_API 0
 %define CONFIG_OS_SUPPORT 1
+%define CONFIG_OUTPUT_FRAME_SIZE 0
 %define CONFIG_PARTITION_SEARCH_ORDER 0
 %define CONFIG_PIC 1
 %define CONFIG_RATECTRL_LOG 0
@@ -45,6 +47,7 @@
 %define CONFIG_REALTIME_ONLY 0
 %define CONFIG_RT_ML_PARTITIONING 0
 %define CONFIG_RUNTIME_CPU_DETECT 0
+%define CONFIG_SALIENCY_MAP 0
 %define CONFIG_SHARED 0
 %define CONFIG_SIZE_LIMIT 1
 %define CONFIG_SPATIAL_RESAMPLING 1

diff --git a/config/x86/config/aom_config.c b/config/x86/config/aom_config.c
index d81f6b9..9873194 100644
--- a/config/x86/config/aom_config.c
+++ b/config/x86/config/aom_config.c

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2016, Alliance for Open Media. All rights reserved
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
  *
  * This source code is subject to the terms of the BSD 2 Clause License and
  * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License

diff --git a/config/x86/config/aom_config.h b/config/x86/config/aom_config.h
index 55e58ee..502262e 100644
--- a/config/x86/config/aom_config.h
+++ b/config/x86/config/aom_config.h

@@ -10,10 +10,11 @@
  */
 #ifndef AOM_CONFIG_H_
 #define AOM_CONFIG_H_
-#define ARCH_ARM 0
-#define ARCH_PPC 0
-#define ARCH_X86 1
-#define ARCH_X86_64 0
+#define AOM_ARCH_AARCH64 0
+#define AOM_ARCH_ARM 0
+#define AOM_ARCH_PPC 0
+#define AOM_ARCH_X86 1
+#define AOM_ARCH_X86_64 0
 #define CONFIG_ACCOUNTING 0
 #define CONFIG_ANALYZER 0
 #define CONFIG_AV1_DECODER 1
@@ -49,6 +50,7 @@
 #define CONFIG_NORMAL_TILE_MODE 1
 #define CONFIG_OPTICAL_FLOW_API 0
 #define CONFIG_OS_SUPPORT 1
+#define CONFIG_OUTPUT_FRAME_SIZE 0
 #define CONFIG_PARTITION_SEARCH_ORDER 0
 #define CONFIG_PIC 1
 #define CONFIG_RATECTRL_LOG 0
@@ -57,6 +59,7 @@
 #define CONFIG_REALTIME_ONLY 0
 #define CONFIG_RT_ML_PARTITIONING 0
 #define CONFIG_RUNTIME_CPU_DETECT 0
+#define CONFIG_SALIENCY_MAP 0
 #define CONFIG_SHARED 0
 #define CONFIG_SIZE_LIMIT 1
 #define CONFIG_SPATIAL_RESAMPLING 1

diff --git a/config/x86/config/aom_dsp_rtcd.h b/config/x86/config/aom_dsp_rtcd.h
index a259b8f..4521b9d 100644
--- a/config/x86/config/aom_dsp_rtcd.h
+++ b/config/x86/config/aom_dsp_rtcd.h

@@ -14,8 +14,8 @@
 
 #include "aom/aom_integer.h"
 #include "aom_dsp/aom_dsp_common.h"
-#include "av1/common/enums.h"
 #include "av1/common/blockd.h"
+#include "av1/common/enums.h"
 
 
 #ifdef __cplusplus
@@ -50,6 +50,9 @@
 void aom_comp_mask_pred_ssse3(uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask);
 #define aom_comp_mask_pred aom_comp_mask_pred_ssse3
 
+void aom_compute_flow_at_point_c(const uint8_t *src, const uint8_t *ref, int x, int y, int width, int height, int stride, double *u, double *v);
+#define aom_compute_flow_at_point aom_compute_flow_at_point_c
+
 void aom_convolve8_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const InterpKernel *filter, int x0_q4, int x_step_q4, int y0_q4, int y_step_q4, int w, int h);
 #define aom_convolve8 aom_convolve8_c
 
@@ -376,92 +379,92 @@
 #define aom_dist_wtd_comp_avg_pred aom_dist_wtd_comp_avg_pred_ssse3
 
 unsigned int aom_dist_wtd_sad128x128_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad128x128_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad128x128_avg aom_dist_wtd_sad128x128_avg_ssse3
+unsigned int aom_dist_wtd_sad128x128_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad128x128_avg aom_dist_wtd_sad128x128_avg_sse2
 
 unsigned int aom_dist_wtd_sad128x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad128x64_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad128x64_avg aom_dist_wtd_sad128x64_avg_ssse3
+unsigned int aom_dist_wtd_sad128x64_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad128x64_avg aom_dist_wtd_sad128x64_avg_sse2
 
 unsigned int aom_dist_wtd_sad16x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad16x16_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad16x16_avg aom_dist_wtd_sad16x16_avg_ssse3
+unsigned int aom_dist_wtd_sad16x16_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad16x16_avg aom_dist_wtd_sad16x16_avg_sse2
 
 unsigned int aom_dist_wtd_sad16x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad16x32_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad16x32_avg aom_dist_wtd_sad16x32_avg_ssse3
+unsigned int aom_dist_wtd_sad16x32_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad16x32_avg aom_dist_wtd_sad16x32_avg_sse2
 
 unsigned int aom_dist_wtd_sad16x4_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad16x4_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad16x4_avg aom_dist_wtd_sad16x4_avg_ssse3
+unsigned int aom_dist_wtd_sad16x4_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad16x4_avg aom_dist_wtd_sad16x4_avg_sse2
 
 unsigned int aom_dist_wtd_sad16x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad16x64_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad16x64_avg aom_dist_wtd_sad16x64_avg_ssse3
+unsigned int aom_dist_wtd_sad16x64_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad16x64_avg aom_dist_wtd_sad16x64_avg_sse2
 
 unsigned int aom_dist_wtd_sad16x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad16x8_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad16x8_avg aom_dist_wtd_sad16x8_avg_ssse3
+unsigned int aom_dist_wtd_sad16x8_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad16x8_avg aom_dist_wtd_sad16x8_avg_sse2
 
 unsigned int aom_dist_wtd_sad32x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad32x16_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad32x16_avg aom_dist_wtd_sad32x16_avg_ssse3
+unsigned int aom_dist_wtd_sad32x16_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad32x16_avg aom_dist_wtd_sad32x16_avg_sse2
 
 unsigned int aom_dist_wtd_sad32x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad32x32_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad32x32_avg aom_dist_wtd_sad32x32_avg_ssse3
+unsigned int aom_dist_wtd_sad32x32_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad32x32_avg aom_dist_wtd_sad32x32_avg_sse2
 
 unsigned int aom_dist_wtd_sad32x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad32x64_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad32x64_avg aom_dist_wtd_sad32x64_avg_ssse3
+unsigned int aom_dist_wtd_sad32x64_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad32x64_avg aom_dist_wtd_sad32x64_avg_sse2
 
 unsigned int aom_dist_wtd_sad32x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad32x8_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad32x8_avg aom_dist_wtd_sad32x8_avg_ssse3
+unsigned int aom_dist_wtd_sad32x8_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad32x8_avg aom_dist_wtd_sad32x8_avg_sse2
 
 unsigned int aom_dist_wtd_sad4x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad4x16_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad4x16_avg aom_dist_wtd_sad4x16_avg_ssse3
+unsigned int aom_dist_wtd_sad4x16_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad4x16_avg aom_dist_wtd_sad4x16_avg_sse2
 
 unsigned int aom_dist_wtd_sad4x4_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad4x4_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad4x4_avg aom_dist_wtd_sad4x4_avg_ssse3
+unsigned int aom_dist_wtd_sad4x4_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad4x4_avg aom_dist_wtd_sad4x4_avg_sse2
 
 unsigned int aom_dist_wtd_sad4x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad4x8_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad4x8_avg aom_dist_wtd_sad4x8_avg_ssse3
+unsigned int aom_dist_wtd_sad4x8_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad4x8_avg aom_dist_wtd_sad4x8_avg_sse2
 
 unsigned int aom_dist_wtd_sad64x128_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad64x128_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad64x128_avg aom_dist_wtd_sad64x128_avg_ssse3
+unsigned int aom_dist_wtd_sad64x128_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad64x128_avg aom_dist_wtd_sad64x128_avg_sse2
 
 unsigned int aom_dist_wtd_sad64x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad64x16_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad64x16_avg aom_dist_wtd_sad64x16_avg_ssse3
+unsigned int aom_dist_wtd_sad64x16_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad64x16_avg aom_dist_wtd_sad64x16_avg_sse2
 
 unsigned int aom_dist_wtd_sad64x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad64x32_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad64x32_avg aom_dist_wtd_sad64x32_avg_ssse3
+unsigned int aom_dist_wtd_sad64x32_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad64x32_avg aom_dist_wtd_sad64x32_avg_sse2
 
 unsigned int aom_dist_wtd_sad64x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad64x64_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad64x64_avg aom_dist_wtd_sad64x64_avg_ssse3
+unsigned int aom_dist_wtd_sad64x64_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad64x64_avg aom_dist_wtd_sad64x64_avg_sse2
 
 unsigned int aom_dist_wtd_sad8x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad8x16_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad8x16_avg aom_dist_wtd_sad8x16_avg_ssse3
+unsigned int aom_dist_wtd_sad8x16_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad8x16_avg aom_dist_wtd_sad8x16_avg_sse2
 
 unsigned int aom_dist_wtd_sad8x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad8x32_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad8x32_avg aom_dist_wtd_sad8x32_avg_ssse3
+unsigned int aom_dist_wtd_sad8x32_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad8x32_avg aom_dist_wtd_sad8x32_avg_sse2
 
 unsigned int aom_dist_wtd_sad8x4_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad8x4_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad8x4_avg aom_dist_wtd_sad8x4_avg_ssse3
+unsigned int aom_dist_wtd_sad8x4_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad8x4_avg aom_dist_wtd_sad8x4_avg_sse2
 
 unsigned int aom_dist_wtd_sad8x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad8x8_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad8x8_avg aom_dist_wtd_sad8x8_avg_ssse3
+unsigned int aom_dist_wtd_sad8x8_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad8x8_avg aom_dist_wtd_sad8x8_avg_sse2
 
 uint32_t aom_dist_wtd_sub_pixel_avg_variance128x128_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
 uint32_t aom_dist_wtd_sub_pixel_avg_variance128x128_ssse3(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
@@ -559,10 +562,6 @@
 void aom_fdct4x4_lp_sse2(const int16_t *input, int16_t *output, int stride);
 #define aom_fdct4x4_lp aom_fdct4x4_lp_sse2
 
-void aom_fdct8x8_c(const int16_t *input, tran_low_t *output, int stride);
-void aom_fdct8x8_sse2(const int16_t *input, tran_low_t *output, int stride);
-#define aom_fdct8x8 aom_fdct8x8_sse2
-
 void aom_fft16x16_float_c(const float *input, float *temp, float *output);
 void aom_fft16x16_float_sse2(const float *input, float *temp, float *output);
 #define aom_fft16x16_float aom_fft16x16_float_sse2
@@ -582,16 +581,6 @@
 void aom_fft8x8_float_sse2(const float *input, float *temp, float *output);
 #define aom_fft8x8_float aom_fft8x8_float_sse2
 
-void aom_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_get16x16var aom_get16x16var_c
-
-unsigned int aom_get4x4sse_cs_c(const unsigned char *src_ptr, int source_stride, const unsigned char *ref_ptr, int ref_stride);
-#define aom_get4x4sse_cs aom_get4x4sse_cs_c
-
-void aom_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-void aom_get8x8var_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_get8x8var aom_get8x8var_sse2
-
 void aom_get_blk_sse_sum_c(const int16_t *data, int stride, int bw, int bh, int *x_sum, int64_t *x2_sum);
 void aom_get_blk_sse_sum_sse2(const int16_t *data, int stride, int bw, int bh, int *x_sum, int64_t *x2_sum);
 #define aom_get_blk_sse_sum aom_get_blk_sse_sum_sse2
@@ -601,7 +590,8 @@
 #define aom_get_mb_ss aom_get_mb_ss_sse2
 
 void aom_get_var_sse_sum_16x16_dual_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse16x16, unsigned int *tot_sse, int *tot_sum, uint32_t *var16x16);
-#define aom_get_var_sse_sum_16x16_dual aom_get_var_sse_sum_16x16_dual_c
+void aom_get_var_sse_sum_16x16_dual_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse16x16, unsigned int *tot_sse, int *tot_sum, uint32_t *var16x16);
+#define aom_get_var_sse_sum_16x16_dual aom_get_var_sse_sum_16x16_dual_sse2
 
 void aom_get_var_sse_sum_8x8_quad_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse8x8, int *sum8x8, unsigned int *tot_sse, int *tot_sum, uint32_t *var8x8);
 void aom_get_var_sse_sum_8x8_quad_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse8x8, int *sum8x8, unsigned int *tot_sse, int *tot_sum, uint32_t *var8x8);
@@ -777,12 +767,6 @@
 uint32_t aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_10_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_10_get16x16var aom_highbd_10_get16x16var_c
-
-void aom_highbd_10_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_10_get8x8var aom_highbd_10_get8x8var_c
-
 unsigned int aom_highbd_10_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 unsigned int aom_highbd_10_masked_sub_pixel_variance128x128_ssse3(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_10_masked_sub_pixel_variance128x128 aom_highbd_10_masked_sub_pixel_variance128x128_ssse3
@@ -1200,11 +1184,11 @@
 unsigned int aom_highbd_10_variance16x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x32 aom_highbd_10_variance16x32_sse2
 
-unsigned int aom_highbd_10_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x4 aom_highbd_10_variance16x4_c
 
-unsigned int aom_highbd_10_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance16x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance16x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x64 aom_highbd_10_variance16x64_sse2
 
 unsigned int aom_highbd_10_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1229,11 +1213,11 @@
 unsigned int aom_highbd_10_variance32x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance32x64 aom_highbd_10_variance32x64_sse2
 
-unsigned int aom_highbd_10_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance32x8_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance32x8_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance32x8 aom_highbd_10_variance32x8_sse2
 
-unsigned int aom_highbd_10_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance4x16 aom_highbd_10_variance4x16_c
 
 unsigned int aom_highbd_10_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1249,8 +1233,8 @@
 unsigned int aom_highbd_10_variance64x128_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance64x128 aom_highbd_10_variance64x128_sse2
 
-unsigned int aom_highbd_10_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance64x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance64x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance64x16 aom_highbd_10_variance64x16_sse2
 
 unsigned int aom_highbd_10_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1265,8 +1249,8 @@
 unsigned int aom_highbd_10_variance8x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance8x16 aom_highbd_10_variance8x16_sse2
 
-unsigned int aom_highbd_10_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance8x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance8x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance8x32 aom_highbd_10_variance8x32_sse2
 
 unsigned int aom_highbd_10_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1342,12 +1326,6 @@
 uint32_t aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_12_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_12_get16x16var aom_highbd_12_get16x16var_c
-
-void aom_highbd_12_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_12_get8x8var aom_highbd_12_get8x8var_c
-
 unsigned int aom_highbd_12_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 unsigned int aom_highbd_12_masked_sub_pixel_variance128x128_ssse3(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_12_masked_sub_pixel_variance128x128 aom_highbd_12_masked_sub_pixel_variance128x128_ssse3
@@ -1765,11 +1743,11 @@
 unsigned int aom_highbd_12_variance16x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance16x32 aom_highbd_12_variance16x32_sse2
 
-unsigned int aom_highbd_12_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance16x4 aom_highbd_12_variance16x4_c
 
-unsigned int aom_highbd_12_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_12_variance16x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance16x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance16x64 aom_highbd_12_variance16x64_sse2
 
 unsigned int aom_highbd_12_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1794,11 +1772,11 @@
 unsigned int aom_highbd_12_variance32x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance32x64 aom_highbd_12_variance32x64_sse2
 
-unsigned int aom_highbd_12_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_12_variance32x8_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance32x8_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance32x8 aom_highbd_12_variance32x8_sse2
 
-unsigned int aom_highbd_12_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance4x16 aom_highbd_12_variance4x16_c
 
 unsigned int aom_highbd_12_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1814,8 +1792,8 @@
 unsigned int aom_highbd_12_variance64x128_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance64x128 aom_highbd_12_variance64x128_sse2
 
-unsigned int aom_highbd_12_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_12_variance64x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance64x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance64x16 aom_highbd_12_variance64x16_sse2
 
 unsigned int aom_highbd_12_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1830,8 +1808,8 @@
 unsigned int aom_highbd_12_variance8x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance8x16 aom_highbd_12_variance8x16_sse2
 
-unsigned int aom_highbd_12_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_12_variance8x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance8x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance8x32 aom_highbd_12_variance8x32_sse2
 
 unsigned int aom_highbd_12_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1907,12 +1885,6 @@
 uint32_t aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_8_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_8_get16x16var aom_highbd_8_get16x16var_c
-
-void aom_highbd_8_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_8_get8x8var aom_highbd_8_get8x8var_c
-
 unsigned int aom_highbd_8_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 unsigned int aom_highbd_8_masked_sub_pixel_variance128x128_ssse3(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_8_masked_sub_pixel_variance128x128 aom_highbd_8_masked_sub_pixel_variance128x128_ssse3
@@ -2198,11 +2170,11 @@
 unsigned int aom_highbd_8_variance16x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance16x32 aom_highbd_8_variance16x32_sse2
 
-unsigned int aom_highbd_8_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance16x4 aom_highbd_8_variance16x4_c
 
-unsigned int aom_highbd_8_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_8_variance16x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance16x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance16x64 aom_highbd_8_variance16x64_sse2
 
 unsigned int aom_highbd_8_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -2227,11 +2199,11 @@
 unsigned int aom_highbd_8_variance32x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance32x64 aom_highbd_8_variance32x64_sse2
 
-unsigned int aom_highbd_8_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_8_variance32x8_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance32x8_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance32x8 aom_highbd_8_variance32x8_sse2
 
-unsigned int aom_highbd_8_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance4x16 aom_highbd_8_variance4x16_c
 
 unsigned int aom_highbd_8_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -2247,8 +2219,8 @@
 unsigned int aom_highbd_8_variance64x128_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance64x128 aom_highbd_8_variance64x128_sse2
 
-unsigned int aom_highbd_8_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_8_variance64x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance64x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance64x16 aom_highbd_8_variance64x16_sse2
 
 unsigned int aom_highbd_8_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -2263,8 +2235,8 @@
 unsigned int aom_highbd_8_variance8x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance8x16 aom_highbd_8_variance8x16_sse2
 
-unsigned int aom_highbd_8_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_8_variance8x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance8x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance8x32 aom_highbd_8_variance8x32_sse2
 
 unsigned int aom_highbd_8_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -2649,10 +2621,6 @@
 unsigned int aom_highbd_dist_wtd_sad8x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_dist_wtd_sad8x8_avg aom_highbd_dist_wtd_sad8x8_avg_c
 
-void aom_highbd_fdct8x8_c(const int16_t *input, tran_low_t *output, int stride);
-void aom_highbd_fdct8x8_sse2(const int16_t *input, tran_low_t *output, int stride);
-#define aom_highbd_fdct8x8 aom_highbd_fdct8x8_sse2
-
 void aom_highbd_h_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_h_predictor_16x16_sse2(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_h_predictor_16x16 aom_highbd_h_predictor_16x16_sse2
@@ -4592,10 +4560,6 @@
 void aom_paeth_predictor_8x8_ssse3(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_paeth_predictor_8x8 aom_paeth_predictor_8x8_ssse3
 
-void aom_pixel_scale_c(const int16_t *src_diff, ptrdiff_t src_stride, int16_t *coeff, int log_scale, int h8, int w8);
-void aom_pixel_scale_sse2(const int16_t *src_diff, ptrdiff_t src_stride, int16_t *coeff, int log_scale, int h8, int w8);
-#define aom_pixel_scale aom_pixel_scale_sse2
-
 void aom_quantize_b_c(const tran_low_t *coeff_ptr, intptr_t n_coeffs, const int16_t *zbin_ptr, const int16_t *round_ptr, const int16_t *quant_ptr, const int16_t *quant_shift_ptr, tran_low_t *qcoeff_ptr, tran_low_t *dqcoeff_ptr, const int16_t *dequant_ptr, uint16_t *eob_ptr, const int16_t *scan, const int16_t *iscan);
 void aom_quantize_b_sse2(const tran_low_t *coeff_ptr, intptr_t n_coeffs, const int16_t *zbin_ptr, const int16_t *round_ptr, const int16_t *quant_ptr, const int16_t *quant_shift_ptr, tran_low_t *qcoeff_ptr, tran_low_t *dqcoeff_ptr, const int16_t *dequant_ptr, uint16_t *eob_ptr, const int16_t *scan, const int16_t *iscan);
 #define aom_quantize_b aom_quantize_b_sse2
@@ -4634,10 +4598,6 @@
 void aom_sad128x128x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad128x128x4d aom_sad128x128x4d_sse2
 
-void aom_sad128x128x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad128x128x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad128x128x4d_avg aom_sad128x128x4d_avg_sse2
-
 unsigned int aom_sad128x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad128x64_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad128x64 aom_sad128x64_sse2
@@ -4653,14 +4613,6 @@
 void aom_sad128x64x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad128x64x4d aom_sad128x64x4d_sse2
 
-void aom_sad128x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad128x64x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad128x64x4d_avg aom_sad128x64x4d_avg_sse2
-
-unsigned int aom_sad128xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-unsigned int aom_sad128xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad128xh aom_sad128xh_sse2
-
 unsigned int aom_sad16x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x16_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x16 aom_sad16x16_sse2
@@ -4676,10 +4628,6 @@
 void aom_sad16x16x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x16x4d aom_sad16x16x4d_sse2
 
-void aom_sad16x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad16x16x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x16x4d_avg aom_sad16x16x4d_avg_sse2
-
 unsigned int aom_sad16x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x32_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x32 aom_sad16x32_sse2
@@ -4695,10 +4643,6 @@
 void aom_sad16x32x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x32x4d aom_sad16x32x4d_sse2
 
-void aom_sad16x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad16x32x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x32x4d_avg aom_sad16x32x4d_avg_sse2
-
 unsigned int aom_sad16x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x4_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x4 aom_sad16x4_sse2
@@ -4714,10 +4658,6 @@
 void aom_sad16x4x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x4x4d aom_sad16x4x4d_sse2
 
-void aom_sad16x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad16x4x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x4x4d_avg aom_sad16x4x4d_avg_sse2
-
 unsigned int aom_sad16x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x64_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x64 aom_sad16x64_sse2
@@ -4733,10 +4673,6 @@
 void aom_sad16x64x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x64x4d aom_sad16x64x4d_sse2
 
-void aom_sad16x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad16x64x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x64x4d_avg aom_sad16x64x4d_avg_sse2
-
 unsigned int aom_sad16x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x8_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x8 aom_sad16x8_sse2
@@ -4752,14 +4688,6 @@
 void aom_sad16x8x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x8x4d aom_sad16x8x4d_sse2
 
-void aom_sad16x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad16x8x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x8x4d_avg aom_sad16x8x4d_avg_sse2
-
-unsigned int aom_sad16xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-unsigned int aom_sad16xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad16xh aom_sad16xh_sse2
-
 unsigned int aom_sad32x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x16_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x16 aom_sad32x16_sse2
@@ -4775,10 +4703,6 @@
 void aom_sad32x16x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x16x4d aom_sad32x16x4d_sse2
 
-void aom_sad32x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad32x16x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x16x4d_avg aom_sad32x16x4d_avg_sse2
-
 unsigned int aom_sad32x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x32_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x32 aom_sad32x32_sse2
@@ -4794,10 +4718,6 @@
 void aom_sad32x32x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x32x4d aom_sad32x32x4d_sse2
 
-void aom_sad32x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad32x32x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x32x4d_avg aom_sad32x32x4d_avg_sse2
-
 unsigned int aom_sad32x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x64_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x64 aom_sad32x64_sse2
@@ -4813,10 +4733,6 @@
 void aom_sad32x64x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x64x4d aom_sad32x64x4d_sse2
 
-void aom_sad32x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad32x64x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x64x4d_avg aom_sad32x64x4d_avg_sse2
-
 unsigned int aom_sad32x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x8_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x8 aom_sad32x8_sse2
@@ -4832,14 +4748,6 @@
 void aom_sad32x8x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x8x4d aom_sad32x8x4d_sse2
 
-void aom_sad32x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad32x8x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x8x4d_avg aom_sad32x8x4d_avg_sse2
-
-unsigned int aom_sad32xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-unsigned int aom_sad32xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad32xh aom_sad32xh_sse2
-
 unsigned int aom_sad4x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad4x16_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x16 aom_sad4x16_sse2
@@ -4855,10 +4763,6 @@
 void aom_sad4x16x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x16x4d aom_sad4x16x4d_sse2
 
-void aom_sad4x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad4x16x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x16x4d_avg aom_sad4x16x4d_avg_sse2
-
 unsigned int aom_sad4x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad4x4_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x4 aom_sad4x4_sse2
@@ -4874,10 +4778,6 @@
 void aom_sad4x4x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x4x4d aom_sad4x4x4d_sse2
 
-void aom_sad4x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad4x4x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x4x4d_avg aom_sad4x4x4d_avg_sse2
-
 unsigned int aom_sad4x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad4x8_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x8 aom_sad4x8_sse2
@@ -4893,14 +4793,6 @@
 void aom_sad4x8x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x8x4d aom_sad4x8x4d_sse2
 
-void aom_sad4x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad4x8x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x8x4d_avg aom_sad4x8x4d_avg_sse2
-
-unsigned int aom_sad4xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-unsigned int aom_sad4xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad4xh aom_sad4xh_sse2
-
 unsigned int aom_sad64x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x128_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x128 aom_sad64x128_sse2
@@ -4916,10 +4808,6 @@
 void aom_sad64x128x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x128x4d aom_sad64x128x4d_sse2
 
-void aom_sad64x128x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad64x128x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x128x4d_avg aom_sad64x128x4d_avg_sse2
-
 unsigned int aom_sad64x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x16_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x16 aom_sad64x16_sse2
@@ -4935,10 +4823,6 @@
 void aom_sad64x16x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x16x4d aom_sad64x16x4d_sse2
 
-void aom_sad64x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad64x16x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x16x4d_avg aom_sad64x16x4d_avg_sse2
-
 unsigned int aom_sad64x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x32_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x32 aom_sad64x32_sse2
@@ -4954,10 +4838,6 @@
 void aom_sad64x32x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x32x4d aom_sad64x32x4d_sse2
 
-void aom_sad64x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad64x32x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x32x4d_avg aom_sad64x32x4d_avg_sse2
-
 unsigned int aom_sad64x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x64_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x64 aom_sad64x64_sse2
@@ -4973,14 +4853,6 @@
 void aom_sad64x64x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x64x4d aom_sad64x64x4d_sse2
 
-void aom_sad64x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad64x64x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x64x4d_avg aom_sad64x64x4d_avg_sse2
-
-unsigned int aom_sad64xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-unsigned int aom_sad64xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad64xh aom_sad64xh_sse2
-
 unsigned int aom_sad8x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x16_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x16 aom_sad8x16_sse2
@@ -4996,10 +4868,6 @@
 void aom_sad8x16x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x16x4d aom_sad8x16x4d_sse2
 
-void aom_sad8x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad8x16x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x16x4d_avg aom_sad8x16x4d_avg_sse2
-
 unsigned int aom_sad8x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x32_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x32 aom_sad8x32_sse2
@@ -5015,10 +4883,6 @@
 void aom_sad8x32x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x32x4d aom_sad8x32x4d_sse2
 
-void aom_sad8x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad8x32x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x32x4d_avg aom_sad8x32x4d_avg_sse2
-
 unsigned int aom_sad8x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x4_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x4 aom_sad8x4_sse2
@@ -5034,10 +4898,6 @@
 void aom_sad8x4x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x4x4d aom_sad8x4x4d_sse2
 
-void aom_sad8x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad8x4x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x4x4d_avg aom_sad8x4x4d_avg_sse2
-
 unsigned int aom_sad8x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x8_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x8 aom_sad8x8_sse2
@@ -5053,14 +4913,6 @@
 void aom_sad8x8x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x8x4d aom_sad8x8x4d_sse2
 
-void aom_sad8x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad8x8x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x8x4d_avg aom_sad8x8x4d_avg_sse2
-
-unsigned int aom_sad8xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-unsigned int aom_sad8xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad8xh aom_sad8xh_sse2
-
 unsigned int aom_sad_skip_128x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad_skip_128x128_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad_skip_128x128 aom_sad_skip_128x128_sse2
@@ -5897,7 +5749,7 @@
 int aom_vector_var_c(const int16_t *ref, const int16_t *src, int bwl);
 #define aom_vector_var aom_vector_var_c
 
-double av1_compute_cross_correlation_c(unsigned char *im1, int stride1, int x1, int y1, unsigned char *im2, int stride2, int x2, int y2);
+double av1_compute_cross_correlation_c(const unsigned char *frame1, int stride1, int x1, int y1, const unsigned char *frame2, int stride2, int x2, int y2);
 #define av1_compute_cross_correlation av1_compute_cross_correlation_c
 
 void aom_dsp_rtcd(void);

diff --git a/config/x86/config/aom_scale_rtcd.h b/config/x86/config/aom_scale_rtcd.h
index 28e903d..3b70fb4 100644
--- a/config/x86/config/aom_scale_rtcd.h
+++ b/config/x86/config/aom_scale_rtcd.h

@@ -80,7 +80,7 @@
 void aom_yv12_partial_copy_y_c(const struct yv12_buffer_config *src_ybc, int hstart1, int hend1, int vstart1, int vend1, struct yv12_buffer_config *dst_ybc, int hstart2, int vstart2);
 #define aom_yv12_partial_copy_y aom_yv12_partial_copy_y_c
 
-int aom_yv12_realloc_with_new_border_c(struct yv12_buffer_config *ybf, int new_border, int byte_alignment, int num_planes);
+int aom_yv12_realloc_with_new_border_c(struct yv12_buffer_config *ybf, int new_border, int byte_alignment, int num_pyramid_levels, int num_planes);
 #define aom_yv12_realloc_with_new_border aom_yv12_realloc_with_new_border_c
 
 void aom_scale_rtcd(void);

diff --git a/config/x86/config/av1_rtcd.h b/config/x86/config/av1_rtcd.h
index ef17ccb..d05b1d5 100644
--- a/config/x86/config/av1_rtcd.h
+++ b/config/x86/config/av1_rtcd.h

@@ -15,12 +15,12 @@
 #include "aom/aom_integer.h"
 #include "aom_dsp/odintrin.h"
 #include "aom_dsp/txfm_common.h"
-#include "av1/common/common.h"
-#include "av1/common/enums.h"
-#include "av1/common/quant_common.h"
-#include "av1/common/filter.h"
-#include "av1/common/convolve.h"
 #include "av1/common/av1_txfm.h"
+#include "av1/common/common.h"
+#include "av1/common/convolve.h"
+#include "av1/common/enums.h"
+#include "av1/common/filter.h"
+#include "av1/common/quant_common.h"
 #include "av1/common/restoration.h"
 
 struct macroblockd;
@@ -86,18 +86,6 @@
                                                    int ref_stride, int subpel_search);
 #define aom_comp_avg_upsampled_pred aom_comp_avg_upsampled_pred_sse2
 
-void aom_comp_mask_upsampled_pred_c(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
-                                                       const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
-                                                       int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
-                                                       int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask,
-                                                       int subpel_search);
-void aom_comp_mask_upsampled_pred_sse2(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
-                                                       const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
-                                                       int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
-                                                       int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask,
-                                                       int subpel_search);
-#define aom_comp_mask_upsampled_pred aom_comp_mask_upsampled_pred_sse2
-
 void aom_dist_wtd_comp_avg_upsampled_pred_c(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
                                                        const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
                                                        int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
@@ -148,8 +136,8 @@
 void av1_apply_selfguided_restoration_c(const uint8_t *dat, int width, int height, int stride, int eps, const int *xqd, uint8_t *dst, int dst_stride, int32_t *tmpbuf, int bit_depth, int highbd);
 #define av1_apply_selfguided_restoration av1_apply_selfguided_restoration_c
 
-void av1_apply_temporal_filter_c(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
-void av1_apply_temporal_filter_sse2(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_apply_temporal_filter_c(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_apply_temporal_filter_sse2(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
 #define av1_apply_temporal_filter av1_apply_temporal_filter_sse2
 
 int64_t av1_block_error_c(const tran_low_t *coeff, const tran_low_t *dqcoeff, intptr_t block_size, int64_t *ssz);
@@ -206,7 +194,7 @@
 bool av1_cnn_predict_c( const float **input, int in_width, int in_height, int in_stride, const CNN_CONFIG *cnn_config, const CNN_THREAD_DATA *thread_data, CNN_MULTI_OUT *output_struct);
 #define av1_cnn_predict av1_cnn_predict_c
 
-void av1_compute_stats_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, int use_downsampled_wiener_stats);
+void av1_compute_stats_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int16_t *dgd_avg, int16_t *src_avg, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, int use_downsampled_wiener_stats);
 #define av1_compute_stats av1_compute_stats_c
 
 void av1_compute_stats_highbd_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, aom_bit_depth_t bit_depth);
@@ -256,6 +244,9 @@
 void av1_dr_prediction_z3_c(uint8_t *dst, ptrdiff_t stride, int bw, int bh, const uint8_t *above, const uint8_t *left, int upsample_left, int dx, int dy);
 #define av1_dr_prediction_z3 av1_dr_prediction_z3_c
 
+double av1_estimate_noise_from_single_plane_c(const uint8_t *src, int height, int width, int stride, int edge_thresh);
+#define av1_estimate_noise_from_single_plane av1_estimate_noise_from_single_plane_c
+
 void av1_filter_intra_edge_c(uint8_t *p, int sz, int strength);
 #define av1_filter_intra_edge av1_filter_intra_edge_c
 
@@ -335,8 +326,8 @@
 void av1_get_nz_map_contexts_sse2(const uint8_t *const levels, const int16_t *const scan, const uint16_t eob, const TX_SIZE tx_size, const TX_CLASS tx_class, int8_t *const coeff_contexts);
 #define av1_get_nz_map_contexts av1_get_nz_map_contexts_sse2
 
-void av1_highbd_apply_temporal_filter_c(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
-void av1_highbd_apply_temporal_filter_sse2(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_highbd_apply_temporal_filter_c(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_highbd_apply_temporal_filter_sse2(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
 #define av1_highbd_apply_temporal_filter av1_highbd_apply_temporal_filter_sse2
 
 int64_t av1_highbd_block_error_c(const tran_low_t *coeff, const tran_low_t *dqcoeff, intptr_t block_size, int64_t *ssz, int bd);
@@ -397,8 +388,8 @@
 void av1_highbd_dr_prediction_z3_c(uint16_t *dst, ptrdiff_t stride, int bw, int bh, const uint16_t *above, const uint16_t *left, int upsample_left, int dx, int dy, int bd);
 #define av1_highbd_dr_prediction_z3 av1_highbd_dr_prediction_z3_c
 
-void av1_highbd_fwht4x4_c(const int16_t *input, tran_low_t *output, int stride);
-#define av1_highbd_fwht4x4 av1_highbd_fwht4x4_c
+double av1_highbd_estimate_noise_from_single_plane_c(const uint16_t *src, int height, int width, int stride, int bit_depth, int edge_thresh);
+#define av1_highbd_estimate_noise_from_single_plane av1_highbd_estimate_noise_from_single_plane_c
 
 void av1_highbd_inv_txfm_add_c(const tran_low_t *input, uint8_t *dest, int stride, const TxfmParam *txfm_param);
 #define av1_highbd_inv_txfm_add av1_highbd_inv_txfm_add_c

diff --git a/config/x86_64/config/aom_config.asm b/config/x86_64/config/aom_config.asm
index 81d4e24..dc45eaa 100644
--- a/config/x86_64/config/aom_config.asm
+++ b/config/x86_64/config/aom_config.asm

@@ -1,7 +1,8 @@
-%define ARCH_ARM 0
-%define ARCH_PPC 0
-%define ARCH_X86 0
-%define ARCH_X86_64 1
+%define AOM_ARCH_AARCH64 0
+%define AOM_ARCH_ARM 0
+%define AOM_ARCH_PPC 0
+%define AOM_ARCH_X86 0
+%define AOM_ARCH_X86_64 1
 %define CONFIG_ACCOUNTING 0
 %define CONFIG_ANALYZER 0
 %define CONFIG_AV1_DECODER 1
@@ -37,6 +38,7 @@
 %define CONFIG_NORMAL_TILE_MODE 1
 %define CONFIG_OPTICAL_FLOW_API 0
 %define CONFIG_OS_SUPPORT 1
+%define CONFIG_OUTPUT_FRAME_SIZE 0
 %define CONFIG_PARTITION_SEARCH_ORDER 0
 %define CONFIG_PIC 1
 %define CONFIG_RATECTRL_LOG 0
@@ -45,6 +47,7 @@
 %define CONFIG_REALTIME_ONLY 0
 %define CONFIG_RT_ML_PARTITIONING 0
 %define CONFIG_RUNTIME_CPU_DETECT 0
+%define CONFIG_SALIENCY_MAP 0
 %define CONFIG_SHARED 0
 %define CONFIG_SIZE_LIMIT 1
 %define CONFIG_SPATIAL_RESAMPLING 1

diff --git a/config/x86_64/config/aom_config.c b/config/x86_64/config/aom_config.c
index 3801952..8a75212 100644
--- a/config/x86_64/config/aom_config.c
+++ b/config/x86_64/config/aom_config.c

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2016, Alliance for Open Media. All rights reserved
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
  *
  * This source code is subject to the terms of the BSD 2 Clause License and
  * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License

diff --git a/config/x86_64/config/aom_config.h b/config/x86_64/config/aom_config.h
index dcaa0fc..bad7861 100644
--- a/config/x86_64/config/aom_config.h
+++ b/config/x86_64/config/aom_config.h

@@ -10,10 +10,11 @@
  */
 #ifndef AOM_CONFIG_H_
 #define AOM_CONFIG_H_
-#define ARCH_ARM 0
-#define ARCH_PPC 0
-#define ARCH_X86 0
-#define ARCH_X86_64 1
+#define AOM_ARCH_AARCH64 0
+#define AOM_ARCH_ARM 0
+#define AOM_ARCH_PPC 0
+#define AOM_ARCH_X86 0
+#define AOM_ARCH_X86_64 1
 #define CONFIG_ACCOUNTING 0
 #define CONFIG_ANALYZER 0
 #define CONFIG_AV1_DECODER 1
@@ -49,6 +50,7 @@
 #define CONFIG_NORMAL_TILE_MODE 1
 #define CONFIG_OPTICAL_FLOW_API 0
 #define CONFIG_OS_SUPPORT 1
+#define CONFIG_OUTPUT_FRAME_SIZE 0
 #define CONFIG_PARTITION_SEARCH_ORDER 0
 #define CONFIG_PIC 1
 #define CONFIG_RATECTRL_LOG 0
@@ -57,6 +59,7 @@
 #define CONFIG_REALTIME_ONLY 0
 #define CONFIG_RT_ML_PARTITIONING 0
 #define CONFIG_RUNTIME_CPU_DETECT 0
+#define CONFIG_SALIENCY_MAP 0
 #define CONFIG_SHARED 0
 #define CONFIG_SIZE_LIMIT 1
 #define CONFIG_SPATIAL_RESAMPLING 1

diff --git a/config/x86_64/config/aom_dsp_rtcd.h b/config/x86_64/config/aom_dsp_rtcd.h
index c4f99d6..cfb7380 100644
--- a/config/x86_64/config/aom_dsp_rtcd.h
+++ b/config/x86_64/config/aom_dsp_rtcd.h

@@ -14,8 +14,8 @@
 
 #include "aom/aom_integer.h"
 #include "aom_dsp/aom_dsp_common.h"
-#include "av1/common/enums.h"
 #include "av1/common/blockd.h"
+#include "av1/common/enums.h"
 
 
 #ifdef __cplusplus
@@ -50,6 +50,9 @@
 void aom_comp_mask_pred_ssse3(uint8_t *comp_pred, const uint8_t *pred, int width, int height, const uint8_t *ref, int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask);
 #define aom_comp_mask_pred aom_comp_mask_pred_ssse3
 
+void aom_compute_flow_at_point_c(const uint8_t *src, const uint8_t *ref, int x, int y, int width, int height, int stride, double *u, double *v);
+#define aom_compute_flow_at_point aom_compute_flow_at_point_c
+
 void aom_convolve8_c(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst, ptrdiff_t dst_stride, const InterpKernel *filter, int x0_q4, int x_step_q4, int y0_q4, int y_step_q4, int w, int h);
 #define aom_convolve8 aom_convolve8_c
 
@@ -376,92 +379,92 @@
 #define aom_dist_wtd_comp_avg_pred aom_dist_wtd_comp_avg_pred_ssse3
 
 unsigned int aom_dist_wtd_sad128x128_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad128x128_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad128x128_avg aom_dist_wtd_sad128x128_avg_ssse3
+unsigned int aom_dist_wtd_sad128x128_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad128x128_avg aom_dist_wtd_sad128x128_avg_sse2
 
 unsigned int aom_dist_wtd_sad128x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad128x64_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad128x64_avg aom_dist_wtd_sad128x64_avg_ssse3
+unsigned int aom_dist_wtd_sad128x64_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad128x64_avg aom_dist_wtd_sad128x64_avg_sse2
 
 unsigned int aom_dist_wtd_sad16x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad16x16_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad16x16_avg aom_dist_wtd_sad16x16_avg_ssse3
+unsigned int aom_dist_wtd_sad16x16_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad16x16_avg aom_dist_wtd_sad16x16_avg_sse2
 
 unsigned int aom_dist_wtd_sad16x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad16x32_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad16x32_avg aom_dist_wtd_sad16x32_avg_ssse3
+unsigned int aom_dist_wtd_sad16x32_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad16x32_avg aom_dist_wtd_sad16x32_avg_sse2
 
 unsigned int aom_dist_wtd_sad16x4_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad16x4_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad16x4_avg aom_dist_wtd_sad16x4_avg_ssse3
+unsigned int aom_dist_wtd_sad16x4_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad16x4_avg aom_dist_wtd_sad16x4_avg_sse2
 
 unsigned int aom_dist_wtd_sad16x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad16x64_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad16x64_avg aom_dist_wtd_sad16x64_avg_ssse3
+unsigned int aom_dist_wtd_sad16x64_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad16x64_avg aom_dist_wtd_sad16x64_avg_sse2
 
 unsigned int aom_dist_wtd_sad16x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad16x8_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad16x8_avg aom_dist_wtd_sad16x8_avg_ssse3
+unsigned int aom_dist_wtd_sad16x8_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad16x8_avg aom_dist_wtd_sad16x8_avg_sse2
 
 unsigned int aom_dist_wtd_sad32x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad32x16_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad32x16_avg aom_dist_wtd_sad32x16_avg_ssse3
+unsigned int aom_dist_wtd_sad32x16_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad32x16_avg aom_dist_wtd_sad32x16_avg_sse2
 
 unsigned int aom_dist_wtd_sad32x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad32x32_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad32x32_avg aom_dist_wtd_sad32x32_avg_ssse3
+unsigned int aom_dist_wtd_sad32x32_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad32x32_avg aom_dist_wtd_sad32x32_avg_sse2
 
 unsigned int aom_dist_wtd_sad32x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad32x64_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad32x64_avg aom_dist_wtd_sad32x64_avg_ssse3
+unsigned int aom_dist_wtd_sad32x64_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad32x64_avg aom_dist_wtd_sad32x64_avg_sse2
 
 unsigned int aom_dist_wtd_sad32x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad32x8_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad32x8_avg aom_dist_wtd_sad32x8_avg_ssse3
+unsigned int aom_dist_wtd_sad32x8_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad32x8_avg aom_dist_wtd_sad32x8_avg_sse2
 
 unsigned int aom_dist_wtd_sad4x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad4x16_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad4x16_avg aom_dist_wtd_sad4x16_avg_ssse3
+unsigned int aom_dist_wtd_sad4x16_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad4x16_avg aom_dist_wtd_sad4x16_avg_sse2
 
 unsigned int aom_dist_wtd_sad4x4_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad4x4_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad4x4_avg aom_dist_wtd_sad4x4_avg_ssse3
+unsigned int aom_dist_wtd_sad4x4_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad4x4_avg aom_dist_wtd_sad4x4_avg_sse2
 
 unsigned int aom_dist_wtd_sad4x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad4x8_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad4x8_avg aom_dist_wtd_sad4x8_avg_ssse3
+unsigned int aom_dist_wtd_sad4x8_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad4x8_avg aom_dist_wtd_sad4x8_avg_sse2
 
 unsigned int aom_dist_wtd_sad64x128_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad64x128_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad64x128_avg aom_dist_wtd_sad64x128_avg_ssse3
+unsigned int aom_dist_wtd_sad64x128_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad64x128_avg aom_dist_wtd_sad64x128_avg_sse2
 
 unsigned int aom_dist_wtd_sad64x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad64x16_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad64x16_avg aom_dist_wtd_sad64x16_avg_ssse3
+unsigned int aom_dist_wtd_sad64x16_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad64x16_avg aom_dist_wtd_sad64x16_avg_sse2
 
 unsigned int aom_dist_wtd_sad64x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad64x32_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad64x32_avg aom_dist_wtd_sad64x32_avg_ssse3
+unsigned int aom_dist_wtd_sad64x32_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad64x32_avg aom_dist_wtd_sad64x32_avg_sse2
 
 unsigned int aom_dist_wtd_sad64x64_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad64x64_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad64x64_avg aom_dist_wtd_sad64x64_avg_ssse3
+unsigned int aom_dist_wtd_sad64x64_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad64x64_avg aom_dist_wtd_sad64x64_avg_sse2
 
 unsigned int aom_dist_wtd_sad8x16_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad8x16_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad8x16_avg aom_dist_wtd_sad8x16_avg_ssse3
+unsigned int aom_dist_wtd_sad8x16_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad8x16_avg aom_dist_wtd_sad8x16_avg_sse2
 
 unsigned int aom_dist_wtd_sad8x32_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad8x32_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad8x32_avg aom_dist_wtd_sad8x32_avg_ssse3
+unsigned int aom_dist_wtd_sad8x32_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad8x32_avg aom_dist_wtd_sad8x32_avg_sse2
 
 unsigned int aom_dist_wtd_sad8x4_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad8x4_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad8x4_avg aom_dist_wtd_sad8x4_avg_ssse3
+unsigned int aom_dist_wtd_sad8x4_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad8x4_avg aom_dist_wtd_sad8x4_avg_sse2
 
 unsigned int aom_dist_wtd_sad8x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-unsigned int aom_dist_wtd_sad8x8_avg_ssse3(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
-#define aom_dist_wtd_sad8x8_avg aom_dist_wtd_sad8x8_avg_ssse3
+unsigned int aom_dist_wtd_sad8x8_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
+#define aom_dist_wtd_sad8x8_avg aom_dist_wtd_sad8x8_avg_sse2
 
 uint32_t aom_dist_wtd_sub_pixel_avg_variance128x128_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
 uint32_t aom_dist_wtd_sub_pixel_avg_variance128x128_ssse3(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS *jcp_param);
@@ -559,11 +562,6 @@
 void aom_fdct4x4_lp_sse2(const int16_t *input, int16_t *output, int stride);
 #define aom_fdct4x4_lp aom_fdct4x4_lp_sse2
 
-void aom_fdct8x8_c(const int16_t *input, tran_low_t *output, int stride);
-void aom_fdct8x8_sse2(const int16_t *input, tran_low_t *output, int stride);
-void aom_fdct8x8_ssse3(const int16_t *input, tran_low_t *output, int stride);
-#define aom_fdct8x8 aom_fdct8x8_ssse3
-
 void aom_fft16x16_float_c(const float *input, float *temp, float *output);
 void aom_fft16x16_float_sse2(const float *input, float *temp, float *output);
 #define aom_fft16x16_float aom_fft16x16_float_sse2
@@ -583,16 +581,6 @@
 void aom_fft8x8_float_sse2(const float *input, float *temp, float *output);
 #define aom_fft8x8_float aom_fft8x8_float_sse2
 
-void aom_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_get16x16var aom_get16x16var_c
-
-unsigned int aom_get4x4sse_cs_c(const unsigned char *src_ptr, int source_stride, const unsigned char *ref_ptr, int ref_stride);
-#define aom_get4x4sse_cs aom_get4x4sse_cs_c
-
-void aom_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-void aom_get8x8var_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_get8x8var aom_get8x8var_sse2
-
 void aom_get_blk_sse_sum_c(const int16_t *data, int stride, int bw, int bh, int *x_sum, int64_t *x2_sum);
 void aom_get_blk_sse_sum_sse2(const int16_t *data, int stride, int bw, int bh, int *x_sum, int64_t *x2_sum);
 #define aom_get_blk_sse_sum aom_get_blk_sse_sum_sse2
@@ -602,7 +590,8 @@
 #define aom_get_mb_ss aom_get_mb_ss_sse2
 
 void aom_get_var_sse_sum_16x16_dual_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse16x16, unsigned int *tot_sse, int *tot_sum, uint32_t *var16x16);
-#define aom_get_var_sse_sum_16x16_dual aom_get_var_sse_sum_16x16_dual_c
+void aom_get_var_sse_sum_16x16_dual_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse16x16, unsigned int *tot_sse, int *tot_sum, uint32_t *var16x16);
+#define aom_get_var_sse_sum_16x16_dual aom_get_var_sse_sum_16x16_dual_sse2
 
 void aom_get_var_sse_sum_8x8_quad_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse8x8, int *sum8x8, unsigned int *tot_sse, int *tot_sum, uint32_t *var8x8);
 void aom_get_var_sse_sum_8x8_quad_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse8x8, int *sum8x8, unsigned int *tot_sse, int *tot_sum, uint32_t *var8x8);
@@ -778,12 +767,6 @@
 uint32_t aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_10_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_10_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_10_get16x16var aom_highbd_10_get16x16var_c
-
-void aom_highbd_10_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_10_get8x8var aom_highbd_10_get8x8var_c
-
 unsigned int aom_highbd_10_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 unsigned int aom_highbd_10_masked_sub_pixel_variance128x128_ssse3(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_10_masked_sub_pixel_variance128x128 aom_highbd_10_masked_sub_pixel_variance128x128_ssse3
@@ -1201,11 +1184,11 @@
 unsigned int aom_highbd_10_variance16x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x32 aom_highbd_10_variance16x32_sse2
 
-unsigned int aom_highbd_10_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x4 aom_highbd_10_variance16x4_c
 
-unsigned int aom_highbd_10_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance16x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance16x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance16x64 aom_highbd_10_variance16x64_sse2
 
 unsigned int aom_highbd_10_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1230,11 +1213,11 @@
 unsigned int aom_highbd_10_variance32x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance32x64 aom_highbd_10_variance32x64_sse2
 
-unsigned int aom_highbd_10_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance32x8_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance32x8_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance32x8 aom_highbd_10_variance32x8_sse2
 
-unsigned int aom_highbd_10_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance4x16 aom_highbd_10_variance4x16_c
 
 unsigned int aom_highbd_10_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1250,8 +1233,8 @@
 unsigned int aom_highbd_10_variance64x128_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance64x128 aom_highbd_10_variance64x128_sse2
 
-unsigned int aom_highbd_10_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance64x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance64x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance64x16 aom_highbd_10_variance64x16_sse2
 
 unsigned int aom_highbd_10_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1266,8 +1249,8 @@
 unsigned int aom_highbd_10_variance8x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance8x16 aom_highbd_10_variance8x16_sse2
 
-unsigned int aom_highbd_10_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_10_variance8x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_10_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_10_variance8x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_10_variance8x32 aom_highbd_10_variance8x32_sse2
 
 unsigned int aom_highbd_10_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1343,12 +1326,6 @@
 uint32_t aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_12_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_12_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_12_get16x16var aom_highbd_12_get16x16var_c
-
-void aom_highbd_12_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_12_get8x8var aom_highbd_12_get8x8var_c
-
 unsigned int aom_highbd_12_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 unsigned int aom_highbd_12_masked_sub_pixel_variance128x128_ssse3(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_12_masked_sub_pixel_variance128x128 aom_highbd_12_masked_sub_pixel_variance128x128_ssse3
@@ -1766,11 +1743,11 @@
 unsigned int aom_highbd_12_variance16x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance16x32 aom_highbd_12_variance16x32_sse2
 
-unsigned int aom_highbd_12_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance16x4 aom_highbd_12_variance16x4_c
 
-unsigned int aom_highbd_12_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_12_variance16x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance16x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance16x64 aom_highbd_12_variance16x64_sse2
 
 unsigned int aom_highbd_12_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1795,11 +1772,11 @@
 unsigned int aom_highbd_12_variance32x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance32x64 aom_highbd_12_variance32x64_sse2
 
-unsigned int aom_highbd_12_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_12_variance32x8_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance32x8_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance32x8 aom_highbd_12_variance32x8_sse2
 
-unsigned int aom_highbd_12_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance4x16 aom_highbd_12_variance4x16_c
 
 unsigned int aom_highbd_12_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1815,8 +1792,8 @@
 unsigned int aom_highbd_12_variance64x128_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance64x128 aom_highbd_12_variance64x128_sse2
 
-unsigned int aom_highbd_12_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_12_variance64x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance64x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance64x16 aom_highbd_12_variance64x16_sse2
 
 unsigned int aom_highbd_12_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1831,8 +1808,8 @@
 unsigned int aom_highbd_12_variance8x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance8x16 aom_highbd_12_variance8x16_sse2
 
-unsigned int aom_highbd_12_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_12_variance8x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_12_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_12_variance8x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_12_variance8x32 aom_highbd_12_variance8x32_sse2
 
 unsigned int aom_highbd_12_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -1908,12 +1885,6 @@
 uint32_t aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8_c(const uint8_t *src_ptr, int source_stride, int xoffset, int  yoffset, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8 aom_highbd_8_dist_wtd_sub_pixel_avg_variance8x8_c
 
-void aom_highbd_8_get16x16var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_8_get16x16var aom_highbd_8_get16x16var_c
-
-void aom_highbd_8_get8x8var_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse, int *sum);
-#define aom_highbd_8_get8x8var aom_highbd_8_get8x8var_c
-
 unsigned int aom_highbd_8_masked_sub_pixel_variance128x128_c(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 unsigned int aom_highbd_8_masked_sub_pixel_variance128x128_ssse3(const uint8_t *src, int src_stride, int xoffset, int yoffset, const uint8_t *ref, int ref_stride, const uint8_t *second_pred, const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 #define aom_highbd_8_masked_sub_pixel_variance128x128 aom_highbd_8_masked_sub_pixel_variance128x128_ssse3
@@ -2199,11 +2170,11 @@
 unsigned int aom_highbd_8_variance16x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance16x32 aom_highbd_8_variance16x32_sse2
 
-unsigned int aom_highbd_8_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance16x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance16x4 aom_highbd_8_variance16x4_c
 
-unsigned int aom_highbd_8_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_8_variance16x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance16x64_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance16x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance16x64 aom_highbd_8_variance16x64_sse2
 
 unsigned int aom_highbd_8_variance16x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -2228,11 +2199,11 @@
 unsigned int aom_highbd_8_variance32x64_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance32x64 aom_highbd_8_variance32x64_sse2
 
-unsigned int aom_highbd_8_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_8_variance32x8_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance32x8_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance32x8_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance32x8 aom_highbd_8_variance32x8_sse2
 
-unsigned int aom_highbd_8_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance4x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance4x16 aom_highbd_8_variance4x16_c
 
 unsigned int aom_highbd_8_variance4x2_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -2248,8 +2219,8 @@
 unsigned int aom_highbd_8_variance64x128_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance64x128 aom_highbd_8_variance64x128_sse2
 
-unsigned int aom_highbd_8_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_8_variance64x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance64x16_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance64x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance64x16 aom_highbd_8_variance64x16_sse2
 
 unsigned int aom_highbd_8_variance64x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -2264,8 +2235,8 @@
 unsigned int aom_highbd_8_variance8x16_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance8x16 aom_highbd_8_variance8x16_sse2
 
-unsigned int aom_highbd_8_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
-unsigned int aom_highbd_8_variance8x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, uint32_t *sse);
+unsigned int aom_highbd_8_variance8x32_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
+unsigned int aom_highbd_8_variance8x32_sse2(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
 #define aom_highbd_8_variance8x32 aom_highbd_8_variance8x32_sse2
 
 unsigned int aom_highbd_8_variance8x4_c(const uint8_t *src_ptr, int source_stride, const uint8_t *ref_ptr, int ref_stride, unsigned int *sse);
@@ -2650,10 +2621,6 @@
 unsigned int aom_highbd_dist_wtd_sad8x8_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride, const uint8_t *second_pred, const DIST_WTD_COMP_PARAMS* jcp_param);
 #define aom_highbd_dist_wtd_sad8x8_avg aom_highbd_dist_wtd_sad8x8_avg_c
 
-void aom_highbd_fdct8x8_c(const int16_t *input, tran_low_t *output, int stride);
-void aom_highbd_fdct8x8_sse2(const int16_t *input, tran_low_t *output, int stride);
-#define aom_highbd_fdct8x8 aom_highbd_fdct8x8_sse2
-
 void aom_highbd_h_predictor_16x16_c(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 void aom_highbd_h_predictor_16x16_sse2(uint16_t *dst, ptrdiff_t y_stride, const uint16_t *above, const uint16_t *left, int bd);
 #define aom_highbd_h_predictor_16x16 aom_highbd_h_predictor_16x16_sse2
@@ -4593,10 +4560,6 @@
 void aom_paeth_predictor_8x8_ssse3(uint8_t *dst, ptrdiff_t y_stride, const uint8_t *above, const uint8_t *left);
 #define aom_paeth_predictor_8x8 aom_paeth_predictor_8x8_ssse3
 
-void aom_pixel_scale_c(const int16_t *src_diff, ptrdiff_t src_stride, int16_t *coeff, int log_scale, int h8, int w8);
-void aom_pixel_scale_sse2(const int16_t *src_diff, ptrdiff_t src_stride, int16_t *coeff, int log_scale, int h8, int w8);
-#define aom_pixel_scale aom_pixel_scale_sse2
-
 void aom_quantize_b_c(const tran_low_t *coeff_ptr, intptr_t n_coeffs, const int16_t *zbin_ptr, const int16_t *round_ptr, const int16_t *quant_ptr, const int16_t *quant_shift_ptr, tran_low_t *qcoeff_ptr, tran_low_t *dqcoeff_ptr, const int16_t *dequant_ptr, uint16_t *eob_ptr, const int16_t *scan, const int16_t *iscan);
 void aom_quantize_b_sse2(const tran_low_t *coeff_ptr, intptr_t n_coeffs, const int16_t *zbin_ptr, const int16_t *round_ptr, const int16_t *quant_ptr, const int16_t *quant_shift_ptr, tran_low_t *qcoeff_ptr, tran_low_t *dqcoeff_ptr, const int16_t *dequant_ptr, uint16_t *eob_ptr, const int16_t *scan, const int16_t *iscan);
 void aom_quantize_b_ssse3(const tran_low_t *coeff_ptr, intptr_t n_coeffs, const int16_t *zbin_ptr, const int16_t *round_ptr, const int16_t *quant_ptr, const int16_t *quant_shift_ptr, tran_low_t *qcoeff_ptr, tran_low_t *dqcoeff_ptr, const int16_t *dequant_ptr, uint16_t *eob_ptr, const int16_t *scan, const int16_t *iscan);
@@ -4637,10 +4600,6 @@
 void aom_sad128x128x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad128x128x4d aom_sad128x128x4d_sse2
 
-void aom_sad128x128x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad128x128x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad128x128x4d_avg aom_sad128x128x4d_avg_sse2
-
 unsigned int aom_sad128x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad128x64_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad128x64 aom_sad128x64_sse2
@@ -4656,14 +4615,6 @@
 void aom_sad128x64x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad128x64x4d aom_sad128x64x4d_sse2
 
-void aom_sad128x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad128x64x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad128x64x4d_avg aom_sad128x64x4d_avg_sse2
-
-unsigned int aom_sad128xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-unsigned int aom_sad128xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad128xh aom_sad128xh_sse2
-
 unsigned int aom_sad16x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x16_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x16 aom_sad16x16_sse2
@@ -4679,10 +4630,6 @@
 void aom_sad16x16x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x16x4d aom_sad16x16x4d_sse2
 
-void aom_sad16x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad16x16x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x16x4d_avg aom_sad16x16x4d_avg_sse2
-
 unsigned int aom_sad16x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x32_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x32 aom_sad16x32_sse2
@@ -4698,10 +4645,6 @@
 void aom_sad16x32x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x32x4d aom_sad16x32x4d_sse2
 
-void aom_sad16x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad16x32x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x32x4d_avg aom_sad16x32x4d_avg_sse2
-
 unsigned int aom_sad16x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x4_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x4 aom_sad16x4_sse2
@@ -4717,10 +4660,6 @@
 void aom_sad16x4x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x4x4d aom_sad16x4x4d_sse2
 
-void aom_sad16x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad16x4x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x4x4d_avg aom_sad16x4x4d_avg_sse2
-
 unsigned int aom_sad16x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x64_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x64 aom_sad16x64_sse2
@@ -4736,10 +4675,6 @@
 void aom_sad16x64x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x64x4d aom_sad16x64x4d_sse2
 
-void aom_sad16x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad16x64x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x64x4d_avg aom_sad16x64x4d_avg_sse2
-
 unsigned int aom_sad16x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad16x8_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad16x8 aom_sad16x8_sse2
@@ -4755,14 +4690,6 @@
 void aom_sad16x8x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad16x8x4d aom_sad16x8x4d_sse2
 
-void aom_sad16x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad16x8x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad16x8x4d_avg aom_sad16x8x4d_avg_sse2
-
-unsigned int aom_sad16xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-unsigned int aom_sad16xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad16xh aom_sad16xh_sse2
-
 unsigned int aom_sad32x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x16_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x16 aom_sad32x16_sse2
@@ -4778,10 +4705,6 @@
 void aom_sad32x16x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x16x4d aom_sad32x16x4d_sse2
 
-void aom_sad32x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad32x16x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x16x4d_avg aom_sad32x16x4d_avg_sse2
-
 unsigned int aom_sad32x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x32_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x32 aom_sad32x32_sse2
@@ -4797,10 +4720,6 @@
 void aom_sad32x32x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x32x4d aom_sad32x32x4d_sse2
 
-void aom_sad32x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad32x32x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x32x4d_avg aom_sad32x32x4d_avg_sse2
-
 unsigned int aom_sad32x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x64_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x64 aom_sad32x64_sse2
@@ -4816,10 +4735,6 @@
 void aom_sad32x64x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x64x4d aom_sad32x64x4d_sse2
 
-void aom_sad32x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad32x64x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x64x4d_avg aom_sad32x64x4d_avg_sse2
-
 unsigned int aom_sad32x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad32x8_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad32x8 aom_sad32x8_sse2
@@ -4835,14 +4750,6 @@
 void aom_sad32x8x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad32x8x4d aom_sad32x8x4d_sse2
 
-void aom_sad32x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad32x8x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad32x8x4d_avg aom_sad32x8x4d_avg_sse2
-
-unsigned int aom_sad32xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-unsigned int aom_sad32xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad32xh aom_sad32xh_sse2
-
 unsigned int aom_sad4x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad4x16_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x16 aom_sad4x16_sse2
@@ -4858,10 +4765,6 @@
 void aom_sad4x16x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x16x4d aom_sad4x16x4d_sse2
 
-void aom_sad4x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad4x16x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x16x4d_avg aom_sad4x16x4d_avg_sse2
-
 unsigned int aom_sad4x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad4x4_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x4 aom_sad4x4_sse2
@@ -4877,10 +4780,6 @@
 void aom_sad4x4x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x4x4d aom_sad4x4x4d_sse2
 
-void aom_sad4x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad4x4x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x4x4d_avg aom_sad4x4x4d_avg_sse2
-
 unsigned int aom_sad4x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad4x8_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad4x8 aom_sad4x8_sse2
@@ -4896,14 +4795,6 @@
 void aom_sad4x8x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad4x8x4d aom_sad4x8x4d_sse2
 
-void aom_sad4x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad4x8x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad4x8x4d_avg aom_sad4x8x4d_avg_sse2
-
-unsigned int aom_sad4xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-unsigned int aom_sad4xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad4xh aom_sad4xh_sse2
-
 unsigned int aom_sad64x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x128_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x128 aom_sad64x128_sse2
@@ -4919,10 +4810,6 @@
 void aom_sad64x128x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x128x4d aom_sad64x128x4d_sse2
 
-void aom_sad64x128x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad64x128x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x128x4d_avg aom_sad64x128x4d_avg_sse2
-
 unsigned int aom_sad64x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x16_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x16 aom_sad64x16_sse2
@@ -4938,10 +4825,6 @@
 void aom_sad64x16x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x16x4d aom_sad64x16x4d_sse2
 
-void aom_sad64x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad64x16x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x16x4d_avg aom_sad64x16x4d_avg_sse2
-
 unsigned int aom_sad64x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x32_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x32 aom_sad64x32_sse2
@@ -4957,10 +4840,6 @@
 void aom_sad64x32x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x32x4d aom_sad64x32x4d_sse2
 
-void aom_sad64x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad64x32x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x32x4d_avg aom_sad64x32x4d_avg_sse2
-
 unsigned int aom_sad64x64_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad64x64_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad64x64 aom_sad64x64_sse2
@@ -4976,14 +4855,6 @@
 void aom_sad64x64x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad64x64x4d aom_sad64x64x4d_sse2
 
-void aom_sad64x64x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad64x64x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad64x64x4d_avg aom_sad64x64x4d_avg_sse2
-
-unsigned int aom_sad64xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-unsigned int aom_sad64xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad64xh aom_sad64xh_sse2
-
 unsigned int aom_sad8x16_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x16_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x16 aom_sad8x16_sse2
@@ -4999,10 +4870,6 @@
 void aom_sad8x16x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x16x4d aom_sad8x16x4d_sse2
 
-void aom_sad8x16x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad8x16x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x16x4d_avg aom_sad8x16x4d_avg_sse2
-
 unsigned int aom_sad8x32_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x32_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x32 aom_sad8x32_sse2
@@ -5018,10 +4885,6 @@
 void aom_sad8x32x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x32x4d aom_sad8x32x4d_sse2
 
-void aom_sad8x32x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad8x32x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x32x4d_avg aom_sad8x32x4d_avg_sse2
-
 unsigned int aom_sad8x4_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x4_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x4 aom_sad8x4_sse2
@@ -5037,10 +4900,6 @@
 void aom_sad8x4x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x4x4d aom_sad8x4x4d_sse2
 
-void aom_sad8x4x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad8x4x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x4x4d_avg aom_sad8x4x4d_avg_sse2
-
 unsigned int aom_sad8x8_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad8x8_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad8x8 aom_sad8x8_sse2
@@ -5056,14 +4915,6 @@
 void aom_sad8x8x4d_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, uint32_t sad_array[4]);
 #define aom_sad8x8x4d aom_sad8x8x4d_sse2
 
-void aom_sad8x8x4d_avg_c(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-void aom_sad8x8x4d_avg_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t * const ref_ptr[4], int ref_stride, const uint8_t *second_pred, uint32_t sad_array[4]);
-#define aom_sad8x8x4d_avg aom_sad8x8x4d_avg_sse2
-
-unsigned int aom_sad8xh_c(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-unsigned int aom_sad8xh_sse2(const uint8_t *a, int a_stride, const uint8_t *b, int b_stride, int width, int height);
-#define aom_sad8xh aom_sad8xh_sse2
-
 unsigned int aom_sad_skip_128x128_c(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 unsigned int aom_sad_skip_128x128_sse2(const uint8_t *src_ptr, int src_stride, const uint8_t *ref_ptr, int ref_stride);
 #define aom_sad_skip_128x128 aom_sad_skip_128x128_sse2
@@ -5901,7 +5752,7 @@
 int aom_vector_var_c(const int16_t *ref, const int16_t *src, int bwl);
 #define aom_vector_var aom_vector_var_c
 
-double av1_compute_cross_correlation_c(unsigned char *im1, int stride1, int x1, int y1, unsigned char *im2, int stride2, int x2, int y2);
+double av1_compute_cross_correlation_c(const unsigned char *frame1, int stride1, int x1, int y1, const unsigned char *frame2, int stride2, int x2, int y2);
 #define av1_compute_cross_correlation av1_compute_cross_correlation_c
 
 void aom_dsp_rtcd(void);

diff --git a/config/x86_64/config/aom_scale_rtcd.h b/config/x86_64/config/aom_scale_rtcd.h
index 28e903d..3b70fb4 100644
--- a/config/x86_64/config/aom_scale_rtcd.h
+++ b/config/x86_64/config/aom_scale_rtcd.h

@@ -80,7 +80,7 @@
 void aom_yv12_partial_copy_y_c(const struct yv12_buffer_config *src_ybc, int hstart1, int hend1, int vstart1, int vend1, struct yv12_buffer_config *dst_ybc, int hstart2, int vstart2);
 #define aom_yv12_partial_copy_y aom_yv12_partial_copy_y_c
 
-int aom_yv12_realloc_with_new_border_c(struct yv12_buffer_config *ybf, int new_border, int byte_alignment, int num_planes);
+int aom_yv12_realloc_with_new_border_c(struct yv12_buffer_config *ybf, int new_border, int byte_alignment, int num_pyramid_levels, int num_planes);
 #define aom_yv12_realloc_with_new_border aom_yv12_realloc_with_new_border_c
 
 void aom_scale_rtcd(void);

diff --git a/config/x86_64/config/av1_rtcd.h b/config/x86_64/config/av1_rtcd.h
index 00a607d..c64a024 100644
--- a/config/x86_64/config/av1_rtcd.h
+++ b/config/x86_64/config/av1_rtcd.h

@@ -15,12 +15,12 @@
 #include "aom/aom_integer.h"
 #include "aom_dsp/odintrin.h"
 #include "aom_dsp/txfm_common.h"
-#include "av1/common/common.h"
-#include "av1/common/enums.h"
-#include "av1/common/quant_common.h"
-#include "av1/common/filter.h"
-#include "av1/common/convolve.h"
 #include "av1/common/av1_txfm.h"
+#include "av1/common/common.h"
+#include "av1/common/convolve.h"
+#include "av1/common/enums.h"
+#include "av1/common/filter.h"
+#include "av1/common/quant_common.h"
 #include "av1/common/restoration.h"
 
 struct macroblockd;
@@ -86,18 +86,6 @@
                                                    int ref_stride, int subpel_search);
 #define aom_comp_avg_upsampled_pred aom_comp_avg_upsampled_pred_sse2
 
-void aom_comp_mask_upsampled_pred_c(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
-                                                       const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
-                                                       int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
-                                                       int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask,
-                                                       int subpel_search);
-void aom_comp_mask_upsampled_pred_sse2(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
-                                                       const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
-                                                       int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
-                                                       int ref_stride, const uint8_t *mask, int mask_stride, int invert_mask,
-                                                       int subpel_search);
-#define aom_comp_mask_upsampled_pred aom_comp_mask_upsampled_pred_sse2
-
 void aom_dist_wtd_comp_avg_upsampled_pred_c(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
                                                        const MV *const mv, uint8_t *comp_pred, const uint8_t *pred, int width,
                                                        int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref,
@@ -148,8 +136,8 @@
 void av1_apply_selfguided_restoration_c(const uint8_t *dat, int width, int height, int stride, int eps, const int *xqd, uint8_t *dst, int dst_stride, int32_t *tmpbuf, int bit_depth, int highbd);
 #define av1_apply_selfguided_restoration av1_apply_selfguided_restoration_c
 
-void av1_apply_temporal_filter_c(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
-void av1_apply_temporal_filter_sse2(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_apply_temporal_filter_c(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_apply_temporal_filter_sse2(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
 #define av1_apply_temporal_filter av1_apply_temporal_filter_sse2
 
 int64_t av1_block_error_c(const tran_low_t *coeff, const tran_low_t *dqcoeff, intptr_t block_size, int64_t *ssz);
@@ -206,7 +194,7 @@
 bool av1_cnn_predict_c( const float **input, int in_width, int in_height, int in_stride, const CNN_CONFIG *cnn_config, const CNN_THREAD_DATA *thread_data, CNN_MULTI_OUT *output_struct);
 #define av1_cnn_predict av1_cnn_predict_c
 
-void av1_compute_stats_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, int use_downsampled_wiener_stats);
+void av1_compute_stats_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int16_t *dgd_avg, int16_t *src_avg, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, int use_downsampled_wiener_stats);
 #define av1_compute_stats av1_compute_stats_c
 
 void av1_compute_stats_highbd_c(int wiener_win, const uint8_t *dgd8, const uint8_t *src8, int h_start, int h_end, int v_start, int v_end, int dgd_stride, int src_stride, int64_t *M, int64_t *H, aom_bit_depth_t bit_depth);
@@ -256,6 +244,9 @@
 void av1_dr_prediction_z3_c(uint8_t *dst, ptrdiff_t stride, int bw, int bh, const uint8_t *above, const uint8_t *left, int upsample_left, int dx, int dy);
 #define av1_dr_prediction_z3 av1_dr_prediction_z3_c
 
+double av1_estimate_noise_from_single_plane_c(const uint8_t *src, int height, int width, int stride, int edge_thresh);
+#define av1_estimate_noise_from_single_plane av1_estimate_noise_from_single_plane_c
+
 void av1_filter_intra_edge_c(uint8_t *p, int sz, int strength);
 #define av1_filter_intra_edge av1_filter_intra_edge_c
 
@@ -335,8 +326,8 @@
 void av1_get_nz_map_contexts_sse2(const uint8_t *const levels, const int16_t *const scan, const uint16_t eob, const TX_SIZE tx_size, const TX_CLASS tx_class, int8_t *const coeff_contexts);
 #define av1_get_nz_map_contexts av1_get_nz_map_contexts_sse2
 
-void av1_highbd_apply_temporal_filter_c(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
-void av1_highbd_apply_temporal_filter_sse2(const struct yv12_buffer_config *ref_frame, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_highbd_apply_temporal_filter_c(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
+void av1_highbd_apply_temporal_filter_sse2(const struct yv12_buffer_config *frame_to_filter, const struct macroblockd *mbd, const BLOCK_SIZE block_size, const int mb_row, const int mb_col, const int num_planes, const double *noise_levels, const MV *subblock_mvs, const int *subblock_mses, const int q_factor, const int filter_strength, int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
 #define av1_highbd_apply_temporal_filter av1_highbd_apply_temporal_filter_sse2
 
 int64_t av1_highbd_block_error_c(const tran_low_t *coeff, const tran_low_t *dqcoeff, intptr_t block_size, int64_t *ssz, int bd);
@@ -400,8 +391,8 @@
 void av1_highbd_dr_prediction_z3_c(uint16_t *dst, ptrdiff_t stride, int bw, int bh, const uint16_t *above, const uint16_t *left, int upsample_left, int dx, int dy, int bd);
 #define av1_highbd_dr_prediction_z3 av1_highbd_dr_prediction_z3_c
 
-void av1_highbd_fwht4x4_c(const int16_t *input, tran_low_t *output, int stride);
-#define av1_highbd_fwht4x4 av1_highbd_fwht4x4_c
+double av1_highbd_estimate_noise_from_single_plane_c(const uint16_t *src, int height, int width, int stride, int bit_depth, int edge_thresh);
+#define av1_highbd_estimate_noise_from_single_plane av1_highbd_estimate_noise_from_single_plane_c
 
 void av1_highbd_inv_txfm_add_c(const tran_low_t *input, uint8_t *dest, int stride, const TxfmParam *txfm_param);
 #define av1_highbd_inv_txfm_add av1_highbd_inv_txfm_add_c

diff --git a/docs.cmake b/docs.cmake
index 0825ca4..0d8db92 100644
--- a/docs.cmake
+++ b/docs.cmake

@@ -100,7 +100,7 @@
                                        "Scalable encoder loop.")
 
   set(AOM_DOXYGEN_EXAMPLE_SOURCES ${AOM_DOXYGEN_EXAMPLE_SOURCES}
-                                  "${AOM_ROOT}/examples/svc_encoder_rtc.c")
+                                  "${AOM_ROOT}/examples/svc_encoder_rtc.cc")
 
   set(AOM_DOXYGEN_EXAMPLE_DESCRIPTIONS ${AOM_DOXYGEN_EXAMPLE_DESCRIPTIONS}
                                        "Layered encoder for RTC.")

diff --git a/examples/encoder_util.h b/examples/encoder_util.h
index a6bb3fb..fa0e7d1 100644
--- a/examples/encoder_util.h
+++ b/examples/encoder_util.h

@@ -14,6 +14,10 @@
 #ifndef AOM_EXAMPLES_ENCODER_UTIL_H_
 #define AOM_EXAMPLES_ENCODER_UTIL_H_
 
+#ifdef __cplusplus
+extern "C" {
+#endif
+
 #include "aom/aom_image.h"
 
 // Returns mismatch location (?loc[0],?loc[1]) and the values at that location
@@ -30,4 +34,7 @@
 int aom_compare_img(const aom_image_t *const img1,
                     const aom_image_t *const img2);
 
+#ifdef __cplusplus
+}
+#endif
 #endif  // AOM_EXAMPLES_ENCODER_UTIL_H_

diff --git a/examples/inspect.c b/examples/inspect.c
index 8e7213a..ed77b5d 100644
--- a/examples/inspect.c
+++ b/examples/inspect.c

@@ -509,7 +509,6 @@
   int r, c, t, i;
   if (compress && len == 1) {
     die("Can't encode scalars as arrays when RLE compression is enabled.");
-    return -1;
   }
   if (map) {
     buf += snprintf(buf, MAX_BUFFER, "  \"%sMap\": {", name);

diff --git a/examples/lightfield_bitstream_parsing.c b/examples/lightfield_bitstream_parsing.c
index 35b4ad0..05272ba 100644
--- a/examples/lightfield_bitstream_parsing.c
+++ b/examples/lightfield_bitstream_parsing.c

@@ -92,15 +92,14 @@
     case AOM_IMG_FMT_I44416: return 48;
     default: die("Invalid image format");
   }
-  return 0;
 }
 
-void process_tile_list(const TILE_LIST_INFO *tiles, int num_tiles,
-                       aom_codec_pts_t tl_pts, unsigned char **frames,
-                       const size_t *frame_sizes, aom_codec_ctx_t *codec,
-                       unsigned char *tl_buf, AvxVideoWriter *writer,
-                       uint8_t output_frame_width_in_tiles_minus_1,
-                       uint8_t output_frame_height_in_tiles_minus_1) {
+static void process_tile_list(const TILE_LIST_INFO *tiles, int num_tiles,
+                              aom_codec_pts_t tl_pts, unsigned char **frames,
+                              const size_t *frame_sizes, aom_codec_ctx_t *codec,
+                              unsigned char *tl_buf, AvxVideoWriter *writer,
+                              uint8_t output_frame_width_in_tiles_minus_1,
+                              uint8_t output_frame_height_in_tiles_minus_1) {
   unsigned char *tl = tl_buf;
   struct aom_write_bit_buffer wb = { tl, 0 };
   unsigned char *saved_obu_size_loc = NULL;

diff --git a/examples/svc_encoder_rtc.c b/examples/svc_encoder_rtc.cc
similarity index 82%
rename from examples/svc_encoder_rtc.c
rename to examples/svc_encoder_rtc.cc
index bceb7d2..1730f89 100644
--- a/examples/svc_encoder_rtc.c
+++ b/examples/svc_encoder_rtc.cc

@@ -12,15 +12,19 @@
 //  encoding scheme for RTC video applications.
 
 #include <assert.h>
+#include <limits.h>
 #include <math.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 
+#include "config/aom_config.h"
+
+#if CONFIG_AV1_DECODER
+#include "aom/aom_decoder.h"
+#endif
 #include "aom/aom_encoder.h"
 #include "aom/aomcx.h"
-#include "av1/common/enums.h"
-#include "av1/encoder/encoder.h"
 #include "common/args.h"
 #include "common/tools_common.h"
 #include "common/video_writer.h"
@@ -39,6 +43,7 @@
   int output_obu;
   int decode;
   int tune_content;
+  int show_psnr;
 } AppInput;
 
 typedef enum {
@@ -92,6 +97,8 @@
 static const arg_def_t test_decode_arg =
     ARG_DEF(NULL, "test-decode", 1,
             "Attempt to test decoding the output when set to 1. Default is 1.");
+static const arg_def_t psnr_arg =
+    ARG_DEF(NULL, "psnr", -1, "Show PSNR in status line.");
 static const struct arg_enum_list tune_content_enum[] = {
   { "default", AOM_CONTENT_DEFAULT },
   { "screen", AOM_CONTENT_SCREEN },
@@ -102,40 +109,27 @@
     NULL, "tune-content", 1, "Tune content type", tune_content_enum);
 
 #if CONFIG_AV1_HIGHBITDEPTH
-static const struct arg_enum_list bitdepth_enum[] = {
-  { "8", AOM_BITS_8 }, { "10", AOM_BITS_10 }, { "12", AOM_BITS_12 }, { NULL, 0 }
-};
+static const struct arg_enum_list bitdepth_enum[] = { { "8", AOM_BITS_8 },
+                                                      { "10", AOM_BITS_10 },
+                                                      { NULL, 0 } };
 
 static const arg_def_t bitdepth_arg = ARG_DEF_ENUM(
-    "d", "bit-depth", 1, "Bit depth for codec 8, 10 or 12. ", bitdepth_enum);
+    "d", "bit-depth", 1, "Bit depth for codec 8 or 10. ", bitdepth_enum);
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
-static const arg_def_t *svc_args[] = { &frames_arg,
-                                       &outputfile,
-                                       &width_arg,
-                                       &height_arg,
-                                       &timebase_arg,
-                                       &bitrate_arg,
-                                       &spatial_layers_arg,
-                                       &kf_dist_arg,
-                                       &scale_factors_arg,
-                                       &min_q_arg,
-                                       &max_q_arg,
-                                       &temporal_layers_arg,
-                                       &layering_mode_arg,
-                                       &threads_arg,
-                                       &aqmode_arg,
+static const arg_def_t *svc_args[] = {
+  &frames_arg,          &outputfile,     &width_arg,
+  &height_arg,          &timebase_arg,   &bitrate_arg,
+  &spatial_layers_arg,  &kf_dist_arg,    &scale_factors_arg,
+  &min_q_arg,           &max_q_arg,      &temporal_layers_arg,
+  &layering_mode_arg,   &threads_arg,    &aqmode_arg,
 #if CONFIG_AV1_HIGHBITDEPTH
-                                       &bitdepth_arg,
+  &bitdepth_arg,
 #endif
-                                       &speed_arg,
-                                       &bitrates_arg,
-                                       &dropframe_thresh_arg,
-                                       &error_resilient_arg,
-                                       &output_obu_arg,
-                                       &test_decode_arg,
-                                       &tune_content_arg,
-                                       NULL };
+  &speed_arg,           &bitrates_arg,   &dropframe_thresh_arg,
+  &error_resilient_arg, &output_obu_arg, &test_decode_arg,
+  &tune_content_arg,    &psnr_arg,       NULL,
+};
 
 #define zero(Dest) memset(&(Dest), 0, sizeof(Dest))
 
@@ -202,7 +196,7 @@
       input->framerate.numerator = input->y4m.fps_n;
       input->framerate.denominator = input->y4m.fps_d;
       input->fmt = input->y4m.aom_fmt;
-      input->bit_depth = input->y4m.bit_depth;
+      input->bit_depth = static_cast<aom_bit_depth_t>(input->y4m.bit_depth);
     } else {
       fatal("Unsupported Y4M stream.");
     }
@@ -252,10 +246,10 @@
       (option1 == NULL && type == SCALE_FACTOR))
     return AOM_CODEC_INVALID_PARAM;
 
-  input_string = malloc(strlen(input));
-  if (!input_string) die("Failed to allocate input string.");
-  memcpy(input_string, input, strlen(input));
+  const size_t input_length = strlen(input);
+  input_string = reinterpret_cast<char *>(malloc(input_length + 1));
   if (input_string == NULL) return AOM_CODEC_MEM_ERROR;
+  memcpy(input_string, input, input_length + 1);
   token = strtok(input_string, delim);  // NOLINT
   for (i = 0; i < num_layers; ++i) {
     if (token != NULL) {
@@ -263,12 +257,10 @@
       if (res != AOM_CODEC_OK) break;
       token = strtok(NULL, delim);  // NOLINT
     } else {
+      res = AOM_CODEC_INVALID_PARAM;
       break;
     }
   }
-  if (res == AOM_CODEC_OK && i != num_layers) {
-    res = AOM_CODEC_INVALID_PARAM;
-  }
   free(input_string);
   return res;
 }
@@ -317,8 +309,8 @@
       svc_params->number_temporal_layers = arg_parse_uint(&arg);
     } else if (arg_match(&arg, &speed_arg, argi)) {
       app_input->speed = arg_parse_uint(&arg);
-      if (app_input->speed > 10) {
-        aom_tools_warn("Mapping speed %d to speed 10.\n", app_input->speed);
+      if (app_input->speed > 11) {
+        aom_tools_warn("Mapping speed %d to speed 11.\n", app_input->speed);
       }
     } else if (arg_match(&arg, &aqmode_arg, argi)) {
       app_input->aq_mode = arg_parse_uint(&arg);
@@ -330,16 +322,21 @@
       enc_cfg->kf_min_dist = arg_parse_uint(&arg);
       enc_cfg->kf_max_dist = enc_cfg->kf_min_dist;
     } else if (arg_match(&arg, &scale_factors_arg, argi)) {
-      parse_layer_options_from_string(svc_params, SCALE_FACTOR, arg.val,
-                                      svc_params->scaling_factor_num,
-                                      svc_params->scaling_factor_den);
+      aom_codec_err_t res = parse_layer_options_from_string(
+          svc_params, SCALE_FACTOR, arg.val, svc_params->scaling_factor_num,
+          svc_params->scaling_factor_den);
+      if (res != AOM_CODEC_OK) {
+        die("Failed to parse scale factors: %s\n",
+            aom_codec_err_to_string(res));
+      }
     } else if (arg_match(&arg, &min_q_arg, argi)) {
       enc_cfg->rc_min_quantizer = arg_parse_uint(&arg);
     } else if (arg_match(&arg, &max_q_arg, argi)) {
       enc_cfg->rc_max_quantizer = arg_parse_uint(&arg);
 #if CONFIG_AV1_HIGHBITDEPTH
     } else if (arg_match(&arg, &bitdepth_arg, argi)) {
-      enc_cfg->g_bit_depth = arg_parse_enum_or_int(&arg);
+      enc_cfg->g_bit_depth =
+          static_cast<aom_bit_depth_t>(arg_parse_enum_or_int(&arg));
       switch (enc_cfg->g_bit_depth) {
         case AOM_BITS_8:
           enc_cfg->g_input_bit_depth = 8;
@@ -347,15 +344,10 @@
           break;
         case AOM_BITS_10:
           enc_cfg->g_input_bit_depth = 10;
-          enc_cfg->g_profile = 2;
-          break;
-        case AOM_BITS_12:
-          enc_cfg->g_input_bit_depth = 12;
-          enc_cfg->g_profile = 2;
+          enc_cfg->g_profile = 0;
           break;
         default:
           die("Error: Invalid bit depth selected (%d)\n", enc_cfg->g_bit_depth);
-          break;
       }
 #endif  // CONFIG_VP9_HIGHBITDEPTH
     } else if (arg_match(&arg, &dropframe_thresh_arg, argi)) {
@@ -378,6 +370,8 @@
     } else if (arg_match(&arg, &tune_content_arg, argi)) {
       app_input->tune_content = arg_parse_enum_or_int(&arg);
       printf("tune content %d\n", app_input->tune_content);
+    } else if (arg_match(&arg, &psnr_arg, argi)) {
+      app_input->show_psnr = 1;
     } else {
       ++argj;
     }
@@ -387,8 +381,11 @@
   for (argi = argj = argv; (*argj = *argi); argi += arg.argv_step) {
     arg.argv_step = 1;
     if (arg_match(&arg, &bitrates_arg, argi)) {
-      parse_layer_options_from_string(svc_params, BITRATE, arg.val,
-                                      svc_params->layer_target_bitrate, NULL);
+      aom_codec_err_t res = parse_layer_options_from_string(
+          svc_params, BITRATE, arg.val, svc_params->layer_target_bitrate, NULL);
+      if (res != AOM_CODEC_OK) {
+        die("Failed to parse bitrates: %s\n", aom_codec_err_to_string(res));
+      }
     } else {
       ++argj;
     }
@@ -410,7 +407,7 @@
   app_input->input_ctx.filename = argv[0];
   free(argv);
 
-  open_input_file(&app_input->input_ctx, 0);
+  open_input_file(&app_input->input_ctx, AOM_CSP_UNKNOWN);
   if (app_input->input_ctx.file_type == FILE_TYPE_Y4M) {
     enc_cfg->g_w = app_input->input_ctx.width;
     enc_cfg->g_h = app_input->input_ctx.height;
@@ -432,10 +429,10 @@
       enc_cfg->rc_target_bitrate, enc_cfg->kf_max_dist);
 }
 
-static unsigned int mode_to_num_temporal_layers[11] = { 1, 2, 3, 3, 2, 1,
-                                                        1, 3, 3, 3, 3 };
-static unsigned int mode_to_num_spatial_layers[11] = { 1, 1, 1, 1, 1, 2,
-                                                       3, 2, 3, 3, 3 };
+static int mode_to_num_temporal_layers[11] = {
+  1, 2, 3, 3, 2, 1, 1, 3, 3, 3, 3
+};
+static int mode_to_num_spatial_layers[11] = { 1, 1, 1, 1, 1, 2, 3, 2, 3, 3, 3 };
 
 // For rate control encoding stats.
 struct RateControlMetrics {
@@ -465,6 +462,10 @@
   int layer_target_bitrate[AOM_MAX_LAYERS];
 };
 
+static const int REF_FRAMES = 8;
+
+static const int INTER_REFS_PER_FRAME = 7;
+
 // Reference frames used in this example encoder.
 enum {
   SVC_LAST_FRAME = 0,
@@ -502,9 +503,8 @@
 // TODO(marpan): Update these metrics to account for multiple key frames
 // in the stream.
 static void set_rate_control_metrics(struct RateControlMetrics *rc,
-                                     double framerate,
-                                     unsigned int ss_number_layers,
-                                     unsigned int ts_number_layers) {
+                                     double framerate, int ss_number_layers,
+                                     int ts_number_layers) {
   int ts_rate_decimator[AOM_MAX_TS_LAYERS] = { 1 };
   ts_rate_decimator[0] = 1;
   if (ts_number_layers == 2) {
@@ -518,12 +518,12 @@
   }
   // Set the layer (cumulative) framerate and the target layer (non-cumulative)
   // per-frame-bandwidth, for the rate control encoding stats below.
-  for (unsigned int sl = 0; sl < ss_number_layers; ++sl) {
-    unsigned int i = sl * ts_number_layers;
+  for (int sl = 0; sl < ss_number_layers; ++sl) {
+    int i = sl * ts_number_layers;
     rc->layer_framerate[0] = framerate / ts_rate_decimator[0];
     rc->layer_pfb[i] =
         1000.0 * rc->layer_target_bitrate[i] / rc->layer_framerate[0];
-    for (unsigned int tl = 0; tl < ts_number_layers; ++tl) {
+    for (int tl = 0; tl < ts_number_layers; ++tl) {
       i = sl * ts_number_layers + tl;
       if (tl > 0) {
         rc->layer_framerate[tl] = framerate / ts_rate_decimator[tl];
@@ -546,17 +546,16 @@
 }
 
 static void printout_rate_control_summary(struct RateControlMetrics *rc,
-                                          int frame_cnt,
-                                          unsigned int ss_number_layers,
-                                          unsigned int ts_number_layers) {
+                                          int frame_cnt, int ss_number_layers,
+                                          int ts_number_layers) {
   int tot_num_frames = 0;
   double perc_fluctuation = 0.0;
   printf("Total number of processed frames: %d\n\n", frame_cnt - 1);
-  printf("Rate control layer stats for %u layer(s):\n\n", ts_number_layers);
-  for (unsigned int sl = 0; sl < ss_number_layers; ++sl) {
+  printf("Rate control layer stats for %d layer(s):\n\n", ts_number_layers);
+  for (int sl = 0; sl < ss_number_layers; ++sl) {
     tot_num_frames = 0;
-    for (unsigned int tl = 0; tl < ts_number_layers; ++tl) {
-      unsigned int i = sl * ts_number_layers + tl;
+    for (int tl = 0; tl < ts_number_layers; ++tl) {
+      int i = sl * ts_number_layers + tl;
       const int num_dropped =
           tl > 0 ? rc->layer_input_frames[tl] - rc->layer_enc_frames[tl]
                  : rc->layer_input_frames[tl] - rc->layer_enc_frames[tl] - 1;
@@ -568,7 +567,7 @@
           rc->layer_avg_frame_size[i] / rc->layer_enc_frames[tl];
       rc->layer_avg_rate_mismatch[i] =
           100.0 * rc->layer_avg_rate_mismatch[i] / rc->layer_enc_frames[tl];
-      printf("For layer#: %u %u \n", sl, tl);
+      printf("For layer#: %d %d \n", sl, tl);
       printf("Bitrate (target vs actual): %d %f\n", rc->layer_target_bitrate[i],
              rc->layer_encoding_bitrate[i]);
       printf("Average frame size (target vs actual): %f %f\n", rc->layer_pfb[i],
@@ -637,10 +636,10 @@
         ref_frame_config->reference[SVC_LAST_FRAME] = 1;
       } else {
         // Pattern of 2 references (ALTREF and GOLDEN) trailing
-        // LAST by 4 and 8 frame, with some switching logic to
-        // sometimes only predict from longer-term reference.
-        // This is simple example to test RPS (reference picture selection)
-        // as method to handle network packet loss.
+        // LAST by 4 and 8 frames, with some switching logic to
+        // sometimes only predict from the longer-term reference
+        //(golden here). This is simple example to test RPS
+        // (reference picture selection).
         int last_idx = 0;
         int last_idx_refresh = 0;
         int gld_idx = 0;
@@ -674,17 +673,20 @@
         ref_frame_config->reference[SVC_LAST_FRAME] = 1;
         ref_frame_config->reference[SVC_ALTREF_FRAME] = 1;
         ref_frame_config->reference[SVC_GOLDEN_FRAME] = 1;
-        // Switch to only ALTREF for frames 200 to 250.
-        if (superframe_cnt >= 200 && superframe_cnt < 250) {
-          ref_frame_config->reference[SVC_LAST_FRAME] = 0;
-          ref_frame_config->reference[SVC_ALTREF_FRAME] = 1;
-          ref_frame_config->reference[SVC_GOLDEN_FRAME] = 0;
-        }
-        // Switch to only GOLDEN for frames 400 to 450.
-        if (superframe_cnt >= 400 && superframe_cnt < 450) {
+        // Switch to only GOLDEN every 300 frames.
+        if (superframe_cnt % 200 == 0 && superframe_cnt > 0) {
           ref_frame_config->reference[SVC_LAST_FRAME] = 0;
           ref_frame_config->reference[SVC_ALTREF_FRAME] = 0;
           ref_frame_config->reference[SVC_GOLDEN_FRAME] = 1;
+          // Test if the long-term is LAST instead, this is just a renaming
+          // but its tests if encoder behaves the same, whether its
+          // LAST or GOLDEN.
+          if (superframe_cnt % 400 == 0 && superframe_cnt > 0) {
+            ref_frame_config->ref_idx[SVC_LAST_FRAME] = gld_idx;
+            ref_frame_config->reference[SVC_LAST_FRAME] = 1;
+            ref_frame_config->reference[SVC_ALTREF_FRAME] = 0;
+            ref_frame_config->reference[SVC_GOLDEN_FRAME] = 0;
+          }
         }
       }
       break;
@@ -692,16 +694,36 @@
       // 2-temporal layer.
       //    1    3    5
       //  0    2    4
+      // Keep golden fixed at slot 3.
+      base_count = superframe_cnt >> 1;
+      ref_frame_config->ref_idx[SVC_GOLDEN_FRAME] = 3;
+      // Cyclically refresh slots 5, 6, 7, for lag alt ref.
+      lag_index = 5;
+      if (base_count > 0) {
+        lag_index = 5 + (base_count % 3);
+        if (superframe_cnt % 2 != 0) lag_index = 5 + ((base_count + 1) % 3);
+      }
+      // Set the altref slot to lag_index.
+      ref_frame_config->ref_idx[SVC_ALTREF_FRAME] = lag_index;
       if (superframe_cnt % 2 == 0) {
         layer_id->temporal_layer_id = 0;
         // Update LAST on layer 0, reference LAST.
         ref_frame_config->refresh[0] = 1;
         ref_frame_config->reference[SVC_LAST_FRAME] = 1;
+        // Refresh lag_index slot, needed for lagging golen.
+        ref_frame_config->refresh[lag_index] = 1;
+        // Refresh GOLDEN every x base layer frames.
+        if (base_count % 32 == 0) ref_frame_config->refresh[3] = 1;
       } else {
         layer_id->temporal_layer_id = 1;
-        // No updates on layer 1, only reference LAST (TL0).
+        // No updates on layer 1, reference LAST (TL0).
         ref_frame_config->reference[SVC_LAST_FRAME] = 1;
       }
+      // Always reference golden and altref on TL0.
+      if (layer_id->temporal_layer_id == 0) {
+        ref_frame_config->reference[SVC_GOLDEN_FRAME] = 1;
+        ref_frame_config->reference[SVC_ALTREF_FRAME] = 1;
+      }
       break;
     case 2:
       // 3-temporal layer:
@@ -781,8 +803,11 @@
       // Every frame can reference GOLDEN AND ALTREF.
       ref_frame_config->reference[SVC_GOLDEN_FRAME] = 1;
       ref_frame_config->reference[SVC_ALTREF_FRAME] = 1;
-      // Allow for compound prediction using LAST and ALTREF.
-      if (speed >= 7) ref_frame_comp_pred->use_comp_pred[2] = 1;
+      // Allow for compound prediction for LAST-ALTREF and LAST-GOLDEN.
+      if (speed >= 7) {
+        ref_frame_comp_pred->use_comp_pred[2] = 1;
+        ref_frame_comp_pred->use_comp_pred[0] = 1;
+      }
       break;
     case 4:
       // 3-temporal layer: but middle layer updates GF, so 2nd TL2 will
@@ -1108,13 +1133,14 @@
 }
 
 #if CONFIG_AV1_DECODER
-static void test_decode(aom_codec_ctx_t *encoder, aom_codec_ctx_t *decoder,
-                        const int frames_out, int *mismatch_seen) {
+// Returns whether there is a mismatch between the encoder's new frame and the
+// decoder's new frame.
+static int test_decode(aom_codec_ctx_t *encoder, aom_codec_ctx_t *decoder,
+                       const int frames_out) {
   aom_image_t enc_img, dec_img;
+  int mismatch = 0;
 
-  if (*mismatch_seen) return;
-
-  /* Get the internal reference frame */
+  /* Get the internal new frame */
   AOM_CODEC_CONTROL_TYPECHECKED(encoder, AV1_GET_NEW_FRAME_IMAGE, &enc_img);
   AOM_CODEC_CONTROL_TYPECHECKED(decoder, AV1_GET_NEW_FRAME_IMAGE, &dec_img);
 
@@ -1123,15 +1149,19 @@
       (dec_img.fmt & AOM_IMG_FMT_HIGHBITDEPTH)) {
     if (enc_img.fmt & AOM_IMG_FMT_HIGHBITDEPTH) {
       aom_image_t enc_hbd_img;
-      aom_img_alloc(&enc_hbd_img, enc_img.fmt - AOM_IMG_FMT_HIGHBITDEPTH,
-                    enc_img.d_w, enc_img.d_h, 16);
+      aom_img_alloc(
+          &enc_hbd_img,
+          static_cast<aom_img_fmt_t>(enc_img.fmt - AOM_IMG_FMT_HIGHBITDEPTH),
+          enc_img.d_w, enc_img.d_h, 16);
       aom_img_truncate_16_to_8(&enc_hbd_img, &enc_img);
       enc_img = enc_hbd_img;
     }
     if (dec_img.fmt & AOM_IMG_FMT_HIGHBITDEPTH) {
       aom_image_t dec_hbd_img;
-      aom_img_alloc(&dec_hbd_img, dec_img.fmt - AOM_IMG_FMT_HIGHBITDEPTH,
-                    dec_img.d_w, dec_img.d_h, 16);
+      aom_img_alloc(
+          &dec_hbd_img,
+          static_cast<aom_img_fmt_t>(dec_img.fmt - AOM_IMG_FMT_HIGHBITDEPTH),
+          dec_img.d_w, dec_img.d_h, 16);
       aom_img_truncate_16_to_8(&dec_hbd_img, &dec_img);
       dec_img = dec_hbd_img;
     }
@@ -1149,22 +1179,47 @@
 #else
     aom_find_mismatch(&enc_img, &dec_img, y, u, v);
 #endif
-    decoder->err = 1;
-    printf(
-        "Encode/decode mismatch on frame %d at"
-        " Y[%d, %d] {%d/%d},"
-        " U[%d, %d] {%d/%d},"
-        " V[%d, %d] {%d/%d}",
-        frames_out, y[0], y[1], y[2], y[3], u[0], u[1], u[2], u[3], v[0], v[1],
-        v[2], v[3]);
-    *mismatch_seen = frames_out;
+    fprintf(stderr,
+            "Encode/decode mismatch on frame %d at"
+            " Y[%d, %d] {%d/%d},"
+            " U[%d, %d] {%d/%d},"
+            " V[%d, %d] {%d/%d}\n",
+            frames_out, y[0], y[1], y[2], y[3], u[0], u[1], u[2], u[3], v[0],
+            v[1], v[2], v[3]);
+    mismatch = 1;
   }
 
   aom_img_free(&enc_img);
   aom_img_free(&dec_img);
+  return mismatch;
 }
 #endif  // CONFIG_AV1_DECODER
 
+struct psnr_stats {
+  // The second element of these arrays is reserved for high bitdepth.
+  uint64_t psnr_sse_total[2];
+  uint64_t psnr_samples_total[2];
+  double psnr_totals[2][4];
+  int psnr_count[2];
+};
+
+static void show_psnr(struct psnr_stats *psnr_stream, double peak) {
+  double ovpsnr;
+
+  if (!psnr_stream->psnr_count[0]) return;
+
+  fprintf(stderr, "\nPSNR (Overall/Avg/Y/U/V)");
+  ovpsnr = sse_to_psnr((double)psnr_stream->psnr_samples_total[0], peak,
+                       (double)psnr_stream->psnr_sse_total[0]);
+  fprintf(stderr, " %.3f", ovpsnr);
+
+  for (int i = 0; i < 4; i++) {
+    fprintf(stderr, " %.3f",
+            psnr_stream->psnr_totals[0][i] / psnr_stream->psnr_count[0]);
+  }
+  fprintf(stderr, "\n");
+}
+
 int main(int argc, const char **argv) {
   AppInput app_input;
   AvxVideoWriter *outfile[AOM_MAX_LAYERS] = { NULL };
@@ -1177,7 +1232,7 @@
   int frame_avail;
   int got_data = 0;
   int flags = 0;
-  unsigned i;
+  int i;
   int pts = 0;             // PTS starts at 0.
   int frame_duration = 1;  // 1 timebase tick per frame.
   aom_svc_layer_id_t layer_id;
@@ -1192,7 +1247,6 @@
   }
 #endif
 #if CONFIG_AV1_DECODER
-  int mismatch_seen = 0;
   aom_codec_ctx_t decoder;
 #endif
 
@@ -1205,6 +1259,7 @@
   double framerate = 30.0;
   int use_svc_control = 1;
   int set_err_resil_frame = 0;
+  int test_changing_bitrate = 0;
   zero(rc.layer_target_bitrate);
   memset(&layer_id, 0, sizeof(aom_svc_layer_id_t));
   memset(&app_input, 0, sizeof(AppInput));
@@ -1214,18 +1269,21 @@
   // spatial stream, using the scaling_mode control.
   const int test_dynamic_scaling_single_layer = 0;
 
+  // Flag to test setting speed per layer.
+  const int test_speed_per_layer = 0;
+
   /* Setup default input stream settings */
   app_input.input_ctx.framerate.numerator = 30;
   app_input.input_ctx.framerate.denominator = 1;
-  app_input.input_ctx.only_i420 = 1;
-  app_input.input_ctx.bit_depth = 0;
+  app_input.input_ctx.only_i420 = 0;
+  app_input.input_ctx.bit_depth = AOM_BITS_8;
   app_input.speed = 7;
   exec_name = argv[0];
 
   // start with default encoder configuration
   aom_codec_err_t res = aom_codec_enc_config_default(aom_codec_av1_cx(), &cfg,
                                                      AOM_USAGE_REALTIME);
-  if (res) {
+  if (res != AOM_CODEC_OK) {
     die("Failed to get config: %s\n", aom_codec_err_to_string(res));
   }
 
@@ -1246,8 +1304,8 @@
 
   parse_command_line(argc, argv, &app_input, &svc_params, &cfg);
 
-  unsigned int ts_number_layers = svc_params.number_temporal_layers;
-  unsigned int ss_number_layers = svc_params.number_spatial_layers;
+  int ts_number_layers = svc_params.number_temporal_layers;
+  int ss_number_layers = svc_params.number_spatial_layers;
 
   unsigned int width = cfg.g_w;
   unsigned int height = cfg.g_h;
@@ -1268,7 +1326,7 @@
     }
   }
 
-  aom_codec_iface_t *encoder = get_aom_encoder_by_short_name("av1");
+  aom_codec_iface_t *encoder = aom_codec_av1_cx();
 
   memcpy(&rc.layer_target_bitrate[0], &svc_params.layer_target_bitrate[0],
          sizeof(svc_params.layer_target_bitrate));
@@ -1311,11 +1369,11 @@
   info.time_base.numerator = cfg.g_timebase.num;
   info.time_base.denominator = cfg.g_timebase.den;
   // Open an output file for each stream.
-  for (unsigned int sl = 0; sl < ss_number_layers; ++sl) {
-    for (unsigned tl = 0; tl < ts_number_layers; ++tl) {
+  for (int sl = 0; sl < ss_number_layers; ++sl) {
+    for (int tl = 0; tl < ts_number_layers; ++tl) {
       i = sl * ts_number_layers + tl;
       char file_name[PATH_MAX];
-      snprintf(file_name, sizeof(file_name), "%s_%u.av1",
+      snprintf(file_name, sizeof(file_name), "%s_%d.av1",
                app_input.output_filename, i);
       if (app_input.output_obu) {
         obu_files[i] = fopen(file_name, "wb");
@@ -1339,14 +1397,16 @@
 
   // Initialize codec.
   aom_codec_ctx_t codec;
-  if (aom_codec_enc_init(&codec, encoder, &cfg, 0))
-    die("Failed to initialize encoder");
+  aom_codec_flags_t flag = 0;
+  flag |= cfg.g_input_bit_depth == AOM_BITS_8 ? 0 : AOM_CODEC_USE_HIGHBITDEPTH;
+  flag |= app_input.show_psnr ? AOM_CODEC_USE_PSNR : 0;
+  if (aom_codec_enc_init(&codec, encoder, &cfg, flag))
+    die_codec(&codec, "Failed to initialize encoder");
 
 #if CONFIG_AV1_DECODER
   if (app_input.decode) {
-    if (aom_codec_dec_init(&decoder, get_aom_decoder_by_index(0), NULL, 0)) {
-      die("Failed to initialize decoder");
-    }
+    if (aom_codec_dec_init(&decoder, get_aom_decoder_by_index(0), NULL, 0))
+      die_codec(&decoder, "Failed to initialize decoder");
   }
 #endif
 
@@ -1374,9 +1434,10 @@
   aom_codec_control(&codec, AV1E_SET_ENABLE_FILTER_INTRA, 0);
   aom_codec_control(&codec, AV1E_SET_INTRA_DEFAULT_TX_ONLY, 1);
 
-  aom_codec_control(&codec, AV1E_SET_TILE_COLUMNS,
-                    cfg.g_threads ? get_msb(cfg.g_threads) : 0);
-  if (cfg.g_threads > 1) aom_codec_control(&codec, AV1E_SET_ROW_MT, 1);
+  if (cfg.g_threads > 1) {
+    aom_codec_control(&codec, AV1E_SET_TILE_COLUMNS,
+                      (unsigned int)log2(cfg.g_threads));
+  }
 
   aom_codec_control(&codec, AV1E_SET_TUNE_CONTENT, app_input.tune_content);
   if (app_input.tune_content == AOM_CONTENT_SCREEN) {
@@ -1417,17 +1478,19 @@
                       max_intra_size_pct);
   }
 
-  for (unsigned int lx = 0; lx < ts_number_layers * ss_number_layers; lx++) {
+  for (int lx = 0; lx < ts_number_layers * ss_number_layers; lx++) {
     cx_time_layer[lx] = 0;
     frame_cnt_layer[lx] = 0;
   }
 
   frame_avail = 1;
+  struct psnr_stats psnr_stream;
+  memset(&psnr_stream, 0, sizeof(psnr_stream));
   while (frame_avail || got_data) {
     struct aom_usec_timer timer;
     frame_avail = read_frame(&(app_input.input_ctx), &raw);
     // Loop over spatial layers.
-    for (unsigned int slx = 0; slx < ss_number_layers; slx++) {
+    for (int slx = 0; slx < ss_number_layers; slx++) {
       aom_codec_iter_t iter = NULL;
       const aom_codec_cx_pkt_t *pkt;
       int layer = 0;
@@ -1448,6 +1511,24 @@
           aom_codec_control(&codec, AV1E_SET_SVC_REF_FRAME_COMP_PRED,
                             &ref_frame_comp_pred);
         }
+        // Set the speed per layer.
+        if (test_speed_per_layer) {
+          int speed_per_layer = 10;
+          if (layer_id.spatial_layer_id == 0) {
+            if (layer_id.temporal_layer_id == 0) speed_per_layer = 6;
+            if (layer_id.temporal_layer_id == 1) speed_per_layer = 7;
+            if (layer_id.temporal_layer_id == 2) speed_per_layer = 8;
+          } else if (layer_id.spatial_layer_id == 1) {
+            if (layer_id.temporal_layer_id == 0) speed_per_layer = 7;
+            if (layer_id.temporal_layer_id == 1) speed_per_layer = 8;
+            if (layer_id.temporal_layer_id == 2) speed_per_layer = 9;
+          } else if (layer_id.spatial_layer_id == 2) {
+            if (layer_id.temporal_layer_id == 0) speed_per_layer = 8;
+            if (layer_id.temporal_layer_id == 1) speed_per_layer = 9;
+            if (layer_id.temporal_layer_id == 2) speed_per_layer = 10;
+          }
+          aom_codec_control(&codec, AOME_SET_CPUUSED, speed_per_layer);
+        }
       } else {
         // Only up to 3 temporal layers supported in fixed mode.
         // Only need to set spatial and temporal layer_id: reference
@@ -1465,11 +1546,16 @@
         aom_codec_control(&codec, AV1E_SET_SVC_LAYER_ID, &layer_id);
       }
 
-      if (set_err_resil_frame) {
+      if (set_err_resil_frame && cfg.g_error_resilient == 0) {
         // Set error_resilient per frame: off/0 for base layer and
         // on/1 for enhancement layer frames.
-        int err_resil_mode =
-            (layer_id.spatial_layer_id > 0 || layer_id.temporal_layer_id > 0);
+        // Note that this is can only be done on the fly/per-frame/layer
+        // if the config error_resilience is off/0. See the logic for updating
+        // in set_encoder_config():
+        // tool_cfg->error_resilient_mode =
+        //     cfg->g_error_resilient | extra_cfg->error_resilient_mode;
+        const int err_resil_mode =
+            layer_id.spatial_layer_id > 0 || layer_id.temporal_layer_id > 0;
         aom_codec_control(&codec, AV1E_SET_ERROR_RESILIENT_MODE,
                           err_resil_mode);
       }
@@ -1518,6 +1604,23 @@
         }
       }
 
+      // Change target_bitrate every other frame.
+      if (test_changing_bitrate && frame_cnt % 2 == 0) {
+        if (frame_cnt < 500)
+          cfg.rc_target_bitrate += 10;
+        else
+          cfg.rc_target_bitrate -= 10;
+        // Do big increase and decrease.
+        if (frame_cnt == 100) cfg.rc_target_bitrate <<= 1;
+        if (frame_cnt == 600) cfg.rc_target_bitrate >>= 1;
+        if (cfg.rc_target_bitrate < 100) cfg.rc_target_bitrate = 100;
+        // Call change_config, or bypass with new control.
+        // res = aom_codec_enc_config_set(&codec, &cfg);
+        if (aom_codec_control(&codec, AV1E_SET_BITRATE_ONE_PASS_CBR,
+                              cfg.rc_target_bitrate))
+          die_codec(&codec, "Failed to SET_BITRATE_ONE_PASS_CBR");
+      }
+
       // Do the layer encode.
       aom_usec_timer_start(&timer);
       if (aom_codec_encode(&codec, frame_avail ? &raw : NULL, pts, 1, flags))
@@ -1529,38 +1632,42 @@
 
       got_data = 0;
       while ((pkt = aom_codec_get_cx_data(&codec, &iter))) {
-        got_data = 1;
         switch (pkt->kind) {
           case AOM_CODEC_CX_FRAME_PKT:
-            for (unsigned int sl = layer_id.spatial_layer_id;
-                 sl < ss_number_layers; ++sl) {
-              for (unsigned tl = layer_id.temporal_layer_id;
-                   tl < ts_number_layers; ++tl) {
-                unsigned int j = sl * ts_number_layers + tl;
+            for (int sl = layer_id.spatial_layer_id; sl < ss_number_layers;
+                 ++sl) {
+              for (int tl = layer_id.temporal_layer_id; tl < ts_number_layers;
+                   ++tl) {
+                int j = sl * ts_number_layers + tl;
                 if (app_input.output_obu) {
                   fwrite(pkt->data.frame.buf, 1, pkt->data.frame.sz,
                          obu_files[j]);
                 } else {
-                  aom_video_writer_write_frame(outfile[j], pkt->data.frame.buf,
-                                               pkt->data.frame.sz, pts);
+                  aom_video_writer_write_frame(
+                      outfile[j],
+                      reinterpret_cast<const uint8_t *>(pkt->data.frame.buf),
+                      pkt->data.frame.sz, pts);
                 }
-                if (sl == (unsigned int)layer_id.spatial_layer_id)
+                if (sl == layer_id.spatial_layer_id)
                   rc.layer_encoding_bitrate[j] += 8.0 * pkt->data.frame.sz;
               }
             }
+            got_data = 1;
             // Write everything into the top layer.
             if (app_input.output_obu) {
               fwrite(pkt->data.frame.buf, 1, pkt->data.frame.sz,
                      total_layer_obu_file);
             } else {
-              aom_video_writer_write_frame(total_layer_file,
-                                           pkt->data.frame.buf,
-                                           pkt->data.frame.sz, pts);
+              aom_video_writer_write_frame(
+                  total_layer_file,
+                  reinterpret_cast<const uint8_t *>(pkt->data.frame.buf),
+                  pkt->data.frame.sz, pts);
             }
             // Keep count of rate control stats per layer (for non-key).
             if (!(pkt->data.frame.flags & AOM_FRAME_IS_KEY)) {
-              unsigned int j = layer_id.spatial_layer_id * ts_number_layers +
-                               layer_id.temporal_layer_id;
+              int j = layer_id.spatial_layer_id * ts_number_layers +
+                      layer_id.temporal_layer_id;
+              assert(j >= 0);
               rc.layer_avg_frame_size[j] += 8.0 * pkt->data.frame.sz;
               rc.layer_avg_rate_mismatch[j] +=
                   fabs(8.0 * pkt->data.frame.sz - rc.layer_pfb[j]) /
@@ -1601,24 +1708,43 @@
 
 #if CONFIG_AV1_DECODER
             if (app_input.decode) {
-              if (aom_codec_decode(&decoder, pkt->data.frame.buf,
-                                   (unsigned int)pkt->data.frame.sz, NULL))
-                die_codec(&decoder, "Failed to decode frame.");
+              if (aom_codec_decode(
+                      &decoder,
+                      reinterpret_cast<const uint8_t *>(pkt->data.frame.buf),
+                      pkt->data.frame.sz, NULL))
+                die_codec(&decoder, "Failed to decode frame");
             }
 #endif
 
             break;
+          case AOM_CODEC_PSNR_PKT:
+            if (app_input.show_psnr) {
+              psnr_stream.psnr_sse_total[0] += pkt->data.psnr.sse[0];
+              psnr_stream.psnr_samples_total[0] += pkt->data.psnr.samples[0];
+              for (int plane = 0; plane < 4; plane++) {
+                psnr_stream.psnr_totals[0][plane] += pkt->data.psnr.psnr[plane];
+              }
+              psnr_stream.psnr_count[0]++;
+            }
+            break;
           default: break;
         }
       }
 #if CONFIG_AV1_DECODER
-      if (app_input.decode) {
+      if (got_data && app_input.decode) {
         // Don't look for mismatch on top spatial and top temporal layers as
         // they are non reference frames.
         if ((ss_number_layers > 1 || ts_number_layers > 1) &&
             !(layer_id.temporal_layer_id > 0 &&
-              layer_id.temporal_layer_id == (int)ts_number_layers - 1)) {
-          test_decode(&codec, &decoder, frame_cnt, &mismatch_seen);
+              layer_id.temporal_layer_id == ts_number_layers - 1)) {
+          if (test_decode(&codec, &decoder, frame_cnt)) {
+#if CONFIG_INTERNAL_STATS
+            fprintf(stats_file, "First mismatch occurred in frame %d\n",
+                    frame_cnt);
+            fclose(stats_file);
+#endif
+            fatal("Mismatch seen");
+          }
         }
       }
 #endif
@@ -1632,8 +1758,8 @@
                                 ts_number_layers);
 
   printf("\n");
-  for (unsigned int slx = 0; slx < ss_number_layers; slx++)
-    for (unsigned int tlx = 0; tlx < ts_number_layers; tlx++) {
+  for (int slx = 0; slx < ss_number_layers; slx++)
+    for (int tlx = 0; tlx < ts_number_layers; tlx++) {
       int lx = slx * ts_number_layers + tlx;
       printf("Per layer encoding time/FPS stats for encoder: %d %d %d %f %f \n",
              slx, tlx, frame_cnt_layer[lx],
@@ -1646,14 +1772,21 @@
          frame_cnt, 1000 * (float)cx_time / (double)(frame_cnt * 1000000),
          1000000 * (double)frame_cnt / (double)cx_time);
 
-  if (aom_codec_destroy(&codec)) die_codec(&codec, "Failed to destroy codec");
+  if (app_input.show_psnr) {
+    show_psnr(&psnr_stream, 255.0);
+  }
+
+  if (aom_codec_destroy(&codec)) die_codec(&codec, "Failed to destroy encoder");
+
+#if CONFIG_AV1_DECODER
+  if (app_input.decode) {
+    if (aom_codec_destroy(&decoder))
+      die_codec(&decoder, "Failed to destroy decoder");
+  }
+#endif
 
 #if CONFIG_INTERNAL_STATS
-  if (mismatch_seen) {
-    fprintf(stats_file, "First mismatch occurred in frame %d\n", mismatch_seen);
-  } else {
-    fprintf(stats_file, "No mismatch detected in recon buffers\n");
-  }
+  fprintf(stats_file, "No mismatch detected in recon buffers\n");
   fclose(stats_file);
 #endif
 

diff --git a/libaom_blocklist.txt b/libaom_blocklist.txt
index 19851d3..06a721b 100644
--- a/libaom_blocklist.txt
+++ b/libaom_blocklist.txt

@@ -7,6 +7,7 @@
 # libaom/av1/encoder/ratectrl.c: indirect call to assembly code on x86/x86_64 platform
 fun:rc_scene_detection_onepass_rt
 # libaom/av1/encoder/var_based_part.c: indirect call to assembly code on x86/x86_64 platform
+fun:evaluate_neighbour_mvs
 fun:setup_planes
 fun:chroma_check
 # libaom/av1/encoder/rd.c: indirect call to assembly code on x86/x86_64 platform

diff --git a/test/acm_random.h b/test/acm_random.h
index bc38ba4..15e8c9c 100644
--- a/test/acm_random.h
+++ b/test/acm_random.h

@@ -59,12 +59,6 @@
     return (value >> 19) & 0xfff;
   }
 
-  int16_t Rand9Signed() {
-    // Use 9 bits: values between 255 (0x0FF) and -256 (0x100).
-    const uint32_t value = random_.Generate(512);
-    return static_cast<int16_t>(value) - 256;
-  }
-
   uint8_t Rand8() {
     const uint32_t value =
         random_.Generate(testing::internal::Random::kMaxRange);

diff --git a/test/aomenc.sh b/test/aomenc.sh
index ed98313..0bb9fba 100755
--- a/test/aomenc.sh
+++ b/test/aomenc.sh

@@ -40,12 +40,6 @@
   fi
 }
 
-aomenc_can_encode_av1() {
-  if [ "$(av1_encode_available)" = "yes" ]; then
-    echo yes
-  fi
-}
-
 # Utilities that echo aomenc input file parameters.
 y4m_input_non_square_par() {
   echo ""${Y4M_NOSQ_PAR_INPUT}""

diff --git a/test/av1_convolve_scale_test.cc b/test/av1_convolve_scale_test.cc
index 3f35025..c321de2 100644
--- a/test/av1_convolve_scale_test.cc
+++ b/test/av1_convolve_scale_test.cc

@@ -244,8 +244,10 @@
 typedef tuple<int, int> BlockDimension;
 
 struct BaseParams {
-  BaseParams(BlockDimension dims, NTaps ntaps_x, NTaps ntaps_y, bool avg)
-      : dims(dims), ntaps_x(ntaps_x), ntaps_y(ntaps_y), avg(avg) {}
+  BaseParams(BlockDimension dimensions, NTaps num_taps_x, NTaps num_taps_y,
+             bool average)
+      : dims(dimensions), ntaps_x(num_taps_x), ntaps_y(num_taps_y),
+        avg(average) {}
 
   BlockDimension dims;
   NTaps ntaps_x, ntaps_y;
@@ -455,11 +457,20 @@
 TEST_P(LowBDConvolveScaleTest, DISABLED_Speed) { SpeedTest(); }
 
 INSTANTIATE_TEST_SUITE_P(
+    C, LowBDConvolveScaleTest,
+    ::testing::Combine(::testing::Values(av1_convolve_2d_scale_c),
+                       ::testing::ValuesIn(kBlockDim),
+                       ::testing::ValuesIn(kNTaps), ::testing::ValuesIn(kNTaps),
+                       ::testing::Bool()));
+
+#if HAVE_SSE4_1
+INSTANTIATE_TEST_SUITE_P(
     SSE4_1, LowBDConvolveScaleTest,
     ::testing::Combine(::testing::Values(av1_convolve_2d_scale_sse4_1),
                        ::testing::ValuesIn(kBlockDim),
                        ::testing::ValuesIn(kNTaps), ::testing::ValuesIn(kNTaps),
                        ::testing::Bool()));
+#endif  // HAVE_SSE4_1
 
 #if CONFIG_AV1_HIGHBITDEPTH
 typedef void (*HighbdConvolveFunc)(const uint16_t *src, int src_stride,
@@ -522,10 +533,30 @@
 TEST_P(HighBDConvolveScaleTest, DISABLED_Speed) { SpeedTest(); }
 
 INSTANTIATE_TEST_SUITE_P(
+    C, HighBDConvolveScaleTest,
+    ::testing::Combine(::testing::Values(av1_highbd_convolve_2d_scale_c),
+                       ::testing::ValuesIn(kBlockDim),
+                       ::testing::ValuesIn(kNTaps), ::testing::ValuesIn(kNTaps),
+                       ::testing::Bool(), ::testing::ValuesIn(kBDs)));
+
+#if HAVE_SSE4_1
+INSTANTIATE_TEST_SUITE_P(
     SSE4_1, HighBDConvolveScaleTest,
     ::testing::Combine(::testing::Values(av1_highbd_convolve_2d_scale_sse4_1),
                        ::testing::ValuesIn(kBlockDim),
                        ::testing::ValuesIn(kNTaps), ::testing::ValuesIn(kNTaps),
                        ::testing::Bool(), ::testing::ValuesIn(kBDs)));
+#endif  // HAVE_SSE4_1
+
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(
+    NEON, HighBDConvolveScaleTest,
+    ::testing::Combine(::testing::Values(av1_highbd_convolve_2d_scale_neon),
+                       ::testing::ValuesIn(kBlockDim),
+                       ::testing::ValuesIn(kNTaps), ::testing::ValuesIn(kNTaps),
+                       ::testing::Bool(), ::testing::ValuesIn(kBDs)));
+
+#endif  // HAVE_NEON
+
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 }  // namespace

diff --git a/test/av1_convolve_test.cc b/test/av1_convolve_test.cc
index 12edfac..873960d 100644
--- a/test/av1_convolve_test.cc
+++ b/test/av1_convolve_test.cc

@@ -535,6 +535,11 @@
                          BuildHighbdParams(av1_highbd_convolve_x_sr_avx2));
 #endif
 
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(NEON, AV1ConvolveXHighbdTest,
+                         BuildHighbdParams(av1_highbd_convolve_x_sr_neon));
+#endif
+
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
 ////////////////////////////////////////////////////////
@@ -735,6 +740,11 @@
                          BuildHighbdParams(av1_highbd_convolve_y_sr_avx2));
 #endif
 
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(NEON, AV1ConvolveYHighbdTest,
+                         BuildHighbdParams(av1_highbd_convolve_y_sr_neon));
+#endif
+
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
 //////////////////////////////////////////////////////////////
@@ -1072,6 +1082,11 @@
                          BuildHighbdParams(av1_highbd_convolve_2d_sr_avx2));
 #endif
 
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(NEON, AV1Convolve2DHighbdTest,
+                         BuildHighbdParams(av1_highbd_convolve_2d_sr_neon));
+#endif
+
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
 //////////////////////////
@@ -1377,6 +1392,12 @@
     BuildHighbdLumaParams(av1_highbd_dist_wtd_convolve_x_avx2));
 #endif
 
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(
+    NEON, AV1ConvolveXHighbdCompoundTest,
+    BuildHighbdLumaParams(av1_highbd_dist_wtd_convolve_x_neon));
+#endif
+
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
 ////////////////////////////////////////////////
@@ -1451,6 +1472,12 @@
     BuildHighbdLumaParams(av1_highbd_dist_wtd_convolve_y_avx2));
 #endif
 
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(
+    NEON, AV1ConvolveYHighbdCompoundTest,
+    BuildHighbdLumaParams(av1_highbd_dist_wtd_convolve_y_neon));
+#endif
+
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
 //////////////////////////////////////////////////////
@@ -1655,6 +1682,12 @@
     BuildHighbdLumaParams(av1_highbd_dist_wtd_convolve_2d_copy_avx2));
 #endif
 
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(
+    NEON, AV1Convolve2DCopyHighbdCompoundTest,
+    BuildHighbdLumaParams(av1_highbd_dist_wtd_convolve_2d_copy_neon));
+#endif
+
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
 /////////////////////////////////////////////////
@@ -1846,6 +1879,12 @@
     BuildHighbdLumaParams(av1_highbd_dist_wtd_convolve_2d_avx2));
 #endif
 
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(
+    NEON, AV1Convolve2DHighbdCompoundTest,
+    BuildHighbdLumaParams(av1_highbd_dist_wtd_convolve_2d_neon));
+#endif
+
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
 }  // namespace

diff --git a/test/av1_fwd_txfm1d_test.cc b/test/av1_fwd_txfm1d_test.cc
index df504ea..885a6db 100644
--- a/test/av1_fwd_txfm1d_test.cc
+++ b/test/av1_fwd_txfm1d_test.cc

@@ -84,7 +84,7 @@
 
       const int count_test_block = 5000;
       if (fwd_txfm_func != nullptr) {
-        for (int ti = 0; ti < count_test_block; ++ti) {
+        for (int i = 0; i < count_test_block; ++i) {
           for (int ni = 0; ni < txfm_size; ++ni) {
             input[ni] = rnd.Rand16() % input_base - rnd.Rand16() % input_base;
             ref_input[ni] = static_cast<double>(input[ni]);

diff --git a/test/av1_fwd_txfm2d_test.cc b/test/av1_fwd_txfm2d_test.cc
index 525e0cc..7b84eb9 100644
--- a/test/av1_fwd_txfm2d_test.cc
+++ b/test/av1_fwd_txfm2d_test.cc

@@ -27,6 +27,7 @@
 using libaom_test::bd;
 using libaom_test::compute_avg_abs_error;
 using libaom_test::input_base;
+using libaom_test::tx_type_name;
 using libaom_test::TYPE_TXFM;
 
 using std::vector;
@@ -99,7 +100,8 @@
         actual_max_error = AOMMAX(actual_max_error, this_error);
       }
       EXPECT_GE(max_error_, actual_max_error)
-          << "tx_size = " << tx_size_ << ", tx_type = " << tx_type_;
+          << "tx_w: " << tx_width_ << " tx_h: " << tx_height_
+          << ", tx_type = " << (int)tx_type_;
       if (actual_max_error > max_error_) {  // exit early.
         break;
       }
@@ -260,8 +262,8 @@
       ACMRandom rnd(ACMRandom::DeterministicSeed());
       for (int cnt = 0; cnt < 500; ++cnt) {
         if (cnt == 0) {
-          for (int r = 0; r < rows; ++r) {
-            for (int c = 0; c < cols; ++c) {
+          for (int c = 0; c < cols; ++c) {
+            for (int r = 0; r < rows; ++r) {
               input[r * input_stride + c] = (1 << bd) - 1;
             }
           }
@@ -278,14 +280,15 @@
         param.bd = bd;
         ref_func(input, ref_output, input_stride, (TX_TYPE)tx_type, bd);
         target_func(input, output, input_stride, &param);
-        const int check_rows = AOMMIN(32, rows);
-        const int check_cols = AOMMIN(32, rows * cols / check_rows);
+        const int check_cols = AOMMIN(32, cols);
+        const int check_rows = AOMMIN(32, rows * cols / check_cols);
         for (int r = 0; r < check_rows; ++r) {
           for (int c = 0; c < check_cols; ++c) {
             ASSERT_EQ(ref_output[r * check_cols + c],
                       output[r * check_cols + c])
                 << "[" << r << "," << c << "] cnt:" << cnt
-                << " tx_size: " << tx_size << " tx_type: " << tx_type;
+                << " tx_size: " << cols << "x" << rows
+                << " tx_type: " << tx_type_name[tx_type];
           }
         }
       }
@@ -300,57 +303,55 @@
   const int cols = tx_size_wide[tx_size];
   const int num_loops = 1000000 / (rows * cols);
 
-  for (int i = 0; i < 2; ++i) {
-    const int bd = 8;
-    for (int tx_type = 0; tx_type < TX_TYPES; ++tx_type) {
-      if (libaom_test::IsTxSizeTypeValid(
-              tx_size, static_cast<TX_TYPE>(tx_type)) == false) {
-        continue;
+  const int bd = 8;
+  for (int tx_type = 0; tx_type < TX_TYPES; ++tx_type) {
+    if (libaom_test::IsTxSizeTypeValid(
+            tx_size, static_cast<TX_TYPE>(tx_type)) == false) {
+      continue;
+    }
+
+    FwdTxfm2dFunc ref_func = libaom_test::fwd_txfm_func_ls[tx_size];
+    if (ref_func != nullptr) {
+      DECLARE_ALIGNED(32, int16_t, input[64 * 64]) = { 0 };
+      DECLARE_ALIGNED(32, int32_t, output[64 * 64]);
+      DECLARE_ALIGNED(32, int32_t, ref_output[64 * 64]);
+      int input_stride = 64;
+      ACMRandom rnd(ACMRandom::DeterministicSeed());
+
+      for (int r = 0; r < rows; ++r) {
+        for (int c = 0; c < cols; ++c) {
+          input[r * input_stride + c] = rnd.Rand16() % (1 << bd);
+        }
       }
 
-      FwdTxfm2dFunc ref_func = libaom_test::fwd_txfm_func_ls[tx_size];
-      if (ref_func != nullptr) {
-        DECLARE_ALIGNED(32, int16_t, input[64 * 64]) = { 0 };
-        DECLARE_ALIGNED(32, int32_t, output[64 * 64]);
-        DECLARE_ALIGNED(32, int32_t, ref_output[64 * 64]);
-        int input_stride = 64;
-        ACMRandom rnd(ACMRandom::DeterministicSeed());
+      param.tx_type = (TX_TYPE)tx_type;
+      param.tx_size = (TX_SIZE)tx_size;
+      param.tx_set_type = EXT_TX_SET_ALL16;
+      param.bd = bd;
 
-        for (int r = 0; r < rows; ++r) {
-          for (int c = 0; c < cols; ++c) {
-            input[r * input_stride + c] = rnd.Rand16() % (1 << bd);
-          }
-        }
+      aom_usec_timer ref_timer, test_timer;
 
-        param.tx_type = (TX_TYPE)tx_type;
-        param.tx_size = (TX_SIZE)tx_size;
-        param.tx_set_type = EXT_TX_SET_ALL16;
-        param.bd = bd;
-
-        aom_usec_timer ref_timer, test_timer;
-
-        aom_usec_timer_start(&ref_timer);
-        for (int i = 0; i < num_loops; ++i) {
-          ref_func(input, ref_output, input_stride, (TX_TYPE)tx_type, bd);
-        }
-        aom_usec_timer_mark(&ref_timer);
-        const int elapsed_time_c =
-            static_cast<int>(aom_usec_timer_elapsed(&ref_timer));
-
-        aom_usec_timer_start(&test_timer);
-        for (int i = 0; i < num_loops; ++i) {
-          target_func(input, output, input_stride, &param);
-        }
-        aom_usec_timer_mark(&test_timer);
-        const int elapsed_time_simd =
-            static_cast<int>(aom_usec_timer_elapsed(&test_timer));
-
-        printf(
-            "txfm_size[%d] \t txfm_type[%d] \t c_time=%d \t simd_time=%d \t "
-            "gain=%d \n",
-            tx_size, tx_type, elapsed_time_c, elapsed_time_simd,
-            (elapsed_time_c / elapsed_time_simd));
+      aom_usec_timer_start(&ref_timer);
+      for (int i = 0; i < num_loops; ++i) {
+        ref_func(input, ref_output, input_stride, (TX_TYPE)tx_type, bd);
       }
+      aom_usec_timer_mark(&ref_timer);
+      const int elapsed_time_c =
+          static_cast<int>(aom_usec_timer_elapsed(&ref_timer));
+
+      aom_usec_timer_start(&test_timer);
+      for (int i = 0; i < num_loops; ++i) {
+        target_func(input, output, input_stride, &param);
+      }
+      aom_usec_timer_mark(&test_timer);
+      const int elapsed_time_simd =
+          static_cast<int>(aom_usec_timer_elapsed(&test_timer));
+
+      printf(
+          "txfm_size[%2dx%-2d] \t txfm_type[%d] \t c_time=%d \t"
+          "simd_time=%d \t gain=%d \n",
+          rows, cols, tx_type, elapsed_time_c, elapsed_time_simd,
+          (elapsed_time_c / elapsed_time_simd));
     }
   }
 }
@@ -382,9 +383,9 @@
     int stride = stride_list[i];
     int array_size = stride * stride;
 
-    for (int i = 0; i < array_size; i++) {
-      src_diff[i] = 8;
-      coeff[i] = 0;
+    for (int j = 0; j < array_size; j++) {
+      src_diff[j] = 8;
+      coeff[j] = 0;
     }
 
     av1_quick_txfm(/*use_hadamard=*/0, tx_size, bd_info, src_diff, stride,
@@ -392,9 +393,9 @@
 
     double input_sse = 0;
     double output_sse = 0;
-    for (int i = 0; i < array_size; i++) {
-      input_sse += pow(src_diff[i], 2);
-      output_sse += pow(coeff[i], 2);
+    for (int j = 0; j < array_size; j++) {
+      input_sse += pow(src_diff[j], 2);
+      output_sse += pow(coeff[j], 2);
     }
 
     double scale = output_sse / input_sse;
@@ -418,9 +419,9 @@
     int stride = stride_list[i];
     int array_size = stride * stride;
 
-    for (int i = 0; i < array_size; i++) {
-      src_diff[i] = 8;
-      coeff[i] = 0;
+    for (int j = 0; j < array_size; j++) {
+      src_diff[j] = 8;
+      coeff[j] = 0;
     }
 
     av1_quick_txfm(/*use_hadamard=*/1, tx_size, bd_info, src_diff, stride,
@@ -428,9 +429,9 @@
 
     double input_sse = 0;
     double output_sse = 0;
-    for (int i = 0; i < array_size; i++) {
-      input_sse += pow(src_diff[i], 2);
-      output_sse += pow(coeff[i], 2);
+    for (int j = 0; j < array_size; j++) {
+      input_sse += pow(src_diff[j], 2);
+      output_sse += pow(coeff[j], 2);
     }
 
     double scale = output_sse / input_sse;
@@ -555,14 +556,15 @@
 
           ref_func(input, ref_output, input_stride, (TX_TYPE)tx_type, bd);
           target_func(input, output, input_stride, &param);
-          const int check_rows = AOMMIN(32, rows);
-          const int check_cols = AOMMIN(32, rows * cols / check_rows);
+          const int check_cols = AOMMIN(32, cols);
+          const int check_rows = AOMMIN(32, rows * cols / check_cols);
           for (int r = 0; r < check_rows; ++r) {
             for (int c = 0; c < check_cols; ++c) {
-              ASSERT_EQ(ref_output[r * check_cols + c],
-                        output[r * check_cols + c])
+              ASSERT_EQ(ref_output[c * check_rows + r],
+                        output[c * check_rows + r])
                   << "[" << r << "," << c << "] cnt:" << cnt
-                  << " tx_size: " << tx_size << " tx_type: " << tx_type;
+                  << " tx_size: " << cols << "x" << rows
+                  << " tx_type: " << tx_type;
             }
           }
         }
@@ -610,7 +612,7 @@
         aom_usec_timer ref_timer, test_timer;
 
         aom_usec_timer_start(&ref_timer);
-        for (int i = 0; i < num_loops; ++i) {
+        for (int j = 0; j < num_loops; ++j) {
           ref_func(input, ref_output, input_stride, (TX_TYPE)tx_type, bd);
         }
         aom_usec_timer_mark(&ref_timer);
@@ -618,7 +620,7 @@
             static_cast<int>(aom_usec_timer_elapsed(&ref_timer));
 
         aom_usec_timer_start(&test_timer);
-        for (int i = 0; i < num_loops; ++i) {
+        for (int j = 0; j < num_loops; ++j) {
           target_func(input, output, input_stride, &param);
         }
         aom_usec_timer_mark(&test_timer);
@@ -626,9 +628,9 @@
             static_cast<int>(aom_usec_timer_elapsed(&test_timer));
 
         printf(
-            "txfm_size[%d] \t txfm_type[%d] \t c_time=%d \t simd_time=%d \t "
-            "gain=%d \n",
-            tx_size, tx_type, elapsed_time_c, elapsed_time_simd,
+            "txfm_size[%2dx%-2d] \t txfm_type[%d] \t c_time=%d \t"
+            "simd_time=%d \t gain=%d \n",
+            cols, rows, tx_type, elapsed_time_c, elapsed_time_simd,
             (elapsed_time_c / elapsed_time_simd));
       }
     }

diff --git a/test/av1_highbd_iht_test.cc b/test/av1_highbd_iht_test.cc
index 07c6036..dae53ea 100644
--- a/test/av1_highbd_iht_test.cc
+++ b/test/av1_highbd_iht_test.cc

@@ -298,9 +298,8 @@
       for (int r = 0; r < rows; ++r) {
         for (int c = 0; c < cols; ++c) {
           ASSERT_EQ(ref_output[r * stride + c], output[r * stride + c])
-              << "[" << r << "," << c << "] " << cnt
-              << " tx_size: " << static_cast<int>(tx_size_)
-              << " bit_depth_: " << bit_depth_
+              << "[" << r << "," << c << "] " << cnt << " tx_size: " << cols
+              << "x" << rows << " bit_depth_: " << bit_depth_
               << " tx_type: " << tx_type_name[tx_type_] << " eob " << eob;
         }
       }

diff --git a/test/av1_horz_only_frame_superres_test.cc b/test/av1_horz_only_frame_superres_test.cc
index f503b63..28ee534 100644
--- a/test/av1_horz_only_frame_superres_test.cc
+++ b/test/av1_horz_only_frame_superres_test.cc

@@ -299,8 +299,13 @@
 TEST_P(LowBDConvolveHorizRSTest, Correctness) { CorrectnessTest(); }
 TEST_P(LowBDConvolveHorizRSTest, DISABLED_Speed) { SpeedTest(); }
 
+INSTANTIATE_TEST_SUITE_P(C, LowBDConvolveHorizRSTest,
+                         ::testing::Values(av1_convolve_horiz_rs_c));
+
+#if HAVE_SSE4_1
 INSTANTIATE_TEST_SUITE_P(SSE4_1, LowBDConvolveHorizRSTest,
                          ::testing::Values(av1_convolve_horiz_rs_sse4_1));
+#endif
 
 #if CONFIG_AV1_HIGHBITDEPTH
 typedef void (*HighBDConvolveHorizRsFunc)(const uint16_t *src, int src_stride,
@@ -358,9 +363,24 @@
 TEST_P(HighBDConvolveHorizRSTest, DISABLED_Speed) { SpeedTest(); }
 
 INSTANTIATE_TEST_SUITE_P(
+    C, HighBDConvolveHorizRSTest,
+    ::testing::Combine(::testing::Values(av1_highbd_convolve_horiz_rs_c),
+                       ::testing::ValuesIn(kBDs)));
+
+#if HAVE_SSE4_1
+INSTANTIATE_TEST_SUITE_P(
     SSE4_1, HighBDConvolveHorizRSTest,
     ::testing::Combine(::testing::Values(av1_highbd_convolve_horiz_rs_sse4_1),
                        ::testing::ValuesIn(kBDs)));
+#endif  // HAVE_SSE4_1
+
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(
+    NEON, HighBDConvolveHorizRSTest,
+    ::testing::Combine(::testing::Values(av1_highbd_convolve_horiz_rs_neon),
+                       ::testing::ValuesIn(kBDs)));
+#endif  // HAVE_NEON
+
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
 }  // namespace

diff --git a/test/av1_inv_txfm1d_test.cc b/test/av1_inv_txfm1d_test.cc
index ab8a6f8..e70b22a 100644
--- a/test/av1_inv_txfm1d_test.cc
+++ b/test/av1_inv_txfm1d_test.cc

@@ -73,31 +73,31 @@
   const int max_error[] = { 6, 10, 19, 31, 40 };
   ASSERT_EQ(NELEMENTS(max_error), TX_SIZES);
   ASSERT_EQ(NELEMENTS(inv_txfm_func_ls), TX_SIZES);
-  for (int k = 0; k < count_test_block; ++k) {
+  for (int i = 0; i < count_test_block; ++i) {
     // choose a random transform to test
     const TxSize tx_size = static_cast<TxSize>(rnd.Rand8() % TX_SIZES);
-    const int tx_size_pix = txfm_size_ls[tx_size];
+    const int txfm_size = txfm_size_ls[tx_size];
     const TxfmFunc inv_txfm_func = inv_txfm_func_ls[tx_size][0];
 
     int32_t input[64];
-    random_matrix(input, tx_size_pix, &rnd);
+    random_matrix(input, txfm_size, &rnd);
 
     // 64x64 transform assumes last 32 values are zero.
     memset(input + 32, 0, 32 * sizeof(input[0]));
 
     int32_t ref_output[64];
     memset(ref_output, 0, sizeof(ref_output));
-    reference_idct_1d_int(input, ref_output, tx_size_pix);
+    reference_idct_1d_int(input, ref_output, txfm_size);
 
     int32_t output[64];
     memset(output, 0, sizeof(output));
     inv_txfm_func(input, output, cos_bit, range_bit);
 
-    for (int i = 0; i < tx_size_pix; ++i) {
-      EXPECT_LE(abs(output[i] - ref_output[i]), max_error[tx_size])
-          << "tx_size = " << tx_size << ", i = " << i
-          << ", output[i] = " << output[i]
-          << ", ref_output[i] = " << ref_output[i];
+    for (int ni = 0; ni < txfm_size; ++ni) {
+      EXPECT_LE(abs(output[ni] - ref_output[ni]), max_error[tx_size])
+          << "tx_size = " << tx_size << ", ni = " << ni
+          << ", output[ni] = " << output[ni]
+          << ", ref_output[ni] = " << ref_output[ni];
     }
   }
 }
@@ -129,7 +129,7 @@
       if (!fwd_txfm_func) continue;
 
       const int count_test_block = 5000;
-      for (int ci = 0; ci < count_test_block; ++ci) {
+      for (int i = 0; i < count_test_block; ++i) {
         int32_t input[64];
         int32_t output[64];
         int32_t round_trip_output[64];

diff --git a/test/av1_inv_txfm2d_test.cc b/test/av1_inv_txfm2d_test.cc
index e13350a..dfa0481 100644
--- a/test/av1_inv_txfm2d_test.cc
+++ b/test/av1_inv_txfm2d_test.cc

@@ -30,6 +30,7 @@
 using libaom_test::input_base;
 using libaom_test::InvTxfm2dFunc;
 using libaom_test::LbdInvTxfm2dFunc;
+using libaom_test::tx_type_name;
 
 using ::testing::Combine;
 using ::testing::Range;
@@ -42,25 +43,6 @@
 
 namespace {
 
-static const char *tx_type_name[] = {
-  "DCT_DCT",
-  "ADST_DCT",
-  "DCT_ADST",
-  "ADST_ADST",
-  "FLIPADST_DCT",
-  "DCT_FLIPADST",
-  "FLIPADST_FLIPADST",
-  "ADST_FLIPADST",
-  "FLIPADST_ADST",
-  "IDTX",
-  "V_DCT",
-  "H_DCT",
-  "V_ADST",
-  "H_ADST",
-  "V_FLIPADST",
-  "H_FLIPADST",
-};
-
 // AV1InvTxfm2dParam argument list:
 // tx_type_, tx_size_, max_error_, max_avg_error_
 typedef std::tuple<TxType, TxSize, int, double> AV1InvTxfm2dParam;
@@ -139,7 +121,8 @@
         actual_max_error = AOMMAX(actual_max_error, this_error);
       }
       EXPECT_GE(max_error_, actual_max_error)
-          << " tx_w: " << tx_w << " tx_h " << tx_h << " tx_type: " << tx_type_;
+          << " tx_w: " << tx_w << " tx_h " << tx_h
+          << " tx_type: " << tx_type_name[tx_type_];
       if (actual_max_error > max_error_) {  // exit early.
         break;
       }
@@ -149,7 +132,8 @@
 
     avg_abs_error /= count;
     EXPECT_GE(max_avg_error_, avg_abs_error)
-        << " tx_w: " << tx_w << " tx_h " << tx_h << " tx_type: " << tx_type_;
+        << " tx_w: " << tx_w << " tx_h " << tx_h
+        << " tx_type: " << tx_type_name[tx_type_];
   }
 
  private:
@@ -345,9 +329,9 @@
           printf(" ");
         }
         ASSERT_EQ(ref_value, output[r * stride + c])
-            << "[" << r << "," << c << "] " << cnt
-            << " tx_size: " << static_cast<int>(tx_size)
-            << " tx_type: " << tx_type_name[tx_type] << " eob " << eob;
+            << "[" << r << "," << c << "] " << cnt << " tx_size: " << cols
+            << "x" << rows << " tx_type: " << tx_type_name[tx_type] << " eob "
+            << eob;
       }
     }
   }
@@ -391,11 +375,12 @@
 }
 
 #if HAVE_SSSE3
-#if defined(_MSC_VER) || defined(__SSSE3__)
-#include "av1/common/x86/av1_inv_txfm_ssse3.h"
+extern "C" void av1_lowbd_inv_txfm2d_add_ssse3(const int32_t *input,
+                                               uint8_t *output, int stride,
+                                               TxType tx_type, TxSize tx_size,
+                                               int eob);
 INSTANTIATE_TEST_SUITE_P(SSSE3, AV1LbdInvTxfm2d,
                          ::testing::Values(av1_lowbd_inv_txfm2d_add_ssse3));
-#endif  // _MSC_VER || __SSSE3__
 #endif  // HAVE_SSSE3
 
 #if HAVE_AVX2

diff --git a/test/av1_k_means_test.cc b/test/av1_k_means_test.cc
index 221dd10..99f0fba 100644
--- a/test/av1_k_means_test.cc
+++ b/test/av1_k_means_test.cc

@@ -259,7 +259,7 @@
   RunSpeedTest(GET_PARAM(0), GET_PARAM(1), 8);
 }
 
-#if HAVE_AVX2 || HAVE_SSE2
+#if HAVE_SSE2 || HAVE_AVX2 || HAVE_NEON
 const BLOCK_SIZE kValidBlockSize[] = { BLOCK_8X8,   BLOCK_8X16,  BLOCK_8X32,
                                        BLOCK_16X8,  BLOCK_16X16, BLOCK_16X32,
                                        BLOCK_32X8,  BLOCK_32X16, BLOCK_32X32,
@@ -267,6 +267,17 @@
                                        BLOCK_16X64, BLOCK_64X16 };
 #endif
 
+#if HAVE_SSE2
+INSTANTIATE_TEST_SUITE_P(
+    SSE2, AV1KmeansTest1,
+    ::testing::Combine(::testing::Values(&av1_calc_indices_dim1_sse2),
+                       ::testing::ValuesIn(kValidBlockSize)));
+INSTANTIATE_TEST_SUITE_P(
+    SSE2, AV1KmeansTest2,
+    ::testing::Combine(::testing::Values(&av1_calc_indices_dim2_sse2),
+                       ::testing::ValuesIn(kValidBlockSize)));
+#endif
+
 #if HAVE_AVX2
 INSTANTIATE_TEST_SUITE_P(
     AVX2, AV1KmeansTest1,
@@ -278,15 +289,14 @@
                        ::testing::ValuesIn(kValidBlockSize)));
 #endif
 
-#if HAVE_SSE2
-
+#if HAVE_NEON
 INSTANTIATE_TEST_SUITE_P(
-    SSE2, AV1KmeansTest1,
-    ::testing::Combine(::testing::Values(&av1_calc_indices_dim1_sse2),
+    NEON, AV1KmeansTest1,
+    ::testing::Combine(::testing::Values(&av1_calc_indices_dim1_neon),
                        ::testing::ValuesIn(kValidBlockSize)));
 INSTANTIATE_TEST_SUITE_P(
-    SSE2, AV1KmeansTest2,
-    ::testing::Combine(::testing::Values(&av1_calc_indices_dim2_sse2),
+    NEON, AV1KmeansTest2,
+    ::testing::Combine(::testing::Values(&av1_calc_indices_dim2_neon),
                        ::testing::ValuesIn(kValidBlockSize)));
 #endif
 

diff --git a/test/av1_nn_predict_test.cc b/test/av1_nn_predict_test.cc
index 7a3067d..48504c8 100644
--- a/test/av1_nn_predict_test.cc
+++ b/test/av1_nn_predict_test.cc

@@ -175,7 +175,7 @@
 // This is all the neural network shapes observed executed in a few different
 // runs of the encoder.  It also conveniently covers all the kernels
 // implemented.
-static const NN_CONFIG shapes[] = {
+static const NN_CONFIG kShapes[] = {
   { 10, 16, 1, { 64 }, { 0 }, { 0 } }, { 12, 1, 1, { 12 }, { 0 }, { 0 } },
   { 12, 1, 1, { 24 }, { 0 }, { 0 } },  { 12, 1, 1, { 32 }, { 0 }, { 0 } },
   { 18, 4, 1, { 24 }, { 0 }, { 0 } },  { 18, 4, 1, { 32 }, { 0 }, { 0 } },
@@ -198,11 +198,12 @@
 }
 
 TEST_P(NnPredictTest, RandomValues) {
-  RunNnPredictTest_all(shapes, sizeof(shapes) / sizeof(*shapes));
+  RunNnPredictTest_all(kShapes, sizeof(kShapes) / sizeof(kShapes[0]));
 }
 
 TEST_P(NnPredictTest, DISABLED_Speed) {
-  RunNnPredictSpeedTest_all(shapes, sizeof(shapes) / sizeof(*shapes), 10000000);
+  RunNnPredictSpeedTest_all(kShapes, sizeof(kShapes) / sizeof(kShapes[0]),
+                            10000000);
 }
 
 #if HAVE_SSE3 && !CONFIG_EXCLUDE_SIMD_MISMATCH

diff --git a/test/av1_txfm_test.cc b/test/av1_txfm_test.cc
index f741e7c..77c0ec1 100644
--- a/test/av1_txfm_test.cc
+++ b/test/av1_txfm_test.cc

@@ -18,6 +18,25 @@
 
 namespace libaom_test {
 
+const char *tx_type_name[] = {
+  "DCT_DCT",
+  "ADST_DCT",
+  "DCT_ADST",
+  "ADST_ADST",
+  "FLIPADST_DCT",
+  "DCT_FLIPADST",
+  "FLIPADST_FLIPADST",
+  "ADST_FLIPADST",
+  "FLIPADST_ADST",
+  "IDTX",
+  "V_DCT",
+  "H_DCT",
+  "V_ADST",
+  "H_ADST",
+  "V_FLIPADST",
+  "H_FLIPADST",
+};
+
 int get_txfm1d_size(TX_SIZE tx_size) { return tx_size_wide[tx_size]; }
 
 void get_txfm1d_type(TX_TYPE txfm2d_type, TYPE_TXFM *type0, TYPE_TXFM *type1) {
@@ -250,23 +269,25 @@
   ASSERT_NE(temp_in, nullptr);
   ASSERT_NE(temp_out, nullptr);
   ASSERT_NE(out_interm, nullptr);
-  const int stride = tx_width;
 
   // Transform columns.
   for (int c = 0; c < tx_width; ++c) {
     for (int r = 0; r < tx_height; ++r) {
-      temp_in[r] = in[r * stride + c];
+      temp_in[r] = in[r * tx_width + c];
     }
     reference_hybrid_1d(temp_in.get(), temp_out.get(), tx_height, type0);
     for (int r = 0; r < tx_height; ++r) {
-      out_interm[r * stride + c] = temp_out[r];
+      out_interm[r * tx_width + c] = temp_out[r];
     }
   }
 
   // Transform rows.
   for (int r = 0; r < tx_height; ++r) {
-    reference_hybrid_1d(out_interm.get() + r * stride, out + r * stride,
+    reference_hybrid_1d(out_interm.get() + r * tx_width, temp_out.get(),
                         tx_width, type1);
+    for (int c = 0; c < tx_width; ++c) {
+      out[c * tx_height + r] = temp_out[c];
+    }
   }
 
   // These transforms use an approximate 2D DCT transform, by only keeping the
@@ -275,48 +296,48 @@
   // TODO(urvang): Refactor this code.
   if (tx_width == 64 && tx_height == 64) {  // tx_size == TX_64X64
     // Zero out top-right 32x32 area.
-    for (int row = 0; row < 32; ++row) {
-      memset(out + row * 64 + 32, 0, 32 * sizeof(*out));
+    for (int col = 0; col < 32; ++col) {
+      memset(out + col * 64 + 32, 0, 32 * sizeof(*out));
     }
     // Zero out the bottom 64x32 area.
     memset(out + 32 * 64, 0, 32 * 64 * sizeof(*out));
     // Re-pack non-zero coeffs in the first 32x32 indices.
-    for (int row = 1; row < 32; ++row) {
-      memcpy(out + row * 32, out + row * 64, 32 * sizeof(*out));
+    for (int col = 1; col < 32; ++col) {
+      memcpy(out + col * 32, out + col * 64, 32 * sizeof(*out));
     }
   } else if (tx_width == 32 && tx_height == 64) {  // tx_size == TX_32X64
+    // Zero out right 32x32 area.
+    for (int col = 0; col < 32; ++col) {
+      memset(out + col * 64 + 32, 0, 32 * sizeof(*out));
+    }
+    // Re-pack non-zero coeffs in the first 32x32 indices.
+    for (int col = 1; col < 32; ++col) {
+      memcpy(out + col * 32, out + col * 64, 32 * sizeof(*out));
+    }
+  } else if (tx_width == 64 && tx_height == 32) {  // tx_size == TX_64X32
     // Zero out the bottom 32x32 area.
     memset(out + 32 * 32, 0, 32 * 32 * sizeof(*out));
     // Note: no repacking needed here.
-  } else if (tx_width == 64 && tx_height == 32) {  // tx_size == TX_64X32
-    // Zero out right 32x32 area.
-    for (int row = 0; row < 32; ++row) {
-      memset(out + row * 64 + 32, 0, 32 * sizeof(*out));
-    }
-    // Re-pack non-zero coeffs in the first 32x32 indices.
-    for (int row = 1; row < 32; ++row) {
-      memcpy(out + row * 32, out + row * 64, 32 * sizeof(*out));
-    }
   } else if (tx_width == 16 && tx_height == 64) {  // tx_size == TX_16X64
-    // Zero out the bottom 16x32 area.
-    memset(out + 16 * 32, 0, 16 * 32 * sizeof(*out));
     // Note: no repacking needed here.
-  } else if (tx_width == 64 && tx_height == 16) {  // tx_size == TX_64X16
     // Zero out right 32x16 area.
-    for (int row = 0; row < 16; ++row) {
-      memset(out + row * 64 + 32, 0, 32 * sizeof(*out));
+    for (int col = 0; col < 16; ++col) {
+      memset(out + col * 64 + 32, 0, 32 * sizeof(*out));
     }
     // Re-pack non-zero coeffs in the first 32x16 indices.
-    for (int row = 1; row < 16; ++row) {
-      memcpy(out + row * 32, out + row * 64, 32 * sizeof(*out));
+    for (int col = 1; col < 16; ++col) {
+      memcpy(out + col * 32, out + col * 64, 32 * sizeof(*out));
     }
+  } else if (tx_width == 64 && tx_height == 16) {  // tx_size == TX_64X16
+    // Zero out the bottom 16x32 area.
+    memset(out + 16 * 32, 0, 16 * 32 * sizeof(*out));
   }
 
   // Apply appropriate scale.
   const double amplify_factor = get_amplification_factor(tx_type, tx_size);
   for (int c = 0; c < tx_width; ++c) {
     for (int r = 0; r < tx_height; ++r) {
-      out[r * stride + c] *= amplify_factor;
+      out[c * tx_height + r] *= amplify_factor;
     }
   }
 }

diff --git a/test/av1_txfm_test.h b/test/av1_txfm_test.h
index 13a7e8a..d285e3d 100644
--- a/test/av1_txfm_test.h
+++ b/test/av1_txfm_test.h

@@ -29,6 +29,9 @@
 #include "av1/common/enums.h"
 
 namespace libaom_test {
+
+extern const char *tx_type_name[];
+
 enum {
   TYPE_DCT = 0,
   TYPE_ADST,

diff --git a/test/avg_test.cc b/test/avg_test.cc
index 4e86f06..8865915 100644
--- a/test/avg_test.cc
+++ b/test/avg_test.cc

@@ -847,7 +847,13 @@
                       make_tuple(32, 32, 10, 15, 4, &aom_highbd_avg_4x4_neon),
                       make_tuple(16, 16, 12, 0, 4, &aom_highbd_avg_4x4_neon),
                       make_tuple(16, 16, 12, 5, 4, &aom_highbd_avg_4x4_neon),
-                      make_tuple(32, 32, 12, 15, 4, &aom_highbd_avg_4x4_neon)));
+                      make_tuple(32, 32, 12, 15, 4, &aom_highbd_avg_4x4_neon),
+                      make_tuple(16, 16, 10, 0, 8, &aom_highbd_avg_8x8_neon),
+                      make_tuple(16, 16, 10, 5, 8, &aom_highbd_avg_8x8_neon),
+                      make_tuple(32, 32, 10, 15, 8, &aom_highbd_avg_8x8_neon),
+                      make_tuple(16, 16, 12, 0, 8, &aom_highbd_avg_8x8_neon),
+                      make_tuple(16, 16, 12, 5, 8, &aom_highbd_avg_8x8_neon),
+                      make_tuple(32, 32, 12, 15, 8, &aom_highbd_avg_8x8_neon)));
 #endif  // HAVE_NEON
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 

diff --git a/test/avif_progressive_test.cc b/test/avif_progressive_test.cc
index 4a00a5a..2a28ca3 100644
--- a/test/avif_progressive_test.cc
+++ b/test/avif_progressive_test.cc

@@ -33,18 +33,22 @@
   aom_image_t img;
   EXPECT_EQ(&img, aom_img_wrap(&img, AOM_IMG_FMT_I444, kWidth, kHeight, 1,
                                buffer.data()));
+  img.cp = AOM_CICP_CP_UNSPECIFIED;
+  img.tc = AOM_CICP_TC_UNSPECIFIED;
+  img.mc = AOM_CICP_MC_UNSPECIFIED;
+  img.range = AOM_CR_FULL_RANGE;
 
   aom_codec_iface_t *iface = aom_codec_av1_cx();
   aom_codec_enc_cfg_t cfg;
-  const unsigned int usage = AOM_USAGE_GOOD_QUALITY;
-  EXPECT_EQ(AOM_CODEC_OK, aom_codec_enc_config_default(iface, &cfg, usage));
+  EXPECT_EQ(AOM_CODEC_OK,
+            aom_codec_enc_config_default(iface, &cfg, AOM_USAGE_GOOD_QUALITY));
+  cfg.g_profile = 1;
   cfg.g_w = kWidth;
   cfg.g_h = kHeight;
-  cfg.rc_end_usage = AOM_Q;
-  cfg.g_profile = 1;
   cfg.g_bit_depth = AOM_BITS_8;
   cfg.g_input_bit_depth = 8;
   cfg.g_lag_in_frames = 0;
+  cfg.rc_end_usage = AOM_Q;
   cfg.rc_min_quantizer = 50;
   cfg.rc_max_quantizer = 50;
   aom_codec_ctx_t enc;
@@ -64,7 +68,7 @@
   EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, &img, 0, 1, 0));
   aom_codec_iter_t iter = nullptr;
   const aom_codec_cx_pkt_t *pkt = aom_codec_get_cx_data(&enc, &iter);
-  EXPECT_NE(pkt, nullptr);
+  ASSERT_NE(pkt, nullptr);
   EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
   // pkt->data.frame.flags is 0x1f0011.
   EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, AOM_FRAME_IS_KEY);
@@ -85,7 +89,7 @@
   EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, &img, 0, 1, encode_flags));
   iter = nullptr;
   pkt = aom_codec_get_cx_data(&enc, &iter);
-  EXPECT_NE(pkt, nullptr);
+  ASSERT_NE(pkt, nullptr);
   EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
   // pkt->data.frame.flags is 0.
   EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, 0u);
@@ -114,18 +118,22 @@
   aom_image_t img;
   EXPECT_EQ(&img, aom_img_wrap(&img, AOM_IMG_FMT_I444, kWidth, kHeight, 1,
                                buffer.data()));
+  img.cp = AOM_CICP_CP_UNSPECIFIED;
+  img.tc = AOM_CICP_TC_UNSPECIFIED;
+  img.mc = AOM_CICP_MC_UNSPECIFIED;
+  img.range = AOM_CR_FULL_RANGE;
 
   aom_codec_iface_t *iface = aom_codec_av1_cx();
   aom_codec_enc_cfg_t cfg;
-  const unsigned int usage = AOM_USAGE_GOOD_QUALITY;
-  EXPECT_EQ(AOM_CODEC_OK, aom_codec_enc_config_default(iface, &cfg, usage));
+  EXPECT_EQ(AOM_CODEC_OK,
+            aom_codec_enc_config_default(iface, &cfg, AOM_USAGE_GOOD_QUALITY));
+  cfg.g_profile = 1;
   cfg.g_w = kWidth;
   cfg.g_h = kHeight;
-  cfg.rc_end_usage = AOM_Q;
-  cfg.g_profile = 1;
   cfg.g_bit_depth = AOM_BITS_8;
   cfg.g_input_bit_depth = 8;
   cfg.g_lag_in_frames = 0;
+  cfg.rc_end_usage = AOM_Q;
   cfg.rc_min_quantizer = 0;
   cfg.rc_max_quantizer = 0;
   aom_codec_ctx_t enc;
@@ -149,7 +157,7 @@
   EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, &img, 0, 1, 0));
   aom_codec_iter_t iter = nullptr;
   const aom_codec_cx_pkt_t *pkt = aom_codec_get_cx_data(&enc, &iter);
-  EXPECT_NE(pkt, nullptr);
+  ASSERT_NE(pkt, nullptr);
   EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
   // pkt->data.frame.flags is 0x1f0011.
   EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, AOM_FRAME_IS_KEY);
@@ -165,7 +173,7 @@
   EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, &img, 0, 1, encode_flags));
   iter = nullptr;
   pkt = aom_codec_get_cx_data(&enc, &iter);
-  EXPECT_NE(pkt, nullptr);
+  ASSERT_NE(pkt, nullptr);
   EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
   // pkt->data.frame.flags is 0.
   EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, 0u);
@@ -181,6 +189,9 @@
   EXPECT_EQ(AOM_CODEC_OK, aom_codec_destroy(&enc));
 }
 
+// This test reproduces bug aomedia:3382. Certain parameters such as width,
+// height, g_threads, usage, etc. were carefully chosen based on the
+// complicated logic of av1_select_sb_size() to cause an inconsistent sb_size.
 TEST(AVIFProgressiveTest, DimensionChangeLargeImageMultiThread) {
   constexpr int kWidth = 1920;
   constexpr int kHeight = 1080;
@@ -233,7 +244,7 @@
   EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, &img, 0, 1, 0));
   aom_codec_iter_t iter = nullptr;
   const aom_codec_cx_pkt_t *pkt = aom_codec_get_cx_data(&enc, &iter);
-  EXPECT_NE(pkt, nullptr);
+  ASSERT_NE(pkt, nullptr);
   EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
   // pkt->data.frame.flags is 0x1f0011.
   EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, AOM_FRAME_IS_KEY);
@@ -249,7 +260,7 @@
   EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, &img, 0, 1, encode_flags));
   iter = nullptr;
   pkt = aom_codec_get_cx_data(&enc, &iter);
-  EXPECT_NE(pkt, nullptr);
+  ASSERT_NE(pkt, nullptr);
   EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
   // pkt->data.frame.flags is 0.
   EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, 0u);

diff --git a/test/comp_mask_pred_test.cc b/test/comp_mask_pred_test.cc
new file mode 100644
index 0000000..06c3192
--- /dev/null
+++ b/test/comp_mask_pred_test.cc

@@ -0,0 +1,716 @@
+/*
+ * Copyright (c) 2018, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <cstdlib>
+#include <new>
+#include <tuple>
+
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+
+#include "aom/aom_codec.h"
+#include "aom/aom_integer.h"
+#include "aom_dsp/variance.h"
+#include "aom_mem/aom_mem.h"
+#include "aom_ports/aom_timer.h"
+#include "aom_ports/mem.h"
+#include "av1/common/reconinter.h"
+#include "av1/encoder/reconinter_enc.h"
+#include "test/acm_random.h"
+#include "test/register_state_check.h"
+#include "test/util.h"
+#include "third_party/googletest/src/googletest/include/gtest/gtest.h"
+
+namespace {
+typedef void (*comp_mask_pred_func)(uint8_t *comp_pred, const uint8_t *pred,
+                                    int width, int height, const uint8_t *ref,
+                                    int ref_stride, const uint8_t *mask,
+                                    int mask_stride, int invert_mask);
+
+typedef void (*comp_avg_pred_func)(uint8_t *comp_pred, const uint8_t *pred,
+                                   int width, int height, const uint8_t *ref,
+                                   int ref_stride);
+
+#if HAVE_SSSE3 || HAVE_SSE2 || HAVE_AVX2 || HAVE_NEON
+const BLOCK_SIZE kCompMaskPredParams[] = {
+  BLOCK_8X8,   BLOCK_8X16, BLOCK_8X32,  BLOCK_16X8, BLOCK_16X16,
+  BLOCK_16X32, BLOCK_32X8, BLOCK_32X16, BLOCK_32X32
+};
+#endif
+
+class AV1CompMaskPredBase : public ::testing::Test {
+ public:
+  ~AV1CompMaskPredBase();
+  void SetUp();
+
+  void TearDown();
+
+ protected:
+  bool CheckResult(int width, int height) {
+    for (int y = 0; y < height; ++y) {
+      for (int x = 0; x < width; ++x) {
+        const int idx = y * width + x;
+        if (comp_pred1_[idx] != comp_pred2_[idx]) {
+          printf("%dx%d mismatch @%d(%d,%d) ", width, height, idx, y, x);
+          printf("%d != %d ", comp_pred1_[idx], comp_pred2_[idx]);
+          return false;
+        }
+      }
+    }
+    return true;
+  }
+
+  libaom_test::ACMRandom rnd_;
+  uint8_t *comp_pred1_;
+  uint8_t *comp_pred2_;
+  uint8_t *pred_;
+  uint8_t *ref_buffer_;
+  uint8_t *ref_;
+};
+
+AV1CompMaskPredBase::~AV1CompMaskPredBase() {}
+
+void AV1CompMaskPredBase::SetUp() {
+  rnd_.Reset(libaom_test::ACMRandom::DeterministicSeed());
+  av1_init_wedge_masks();
+  comp_pred1_ = (uint8_t *)aom_memalign(16, MAX_SB_SQUARE);
+  ASSERT_NE(comp_pred1_, nullptr);
+  comp_pred2_ = (uint8_t *)aom_memalign(16, MAX_SB_SQUARE);
+  ASSERT_NE(comp_pred2_, nullptr);
+  pred_ = (uint8_t *)aom_memalign(16, MAX_SB_SQUARE);
+  ASSERT_NE(pred_, nullptr);
+  // The biggest block size is MAX_SB_SQUARE(128*128), however for the
+  // convolution we need to access 3 bytes before and 4 bytes after (for an
+  // 8-tap filter), in both directions, so we need to allocate
+  // (128 + 7) * (128 + 7) = MAX_SB_SQUARE + (14 * MAX_SB_SIZE) + 49
+  ref_buffer_ =
+      (uint8_t *)aom_memalign(16, MAX_SB_SQUARE + (14 * MAX_SB_SIZE) + 49);
+  ASSERT_NE(ref_buffer_, nullptr);
+  // Start of the actual block where the convolution will be computed
+  ref_ = ref_buffer_ + (3 * MAX_SB_SIZE + 3);
+  for (int i = 0; i < MAX_SB_SQUARE; ++i) {
+    pred_[i] = rnd_.Rand8();
+  }
+  for (int i = 0; i < MAX_SB_SQUARE + (14 * MAX_SB_SIZE) + 49; ++i) {
+    ref_buffer_[i] = rnd_.Rand8();
+  }
+}
+
+void AV1CompMaskPredBase::TearDown() {
+  aom_free(comp_pred1_);
+  aom_free(comp_pred2_);
+  aom_free(pred_);
+  aom_free(ref_buffer_);
+}
+
+typedef std::tuple<comp_mask_pred_func, BLOCK_SIZE> CompMaskPredParam;
+
+class AV1CompMaskPredTest
+    : public AV1CompMaskPredBase,
+      public ::testing::WithParamInterface<CompMaskPredParam> {
+ protected:
+  void RunCheckOutput(comp_mask_pred_func test_impl, BLOCK_SIZE bsize, int inv);
+  void RunSpeedTest(comp_mask_pred_func test_impl, BLOCK_SIZE bsize);
+};
+
+void AV1CompMaskPredTest::RunCheckOutput(comp_mask_pred_func test_impl,
+                                         BLOCK_SIZE bsize, int inv) {
+  const int w = block_size_wide[bsize];
+  const int h = block_size_high[bsize];
+  const int wedge_types = get_wedge_types_lookup(bsize);
+  for (int wedge_index = 0; wedge_index < wedge_types; ++wedge_index) {
+    const uint8_t *mask = av1_get_contiguous_soft_mask(wedge_index, 1, bsize);
+
+    aom_comp_mask_pred_c(comp_pred1_, pred_, w, h, ref_, MAX_SB_SIZE, mask, w,
+                         inv);
+    test_impl(comp_pred2_, pred_, w, h, ref_, MAX_SB_SIZE, mask, w, inv);
+
+    ASSERT_EQ(CheckResult(w, h), true)
+        << " wedge " << wedge_index << " inv " << inv;
+  }
+}
+
+void AV1CompMaskPredTest::RunSpeedTest(comp_mask_pred_func test_impl,
+                                       BLOCK_SIZE bsize) {
+  const int w = block_size_wide[bsize];
+  const int h = block_size_high[bsize];
+  const int wedge_types = get_wedge_types_lookup(bsize);
+  int wedge_index = wedge_types / 2;
+  const uint8_t *mask = av1_get_contiguous_soft_mask(wedge_index, 1, bsize);
+  const int num_loops = 1000000000 / (w + h);
+
+  comp_mask_pred_func funcs[2] = { aom_comp_mask_pred_c, test_impl };
+  double elapsed_time[2] = { 0 };
+  for (int i = 0; i < 2; ++i) {
+    aom_usec_timer timer;
+    aom_usec_timer_start(&timer);
+    comp_mask_pred_func func = funcs[i];
+    for (int j = 0; j < num_loops; ++j) {
+      func(comp_pred1_, pred_, w, h, ref_, MAX_SB_SIZE, mask, w, 0);
+    }
+    aom_usec_timer_mark(&timer);
+    double time = static_cast<double>(aom_usec_timer_elapsed(&timer));
+    elapsed_time[i] = 1000.0 * time / num_loops;
+  }
+  printf("compMask %3dx%-3d: %7.2f/%7.2fns", w, h, elapsed_time[0],
+         elapsed_time[1]);
+  printf("(%3.2f)\n", elapsed_time[0] / elapsed_time[1]);
+}
+
+GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(AV1CompMaskPredTest);
+
+TEST_P(AV1CompMaskPredTest, CheckOutput) {
+  // inv = 0, 1
+  RunCheckOutput(GET_PARAM(0), GET_PARAM(1), 0);
+  RunCheckOutput(GET_PARAM(0), GET_PARAM(1), 1);
+}
+
+TEST_P(AV1CompMaskPredTest, DISABLED_Speed) {
+  RunSpeedTest(GET_PARAM(0), GET_PARAM(1));
+}
+
+#if HAVE_SSSE3
+INSTANTIATE_TEST_SUITE_P(
+    SSSE3, AV1CompMaskPredTest,
+    ::testing::Combine(::testing::Values(&aom_comp_mask_pred_ssse3),
+                       ::testing::ValuesIn(kCompMaskPredParams)));
+#endif
+
+#if HAVE_AVX2
+INSTANTIATE_TEST_SUITE_P(
+    AVX2, AV1CompMaskPredTest,
+    ::testing::Combine(::testing::Values(&aom_comp_mask_pred_avx2),
+                       ::testing::ValuesIn(kCompMaskPredParams)));
+#endif
+
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(
+    NEON, AV1CompMaskPredTest,
+    ::testing::Combine(::testing::Values(&aom_comp_mask_pred_neon),
+                       ::testing::ValuesIn(kCompMaskPredParams)));
+#endif
+
+#if HAVE_SSSE3 || HAVE_SSE2 || HAVE_AVX2 || HAVE_NEON
+const BLOCK_SIZE kValidBlockSize[] = {
+  BLOCK_4X4,     BLOCK_8X8,   BLOCK_8X16,  BLOCK_8X32,   BLOCK_16X8,
+  BLOCK_16X16,   BLOCK_16X32, BLOCK_32X8,  BLOCK_32X16,  BLOCK_32X32,
+  BLOCK_32X64,   BLOCK_64X32, BLOCK_64X64, BLOCK_64X128, BLOCK_128X64,
+  BLOCK_128X128, BLOCK_16X64, BLOCK_64X16
+};
+#endif
+
+typedef void (*upsampled_pred_func)(MACROBLOCKD *xd, const AV1_COMMON *const cm,
+                                    int mi_row, int mi_col, const MV *const mv,
+                                    uint8_t *comp_pred, int width, int height,
+                                    int subpel_x_q3, int subpel_y_q3,
+                                    const uint8_t *ref, int ref_stride,
+                                    int subpel_search);
+
+typedef std::tuple<upsampled_pred_func, BLOCK_SIZE> UpsampledPredParam;
+
+class AV1UpsampledPredTest
+    : public AV1CompMaskPredBase,
+      public ::testing::WithParamInterface<UpsampledPredParam> {
+ protected:
+  void RunCheckOutput(upsampled_pred_func test_impl, BLOCK_SIZE bsize);
+  void RunSpeedTest(upsampled_pred_func test_impl, BLOCK_SIZE bsize,
+                    int havSub);
+};
+
+void AV1UpsampledPredTest::RunCheckOutput(upsampled_pred_func test_impl,
+                                          BLOCK_SIZE bsize) {
+  const int w = block_size_wide[bsize];
+  const int h = block_size_high[bsize];
+  for (int subpel_search = USE_4_TAPS; subpel_search <= USE_8_TAPS;
+       ++subpel_search) {
+    // loop through subx and suby
+    for (int sub = 0; sub < 8 * 8; ++sub) {
+      int subx = sub & 0x7;
+      int suby = (sub >> 3);
+
+      aom_upsampled_pred_c(nullptr, nullptr, 0, 0, nullptr, comp_pred1_, w, h,
+                           subx, suby, ref_, MAX_SB_SIZE, subpel_search);
+
+      test_impl(nullptr, nullptr, 0, 0, nullptr, comp_pred2_, w, h, subx, suby,
+                ref_, MAX_SB_SIZE, subpel_search);
+      ASSERT_EQ(CheckResult(w, h), true)
+          << "sub (" << subx << "," << suby << ")";
+    }
+  }
+}
+
+void AV1UpsampledPredTest::RunSpeedTest(upsampled_pred_func test_impl,
+                                        BLOCK_SIZE bsize, int havSub) {
+  const int w = block_size_wide[bsize];
+  const int h = block_size_high[bsize];
+  const int subx = havSub ? 3 : 0;
+  const int suby = havSub ? 4 : 0;
+
+  const int num_loops = 1000000000 / (w + h);
+  upsampled_pred_func funcs[2] = { aom_upsampled_pred_c, test_impl };
+  double elapsed_time[2] = { 0 };
+  int subpel_search = USE_8_TAPS;  // set to USE_4_TAPS to test 4-tap filter.
+  for (int i = 0; i < 2; ++i) {
+    aom_usec_timer timer;
+    aom_usec_timer_start(&timer);
+    upsampled_pred_func func = funcs[i];
+    for (int j = 0; j < num_loops; ++j) {
+      func(nullptr, nullptr, 0, 0, nullptr, comp_pred1_, w, h, subx, suby, ref_,
+           MAX_SB_SIZE, subpel_search);
+    }
+    aom_usec_timer_mark(&timer);
+    double time = static_cast<double>(aom_usec_timer_elapsed(&timer));
+    elapsed_time[i] = 1000.0 * time / num_loops;
+  }
+  printf("UpsampledPred[%d] %3dx%-3d:%7.2f/%7.2fns", havSub, w, h,
+         elapsed_time[0], elapsed_time[1]);
+  printf("(%3.2f)\n", elapsed_time[0] / elapsed_time[1]);
+}
+
+GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(AV1UpsampledPredTest);
+
+TEST_P(AV1UpsampledPredTest, CheckOutput) {
+  RunCheckOutput(GET_PARAM(0), GET_PARAM(1));
+}
+
+TEST_P(AV1UpsampledPredTest, DISABLED_Speed) {
+  RunSpeedTest(GET_PARAM(0), GET_PARAM(1), 1);
+}
+
+#if HAVE_SSE2
+INSTANTIATE_TEST_SUITE_P(
+    SSE2, AV1UpsampledPredTest,
+    ::testing::Combine(::testing::Values(&aom_upsampled_pred_sse2),
+                       ::testing::ValuesIn(kValidBlockSize)));
+#endif
+
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(
+    NEON, AV1UpsampledPredTest,
+    ::testing::Combine(::testing::Values(&aom_upsampled_pred_neon),
+                       ::testing::ValuesIn(kValidBlockSize)));
+#endif
+
+typedef std::tuple<comp_avg_pred_func, BLOCK_SIZE> CompAvgPredParam;
+
+class AV1CompAvgPredTest : public ::testing::TestWithParam<CompAvgPredParam> {
+ public:
+  ~AV1CompAvgPredTest();
+  void SetUp();
+
+  void TearDown();
+
+ protected:
+  void RunCheckOutput(comp_avg_pred_func test_impl, BLOCK_SIZE bsize);
+  void RunSpeedTest(comp_avg_pred_func test_impl, BLOCK_SIZE bsize);
+  bool CheckResult(int width, int height) {
+    for (int y = 0; y < height; ++y) {
+      for (int x = 0; x < width; ++x) {
+        const int idx = y * width + x;
+        if (comp_pred1_[idx] != comp_pred2_[idx]) {
+          printf("%dx%d mismatch @%d(%d,%d) ", width, height, idx, x, y);
+          printf("%d != %d ", comp_pred1_[idx], comp_pred2_[idx]);
+          return false;
+        }
+      }
+    }
+    return true;
+  }
+
+  libaom_test::ACMRandom rnd_;
+  uint8_t *comp_pred1_;
+  uint8_t *comp_pred2_;
+  uint8_t *pred_;
+  uint8_t *ref_;
+};
+GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(AV1CompAvgPredTest);
+
+AV1CompAvgPredTest::~AV1CompAvgPredTest() {}
+
+void AV1CompAvgPredTest::SetUp() {
+  rnd_.Reset(libaom_test::ACMRandom::DeterministicSeed());
+
+  comp_pred1_ = (uint8_t *)aom_memalign(16, MAX_SB_SQUARE);
+  ASSERT_NE(comp_pred1_, nullptr);
+  comp_pred2_ = (uint8_t *)aom_memalign(16, MAX_SB_SQUARE);
+  ASSERT_NE(comp_pred2_, nullptr);
+  pred_ = (uint8_t *)aom_memalign(16, MAX_SB_SQUARE);
+  ASSERT_NE(pred_, nullptr);
+  ref_ = (uint8_t *)aom_memalign(16, MAX_SB_SQUARE);
+  ASSERT_NE(ref_, nullptr);
+  for (int i = 0; i < MAX_SB_SQUARE; ++i) {
+    pred_[i] = rnd_.Rand8();
+  }
+  for (int i = 0; i < MAX_SB_SQUARE; ++i) {
+    ref_[i] = rnd_.Rand8();
+  }
+}
+
+void AV1CompAvgPredTest::TearDown() {
+  aom_free(comp_pred1_);
+  aom_free(comp_pred2_);
+  aom_free(pred_);
+  aom_free(ref_);
+}
+
+void AV1CompAvgPredTest::RunCheckOutput(comp_avg_pred_func test_impl,
+                                        BLOCK_SIZE bsize) {
+  const int w = block_size_wide[bsize];
+  const int h = block_size_high[bsize];
+  aom_comp_avg_pred_c(comp_pred1_, pred_, w, h, ref_, MAX_SB_SIZE);
+  test_impl(comp_pred2_, pred_, w, h, ref_, MAX_SB_SIZE);
+
+  ASSERT_EQ(CheckResult(w, h), true);
+}
+
+void AV1CompAvgPredTest::RunSpeedTest(comp_avg_pred_func test_impl,
+                                      BLOCK_SIZE bsize) {
+  const int w = block_size_wide[bsize];
+  const int h = block_size_high[bsize];
+  const int num_loops = 1000000000 / (w + h);
+
+  comp_avg_pred_func functions[2] = { aom_comp_avg_pred_c, test_impl };
+  double elapsed_time[2] = { 0.0 };
+  for (int i = 0; i < 2; ++i) {
+    aom_usec_timer timer;
+    aom_usec_timer_start(&timer);
+    comp_avg_pred_func func = functions[i];
+    for (int j = 0; j < num_loops; ++j) {
+      func(comp_pred1_, pred_, w, h, ref_, MAX_SB_SIZE);
+    }
+    aom_usec_timer_mark(&timer);
+    const double time = static_cast<double>(aom_usec_timer_elapsed(&timer));
+    elapsed_time[i] = 1000.0 * time;
+  }
+  printf("compMask %3dx%-3d: %7.2f/%7.2fns", w, h, elapsed_time[0],
+         elapsed_time[1]);
+  printf("(%3.2f)\n", elapsed_time[0] / elapsed_time[1]);
+}
+
+TEST_P(AV1CompAvgPredTest, CheckOutput) {
+  RunCheckOutput(GET_PARAM(0), GET_PARAM(1));
+}
+
+TEST_P(AV1CompAvgPredTest, DISABLED_Speed) {
+  RunSpeedTest(GET_PARAM(0), GET_PARAM(1));
+}
+
+#if HAVE_AVX2
+INSTANTIATE_TEST_SUITE_P(
+    AVX2, AV1CompAvgPredTest,
+    ::testing::Combine(::testing::Values(&aom_comp_avg_pred_avx2),
+                       ::testing::ValuesIn(kValidBlockSize)));
+#endif
+
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(
+    NEON, AV1CompAvgPredTest,
+    ::testing::Combine(::testing::Values(&aom_comp_avg_pred_neon),
+                       ::testing::ValuesIn(kValidBlockSize)));
+#endif
+
+#if CONFIG_AV1_HIGHBITDEPTH
+class AV1HighbdCompMaskPredTestBase : public ::testing::Test {
+ public:
+  ~AV1HighbdCompMaskPredTestBase();
+  void SetUp();
+
+  void TearDown();
+
+ protected:
+  bool CheckResult(int width, int height) {
+    for (int y = 0; y < height; ++y) {
+      for (int x = 0; x < width; ++x) {
+        const int idx = y * width + x;
+        if (comp_pred1_[idx] != comp_pred2_[idx]) {
+          printf("%dx%d mismatch @%d(%d,%d) ", width, height, idx, y, x);
+          printf("%d != %d ", comp_pred1_[idx], comp_pred2_[idx]);
+          return false;
+        }
+      }
+    }
+    return true;
+  }
+
+  libaom_test::ACMRandom rnd_;
+  uint16_t *comp_pred1_;
+  uint16_t *comp_pred2_;
+  uint16_t *pred_;
+  uint16_t *ref_buffer_;
+  uint16_t *ref_;
+};
+
+AV1HighbdCompMaskPredTestBase::~AV1HighbdCompMaskPredTestBase() {}
+
+void AV1HighbdCompMaskPredTestBase::SetUp() {
+  rnd_.Reset(libaom_test::ACMRandom::DeterministicSeed());
+  av1_init_wedge_masks();
+
+  comp_pred1_ =
+      (uint16_t *)aom_memalign(16, MAX_SB_SQUARE * sizeof(*comp_pred1_));
+  ASSERT_NE(comp_pred1_, nullptr);
+  comp_pred2_ =
+      (uint16_t *)aom_memalign(16, MAX_SB_SQUARE * sizeof(*comp_pred2_));
+  ASSERT_NE(comp_pred2_, nullptr);
+  pred_ = (uint16_t *)aom_memalign(16, MAX_SB_SQUARE * sizeof(*pred_));
+  ASSERT_NE(pred_, nullptr);
+  // The biggest block size is MAX_SB_SQUARE(128*128), however for the
+  // convolution we need to access 3 elements before and 4 elements after (for
+  // an 8-tap filter), in both directions, so we need to allocate (128 + 7) *
+  // (128 + 7) = (MAX_SB_SQUARE + (14 * MAX_SB_SIZE) + 49) *
+  // sizeof(*ref_buffer_)
+  ref_buffer_ = (uint16_t *)aom_memalign(
+      16, (MAX_SB_SQUARE + (14 * MAX_SB_SIZE) + 49) * sizeof(*ref_buffer_));
+  ASSERT_NE(ref_buffer_, nullptr);
+  // Start of the actual block where the convolution will be computed
+  ref_ = ref_buffer_ + (3 * MAX_SB_SIZE + 3);
+}
+
+void AV1HighbdCompMaskPredTestBase::TearDown() {
+  aom_free(comp_pred1_);
+  aom_free(comp_pred2_);
+  aom_free(pred_);
+  aom_free(ref_buffer_);
+}
+
+typedef void (*highbd_comp_mask_pred_func)(uint8_t *comp_pred8,
+                                           const uint8_t *pred8, int width,
+                                           int height, const uint8_t *ref8,
+                                           int ref_stride, const uint8_t *mask,
+                                           int mask_stride, int invert_mask);
+
+typedef std::tuple<highbd_comp_mask_pred_func, BLOCK_SIZE, int>
+    HighbdCompMaskPredParam;
+
+class AV1HighbdCompMaskPredTest
+    : public AV1HighbdCompMaskPredTestBase,
+      public ::testing::WithParamInterface<HighbdCompMaskPredParam> {
+ public:
+  ~AV1HighbdCompMaskPredTest();
+
+ protected:
+  void RunCheckOutput(comp_mask_pred_func test_impl, BLOCK_SIZE bsize, int inv);
+  void RunSpeedTest(comp_mask_pred_func test_impl, BLOCK_SIZE bsize);
+};
+
+AV1HighbdCompMaskPredTest::~AV1HighbdCompMaskPredTest() {}
+
+void AV1HighbdCompMaskPredTest::RunCheckOutput(
+    highbd_comp_mask_pred_func test_impl, BLOCK_SIZE bsize, int inv) {
+  int bd_ = GET_PARAM(2);
+  const int w = block_size_wide[bsize];
+  const int h = block_size_high[bsize];
+  const int wedge_types = get_wedge_types_lookup(bsize);
+
+  for (int i = 0; i < MAX_SB_SQUARE; ++i) {
+    pred_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
+  }
+  for (int i = 0; i < MAX_SB_SQUARE + (8 * MAX_SB_SIZE); ++i) {
+    ref_buffer_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
+  }
+
+  for (int wedge_index = 0; wedge_index < wedge_types; ++wedge_index) {
+    const uint8_t *mask = av1_get_contiguous_soft_mask(wedge_index, 1, bsize);
+
+    aom_highbd_comp_mask_pred_c(
+        CONVERT_TO_BYTEPTR(comp_pred1_), CONVERT_TO_BYTEPTR(pred_), w, h,
+        CONVERT_TO_BYTEPTR(ref_), MAX_SB_SIZE, mask, w, inv);
+
+    test_impl(CONVERT_TO_BYTEPTR(comp_pred2_), CONVERT_TO_BYTEPTR(pred_), w, h,
+              CONVERT_TO_BYTEPTR(ref_), MAX_SB_SIZE, mask, w, inv);
+
+    ASSERT_EQ(CheckResult(w, h), true)
+        << " wedge " << wedge_index << " inv " << inv;
+  }
+}
+
+void AV1HighbdCompMaskPredTest::RunSpeedTest(
+    highbd_comp_mask_pred_func test_impl, BLOCK_SIZE bsize) {
+  int bd_ = GET_PARAM(2);
+
+  const int w = block_size_wide[bsize];
+  const int h = block_size_high[bsize];
+  const int wedge_types = get_wedge_types_lookup(bsize);
+  int wedge_index = wedge_types / 2;
+
+  for (int i = 0; i < MAX_SB_SQUARE; ++i) {
+    pred_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
+  }
+  for (int i = 0; i < MAX_SB_SQUARE + (8 * MAX_SB_SIZE); ++i) {
+    ref_buffer_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
+  }
+
+  const uint8_t *mask = av1_get_contiguous_soft_mask(wedge_index, 1, bsize);
+  const int num_loops = 1000000000 / (w + h);
+
+  highbd_comp_mask_pred_func funcs[2] = { aom_highbd_comp_mask_pred_c,
+                                          test_impl };
+  double elapsed_time[2] = { 0 };
+  for (int i = 0; i < 2; ++i) {
+    aom_usec_timer timer;
+    aom_usec_timer_start(&timer);
+    highbd_comp_mask_pred_func func = funcs[i];
+    for (int j = 0; j < num_loops; ++j) {
+      func(CONVERT_TO_BYTEPTR(comp_pred1_), CONVERT_TO_BYTEPTR(pred_), w, h,
+           CONVERT_TO_BYTEPTR(ref_), MAX_SB_SIZE, mask, w, 0);
+    }
+    aom_usec_timer_mark(&timer);
+    double time = static_cast<double>(aom_usec_timer_elapsed(&timer));
+    elapsed_time[i] = 1000.0 * time / num_loops;
+  }
+  printf("compMask %3dx%-3d: %7.2f/%7.2fns", w, h, elapsed_time[0],
+         elapsed_time[1]);
+  printf("(%3.2f)\n", elapsed_time[0] / elapsed_time[1]);
+}
+
+GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(AV1HighbdCompMaskPredTest);
+
+TEST_P(AV1HighbdCompMaskPredTest, CheckOutput) {
+  // inv = 0, 1
+  RunCheckOutput(GET_PARAM(0), GET_PARAM(1), 0);
+  RunCheckOutput(GET_PARAM(0), GET_PARAM(1), 1);
+}
+
+TEST_P(AV1HighbdCompMaskPredTest, DISABLED_Speed) {
+  RunSpeedTest(GET_PARAM(0), GET_PARAM(1));
+}
+
+#if HAVE_AVX2
+INSTANTIATE_TEST_SUITE_P(
+    AVX2, AV1HighbdCompMaskPredTest,
+    ::testing::Combine(::testing::Values(&aom_highbd_comp_mask_pred_avx2),
+                       ::testing::ValuesIn(kCompMaskPredParams),
+                       ::testing::Range(8, 13, 2)));
+#endif
+
+#if HAVE_SSE2
+INSTANTIATE_TEST_SUITE_P(
+    SSE2, AV1HighbdCompMaskPredTest,
+    ::testing::Combine(::testing::Values(&aom_highbd_comp_mask_pred_sse2),
+                       ::testing::ValuesIn(kCompMaskPredParams),
+                       ::testing::Range(8, 13, 2)));
+#endif
+
+typedef void (*highbd_upsampled_pred_func)(
+    MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
+    const MV *const mv, uint8_t *comp_pred8, int width, int height,
+    int subpel_x_q3, int subpel_y_q3, const uint8_t *ref8, int ref_stride,
+    int bd, int subpel_search);
+
+typedef std::tuple<highbd_upsampled_pred_func, BLOCK_SIZE, int>
+    HighbdUpsampledPredParam;
+
+class AV1HighbdUpsampledPredTest
+    : public AV1HighbdCompMaskPredTestBase,
+      public ::testing::WithParamInterface<HighbdUpsampledPredParam> {
+ public:
+  ~AV1HighbdUpsampledPredTest();
+
+ protected:
+  void RunCheckOutput(highbd_upsampled_pred_func test_impl, BLOCK_SIZE bsize);
+  void RunSpeedTest(highbd_upsampled_pred_func test_impl, BLOCK_SIZE bsize,
+                    int havSub);
+};
+
+AV1HighbdUpsampledPredTest::~AV1HighbdUpsampledPredTest() {}
+
+void AV1HighbdUpsampledPredTest::RunCheckOutput(
+    highbd_upsampled_pred_func test_impl, BLOCK_SIZE bsize) {
+  int bd_ = GET_PARAM(2);
+  const int w = block_size_wide[bsize];
+  const int h = block_size_high[bsize];
+
+  for (int i = 0; i < MAX_SB_SQUARE; ++i) {
+    pred_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
+  }
+  for (int i = 0; i < MAX_SB_SQUARE + (8 * MAX_SB_SIZE); ++i) {
+    ref_buffer_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
+  }
+
+  for (int subpel_search = 1; subpel_search <= 2; ++subpel_search) {
+    // loop through subx and suby
+    for (int sub = 0; sub < 8 * 8; ++sub) {
+      int subx = sub & 0x7;
+      int suby = (sub >> 3);
+
+      aom_highbd_upsampled_pred_c(nullptr, nullptr, 0, 0, nullptr,
+                                  CONVERT_TO_BYTEPTR(comp_pred1_), w, h, subx,
+                                  suby, CONVERT_TO_BYTEPTR(ref_), MAX_SB_SIZE,
+                                  bd_, subpel_search);
+
+      test_impl(nullptr, nullptr, 0, 0, nullptr,
+                CONVERT_TO_BYTEPTR(comp_pred2_), w, h, subx, suby,
+                CONVERT_TO_BYTEPTR(ref_), MAX_SB_SIZE, bd_, subpel_search);
+
+      ASSERT_EQ(CheckResult(w, h), true)
+          << "sub (" << subx << "," << suby << ")";
+    }
+  }
+}
+
+void AV1HighbdUpsampledPredTest::RunSpeedTest(
+    highbd_upsampled_pred_func test_impl, BLOCK_SIZE bsize, int havSub) {
+  int bd_ = GET_PARAM(2);
+  const int w = block_size_wide[bsize];
+  const int h = block_size_high[bsize];
+  const int subx = havSub ? 3 : 0;
+  const int suby = havSub ? 4 : 0;
+
+  for (int i = 0; i < MAX_SB_SQUARE; ++i) {
+    pred_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
+  }
+  for (int i = 0; i < MAX_SB_SQUARE + (8 * MAX_SB_SIZE); ++i) {
+    ref_buffer_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
+  }
+
+  const int num_loops = 1000000000 / (w + h);
+  highbd_upsampled_pred_func funcs[2] = { &aom_highbd_upsampled_pred_c,
+                                          test_impl };
+  double elapsed_time[2] = { 0 };
+  for (int i = 0; i < 2; ++i) {
+    aom_usec_timer timer;
+    aom_usec_timer_start(&timer);
+    highbd_upsampled_pred_func func = funcs[i];
+    int subpel_search = 2;  // set to 1 to test 4-tap filter.
+    for (int j = 0; j < num_loops; ++j) {
+      func(nullptr, nullptr, 0, 0, nullptr, CONVERT_TO_BYTEPTR(comp_pred1_), w,
+           h, subx, suby, CONVERT_TO_BYTEPTR(ref_), MAX_SB_SIZE, bd_,
+           subpel_search);
+    }
+    aom_usec_timer_mark(&timer);
+    double time = static_cast<double>(aom_usec_timer_elapsed(&timer));
+    elapsed_time[i] = 1000.0 * time / num_loops;
+  }
+  printf("CompMaskUp[%d] %3dx%-3d:%7.2f/%7.2fns", havSub, w, h, elapsed_time[0],
+         elapsed_time[1]);
+  printf("(%3.2f)\n", elapsed_time[0] / elapsed_time[1]);
+}
+
+GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(AV1HighbdUpsampledPredTest);
+
+TEST_P(AV1HighbdUpsampledPredTest, CheckOutput) {
+  RunCheckOutput(GET_PARAM(0), GET_PARAM(1));
+}
+
+TEST_P(AV1HighbdUpsampledPredTest, DISABLED_Speed) {
+  RunSpeedTest(GET_PARAM(0), GET_PARAM(1), 1);
+}
+
+#if HAVE_SSE2
+INSTANTIATE_TEST_SUITE_P(
+    SSE2, AV1HighbdUpsampledPredTest,
+    ::testing::Combine(::testing::Values(&aom_highbd_upsampled_pred_sse2),
+                       ::testing::ValuesIn(kValidBlockSize),
+                       ::testing::Range(8, 13, 2)));
+#endif
+
+#endif  // CONFIG_AV1_HIGHBITDEPTH
+}  // namespace

diff --git a/test/comp_mask_variance_test.cc b/test/comp_mask_variance_test.cc
deleted file mode 100644
index f958c5d..0000000
--- a/test/comp_mask_variance_test.cc
+++ /dev/null

@@ -1,589 +0,0 @@
-/*
- * Copyright (c) 2018, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-
-#include <cstdlib>
-#include <new>
-#include <tuple>
-
-#include "config/aom_config.h"
-#include "config/aom_dsp_rtcd.h"
-
-#include "aom/aom_codec.h"
-#include "aom/aom_integer.h"
-#include "aom_dsp/variance.h"
-#include "aom_mem/aom_mem.h"
-#include "aom_ports/aom_timer.h"
-#include "aom_ports/mem.h"
-#include "av1/common/reconinter.h"
-#include "av1/encoder/reconinter_enc.h"
-#include "test/acm_random.h"
-#include "test/register_state_check.h"
-#include "test/util.h"
-#include "third_party/googletest/src/googletest/include/gtest/gtest.h"
-
-namespace AV1CompMaskVariance {
-typedef void (*comp_mask_pred_func)(uint8_t *comp_pred, const uint8_t *pred,
-                                    int width, int height, const uint8_t *ref,
-                                    int ref_stride, const uint8_t *mask,
-                                    int mask_stride, int invert_mask);
-
-#if HAVE_SSSE3 || HAVE_SSE2 || HAVE_AVX2
-const BLOCK_SIZE kValidBlockSize[] = {
-  BLOCK_8X8,   BLOCK_8X16,  BLOCK_8X32,   BLOCK_16X8,   BLOCK_16X16,
-  BLOCK_16X32, BLOCK_32X8,  BLOCK_32X16,  BLOCK_32X32,  BLOCK_32X64,
-  BLOCK_64X32, BLOCK_64X64, BLOCK_64X128, BLOCK_128X64, BLOCK_128X128,
-  BLOCK_16X64, BLOCK_64X16
-};
-#endif
-typedef std::tuple<comp_mask_pred_func, BLOCK_SIZE> CompMaskPredParam;
-
-class AV1CompMaskVarianceTest
-    : public ::testing::TestWithParam<CompMaskPredParam> {
- public:
-  ~AV1CompMaskVarianceTest();
-  void SetUp();
-
-  void TearDown();
-
- protected:
-  void RunCheckOutput(comp_mask_pred_func test_impl, BLOCK_SIZE bsize, int inv);
-  void RunSpeedTest(comp_mask_pred_func test_impl, BLOCK_SIZE bsize);
-  bool CheckResult(int width, int height) {
-    for (int y = 0; y < height; ++y) {
-      for (int x = 0; x < width; ++x) {
-        const int idx = y * width + x;
-        if (comp_pred1_[idx] != comp_pred2_[idx]) {
-          printf("%dx%d mismatch @%d(%d,%d) ", width, height, idx, y, x);
-          printf("%d != %d ", comp_pred1_[idx], comp_pred2_[idx]);
-          return false;
-        }
-      }
-    }
-    return true;
-  }
-
-  libaom_test::ACMRandom rnd_;
-  uint8_t *comp_pred1_;
-  uint8_t *comp_pred2_;
-  uint8_t *pred_;
-  uint8_t *ref_buffer_;
-  uint8_t *ref_;
-};
-GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(AV1CompMaskVarianceTest);
-
-AV1CompMaskVarianceTest::~AV1CompMaskVarianceTest() {}
-
-void AV1CompMaskVarianceTest::SetUp() {
-  rnd_.Reset(libaom_test::ACMRandom::DeterministicSeed());
-  av1_init_wedge_masks();
-  comp_pred1_ = (uint8_t *)aom_memalign(16, MAX_SB_SQUARE);
-  ASSERT_NE(comp_pred1_, nullptr);
-  comp_pred2_ = (uint8_t *)aom_memalign(16, MAX_SB_SQUARE);
-  ASSERT_NE(comp_pred2_, nullptr);
-  pred_ = (uint8_t *)aom_memalign(16, MAX_SB_SQUARE);
-  ASSERT_NE(pred_, nullptr);
-  ref_buffer_ = (uint8_t *)aom_memalign(16, MAX_SB_SQUARE + (8 * MAX_SB_SIZE));
-  ASSERT_NE(ref_buffer_, nullptr);
-  ref_ = ref_buffer_ + (8 * MAX_SB_SIZE);
-  for (int i = 0; i < MAX_SB_SQUARE; ++i) {
-    pred_[i] = rnd_.Rand8();
-  }
-  for (int i = 0; i < MAX_SB_SQUARE + (8 * MAX_SB_SIZE); ++i) {
-    ref_buffer_[i] = rnd_.Rand8();
-  }
-}
-
-void AV1CompMaskVarianceTest::TearDown() {
-  aom_free(comp_pred1_);
-  aom_free(comp_pred2_);
-  aom_free(pred_);
-  aom_free(ref_buffer_);
-}
-
-void AV1CompMaskVarianceTest::RunCheckOutput(comp_mask_pred_func test_impl,
-                                             BLOCK_SIZE bsize, int inv) {
-  const int w = block_size_wide[bsize];
-  const int h = block_size_high[bsize];
-  const int wedge_types = get_wedge_types_lookup(bsize);
-  for (int wedge_index = 0; wedge_index < wedge_types; ++wedge_index) {
-    const uint8_t *mask = av1_get_contiguous_soft_mask(wedge_index, 1, bsize);
-
-    aom_comp_mask_pred_c(comp_pred1_, pred_, w, h, ref_, MAX_SB_SIZE, mask, w,
-                         inv);
-    test_impl(comp_pred2_, pred_, w, h, ref_, MAX_SB_SIZE, mask, w, inv);
-
-    ASSERT_EQ(CheckResult(w, h), true)
-        << " wedge " << wedge_index << " inv " << inv;
-  }
-}
-
-void AV1CompMaskVarianceTest::RunSpeedTest(comp_mask_pred_func test_impl,
-                                           BLOCK_SIZE bsize) {
-  const int w = block_size_wide[bsize];
-  const int h = block_size_high[bsize];
-  const int wedge_types = get_wedge_types_lookup(bsize);
-  int wedge_index = wedge_types / 2;
-  const uint8_t *mask = av1_get_contiguous_soft_mask(wedge_index, 1, bsize);
-  const int num_loops = 1000000000 / (w + h);
-
-  comp_mask_pred_func funcs[2] = { aom_comp_mask_pred_c, test_impl };
-  double elapsed_time[2] = { 0 };
-  for (int i = 0; i < 2; ++i) {
-    aom_usec_timer timer;
-    aom_usec_timer_start(&timer);
-    comp_mask_pred_func func = funcs[i];
-    for (int j = 0; j < num_loops; ++j) {
-      func(comp_pred1_, pred_, w, h, ref_, MAX_SB_SIZE, mask, w, 0);
-    }
-    aom_usec_timer_mark(&timer);
-    double time = static_cast<double>(aom_usec_timer_elapsed(&timer));
-    elapsed_time[i] = 1000.0 * time / num_loops;
-  }
-  printf("compMask %3dx%-3d: %7.2f/%7.2fns", w, h, elapsed_time[0],
-         elapsed_time[1]);
-  printf("(%3.2f)\n", elapsed_time[0] / elapsed_time[1]);
-}
-
-TEST_P(AV1CompMaskVarianceTest, CheckOutput) {
-  // inv = 0, 1
-  RunCheckOutput(GET_PARAM(0), GET_PARAM(1), 0);
-  RunCheckOutput(GET_PARAM(0), GET_PARAM(1), 1);
-}
-
-TEST_P(AV1CompMaskVarianceTest, DISABLED_Speed) {
-  RunSpeedTest(GET_PARAM(0), GET_PARAM(1));
-}
-
-#if HAVE_SSSE3
-INSTANTIATE_TEST_SUITE_P(
-    SSSE3, AV1CompMaskVarianceTest,
-    ::testing::Combine(::testing::Values(&aom_comp_mask_pred_ssse3),
-                       ::testing::ValuesIn(kValidBlockSize)));
-#endif
-
-#if HAVE_AVX2
-INSTANTIATE_TEST_SUITE_P(
-    AVX2, AV1CompMaskVarianceTest,
-    ::testing::Combine(::testing::Values(&aom_comp_mask_pred_avx2),
-                       ::testing::ValuesIn(kValidBlockSize)));
-#endif
-
-#ifndef aom_comp_mask_pred
-// can't run this test if aom_comp_mask_pred is defined to aom_comp_mask_pred_c
-class AV1CompMaskUpVarianceTest : public AV1CompMaskVarianceTest {
- public:
-  ~AV1CompMaskUpVarianceTest();
-
- protected:
-  void RunCheckOutput(comp_mask_pred_func test_impl, BLOCK_SIZE bsize, int inv);
-  void RunSpeedTest(comp_mask_pred_func test_impl, BLOCK_SIZE bsize,
-                    int havSub);
-};
-
-AV1CompMaskUpVarianceTest::~AV1CompMaskUpVarianceTest() {}
-
-void AV1CompMaskUpVarianceTest::RunCheckOutput(comp_mask_pred_func test_impl,
-                                               BLOCK_SIZE bsize, int inv) {
-  const int w = block_size_wide[bsize];
-  const int h = block_size_high[bsize];
-  const int wedge_types = get_wedge_types_lookup(bsize);
-  int subpel_search;
-  for (subpel_search = USE_4_TAPS; subpel_search <= USE_8_TAPS;
-       ++subpel_search) {
-    // loop through subx and suby
-    for (int sub = 0; sub < 8 * 8; ++sub) {
-      int subx = sub & 0x7;
-      int suby = (sub >> 3);
-      for (int wedge_index = 0; wedge_index < wedge_types; ++wedge_index) {
-        const uint8_t *mask =
-            av1_get_contiguous_soft_mask(wedge_index, 1, bsize);
-
-        // ref
-        aom_comp_mask_upsampled_pred_c(
-            nullptr, nullptr, 0, 0, nullptr, comp_pred1_, pred_, w, h, subx,
-            suby, ref_, MAX_SB_SIZE, mask, w, inv, subpel_search);
-
-        aom_comp_mask_pred = test_impl;  // test
-        aom_comp_mask_upsampled_pred(nullptr, nullptr, 0, 0, nullptr,
-                                     comp_pred2_, pred_, w, h, subx, suby, ref_,
-                                     MAX_SB_SIZE, mask, w, inv, subpel_search);
-        ASSERT_EQ(CheckResult(w, h), true)
-            << " wedge " << wedge_index << " inv " << inv << "sub (" << subx
-            << "," << suby << ")";
-      }
-    }
-  }
-}
-
-void AV1CompMaskUpVarianceTest::RunSpeedTest(comp_mask_pred_func test_impl,
-                                             BLOCK_SIZE bsize, int havSub) {
-  const int w = block_size_wide[bsize];
-  const int h = block_size_high[bsize];
-  const int subx = havSub ? 3 : 0;
-  const int suby = havSub ? 4 : 0;
-  const int wedge_types = get_wedge_types_lookup(bsize);
-  int wedge_index = wedge_types / 2;
-  const uint8_t *mask = av1_get_contiguous_soft_mask(wedge_index, 1, bsize);
-
-  const int num_loops = 1000000000 / (w + h);
-  comp_mask_pred_func funcs[2] = { &aom_comp_mask_pred_c, test_impl };
-  double elapsed_time[2] = { 0 };
-  int subpel_search = USE_8_TAPS;  // set to USE_4_TAPS to test 4-tap filter.
-  for (int i = 0; i < 2; ++i) {
-    aom_usec_timer timer;
-    aom_usec_timer_start(&timer);
-    aom_comp_mask_pred = funcs[i];
-    for (int j = 0; j < num_loops; ++j) {
-      aom_comp_mask_upsampled_pred(nullptr, nullptr, 0, 0, nullptr, comp_pred1_,
-                                   pred_, w, h, subx, suby, ref_, MAX_SB_SIZE,
-                                   mask, w, 0, subpel_search);
-    }
-    aom_usec_timer_mark(&timer);
-    double time = static_cast<double>(aom_usec_timer_elapsed(&timer));
-    elapsed_time[i] = 1000.0 * time / num_loops;
-  }
-  printf("CompMaskUp[%d] %3dx%-3d:%7.2f/%7.2fns", havSub, w, h, elapsed_time[0],
-         elapsed_time[1]);
-  printf("(%3.2f)\n", elapsed_time[0] / elapsed_time[1]);
-}
-
-TEST_P(AV1CompMaskUpVarianceTest, CheckOutput) {
-  // inv mask = 0, 1
-  RunCheckOutput(GET_PARAM(0), GET_PARAM(1), 0);
-  RunCheckOutput(GET_PARAM(0), GET_PARAM(1), 1);
-}
-
-TEST_P(AV1CompMaskUpVarianceTest, DISABLED_Speed) {
-  RunSpeedTest(GET_PARAM(0), GET_PARAM(1), 1);
-}
-
-#if HAVE_SSSE3
-INSTANTIATE_TEST_SUITE_P(
-    SSSE3, AV1CompMaskUpVarianceTest,
-    ::testing::Combine(::testing::Values(&aom_comp_mask_pred_ssse3),
-                       ::testing::ValuesIn(kValidBlockSize)));
-#endif
-
-#if HAVE_AVX2
-INSTANTIATE_TEST_SUITE_P(
-    AVX2, AV1CompMaskUpVarianceTest,
-    ::testing::Combine(::testing::Values(&aom_comp_mask_pred_avx2),
-                       ::testing::ValuesIn(kValidBlockSize)));
-#endif
-
-#endif  // ifndef aom_comp_mask_pred
-
-#if CONFIG_AV1_HIGHBITDEPTH
-typedef void (*highbd_comp_mask_pred_func)(uint8_t *comp_pred8,
-                                           const uint8_t *pred8, int width,
-                                           int height, const uint8_t *ref8,
-                                           int ref_stride, const uint8_t *mask,
-                                           int mask_stride, int invert_mask);
-
-typedef std::tuple<highbd_comp_mask_pred_func, BLOCK_SIZE, int>
-    HighbdCompMaskPredParam;
-
-class AV1HighbdCompMaskVarianceTest
-    : public ::testing::TestWithParam<HighbdCompMaskPredParam> {
- public:
-  ~AV1HighbdCompMaskVarianceTest();
-  void SetUp();
-
-  void TearDown();
-
- protected:
-  void RunCheckOutput(highbd_comp_mask_pred_func test_impl, BLOCK_SIZE bsize,
-                      int inv);
-  void RunSpeedTest(highbd_comp_mask_pred_func test_impl, BLOCK_SIZE bsize);
-  bool CheckResult(int width, int height) {
-    for (int y = 0; y < height; ++y) {
-      for (int x = 0; x < width; ++x) {
-        const int idx = y * width + x;
-        if (comp_pred1_[idx] != comp_pred2_[idx]) {
-          printf("%dx%d mismatch @%d(%d,%d) ", width, height, idx, y, x);
-          printf("%d != %d ", comp_pred1_[idx], comp_pred2_[idx]);
-          return false;
-        }
-      }
-    }
-    return true;
-  }
-
-  libaom_test::ACMRandom rnd_;
-  uint16_t *comp_pred1_;
-  uint16_t *comp_pred2_;
-  uint16_t *pred_;
-  uint16_t *ref_buffer_;
-  uint16_t *ref_;
-};
-GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(AV1HighbdCompMaskVarianceTest);
-
-AV1HighbdCompMaskVarianceTest::~AV1HighbdCompMaskVarianceTest() {}
-
-void AV1HighbdCompMaskVarianceTest::SetUp() {
-  rnd_.Reset(libaom_test::ACMRandom::DeterministicSeed());
-  av1_init_wedge_masks();
-
-  comp_pred1_ =
-      (uint16_t *)aom_memalign(16, MAX_SB_SQUARE * sizeof(*comp_pred1_));
-  ASSERT_NE(comp_pred1_, nullptr);
-  comp_pred2_ =
-      (uint16_t *)aom_memalign(16, MAX_SB_SQUARE * sizeof(*comp_pred2_));
-  ASSERT_NE(comp_pred2_, nullptr);
-  pred_ = (uint16_t *)aom_memalign(16, MAX_SB_SQUARE * sizeof(*pred_));
-  ASSERT_NE(pred_, nullptr);
-  ref_buffer_ = (uint16_t *)aom_memalign(
-      16, (MAX_SB_SQUARE + (8 * MAX_SB_SIZE)) * sizeof(*ref_buffer_));
-  ASSERT_NE(ref_buffer_, nullptr);
-  ref_ = ref_buffer_ + (8 * MAX_SB_SIZE);
-}
-
-void AV1HighbdCompMaskVarianceTest::TearDown() {
-  aom_free(comp_pred1_);
-  aom_free(comp_pred2_);
-  aom_free(pred_);
-  aom_free(ref_buffer_);
-}
-
-void AV1HighbdCompMaskVarianceTest::RunCheckOutput(
-    highbd_comp_mask_pred_func test_impl, BLOCK_SIZE bsize, int inv) {
-  int bd_ = GET_PARAM(2);
-  const int w = block_size_wide[bsize];
-  const int h = block_size_high[bsize];
-  const int wedge_types = get_wedge_types_lookup(bsize);
-
-  for (int i = 0; i < MAX_SB_SQUARE; ++i) {
-    pred_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
-  }
-  for (int i = 0; i < MAX_SB_SQUARE + (8 * MAX_SB_SIZE); ++i) {
-    ref_buffer_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
-  }
-
-  for (int wedge_index = 0; wedge_index < wedge_types; ++wedge_index) {
-    const uint8_t *mask = av1_get_contiguous_soft_mask(wedge_index, 1, bsize);
-
-    aom_highbd_comp_mask_pred_c(
-        CONVERT_TO_BYTEPTR(comp_pred1_), CONVERT_TO_BYTEPTR(pred_), w, h,
-        CONVERT_TO_BYTEPTR(ref_), MAX_SB_SIZE, mask, w, inv);
-
-    test_impl(CONVERT_TO_BYTEPTR(comp_pred2_), CONVERT_TO_BYTEPTR(pred_), w, h,
-              CONVERT_TO_BYTEPTR(ref_), MAX_SB_SIZE, mask, w, inv);
-
-    ASSERT_EQ(CheckResult(w, h), true)
-        << " wedge " << wedge_index << " inv " << inv;
-  }
-}
-
-void AV1HighbdCompMaskVarianceTest::RunSpeedTest(
-    highbd_comp_mask_pred_func test_impl, BLOCK_SIZE bsize) {
-  int bd_ = GET_PARAM(2);
-
-  const int w = block_size_wide[bsize];
-  const int h = block_size_high[bsize];
-  const int wedge_types = get_wedge_types_lookup(bsize);
-  int wedge_index = wedge_types / 2;
-
-  for (int i = 0; i < MAX_SB_SQUARE; ++i) {
-    pred_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
-  }
-  for (int i = 0; i < MAX_SB_SQUARE + (8 * MAX_SB_SIZE); ++i) {
-    ref_buffer_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
-  }
-
-  const uint8_t *mask = av1_get_contiguous_soft_mask(wedge_index, 1, bsize);
-  const int num_loops = 1000000000 / (w + h);
-
-  highbd_comp_mask_pred_func funcs[2] = { aom_highbd_comp_mask_pred_c,
-                                          test_impl };
-  double elapsed_time[2] = { 0 };
-  for (int i = 0; i < 2; ++i) {
-    aom_usec_timer timer;
-    aom_usec_timer_start(&timer);
-    highbd_comp_mask_pred_func func = funcs[i];
-    for (int j = 0; j < num_loops; ++j) {
-      func(CONVERT_TO_BYTEPTR(comp_pred1_), CONVERT_TO_BYTEPTR(pred_), w, h,
-           CONVERT_TO_BYTEPTR(ref_), MAX_SB_SIZE, mask, w, 0);
-    }
-    aom_usec_timer_mark(&timer);
-    double time = static_cast<double>(aom_usec_timer_elapsed(&timer));
-    elapsed_time[i] = 1000.0 * time / num_loops;
-  }
-  printf("compMask %3dx%-3d: %7.2f/%7.2fns", w, h, elapsed_time[0],
-         elapsed_time[1]);
-  printf("(%3.2f)\n", elapsed_time[0] / elapsed_time[1]);
-}
-
-TEST_P(AV1HighbdCompMaskVarianceTest, CheckOutput) {
-  // inv = 0, 1
-  RunCheckOutput(GET_PARAM(0), GET_PARAM(1), 0);
-  RunCheckOutput(GET_PARAM(0), GET_PARAM(1), 1);
-}
-
-TEST_P(AV1HighbdCompMaskVarianceTest, DISABLED_Speed) {
-  RunSpeedTest(GET_PARAM(0), GET_PARAM(1));
-}
-
-#if HAVE_AVX2
-INSTANTIATE_TEST_SUITE_P(
-    AVX2, AV1HighbdCompMaskVarianceTest,
-    ::testing::Combine(::testing::Values(&aom_highbd_comp_mask_pred_avx2),
-                       ::testing::ValuesIn(kValidBlockSize),
-                       ::testing::Range(8, 13, 2)));
-#endif
-
-#if HAVE_SSE2
-INSTANTIATE_TEST_SUITE_P(
-    SSE2, AV1HighbdCompMaskVarianceTest,
-    ::testing::Combine(::testing::Values(&aom_highbd_comp_mask_pred_sse2),
-                       ::testing::ValuesIn(kValidBlockSize),
-                       ::testing::Range(8, 13, 2)));
-#endif
-
-#ifndef aom_highbd_comp_mask_pred
-// can't run this test if aom_highbd_comp_mask_pred is defined to
-// aom_highbd_comp_mask_pred_c
-class AV1HighbdCompMaskUpVarianceTest : public AV1HighbdCompMaskVarianceTest {
- public:
-  ~AV1HighbdCompMaskUpVarianceTest();
-
- protected:
-  void RunCheckOutput(highbd_comp_mask_pred_func test_impl, BLOCK_SIZE bsize,
-                      int inv);
-  void RunSpeedTest(highbd_comp_mask_pred_func test_impl, BLOCK_SIZE bsize,
-                    int havSub);
-};
-
-AV1HighbdCompMaskUpVarianceTest::~AV1HighbdCompMaskUpVarianceTest() {}
-
-void AV1HighbdCompMaskUpVarianceTest::RunCheckOutput(
-    highbd_comp_mask_pred_func test_impl, BLOCK_SIZE bsize, int inv) {
-  (void)test_impl;
-  int bd_ = GET_PARAM(2);
-  const int w = block_size_wide[bsize];
-  const int h = block_size_high[bsize];
-  const int wedge_types = get_wedge_types_lookup(bsize);
-
-  for (int i = 0; i < MAX_SB_SQUARE; ++i) {
-    pred_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
-  }
-  for (int i = 0; i < MAX_SB_SQUARE + (8 * MAX_SB_SIZE); ++i) {
-    ref_buffer_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
-  }
-
-  int subpel_search;
-  for (subpel_search = 1; subpel_search <= 2; ++subpel_search) {
-    // loop through subx and suby
-    for (int sub = 0; sub < 8 * 8; ++sub) {
-      int subx = sub & 0x7;
-      int suby = (sub >> 3);
-      for (int wedge_index = 0; wedge_index < wedge_types; ++wedge_index) {
-        const uint8_t *mask =
-            av1_get_contiguous_soft_mask(wedge_index, 1, bsize);
-
-        // ref
-        aom_highbd_upsampled_pred_c(nullptr, nullptr, 0, 0, nullptr,
-                                    CONVERT_TO_BYTEPTR(comp_pred1_), w, h, subx,
-                                    suby, CONVERT_TO_BYTEPTR(ref_), MAX_SB_SIZE,
-                                    bd_, subpel_search);
-
-        aom_highbd_comp_mask_pred_c(
-            CONVERT_TO_BYTEPTR(comp_pred1_), CONVERT_TO_BYTEPTR(pred_), w, h,
-            CONVERT_TO_BYTEPTR(comp_pred1_), w, mask, w, inv);
-
-        // test
-        aom_highbd_upsampled_pred(nullptr, nullptr, 0, 0, nullptr,
-                                  CONVERT_TO_BYTEPTR(comp_pred2_), w, h, subx,
-                                  suby, CONVERT_TO_BYTEPTR(ref_), MAX_SB_SIZE,
-                                  bd_, subpel_search);
-
-        aom_highbd_comp_mask_pred(
-            CONVERT_TO_BYTEPTR(comp_pred2_), CONVERT_TO_BYTEPTR(pred_), w, h,
-            CONVERT_TO_BYTEPTR(comp_pred2_), w, mask, w, inv);
-
-        ASSERT_EQ(CheckResult(w, h), true)
-            << " wedge " << wedge_index << " inv " << inv << "sub (" << subx
-            << "," << suby << ")";
-      }
-    }
-  }
-}
-
-void AV1HighbdCompMaskUpVarianceTest::RunSpeedTest(
-    highbd_comp_mask_pred_func test_impl, BLOCK_SIZE bsize, int havSub) {
-  int bd_ = GET_PARAM(2);
-  const int w = block_size_wide[bsize];
-  const int h = block_size_high[bsize];
-  const int subx = havSub ? 3 : 0;
-  const int suby = havSub ? 4 : 0;
-  const int wedge_types = get_wedge_types_lookup(bsize);
-  int wedge_index = wedge_types / 2;
-  const uint8_t *mask = av1_get_contiguous_soft_mask(wedge_index, 1, bsize);
-
-  for (int i = 0; i < MAX_SB_SQUARE; ++i) {
-    pred_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
-  }
-  for (int i = 0; i < MAX_SB_SQUARE + (8 * MAX_SB_SIZE); ++i) {
-    ref_buffer_[i] = rnd_.Rand16() & ((1 << bd_) - 1);
-  }
-
-  const int num_loops = 1000000000 / (w + h);
-  highbd_comp_mask_pred_func funcs[2] = { &aom_highbd_comp_mask_pred_c,
-                                          test_impl };
-  double elapsed_time[2] = { 0 };
-  for (int i = 0; i < 2; ++i) {
-    aom_usec_timer timer;
-    aom_usec_timer_start(&timer);
-    aom_highbd_comp_mask_pred = funcs[i];
-    int subpel_search = 2;  // set to 1 to test 4-tap filter.
-    for (int j = 0; j < num_loops; ++j) {
-      aom_highbd_comp_mask_upsampled_pred(
-          nullptr, nullptr, 0, 0, nullptr, CONVERT_TO_BYTEPTR(comp_pred1_),
-          CONVERT_TO_BYTEPTR(pred_), w, h, subx, suby, CONVERT_TO_BYTEPTR(ref_),
-          MAX_SB_SIZE, mask, w, 0, bd_, subpel_search);
-    }
-    aom_usec_timer_mark(&timer);
-    double time = static_cast<double>(aom_usec_timer_elapsed(&timer));
-    elapsed_time[i] = 1000.0 * time / num_loops;
-  }
-  printf("CompMaskUp[%d] %3dx%-3d:%7.2f/%7.2fns", havSub, w, h, elapsed_time[0],
-         elapsed_time[1]);
-  printf("(%3.2f)\n", elapsed_time[0] / elapsed_time[1]);
-}
-
-TEST_P(AV1HighbdCompMaskUpVarianceTest, CheckOutput) {
-  // inv mask = 0, 1
-  RunCheckOutput(GET_PARAM(0), GET_PARAM(1), 0);
-  RunCheckOutput(GET_PARAM(0), GET_PARAM(1), 1);
-}
-
-TEST_P(AV1HighbdCompMaskUpVarianceTest, DISABLED_Speed) {
-  RunSpeedTest(GET_PARAM(0), GET_PARAM(1), 1);
-}
-
-#if HAVE_AVX2
-INSTANTIATE_TEST_SUITE_P(
-    AVX2, AV1HighbdCompMaskUpVarianceTest,
-    ::testing::Combine(::testing::Values(&aom_highbd_comp_mask_pred_avx2),
-                       ::testing::ValuesIn(kValidBlockSize),
-                       ::testing::Range(8, 13, 2)));
-#endif
-
-#if HAVE_SSE2
-INSTANTIATE_TEST_SUITE_P(
-    SSE2, AV1HighbdCompMaskUpVarianceTest,
-    ::testing::Combine(::testing::Values(&aom_highbd_comp_mask_pred_sse2),
-                       ::testing::ValuesIn(kValidBlockSize),
-                       ::testing::Range(8, 13, 2)));
-#endif
-
-#endif  // ifndef aom_highbd_comp_mask_pred
-#endif  // CONFIG_AV1_HIGHBITDEPTH
-}  // namespace AV1CompMaskVariance

diff --git a/test/convolve_test.cc b/test/convolve_test.cc
index d5232ee..8aed171 100644
--- a/test/convolve_test.cc
+++ b/test/convolve_test.cc

@@ -31,6 +31,10 @@
 
 static const unsigned int kMaxDimension = MAX_SB_SIZE;
 
+static const int16_t kInvalidFilter[8] = {};
+static const int kNumFilterBanks = SWITCHABLE_FILTERS;
+static const int kNumFilters = 16;
+
 typedef void (*ConvolveFunc)(const uint8_t *src, ptrdiff_t src_stride,
                              uint8_t *dst, ptrdiff_t dst_stride,
                              const int16_t *filter_x, int filter_x_stride,
@@ -265,7 +269,7 @@
                            output_width, output_height);
 }
 
-class ConvolveTest : public ::testing::TestWithParam<ConvolveParam> {
+class ConvolveTestBase : public ::testing::TestWithParam<ConvolveParam> {
  public:
   static void SetUpTestSuite() {
     // Force input_ to be unaligned, output to be 16 byte aligned.
@@ -462,6 +466,202 @@
     }
   }
 
+  void MatchesReferenceSubpixelFilter() {
+    uint8_t *const in = input();
+    uint8_t *const out = output();
+    uint8_t *ref;
+    if (UUT_->use_highbd_ == 0) {
+      ref = ref8_;
+    } else {
+      ref = CONVERT_TO_BYTEPTR(ref16_);
+    }
+    int subpel_search;
+    for (subpel_search = USE_4_TAPS; subpel_search <= USE_8_TAPS;
+         ++subpel_search) {
+      for (int filter_bank = 0; filter_bank < kNumFilterBanks; ++filter_bank) {
+        const InterpFilter filter = (InterpFilter)filter_bank;
+        const InterpKernel *filters =
+            (const InterpKernel *)av1_get_interp_filter_kernel(filter,
+                                                               subpel_search);
+        for (int filter_x = 0; filter_x < kNumFilters; ++filter_x) {
+          for (int filter_y = 0; filter_y < kNumFilters; ++filter_y) {
+            wrapper_filter_block2d_8_c(in, kInputStride, filters[filter_x],
+                                       filters[filter_y], ref, kOutputStride,
+                                       Width(), Height());
+
+            if (filter_x && filter_y)
+              continue;
+            else if (filter_y)
+              UUT_->v8_(in, kInputStride, out, kOutputStride, kInvalidFilter,
+                        16, filters[filter_y], 16, Width(), Height());
+            else if (filter_x)
+              API_REGISTER_STATE_CHECK(UUT_->h8_(
+                  in, kInputStride, out, kOutputStride, filters[filter_x], 16,
+                  kInvalidFilter, 16, Width(), Height()));
+            else
+              continue;
+
+            CheckGuardBlocks();
+
+            for (int y = 0; y < Height(); ++y)
+              for (int x = 0; x < Width(); ++x)
+                ASSERT_EQ(lookup(ref, y * kOutputStride + x),
+                          lookup(out, y * kOutputStride + x))
+                    << "mismatch at (" << x << "," << y << "), "
+                    << "filters (" << filter_bank << "," << filter_x << ","
+                    << filter_y << ")";
+          }
+        }
+      }
+    }
+  }
+
+  void FilterExtremes() {
+    uint8_t *const in = input();
+    uint8_t *const out = output();
+    uint8_t *ref;
+    if (UUT_->use_highbd_ == 0) {
+      ref = ref8_;
+    } else {
+      ref = CONVERT_TO_BYTEPTR(ref16_);
+    }
+
+    // Populate ref and out with some random data
+    ::libaom_test::ACMRandom prng;
+    for (int y = 0; y < Height(); ++y) {
+      for (int x = 0; x < Width(); ++x) {
+        uint16_t r;
+        if (UUT_->use_highbd_ == 0 || UUT_->use_highbd_ == 8) {
+          r = prng.Rand8Extremes();
+        } else {
+          r = prng.Rand16() & mask_;
+        }
+        assign_val(out, y * kOutputStride + x, r);
+        assign_val(ref, y * kOutputStride + x, r);
+      }
+    }
+
+    for (int axis = 0; axis < 2; axis++) {
+      int seed_val = 0;
+      while (seed_val < 256) {
+        for (int y = 0; y < 8; ++y) {
+          for (int x = 0; x < 8; ++x) {
+            assign_val(in, y * kOutputStride + x - SUBPEL_TAPS / 2 + 1,
+                       ((seed_val >> (axis ? y : x)) & 1) * mask_);
+            if (axis) seed_val++;
+          }
+          if (axis)
+            seed_val -= 8;
+          else
+            seed_val++;
+        }
+        if (axis) seed_val += 8;
+        int subpel_search;
+        for (subpel_search = USE_4_TAPS; subpel_search <= USE_8_TAPS;
+             ++subpel_search) {
+          for (int filter_bank = 0; filter_bank < kNumFilterBanks;
+               ++filter_bank) {
+            const InterpFilter filter = (InterpFilter)filter_bank;
+            const InterpKernel *filters =
+                (const InterpKernel *)av1_get_interp_filter_kernel(
+                    filter, subpel_search);
+            for (int filter_x = 0; filter_x < kNumFilters; ++filter_x) {
+              for (int filter_y = 0; filter_y < kNumFilters; ++filter_y) {
+                wrapper_filter_block2d_8_c(in, kInputStride, filters[filter_x],
+                                           filters[filter_y], ref,
+                                           kOutputStride, Width(), Height());
+                if (filter_x && filter_y)
+                  continue;
+                else if (filter_y)
+                  API_REGISTER_STATE_CHECK(UUT_->v8_(
+                      in, kInputStride, out, kOutputStride, kInvalidFilter, 16,
+                      filters[filter_y], 16, Width(), Height()));
+                else if (filter_x)
+                  API_REGISTER_STATE_CHECK(UUT_->h8_(
+                      in, kInputStride, out, kOutputStride, filters[filter_x],
+                      16, kInvalidFilter, 16, Width(), Height()));
+                else
+                  continue;
+
+                for (int y = 0; y < Height(); ++y)
+                  for (int x = 0; x < Width(); ++x)
+                    ASSERT_EQ(lookup(ref, y * kOutputStride + x),
+                              lookup(out, y * kOutputStride + x))
+                        << "mismatch at (" << x << "," << y << "), "
+                        << "filters (" << filter_bank << "," << filter_x << ","
+                        << filter_y << ")";
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+
+  void SpeedTest() {
+    uint8_t *const in = input();
+    uint8_t *const out = output();
+    uint8_t *ref;
+    if (UUT_->use_highbd_ == 0) {
+      ref = ref8_;
+    } else {
+      ref = CONVERT_TO_BYTEPTR(ref16_);
+    }
+
+    // Populate ref and out with some random data
+    ::libaom_test::ACMRandom prng;
+    for (int y = 0; y < Height(); ++y) {
+      for (int x = 0; x < Width(); ++x) {
+        uint16_t r;
+        if (UUT_->use_highbd_ == 0 || UUT_->use_highbd_ == 8) {
+          r = prng.Rand8Extremes();
+        } else {
+          r = prng.Rand16() & mask_;
+        }
+        assign_val(out, y * kOutputStride + x, r);
+        assign_val(ref, y * kOutputStride + x, r);
+      }
+    }
+
+    InterpFilter filter = (InterpFilter)1;
+    const InterpKernel *filters =
+        (const InterpKernel *)av1_get_interp_filter_kernel(filter, USE_8_TAPS);
+    wrapper_filter_average_block2d_8_c(in, kInputStride, filters[1], filters[1],
+                                       out, kOutputStride, Width(), Height());
+
+    aom_usec_timer timer;
+    int tests_num = 1000;
+
+    aom_usec_timer_start(&timer);
+    while (tests_num > 0) {
+      for (int filter_bank = 0; filter_bank < kNumFilterBanks; ++filter_bank) {
+        filter = (InterpFilter)filter_bank;
+        filters = (const InterpKernel *)av1_get_interp_filter_kernel(
+            filter, USE_8_TAPS);
+        for (int filter_x = 0; filter_x < kNumFilters; ++filter_x) {
+          for (int filter_y = 0; filter_y < kNumFilters; ++filter_y) {
+            if (filter_x && filter_y) continue;
+            if (filter_y)
+              API_REGISTER_STATE_CHECK(UUT_->v8_(
+                  in, kInputStride, out, kOutputStride, kInvalidFilter, 16,
+                  filters[filter_y], 16, Width(), Height()));
+            else if (filter_x)
+              API_REGISTER_STATE_CHECK(UUT_->h8_(
+                  in, kInputStride, out, kOutputStride, filters[filter_x], 16,
+                  kInvalidFilter, 16, Width(), Height()));
+          }
+        }
+      }
+      tests_num--;
+    }
+    aom_usec_timer_mark(&timer);
+
+    const int elapsed_time =
+        static_cast<int>(aom_usec_timer_elapsed(&timer) / 1000);
+    printf("%dx%d (bitdepth %d) time: %5d ms\n", Width(), Height(),
+           UUT_->use_highbd_, elapsed_time);
+  }
+
   const ConvolveFunctions *UUT_;
   static uint8_t *input_;
   static uint8_t *ref8_;
@@ -474,21 +674,20 @@
   int mask_;
 };
 
-uint8_t *ConvolveTest::input_ = nullptr;
-uint8_t *ConvolveTest::ref8_ = nullptr;
-uint8_t *ConvolveTest::output_ = nullptr;
-uint8_t *ConvolveTest::output_ref_ = nullptr;
-uint16_t *ConvolveTest::input16_ = nullptr;
-uint16_t *ConvolveTest::ref16_ = nullptr;
-uint16_t *ConvolveTest::output16_ = nullptr;
-uint16_t *ConvolveTest::output16_ref_ = nullptr;
+uint8_t *ConvolveTestBase::input_ = nullptr;
+uint8_t *ConvolveTestBase::ref8_ = nullptr;
+uint8_t *ConvolveTestBase::output_ = nullptr;
+uint8_t *ConvolveTestBase::output_ref_ = nullptr;
+uint16_t *ConvolveTestBase::input16_ = nullptr;
+uint16_t *ConvolveTestBase::ref16_ = nullptr;
+uint16_t *ConvolveTestBase::output16_ = nullptr;
+uint16_t *ConvolveTestBase::output16_ref_ = nullptr;
 
-TEST_P(ConvolveTest, GuardBlocks) { CheckGuardBlocks(); }
+using LowbdConvolveTest = ConvolveTestBase;
 
-const int kNumFilterBanks = SWITCHABLE_FILTERS;
-const int kNumFilters = 16;
+TEST_P(LowbdConvolveTest, GuardBlocks) { CheckGuardBlocks(); }
 
-TEST(ConvolveTest, FiltersWontSaturateWhenAddedPairwise) {
+void FiltersWontSaturateWhenAddedPairwise() {
   int subpel_search;
   for (subpel_search = USE_4_TAPS; subpel_search <= USE_8_TAPS;
        ++subpel_search) {
@@ -515,205 +714,17 @@
   }
 }
 
-const int16_t kInvalidFilter[8] = { 0 };
-
-TEST_P(ConvolveTest, MatchesReferenceSubpixelFilter) {
-  uint8_t *const in = input();
-  uint8_t *const out = output();
-  uint8_t *ref;
-  if (UUT_->use_highbd_ == 0) {
-    ref = ref8_;
-  } else {
-    ref = CONVERT_TO_BYTEPTR(ref16_);
-  }
-  int subpel_search;
-  for (subpel_search = USE_4_TAPS; subpel_search <= USE_8_TAPS;
-       ++subpel_search) {
-    for (int filter_bank = 0; filter_bank < kNumFilterBanks; ++filter_bank) {
-      const InterpFilter filter = (InterpFilter)filter_bank;
-      const InterpKernel *filters =
-          (const InterpKernel *)av1_get_interp_filter_kernel(filter,
-                                                             subpel_search);
-      for (int filter_x = 0; filter_x < kNumFilters; ++filter_x) {
-        for (int filter_y = 0; filter_y < kNumFilters; ++filter_y) {
-          wrapper_filter_block2d_8_c(in, kInputStride, filters[filter_x],
-                                     filters[filter_y], ref, kOutputStride,
-                                     Width(), Height());
-
-          if (filter_x && filter_y)
-            continue;
-          else if (filter_y)
-            API_REGISTER_STATE_CHECK(
-                UUT_->v8_(in, kInputStride, out, kOutputStride, kInvalidFilter,
-                          16, filters[filter_y], 16, Width(), Height()));
-          else if (filter_x)
-            API_REGISTER_STATE_CHECK(UUT_->h8_(
-                in, kInputStride, out, kOutputStride, filters[filter_x], 16,
-                kInvalidFilter, 16, Width(), Height()));
-          else
-            continue;
-
-          CheckGuardBlocks();
-
-          for (int y = 0; y < Height(); ++y)
-            for (int x = 0; x < Width(); ++x)
-              ASSERT_EQ(lookup(ref, y * kOutputStride + x),
-                        lookup(out, y * kOutputStride + x))
-                  << "mismatch at (" << x << "," << y << "), "
-                  << "filters (" << filter_bank << "," << filter_x << ","
-                  << filter_y << ")";
-        }
-      }
-    }
-  }
+TEST(LowbdConvolveTest, FiltersWontSaturateWhenAddedPairwise) {
+  FiltersWontSaturateWhenAddedPairwise();
 }
 
-TEST_P(ConvolveTest, FilterExtremes) {
-  uint8_t *const in = input();
-  uint8_t *const out = output();
-  uint8_t *ref;
-  if (UUT_->use_highbd_ == 0) {
-    ref = ref8_;
-  } else {
-    ref = CONVERT_TO_BYTEPTR(ref16_);
-  }
-
-  // Populate ref and out with some random data
-  ::libaom_test::ACMRandom prng;
-  for (int y = 0; y < Height(); ++y) {
-    for (int x = 0; x < Width(); ++x) {
-      uint16_t r;
-      if (UUT_->use_highbd_ == 0 || UUT_->use_highbd_ == 8) {
-        r = prng.Rand8Extremes();
-      } else {
-        r = prng.Rand16() & mask_;
-      }
-      assign_val(out, y * kOutputStride + x, r);
-      assign_val(ref, y * kOutputStride + x, r);
-    }
-  }
-
-  for (int axis = 0; axis < 2; axis++) {
-    int seed_val = 0;
-    while (seed_val < 256) {
-      for (int y = 0; y < 8; ++y) {
-        for (int x = 0; x < 8; ++x) {
-          assign_val(in, y * kOutputStride + x - SUBPEL_TAPS / 2 + 1,
-                     ((seed_val >> (axis ? y : x)) & 1) * mask_);
-          if (axis) seed_val++;
-        }
-        if (axis)
-          seed_val -= 8;
-        else
-          seed_val++;
-      }
-      if (axis) seed_val += 8;
-      int subpel_search;
-      for (subpel_search = USE_4_TAPS; subpel_search <= USE_8_TAPS;
-           ++subpel_search) {
-        for (int filter_bank = 0; filter_bank < kNumFilterBanks;
-             ++filter_bank) {
-          const InterpFilter filter = (InterpFilter)filter_bank;
-          const InterpKernel *filters =
-              (const InterpKernel *)av1_get_interp_filter_kernel(filter,
-                                                                 subpel_search);
-          for (int filter_x = 0; filter_x < kNumFilters; ++filter_x) {
-            for (int filter_y = 0; filter_y < kNumFilters; ++filter_y) {
-              wrapper_filter_block2d_8_c(in, kInputStride, filters[filter_x],
-                                         filters[filter_y], ref, kOutputStride,
-                                         Width(), Height());
-              if (filter_x && filter_y)
-                continue;
-              else if (filter_y)
-                API_REGISTER_STATE_CHECK(UUT_->v8_(
-                    in, kInputStride, out, kOutputStride, kInvalidFilter, 16,
-                    filters[filter_y], 16, Width(), Height()));
-              else if (filter_x)
-                API_REGISTER_STATE_CHECK(UUT_->h8_(
-                    in, kInputStride, out, kOutputStride, filters[filter_x], 16,
-                    kInvalidFilter, 16, Width(), Height()));
-              else
-                continue;
-
-              for (int y = 0; y < Height(); ++y)
-                for (int x = 0; x < Width(); ++x)
-                  ASSERT_EQ(lookup(ref, y * kOutputStride + x),
-                            lookup(out, y * kOutputStride + x))
-                      << "mismatch at (" << x << "," << y << "), "
-                      << "filters (" << filter_bank << "," << filter_x << ","
-                      << filter_y << ")";
-            }
-          }
-        }
-      }
-    }
-  }
+TEST_P(LowbdConvolveTest, MatchesReferenceSubpixelFilter) {
+  MatchesReferenceSubpixelFilter();
 }
 
-TEST_P(ConvolveTest, DISABLED_Speed) {
-  uint8_t *const in = input();
-  uint8_t *const out = output();
-  uint8_t *ref;
-  if (UUT_->use_highbd_ == 0) {
-    ref = ref8_;
-  } else {
-    ref = CONVERT_TO_BYTEPTR(ref16_);
-  }
+TEST_P(LowbdConvolveTest, FilterExtremes) { FilterExtremes(); }
 
-  // Populate ref and out with some random data
-  ::libaom_test::ACMRandom prng;
-  for (int y = 0; y < Height(); ++y) {
-    for (int x = 0; x < Width(); ++x) {
-      uint16_t r;
-      if (UUT_->use_highbd_ == 0 || UUT_->use_highbd_ == 8) {
-        r = prng.Rand8Extremes();
-      } else {
-        r = prng.Rand16() & mask_;
-      }
-      assign_val(out, y * kOutputStride + x, r);
-      assign_val(ref, y * kOutputStride + x, r);
-    }
-  }
-
-  const InterpFilter filter = (InterpFilter)1;
-  const InterpKernel *filters =
-      (const InterpKernel *)av1_get_interp_filter_kernel(filter, USE_8_TAPS);
-  wrapper_filter_average_block2d_8_c(in, kInputStride, filters[1], filters[1],
-                                     out, kOutputStride, Width(), Height());
-
-  aom_usec_timer timer;
-  int tests_num = 1000;
-
-  aom_usec_timer_start(&timer);
-  while (tests_num > 0) {
-    for (int filter_bank = 0; filter_bank < kNumFilterBanks; ++filter_bank) {
-      const InterpFilter filter = (InterpFilter)filter_bank;
-      const InterpKernel *filters =
-          (const InterpKernel *)av1_get_interp_filter_kernel(filter,
-                                                             USE_8_TAPS);
-      for (int filter_x = 0; filter_x < kNumFilters; ++filter_x) {
-        for (int filter_y = 0; filter_y < kNumFilters; ++filter_y) {
-          if (filter_x && filter_y) continue;
-          if (filter_y)
-            API_REGISTER_STATE_CHECK(
-                UUT_->v8_(in, kInputStride, out, kOutputStride, kInvalidFilter,
-                          16, filters[filter_y], 16, Width(), Height()));
-          else if (filter_x)
-            API_REGISTER_STATE_CHECK(UUT_->h8_(
-                in, kInputStride, out, kOutputStride, filters[filter_x], 16,
-                kInvalidFilter, 16, Width(), Height()));
-        }
-      }
-    }
-    tests_num--;
-  }
-  aom_usec_timer_mark(&timer);
-
-  const int elapsed_time =
-      static_cast<int>(aom_usec_timer_elapsed(&timer) / 1000);
-  printf("%dx%d (bitdepth %d) time: %5d ms\n", Width(), Height(),
-         UUT_->use_highbd_, elapsed_time);
-}
+TEST_P(LowbdConvolveTest, DISABLED_Speed) { SpeedTest(); }
 
 using std::make_tuple;
 
@@ -727,14 +738,14 @@
     aom_highbd_##func(src, src_stride, dst, dst_stride, filter_x,            \
                       filter_x_stride, filter_y, filter_y_stride, w, h, bd); \
   }
-#if HAVE_SSE2 && ARCH_X86_64
+#if HAVE_SSE2 && AOM_ARCH_X86_64
 WRAP(convolve8_horiz_sse2, 8)
 WRAP(convolve8_vert_sse2, 8)
 WRAP(convolve8_horiz_sse2, 10)
 WRAP(convolve8_vert_sse2, 10)
 WRAP(convolve8_horiz_sse2, 12)
 WRAP(convolve8_vert_sse2, 12)
-#endif  // HAVE_SSE2 && ARCH_X86_64
+#endif  // HAVE_SSE2 && AOM_ARCH_X86_64
 
 WRAP(convolve8_horiz_c, 8)
 WRAP(convolve8_vert_c, 8)
@@ -758,25 +769,45 @@
 #undef WRAP
 
 #if CONFIG_AV1_HIGHBITDEPTH
+
+using HighbdConvolveTest = ConvolveTestBase;
+
+TEST_P(HighbdConvolveTest, GuardBlocks) { CheckGuardBlocks(); }
+
+TEST(HighbdConvolveTest, FiltersWontSaturateWhenAddedPairwise) {
+  FiltersWontSaturateWhenAddedPairwise();
+}
+
+TEST_P(HighbdConvolveTest, MatchesReferenceSubpixelFilter) {
+  MatchesReferenceSubpixelFilter();
+}
+
+TEST_P(HighbdConvolveTest, FilterExtremes) { FilterExtremes(); }
+
+TEST_P(HighbdConvolveTest, DISABLED_Speed) { SpeedTest(); }
+
 const ConvolveFunctions wrap_convolve8_c(wrap_convolve8_horiz_c_8,
                                          wrap_convolve8_vert_c_8, 8);
 const ConvolveFunctions wrap_convolve10_c(wrap_convolve8_horiz_c_10,
                                           wrap_convolve8_vert_c_10, 10);
 const ConvolveFunctions wrap_convolve12_c(wrap_convolve8_horiz_c_12,
                                           wrap_convolve8_vert_c_12, 12);
-const ConvolveParam kArrayConvolve_c[] = { ALL_SIZES(wrap_convolve8_c),
-                                           ALL_SIZES(wrap_convolve10_c),
-                                           ALL_SIZES(wrap_convolve12_c) };
-#else
+const ConvolveParam kArrayHighbdConvolve_c[] = { ALL_SIZES(wrap_convolve8_c),
+                                                 ALL_SIZES(wrap_convolve10_c),
+                                                 ALL_SIZES(wrap_convolve12_c) };
+
+INSTANTIATE_TEST_SUITE_P(C, HighbdConvolveTest,
+                         ::testing::ValuesIn(kArrayHighbdConvolve_c));
+#endif  // CONFIG_AV1_HIGHBITDEPTH
+
 const ConvolveFunctions convolve8_c(aom_convolve8_horiz_c, aom_convolve8_vert_c,
                                     0);
 const ConvolveParam kArrayConvolve_c[] = { ALL_SIZES(convolve8_c) };
-#endif
 
-INSTANTIATE_TEST_SUITE_P(C, ConvolveTest,
+INSTANTIATE_TEST_SUITE_P(C, LowbdConvolveTest,
                          ::testing::ValuesIn(kArrayConvolve_c));
 
-#if HAVE_SSE2 && ARCH_X86_64
+#if HAVE_SSE2 && AOM_ARCH_X86_64
 #if CONFIG_AV1_HIGHBITDEPTH
 const ConvolveFunctions wrap_convolve8_sse2(wrap_convolve8_horiz_sse2_8,
                                             wrap_convolve8_vert_sse2_8, 8);
@@ -784,15 +815,19 @@
                                              wrap_convolve8_vert_sse2_10, 10);
 const ConvolveFunctions wrap_convolve12_sse2(wrap_convolve8_horiz_sse2_12,
                                              wrap_convolve8_vert_sse2_12, 12);
-const ConvolveParam kArrayConvolve_sse2[] = { ALL_SIZES(wrap_convolve8_sse2),
-                                              ALL_SIZES(wrap_convolve10_sse2),
-                                              ALL_SIZES(wrap_convolve12_sse2) };
-#else
+const ConvolveParam kArrayHighbdConvolve_sse2[] = {
+  ALL_SIZES(wrap_convolve8_sse2), ALL_SIZES(wrap_convolve10_sse2),
+  ALL_SIZES(wrap_convolve12_sse2)
+};
+
+INSTANTIATE_TEST_SUITE_P(SSE2, HighbdConvolveTest,
+                         ::testing::ValuesIn(kArrayHighbdConvolve_sse2));
+#endif
 const ConvolveFunctions convolve8_sse2(aom_convolve8_horiz_sse2,
                                        aom_convolve8_vert_sse2, 0);
 const ConvolveParam kArrayConvolve_sse2[] = { ALL_SIZES(convolve8_sse2) };
-#endif
-INSTANTIATE_TEST_SUITE_P(SSE2, ConvolveTest,
+
+INSTANTIATE_TEST_SUITE_P(SSE2, LowbdConvolveTest,
                          ::testing::ValuesIn(kArrayConvolve_sse2));
 #endif
 
@@ -801,7 +836,8 @@
                                         aom_convolve8_vert_ssse3, 0);
 
 const ConvolveParam kArrayConvolve8_ssse3[] = { ALL_SIZES(convolve8_ssse3) };
-INSTANTIATE_TEST_SUITE_P(SSSE3, ConvolveTest,
+
+INSTANTIATE_TEST_SUITE_P(SSSE3, LowbdConvolveTest,
                          ::testing::ValuesIn(kArrayConvolve8_ssse3));
 #endif
 
@@ -813,18 +849,29 @@
                                              wrap_convolve8_vert_avx2_10, 10);
 const ConvolveFunctions wrap_convolve12_avx2(wrap_convolve8_horiz_avx2_12,
                                              wrap_convolve8_vert_avx2_12, 12);
-const ConvolveParam kArray_Convolve8_avx2[] = {
+const ConvolveParam kArray_HighbdConvolve8_avx2[] = {
   ALL_SIZES_64(wrap_convolve8_avx2), ALL_SIZES_64(wrap_convolve10_avx2),
   ALL_SIZES_64(wrap_convolve12_avx2)
 };
-#else
+
+INSTANTIATE_TEST_SUITE_P(AVX2, HighbdConvolveTest,
+                         ::testing::ValuesIn(kArray_HighbdConvolve8_avx2));
+#endif
 const ConvolveFunctions convolve8_avx2(aom_convolve8_horiz_avx2,
                                        aom_convolve8_vert_avx2, 0);
 const ConvolveParam kArray_Convolve8_avx2[] = { ALL_SIZES(convolve8_avx2) };
-#endif
 
-INSTANTIATE_TEST_SUITE_P(AVX2, ConvolveTest,
+INSTANTIATE_TEST_SUITE_P(AVX2, LowbdConvolveTest,
                          ::testing::ValuesIn(kArray_Convolve8_avx2));
 #endif  // HAVE_AVX2
 
+#if HAVE_NEON
+const ConvolveFunctions convolve8_neon(aom_convolve8_horiz_neon,
+                                       aom_convolve8_vert_neon, 0);
+const ConvolveParam kArray_Convolve8_neon[] = { ALL_SIZES(convolve8_neon) };
+
+INSTANTIATE_TEST_SUITE_P(NEON, LowbdConvolveTest,
+                         ::testing::ValuesIn(kArray_Convolve8_neon));
+#endif  // HAVE_NEON
+
 }  // namespace

diff --git a/test/corner_match_test.cc b/test/corner_match_test.cc
index 673205a..93ca8ec 100644
--- a/test/corner_match_test.cc
+++ b/test/corner_match_test.cc

@@ -27,9 +27,9 @@
 
 using libaom_test::ACMRandom;
 
-typedef double (*ComputeCrossCorrFunc)(unsigned char *im1, int stride1, int x1,
-                                       int y1, unsigned char *im2, int stride2,
-                                       int x2, int y2);
+typedef double (*ComputeCrossCorrFunc)(const unsigned char *im1, int stride1,
+                                       int x1, int y1, const unsigned char *im2,
+                                       int stride2, int x2, int y2);
 
 using std::make_tuple;
 using std::tuple;

diff --git a/test/cpu_used_firstpass_test.cc b/test/cpu_used_firstpass_test.cc
index c53db6e..cfffcd7 100644
--- a/test/cpu_used_firstpass_test.cc
+++ b/test/cpu_used_firstpass_test.cc

@@ -9,6 +9,8 @@
  * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
  */
 
+#include <cstdlib>
+
 #include "test/codec_factory.h"
 #include "test/encode_test_driver.h"
 #include "test/i420_video_source.h"
@@ -84,7 +86,7 @@
     first_pass_cpu_used_ = GET_PARAM(1);
     if (first_pass_cpu_used_ == second_pass_cpu_used_) return;
     ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
-    psnr_diff = abs(ref_psnr - GetAveragePsnr());
+    psnr_diff = std::abs(ref_psnr - GetAveragePsnr());
     EXPECT_LT(psnr_diff, GetPsnrDiffThreshold())
         << "first pass cpu used = " << first_pass_cpu_used_
         << ", second pass cpu used = " << second_pass_cpu_used_;

diff --git a/test/datarate_test.cc b/test/datarate_test.cc
index 8fdc662..21b40d9 100644
--- a/test/datarate_test.cc
+++ b/test/datarate_test.cc

@@ -12,6 +12,7 @@
 #include "config/aom_config.h"
 
 #include "third_party/googletest/src/googletest/include/gtest/gtest.h"
+#include "test/acm_random.h"
 #include "test/codec_factory.h"
 #include "test/datarate_test.h"
 #include "test/encode_test_driver.h"
@@ -109,7 +110,7 @@
         << " The datarate for the file is lower than target by too much!";
     ASSERT_LE(effective_datarate_, cfg_.rc_target_bitrate * 1.19)
         << " The datarate for the file is greater than target by too much!";
-    ASSERT_LT(num_spikes_, 8);
+    ASSERT_LE(num_spikes_, 8);
     ASSERT_LT(num_spikes_high_, 1);
   }
 
@@ -347,7 +348,7 @@
       ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
       ASSERT_GE(effective_datarate_, cfg_.rc_target_bitrate * 0.85)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_, cfg_.rc_target_bitrate * 1.31)
+      ASSERT_LE(effective_datarate_, cfg_.rc_target_bitrate * 1.40)
           << " The datarate for the file is greater than target by too much!";
       if (last_drop > 0) {
         ASSERT_LE(first_drop_, last_drop)
@@ -396,7 +397,11 @@
 }
 
 // Check basic rate targeting for CBR, for 444 input screen mode.
+#if defined(CONFIG_MAX_DECODE_PROFILE) && CONFIG_MAX_DECODE_PROFILE < 1
+TEST_P(DatarateTestLarge, DISABLED_BasicRateTargeting444CBRScreen) {
+#else
 TEST_P(DatarateTestLarge, BasicRateTargeting444CBRScreen) {
+#endif
   BasicRateTargeting444CBRScreenTest();
 }
 
@@ -508,7 +513,11 @@
 }
 
 // Check basic rate targeting for CBR for 444 screen mode.
+#if defined(CONFIG_MAX_DECODE_PROFILE) && CONFIG_MAX_DECODE_PROFILE < 1
+TEST_P(DatarateTestRealtime, DISABLED_BasicRateTargeting444CBRScreen) {
+#else
 TEST_P(DatarateTestRealtime, BasicRateTargeting444CBRScreen) {
+#endif
   BasicRateTargeting444CBRScreenTest();
 }
 
@@ -524,6 +533,68 @@
   ChangingSpeedTest();
 }
 
+class DatarateTestSetFrameQpRealtime
+    : public DatarateTest,
+      public ::testing::TestWithParam<const libaom_test::AV1CodecFactory *> {
+ public:
+  DatarateTestSetFrameQpRealtime() : DatarateTest(GetParam()), frame_(0) {}
+
+ protected:
+  virtual ~DatarateTestSetFrameQpRealtime() {}
+
+  virtual void SetUp() {
+    InitializeConfig(libaom_test::kRealTime);
+    ResetModel();
+  }
+
+  virtual void PreEncodeFrameHook(::libaom_test::VideoSource *video,
+                                  ::libaom_test::Encoder *encoder) {
+    set_cpu_used_ = 7;
+    DatarateTest::PreEncodeFrameHook(video, encoder);
+    frame_qp_ = rnd_.PseudoUniform(63);
+    encoder->Control(AV1E_SET_QUANTIZER_ONE_PASS, frame_qp_);
+    frame_++;
+  }
+
+  virtual void PostEncodeFrameHook(::libaom_test::Encoder *encoder) {
+    if (frame_ >= total_frames_) return;
+    int qp = 0;
+    encoder->Control(AOME_GET_LAST_QUANTIZER_64, &qp);
+    ASSERT_EQ(qp, frame_qp_);
+  }
+
+ protected:
+  int total_frames_;
+
+ private:
+  int frame_qp_;
+  int frame_;
+  libaom_test::ACMRandom rnd_;
+};
+
+TEST_P(DatarateTestSetFrameQpRealtime, SetFrameQpOnePass) {
+  cfg_.rc_buf_initial_sz = 500;
+  cfg_.rc_buf_optimal_sz = 500;
+  cfg_.rc_buf_sz = 1000;
+  cfg_.rc_undershoot_pct = 20;
+  cfg_.rc_undershoot_pct = 20;
+  cfg_.rc_min_quantizer = 0;
+  cfg_.rc_max_quantizer = 50;
+  cfg_.rc_end_usage = AOM_CBR;
+  cfg_.rc_target_bitrate = 200;
+  cfg_.g_lag_in_frames = 0;
+  cfg_.g_error_resilient = 1;
+  cfg_.kf_max_dist = 9999;
+  cfg_.rc_dropframe_thresh = 0;
+
+  total_frames_ = 100;
+  ::libaom_test::I420VideoSource video("hantro_collage_w352h288.yuv", 352, 288,
+                                       30, 1, 0, 100);
+
+  ResetModel();
+  ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
+}
+
 AV1_INSTANTIATE_TEST_SUITE(DatarateTestLarge,
                            ::testing::Values(::libaom_test::kRealTime),
                            ::testing::Range(5, 7), ::testing::Values(0, 3),
@@ -535,16 +606,21 @@
 
 AV1_INSTANTIATE_TEST_SUITE(DatarateTestRealtime,
                            ::testing::Values(::libaom_test::kRealTime),
-                           ::testing::Range(7, 11), ::testing::Values(0, 3),
+                           ::testing::Range(7, 12), ::testing::Values(0, 3),
                            ::testing::Values(0, 1));
 
 AV1_INSTANTIATE_TEST_SUITE(DatarateTestFrameDropRealtime,
                            ::testing::Values(::libaom_test::kRealTime),
-                           ::testing::Range(7, 11), ::testing::Values(0, 3));
+                           ::testing::Range(7, 12), ::testing::Values(0, 3));
 
 AV1_INSTANTIATE_TEST_SUITE(DatarateTestSpeedChangeRealtime,
                            ::testing::Values(::libaom_test::kRealTime),
                            ::testing::Values(0, 3));
 
+INSTANTIATE_TEST_SUITE_P(
+    AV1, DatarateTestSetFrameQpRealtime,
+    ::testing::Values(
+        static_cast<const libaom_test::CodecFactory *>(&libaom_test::kAV1)));
+
 }  // namespace
 }  // namespace datarate_test

diff --git a/test/deltaq_mode_test.cc b/test/deltaq_mode_test.cc
index 0a5e5aa..5960d27 100644
--- a/test/deltaq_mode_test.cc
+++ b/test/deltaq_mode_test.cc

@@ -10,12 +10,14 @@
  */
 
 #include <cstddef>
+#include <cstdint>
 #include <vector>
 
 #include "aom/aomcx.h"
 #include "aom/aom_codec.h"
 #include "aom/aom_encoder.h"
 #include "aom/aom_image.h"
+#include "config/aom_config.h"
 #include "third_party/googletest/src/googletest/include/gtest/gtest.h"
 
 namespace {
@@ -67,7 +69,7 @@
   EXPECT_EQ(aom_codec_encode(&enc, &img, 0, 1, 0), AOM_CODEC_OK);
   aom_codec_iter_t iter = nullptr;
   const aom_codec_cx_pkt_t *pkt = aom_codec_get_cx_data(&enc, &iter);
-  EXPECT_NE(pkt, nullptr);
+  ASSERT_NE(pkt, nullptr);
   EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
   // pkt->data.frame.flags is 0x1f0011.
   EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, AOM_FRAME_IS_KEY);
@@ -83,4 +85,125 @@
   EXPECT_EQ(AOM_CODEC_OK, aom_codec_destroy(&enc));
 }
 
+// The implementation of multi-threading for deltaq-mode=3 in allintra
+// mode is based on row multi-threading.
+// The test ensures that When row mt is turned off,
+// deltaq-mode = 3 can still properly encode and decode.
+TEST(DeltaqModeTest, DeltaqMode3MultiThreadNoRowMT) {
+  constexpr int kWidth = 1280;
+  constexpr int kHeight = 720;
+  // Dummy buffer of neutral gray samples.
+  constexpr size_t kBufferSize = kWidth * kHeight + kWidth * kHeight / 2;
+  std::vector<unsigned char> buffer(kBufferSize,
+                                    static_cast<unsigned char>(128));
+
+  aom_image_t img;
+  EXPECT_EQ(&img, aom_img_wrap(&img, AOM_IMG_FMT_I420, kWidth, kHeight, 1,
+                               buffer.data()));
+
+  aom_codec_iface_t *iface = aom_codec_av1_cx();
+  aom_codec_enc_cfg_t cfg;
+  EXPECT_EQ(aom_codec_enc_config_default(iface, &cfg, AOM_USAGE_GOOD_QUALITY),
+            AOM_CODEC_OK);
+  cfg.g_w = kWidth;
+  cfg.g_h = kHeight;
+  cfg.g_threads = 10;
+  cfg.rc_end_usage = AOM_Q;
+  cfg.g_profile = 0;
+  cfg.g_bit_depth = AOM_BITS_8;
+  cfg.g_input_bit_depth = 8;
+  cfg.g_lag_in_frames = 0;
+  cfg.rc_min_quantizer = 0;
+  cfg.rc_max_quantizer = 63;
+  cfg.g_pass = AOM_RC_ONE_PASS;
+  cfg.g_limit = 1;
+  aom_codec_ctx_t enc;
+  EXPECT_EQ(aom_codec_enc_init(&enc, iface, &cfg, 0), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_ROW_MT, 0), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_CPUUSED, 6), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_CQ_LEVEL, 14), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_DELTAQ_MODE, 3), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_set_option(&enc, "passes", "1"), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_COLOR_RANGE, AOM_CR_STUDIO_RANGE),
+            AOM_CODEC_OK);
+
+  EXPECT_EQ(aom_codec_encode(&enc, &img, 0, 1, 0), AOM_CODEC_OK);
+  aom_codec_iter_t iter = nullptr;
+  const aom_codec_cx_pkt_t *pkt = aom_codec_get_cx_data(&enc, &iter);
+  ASSERT_NE(pkt, nullptr);
+  EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
+  // pkt->data.frame.flags is 0x1f0011.
+  EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, AOM_FRAME_IS_KEY);
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  // Flush encoder
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, nullptr, 0, 1, 0));
+  iter = nullptr;
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_destroy(&enc));
+}
+
+#if CONFIG_AV1_HIGHBITDEPTH
+// 10-bit version of the DeltaqMode3MultiThread test.
+TEST(DeltaqModeTest, DeltaqMode3MultiThreadHighbd) {
+  constexpr int kWidth = 1280;
+  constexpr int kHeight = 720;
+  // Dummy buffer of 10-bit neutral gray samples.
+  constexpr size_t kBufferSize = kWidth * kHeight + kWidth * kHeight / 2;
+  std::vector<uint16_t> buffer(kBufferSize, 512);
+
+  aom_image_t img;
+  EXPECT_EQ(&img,
+            aom_img_wrap(&img, AOM_IMG_FMT_I42016, kWidth, kHeight, 1,
+                         reinterpret_cast<unsigned char *>(buffer.data())));
+
+  aom_codec_iface_t *iface = aom_codec_av1_cx();
+  aom_codec_enc_cfg_t cfg;
+  EXPECT_EQ(aom_codec_enc_config_default(iface, &cfg, AOM_USAGE_GOOD_QUALITY),
+            AOM_CODEC_OK);
+  cfg.g_w = kWidth;
+  cfg.g_h = kHeight;
+  cfg.g_threads = 10;
+  cfg.rc_end_usage = AOM_Q;
+  cfg.g_profile = 0;
+  cfg.g_bit_depth = AOM_BITS_10;
+  cfg.g_input_bit_depth = 10;
+  cfg.g_lag_in_frames = 0;
+  cfg.rc_min_quantizer = 0;
+  cfg.rc_max_quantizer = 63;
+  cfg.g_pass = AOM_RC_ONE_PASS;
+  cfg.g_limit = 1;
+  aom_codec_ctx_t enc;
+  EXPECT_EQ(aom_codec_enc_init(&enc, iface, &cfg, AOM_CODEC_USE_HIGHBITDEPTH),
+            AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_CPUUSED, 6), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_CQ_LEVEL, 14), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_DELTAQ_MODE, 3), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_set_option(&enc, "passes", "1"), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_COLOR_RANGE, AOM_CR_STUDIO_RANGE),
+            AOM_CODEC_OK);
+
+  EXPECT_EQ(aom_codec_encode(&enc, &img, 0, 1, 0), AOM_CODEC_OK);
+  aom_codec_iter_t iter = nullptr;
+  const aom_codec_cx_pkt_t *pkt = aom_codec_get_cx_data(&enc, &iter);
+  ASSERT_NE(pkt, nullptr);
+  EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
+  // pkt->data.frame.flags is 0x1f0011.
+  EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, AOM_FRAME_IS_KEY);
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  // Flush encoder
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, nullptr, 0, 1, 0));
+  iter = nullptr;
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_destroy(&enc));
+}
+#endif  // CONFIG_AV1_HIGHBITDEPTH
+
 }  // namespace

diff --git a/test/dropframe_encode_test.cc b/test/dropframe_encode_test.cc
new file mode 100644
index 0000000..c7a801b
--- /dev/null
+++ b/test/dropframe_encode_test.cc

@@ -0,0 +1,62 @@
+/*
+ * Copyright (c) 2023, Alliance for Open Media. All rights reserved
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include "test/codec_factory.h"
+#include "test/encode_test_driver.h"
+#include "test/i420_video_source.h"
+#include "test/util.h"
+
+namespace {
+
+// Params: test mode, threads.
+class DropFrameEncodeTestLarge
+    : public ::libaom_test::CodecTestWith2Params<libaom_test::TestMode,
+                                                 unsigned int>,
+      public ::libaom_test::EncoderTest {
+ protected:
+  DropFrameEncodeTestLarge()
+      : EncoderTest(GET_PARAM(0)), frame_number_(0), threads_(GET_PARAM(2)) {}
+
+  virtual void SetUp() { InitializeConfig(GET_PARAM(1)); }
+
+  virtual void PreEncodeFrameHook(::libaom_test::VideoSource *video,
+                                  ::libaom_test::Encoder *encoder) {
+    frame_number_ = video->frame();
+    if (frame_number_ == 0) {
+      encoder->Control(AOME_SET_CPUUSED, 1);
+    }
+  }
+
+  unsigned int frame_number_;
+  unsigned int threads_;
+};
+
+// Test to reproduce the assertion failure related to buf->display_idx in
+// init_gop_frames_for_tpl() and segmentation fault reported in aomedia:3372
+// while encoding with --drop-frame=1.
+TEST_P(DropFrameEncodeTestLarge, TestNoMisMatch) {
+  cfg_.rc_end_usage = AOM_CBR;
+  cfg_.rc_buf_sz = 1;
+  cfg_.g_pass = AOM_RC_ONE_PASS;
+  cfg_.rc_dropframe_thresh = 1;
+  cfg_.g_threads = threads_;
+
+  ::libaom_test::I420VideoSource video("desktopqvga2.320_240.yuv", 320, 240, 30,
+                                       1, 0, 100);
+
+  ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
+}
+
+AV1_INSTANTIATE_TEST_SUITE(DropFrameEncodeTestLarge,
+                           ::testing::Values(::libaom_test::kOnePassGood),
+                           ::testing::Values(1, 4));
+
+}  // namespace

diff --git a/test/ducky_encode_test.cc b/test/ducky_encode_test.cc
deleted file mode 100644
index 7bbdc88..0000000
--- a/test/ducky_encode_test.cc
+++ /dev/null

@@ -1,193 +0,0 @@
-/*
- * Copyright (c) 2022, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-
-#include <array>
-#include <algorithm>
-#include <cerrno>
-#include <cstring>
-#include <fstream>
-#include <memory>
-#include <numeric>
-#include <string>
-#include <vector>
-
-#include "av1/encoder/encoder.h"
-#include "av1/qmode_rc/ducky_encode.h"
-#include "av1/qmode_rc/ratectrl_qmode.h"
-#include "av1/qmode_rc/ratectrl_qmode_interface.h"
-#include "test/video_source.h"
-#include "third_party/googletest/src/googlemock/include/gmock/gmock.h"
-#include "third_party/googletest/src/googletest/include/gtest/gtest.h"
-
-namespace aom {
-
-constexpr int kMaxRefFrames = 7;
-
-TEST(DuckyEncodeTest, ComputeFirstPassStats) {
-  aom_rational_t frame_rate = { 30, 1 };
-  VideoInfo video_info = { 352,        288,
-                           frame_rate, AOM_IMG_FMT_I420,
-                           1,          "bus_352x288_420_f20_b8.yuv" };
-  video_info.file_path =
-      libaom_test::GetDataPath() + "/" + video_info.file_path;
-  DuckyEncode ducky_encode(video_info, BLOCK_64X64, kMaxRefFrames, 3, 128);
-  std::vector<FIRSTPASS_STATS> frame_stats =
-      ducky_encode.ComputeFirstPassStats();
-  EXPECT_EQ(frame_stats.size(), static_cast<size_t>(video_info.frame_count));
-  for (size_t i = 0; i < frame_stats.size(); ++i) {
-    // FIRSTPASS_STATS's first element is frame
-    EXPECT_EQ(frame_stats[i].frame, i);
-  }
-}
-
-TEST(DuckyEncodeTest, EncodeFrame) {
-  aom_rational_t frame_rate = { 30, 1 };
-  VideoInfo video_info = { 352,        288,
-                           frame_rate, AOM_IMG_FMT_I420,
-                           17,         "bus_352x288_420_f20_b8.yuv" };
-  video_info.file_path =
-      libaom_test::GetDataPath() + "/" + video_info.file_path;
-  DuckyEncode ducky_encode(video_info, BLOCK_64X64, kMaxRefFrames, 3, 128);
-  std::vector<FIRSTPASS_STATS> frame_stats =
-      ducky_encode.ComputeFirstPassStats();
-  ducky_encode.StartEncode(frame_stats);
-  // We set coding_frame_count to a arbitrary number that smaller than
-  // 17 here.
-  // TODO(angiebird): Set coding_frame_count properly, once the DuckyEncode can
-  // provide proper information.
-  int coding_frame_count = 5;
-  EncodeFrameDecision decision = { aom::EncodeFrameMode::kNone,
-                                   aom::EncodeGopMode::kNone,
-                                   {} };
-  for (int i = 0; i < coding_frame_count; ++i) {
-    ducky_encode.AllocateBitstreamBuffer(video_info);
-    EncodeFrameResult encode_frame_result = ducky_encode.EncodeFrame(decision);
-  }
-  ducky_encode.EndEncode();
-}
-
-TEST(DuckyEncodeTest, EncodeFrameWithQindex) {
-  aom_rational_t frame_rate = { 30, 1 };
-  VideoInfo video_info = { 352,        288,
-                           frame_rate, AOM_IMG_FMT_I420,
-                           17,         "bus_352x288_420_f20_b8.yuv" };
-  video_info.file_path =
-      libaom_test::GetDataPath() + "/" + video_info.file_path;
-  DuckyEncode ducky_encode(video_info, BLOCK_64X64, kMaxRefFrames, 3, 128);
-  std::vector<FIRSTPASS_STATS> frame_stats =
-      ducky_encode.ComputeFirstPassStats();
-  ducky_encode.StartEncode(frame_stats);
-  // We set coding_frame_count to a arbitrary number that smaller than
-  // 17 here.
-  // TODO(angiebird): Set coding_frame_count properly, once the DuckyEncode can
-  // provide proper information.
-  int coding_frame_count = 5;
-  int q_index = 0;
-  EncodeFrameDecision decision = { aom::EncodeFrameMode::kQindex,
-                                   aom::EncodeGopMode::kNone,
-                                   { q_index, -1, {}, {} } };
-  for (int i = 0; i < coding_frame_count; ++i) {
-    ducky_encode.AllocateBitstreamBuffer(video_info);
-    EncodeFrameResult encode_frame_result = ducky_encode.EncodeFrame(decision);
-    EXPECT_EQ(encode_frame_result.dist, 0);
-  }
-  ducky_encode.EndEncode();
-}
-
-TEST(DuckyEncodeRCTest, EncodeVideoWithRC) {
-  aom_rational_t frame_rate = { 30, 1 };
-  const int frame_number = 35;
-  const int frame_width = 352;
-  const int frame_height = 288;
-  VideoInfo video_info = { frame_width,  frame_height,
-                           frame_rate,   AOM_IMG_FMT_I420,
-                           frame_number, "bus_352x288_420_f20_b8.yuv" };
-  video_info.file_path =
-      libaom_test::GetDataPath() + "/" + video_info.file_path;
-  DuckyEncode ducky_encode(video_info, BLOCK_64X64, kMaxRefFrames, 3, 128);
-
-  AV1RateControlQMode qmode_rc;
-  RateControlParam rc_param = {};
-  rc_param.max_gop_show_frame_count = 16;
-  rc_param.min_gop_show_frame_count = 4;
-  rc_param.ref_frame_table_size = 5;
-  rc_param.max_ref_frames = 3;
-  rc_param.base_q_index = 45;
-  rc_param.max_distinct_q_indices_per_frame = 8;
-  rc_param.max_distinct_lambda_scales_per_frame = 1;
-  rc_param.frame_width = frame_width;
-  rc_param.frame_height = frame_height;
-  rc_param.tpl_pass_count = TplPassCount::kOneTplPass;
-  rc_param.tpl_pass_index = 0;
-  const Status status = qmode_rc.SetRcParam(rc_param);
-  ASSERT_TRUE(status.ok());
-  FirstpassInfo firstpass_info;
-  firstpass_info.stats_list = ducky_encode.ComputeFirstPassStats();
-  constexpr int kBlockSize = 16;
-  firstpass_info.num_mbs_16x16 = ((frame_width + kBlockSize - 1) / kBlockSize) *
-                                 ((frame_height + kBlockSize - 1) / kBlockSize);
-  const auto gop_info = qmode_rc.DetermineGopInfo(firstpass_info);
-  ASSERT_TRUE(gop_info.status().ok());
-  const GopStructList &gop_list = gop_info.value();
-
-  std::vector<aom::GopEncodeInfo> tpl_pass_gop_encode_info_list;
-  std::vector<aom::TplGopStats> tpl_gop_stats_list;
-  for (const auto &gop_struct : gop_list) {
-    const auto gop_encode_info =
-        qmode_rc.GetTplPassGopEncodeInfo(gop_struct, firstpass_info);
-    ASSERT_TRUE(gop_encode_info.status().ok());
-    tpl_pass_gop_encode_info_list.push_back(std::move(*gop_encode_info));
-  }
-
-  tpl_gop_stats_list = ducky_encode.ComputeTplStats(
-      firstpass_info.stats_list, gop_list, tpl_pass_gop_encode_info_list);
-
-  std::vector<aom::GopEncodeInfo> final_pass_gop_encode_info_list;
-  aom::RefFrameTable ref_frame_table;
-  for (size_t i = 0; i < gop_list.size(); ++i) {
-    const aom::GopStruct &gop_struct = gop_list[i];
-    const aom::TplGopStats &tpl_gop_stats = tpl_gop_stats_list[i];
-    std::vector<aom::LookaheadStats> lookahead_stats = {};
-    for (size_t lookahead_index = 1;
-         lookahead_index <= 1 && i + lookahead_index < gop_list.size();
-         ++lookahead_index) {
-      lookahead_stats.push_back({ &gop_list[i + lookahead_index],
-                                  &tpl_gop_stats_list[i + lookahead_index] });
-    }
-    const auto gop_encode_info =
-        qmode_rc.GetGopEncodeInfo(gop_struct, tpl_gop_stats, lookahead_stats,
-                                  firstpass_info, ref_frame_table);
-    ASSERT_TRUE(gop_encode_info.status().ok());
-    ref_frame_table = gop_encode_info.value().final_snapshot;
-    final_pass_gop_encode_info_list.push_back(std::move(*gop_encode_info));
-  }
-
-  ducky_encode.StartEncode(firstpass_info.stats_list);
-  std::vector<aom::EncodeFrameResult> encoded_frames_list =
-      ducky_encode.EncodeVideo(gop_list, final_pass_gop_encode_info_list);
-  ducky_encode.EndEncode();
-
-  EXPECT_THAT(encoded_frames_list,
-              testing::Each(testing::Field(
-                  "psnr", &aom::EncodeFrameResult::psnr, testing::Gt(37))));
-}
-
-TEST(DuckyEncodeTest, EncodeFrameMode) {
-  EXPECT_EQ(DUCKY_ENCODE_FRAME_MODE_NONE,
-            static_cast<DUCKY_ENCODE_FRAME_MODE>(EncodeFrameMode::kNone));
-  EXPECT_EQ(DUCKY_ENCODE_FRAME_MODE_QINDEX,
-            static_cast<DUCKY_ENCODE_FRAME_MODE>(EncodeFrameMode::kQindex));
-  EXPECT_EQ(
-      DUCKY_ENCODE_FRAME_MODE_QINDEX_RDMULT,
-      static_cast<DUCKY_ENCODE_FRAME_MODE>(EncodeFrameMode::kQindexRdmult));
-}
-
-}  // namespace aom

diff --git a/test/ec_test.cc b/test/ec_test.cc
index c4b88e3..e0555b4 100644
--- a/test/ec_test.cc
+++ b/test/ec_test.cc

@@ -93,11 +93,8 @@
       int dec_method;
       unsigned int sym = data[j] + 1;  // Initialize sym to an invalid value.
 
-      if (CDF_SHIFT == 0) {
-        dec_method = 3 + (rand() & 1);
-      } else {
-        dec_method = enc_method[j];
-      }
+      dec_method = 3 + (rand() & 1);
+
       switch (dec_method) {
         case 3: {
           sym = od_ec_decode_bool_q15(
@@ -128,30 +125,28 @@
     }
   }
   od_ec_enc_reset(&enc);
-  if (CDF_SHIFT == 0) {
-    od_ec_encode_bool_q15(&enc, 0, OD_ICDF(16384));
-    od_ec_encode_bool_q15(&enc, 0, OD_ICDF(16384));
-    od_ec_encode_bool_q15(&enc, 0, OD_ICDF(16384));
-    od_ec_encode_bool_q15(&enc, 0, OD_ICDF(16384));
-    od_ec_encode_bool_q15(&enc, 0, OD_ICDF(24576));
-    od_ec_enc_patch_initial_bits(&enc, 3, 2);
-    EXPECT_FALSE(enc.error) << "od_ec_enc_patch_initial_bits() failed.\n";
-    od_ec_enc_patch_initial_bits(&enc, 0, 5);
-    EXPECT_TRUE(enc.error)
-        << "od_ec_enc_patch_initial_bits() didn't fail when it should have.\n";
-    od_ec_enc_reset(&enc);
-    od_ec_encode_bool_q15(&enc, 0, OD_ICDF(16384));
-    od_ec_encode_bool_q15(&enc, 0, OD_ICDF(16384));
-    od_ec_encode_bool_q15(&enc, 1, OD_ICDF(32256));
-    od_ec_encode_bool_q15(&enc, 0, OD_ICDF(24576));
-    od_ec_enc_patch_initial_bits(&enc, 0, 2);
-    EXPECT_FALSE(enc.error) << "od_ec_enc_patch_initial_bits() failed.\n";
-    ptr = od_ec_enc_done(&enc, &ptr_sz);
-    EXPECT_EQ(ptr_sz, 2u);
-    EXPECT_EQ(ptr[0], 63)
-        << "Got " << ptr[0]
-        << " when expecting 63 for od_ec_enc_patch_initial_bits().\n";
-  }
+  od_ec_encode_bool_q15(&enc, 0, OD_ICDF(16384));
+  od_ec_encode_bool_q15(&enc, 0, OD_ICDF(16384));
+  od_ec_encode_bool_q15(&enc, 0, OD_ICDF(16384));
+  od_ec_encode_bool_q15(&enc, 0, OD_ICDF(16384));
+  od_ec_encode_bool_q15(&enc, 0, OD_ICDF(24576));
+  od_ec_enc_patch_initial_bits(&enc, 3, 2);
+  EXPECT_FALSE(enc.error) << "od_ec_enc_patch_initial_bits() failed.\n";
+  od_ec_enc_patch_initial_bits(&enc, 0, 5);
+  EXPECT_TRUE(enc.error)
+      << "od_ec_enc_patch_initial_bits() didn't fail when it should have.\n";
+  od_ec_enc_reset(&enc);
+  od_ec_encode_bool_q15(&enc, 0, OD_ICDF(16384));
+  od_ec_encode_bool_q15(&enc, 0, OD_ICDF(16384));
+  od_ec_encode_bool_q15(&enc, 1, OD_ICDF(32256));
+  od_ec_encode_bool_q15(&enc, 0, OD_ICDF(24576));
+  od_ec_enc_patch_initial_bits(&enc, 0, 2);
+  EXPECT_FALSE(enc.error) << "od_ec_enc_patch_initial_bits() failed.\n";
+  ptr = od_ec_enc_done(&enc, &ptr_sz);
+  EXPECT_EQ(ptr_sz, 2u);
+  EXPECT_EQ(ptr[0], 63)
+      << "Got " << ptr[0]
+      << " when expecting 63 for od_ec_enc_patch_initial_bits().\n";
   od_ec_enc_clear(&enc);
   EXPECT_EQ(ret, 0);
 }

diff --git a/test/encode_api_test.cc b/test/encode_api_test.cc
index 8303880..470bd06 100644
--- a/test/encode_api_test.cc
+++ b/test/encode_api_test.cc

@@ -106,6 +106,30 @@
   EXPECT_EQ(aom_codec_destroy(&enc), AOM_CODEC_OK);
 }
 
+TEST(EncodeAPI, MonochromeInProfiles) {
+  aom_codec_iface_t *iface = aom_codec_av1_cx();
+  aom_codec_enc_cfg_t cfg;
+  ASSERT_EQ(AOM_CODEC_OK, aom_codec_enc_config_default(iface, &cfg, kUsage));
+  cfg.g_w = 128;
+  cfg.g_h = 128;
+  cfg.monochrome = 1;
+  aom_codec_ctx_t enc;
+
+  // Test Profile 0
+  cfg.g_profile = 0;
+  ASSERT_EQ(AOM_CODEC_OK, aom_codec_enc_init(&enc, iface, &cfg, 0));
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_destroy(&enc));
+
+  // Test Profile 1
+  cfg.g_profile = 1;
+  ASSERT_EQ(AOM_CODEC_INVALID_PARAM, aom_codec_enc_init(&enc, iface, &cfg, 0));
+
+  // Test Profile 3
+  cfg.g_profile = 2;
+  ASSERT_EQ(AOM_CODEC_OK, aom_codec_enc_init(&enc, iface, &cfg, 0));
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_destroy(&enc));
+}
+
 #if !CONFIG_REALTIME_ONLY
 TEST(EncodeAPI, AllIntraMode) {
   aom_codec_iface_t *iface = aom_codec_av1_cx();

diff --git a/test/encodetxb_test.cc b/test/encodetxb_test.cc
index c1b6709..0a58737 100644
--- a/test/encodetxb_test.cc
+++ b/test/encodetxb_test.cc

@@ -66,17 +66,17 @@
       for (int tx_type = DCT_DCT; tx_type < TX_TYPES; ++tx_type) {
         const TX_CLASS tx_class = tx_type_to_class[tx_type];
         for (int tx_size = TX_4X4; tx_size < TX_SIZES_ALL; ++tx_size) {
-          const int bwl = get_txb_bwl((TX_SIZE)tx_size);
+          const int bhl = get_txb_bhl((TX_SIZE)tx_size);
           const int width = get_txb_wide((TX_SIZE)tx_size);
           const int height = get_txb_high((TX_SIZE)tx_size);
           const int real_width = tx_size_wide[tx_size];
           const int real_height = tx_size_high[tx_size];
           const int16_t *const scan = av1_scan_orders[tx_size][tx_type].scan;
 
-          levels_ = set_levels(levels_buf_, width);
+          levels_ = set_levels(levels_buf_, height);
           for (int i = 0; i < kNumTests && !result; ++i) {
             for (int eob = 1; eob <= width * height && !result; ++eob) {
-              InitDataWithEob(scan, bwl, eob);
+              InitDataWithEob(scan, bhl, eob);
 
               av1_get_nz_map_contexts_c(levels_, scan, eob, (TX_SIZE)tx_size,
                                         tx_class, coeff_contexts_ref_);
@@ -86,7 +86,7 @@
               result = Compare(scan, eob);
 
               EXPECT_EQ(result, 0)
-                  << " tx_class " << tx_class << " width " << real_width
+                  << " tx_class " << (int)tx_class << " width " << real_width
                   << " height " << real_height << " eob " << eob;
             }
           }
@@ -102,7 +102,7 @@
 
     printf("Note: Only test the largest possible eob case!\n");
     for (int tx_size = TX_4X4; tx_size < TX_SIZES_ALL; ++tx_size) {
-      const int bwl = get_txb_bwl((TX_SIZE)tx_size);
+      const int bhl = get_txb_bhl((TX_SIZE)tx_size);
       const int width = get_txb_wide((TX_SIZE)tx_size);
       const int height = get_txb_high((TX_SIZE)tx_size);
       const int real_width = tx_size_wide[tx_size];
@@ -113,8 +113,8 @@
       const int eob = width * height;
       const int numTests = kNumTests / (width * height);
 
-      levels_ = set_levels(levels_buf_, width);
-      InitDataWithEob(scan, bwl, eob);
+      levels_ = set_levels(levels_buf_, height);
+      InitDataWithEob(scan, bhl, eob);
 
       aom_usec_timer_start(&timer_ref);
       for (int i = 0; i < numTests; ++i) {
@@ -123,8 +123,8 @@
       }
       aom_usec_timer_mark(&timer_ref);
 
-      levels_ = set_levels(levels_buf_, width);
-      InitDataWithEob(scan, bwl, eob);
+      levels_ = set_levels(levels_buf_, height);
+      InitDataWithEob(scan, bhl, eob);
 
       aom_usec_timer_start(&timer);
       for (int i = 0; i < numTests; ++i) {
@@ -145,13 +145,13 @@
   }
 
  private:
-  void InitDataWithEob(const int16_t *const scan, const int bwl,
+  void InitDataWithEob(const int16_t *const scan, const int bhl,
                        const int eob) {
     memset(levels_buf_, 0, sizeof(levels_buf_));
     memset(coeff_contexts_, 0, sizeof(*coeff_contexts_) * MAX_TX_SQUARE);
 
     for (int c = 0; c < eob; ++c) {
-      levels_[get_padded_idx(scan[c], bwl)] =
+      levels_[get_padded_idx(scan[c], bhl)] =
           static_cast<uint8_t>(clamp(rnd_.Rand8(), 0, INT8_MAX));
       coeff_contexts_[scan[c]] = static_cast<int8_t>(rnd_.Rand16() >> 1);
     }
@@ -224,8 +224,8 @@
   tran_low_t coeff[MAX_TX_SQUARE];
 
   uint8_t levels_buf[2][TX_PAD_2D];
-  uint8_t *const levels0 = set_levels(levels_buf[0], width);
-  uint8_t *const levels1 = set_levels(levels_buf[1], width);
+  uint8_t *const levels0 = set_levels(levels_buf[0], height);
+  uint8_t *const levels1 = set_levels(levels_buf[1], height);
 
   ACMRandom rnd(ACMRandom::DeterministicSeed());
   for (int i = 0; i < width * height; i++) {

diff --git a/test/error_block_test.cc b/test/error_block_test.cc
index a6b442f..aadbb44 100644
--- a/test/error_block_test.cc
+++ b/test/error_block_test.cc

@@ -190,11 +190,10 @@
   int64_t ssz;
   int num_iters = 100000;
   int64_t ref_ssz;
-  int k;
   const int msb = bit_depth_ + 8 - 1;
   for (int i = 0; i < 9; ++i) {
     block_size = 16 << (i % 9);  // All block sizes from 4x4, 8x4 ..64x64
-    for (k = 0; k < 9; k++) {
+    for (int k = 0; k < 9; k++) {
       for (int j = 0; j < block_size; j++) {
         if (k < 5) {
           if (rnd(2)) {
@@ -221,7 +220,7 @@
       aom_usec_timer ref_timer, test_timer;
 
       aom_usec_timer_start(&ref_timer);
-      for (int i = 0; i < num_iters; ++i) {
+      for (int iter = 0; iter < num_iters; ++iter) {
         ref_error_block_op_(coeff, dqcoeff, block_size, &ref_ssz, bit_depth_);
       }
       aom_usec_timer_mark(&ref_timer);
@@ -229,7 +228,7 @@
           static_cast<int>(aom_usec_timer_elapsed(&ref_timer));
 
       aom_usec_timer_start(&test_timer);
-      for (int i = 0; i < num_iters; ++i) {
+      for (int iter = 0; iter < num_iters; ++iter) {
         error_block_op_(coeff, dqcoeff, block_size, &ssz, bit_depth_);
       }
       aom_usec_timer_mark(&test_timer);

diff --git a/test/ethread_test.cc b/test/ethread_test.cc
index 8e1d750..6b7fcce 100644
--- a/test/ethread_test.cc
+++ b/test/ethread_test.cc

@@ -261,6 +261,16 @@
         encoder->Control(AOME_SET_ARNR_STRENGTH, 5);
         encoder->Control(AV1E_SET_FRAME_PARALLEL_DECODING, 0);
         encoder->Control(AV1E_SET_MAX_GF_INTERVAL, 4);
+        // In row_mt_=0 case, the output of single thread (1 thread) will be
+        // compared with multi thread (4 thread) output (as per line no:340).
+        // Currently, Loop restoration stage is conditionally disabled for speed
+        // 5, 6 when num_workers > 1. Due to this, the match between single
+        // thread and multi thread output can not be achieved. Hence, testing
+        // this case alone with LR disabled.
+        // TODO(aomedia:3446): Remove the constraint on this test case once Loop
+        // restoration state is same in both single and multi thread path.
+        if (set_cpu_used_ >= 5 && row_mt_ == 0)
+          encoder->Control(AV1E_SET_ENABLE_RESTORATION, 0);
       } else if (encoding_mode_ == ::libaom_test::kRealTime) {
         encoder->Control(AOME_SET_ENABLEAUTOALTREF, 0);
         encoder->Control(AV1E_SET_AQ_MODE, 3);

diff --git a/test/fft_test.cc b/test/fft_test.cc
index 7fce0f8..5443c99 100644
--- a/test/fft_test.cc
+++ b/test/fft_test.cc

@@ -82,7 +82,8 @@
 };
 
 std::ostream &operator<<(std::ostream &os, const FFTTestArg &test_arg) {
-  return os << "fft_arg { n:" << test_arg.n << " fft:" << test_arg.fft << " }";
+  return os << "fft_arg { n:" << test_arg.n
+            << " fft:" << reinterpret_cast<const void *>(test_arg.fft) << " }";
 }
 
 class FFT2DTest : public ::testing::TestWithParam<FFTTestArg> {
@@ -146,7 +147,7 @@
                                            FFTTestArg(16, aom_fft16x16_float_c),
                                            FFTTestArg(32,
                                                       aom_fft32x32_float_c)));
-#if ARCH_X86 || ARCH_X86_64
+#if AOM_ARCH_X86 || AOM_ARCH_X86_64
 #if HAVE_SSE2
 INSTANTIATE_TEST_SUITE_P(
     SSE2, FFT2DTest,
@@ -162,7 +163,7 @@
                       FFTTestArg(16, aom_fft16x16_float_avx2),
                       FFTTestArg(32, aom_fft32x32_float_avx2)));
 #endif  // HAVE_AVX2
-#endif  // ARCH_X86 || ARCH_X86_64
+#endif  // AOM_ARCH_X86 || AOM_ARCH_X86_64
 
 struct IFFTTestArg {
   int n;
@@ -171,8 +172,8 @@
 };
 
 std::ostream &operator<<(std::ostream &os, const IFFTTestArg &test_arg) {
-  return os << "ifft_arg { n:" << test_arg.n << " fft:" << test_arg.ifft
-            << " }";
+  return os << "ifft_arg { n:" << test_arg.n
+            << " fft:" << reinterpret_cast<const void *>(test_arg.ifft) << " }";
 }
 
 class IFFT2DTest : public ::testing::TestWithParam<IFFTTestArg> {
@@ -245,7 +246,7 @@
                       IFFTTestArg(8, aom_ifft8x8_float_c),
                       IFFTTestArg(16, aom_ifft16x16_float_c),
                       IFFTTestArg(32, aom_ifft32x32_float_c)));
-#if ARCH_X86 || ARCH_X86_64
+#if AOM_ARCH_X86 || AOM_ARCH_X86_64
 #if HAVE_SSE2
 INSTANTIATE_TEST_SUITE_P(
     SSE2, IFFT2DTest,
@@ -262,6 +263,6 @@
                       IFFTTestArg(16, aom_ifft16x16_float_avx2),
                       IFFTTestArg(32, aom_ifft32x32_float_avx2)));
 #endif  // HAVE_AVX2
-#endif  // ARCH_X86 || ARCH_X86_64
+#endif  // AOM_ARCH_X86 || AOM_ARCH_X86_64
 
 }  // namespace

diff --git a/test/film_grain_table_test.cc b/test/film_grain_table_test.cc
index bf63903..f8937f1 100644
--- a/test/film_grain_table_test.cc
+++ b/test/film_grain_table_test.cc

@@ -14,6 +14,10 @@
 #include "aom_dsp/grain_table.h"
 #include "aom/internal/aom_codec_internal.h"
 #include "av1/encoder/grain_test_vectors.h"
+#include "test/codec_factory.h"
+#include "test/encode_test_driver.h"
+#include "test/i420_video_source.h"
+#include "test/util.h"
 #include "test/video_source.h"
 
 void grain_equal(const aom_film_grain_t *expected,
@@ -267,3 +271,66 @@
 
   EXPECT_EQ(0, remove(grain_file.c_str()));
 }
+
+const ::libaom_test::TestMode kFilmGrainEncodeTestModes[] = {
+  ::libaom_test::kRealTime,
+#if !CONFIG_REALTIME_ONLY
+  ::libaom_test::kOnePassGood
+#endif
+};
+
+class FilmGrainEncodeTest
+    : public ::libaom_test::CodecTestWith2Params<bool, ::libaom_test::TestMode>,
+      public ::libaom_test::EncoderTest {
+ protected:
+  FilmGrainEncodeTest()
+      : EncoderTest(GET_PARAM(0)), test_monochrome_(GET_PARAM(1)),
+        test_mode_(GET_PARAM(2)) {}
+  ~FilmGrainEncodeTest() override = default;
+
+  void SetUp() override {
+    InitializeConfig(test_mode_);
+    cfg_.monochrome = test_monochrome_;
+    cfg_.rc_target_bitrate = 300;
+    cfg_.kf_max_dist = 0;
+  }
+
+  void PreEncodeFrameHook(::libaom_test::VideoSource *video,
+                          ::libaom_test::Encoder *encoder) override {
+    if (video->frame() == 0) {
+      encoder->Control(AOME_SET_CPUUSED, 5);
+      encoder->Control(AV1E_SET_TUNE_CONTENT, AOM_CONTENT_FILM);
+      encoder->Control(AV1E_SET_DENOISE_NOISE_LEVEL, 1);
+    } else if (video->frame() == 1) {
+      cfg_.monochrome = 0;
+      encoder->Config(&cfg_);
+    } else {
+      cfg_.monochrome = test_monochrome_;
+      encoder->Config(&cfg_);
+    }
+  }
+
+  bool DoDecode() const override { return false; }
+
+  void DoTest() {
+    if (test_monochrome_ && test_mode_ == ::libaom_test::kRealTime) {
+      // TODO(bohanli): Running real time mode with monochrome will cause the
+      // encoder to crash. Check if this is intended or there is a bug.
+      GTEST_SKIP();
+    }
+    ::libaom_test::I420VideoSource video("hantro_collage_w352h288.yuv", 352,
+                                         288, 30, 1, 0, 3);
+    cfg_.g_w = video.img()->d_w;
+    cfg_.g_h = video.img()->d_h;
+    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
+  }
+
+ private:
+  bool test_monochrome_;
+  ::libaom_test::TestMode test_mode_;
+};
+
+TEST_P(FilmGrainEncodeTest, Test) { DoTest(); }
+
+AV1_INSTANTIATE_TEST_SUITE(FilmGrainEncodeTest, ::testing::Bool(),
+                           ::testing::ValuesIn(kFilmGrainEncodeTestModes));

diff --git a/test/firstpass_test.cc b/test/firstpass_test.cc
index 718fdab..1f4f3b7 100644
--- a/test/firstpass_test.cc
+++ b/test/firstpass_test.cc

@@ -76,11 +76,13 @@
   EXPECT_EQ(firstpass_info.stats_count, FIRSTPASS_INFO_STATIC_BUF_SIZE);
 
   EXPECT_EQ(firstpass_info.stats_count, firstpass_info.stats_buf_size);
-  // Push the stats when the queue is full.
-  FIRSTPASS_STATS stats;
-  av1_zero(stats);
-  aom_codec_err_t ret = av1_firstpass_info_push(&firstpass_info, &stats);
-  EXPECT_EQ(ret, AOM_CODEC_ERROR);
+  {
+    // Push the stats when the queue is full.
+    FIRSTPASS_STATS stats;
+    av1_zero(stats);
+    aom_codec_err_t ret = av1_firstpass_info_push(&firstpass_info, &stats);
+    EXPECT_EQ(ret, AOM_CODEC_ERROR);
+  }
 }
 
 TEST(FirstpassTest, FirstpassInfoTotalStats) {
@@ -110,9 +112,11 @@
     EXPECT_EQ(ret, AOM_CODEC_OK);
   }
   EXPECT_EQ(firstpass_info.cur_index, firstpass_info.start_index);
-  aom_codec_err_t ret = av1_firstpass_info_pop(&firstpass_info);
-  // We cannot pop when cur_index == start_index
-  EXPECT_EQ(ret, AOM_CODEC_ERROR);
+  {
+    aom_codec_err_t ret = av1_firstpass_info_pop(&firstpass_info);
+    // We cannot pop when cur_index == start_index
+    EXPECT_EQ(ret, AOM_CODEC_ERROR);
+  }
   int ref_frame_cnt = 0;
   const int move_count = FIRSTPASS_INFO_STATIC_BUF_SIZE * 2 / 3;
   for (int i = 0; i < move_count; ++i) {

diff --git a/test/forced_max_frame_width_height_test.cc b/test/forced_max_frame_width_height_test.cc
index 98d96fb..2e019b6 100644
--- a/test/forced_max_frame_width_height_test.cc
+++ b/test/forced_max_frame_width_height_test.cc

@@ -15,7 +15,9 @@
 // encode two frames of increasing sizes. The second aom_codec_encode() should
 // not crash or have memory errors.
 
+#include <algorithm>
 #include <memory>
+#include <vector>
 
 #include "aom/aomcx.h"
 #include "aom/aom_encoder.h"
@@ -89,6 +91,114 @@
   RunTest(AOM_USAGE_GOOD_QUALITY, /*lag_in_frames=*/1, "ssim");
 }
 
+void FillImageGradient(aom_image_t *image, int bit_depth) {
+  assert(image->range == AOM_CR_FULL_RANGE);
+  for (int plane = 0; plane < 3; plane++) {
+    const int plane_width = aom_img_plane_width(image, plane);
+    const int plane_height = aom_img_plane_height(image, plane);
+    unsigned char *row = image->planes[plane];
+    const int stride = image->stride[plane];
+    for (int y = 0; y < plane_height; ++y) {
+      for (int x = 0; x < plane_width; ++x) {
+        const int value = (x + y) * ((1 << bit_depth) - 1) /
+                          std::max(1, plane_width + plane_height - 2);
+        assert(value >= 0 && value <= (1 << bit_depth) - 1);
+        if (bit_depth > 8) {
+          reinterpret_cast<uint16_t *>(row)[x] = static_cast<uint16_t>(value);
+        } else {
+          row[x] = static_cast<unsigned char>(value);
+        }
+      }
+      row += stride;
+    }
+  }
+}
+
+// A test that reproduces bug aomedia:3348: Assertion
+// `ms_params->ms_buffers.ref->stride == ms_params->search_sites->stride'
+// failed.
+TEST(EncodeForcedMaxFrameWidthHeight, DISABLED_DimensionDecreasing) {
+  constexpr int kWidth = 128;
+  constexpr int kHeight = 128;
+  constexpr size_t kBufferSize = 3 * kWidth * kHeight;
+  std::vector<unsigned char> buffer(kBufferSize);
+
+  aom_image_t img;
+  EXPECT_EQ(&img, aom_img_wrap(&img, AOM_IMG_FMT_I420, kWidth, kHeight, 1,
+                               buffer.data()));
+  img.cp = AOM_CICP_CP_UNSPECIFIED;
+  img.tc = AOM_CICP_TC_UNSPECIFIED;
+  img.mc = AOM_CICP_MC_UNSPECIFIED;
+  img.range = AOM_CR_FULL_RANGE;
+  FillImageGradient(&img, 8);
+
+  aom_codec_iface_t *iface = aom_codec_av1_cx();
+  aom_codec_enc_cfg_t cfg;
+  EXPECT_EQ(AOM_CODEC_OK,
+            aom_codec_enc_config_default(iface, &cfg, AOM_USAGE_GOOD_QUALITY));
+  cfg.rc_end_usage = AOM_Q;
+  cfg.g_profile = 0;
+  cfg.g_bit_depth = AOM_BITS_8;
+  cfg.g_input_bit_depth = 8;
+  cfg.g_w = kWidth;
+  cfg.g_h = kHeight;
+  cfg.g_forced_max_frame_width = kWidth;
+  cfg.g_forced_max_frame_height = kHeight;
+  cfg.g_lag_in_frames = 1;
+  cfg.rc_min_quantizer = 20;
+  cfg.rc_max_quantizer = 40;
+  aom_codec_ctx_t enc;
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_enc_init(&enc, iface, &cfg, 0));
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_control(&enc, AOME_SET_CQ_LEVEL, 30));
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_control(&enc, AOME_SET_CPUUSED, 6));
+  EXPECT_EQ(AOM_CODEC_OK,
+            aom_codec_control(&enc, AV1E_SET_COLOR_RANGE, AOM_CR_FULL_RANGE));
+  EXPECT_EQ(AOM_CODEC_OK,
+            aom_codec_control(&enc, AOME_SET_TUNING, AOM_TUNE_SSIM));
+
+  // First frame
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, &img, 0, 1, 0));
+  aom_codec_iter_t iter = nullptr;
+  const aom_codec_cx_pkt_t *pkt = aom_codec_get_cx_data(&enc, &iter);
+  ASSERT_NE(pkt, nullptr);
+  EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
+  // pkt->data.frame.flags is 0x1f0011.
+  EXPECT_NE(pkt->data.frame.flags & AOM_FRAME_IS_KEY, 0u);
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  // Second frame
+  constexpr int kWidthSmall = 64;
+  constexpr int kHeightSmall = 64;
+  EXPECT_EQ(&img, aom_img_wrap(&img, AOM_IMG_FMT_I420, kWidthSmall,
+                               kHeightSmall, 1, buffer.data()));
+  img.cp = AOM_CICP_CP_UNSPECIFIED;
+  img.tc = AOM_CICP_TC_UNSPECIFIED;
+  img.mc = AOM_CICP_MC_UNSPECIFIED;
+  img.range = AOM_CR_FULL_RANGE;
+  FillImageGradient(&img, 8);
+  cfg.g_w = kWidthSmall;
+  cfg.g_h = kHeightSmall;
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_enc_config_set(&enc, &cfg));
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, &img, 0, 1, 0));
+  iter = nullptr;
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  ASSERT_NE(pkt, nullptr);
+  EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
+  // pkt->data.frame.flags is 0.
+  EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, 0u);
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  // Flush encoder
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, nullptr, 0, 1, 0));
+  iter = nullptr;
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_destroy(&enc));
+}
+
 #endif  // !CONFIG_REALTIME_ONLY
 
 TEST(EncodeForcedMaxFrameWidthHeight, RealtimeLag0TunePSNR) {

diff --git a/test/frame_size_tests.cc b/test/frame_size_tests.cc
index 20aea31..b15be6e 100644
--- a/test/frame_size_tests.cc
+++ b/test/frame_size_tests.cc

@@ -48,7 +48,9 @@
 };
 
 #if CONFIG_SIZE_LIMIT
-TEST_F(AV1FrameSizeTests, TestInvalidSizes) {
+// TODO([email protected]) fails due to newer bounds checks that get caught
+// before the assert below added in ebc2714d71a834fc32a19eef0a81f51fbc47db01
+TEST_F(AV1FrameSizeTests, DISABLED_TestInvalidSizes) {
   ::libaom_test::RandomVideoSource video;
 
   video.SetSize(DECODE_WIDTH_LIMIT + 16, DECODE_HEIGHT_LIMIT + 16);
@@ -57,7 +59,9 @@
   ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
 }
 
-TEST_F(AV1FrameSizeTests, LargeValidSizes) {
+// TODO([email protected]) similar to the above test, needs to be
+// updated for the new rejection case
+TEST_F(AV1FrameSizeTests, DISABLED_LargeValidSizes) {
   ::libaom_test::RandomVideoSource video;
 
   video.SetSize(DECODE_WIDTH_LIMIT, DECODE_HEIGHT_LIMIT);

diff --git a/test/function_equivalence_test.h b/test/function_equivalence_test.h
index f47800a..fc2a769 100644
--- a/test/function_equivalence_test.h
+++ b/test/function_equivalence_test.h

@@ -36,8 +36,8 @@
 
 template <typename T>
 struct FuncParam {
-  FuncParam(T ref = nullptr, T tst = nullptr, int bit_depth = 0)
-      : ref_func(ref), tst_func(tst), bit_depth(bit_depth) {}
+  FuncParam(T ref = nullptr, T tst = nullptr, int depth = 0)
+      : ref_func(ref), tst_func(tst), bit_depth(depth) {}
   T ref_func;
   T tst_func;
   int bit_depth;

diff --git a/test/fwht4x4_test.cc b/test/fwht4x4_test.cc
index 8b8b4f2..9d27db8 100644
--- a/test/fwht4x4_test.cc
+++ b/test/fwht4x4_test.cc

@@ -113,9 +113,8 @@
       ASSERT_NE(output_block, nullptr);
 
       for (int i = 0; i < count_test_block; ++i) {
-        int j, k;
-        for (j = 0; j < height_; ++j) {
-          for (k = 0; k < pitch_; ++k) {
+        for (int j = 0; j < height_; ++j) {
+          for (int k = 0; k < pitch_; ++k) {
             int in_idx = j * stride + k;
             int out_idx = j * pitch_ + k;
             input_block[in_idx] =
@@ -131,7 +130,7 @@
 
         aom_usec_timer c_timer_;
         aom_usec_timer_start(&c_timer_);
-        for (int i = 0; i < numIter; i++) {
+        for (int iter = 0; iter < numIter; iter++) {
           API_REGISTER_STATE_CHECK(
               fwd_txfm_c_(input_block, output_ref_block, stride));
         }
@@ -140,7 +139,7 @@
         aom_usec_timer simd_timer_;
         aom_usec_timer_start(&simd_timer_);
 
-        for (int i = 0; i < numIter; i++) {
+        for (int iter = 0; iter < numIter; iter++) {
           API_REGISTER_STATE_CHECK(
               fwd_txfm_(input_block, output_block, stride));
         }
@@ -150,8 +149,8 @@
         simd_sum_time += static_cast<int>(aom_usec_timer_elapsed(&simd_timer_));
 
         // The minimum quant value is 4.
-        for (j = 0; j < height_; ++j) {
-          for (k = 0; k < pitch_; ++k) {
+        for (int j = 0; j < height_; ++j) {
+          for (int k = 0; k < pitch_; ++k) {
             int out_idx = j * pitch_ + k;
             ASSERT_EQ(output_block[out_idx], output_ref_block[out_idx])
                 << "Error: not bit-exact result at index: " << out_idx
@@ -191,10 +190,10 @@
 
 INSTANTIATE_TEST_SUITE_P(
     C, Trans4x4WHT,
-    ::testing::Values(make_tuple(&av1_highbd_fwht4x4_c, &iwht4x4_10_c, DCT_DCT,
+    ::testing::Values(make_tuple(&av1_fwht4x4_c, &iwht4x4_10_c, DCT_DCT,
                                  AOM_BITS_10, 16,
                                  static_cast<FdctFunc>(nullptr)),
-                      make_tuple(&av1_highbd_fwht4x4_c, &iwht4x4_12_c, DCT_DCT,
+                      make_tuple(&av1_fwht4x4_c, &iwht4x4_12_c, DCT_DCT,
                                  AOM_BITS_12, 16,
                                  static_cast<FdctFunc>(nullptr))));
 
@@ -202,10 +201,10 @@
 
 INSTANTIATE_TEST_SUITE_P(
     SSE4_1, Trans4x4WHT,
-    ::testing::Values(make_tuple(&av1_highbd_fwht4x4_sse4_1, &iwht4x4_10_sse4_1,
+    ::testing::Values(make_tuple(&av1_fwht4x4_sse4_1, &iwht4x4_10_sse4_1,
                                  DCT_DCT, AOM_BITS_10, 16,
                                  static_cast<FdctFunc>(nullptr)),
-                      make_tuple(&av1_highbd_fwht4x4_sse4_1, &iwht4x4_12_sse4_1,
+                      make_tuple(&av1_fwht4x4_sse4_1, &iwht4x4_12_sse4_1,
                                  DCT_DCT, AOM_BITS_12, 16,
                                  static_cast<FdctFunc>(nullptr))));
 
@@ -215,12 +214,10 @@
 
 INSTANTIATE_TEST_SUITE_P(
     NEON, Trans4x4WHT,
-    ::testing::Values(make_tuple(&av1_highbd_fwht4x4_neon, &iwht4x4_10_c,
-                                 DCT_DCT, AOM_BITS_10, 16,
-                                 &av1_highbd_fwht4x4_c),
-                      make_tuple(&av1_highbd_fwht4x4_neon, &iwht4x4_12_c,
-                                 DCT_DCT, AOM_BITS_12, 16,
-                                 &av1_highbd_fwht4x4_c)));
+    ::testing::Values(make_tuple(&av1_fwht4x4_neon, &iwht4x4_10_c, DCT_DCT,
+                                 AOM_BITS_10, 16, &av1_fwht4x4_c),
+                      make_tuple(&av1_fwht4x4_neon, &iwht4x4_12_c, DCT_DCT,
+                                 AOM_BITS_12, 16, &av1_fwht4x4_c)));
 
 #endif  // HAVE_NEON
 

diff --git a/test/hadamard_test.cc b/test/hadamard_test.cc
index 0fe7f42..fc306e6 100644
--- a/test/hadamard_test.cc
+++ b/test/hadamard_test.cc

@@ -242,6 +242,12 @@
 
   virtual void SetUp() { rnd_.Reset(ACMRandom::DeterministicSeed()); }
 
+  // The Rand() function generates values in the range [-((1 << BitDepth) - 1),
+  // (1 << BitDepth) - 1]. This is because the input to the Hadamard transform
+  // is the residual pixel, which is defined as 'source pixel - predicted
+  // pixel'. Source pixel and predicted pixel take values in the range
+  // [0, (1 << BitDepth) - 1] and thus the residual pixel ranges from
+  // -((1 << BitDepth) - 1) to ((1 << BitDepth) - 1).
   virtual int16_t Rand() = 0;
 
   void CompareReferenceRandom() {
@@ -259,9 +265,37 @@
     for (int i = 0; i < block_size_; ++i) a[i] = Rand();
     ReferenceHadamard(a, bw_, b_ref, bw_, bh_, shift_);
     API_REGISTER_STATE_CHECK(h_func_(a, bw_, b));
+
+    // The order of the output is not important. Sort before checking.
+    std::sort(b, b + block_size_);
+    std::sort(b_ref, b_ref + block_size_);
     EXPECT_EQ(memcmp(b, b_ref, sizeof(b)), 0);
   }
 
+  void CompareReferenceExtreme() {
+    const int kMaxBlockSize = 32 * 32;
+    const int block_size = bw_ * bh_;
+    const int kBitDepth = 8;
+    DECLARE_ALIGNED(16, int16_t, a[kMaxBlockSize]);
+    DECLARE_ALIGNED(16, OutputType, b[kMaxBlockSize]);
+    memset(b, 0, sizeof(b));
+
+    OutputType b_ref[kMaxBlockSize];
+    memset(b_ref, 0, sizeof(b_ref));
+    for (int i = 0; i < 2; ++i) {
+      const int sign = (i == 0) ? 1 : -1;
+      for (int j = 0; j < block_size; ++j) a[j] = sign * ((1 << kBitDepth) - 1);
+
+      ReferenceHadamard(a, bw_, b_ref, bw_, bh_, shift_);
+      API_REGISTER_STATE_CHECK(h_func_(a, bw_, b));
+
+      // The order of the output is not important. Sort before checking.
+      std::sort(b, b + block_size);
+      std::sort(b_ref, b_ref + block_size);
+      EXPECT_EQ(memcmp(b, b_ref, sizeof(b)), 0);
+    }
+  }
+
   void VaryStride() {
     const int kMaxBlockSize = 32 * 32;
     const int block_size_ = bw_ * bh_;
@@ -278,6 +312,10 @@
 
       ReferenceHadamard(a, i, b_ref, bw_, bh_, shift_);
       API_REGISTER_STATE_CHECK(h_func_(a, i, b));
+
+      // The order of the output is not important. Sort before checking.
+      std::sort(b, b + block_size_);
+      std::sort(b_ref, b_ref + block_size_);
       EXPECT_EQ(0, memcmp(b, b_ref, sizeof(b)));
     }
   }
@@ -312,11 +350,20 @@
 class HadamardLowbdTest : public HadamardTestBase<tran_low_t, HadamardFunc> {
  public:
   HadamardLowbdTest() : HadamardTestBase(GetParam(), /*do_shift=*/true) {}
-  virtual int16_t Rand() { return rnd_.Rand9Signed(); }
+  // Use values between -255 (0xFF01) and 255 (0x00FF)
+  virtual int16_t Rand() {
+    int16_t src = rnd_.Rand8();
+    int16_t pred = rnd_.Rand8();
+    return src - pred;
+  }
 };
 
 TEST_P(HadamardLowbdTest, CompareReferenceRandom) { CompareReferenceRandom(); }
 
+TEST_P(HadamardLowbdTest, CompareReferenceExtreme) {
+  CompareReferenceExtreme();
+}
+
 TEST_P(HadamardLowbdTest, VaryStride) { VaryStride(); }
 
 TEST_P(HadamardLowbdTest, DISABLED_SpeedTest) { SpeedTest(1000000); }
@@ -349,15 +396,62 @@
 #if HAVE_NEON
 INSTANTIATE_TEST_SUITE_P(
     NEON, HadamardLowbdTest,
-    ::testing::Values(HadamardFuncWithSize(&aom_hadamard_8x8_neon, 8, 8),
-                      HadamardFuncWithSize(&aom_hadamard_16x16_neon, 16, 16)));
+    ::testing::Values(HadamardFuncWithSize(&aom_hadamard_4x4_neon, 4, 4),
+                      HadamardFuncWithSize(&aom_hadamard_8x8_neon, 8, 8),
+                      HadamardFuncWithSize(&aom_hadamard_16x16_neon, 16, 16),
+                      HadamardFuncWithSize(&aom_hadamard_32x32_neon, 32, 32)));
 #endif  // HAVE_NEON
 
+#if CONFIG_AV1_HIGHBITDEPTH
+class HadamardHighbdTest : public HadamardTestBase<tran_low_t, HadamardFunc> {
+ protected:
+  HadamardHighbdTest() : HadamardTestBase(GetParam(), /*do_shift=*/true) {}
+  // Use values between -4095 (0xF001) and 4095 (0x0FFF)
+  virtual int16_t Rand() {
+    int16_t src = rnd_.Rand12();
+    int16_t pred = rnd_.Rand12();
+    return src - pred;
+  }
+};
+
+TEST_P(HadamardHighbdTest, CompareReferenceRandom) { CompareReferenceRandom(); }
+
+TEST_P(HadamardHighbdTest, VaryStride) { VaryStride(); }
+
+TEST_P(HadamardHighbdTest, DISABLED_Speed) {
+  SpeedTest(10);
+  SpeedTest(10000);
+  SpeedTest(10000000);
+}
+
+INSTANTIATE_TEST_SUITE_P(
+    C, HadamardHighbdTest,
+    ::testing::Values(
+        HadamardFuncWithSize(&aom_highbd_hadamard_8x8_c, 8, 8),
+        HadamardFuncWithSize(&aom_highbd_hadamard_16x16_c, 16, 16),
+        HadamardFuncWithSize(&aom_highbd_hadamard_32x32_c, 32, 32)));
+
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(
+    NEON, HadamardHighbdTest,
+    ::testing::Values(
+        HadamardFuncWithSize(&aom_highbd_hadamard_8x8_neon, 8, 8),
+        HadamardFuncWithSize(&aom_highbd_hadamard_16x16_neon, 16, 16),
+        HadamardFuncWithSize(&aom_highbd_hadamard_32x32_neon, 32, 32)));
+#endif  // HAVE_NEON
+
+#endif  // CONFIG_AV1_HIGHBITDEPTH
+
 // Tests for low precision
 class HadamardLowbdLPTest : public HadamardTestBase<int16_t, HadamardLPFunc> {
  public:
   HadamardLowbdLPTest() : HadamardTestBase(GetParam(), /*do_shift=*/false) {}
-  virtual int16_t Rand() { return rnd_.Rand9Signed(); }
+  // Use values between -255 (0xFF01) and 255 (0x00FF)
+  virtual int16_t Rand() {
+    int16_t src = rnd_.Rand8();
+    int16_t pred = rnd_.Rand8();
+    return src - pred;
+  }
 };
 
 TEST_P(HadamardLowbdLPTest, CompareReferenceRandom) {
@@ -402,7 +496,12 @@
  public:
   HadamardLowbdLP8x8DualTest()
       : HadamardTestBase(GetParam(), /*do_shift=*/false) {}
-  virtual int16_t Rand() { return rnd_.Rand9Signed(); }
+  // Use values between -255 (0xFF01) and 255 (0x00FF)
+  virtual int16_t Rand() {
+    int16_t src = rnd_.Rand8();
+    int16_t pred = rnd_.Rand8();
+    return src - pred;
+  }
 };
 
 TEST_P(HadamardLowbdLP8x8DualTest, CompareReferenceRandom) {

diff --git a/test/hbd_metrics_test.cc b/test/hbd_metrics_test.cc
index 6c9fe55..074213a 100644
--- a/test/hbd_metrics_test.cc
+++ b/test/hbd_metrics_test.cc

@@ -112,10 +112,10 @@
     memset(&hbd_src, 0, sizeof(hbd_src));
     memset(&hbd_dst, 0, sizeof(hbd_dst));
 
-    aom_alloc_frame_buffer(&lbd_src, width, height, 1, 1, 0, 32, 16, 0);
-    aom_alloc_frame_buffer(&lbd_dst, width, height, 1, 1, 0, 32, 16, 0);
-    aom_alloc_frame_buffer(&hbd_src, width, height, 1, 1, 1, 32, 16, 0);
-    aom_alloc_frame_buffer(&hbd_dst, width, height, 1, 1, 1, 32, 16, 0);
+    aom_alloc_frame_buffer(&lbd_src, width, height, 1, 1, 0, 32, 16, 0, 0);
+    aom_alloc_frame_buffer(&lbd_dst, width, height, 1, 1, 0, 32, 16, 0, 0);
+    aom_alloc_frame_buffer(&hbd_src, width, height, 1, 1, 1, 32, 16, 0, 0);
+    aom_alloc_frame_buffer(&hbd_dst, width, height, 1, 1, 1, 32, 16, 0, 0);
 
     memset(lbd_src.buffer_alloc, kPixFiller, lbd_src.buffer_alloc_sz);
     while (i < lbd_src.buffer_alloc_sz) {

diff --git a/test/horz_superres_test.cc b/test/horz_superres_test.cc
index 323aa93..cba29e9 100644
--- a/test/horz_superres_test.cc
+++ b/test/horz_superres_test.cc

@@ -53,12 +53,12 @@
 
 const TestVideoParam kTestVideoVectors[] = {
   { "park_joy_90p_8_420.y4m", AOM_IMG_FMT_I420, AOM_BITS_8, 0, 5, 0, 25.3,
-    45.0 },
+    44.7 },
 #if CONFIG_AV1_HIGHBITDEPTH
   { "park_joy_90p_10_444.y4m", AOM_IMG_FMT_I44416, AOM_BITS_10, 1, 5, 0, 27.0,
-    47.9 },
+    46.8 },
 #endif
-  { "screendata.y4m", AOM_IMG_FMT_I420, AOM_BITS_8, 0, 4, 1, 23.0, 56.0 },
+  { "screendata.y4m", AOM_IMG_FMT_I420, AOM_BITS_8, 0, 4, 1, 23.0, 52.5 },
   // Image coding (single frame).
   { "niklas_1280_720_30.y4m", AOM_IMG_FMT_I420, AOM_BITS_8, 0, 1, 0, 32.0,
     49.0 },

diff --git a/test/intrapred_test.cc b/test/intrapred_test.cc
index 3da9293..aced593 100644
--- a/test/intrapred_test.cc
+++ b/test/intrapred_test.cc

@@ -340,26 +340,11 @@
 
 #if HAVE_NEON
 const IntraPredFunc<IntraPred> LowbdIntraPredTestVectorNeon[] = {
-  lowbd_entry(dc, 4, 4, neon),        lowbd_entry(dc, 8, 8, neon),
-  lowbd_entry(dc, 16, 16, neon),      lowbd_entry(dc, 32, 32, neon),
-
-  lowbd_entry(dc_top, 4, 4, neon),    lowbd_entry(dc_top, 8, 8, neon),
-  lowbd_entry(dc_top, 16, 16, neon),  lowbd_entry(dc_top, 32, 32, neon),
-
-  lowbd_entry(dc_left, 4, 4, neon),   lowbd_entry(dc_left, 8, 8, neon),
-  lowbd_entry(dc_left, 16, 16, neon), lowbd_entry(dc_left, 32, 32, neon),
-
-  lowbd_entry(dc_128, 4, 4, neon),    lowbd_entry(dc_128, 8, 8, neon),
-  lowbd_entry(dc_128, 16, 16, neon),  lowbd_entry(dc_128, 32, 32, neon),
-
-  lowbd_entry(v, 4, 4, neon),         lowbd_entry(v, 8, 8, neon),
-  lowbd_entry(v, 16, 16, neon),       lowbd_entry(v, 32, 32, neon),
-
-  lowbd_entry(h, 4, 4, neon),         lowbd_entry(h, 8, 8, neon),
-  lowbd_entry(h, 16, 16, neon),       lowbd_entry(h, 32, 32, neon),
-
-  lowbd_intrapred(smooth, neon),      lowbd_intrapred(smooth_v, neon),
-  lowbd_intrapred(smooth_h, neon),    lowbd_intrapred(paeth, neon),
+  lowbd_intrapred(dc, neon),       lowbd_intrapred(dc_top, neon),
+  lowbd_intrapred(dc_left, neon),  lowbd_intrapred(dc_128, neon),
+  lowbd_intrapred(v, neon),        lowbd_intrapred(h, neon),
+  lowbd_intrapred(smooth, neon),   lowbd_intrapred(smooth_v, neon),
+  lowbd_intrapred(smooth_h, neon), lowbd_intrapred(paeth, neon),
 };
 
 INSTANTIATE_TEST_SUITE_P(NEON, LowbdIntraPredTest,
@@ -416,13 +401,11 @@
 #if CONFIG_AV1_HIGHBITDEPTH
 #if HAVE_NEON
 const IntraPredFunc<HighbdIntraPred> HighbdIntraPredTestVectorNeon[] = {
-  highbd_entry(dc, 4, 4, neon, 8),      highbd_entry(dc, 8, 8, neon, 8),
-  highbd_entry(dc, 16, 16, neon, 8),    highbd_entry(dc, 32, 32, neon, 8),
-  highbd_entry(dc, 64, 64, neon, 8),
-
-  highbd_intrapred(v, neon, 12),        highbd_intrapred(paeth, neon, 12),
-  highbd_intrapred(smooth, neon, 12),   highbd_intrapred(smooth_v, neon, 12),
-  highbd_intrapred(smooth_h, neon, 12),
+  highbd_intrapred(dc, neon, 12),       highbd_intrapred(dc_top, neon, 12),
+  highbd_intrapred(dc_left, neon, 12),  highbd_intrapred(dc_128, neon, 12),
+  highbd_intrapred(v, neon, 12),        highbd_intrapred(h, neon, 12),
+  highbd_intrapred(paeth, neon, 12),    highbd_intrapred(smooth, neon, 12),
+  highbd_intrapred(smooth_v, neon, 12), highbd_intrapred(smooth_h, neon, 12),
 };
 
 INSTANTIATE_TEST_SUITE_P(NEON, HighbdIntraPredTest,

diff --git a/test/invalid_file_test.cc b/test/invalid_file_test.cc
index 10a3bc4..63e15ca 100644
--- a/test/invalid_file_test.cc
+++ b/test/invalid_file_test.cc

@@ -133,10 +133,16 @@
   { 4, "invalid-oss-fuzz-9463.ivf", "invalid-oss-fuzz-9463.ivf.res.2" },
   { 1, "invalid-oss-fuzz-9720.ivf", nullptr },
   { 1, "invalid-oss-fuzz-10389.ivf", "invalid-oss-fuzz-10389.ivf.res.4" },
+#if !CHROMIUM && !CONFIG_SIZE_LIMIT ||                  \
+    (CONFIG_SIZE_LIMIT && DECODE_WIDTH_LIMIT >= 5120 && \
+     DECODE_HEIGHT_LIMIT >= 180)
   { 1, "invalid-oss-fuzz-11523.ivf", "invalid-oss-fuzz-11523.ivf.res.2" },
+#endif
   { 4, "invalid-oss-fuzz-15363.ivf", nullptr },
   { 1, "invalid-oss-fuzz-16437.ivf", "invalid-oss-fuzz-16437.ivf.res.2" },
+#if CONFIG_MAX_DECODE_PROFILE >= 1
   { 1, "invalid-oss-fuzz-24706.ivf", nullptr },
+#endif
 #if CONFIG_AV1_HIGHBITDEPTH
   // These test vectors contain 10-bit or 12-bit video.
   { 1, "invalid-oss-fuzz-9288.ivf", nullptr },

diff --git a/test/level_test.cc b/test/level_test.cc
index 7ae1a75..cc79926 100644
--- a/test/level_test.cc
+++ b/test/level_test.cc

@@ -9,6 +9,7 @@
  * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
  */
 #include <memory>
+#include <string>
 
 #include "third_party/googletest/src/googletest/include/gtest/gtest.h"
 
@@ -78,8 +79,8 @@
   int level_[32];
 };
 
-TEST_P(LevelTest, TestTargetLevelApi) {
-  static aom_codec_iface_t *codec = aom_codec_av1_cx();
+TEST(LevelTest, TestTargetLevelApi) {
+  aom_codec_iface_t *codec = aom_codec_av1_cx();
   aom_codec_ctx_t enc;
   aom_codec_enc_cfg_t cfg;
   EXPECT_EQ(AOM_CODEC_OK, aom_codec_enc_config_default(codec, &cfg, 0));
@@ -87,10 +88,10 @@
   for (int operating_point = 0; operating_point <= 32; ++operating_point) {
     for (int level = 0; level <= 32; ++level) {
       const int target_level = operating_point * 100 + level;
-      if ((level < (CONFIG_CWG_C013 ? 28 : 20) && level != 2 && level != 3 &&
-           level != 6 && level != 7 && level != 10 && level != 11) ||
-          level == kLevelMax || level == kLevelKeepStats ||
-          operating_point > 31) {
+      if (operating_point <= 31 &&
+          ((level < (CONFIG_CWG_C013 ? 28 : 20) && level != 2 && level != 3 &&
+            level != 6 && level != 7 && level != 10 && level != 11) ||
+           level == kLevelMax || level == kLevelKeepStats)) {
         EXPECT_EQ(AOM_CODEC_OK,
                   AOM_CODEC_CONTROL_TYPECHECKED(
                       &enc, AV1E_SET_TARGET_SEQ_LEVEL_IDX, target_level));
@@ -104,6 +105,23 @@
   EXPECT_EQ(AOM_CODEC_OK, aom_codec_destroy(&enc));
 }
 
+TEST(LevelTest, InvalidOperatingPointIndexErrorDetail) {
+  aom_codec_iface_t *codec = aom_codec_av1_cx();
+  aom_codec_ctx_t enc;
+  aom_codec_enc_cfg_t cfg;
+  EXPECT_EQ(aom_codec_enc_config_default(codec, &cfg, 0), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_enc_init(&enc, codec, &cfg, 0), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_TARGET_SEQ_LEVEL_IDX, 3219),
+            AOM_CODEC_INVALID_PARAM);
+  EXPECT_EQ(aom_codec_error_detail(&enc),
+            std::string("Invalid operating point index: 32"));
+  EXPECT_EQ(aom_codec_set_option(&enc, "target-seq-level-idx", "3319"),
+            AOM_CODEC_INVALID_PARAM);
+  EXPECT_EQ(aom_codec_error_detail(&enc),
+            std::string("Invalid operating point index: 33"));
+  EXPECT_EQ(aom_codec_destroy(&enc), AOM_CODEC_OK);
+}
+
 TEST_P(LevelTest, TestTargetLevel19) {
   std::unique_ptr<libaom_test::VideoSource> video;
   video.reset(new libaom_test::Y4mVideoSource("park_joy_90p_8_420.y4m", 0, 10));

diff --git a/test/log2_test.cc b/test/log2_test.cc
index d7840c6..71cf8b2 100644
--- a/test/log2_test.cc
+++ b/test/log2_test.cc

@@ -9,6 +9,7 @@
  * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
  */
 
+#include <limits.h>
 #include <math.h>
 
 #include "aom_ports/bitops.h"
@@ -42,9 +43,9 @@
     const int power_of_2 = 1 << exponent;
     EXPECT_EQ(av1_ceil_log2(power_of_2 - 1), exponent);
     EXPECT_EQ(av1_ceil_log2(power_of_2), exponent);
-    // The current implementation of av1_ceil_log2 only works up to 2^30.
-    if (exponent < 30) {
-      EXPECT_EQ(av1_ceil_log2(power_of_2 + 1), exponent + 1);
-    }
+    EXPECT_EQ(av1_ceil_log2(power_of_2 + 1), exponent + 1);
   }
+
+  // INT_MAX = 2^31 - 1
+  EXPECT_EQ(av1_ceil_log2(INT_MAX), 31);
 }

diff --git a/test/lossless_test.cc b/test/lossless_test.cc
index c14bc06..ef4e19f 100644
--- a/test/lossless_test.cc
+++ b/test/lossless_test.cc

@@ -76,6 +76,11 @@
     return AOM_CODEC_OK == res_dec;
   }
 
+  void TestLosslessEncoding();
+  void TestLosslessEncodingVGALag0();
+  void TestLosslessEncoding444();
+  void TestLosslessEncodingCtrl();
+
  private:
   double psnr_;
   unsigned int nframes_;
@@ -85,7 +90,7 @@
   int base_qindex_;
 };
 
-TEST_P(LosslessTestLarge, TestLossLessEncoding) {
+void LosslessTestLarge::TestLosslessEncoding() {
   const aom_rational timebase = { 33333333, 1000000000 };
   cfg_.g_timebase = timebase;
   cfg_.rc_target_bitrate = 2000;
@@ -103,7 +108,24 @@
   EXPECT_GE(psnr_lossless, kMaxPsnr);
 }
 
-TEST_P(LosslessTestLarge, TestLossLessEncoding444) {
+void LosslessTestLarge::TestLosslessEncodingVGALag0() {
+  const aom_rational timebase = { 33333333, 1000000000 };
+  cfg_.g_timebase = timebase;
+  cfg_.rc_target_bitrate = 2000;
+  cfg_.g_lag_in_frames = 0;
+  cfg_.rc_min_quantizer = 0;
+  cfg_.rc_max_quantizer = 0;
+
+  init_flags_ = AOM_CODEC_USE_PSNR;
+
+  libaom_test::I420VideoSource video("niklas_640_480_30.yuv", 640, 480,
+                                     timebase.den, timebase.num, 0, 30);
+  ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
+  const double psnr_lossless = GetMinPsnr();
+  EXPECT_GE(psnr_lossless, kMaxPsnr);
+}
+
+void LosslessTestLarge::TestLosslessEncoding444() {
   libaom_test::Y4mVideoSource video("rush_hour_444.y4m", 0, 5);
 
   cfg_.g_profile = 1;
@@ -120,7 +142,7 @@
   EXPECT_GE(psnr_lossless, kMaxPsnr);
 }
 
-TEST_P(LosslessTestLarge, TestLossLessEncodingCtrl) {
+void LosslessTestLarge::TestLosslessEncodingCtrl() {
   const aom_rational timebase = { 33333333, 1000000000 };
   cfg_.g_timebase = timebase;
   cfg_.rc_target_bitrate = 2000;
@@ -139,9 +161,23 @@
   EXPECT_GE(psnr_lossless, kMaxPsnr);
 }
 
+TEST_P(LosslessTestLarge, TestLosslessEncoding) { TestLosslessEncoding(); }
+
+TEST_P(LosslessTestLarge, TestLosslessEncodingVGALag0) {
+  TestLosslessEncodingVGALag0();
+}
+
+TEST_P(LosslessTestLarge, TestLosslessEncoding444) {
+  TestLosslessEncoding444();
+}
+
+TEST_P(LosslessTestLarge, TestLosslessEncodingCtrl) {
+  TestLosslessEncodingCtrl();
+}
+
 class LosslessAllIntraTestLarge : public LosslessTestLarge {};
 
-TEST_P(LosslessAllIntraTestLarge, TestLossLessEncodingCtrl) {
+TEST_P(LosslessAllIntraTestLarge, TestLosslessEncodingCtrl) {
   const aom_rational timebase = { 33333333, 1000000000 };
   cfg_.g_timebase = timebase;
   // Intentionally set Q > 0, to make sure control can be used to activate
@@ -158,6 +194,24 @@
   EXPECT_GE(psnr_lossless, kMaxPsnr);
 }
 
+using LosslessRealtimeTestLarge = LosslessTestLarge;
+
+TEST_P(LosslessRealtimeTestLarge, TestLosslessEncoding) {
+  TestLosslessEncoding();
+}
+
+TEST_P(LosslessRealtimeTestLarge, TestLosslessEncodingVGALag0) {
+  TestLosslessEncodingVGALag0();
+}
+
+TEST_P(LosslessRealtimeTestLarge, TestLosslessEncoding444) {
+  TestLosslessEncoding444();
+}
+
+TEST_P(LosslessRealtimeTestLarge, TestLosslessEncodingCtrl) {
+  TestLosslessEncodingCtrl();
+}
+
 AV1_INSTANTIATE_TEST_SUITE(LosslessTestLarge,
                            ::testing::Values(::libaom_test::kOnePassGood,
                                              ::libaom_test::kTwoPassGood),
@@ -168,4 +222,9 @@
                            ::testing::Values(::libaom_test::kAllIntra),
                            ::testing::Values(AOM_Q),
                            ::testing::Values(6, 9));  // cpu_used
+
+AV1_INSTANTIATE_TEST_SUITE(LosslessRealtimeTestLarge,
+                           ::testing::Values(::libaom_test::kRealTime),
+                           ::testing::Values(AOM_Q, AOM_VBR, AOM_CBR, AOM_CQ),
+                           ::testing::Range(6, 11));  // cpu_used
 }  // namespace

diff --git a/test/masked_sad_test.cc b/test/masked_sad_test.cc
index 91f7982..2ef3e4d 100644
--- a/test/masked_sad_test.cc
+++ b/test/masked_sad_test.cc

@@ -187,13 +187,30 @@
   int msk_stride = MAX_SB_SIZE;
   const int iters = run_times == 1 ? number_of_iterations : 1;
   for (int i = 0; i < iters; ++i) {
+    if (run_times == 1 && i == 0) {
+      // The maximum accumulator value occurs when src=0 and
+      // ref/second_pref=255 (or vice-versa, since we take the absolute
+      // difference). Check this case explicitly to ensure we do not overflow
+      // during accumulation.
+      for (int j = 0; j < MAX_SB_SIZE * MAX_SB_SIZE; j++) {
+        src_ptr[j] = 0;
+        ref_ptr[j] = 255;
+        (ref_ptr + kBlockSize)[j] = 255;
+        (ref_ptr + 2 * kBlockSize)[j] = 255;
+        (ref_ptr + 3 * kBlockSize)[j] = 255;
+        second_pred_ptr[j] = 255;
+      }
+    } else {
+      for (int j = 0; j < MAX_SB_SIZE * MAX_SB_SIZE; j++) {
+        src_ptr[j] = rnd.Rand8();
+        ref_ptr[j] = rnd.Rand8();
+        (ref_ptr + kBlockSize)[j] = rnd.Rand8();
+        (ref_ptr + 2 * kBlockSize)[j] = rnd.Rand8();
+        (ref_ptr + 3 * kBlockSize)[j] = rnd.Rand8();
+        second_pred_ptr[j] = rnd.Rand8();
+      }
+    }
     for (int j = 0; j < MAX_SB_SIZE * MAX_SB_SIZE; j++) {
-      src_ptr[j] = rnd.Rand8();
-      ref_ptr[j] = rnd.Rand8();
-      (ref_ptr + kBlockSize)[j] = rnd.Rand8();
-      (ref_ptr + 2 * kBlockSize)[j] = rnd.Rand8();
-      (ref_ptr + 3 * kBlockSize)[j] = rnd.Rand8();
-      second_pred_ptr[j] = rnd.Rand8();
       msk_ptr[j] = ((rnd.Rand8() & 0x7f) > 64) ? rnd.Rand8() & 0x3f : 64;
       assert(msk_ptr[j] <= 64);
     }
@@ -505,4 +522,65 @@
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 #endif  // HAVE_AVX2
 
+#if HAVE_NEON
+const MaskedSADParam msad_test[] = {
+  make_tuple(&aom_masked_sad4x4_neon, &aom_masked_sad4x4_c),
+  make_tuple(&aom_masked_sad4x8_neon, &aom_masked_sad4x8_c),
+  make_tuple(&aom_masked_sad8x4_neon, &aom_masked_sad8x4_c),
+  make_tuple(&aom_masked_sad8x8_neon, &aom_masked_sad8x8_c),
+  make_tuple(&aom_masked_sad8x16_neon, &aom_masked_sad8x16_c),
+  make_tuple(&aom_masked_sad16x8_neon, &aom_masked_sad16x8_c),
+  make_tuple(&aom_masked_sad16x16_neon, &aom_masked_sad16x16_c),
+  make_tuple(&aom_masked_sad16x32_neon, &aom_masked_sad16x32_c),
+  make_tuple(&aom_masked_sad32x16_neon, &aom_masked_sad32x16_c),
+  make_tuple(&aom_masked_sad32x32_neon, &aom_masked_sad32x32_c),
+  make_tuple(&aom_masked_sad32x64_neon, &aom_masked_sad32x64_c),
+  make_tuple(&aom_masked_sad64x32_neon, &aom_masked_sad64x32_c),
+  make_tuple(&aom_masked_sad64x64_neon, &aom_masked_sad64x64_c),
+  make_tuple(&aom_masked_sad64x128_neon, &aom_masked_sad64x128_c),
+  make_tuple(&aom_masked_sad128x64_neon, &aom_masked_sad128x64_c),
+  make_tuple(&aom_masked_sad128x128_neon, &aom_masked_sad128x128_c),
+#if !CONFIG_REALTIME_ONLY
+  make_tuple(&aom_masked_sad4x16_neon, &aom_masked_sad4x16_c),
+  make_tuple(&aom_masked_sad16x4_neon, &aom_masked_sad16x4_c),
+  make_tuple(&aom_masked_sad8x32_neon, &aom_masked_sad8x32_c),
+  make_tuple(&aom_masked_sad32x8_neon, &aom_masked_sad32x8_c),
+  make_tuple(&aom_masked_sad16x64_neon, &aom_masked_sad16x64_c),
+  make_tuple(&aom_masked_sad64x16_neon, &aom_masked_sad64x16_c),
+#endif
+};
+
+INSTANTIATE_TEST_SUITE_P(NEON, MaskedSADTest, ::testing::ValuesIn(msad_test));
+
+const MaskedSADx4Param msadx4_test[] = {
+  make_tuple(&aom_masked_sad4x4x4d_neon, &aom_masked_sad4x4x4d_c),
+  make_tuple(&aom_masked_sad4x8x4d_neon, &aom_masked_sad4x8x4d_c),
+  make_tuple(&aom_masked_sad8x4x4d_neon, &aom_masked_sad8x4x4d_c),
+  make_tuple(&aom_masked_sad8x8x4d_neon, &aom_masked_sad8x8x4d_c),
+  make_tuple(&aom_masked_sad8x16x4d_neon, &aom_masked_sad8x16x4d_c),
+  make_tuple(&aom_masked_sad16x8x4d_neon, &aom_masked_sad16x8x4d_c),
+  make_tuple(&aom_masked_sad16x16x4d_neon, &aom_masked_sad16x16x4d_c),
+  make_tuple(&aom_masked_sad16x32x4d_neon, &aom_masked_sad16x32x4d_c),
+  make_tuple(&aom_masked_sad32x16x4d_neon, &aom_masked_sad32x16x4d_c),
+  make_tuple(&aom_masked_sad32x32x4d_neon, &aom_masked_sad32x32x4d_c),
+  make_tuple(&aom_masked_sad32x64x4d_neon, &aom_masked_sad32x64x4d_c),
+  make_tuple(&aom_masked_sad64x32x4d_neon, &aom_masked_sad64x32x4d_c),
+  make_tuple(&aom_masked_sad64x64x4d_neon, &aom_masked_sad64x64x4d_c),
+  make_tuple(&aom_masked_sad64x128x4d_neon, &aom_masked_sad64x128x4d_c),
+  make_tuple(&aom_masked_sad128x64x4d_neon, &aom_masked_sad128x64x4d_c),
+  make_tuple(&aom_masked_sad128x128x4d_neon, &aom_masked_sad128x128x4d_c),
+#if !CONFIG_REALTIME_ONLY
+  make_tuple(&aom_masked_sad4x16x4d_neon, &aom_masked_sad4x16x4d_c),
+  make_tuple(&aom_masked_sad16x4x4d_neon, &aom_masked_sad16x4x4d_c),
+  make_tuple(&aom_masked_sad8x32x4d_neon, &aom_masked_sad8x32x4d_c),
+  make_tuple(&aom_masked_sad32x8x4d_neon, &aom_masked_sad32x8x4d_c),
+  make_tuple(&aom_masked_sad16x64x4d_neon, &aom_masked_sad16x64x4d_c),
+  make_tuple(&aom_masked_sad64x16x4d_neon, &aom_masked_sad64x16x4d_c),
+#endif
+};
+
+INSTANTIATE_TEST_SUITE_P(NEON, MaskedSADx4Test,
+                         ::testing::ValuesIn(msadx4_test));
+#endif  // HAVE_NEON
+
 }  // namespace

diff --git a/test/masked_variance_test.cc b/test/masked_variance_test.cc
index 4a4cb1a..e76403e 100644
--- a/test/masked_variance_test.cc
+++ b/test/masked_variance_test.cc

@@ -514,4 +514,59 @@
                          ::testing::ValuesIn(hbd_sub_pel_var_test));
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 #endif  // HAVE_SSSE3
+
+#if HAVE_NEON
+
+const MaskedSubPixelVarianceParam sub_pel_var_test[] = {
+  make_tuple(&aom_masked_sub_pixel_variance128x128_neon,
+             &aom_masked_sub_pixel_variance128x128_c),
+  make_tuple(&aom_masked_sub_pixel_variance128x64_neon,
+             &aom_masked_sub_pixel_variance128x64_c),
+  make_tuple(&aom_masked_sub_pixel_variance64x128_neon,
+             &aom_masked_sub_pixel_variance64x128_c),
+  make_tuple(&aom_masked_sub_pixel_variance64x64_neon,
+             &aom_masked_sub_pixel_variance64x64_c),
+  make_tuple(&aom_masked_sub_pixel_variance64x32_neon,
+             &aom_masked_sub_pixel_variance64x32_c),
+  make_tuple(&aom_masked_sub_pixel_variance32x64_neon,
+             &aom_masked_sub_pixel_variance32x64_c),
+  make_tuple(&aom_masked_sub_pixel_variance32x32_neon,
+             &aom_masked_sub_pixel_variance32x32_c),
+  make_tuple(&aom_masked_sub_pixel_variance32x16_neon,
+             &aom_masked_sub_pixel_variance32x16_c),
+  make_tuple(&aom_masked_sub_pixel_variance16x32_neon,
+             &aom_masked_sub_pixel_variance16x32_c),
+  make_tuple(&aom_masked_sub_pixel_variance16x16_neon,
+             &aom_masked_sub_pixel_variance16x16_c),
+  make_tuple(&aom_masked_sub_pixel_variance16x8_neon,
+             &aom_masked_sub_pixel_variance16x8_c),
+  make_tuple(&aom_masked_sub_pixel_variance8x16_neon,
+             &aom_masked_sub_pixel_variance8x16_c),
+  make_tuple(&aom_masked_sub_pixel_variance8x8_neon,
+             &aom_masked_sub_pixel_variance8x8_c),
+  make_tuple(&aom_masked_sub_pixel_variance8x4_neon,
+             &aom_masked_sub_pixel_variance8x4_c),
+  make_tuple(&aom_masked_sub_pixel_variance4x8_neon,
+             &aom_masked_sub_pixel_variance4x8_c),
+  make_tuple(&aom_masked_sub_pixel_variance4x4_neon,
+             &aom_masked_sub_pixel_variance4x4_c),
+#if !CONFIG_REALTIME_ONLY
+  make_tuple(&aom_masked_sub_pixel_variance64x16_neon,
+             &aom_masked_sub_pixel_variance64x16_c),
+  make_tuple(&aom_masked_sub_pixel_variance16x64_neon,
+             &aom_masked_sub_pixel_variance16x64_c),
+  make_tuple(&aom_masked_sub_pixel_variance32x8_neon,
+             &aom_masked_sub_pixel_variance32x8_c),
+  make_tuple(&aom_masked_sub_pixel_variance8x32_neon,
+             &aom_masked_sub_pixel_variance8x32_c),
+  make_tuple(&aom_masked_sub_pixel_variance16x4_neon,
+             &aom_masked_sub_pixel_variance16x4_c),
+  make_tuple(&aom_masked_sub_pixel_variance4x16_neon,
+             &aom_masked_sub_pixel_variance4x16_c),
+#endif
+};
+
+INSTANTIATE_TEST_SUITE_P(NEON_C_COMPARE, MaskedSubPixelVarianceTest,
+                         ::testing::ValuesIn(sub_pel_var_test));
+#endif  // HAVE_NEON
 }  // namespace

diff --git a/test/minmax_test.cc b/test/minmax_test.cc
new file mode 100644
index 0000000..cf67b7b
--- /dev/null
+++ b/test/minmax_test.cc

@@ -0,0 +1,244 @@
+/*
+ *  Copyright (c) 2023 The WebM project authors. All Rights Reserved.
+ *  Copyright (c) 2023, Alliance for Open Media. All Rights Reserved.
+ *
+ *  This source code is subject to the terms of the BSD 2 Clause License and
+ *  the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ *  was not distributed with this source code in the LICENSE file, you can
+ *  obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ *  Media Patent License 1.0 was not distributed with this source code in the
+ *  PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#include <stdlib.h>
+#include <string.h>
+
+#include "third_party/googletest/src/googletest/include/gtest/gtest.h"
+
+#include "config/aom_config.h"
+#include "config/aom_dsp_rtcd.h"
+#include "aom_ports/mem.h"
+#include "test/acm_random.h"
+#include "test/register_state_check.h"
+#include "test/util.h"
+
+namespace {
+
+using ::libaom_test::ACMRandom;
+
+typedef void (*MinMaxFunc)(const uint8_t *a, int a_stride, const uint8_t *b,
+                           int b_stride, int *min, int *max);
+
+class MinMaxTest : public ::testing::TestWithParam<MinMaxFunc> {
+ public:
+  virtual void SetUp() {
+    mm_func_ = GetParam();
+    rnd_.Reset(ACMRandom::DeterministicSeed());
+  }
+
+ protected:
+  MinMaxFunc mm_func_;
+  ACMRandom rnd_;
+};
+
+void reference_minmax(const uint8_t *a, int a_stride, const uint8_t *b,
+                      int b_stride, int *min_ret, int *max_ret) {
+  int min = 255;
+  int max = 0;
+  for (int i = 0; i < 8; i++) {
+    for (int j = 0; j < 8; j++) {
+      const int diff = abs(a[i * a_stride + j] - b[i * b_stride + j]);
+      if (min > diff) min = diff;
+      if (max < diff) max = diff;
+    }
+  }
+
+  *min_ret = min;
+  *max_ret = max;
+}
+
+TEST_P(MinMaxTest, MinValue) {
+  for (int i = 0; i < 64; i++) {
+    uint8_t a[64], b[64];
+    memset(a, 0, sizeof(a));
+    memset(b, 255, sizeof(b));
+    b[i] = i;  // Set a minimum difference of i.
+
+    int min, max;
+    API_REGISTER_STATE_CHECK(mm_func_(a, 8, b, 8, &min, &max));
+    EXPECT_EQ(255, max);
+    EXPECT_EQ(i, min);
+  }
+}
+
+TEST_P(MinMaxTest, MaxValue) {
+  for (int i = 0; i < 64; i++) {
+    uint8_t a[64], b[64];
+    memset(a, 0, sizeof(a));
+    memset(b, 0, sizeof(b));
+    b[i] = i;  // Set a maximum difference of i.
+
+    int min, max;
+    API_REGISTER_STATE_CHECK(mm_func_(a, 8, b, 8, &min, &max));
+    EXPECT_EQ(i, max);
+    EXPECT_EQ(0, min);
+  }
+}
+
+TEST_P(MinMaxTest, CompareReference) {
+  uint8_t a[64], b[64];
+  for (int j = 0; j < 64; j++) {
+    a[j] = rnd_.Rand8();
+    b[j] = rnd_.Rand8();
+  }
+
+  int min_ref, max_ref, min, max;
+  reference_minmax(a, 8, b, 8, &min_ref, &max_ref);
+  API_REGISTER_STATE_CHECK(mm_func_(a, 8, b, 8, &min, &max));
+  EXPECT_EQ(max_ref, max);
+  EXPECT_EQ(min_ref, min);
+}
+
+TEST_P(MinMaxTest, CompareReferenceAndVaryStride) {
+  uint8_t a[8 * 64], b[8 * 64];
+  for (int i = 0; i < 8 * 64; i++) {
+    a[i] = rnd_.Rand8();
+    b[i] = rnd_.Rand8();
+  }
+  for (int a_stride = 8; a_stride <= 64; a_stride += 8) {
+    for (int b_stride = 8; b_stride <= 64; b_stride += 8) {
+      int min_ref, max_ref, min, max;
+      reference_minmax(a, a_stride, b, b_stride, &min_ref, &max_ref);
+      API_REGISTER_STATE_CHECK(mm_func_(a, a_stride, b, b_stride, &min, &max));
+      EXPECT_EQ(max_ref, max)
+          << "when a_stride = " << a_stride << " and b_stride = " << b_stride;
+      EXPECT_EQ(min_ref, min)
+          << "when a_stride = " << a_stride << " and b_stride = " << b_stride;
+    }
+  }
+}
+
+#if CONFIG_AV1_HIGHBITDEPTH
+
+using HBDMinMaxTest = MinMaxTest;
+
+void highbd_reference_minmax(const uint8_t *a, int a_stride, const uint8_t *b,
+                             int b_stride, int *min_ret, int *max_ret) {
+  int min = 65535;
+  int max = 0;
+  const uint16_t *a_ptr = CONVERT_TO_SHORTPTR(a);
+  const uint16_t *b_ptr = CONVERT_TO_SHORTPTR(b);
+  for (int i = 0; i < 8; i++) {
+    for (int j = 0; j < 8; j++) {
+      const int diff = abs(a_ptr[i * a_stride + j] - b_ptr[i * b_stride + j]);
+      if (min > diff) min = diff;
+      if (max < diff) max = diff;
+    }
+  }
+
+  *min_ret = min;
+  *max_ret = max;
+}
+
+TEST_P(HBDMinMaxTest, MinValue) {
+  uint8_t *a = CONVERT_TO_BYTEPTR(
+      reinterpret_cast<uint16_t *>(aom_malloc(64 * sizeof(uint16_t))));
+  uint8_t *b = CONVERT_TO_BYTEPTR(
+      reinterpret_cast<uint16_t *>(aom_malloc(64 * sizeof(uint16_t))));
+  for (int i = 0; i < 64; i++) {
+    aom_memset16(CONVERT_TO_SHORTPTR(a), 0, 64);
+    aom_memset16(CONVERT_TO_SHORTPTR(b), 65535, 64);
+    CONVERT_TO_SHORTPTR(b)[i] = i;  // Set a minimum difference of i.
+
+    int min, max;
+    API_REGISTER_STATE_CHECK(mm_func_(a, 8, b, 8, &min, &max));
+    EXPECT_EQ(65535, max);
+    EXPECT_EQ(i, min);
+  }
+  aom_free(CONVERT_TO_SHORTPTR(a));
+  aom_free(CONVERT_TO_SHORTPTR(b));
+}
+
+TEST_P(HBDMinMaxTest, MaxValue) {
+  uint8_t *a = CONVERT_TO_BYTEPTR(
+      reinterpret_cast<uint16_t *>(aom_malloc(64 * sizeof(uint16_t))));
+  uint8_t *b = CONVERT_TO_BYTEPTR(
+      reinterpret_cast<uint16_t *>(aom_malloc(64 * sizeof(uint16_t))));
+  for (int i = 0; i < 64; i++) {
+    aom_memset16(CONVERT_TO_SHORTPTR(a), 0, 64);
+    aom_memset16(CONVERT_TO_SHORTPTR(b), 0, 64);
+    CONVERT_TO_SHORTPTR(b)[i] = i;  // Set a minimum difference of i.
+
+    int min, max;
+    API_REGISTER_STATE_CHECK(mm_func_(a, 8, b, 8, &min, &max));
+    EXPECT_EQ(i, max);
+    EXPECT_EQ(0, min);
+  }
+  aom_free(CONVERT_TO_SHORTPTR(a));
+  aom_free(CONVERT_TO_SHORTPTR(b));
+}
+
+TEST_P(HBDMinMaxTest, CompareReference) {
+  uint8_t *a = CONVERT_TO_BYTEPTR(
+      reinterpret_cast<uint16_t *>(aom_malloc(64 * sizeof(uint16_t))));
+  uint8_t *b = CONVERT_TO_BYTEPTR(
+      reinterpret_cast<uint16_t *>(aom_malloc(64 * sizeof(uint16_t))));
+  for (int j = 0; j < 64; j++) {
+    CONVERT_TO_SHORTPTR(a)[j] = rnd_.Rand16();
+    CONVERT_TO_SHORTPTR(b)[j] = rnd_.Rand16();
+  }
+
+  int min_ref, max_ref, min, max;
+  highbd_reference_minmax(a, 8, b, 8, &min_ref, &max_ref);
+  API_REGISTER_STATE_CHECK(mm_func_(a, 8, b, 8, &min, &max));
+  aom_free(CONVERT_TO_SHORTPTR(a));
+  aom_free(CONVERT_TO_SHORTPTR(b));
+  EXPECT_EQ(max_ref, max);
+  EXPECT_EQ(min_ref, min);
+}
+
+TEST_P(HBDMinMaxTest, CompareReferenceAndVaryStride) {
+  uint8_t *a = CONVERT_TO_BYTEPTR(
+      reinterpret_cast<uint16_t *>(aom_malloc((8 * 64) * sizeof(uint16_t))));
+  uint8_t *b = CONVERT_TO_BYTEPTR(
+      reinterpret_cast<uint16_t *>(aom_malloc((8 * 64) * sizeof(uint16_t))));
+  for (int i = 0; i < 8 * 64; i++) {
+    CONVERT_TO_SHORTPTR(a)[i] = rnd_.Rand16();
+    CONVERT_TO_SHORTPTR(b)[i] = rnd_.Rand16();
+  }
+  for (int a_stride = 8; a_stride <= 64; a_stride += 8) {
+    for (int b_stride = 8; b_stride <= 64; b_stride += 8) {
+      int min_ref, max_ref, min, max;
+      highbd_reference_minmax(a, a_stride, b, b_stride, &min_ref, &max_ref);
+      API_REGISTER_STATE_CHECK(mm_func_(a, a_stride, b, b_stride, &min, &max));
+      EXPECT_EQ(max_ref, max)
+          << "when a_stride = " << a_stride << " and b_stride = " << b_stride;
+      EXPECT_EQ(min_ref, min)
+          << "when a_stride = " << a_stride << " and b_stride = " << b_stride;
+    }
+  }
+  aom_free(CONVERT_TO_SHORTPTR(a));
+  aom_free(CONVERT_TO_SHORTPTR(b));
+}
+#endif  // CONFIG_AV1_HIGHBITDEPTH
+
+INSTANTIATE_TEST_SUITE_P(C, MinMaxTest, ::testing::Values(&aom_minmax_8x8_c));
+#if CONFIG_AV1_HIGHBITDEPTH
+INSTANTIATE_TEST_SUITE_P(C, HBDMinMaxTest,
+                         ::testing::Values(&aom_highbd_minmax_8x8_c));
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(NEON, HBDMinMaxTest,
+                         ::testing::Values(&aom_highbd_minmax_8x8_neon));
+#endif
+#endif
+
+#if HAVE_SSE2
+INSTANTIATE_TEST_SUITE_P(SSE2, MinMaxTest,
+                         ::testing::Values(&aom_minmax_8x8_sse2));
+#endif
+
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(NEON, MinMaxTest,
+                         ::testing::Values(&aom_minmax_8x8_neon));
+#endif
+}  // namespace

diff --git a/test/mock_ratectrl_qmode.h b/test/mock_ratectrl_qmode.h
deleted file mode 100644
index 9c9e6e8..0000000
--- a/test/mock_ratectrl_qmode.h
+++ /dev/null

@@ -1,47 +0,0 @@
-/*
- * Copyright (c) 2022, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-
-#ifndef AOM_TEST_MOCK_RATECTRL_QMODE_H_
-#define AOM_TEST_MOCK_RATECTRL_QMODE_H_
-
-#include "av1/qmode_rc/ratectrl_qmode_interface.h"
-#include "third_party/googletest/src/googlemock/include/gmock/gmock.h"
-
-namespace aom {
-
-class MockRateControlQMode : public AV1RateControlQModeInterface {
- public:
-  MOCK_METHOD(Status, SetRcParam, (const RateControlParam &rc_param),
-              (override));
-  MOCK_METHOD(StatusOr<GopStructList>, DetermineGopInfo,
-              (const FirstpassInfo &firstpass_info), (override));
-  MOCK_METHOD(StatusOr<GopEncodeInfo>, GetGopEncodeInfo,
-              (const GopStruct &gop_struct, const TplGopStats &tpl_gop_stats,
-               const std::vector<LookaheadStats> &lookahead_stats,
-               const RefFrameTable &ref_frame_table_snapshot_init),
-              (override));
-  MOCK_METHOD(StatusOr<GopEncodeInfo>, GetGopEncodeInfo,
-              (const GopStruct &gop_struct, const TplGopStats &tpl_gop_stats,
-               const std::vector<LookaheadStats> &lookahead_stats,
-               const FirstpassInfo &firstpass_info,
-               const RefFrameTable &ref_frame_table_snapshot_init),
-              (override));
-  MOCK_METHOD(StatusOr<GopEncodeInfo>, GetTplPassGopEncodeInfo,
-              (const GopStruct &gop_struct), (override));
-  MOCK_METHOD(StatusOr<GopEncodeInfo>, GetTplPassGopEncodeInfo,
-              (const GopStruct &gop_struct,
-               const FirstpassInfo &firstpass_info),
-              (override));
-};
-
-}  // namespace aom
-
-#endif  // AOM_TEST_MOCK_RATECTRL_QMODE_H_

diff --git a/test/noise_model_test.cc b/test/noise_model_test.cc
index e9cf9e2..650af79 100644
--- a/test/noise_model_test.cc
+++ b/test/noise_model_test.cc

@@ -36,7 +36,6 @@
       return sigma * (u * sqrt(-2.0 * log(s) / s));
     }
   }
-  return 0;
 }
 
 // Synthesizes noise using the auto-regressive filter of the given lag,
@@ -625,20 +624,20 @@
 
 TYPED_TEST_P(NoiseModelUpdateTest, UpdateSuccessForWhiteRandomNoise) {
   aom_noise_model_t &model = this->model_;
-  const int kWidth = this->kWidth;
-  const int kHeight = this->kHeight;
+  const int width = this->kWidth;
+  const int height = this->kHeight;
 
   const int shift = this->kBitDepth - 8;
-  for (int y = 0; y < kHeight; ++y) {
-    for (int x = 0; x < kWidth; ++x) {
-      this->data_ptr_[0][y * kWidth + x] =
-          int(64 + y + randn(&this->random_, 1)) << shift;
-      this->denoised_ptr_[0][y * kWidth + x] = (64 + y) << shift;
+  for (int y = 0; y < height; ++y) {
+    for (int x = 0; x < width; ++x) {
+      this->data_ptr_[0][y * width + x] = int(64 + y + randn(&this->random_, 1))
+                                          << shift;
+      this->denoised_ptr_[0][y * width + x] = (64 + y) << shift;
       // Make the chroma planes completely correlated with the Y plane
       for (int c = 1; c < 3; ++c) {
-        this->data_ptr_[c][y * kWidth + x] = this->data_ptr_[0][y * kWidth + x];
-        this->denoised_ptr_[c][y * kWidth + x] =
-            this->denoised_ptr_[0][y * kWidth + x];
+        this->data_ptr_[c][y * width + x] = this->data_ptr_[0][y * width + x];
+        this->denoised_ptr_[c][y * width + x] =
+            this->denoised_ptr_[0][y * width + x];
       }
     }
   }
@@ -689,26 +688,26 @@
 
 TYPED_TEST_P(NoiseModelUpdateTest, UpdateSuccessForScaledWhiteNoise) {
   aom_noise_model_t &model = this->model_;
-  const int kWidth = this->kWidth;
-  const int kHeight = this->kHeight;
+  const int width = this->kWidth;
+  const int height = this->kHeight;
 
   const double kCoeffEps = 0.055;
   const double kLowStd = 1;
   const double kHighStd = 4;
   const int shift = this->kBitDepth - 8;
-  for (int y = 0; y < kHeight; ++y) {
-    for (int x = 0; x < kWidth; ++x) {
+  for (int y = 0; y < height; ++y) {
+    for (int x = 0; x < width; ++x) {
       for (int c = 0; c < 3; ++c) {
         // The image data is bimodal:
         // Bottom half has low intensity and low noise strength
         // Top half has high intensity and high noise strength
-        const int avg = (y < kHeight / 2) ? 4 : 245;
-        const double std = (y < kHeight / 2) ? kLowStd : kHighStd;
-        this->data_ptr_[c][y * kWidth + x] =
+        const int avg = (y < height / 2) ? 4 : 245;
+        const double std = (y < height / 2) ? kLowStd : kHighStd;
+        this->data_ptr_[c][y * width + x] =
             ((uint8_t)std::min((int)255,
                                (int)(2 + avg + randn(&this->random_, std))))
             << shift;
-        this->denoised_ptr_[c][y * kWidth + x] = (2 + avg) << shift;
+        this->denoised_ptr_[c][y * width + x] = (2 + avg) << shift;
       }
     }
   }
@@ -766,8 +765,8 @@
 
 TYPED_TEST_P(NoiseModelUpdateTest, UpdateSuccessForCorrelatedNoise) {
   aom_noise_model_t &model = this->model_;
-  const int kWidth = this->kWidth;
-  const int kHeight = this->kHeight;
+  const int width = this->kWidth;
+  const int height = this->kHeight;
   const int kNumCoeffs = 24;
   const double kStd = 4;
   const double kStdEps = 0.3;
@@ -797,16 +796,16 @@
   const int shift = this->kBitDepth - 8;
   for (int c = 0; c < 3; ++c) {
     noise_synth(&this->random_, model.params.lag, model.n, model.coords,
-                kCoeffs[c], this->noise_ptr_[c], kWidth, kHeight);
+                kCoeffs[c], this->noise_ptr_[c], width, height);
     const int x_shift = c > 0 ? this->chroma_sub_[0] : 0;
     const int y_shift = c > 0 ? this->chroma_sub_[1] : 0;
-    for (int y = 0; y < (kHeight >> y_shift); ++y) {
-      for (int x = 0; x < (kWidth >> x_shift); ++x) {
+    for (int y = 0; y < (height >> y_shift); ++y) {
+      for (int x = 0; x < (width >> x_shift); ++x) {
         const uint8_t value = 64 + x / 2 + y / 4;
-        this->data_ptr_[c][y * kWidth + x] =
-            (uint8_t(value + this->noise_ptr_[c][y * kWidth + x] * kStd))
+        this->data_ptr_[c][y * width + x] =
+            (uint8_t(value + this->noise_ptr_[c][y * width + x] * kStd))
             << shift;
-        this->denoised_ptr_[c][y * kWidth + x] = value << shift;
+        this->denoised_ptr_[c][y * width + x] = value << shift;
       }
     }
   }
@@ -830,10 +829,10 @@
                         model.latest_state[c].eqns.x, kCoeffs[c], kNumCoeffs));
 
     noise_synth(&this->random_, model.params.lag, model.n, model.coords,
-                model.latest_state[c].eqns.x, &this->renoise_[0], kWidth,
-                kHeight);
+                model.latest_state[c].eqns.x, &this->renoise_[0], width,
+                height);
 
-    EXPECT_TRUE(aom_noise_data_validate(&this->renoise_[0], kWidth, kHeight));
+    EXPECT_TRUE(aom_noise_data_validate(&this->renoise_[0], width, height));
   }
 
   // Check fitted noise strength
@@ -850,15 +849,15 @@
 TYPED_TEST_P(NoiseModelUpdateTest,
              NoiseStrengthChangeSignalsDifferentNoiseType) {
   aom_noise_model_t &model = this->model_;
-  const int kWidth = this->kWidth;
-  const int kHeight = this->kHeight;
-  const int kBlockSize = this->kBlockSize;
+  const int width = this->kWidth;
+  const int height = this->kHeight;
+  const int block_size = this->kBlockSize;
   // Create a gradient image with std = 2 uncorrelated noise
   const double kStd = 2;
   const int shift = this->kBitDepth - 8;
 
-  for (int i = 0; i < kWidth * kHeight; ++i) {
-    const uint8_t val = (i % kWidth) < kWidth / 2 ? 64 : 192;
+  for (int i = 0; i < width * height; ++i) {
+    const uint8_t val = (i % width) < width / 2 ? 64 : 192;
     for (int c = 0; c < 3; ++c) {
       this->noise_ptr_[c][i] = randn(&this->random_, 1);
       this->data_ptr_[c][i] = ((uint8_t)(this->noise_ptr_[c][i] * kStd + val))
@@ -869,7 +868,7 @@
   this->flat_blocks_.assign(this->flat_blocks_.size(), 1);
   EXPECT_EQ(AOM_NOISE_STATUS_OK, this->NoiseModelUpdate());
 
-  const int kNumBlocks = kWidth * kHeight / kBlockSize / kBlockSize;
+  const int kNumBlocks = width * height / block_size / block_size;
   EXPECT_EQ(kNumBlocks, model.latest_state[0].strength_solver.num_equations);
   EXPECT_EQ(kNumBlocks, model.latest_state[1].strength_solver.num_equations);
   EXPECT_EQ(kNumBlocks, model.latest_state[2].strength_solver.num_equations);
@@ -878,8 +877,8 @@
   EXPECT_EQ(kNumBlocks, model.combined_state[2].strength_solver.num_equations);
 
   // Bump up noise by an insignificant amount
-  for (int i = 0; i < kWidth * kHeight; ++i) {
-    const uint8_t val = (i % kWidth) < kWidth / 2 ? 64 : 192;
+  for (int i = 0; i < width * height; ++i) {
+    const uint8_t val = (i % width) < width / 2 ? 64 : 192;
     this->data_ptr_[0][i] =
         ((uint8_t)(this->noise_ptr_[0][i] * (kStd + 0.085) + val)) << shift;
   }
@@ -899,9 +898,9 @@
 
   // Bump up the noise strength on half the image for one channel by a
   // significant amount.
-  for (int i = 0; i < kWidth * kHeight; ++i) {
-    const uint8_t val = (i % kWidth) < kWidth / 2 ? 64 : 128;
-    if (i % kWidth < kWidth / 2) {
+  for (int i = 0; i < width * height; ++i) {
+    const uint8_t val = (i % width) < width / 2 ? 64 : 128;
+    if (i % width < width / 2) {
       this->data_ptr_[0][i] =
           ((uint8_t)(randn(&this->random_, kStd + 0.5) + val)) << shift;
     }
@@ -931,8 +930,8 @@
 
 TYPED_TEST_P(NoiseModelUpdateTest, NoiseCoeffsSignalsDifferentNoiseType) {
   aom_noise_model_t &model = this->model_;
-  const int kWidth = this->kWidth;
-  const int kHeight = this->kHeight;
+  const int width = this->kWidth;
+  const int height = this->kHeight;
   const double kCoeffs[2][24] = {
     { 0.02884, -0.03356, 0.00633,  0.01757,  0.02849,  -0.04620,
       0.02833, -0.07178, 0.07076,  -0.11603, -0.10413, -0.16571,
@@ -945,8 +944,8 @@
   };
 
   noise_synth(&this->random_, model.params.lag, model.n, model.coords,
-              kCoeffs[0], this->noise_ptr_[0], kWidth, kHeight);
-  for (int i = 0; i < kWidth * kHeight; ++i) {
+              kCoeffs[0], this->noise_ptr_[0], width, height);
+  for (int i = 0; i < width * height; ++i) {
     this->data_ptr_[0][i] = (uint8_t)(128 + this->noise_ptr_[0][i]);
   }
   this->flat_blocks_.assign(this->flat_blocks_.size(), 1);
@@ -954,8 +953,8 @@
 
   // Now try with the second set of AR coefficients
   noise_synth(&this->random_, model.params.lag, model.n, model.coords,
-              kCoeffs[1], this->noise_ptr_[0], kWidth, kHeight);
-  for (int i = 0; i < kWidth * kHeight; ++i) {
+              kCoeffs[1], this->noise_ptr_[0], width, height);
+  for (int i = 0; i < width * height; ++i) {
     this->data_ptr_[0][i] = (uint8_t)(128 + this->noise_ptr_[0][i]);
   }
   EXPECT_EQ(AOM_NOISE_STATUS_DIFFERENT_NOISE_TYPE, this->NoiseModelUpdate());
@@ -1313,9 +1312,9 @@
 }
 
 TYPED_TEST_P(WienerDenoiseTest, GradientTest) {
-  const int kWidth = this->kWidth;
-  const int kHeight = this->kHeight;
-  const int kBlockSize = this->kBlockSize;
+  const int width = this->kWidth;
+  const int height = this->kHeight;
+  const int block_size = this->kBlockSize;
   const uint8_t *const data_ptrs[3] = {
     reinterpret_cast<uint8_t *>(&this->data_[0][0]),
     reinterpret_cast<uint8_t *>(&this->data_[1][0]),
@@ -1327,34 +1326,33 @@
     reinterpret_cast<uint8_t *>(&this->denoised_[2][0]),
   };
   const int ret = aom_wiener_denoise_2d(
-      data_ptrs, denoised_ptrs, kWidth, kHeight, this->stride_,
-      this->chroma_sub_, this->noise_psd_ptrs_, this->kBlockSize,
-      this->kBitDepth, this->kUseHighBD);
+      data_ptrs, denoised_ptrs, width, height, this->stride_, this->chroma_sub_,
+      this->noise_psd_ptrs_, block_size, this->kBitDepth, this->kUseHighBD);
   EXPECT_EQ(1, ret);
 
   // Check the noise on the denoised image (from the analytical gradient)
   // and make sure that it is less than what we added.
   for (int c = 0; c < 3; ++c) {
-    std::vector<double> measured_noise(kWidth * kHeight);
+    std::vector<double> measured_noise(width * height);
 
     double var = 0;
     const int shift = (c > 0);
-    for (int x = 0; x < (kWidth >> shift); ++x) {
-      for (int y = 0; y < (kHeight >> shift); ++y) {
+    for (int x = 0; x < (width >> shift); ++x) {
+      for (int y = 0; y < (height >> shift); ++y) {
         const double diff = this->denoised_[c][y * this->stride_[c] + x] -
                             x * this->kScaleNoise;
         var += diff * diff;
-        measured_noise[y * kWidth + x] = diff;
+        measured_noise[y * width + x] = diff;
       }
     }
-    var /= (kWidth * kHeight);
+    var /= (width * height);
     const double std = sqrt(std::max(0.0, var));
     EXPECT_LE(std, 1.25f * this->kScaleNoise);
     if (c == 0) {
       std::vector<float> measured_psd =
-          get_noise_psd(&measured_noise[0], kWidth, kHeight, kBlockSize);
-      std::vector<double> measured_psd_d(kBlockSize * kBlockSize);
-      std::vector<double> noise_psd_d(kBlockSize * kBlockSize);
+          get_noise_psd(&measured_noise[0], width, height, block_size);
+      std::vector<double> measured_psd_d(block_size * block_size);
+      std::vector<double> noise_psd_d(block_size * block_size);
       std::copy(measured_psd.begin(), measured_psd.end(),
                 measured_psd_d.begin());
       std::copy(this->noise_psd_[0].begin(), this->noise_psd_[0].end(),

diff --git a/test/obmc_sad_test.cc b/test/obmc_sad_test.cc
index 9b70366..8d13ac1 100644
--- a/test/obmc_sad_test.cc
+++ b/test/obmc_sad_test.cc

@@ -147,6 +147,37 @@
                          ::testing::ValuesIn(avx2_functions));
 #endif  // HAVE_AVX2
 
+#if HAVE_NEON
+const ObmcSadTest::ParamType neon_functions[] = {
+  TestFuncs(aom_obmc_sad128x128_c, aom_obmc_sad128x128_neon),
+  TestFuncs(aom_obmc_sad128x64_c, aom_obmc_sad128x64_neon),
+  TestFuncs(aom_obmc_sad64x128_c, aom_obmc_sad64x128_neon),
+  TestFuncs(aom_obmc_sad64x64_c, aom_obmc_sad64x64_neon),
+  TestFuncs(aom_obmc_sad64x32_c, aom_obmc_sad64x32_neon),
+  TestFuncs(aom_obmc_sad32x64_c, aom_obmc_sad32x64_neon),
+  TestFuncs(aom_obmc_sad32x32_c, aom_obmc_sad32x32_neon),
+  TestFuncs(aom_obmc_sad32x16_c, aom_obmc_sad32x16_neon),
+  TestFuncs(aom_obmc_sad16x32_c, aom_obmc_sad16x32_neon),
+  TestFuncs(aom_obmc_sad16x16_c, aom_obmc_sad16x16_neon),
+  TestFuncs(aom_obmc_sad16x8_c, aom_obmc_sad16x8_neon),
+  TestFuncs(aom_obmc_sad8x16_c, aom_obmc_sad8x16_neon),
+  TestFuncs(aom_obmc_sad8x8_c, aom_obmc_sad8x8_neon),
+  TestFuncs(aom_obmc_sad8x4_c, aom_obmc_sad8x4_neon),
+  TestFuncs(aom_obmc_sad4x8_c, aom_obmc_sad4x8_neon),
+  TestFuncs(aom_obmc_sad4x4_c, aom_obmc_sad4x4_neon),
+
+  TestFuncs(aom_obmc_sad64x16_c, aom_obmc_sad64x16_neon),
+  TestFuncs(aom_obmc_sad16x64_c, aom_obmc_sad16x64_neon),
+  TestFuncs(aom_obmc_sad32x8_c, aom_obmc_sad32x8_neon),
+  TestFuncs(aom_obmc_sad8x32_c, aom_obmc_sad8x32_neon),
+  TestFuncs(aom_obmc_sad16x4_c, aom_obmc_sad16x4_neon),
+  TestFuncs(aom_obmc_sad4x16_c, aom_obmc_sad4x16_neon),
+};
+
+INSTANTIATE_TEST_SUITE_P(NEON, ObmcSadTest,
+                         ::testing::ValuesIn(neon_functions));
+#endif  // HAVE_NEON
+
 #if CONFIG_AV1_HIGHBITDEPTH
 ////////////////////////////////////////////////////////////////////////////////
 // High bit-depth

diff --git a/test/obmc_variance_test.cc b/test/obmc_variance_test.cc
index 03b38f7..b2bf42a 100644
--- a/test/obmc_variance_test.cc
+++ b/test/obmc_variance_test.cc

@@ -127,8 +127,9 @@
   const int elapsed_time_simd =
       static_cast<int>(aom_usec_timer_elapsed(&test_timer));
 
-  printf("c_time=%d \t simd_time=%d \t gain=%d \n", elapsed_time_c,
-         elapsed_time_simd, (elapsed_time_c / elapsed_time_simd));
+  printf("c_time=%d \t simd_time=%d \t gain=%f \n", elapsed_time_c,
+         elapsed_time_simd,
+         static_cast<double>(elapsed_time_c) / elapsed_time_simd);
 }
 
 #if HAVE_SSE4_1
@@ -193,6 +194,37 @@
                          ::testing::ValuesIn(avx2_functions));
 #endif  // HAVE_AVX2
 
+#if HAVE_NEON
+const ObmcVarianceTest::ParamType neon_functions[] = {
+  TestFuncs(aom_obmc_variance128x128_c, aom_obmc_variance128x128_neon),
+  TestFuncs(aom_obmc_variance128x64_c, aom_obmc_variance128x64_neon),
+  TestFuncs(aom_obmc_variance64x128_c, aom_obmc_variance64x128_neon),
+  TestFuncs(aom_obmc_variance64x64_c, aom_obmc_variance64x64_neon),
+  TestFuncs(aom_obmc_variance64x32_c, aom_obmc_variance64x32_neon),
+  TestFuncs(aom_obmc_variance32x64_c, aom_obmc_variance32x64_neon),
+  TestFuncs(aom_obmc_variance32x32_c, aom_obmc_variance32x32_neon),
+  TestFuncs(aom_obmc_variance32x16_c, aom_obmc_variance32x16_neon),
+  TestFuncs(aom_obmc_variance16x32_c, aom_obmc_variance16x32_neon),
+  TestFuncs(aom_obmc_variance16x16_c, aom_obmc_variance16x16_neon),
+  TestFuncs(aom_obmc_variance16x8_c, aom_obmc_variance16x8_neon),
+  TestFuncs(aom_obmc_variance8x16_c, aom_obmc_variance8x16_neon),
+  TestFuncs(aom_obmc_variance8x8_c, aom_obmc_variance8x8_neon),
+  TestFuncs(aom_obmc_variance8x4_c, aom_obmc_variance8x4_neon),
+  TestFuncs(aom_obmc_variance4x8_c, aom_obmc_variance4x8_neon),
+  TestFuncs(aom_obmc_variance4x4_c, aom_obmc_variance4x4_neon),
+
+  TestFuncs(aom_obmc_variance64x16_c, aom_obmc_variance64x16_neon),
+  TestFuncs(aom_obmc_variance16x64_c, aom_obmc_variance16x64_neon),
+  TestFuncs(aom_obmc_variance32x8_c, aom_obmc_variance32x8_neon),
+  TestFuncs(aom_obmc_variance8x32_c, aom_obmc_variance8x32_neon),
+  TestFuncs(aom_obmc_variance16x4_c, aom_obmc_variance16x4_neon),
+  TestFuncs(aom_obmc_variance4x16_c, aom_obmc_variance4x16_neon),
+};
+
+INSTANTIATE_TEST_SUITE_P(NEON, ObmcVarianceTest,
+                         ::testing::ValuesIn(neon_functions));
+#endif  // HAVE_NEON
+
 ////////////////////////////////////////////////////////////////////////////////
 // High bit-depth
 ////////////////////////////////////////////////////////////////////////////////

diff --git a/test/quantize_func_test.cc b/test/quantize_func_test.cc
index 6f58898..04e8306 100644
--- a/test/quantize_func_test.cc
+++ b/test/quantize_func_test.cc

@@ -768,7 +768,7 @@
                          ::testing::ValuesIn(kQParamArrayNEON));
 #endif
 
-#if HAVE_SSSE3 && ARCH_X86_64
+#if HAVE_SSSE3 && AOM_ARCH_X86_64
 INSTANTIATE_TEST_SUITE_P(
     SSSE3, FullPrecisionQuantizeTest,
     ::testing::Values(
@@ -779,7 +779,7 @@
         make_tuple(&aom_quantize_b_64x64_c, &aom_quantize_b_64x64_ssse3,
                    static_cast<TX_SIZE>(TX_64X64), TYPE_B, AOM_BITS_8)));
 
-#endif  // HAVE_SSSE3 && ARCH_X86_64
+#endif  // HAVE_SSSE3 && AOM_ARCH_X86_64
 
 #if HAVE_AVX
 INSTANTIATE_TEST_SUITE_P(

diff --git a/test/ratectrl_qmode_test.cc b/test/ratectrl_qmode_test.cc
deleted file mode 100644
index fa0c19a..0000000
--- a/test/ratectrl_qmode_test.cc
+++ /dev/null

@@ -1,1180 +0,0 @@
-/*
- * Copyright (c) 2022, Alliance for Open Media. All rights reserved
- *
- * This source code is subject to the terms of the BSD 2 Clause License and
- * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
- * was not distributed with this source code in the LICENSE file, you can
- * obtain it at www.aomedia.org/license/software. If the Alliance for Open
- * Media Patent License 1.0 was not distributed with this source code in the
- * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
- */
-
-#include "av1/qmode_rc/ratectrl_qmode.h"
-
-#include <algorithm>
-#include <array>
-#include <cerrno>
-#include <cstring>
-#include <fstream>
-#include <memory>
-#include <numeric>
-#include <random>
-#include <string>
-#include <unordered_set>
-#include <vector>
-
-#include "av1/qmode_rc/ducky_encode.h"
-#include "av1/qmode_rc/reference_manager.h"
-#include "test/mock_ratectrl_qmode.h"
-#include "test/video_source.h"
-#include "third_party/googletest/src/googlemock/include/gmock/gmock.h"
-#include "third_party/googletest/src/googletest/include/gtest/gtest.h"
-
-namespace {
-
-using ::testing::HasSubstr;
-
-constexpr int kRefFrameTableSize = 7;
-constexpr int kFrameWidth = 352;
-constexpr int kFrameHeight = 288;
-constexpr int kFrameLimit = 250;
-
-MATCHER(IsOkStatus, "") {
-  *result_listener << "with code " << arg.code
-                   << " and message: " << arg.message;
-  return arg.ok();
-}
-
-// Reads a whitespace-delimited string from stream, and parses it as a double.
-// Returns an empty string if the entire string was successfully parsed as a
-// double, or an error messaage if not.
-std::string ReadDouble(std::istream &stream, double *value) {
-  std::string word;
-  stream >> word;
-  if (word.empty()) {
-    return "Unexpectedly reached end of input";
-  }
-  char *end;
-  *value = std::strtod(word.c_str(), &end);
-  if (*end != '\0') {
-    return "Unexpected characters found: " + word;
-  }
-  return "";
-}
-
-void ReadFirstpassInfo(const std::string &filename,
-                       aom::FirstpassInfo *firstpass_info,
-                       const int frame_limit) {
-  // These golden files are generated by the following command line:
-  // ./aomenc --width=352 --height=288 --fps=30/1 --limit=250 --codec=av1
-  // --cpu-used=3 --end-usage=q --cq-level=36 --threads=0 --profile=0
-  // --lag-in-frames=35 --min-q=0 --max-q=63 --auto-alt-ref=1 --passes=2
-  // --kf-max-dist=160 --kf-min-dist=0 --drop-frame=0
-  // --static-thresh=0 --minsection-pct=0 --maxsection-pct=2000
-  // --arnr-maxframes=7
-  // --arnr-strength=5 --sharpness=0 --undershoot-pct=100 --overshoot-pct=100
-  // --frame-parallel=0
-  // --tile-columns=0 -o output.webm hantro_collage_w352h288.yuv
-  // First pass stats are written out in av1_get_second_pass_params right after
-  // calculate_gf_length.
-  std::string path = libaom_test::GetDataPath() + "/" + filename;
-  std::ifstream firstpass_stats_file(path);
-  ASSERT_TRUE(firstpass_stats_file.good())
-      << "Error opening " << path << ": " << std::strerror(errno);
-  firstpass_info->num_mbs_16x16 =
-      (kFrameWidth / 16 + 1) * (kFrameHeight / 16 + 1);
-  std::string newline;
-  int frame_number = 0;
-  while (std::getline(firstpass_stats_file, newline) &&
-         frame_number < frame_limit) {
-    std::istringstream iss(newline);
-    FIRSTPASS_STATS firstpass_stats_input = {};
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.frame), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.weight), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.intra_error), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.frame_avg_wavelet_energy),
-              "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.coded_error), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.sr_coded_error), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.pcnt_inter), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.pcnt_motion), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.pcnt_second_ref), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.pcnt_neutral), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.intra_skip_pct), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.inactive_zone_rows), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.inactive_zone_cols), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.MVr), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.mvr_abs), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.MVc), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.mvc_abs), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.MVrv), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.MVcv), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.mv_in_out_count), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.new_mv_count), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.duration), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.count), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.raw_error_stdev), "");
-    iss >> firstpass_stats_input.is_flash;
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.noise_var), "");
-    ASSERT_EQ(ReadDouble(iss, &firstpass_stats_input.cor_coeff), "");
-    ASSERT_TRUE(iss.eof()) << "Too many fields on line "
-                           << firstpass_info->stats_list.size() + 1 << "\n"
-                           << newline;
-    firstpass_info->stats_list.push_back(firstpass_stats_input);
-
-    frame_number++;
-  }
-}
-}  // namespace
-
-namespace aom {
-
-using ::testing::ElementsAre;
-using ::testing::Field;
-using ::testing::Return;
-
-constexpr double kErrorEpsilon = 0.000001;
-
-void TestGopDisplayOrder(const GopStruct &gop_struct) {
-  // Test whether show frames' order indices are sequential
-  int expected_order_idx = 0;
-  int expected_show_frame_count = 0;
-  for (const auto &gop_frame : gop_struct.gop_frame_list) {
-    if (gop_frame.is_show_frame) {
-      EXPECT_EQ(gop_frame.order_idx, expected_order_idx);
-      expected_order_idx++;
-      expected_show_frame_count++;
-    }
-  }
-  EXPECT_EQ(gop_struct.show_frame_count, expected_show_frame_count);
-}
-
-void TestGopGlobalOrderIdx(const GopStruct &gop_struct,
-                           int global_order_idx_offset) {
-  // Test whether show frames' global order indices are sequential
-  EXPECT_EQ(gop_struct.global_order_idx_offset, global_order_idx_offset);
-  int expected_global_order_idx = global_order_idx_offset;
-  for (const auto &gop_frame : gop_struct.gop_frame_list) {
-    if (gop_frame.is_show_frame) {
-      EXPECT_EQ(gop_frame.global_order_idx, expected_global_order_idx);
-      expected_global_order_idx++;
-    }
-  }
-}
-
-void TestGopGlobalCodingIdx(const GopStruct &gop_struct,
-                            int global_coding_idx_offset) {
-  EXPECT_EQ(gop_struct.global_coding_idx_offset, global_coding_idx_offset);
-  for (const auto &gop_frame : gop_struct.gop_frame_list) {
-    EXPECT_EQ(gop_frame.global_coding_idx,
-              global_coding_idx_offset + gop_frame.coding_idx);
-  }
-}
-
-void TestColocatedShowFrame(const GopStruct &gop_struct) {
-  // Test whether each non show frame has a colocated show frame
-  int gop_size = static_cast<int>(gop_struct.gop_frame_list.size());
-  for (int gop_idx = 0; gop_idx < gop_size; ++gop_idx) {
-    auto &gop_frame = gop_struct.gop_frame_list[gop_idx];
-    if (gop_frame.is_show_frame == 0) {
-      bool found_colocated_ref_frame = false;
-      for (int i = gop_idx + 1; i < gop_size; ++i) {
-        auto &next_gop_frame = gop_struct.gop_frame_list[i];
-        if (gop_frame.order_idx == next_gop_frame.order_idx) {
-          found_colocated_ref_frame = true;
-          EXPECT_EQ(gop_frame.update_ref_idx, next_gop_frame.colocated_ref_idx);
-          EXPECT_TRUE(next_gop_frame.is_show_frame);
-        }
-        if (gop_frame.update_ref_idx == next_gop_frame.update_ref_idx) {
-          break;
-        }
-      }
-      EXPECT_TRUE(found_colocated_ref_frame);
-    }
-  }
-}
-
-void TestLayerDepth(const GopStruct &gop_struct, int max_layer_depth) {
-  int gop_size = static_cast<int>(gop_struct.gop_frame_list.size());
-  for (int gop_idx = 0; gop_idx < gop_size; ++gop_idx) {
-    const auto &gop_frame = gop_struct.gop_frame_list[gop_idx];
-    if (gop_frame.is_key_frame) {
-      EXPECT_EQ(gop_frame.layer_depth, 0);
-    }
-
-    if (gop_frame.is_arf_frame) {
-      EXPECT_LT(gop_frame.layer_depth, max_layer_depth);
-    }
-
-    if (!gop_frame.is_key_frame && !gop_frame.is_arf_frame) {
-      EXPECT_EQ(gop_frame.layer_depth, max_layer_depth);
-    }
-  }
-}
-
-void TestArfInterval(const GopStruct &gop_struct) {
-  std::vector<int> arf_order_idx_list;
-  for (const auto &gop_frame : gop_struct.gop_frame_list) {
-    if (gop_frame.is_arf_frame) {
-      arf_order_idx_list.push_back(gop_frame.order_idx);
-    }
-  }
-  std::sort(arf_order_idx_list.begin(), arf_order_idx_list.end());
-  int arf_count = static_cast<int>(arf_order_idx_list.size());
-  for (int i = 1; i < arf_count; ++i) {
-    int arf_interval = arf_order_idx_list[i] - arf_order_idx_list[i - 1];
-    EXPECT_GE(arf_interval, kMinArfInterval);
-  }
-}
-
-class RateControlQModeTest : public ::testing::Test {
- protected:
-  RateControlQModeTest() {
-    rc_param_.max_gop_show_frame_count = 32;
-    rc_param_.min_gop_show_frame_count = 4;
-    rc_param_.ref_frame_table_size = 7;
-    rc_param_.max_ref_frames = 7;
-    rc_param_.base_q_index = 128;
-    rc_param_.frame_height = kFrameHeight;
-    rc_param_.frame_width = kFrameWidth;
-  }
-
-  RateControlParam rc_param_ = {};
-};
-
-TEST_F(RateControlQModeTest, ConstructGopARF) {
-  int show_frame_count = 16;
-  const bool has_key_frame = false;
-  const int global_coding_idx_offset = 5;
-  const int global_order_idx_offset = 20;
-  RefFrameManager ref_frame_manager(kRefFrameTableSize, 7);
-  GopStruct gop_struct =
-      ConstructGop(&ref_frame_manager, show_frame_count, has_key_frame,
-                   global_coding_idx_offset, global_order_idx_offset);
-  EXPECT_EQ(gop_struct.show_frame_count, show_frame_count);
-  TestGopDisplayOrder(gop_struct);
-  TestGopGlobalOrderIdx(gop_struct, global_order_idx_offset);
-  TestGopGlobalCodingIdx(gop_struct, global_coding_idx_offset);
-  TestColocatedShowFrame(gop_struct);
-  const int max_layer_depth = ref_frame_manager.MaxRefFrame();
-  TestLayerDepth(gop_struct, max_layer_depth);
-  TestArfInterval(gop_struct);
-}
-
-TEST_F(RateControlQModeTest, ConstructGopKey) {
-  const int show_frame_count = 16;
-  const bool has_key_frame = true;
-  const int global_coding_idx_offset = 10;
-  const int global_order_idx_offset = 8;
-  RefFrameManager ref_frame_manager(kRefFrameTableSize, 7);
-  GopStruct gop_struct =
-      ConstructGop(&ref_frame_manager, show_frame_count, has_key_frame,
-                   global_coding_idx_offset, global_order_idx_offset);
-  EXPECT_EQ(gop_struct.show_frame_count, show_frame_count);
-  TestGopDisplayOrder(gop_struct);
-  TestGopGlobalOrderIdx(gop_struct, global_order_idx_offset);
-  TestGopGlobalCodingIdx(gop_struct, global_coding_idx_offset);
-  TestColocatedShowFrame(gop_struct);
-  const int max_layer_depth = ref_frame_manager.MaxRefFrame();
-  TestLayerDepth(gop_struct, max_layer_depth);
-  TestArfInterval(gop_struct);
-}
-
-TEST_F(RateControlQModeTest, ConstructShortGop) {
-  int show_frame_count = 2;
-  const bool has_key_frame = false;
-  const int global_coding_idx_offset = 5;
-  const int global_order_idx_offset = 20;
-  RefFrameManager ref_frame_manager(kRefFrameTableSize, 7);
-  GopStruct gop_struct =
-      ConstructGop(&ref_frame_manager, show_frame_count, has_key_frame,
-                   global_coding_idx_offset, global_order_idx_offset);
-  EXPECT_EQ(gop_struct.show_frame_count, show_frame_count);
-  TestGopDisplayOrder(gop_struct);
-  TestGopGlobalOrderIdx(gop_struct, global_order_idx_offset);
-  TestGopGlobalCodingIdx(gop_struct, global_coding_idx_offset);
-  TestColocatedShowFrame(gop_struct);
-  const int max_layer_depth = 1 + kLayerDepthOffset;
-  TestLayerDepth(gop_struct, max_layer_depth);
-  TestArfInterval(gop_struct);
-}
-
-static TplBlockStats CreateToyTplBlockStats(int h, int w, int r, int c,
-                                            int intra_cost, int inter_cost) {
-  TplBlockStats tpl_block_stats = {};
-  tpl_block_stats.height = h;
-  tpl_block_stats.width = w;
-  tpl_block_stats.row = r;
-  tpl_block_stats.col = c;
-  tpl_block_stats.intra_cost = intra_cost;
-  tpl_block_stats.inter_cost = inter_cost;
-  tpl_block_stats.ref_frame_index = { -1, -1 };
-  return tpl_block_stats;
-}
-
-static TplFrameStats CreateToyTplFrameStatsWithDiffSizes(int min_block_size,
-                                                         int max_block_size) {
-  TplFrameStats frame_stats;
-  const int max_h = max_block_size;
-  const int max_w = max_h;
-  const int count = max_block_size / min_block_size;
-  frame_stats.min_block_size = min_block_size;
-  frame_stats.frame_height = max_h * count;
-  frame_stats.frame_width = max_w * count;
-  frame_stats.rate_dist_present = false;
-  for (int i = 0; i < count; ++i) {
-    for (int j = 0; j < count; ++j) {
-      int h = max_h >> i;
-      int w = max_w >> j;
-      for (int u = 0; u * h < max_h; ++u) {
-        for (int v = 0; v * w < max_w; ++v) {
-          int r = max_h * i + h * u;
-          int c = max_w * j + w * v;
-          int intra_cost = std::rand() % 16;
-          TplBlockStats block_stats =
-              CreateToyTplBlockStats(h, w, r, c, intra_cost, 0);
-          frame_stats.block_stats_list.push_back(block_stats);
-        }
-      }
-    }
-  }
-  return frame_stats;
-}
-
-static void AugmentTplFrameStatsWithRefFrames(
-    TplFrameStats *tpl_frame_stats,
-    const std::array<int, kBlockRefCount> &ref_frame_index) {
-  for (auto &block_stats : tpl_frame_stats->block_stats_list) {
-    block_stats.ref_frame_index = ref_frame_index;
-  }
-}
-static void AugmentTplFrameStatsWithMotionVector(
-    TplFrameStats *tpl_frame_stats,
-    const std::array<MotionVector, kBlockRefCount> &mv) {
-  for (auto &block_stats : tpl_frame_stats->block_stats_list) {
-    block_stats.mv = mv;
-  }
-}
-
-static RefFrameTable CreateToyRefFrameTable(int frame_count) {
-  RefFrameTable ref_frame_table(kRefFrameTableSize);
-  EXPECT_LE(frame_count, kRefFrameTableSize);
-  for (int i = 0; i < frame_count; ++i) {
-    ref_frame_table[i] =
-        GopFrameBasic(0, 0, i, i, 0, 0, GopFrameType::kRegularLeaf);
-  }
-  for (int i = frame_count; i < kRefFrameTableSize; ++i) {
-    ref_frame_table[i] = GopFrameInvalid();
-  }
-  return ref_frame_table;
-}
-
-static MotionVector CreateFullpelMv(int row, int col) {
-  return { row, col, 0 };
-}
-
-double TplFrameStatsAccumulateIntraCost(const TplFrameStats &frame_stats) {
-  double sum = 0;
-  for (auto &block_stats : frame_stats.block_stats_list) {
-    sum += block_stats.intra_cost;
-  }
-  return std::max(sum, 1.0);
-}
-
-TEST_F(RateControlQModeTest, CreateTplFrameDepStats) {
-  TplFrameStats frame_stats = CreateToyTplFrameStatsWithDiffSizes(8, 16);
-  StatusOr<TplFrameDepStats> frame_dep_stats =
-      CreateTplFrameDepStatsWithoutPropagation(frame_stats);
-  ASSERT_THAT(frame_dep_stats.status(), IsOkStatus());
-  EXPECT_EQ(frame_stats.min_block_size, frame_dep_stats->unit_size);
-  const int unit_rows = static_cast<int>(frame_dep_stats->unit_stats.size());
-  const int unit_cols = static_cast<int>(frame_dep_stats->unit_stats[0].size());
-  EXPECT_EQ(frame_stats.frame_height, unit_rows * frame_dep_stats->unit_size);
-  EXPECT_EQ(frame_stats.frame_width, unit_cols * frame_dep_stats->unit_size);
-  const double intra_cost_sum =
-      TplFrameDepStatsAccumulateIntraCost(*frame_dep_stats);
-
-  const double expected_intra_cost_sum =
-      TplFrameStatsAccumulateIntraCost(frame_stats);
-  EXPECT_NEAR(intra_cost_sum, expected_intra_cost_sum, kErrorEpsilon);
-}
-
-TEST_F(RateControlQModeTest, BlockRowNotAMultipleOfMinBlockSizeError) {
-  TplFrameStats frame_stats = CreateToyTplFrameStatsWithDiffSizes(8, 16);
-  frame_stats.block_stats_list.back().row = 1;
-  auto result = CreateTplFrameDepStatsWithoutPropagation(frame_stats);
-  EXPECT_FALSE(result.ok());
-  EXPECT_THAT(result.status().message, HasSubstr("must be a multiple of 8"));
-}
-
-TEST_F(RateControlQModeTest, BlockPositionOutOfRangeError) {
-  TplFrameStats frame_stats = CreateToyTplFrameStatsWithDiffSizes(8, 16);
-  frame_stats.block_stats_list.back().row += 8;
-  auto result = CreateTplFrameDepStatsWithoutPropagation(frame_stats);
-  EXPECT_FALSE(result.ok());
-  EXPECT_THAT(result.status().message, HasSubstr("out of range"));
-}
-
-TEST_F(RateControlQModeTest, GetBlockOverlapArea) {
-  const int size = 8;
-  const int r0 = 8;
-  const int c0 = 9;
-  std::vector<int> r1 = { 8, 10, 16, 10, 8, 100 };
-  std::vector<int> c1 = { 9, 12, 17, 5, 100, 9 };
-  std::vector<int> ref_overlap = { 64, 30, 0, 24, 0, 0 };
-  for (int i = 0; i < static_cast<int>(r1.size()); ++i) {
-    const int overlap0 = GetBlockOverlapArea(r0, c0, r1[i], c1[i], size);
-    const int overlap1 = GetBlockOverlapArea(r1[i], c1[i], r0, c0, size);
-    EXPECT_EQ(overlap0, ref_overlap[i]);
-    EXPECT_EQ(overlap1, ref_overlap[i]);
-  }
-}
-
-TEST_F(RateControlQModeTest, TplBlockStatsToDepStats) {
-  const int intra_cost = 100;
-  const int inter_cost = 120;
-  const int unit_count = 2;
-  TplBlockStats block_stats =
-      CreateToyTplBlockStats(8, 4, 0, 0, intra_cost, inter_cost);
-  TplUnitDepStats unit_stats = TplBlockStatsToDepStats(block_stats, unit_count);
-  double expected_intra_cost = intra_cost * 1.0 / unit_count;
-  EXPECT_NEAR(unit_stats.intra_cost, expected_intra_cost, kErrorEpsilon);
-  // When inter_cost >= intra_cost in block_stats, in unit_stats,
-  // the inter_cost will be modified so that it's upper-bounded by intra_cost.
-  EXPECT_LE(unit_stats.inter_cost, unit_stats.intra_cost);
-}
-
-TEST_F(RateControlQModeTest, TplFrameDepStatsPropagateSingleZeroMotion) {
-  // cur frame with coding_idx 1 use ref frame with coding_idx 0
-  const std::array<int, kBlockRefCount> ref_frame_index = { 0, -1 };
-  TplFrameStats frame_stats = CreateToyTplFrameStatsWithDiffSizes(8, 16);
-  AugmentTplFrameStatsWithRefFrames(&frame_stats, ref_frame_index);
-
-  TplGopDepStats gop_dep_stats;
-  const int frame_count = 2;
-  // ref frame with coding_idx 0
-  TplFrameDepStats frame_dep_stats0 =
-      CreateTplFrameDepStats(frame_stats.frame_height, frame_stats.frame_width,
-                             frame_stats.min_block_size);
-  gop_dep_stats.frame_dep_stats_list.push_back(frame_dep_stats0);
-
-  // cur frame with coding_idx 1
-  const StatusOr<TplFrameDepStats> frame_dep_stats1 =
-      CreateTplFrameDepStatsWithoutPropagation(frame_stats);
-  ASSERT_THAT(frame_dep_stats1.status(), IsOkStatus());
-  gop_dep_stats.frame_dep_stats_list.push_back(std::move(*frame_dep_stats1));
-
-  const RefFrameTable ref_frame_table = CreateToyRefFrameTable(frame_count);
-  TplFrameDepStatsPropagate(/*coding_idx=*/1, ref_frame_table, &gop_dep_stats);
-
-  // cur frame with coding_idx 1
-  const double expected_propagation_sum =
-      TplFrameStatsAccumulateIntraCost(frame_stats);
-
-  // ref frame with coding_idx 0
-  const double propagation_sum =
-      TplFrameDepStatsAccumulate(gop_dep_stats.frame_dep_stats_list[0]);
-
-  // The propagation_sum between coding_idx 0 and coding_idx 1 should be equal
-  // because every block in cur frame has zero motion, use ref frame with
-  // coding_idx 0 for prediction, and ref frame itself is empty.
-  EXPECT_NEAR(propagation_sum, expected_propagation_sum, kErrorEpsilon);
-}
-
-TEST_F(RateControlQModeTest, TplFrameDepStatsPropagateCompoundZeroMotion) {
-  // cur frame with coding_idx 2 use two ref frames with coding_idx 0 and 1
-  const std::array<int, kBlockRefCount> ref_frame_index = { 0, 1 };
-  TplFrameStats frame_stats = CreateToyTplFrameStatsWithDiffSizes(8, 16);
-  AugmentTplFrameStatsWithRefFrames(&frame_stats, ref_frame_index);
-
-  TplGopDepStats gop_dep_stats;
-  const int frame_count = 3;
-  // ref frame with coding_idx 0
-  const TplFrameDepStats frame_dep_stats0 =
-      CreateTplFrameDepStats(frame_stats.frame_height, frame_stats.frame_width,
-                             frame_stats.min_block_size);
-  gop_dep_stats.frame_dep_stats_list.push_back(frame_dep_stats0);
-
-  // ref frame with coding_idx 1
-  const TplFrameDepStats frame_dep_stats1 =
-      CreateTplFrameDepStats(frame_stats.frame_height, frame_stats.frame_width,
-                             frame_stats.min_block_size);
-  gop_dep_stats.frame_dep_stats_list.push_back(frame_dep_stats1);
-
-  // cur frame with coding_idx 2
-  const StatusOr<TplFrameDepStats> frame_dep_stats2 =
-      CreateTplFrameDepStatsWithoutPropagation(frame_stats);
-  ASSERT_THAT(frame_dep_stats2.status(), IsOkStatus());
-  gop_dep_stats.frame_dep_stats_list.push_back(std::move(*frame_dep_stats2));
-
-  const RefFrameTable ref_frame_table = CreateToyRefFrameTable(frame_count);
-  TplFrameDepStatsPropagate(/*coding_idx=*/2, ref_frame_table, &gop_dep_stats);
-
-  // cur frame with coding_idx 1
-  const double expected_ref_sum = TplFrameStatsAccumulateIntraCost(frame_stats);
-
-  // ref frame with coding_idx 0
-  const double cost_sum0 =
-      TplFrameDepStatsAccumulate(gop_dep_stats.frame_dep_stats_list[0]);
-  EXPECT_NEAR(cost_sum0, expected_ref_sum * 0.5, kErrorEpsilon);
-
-  // ref frame with coding_idx 1
-  const double cost_sum1 =
-      TplFrameDepStatsAccumulate(gop_dep_stats.frame_dep_stats_list[1]);
-  EXPECT_NEAR(cost_sum1, expected_ref_sum * 0.5, kErrorEpsilon);
-}
-
-TEST_F(RateControlQModeTest, TplFrameDepStatsPropagateSingleWithMotion) {
-  // cur frame with coding_idx 1 use ref frame with coding_idx 0
-  const std::array<int, kBlockRefCount> ref_frame_index = { 0, -1 };
-  const int min_block_size = 8;
-  TplFrameStats frame_stats =
-      CreateToyTplFrameStatsWithDiffSizes(min_block_size, min_block_size * 2);
-  AugmentTplFrameStatsWithRefFrames(&frame_stats, ref_frame_index);
-
-  const int mv_row = min_block_size / 2;
-  const int mv_col = min_block_size / 4;
-  const double r_ratio = 1.0 / 2;
-  const double c_ratio = 1.0 / 4;
-  std::array<MotionVector, kBlockRefCount> mv;
-  mv[0] = CreateFullpelMv(mv_row, mv_col);
-  mv[1] = CreateFullpelMv(0, 0);
-  AugmentTplFrameStatsWithMotionVector(&frame_stats, mv);
-
-  TplGopDepStats gop_dep_stats;
-  const int frame_count = 2;
-  // ref frame with coding_idx 0
-  gop_dep_stats.frame_dep_stats_list.push_back(
-      CreateTplFrameDepStats(frame_stats.frame_height, frame_stats.frame_width,
-                             frame_stats.min_block_size));
-
-  // cur frame with coding_idx 1
-  const StatusOr<TplFrameDepStats> frame_dep_stats =
-      CreateTplFrameDepStatsWithoutPropagation(frame_stats);
-  ASSERT_THAT(frame_dep_stats.status(), IsOkStatus());
-  gop_dep_stats.frame_dep_stats_list.push_back(std::move(*frame_dep_stats));
-
-  const RefFrameTable ref_frame_table = CreateToyRefFrameTable(frame_count);
-  TplFrameDepStatsPropagate(/*coding_idx=*/1, ref_frame_table, &gop_dep_stats);
-
-  const auto &dep_stats0 = gop_dep_stats.frame_dep_stats_list[0];
-  const auto &dep_stats1 = gop_dep_stats.frame_dep_stats_list[1];
-  const int unit_rows = static_cast<int>(dep_stats0.unit_stats.size());
-  const int unit_cols = static_cast<int>(dep_stats0.unit_stats[0].size());
-  for (int r = 0; r < unit_rows; ++r) {
-    for (int c = 0; c < unit_cols; ++c) {
-      double ref_value = 0;
-      ref_value += (1 - r_ratio) * (1 - c_ratio) *
-                   dep_stats1.unit_stats[r][c].intra_cost;
-      if (r - 1 >= 0) {
-        ref_value += r_ratio * (1 - c_ratio) *
-                     dep_stats1.unit_stats[r - 1][c].intra_cost;
-      }
-      if (c - 1 >= 0) {
-        ref_value += (1 - r_ratio) * c_ratio *
-                     dep_stats1.unit_stats[r][c - 1].intra_cost;
-      }
-      if (r - 1 >= 0 && c - 1 >= 0) {
-        ref_value +=
-            r_ratio * c_ratio * dep_stats1.unit_stats[r - 1][c - 1].intra_cost;
-      }
-      EXPECT_NEAR(dep_stats0.unit_stats[r][c].propagation_cost, ref_value,
-                  kErrorEpsilon);
-    }
-  }
-}
-
-// TODO(jianj): Add tests for non empty lookahead stats.
-TEST_F(RateControlQModeTest, ComputeTplGopDepStats) {
-  TplGopStats tpl_gop_stats;
-  std::vector<RefFrameTable> ref_frame_table_list;
-  GopStruct gop_struct;
-  gop_struct.show_frame_count = 3;
-  for (int i = 0; i < 3; i++) {
-    // Use the previous frame as reference
-    const std::array<int, kBlockRefCount> ref_frame_index = { i - 1, -1 };
-    int min_block_size = 8;
-    TplFrameStats frame_stats =
-        CreateToyTplFrameStatsWithDiffSizes(min_block_size, min_block_size * 2);
-    AugmentTplFrameStatsWithRefFrames(&frame_stats, ref_frame_index);
-    tpl_gop_stats.frame_stats_list.push_back(frame_stats);
-
-    ref_frame_table_list.push_back(CreateToyRefFrameTable(i));
-  }
-  const StatusOr<TplGopDepStats> gop_dep_stats =
-      ComputeTplGopDepStats(tpl_gop_stats, {}, ref_frame_table_list);
-  ASSERT_THAT(gop_dep_stats.status(), IsOkStatus());
-
-  double expected_sum = 0;
-  for (int i = 2; i >= 0; i--) {
-    // Due to the linear propagation with zero motion, we can accumulate the
-    // frame_stats intra_cost and use it as expected sum for dependency stats
-    expected_sum +=
-        TplFrameStatsAccumulateIntraCost(tpl_gop_stats.frame_stats_list[i]);
-    const double sum =
-        TplFrameDepStatsAccumulate(gop_dep_stats->frame_dep_stats_list[i]);
-    EXPECT_NEAR(sum, expected_sum, kErrorEpsilon);
-    break;
-  }
-}
-
-TEST(RefFrameManagerTest, GetRefFrameCount) {
-  const std::vector<int> order_idx_list = { 0, 4, 2, 1, 2, 3, 4 };
-  const std::vector<GopFrameType> type_list = {
-    GopFrameType::kRegularKey,
-    GopFrameType::kRegularArf,
-    GopFrameType::kIntermediateArf,
-    GopFrameType::kRegularLeaf,
-    GopFrameType::kIntermediateOverlay,
-    GopFrameType::kRegularLeaf,
-    GopFrameType::kOverlay
-  };
-  RefFrameManager ref_manager(kRefFrameTableSize, 7);
-  int coding_idx = 0;
-  const int first_leaf_idx = 3;
-  EXPECT_EQ(type_list[first_leaf_idx], GopFrameType::kRegularLeaf);
-  // update reference frame until we see the first kRegularLeaf frame
-  for (; coding_idx <= first_leaf_idx; ++coding_idx) {
-    GopFrame gop_frame =
-        GopFrameBasic(0, 0, coding_idx, order_idx_list[coding_idx], 0, 0,
-                      type_list[coding_idx]);
-    ref_manager.UpdateRefFrameTable(&gop_frame);
-  }
-  EXPECT_EQ(ref_manager.GetRefFrameCount(), 4);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kForward), 2);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kBackward), 1);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kLast), 1);
-  EXPECT_EQ(ref_manager.CurGlobalOrderIdx(), 1);
-
-  // update reference frame until we see the first kShowExisting frame
-  const int first_show_existing_idx = 4;
-  EXPECT_EQ(type_list[first_show_existing_idx],
-            GopFrameType::kIntermediateOverlay);
-  for (; coding_idx <= first_show_existing_idx; ++coding_idx) {
-    GopFrame gop_frame =
-        GopFrameBasic(0, 0, coding_idx, order_idx_list[coding_idx], 0, 0,
-                      type_list[coding_idx]);
-    ref_manager.UpdateRefFrameTable(&gop_frame);
-  }
-  EXPECT_EQ(ref_manager.GetRefFrameCount(), 4);
-  EXPECT_EQ(ref_manager.CurGlobalOrderIdx(), 2);
-  // After the first kShowExisting, the kIntermediateArf should be moved from
-  // kForward to kLast due to the cur_global_order_idx_ update
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kForward), 1);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kBackward), 2);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kLast), 1);
-
-  const int second_leaf_idx = 5;
-  EXPECT_EQ(type_list[second_leaf_idx], GopFrameType::kRegularLeaf);
-  for (; coding_idx <= second_leaf_idx; ++coding_idx) {
-    GopFrame gop_frame =
-        GopFrameBasic(0, 0, coding_idx, order_idx_list[coding_idx], 0, 0,
-                      type_list[coding_idx]);
-    ref_manager.UpdateRefFrameTable(&gop_frame);
-  }
-  EXPECT_EQ(ref_manager.GetRefFrameCount(), 5);
-  EXPECT_EQ(ref_manager.CurGlobalOrderIdx(), 3);
-  // An additional kRegularLeaf frame is added into kLast
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kForward), 1);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kBackward), 2);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kLast), 2);
-
-  const int first_overlay_idx = 6;
-  EXPECT_EQ(type_list[first_overlay_idx], GopFrameType::kOverlay);
-  for (; coding_idx <= first_overlay_idx; ++coding_idx) {
-    GopFrame gop_frame =
-        GopFrameBasic(0, 0, coding_idx, order_idx_list[coding_idx], 0, 0,
-                      type_list[coding_idx]);
-    ref_manager.UpdateRefFrameTable(&gop_frame);
-  }
-
-  EXPECT_EQ(ref_manager.GetRefFrameCount(), 5);
-  EXPECT_EQ(ref_manager.CurGlobalOrderIdx(), 4);
-  // After the kOverlay, the kRegularArf should be moved from
-  // kForward to kBackward due to the cur_global_order_idx_ update
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kForward), 0);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kBackward), 3);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kLast), 2);
-}
-
-void TestRefFrameManagerPriority(const RefFrameManager &ref_manager,
-                                 RefUpdateType type) {
-  int ref_count = ref_manager.GetRefFrameCountByType(type);
-  int prev_global_order_idx = ref_manager.CurGlobalOrderIdx();
-  // The lower the priority is, the closer the gop_frame.global_order_idx should
-  // be with cur_global_order_idx_, with exception of a base layer ARF.
-  for (int priority = 0; priority < ref_count; ++priority) {
-    GopFrame gop_frame = ref_manager.GetRefFrameByPriority(type, priority);
-    EXPECT_EQ(gop_frame.is_valid, true);
-    if (type == RefUpdateType::kForward) {
-      if (priority == 0) continue;
-      EXPECT_GE(gop_frame.global_order_idx, prev_global_order_idx);
-    } else {
-      EXPECT_LE(gop_frame.global_order_idx, prev_global_order_idx);
-    }
-    prev_global_order_idx = gop_frame.global_order_idx;
-  }
-  GopFrame gop_frame =
-      ref_manager.GetRefFrameByPriority(RefUpdateType::kForward, ref_count);
-  EXPECT_EQ(gop_frame.is_valid, false);
-}
-
-TEST(RefFrameManagerTest, GetRefFrameByPriority) {
-  const std::vector<int> order_idx_list = { 0, 4, 2, 1, 2, 3, 4 };
-  const std::vector<GopFrameType> type_list = {
-    GopFrameType::kRegularKey,
-    GopFrameType::kRegularArf,
-    GopFrameType::kIntermediateArf,
-    GopFrameType::kRegularLeaf,
-    GopFrameType::kIntermediateOverlay,
-    GopFrameType::kRegularLeaf,
-    GopFrameType::kOverlay
-  };
-  RefFrameManager ref_manager(kRefFrameTableSize, 7);
-  int coding_idx = 0;
-  const int first_leaf_idx = 3;
-  EXPECT_EQ(type_list[first_leaf_idx], GopFrameType::kRegularLeaf);
-  // update reference frame until we see the first kRegularLeaf frame
-  for (; coding_idx <= first_leaf_idx; ++coding_idx) {
-    GopFrame gop_frame =
-        GopFrameBasic(0, 0, coding_idx, order_idx_list[coding_idx], 0, 0,
-                      type_list[coding_idx]);
-    ref_manager.UpdateRefFrameTable(&gop_frame);
-  }
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kForward), 2);
-  TestRefFrameManagerPriority(ref_manager, RefUpdateType::kForward);
-
-  const int first_overlay_idx = 6;
-  EXPECT_EQ(type_list[first_overlay_idx], GopFrameType::kOverlay);
-  for (; coding_idx <= first_overlay_idx; ++coding_idx) {
-    GopFrame gop_frame =
-        GopFrameBasic(0, 0, coding_idx, order_idx_list[coding_idx], 0, 0,
-                      type_list[coding_idx]);
-    ref_manager.UpdateRefFrameTable(&gop_frame);
-  }
-
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kBackward), 3);
-  TestRefFrameManagerPriority(ref_manager, RefUpdateType::kBackward);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kLast), 2);
-  TestRefFrameManagerPriority(ref_manager, RefUpdateType::kLast);
-}
-
-TEST(RefFrameManagerTest, GetRefFrameListByPriority) {
-  const std::vector<int> order_idx_list = { 0, 4, 2, 1 };
-  const int frame_count = static_cast<int>(order_idx_list.size());
-  const std::vector<GopFrameType> type_list = { GopFrameType::kRegularKey,
-                                                GopFrameType::kRegularArf,
-                                                GopFrameType::kIntermediateArf,
-                                                GopFrameType::kRegularLeaf };
-  RefFrameManager ref_manager(kRefFrameTableSize, 7);
-  for (int coding_idx = 0; coding_idx < frame_count; ++coding_idx) {
-    GopFrame gop_frame =
-        GopFrameBasic(0, 0, coding_idx, order_idx_list[coding_idx], 0, 0,
-                      type_list[coding_idx]);
-    ref_manager.UpdateRefFrameTable(&gop_frame);
-  }
-  EXPECT_EQ(ref_manager.GetRefFrameCount(), frame_count);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kForward), 2);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kBackward), 1);
-  EXPECT_EQ(ref_manager.GetRefFrameCountByType(RefUpdateType::kLast), 1);
-  std::vector<ReferenceFrame> ref_frame_list =
-      ref_manager.GetRefFrameListByPriority();
-  EXPECT_EQ(ref_frame_list.size(), order_idx_list.size());
-  std::vector<int> expected_global_order_idx = { 4, 0, 1, 2 };
-  std::vector<ReferenceName> expected_names = { ReferenceName::kAltrefFrame,
-                                                ReferenceName::kGoldenFrame,
-                                                ReferenceName::kLastFrame,
-                                                ReferenceName::kBwdrefFrame };
-  for (size_t i = 0; i < ref_frame_list.size(); ++i) {
-    ReferenceFrame &ref_frame = ref_frame_list[i];
-    GopFrame gop_frame = ref_manager.GetRefFrameByIndex(ref_frame.index);
-    EXPECT_EQ(gop_frame.global_order_idx, expected_global_order_idx[i]);
-    EXPECT_EQ(ref_frame.name, expected_names[i]);
-  }
-}
-
-TEST(RefFrameManagerTest, GetPrimaryRefFrame) {
-  const std::vector<int> order_idx_list = { 0, 4, 2, 1 };
-  const int frame_count = static_cast<int>(order_idx_list.size());
-  const std::vector<GopFrameType> type_list = { GopFrameType::kRegularKey,
-                                                GopFrameType::kRegularArf,
-                                                GopFrameType::kIntermediateArf,
-                                                GopFrameType::kRegularLeaf };
-  const std::vector<int> layer_depth_list = { 0, 2, 4, 6 };
-  RefFrameManager ref_manager(kRefFrameTableSize, 7);
-  for (int coding_idx = 0; coding_idx < frame_count; ++coding_idx) {
-    GopFrame gop_frame =
-        GopFrameBasic(0, 0, coding_idx, order_idx_list[coding_idx],
-                      layer_depth_list[coding_idx], 0, type_list[coding_idx]);
-    ref_manager.UpdateRefFrameTable(&gop_frame);
-  }
-
-  for (int i = 0; i < frame_count; ++i) {
-    // Test frame that share the same layer depth with a reference frame
-    int layer_depth = layer_depth_list[i];
-    // Set different frame type
-    GopFrameType type = type_list[(i + 1) % frame_count];
-    GopFrame gop_frame = GopFrameBasic(0, 0, 0, 0, layer_depth, 0, type);
-    gop_frame.ref_frame_list = ref_manager.GetRefFrameListByPriority();
-    ReferenceFrame ref_frame = ref_manager.GetPrimaryRefFrame(gop_frame);
-    GopFrame primary_ref_frame =
-        ref_manager.GetRefFrameByIndex(ref_frame.index);
-    // The GetPrimaryRefFrame should find the ref_frame with matched layer depth
-    // because it's our first priority
-    EXPECT_EQ(primary_ref_frame.layer_depth, gop_frame.layer_depth);
-  }
-
-  const std::vector<int> mid_layer_depth_list = { 1, 3, 5 };
-  for (int i = 0; i < 3; ++i) {
-    // Test frame that share the same frame type with a reference frame
-    GopFrameType type = type_list[i];
-    // Let the frame layer_depth sit in the middle of two reference frames
-    int layer_depth = mid_layer_depth_list[i];
-    GopFrame gop_frame = GopFrameBasic(0, 0, 0, 0, layer_depth, 0, type);
-    gop_frame.ref_frame_list = ref_manager.GetRefFrameListByPriority();
-    ReferenceFrame ref_frame = ref_manager.GetPrimaryRefFrame(gop_frame);
-    GopFrame primary_ref_frame =
-        ref_manager.GetRefFrameByIndex(ref_frame.index);
-    // The GetPrimaryRefFrame should find the ref_frame with matched frame type
-    // Here we use coding_idx to confirm that.
-    EXPECT_EQ(primary_ref_frame.coding_idx, i);
-  }
-}
-
-TEST_F(RateControlQModeTest, TestKeyframeDetection) {
-  FirstpassInfo firstpass_info;
-  const std::string kFirstpassStatsFile = "firstpass_stats";
-  ASSERT_NO_FATAL_FAILURE(
-      ReadFirstpassInfo(kFirstpassStatsFile, &firstpass_info, kFrameLimit));
-  EXPECT_THAT(GetKeyFrameList(firstpass_info),
-              ElementsAre(0, 30, 60, 90, 120, 150, 180, 210, 240));
-}
-
-MATCHER_P(GopFrameMatches, expected, "") {
-#define COMPARE_FIELD(FIELD)                                   \
-  do {                                                         \
-    if (arg.FIELD != expected.FIELD) {                         \
-      *result_listener << "where " #FIELD " is " << arg.FIELD  \
-                       << " but should be " << expected.FIELD; \
-      return false;                                            \
-    }                                                          \
-  } while (0)
-  COMPARE_FIELD(is_valid);
-  COMPARE_FIELD(order_idx);
-  COMPARE_FIELD(coding_idx);
-  COMPARE_FIELD(global_order_idx);
-  COMPARE_FIELD(global_coding_idx);
-  COMPARE_FIELD(is_key_frame);
-  COMPARE_FIELD(is_arf_frame);
-  COMPARE_FIELD(is_show_frame);
-  COMPARE_FIELD(is_golden_frame);
-  COMPARE_FIELD(colocated_ref_idx);
-  COMPARE_FIELD(update_ref_idx);
-  COMPARE_FIELD(layer_depth);
-#undef COMPARE_FIELD
-
-  return true;
-}
-
-// Helper for tests which need to set update_ref_idx, but for which the indices
-// and depth don't matter (other than to allow creating multiple GopFrames which
-// are distinguishable).
-GopFrame GopFrameUpdateRefIdx(int index, GopFrameType gop_frame_type,
-                              int update_ref_idx) {
-  GopFrame frame =
-      GopFrameBasic(0, 0, index, index, /*depth=*/0, 0, gop_frame_type);
-  frame.update_ref_idx = update_ref_idx;
-  return frame;
-}
-
-TEST_F(RateControlQModeTest, TestInvalidRateControlParam) {
-  // Default constructed RateControlParam should not be valid.
-  RateControlParam rc_param = {};
-  EXPECT_NE(AV1RateControlQMode().SetRcParam(rc_param).code, AOM_CODEC_OK);
-}
-
-TEST_F(RateControlQModeTest, TestInvalidMaxGopShowFrameCount) {
-  rc_param_.min_gop_show_frame_count = 2;
-  rc_param_.max_gop_show_frame_count = 3;
-  Status status = AV1RateControlQMode().SetRcParam(rc_param_);
-  EXPECT_EQ(status.code, AOM_CODEC_INVALID_PARAM);
-  EXPECT_THAT(status.message,
-              HasSubstr("max_gop_show_frame_count (3) must be at least 4"));
-}
-
-TEST_F(RateControlQModeTest, TestInvalidMinGopShowFrameCount) {
-  rc_param_.min_gop_show_frame_count = 9;
-  rc_param_.max_gop_show_frame_count = 8;
-  Status status = AV1RateControlQMode().SetRcParam(rc_param_);
-  EXPECT_EQ(status.code, AOM_CODEC_INVALID_PARAM);
-  EXPECT_THAT(status.message,
-              HasSubstr("may not be less than min_gop_show_frame_count (9)"));
-}
-
-TEST_F(RateControlQModeTest, TestInvalidRefFrameTableSize) {
-  rc_param_.ref_frame_table_size = 9;
-  Status status = AV1RateControlQMode().SetRcParam(rc_param_);
-  EXPECT_EQ(status.code, AOM_CODEC_INVALID_PARAM);
-  EXPECT_THAT(status.message,
-              HasSubstr("ref_frame_table_size (9) must be in the range"));
-}
-
-TEST_F(RateControlQModeTest, TestInvalidMaxRefFrames) {
-  rc_param_.max_ref_frames = 8;
-  Status status = AV1RateControlQMode().SetRcParam(rc_param_);
-  EXPECT_EQ(status.code, AOM_CODEC_INVALID_PARAM);
-  EXPECT_THAT(status.message,
-              HasSubstr("max_ref_frames (8) must be in the range"));
-}
-
-TEST_F(RateControlQModeTest, TestInvalidBaseQIndex) {
-  rc_param_.base_q_index = 256;
-  Status status = AV1RateControlQMode().SetRcParam(rc_param_);
-  EXPECT_EQ(status.code, AOM_CODEC_INVALID_PARAM);
-  EXPECT_THAT(status.message,
-              HasSubstr("base_q_index (256) must be in the range"));
-}
-
-TEST_F(RateControlQModeTest, TestInvalidFrameHeight) {
-  rc_param_.frame_height = 15;
-  Status status = AV1RateControlQMode().SetRcParam(rc_param_);
-  EXPECT_EQ(status.code, AOM_CODEC_INVALID_PARAM);
-  EXPECT_THAT(status.message,
-              HasSubstr("frame_height (15) must be in the range"));
-}
-
-TEST_F(RateControlQModeTest, TestGetRefFrameTableListFirstGop) {
-  AV1RateControlQMode rc;
-  rc_param_.ref_frame_table_size = 3;
-  ASSERT_THAT(rc.SetRcParam(rc_param_), IsOkStatus());
-
-  const auto invalid = GopFrameInvalid();
-  const auto frame0 = GopFrameUpdateRefIdx(0, GopFrameType::kRegularKey, -1);
-  const auto frame1 = GopFrameUpdateRefIdx(1, GopFrameType::kRegularLeaf, 2);
-  const auto frame2 = GopFrameUpdateRefIdx(2, GopFrameType::kRegularLeaf, 0);
-
-  const auto matches_invalid = GopFrameMatches(invalid);
-  const auto matches_frame0 = GopFrameMatches(frame0);
-  const auto matches_frame1 = GopFrameMatches(frame1);
-  const auto matches_frame2 = GopFrameMatches(frame2);
-
-  GopStruct gop_struct;
-  gop_struct.global_coding_idx_offset = 0;  // This is the first GOP.
-  gop_struct.gop_frame_list = { frame0, frame1, frame2 };
-  ASSERT_THAT(
-      // For the first GOP only, GetRefFrameTableList can be passed a
-      // default-constructed RefFrameTable (because it's all going to be
-      // replaced by the key frame anyway).
-      rc.GetRefFrameTableList(gop_struct, {}, RefFrameTable()),
-      ElementsAre(
-          ElementsAre(matches_invalid, matches_invalid, matches_invalid),
-          ElementsAre(matches_frame0, matches_frame0, matches_frame0),
-          ElementsAre(matches_frame0, matches_frame0, matches_frame1),
-          ElementsAre(matches_frame2, matches_frame0, matches_frame1)));
-}
-
-TEST_F(RateControlQModeTest, TestGetRefFrameTableListNotFirstGop) {
-  AV1RateControlQMode rc;
-  rc_param_.ref_frame_table_size = 3;
-  ASSERT_THAT(rc.SetRcParam(rc_param_), IsOkStatus());
-
-  const auto previous = GopFrameUpdateRefIdx(0, GopFrameType::kRegularKey, -1);
-  const auto frame0 = GopFrameUpdateRefIdx(5, GopFrameType::kRegularLeaf, 2);
-  const auto frame1 = GopFrameUpdateRefIdx(6, GopFrameType::kRegularLeaf, -1);
-  const auto frame2 = GopFrameUpdateRefIdx(7, GopFrameType::kRegularLeaf, 0);
-
-  // Frames in the initial table should have coding_idx of -1
-  // to prevent propagating TPL stats to already coded frames.
-  auto previous_modified = previous;
-  previous_modified.coding_idx = -1;
-  const auto matches_previous = GopFrameMatches(previous_modified);
-  const auto matches_frame0 = GopFrameMatches(frame0);
-  const auto matches_frame2 = GopFrameMatches(frame2);
-
-  GopStruct gop_struct;
-  gop_struct.global_coding_idx_offset = 5;  // This is not the first GOP.
-  gop_struct.gop_frame_list = { frame0, frame1, frame2 };
-  ASSERT_THAT(
-      rc.GetRefFrameTableList(gop_struct, {}, RefFrameTable(3, previous)),
-      ElementsAre(
-          ElementsAre(matches_previous, matches_previous, matches_previous),
-          ElementsAre(matches_previous, matches_previous, matches_frame0),
-          ElementsAre(matches_previous, matches_previous, matches_frame0),
-          ElementsAre(matches_frame2, matches_previous, matches_frame0)));
-}
-
-TEST_F(RateControlQModeTest, TestGopIntervals) {
-  FirstpassInfo firstpass_info;
-  ASSERT_NO_FATAL_FAILURE(
-      ReadFirstpassInfo("firstpass_stats", &firstpass_info, kFrameLimit));
-  AV1RateControlQMode rc;
-  ASSERT_THAT(rc.SetRcParam(rc_param_), IsOkStatus());
-
-  const auto gop_info = rc.DetermineGopInfo(firstpass_info);
-  ASSERT_THAT(gop_info.status(), IsOkStatus());
-  std::vector<int> gop_interval_list;
-  std::transform(gop_info->begin(), gop_info->end(),
-                 std::back_inserter(gop_interval_list),
-                 [](GopStruct const &x) { return x.show_frame_count; });
-  EXPECT_THAT(gop_interval_list,
-              ElementsAre(21, 9, 30, 30, 16, 14, 21, 9, 30, 12, 16, 2, 30, 10));
-}
-
-// TODO(b/242892473): Add a test which passes lookahead GOPs.
-TEST_F(RateControlQModeTest, TestGetGopEncodeInfo) {
-  FirstpassInfo firstpass_info;
-  ASSERT_NO_FATAL_FAILURE(
-      ReadFirstpassInfo("firstpass_stats", &firstpass_info, 50));
-  AV1RateControlQMode rc;
-  rc_param_.max_gop_show_frame_count = 16;
-  rc_param_.max_ref_frames = 3;
-  rc_param_.base_q_index = 117;
-  ASSERT_THAT(rc.SetRcParam(rc_param_), IsOkStatus());
-  const auto gop_info = rc.DetermineGopInfo(firstpass_info);
-  ASSERT_THAT(gop_info.status(), IsOkStatus());
-  const GopStructList &gop_list = *gop_info;
-  const aom_rational_t frame_rate = { 30, 1 };
-  const aom::VideoInfo input_video = {
-    kFrameWidth, kFrameHeight,
-    frame_rate,  AOM_IMG_FMT_I420,
-    50,          libaom_test::GetDataPath() + "/hantro_collage_w352h288.yuv"
-  };
-  DuckyEncode ducky_encode(input_video, BLOCK_64X64, rc_param_.max_ref_frames,
-                           3, rc_param_.base_q_index);
-
-  std::vector<aom::GopEncodeInfo> gop_encode_info_list;
-  for (const auto &gop_struct : gop_list) {
-    const auto gop_encode_info =
-        rc.GetTplPassGopEncodeInfo(gop_struct, firstpass_info);
-    ASSERT_TRUE(gop_encode_info.ok());
-    gop_encode_info_list.push_back(gop_encode_info.value());
-  }
-
-  // Read TPL stats
-  std::vector<TplGopStats> tpl_gop_list = ducky_encode.ComputeTplStats(
-      firstpass_info.stats_list, gop_list, gop_encode_info_list);
-
-  RefFrameTable ref_frame_table;
-  int num_gop_skipped = 0;
-  for (size_t gop_idx = 0; gop_idx < gop_list.size(); gop_idx++) {
-    size_t tpl_gop_idx = gop_idx - num_gop_skipped;
-    const auto gop_encode_info =
-        rc.GetGopEncodeInfo(gop_list[gop_idx], tpl_gop_list[tpl_gop_idx], {},
-                            firstpass_info, ref_frame_table);
-    ASSERT_THAT(gop_encode_info.status(), IsOkStatus());
-    for (auto &frame_param : gop_encode_info->param_list) {
-      EXPECT_LE(frame_param.q_index, rc_param_.base_q_index);
-    }
-    ref_frame_table = gop_encode_info->final_snapshot;
-    for (auto &gop_frame : ref_frame_table) {
-      EXPECT_LE(static_cast<int>(gop_frame.ref_frame_list.size()),
-                rc_param_.max_ref_frames);
-    }
-  }
-}
-
-TEST_F(RateControlQModeTest, GetGopEncodeInfoWrongGopSize) {
-  GopStruct gop_struct;
-  gop_struct.gop_frame_list.assign(7, GopFrameInvalid());
-  TplGopStats tpl_gop_stats;
-  tpl_gop_stats.frame_stats_list.assign(
-      5, CreateToyTplFrameStatsWithDiffSizes(8, 8));
-  AV1RateControlQMode rc;
-  const Status status =
-      rc.GetGopEncodeInfo(gop_struct, tpl_gop_stats, {}, {}, RefFrameTable())
-          .status();
-  EXPECT_EQ(status.code, AOM_CODEC_INVALID_PARAM);
-  EXPECT_THAT(status.message,
-              HasSubstr("Frame count of GopStruct (7) doesn't match frame "
-                        "count of TPL stats (5)"));
-}
-
-TEST_F(RateControlQModeTest, GetGopEncodeInfoRefFrameMissingBlockStats) {
-  GopStruct gop_struct;
-  // Frames 0 and 2 are reference frames.
-  gop_struct.gop_frame_list = {
-    GopFrameUpdateRefIdx(0, GopFrameType::kRegularKey, 1),
-    GopFrameUpdateRefIdx(1, GopFrameType::kRegularLeaf, -1),
-    GopFrameUpdateRefIdx(2, GopFrameType::kRegularLeaf, 2),
-  };
-  gop_struct.show_frame_count = 3;
-
-  // Only frame 0 has TPL block stats.
-  TplGopStats tpl_gop_stats;
-  tpl_gop_stats.frame_stats_list.assign(3, { 8, 176, 144, false, {}, {} });
-  tpl_gop_stats.frame_stats_list[0] = CreateToyTplFrameStatsWithDiffSizes(8, 8);
-
-  AV1RateControlQMode rc;
-  const Status status =
-      rc.GetGopEncodeInfo(gop_struct, tpl_gop_stats, {}, {}, RefFrameTable())
-          .status();
-  EXPECT_EQ(status.code, AOM_CODEC_INVALID_PARAM);
-  EXPECT_THAT(status.message,
-              HasSubstr("The frame with global_coding_idx 2 is a reference "
-                        "frame, but has no TPL stats"));
-}
-
-// MockRateControlQMode is provided for the use of clients of libaom, but it's
-// not expected that it will be used in any real libaom tests.
-// This simple "toy" test exists solely to verify the integration of gmock into
-// the aom build.
-TEST_F(RateControlQModeTest, TestMock) {
-  MockRateControlQMode mock_rc;
-  EXPECT_CALL(mock_rc,
-              DetermineGopInfo(Field(&FirstpassInfo::num_mbs_16x16, 1000)))
-      .WillOnce(Return(aom::Status{ AOM_CODEC_ERROR, "message" }));
-  FirstpassInfo firstpass_info = {};
-  firstpass_info.num_mbs_16x16 = 1000;
-  const auto result = mock_rc.DetermineGopInfo(firstpass_info);
-  EXPECT_EQ(result.status().code, AOM_CODEC_ERROR);
-  EXPECT_EQ(result.status().message, "message");
-}
-
-TEST_F(RateControlQModeTest, TestKMeans) {
-  // The distance between intended centroids is designed so each cluster is far
-  // enough from others.
-  std::vector<int> centroids_ref = { 16, 48, 80, 112, 144, 176, 208, 240 };
-  std::vector<uint8_t> random_input;
-  const int num_sample_per_cluster = 10;
-  const int num_clusters = 8;
-  std::default_random_engine generator;
-  for (const int centroid : centroids_ref) {
-    // This is to make sure each cluster is far enough from others.
-    std::uniform_int_distribution<int> distribution(centroid - 8, centroid + 8);
-    for (int i = 0; i < num_sample_per_cluster; ++i) {
-      const int random_sample = distribution(generator);
-      random_input.push_back(static_cast<uint8_t>(random_sample));
-    }
-  }
-  std::shuffle(random_input.begin(), random_input.end(), generator);
-  std::unordered_map<int, int> kmeans_result =
-      aom::internal::KMeans(random_input, num_clusters);
-
-  std::unordered_set<int> found_centroids;
-  for (const auto &result : kmeans_result) {
-    found_centroids.insert(result.second);
-  }
-  // Verify there're num_clusters in the k-means result.
-  EXPECT_EQ(static_cast<int>(found_centroids.size()), num_clusters);
-
-  // Verify that for each data point, the assigned centroid is the closest one.
-  for (const auto &result : kmeans_result) {
-    const int distance_from_cluster_centroid =
-        abs(result.first - result.second);
-    for (const int centroid : found_centroids) {
-      if (centroid == result.second) continue;
-      const int distance_from_other_cluster_centroid =
-          abs(result.first - centroid);
-      EXPECT_LE(distance_from_cluster_centroid,
-                distance_from_other_cluster_centroid);
-    }
-  }
-}
-
-}  // namespace aom
-
-int main(int argc, char **argv) {
-  ::testing::InitGoogleTest(&argc, argv);
-  std::srand(0);
-  return RUN_ALL_TESTS();
-}

diff --git a/test/ratectrl_rtc_test.cc b/test/ratectrl_rtc_test.cc
index 7910b41..0d8d48f 100644
--- a/test/ratectrl_rtc_test.cc
+++ b/test/ratectrl_rtc_test.cc

@@ -16,8 +16,6 @@
 #include "test/codec_factory.h"
 #include "test/encode_test_driver.h"
 #include "test/util.h"
-#include "test/y4m_video_source.h"
-#include "test/yuv_video_source.h"
 #include "test/i420_video_source.h"
 #include "third_party/googletest/src/googletest/include/gtest/gtest.h"
 
@@ -25,7 +23,11 @@
 
 constexpr size_t kNumFrames = 450;
 
-constexpr int kTemporalId[4] = { 0, 2, 1, 2 };
+const int kTemporalId3Layer[4] = { 0, 2, 1, 2 };
+const int kTemporalId2Layer[2] = { 0, 1 };
+const int kTemporalRateAllocation3Layer[3] = { 50, 70, 100 };
+const int kTemporalRateAllocation2Layer[2] = { 60, 100 };
+const int kSpatialLayerBitrate[3] = { 200, 500, 900 };
 
 // Parameter: aq mode: 0 and 3
 class RcInterfaceTest : public ::libaom_test::EncoderTest,
@@ -33,7 +35,8 @@
  public:
   RcInterfaceTest()
       : EncoderTest(GET_PARAM(0)), aq_mode_(GET_PARAM(1)), key_interval_(3000),
-        encoder_exit_(false), layer_frame_cnt_(0) {
+        encoder_exit_(false), layer_frame_cnt_(0), superframe_cnt_(0),
+        dynamic_temporal_layers_(false), dynamic_spatial_layers_(false) {
     memset(&svc_params_, 0, sizeof(svc_params_));
     memset(&layer_id_, 0, sizeof(layer_id_));
   }
@@ -63,7 +66,14 @@
     if (use_svc) {
       frame_params_.spatial_layer_id =
           layer_frame_cnt_ % rc_cfg_.ss_number_layers;
-      frame_params_.temporal_layer_id = kTemporalId[video->frame() % 4];
+      if (rc_cfg_.ts_number_layers == 3)
+        frame_params_.temporal_layer_id =
+            kTemporalId3Layer[superframe_cnt_ % 4];
+      else if (rc_cfg_.ts_number_layers == 2)
+        frame_params_.temporal_layer_id =
+            kTemporalId2Layer[superframe_cnt_ % 2];
+      else
+        frame_params_.temporal_layer_id = 0;
       layer_id_.spatial_layer_id = frame_params_.spatial_layer_id;
       layer_id_.temporal_layer_id = frame_params_.temporal_layer_id;
       encoder->Control(AV1E_SET_SVC_LAYER_ID, &layer_id_);
@@ -72,6 +82,57 @@
     frame_params_.frame_type =
         layer_frame_cnt_ % key_int == 0 ? aom::kKeyFrame : aom::kInterFrame;
     encoder_exit_ = video->frame() == kNumFrames;
+    frame_flags_ = 0;
+
+    if (dynamic_temporal_layers_) {
+      if (superframe_cnt_ == 100 && layer_id_.spatial_layer_id == 0) {
+        // Go down to 2 temporal layers.
+        SetConfigSvc(3, 2);
+        encoder->Control(AV1E_SET_SVC_PARAMS, &svc_params_);
+        ASSERT_TRUE(rc_api_->UpdateRateControl(rc_cfg_));
+      } else if (superframe_cnt_ == 200 && layer_id_.spatial_layer_id == 0) {
+        // Go down to 1 temporal layer.
+        SetConfigSvc(3, 1);
+        encoder->Control(AV1E_SET_SVC_PARAMS, &svc_params_);
+        ASSERT_TRUE(rc_api_->UpdateRateControl(rc_cfg_));
+      } else if (superframe_cnt_ == 300 && layer_id_.spatial_layer_id == 0) {
+        // Go back up to 3 temporal layers.
+        SetConfigSvc(3, 3);
+        encoder->Control(AV1E_SET_SVC_PARAMS, &svc_params_);
+        ASSERT_TRUE(rc_api_->UpdateRateControl(rc_cfg_));
+      }
+    } else if (dynamic_spatial_layers_) {
+      // In this example the #spatial layers is modified on the fly,
+      // so we go from (120p,240p,480p) to (240p,480p), etc.
+      if (superframe_cnt_ == 100 && layer_id_.spatial_layer_id == 0) {
+        // Change to 2 spatial layers (240p, 480p).
+        SetConfigSvc(2, 3);
+        encoder->Control(AV1E_SET_SVC_PARAMS, &svc_params_);
+        ASSERT_TRUE(rc_api_->UpdateRateControl(rc_cfg_));
+      } else if (superframe_cnt_ == 200 && layer_id_.spatial_layer_id == 0) {
+        // Change to 1 spatial layer (480p).
+        SetConfigSvc(1, 3);
+        encoder->Control(AV1E_SET_SVC_PARAMS, &svc_params_);
+        ASSERT_TRUE(rc_api_->UpdateRateControl(rc_cfg_));
+      } else if (superframe_cnt_ == 300 && layer_id_.spatial_layer_id == 0) {
+        // Go back to 3 spatial layers (120p, 240p, 480p).
+        SetConfigSvc(3, 3);
+        encoder->Control(AV1E_SET_SVC_PARAMS, &svc_params_);
+        // In the fixed SVC mode (which is what is used in this test):
+        // Key frame is required here on SL0 since 120p will try to predict
+        // from LAST which was the 480p, so decoder will throw an error
+        // (reference must be smaller than 4x4). In the flexible mode
+        // (not used here) we can set the frame flags to predict off the 2x2
+        // reference instead,
+        frame_flags_ = AOM_EFLAG_FORCE_KF;
+        frame_params_.frame_type = aom::kKeyFrame;
+        ASSERT_TRUE(rc_api_->UpdateRateControl(rc_cfg_));
+      }
+    }
+    // TODO(marpan): Add dynamic spatial layers based on 0 layer bitrate.
+    // That is actual usage in SW where configuration (#spatial, #temporal)
+    // layers is fixed, but top layer is dropped or re-enabled based on
+    // bitrate. This requires external RC to handle dropped (zero-size) frames.
   }
 
   void PostEncodeFrameHook(::libaom_test::Encoder *encoder) override {
@@ -79,10 +140,20 @@
       return;
     }
     layer_frame_cnt_++;
+    if (layer_id_.spatial_layer_id == rc_cfg_.ss_number_layers - 1)
+      superframe_cnt_++;
     int qp;
     encoder->Control(AOME_GET_LAST_QUANTIZER, &qp);
     rc_api_->ComputeQP(frame_params_);
     ASSERT_EQ(rc_api_->GetQP(), qp);
+    int encoder_lpf_level;
+    encoder->Control(AOME_GET_LOOPFILTER_LEVEL, &encoder_lpf_level);
+    aom::AV1LoopfilterLevel loopfilter_level = rc_api_->GetLoopfilterLevel();
+    ASSERT_EQ(loopfilter_level.filter_level[0], encoder_lpf_level);
+    aom::AV1CdefInfo cdef_level = rc_api_->GetCdefInfo();
+    int cdef_y_strengths[16];
+    encoder->Control(AV1E_GET_LUMA_CDEF_STRENGTH, cdef_y_strengths);
+    ASSERT_EQ(cdef_level.cdef_strength_y, cdef_y_strengths[0]);
   }
 
   void FramePktHook(const aom_codec_cx_pkt_t *pkt) override {
@@ -125,7 +196,7 @@
 
   void RunSvc() {
     key_interval_ = 10000;
-    SetConfigSvc();
+    SetConfigSvc(3, 3);
     rc_api_ = aom::AV1RateControlRTC::Create(rc_cfg_);
     frame_params_.spatial_layer_id = 0;
     frame_params_.temporal_layer_id = 0;
@@ -138,7 +209,35 @@
 
   void RunSvcPeriodicKey() {
     key_interval_ = 100;
-    SetConfigSvc();
+    SetConfigSvc(3, 3);
+    rc_api_ = aom::AV1RateControlRTC::Create(rc_cfg_);
+    frame_params_.spatial_layer_id = 0;
+    frame_params_.temporal_layer_id = 0;
+
+    ::libaom_test::I420VideoSource video("niklas_640_480_30.yuv", 640, 480, 30,
+                                         1, 0, kNumFrames);
+
+    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
+  }
+
+  void RunSvcDynamicTemporal() {
+    dynamic_temporal_layers_ = true;
+    key_interval_ = 10000;
+    SetConfigSvc(3, 3);
+    rc_api_ = aom::AV1RateControlRTC::Create(rc_cfg_);
+    frame_params_.spatial_layer_id = 0;
+    frame_params_.temporal_layer_id = 0;
+
+    ::libaom_test::I420VideoSource video("niklas_640_480_30.yuv", 640, 480, 30,
+                                         1, 0, kNumFrames);
+
+    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
+  }
+
+  void RunSvcDynamicSpatial() {
+    dynamic_spatial_layers_ = true;
+    key_interval_ = 10000;
+    SetConfigSvc(3, 3);
     rc_api_ = aom::AV1RateControlRTC::Create(rc_cfg_);
     frame_params_.spatial_layer_id = 0;
     frame_params_.temporal_layer_id = 0;
@@ -191,12 +290,11 @@
     cfg_.kf_max_dist = key_interval_;
   }
 
-  void SetConfigSvc() {
+  void SetConfigSvc(int number_spatial_layers, int number_temporal_layers) {
     rc_cfg_.width = 640;
     rc_cfg_.height = 480;
-    rc_cfg_.max_quantizer = 52;
+    rc_cfg_.max_quantizer = 56;
     rc_cfg_.min_quantizer = 2;
-    rc_cfg_.target_bandwidth = 1000;
     rc_cfg_.buf_initial_sz = 600;
     rc_cfg_.buf_optimal_sz = 600;
     rc_cfg_.buf_sz = 1000;
@@ -204,85 +302,117 @@
     rc_cfg_.overshoot_pct = 50;
     rc_cfg_.max_intra_bitrate_pct = 1000;
     rc_cfg_.framerate = 30.0;
-    rc_cfg_.ss_number_layers = 3;
-    rc_cfg_.ts_number_layers = 3;
     rc_cfg_.aq_mode = aq_mode_;
-
-    rc_cfg_.scaling_factor_num[0] = 1;
-    rc_cfg_.scaling_factor_den[0] = 4;
-    rc_cfg_.scaling_factor_num[1] = 2;
-    rc_cfg_.scaling_factor_den[1] = 4;
-    rc_cfg_.scaling_factor_num[2] = 4;
-    rc_cfg_.scaling_factor_den[2] = 4;
-
-    rc_cfg_.ts_rate_decimator[0] = 4;
-    rc_cfg_.ts_rate_decimator[1] = 2;
-    rc_cfg_.ts_rate_decimator[2] = 1;
-
-    rc_cfg_.layer_target_bitrate[0] = 100;
-    rc_cfg_.layer_target_bitrate[1] = 140;
-    rc_cfg_.layer_target_bitrate[2] = 200;
-    rc_cfg_.layer_target_bitrate[3] = 250;
-    rc_cfg_.layer_target_bitrate[4] = 350;
-    rc_cfg_.layer_target_bitrate[5] = 500;
-    rc_cfg_.layer_target_bitrate[6] = 450;
-    rc_cfg_.layer_target_bitrate[7] = 630;
-    rc_cfg_.layer_target_bitrate[8] = 900;
-
-    for (int sl = 0; sl < rc_cfg_.ss_number_layers; ++sl) {
-      for (int tl = 0; tl < rc_cfg_.ts_number_layers; ++tl) {
-        const int i = sl * rc_cfg_.ts_number_layers + tl;
-        rc_cfg_.max_quantizers[i] = 56;
-        rc_cfg_.min_quantizers[i] = 2;
-      }
-    }
+    rc_cfg_.ss_number_layers = number_spatial_layers;
+    rc_cfg_.ts_number_layers = number_temporal_layers;
 
     // Encoder settings for ground truth.
     cfg_.g_w = 640;
     cfg_.g_h = 480;
-    svc_params_.number_spatial_layers = 3;
-    svc_params_.number_temporal_layers = 3;
-    cfg_.g_timebase.num = 1;
-    cfg_.g_timebase.den = 30;
-    svc_params_.scaling_factor_num[0] = 72;
-    svc_params_.scaling_factor_den[0] = 288;
-    svc_params_.scaling_factor_num[1] = 144;
-    svc_params_.scaling_factor_den[1] = 288;
-    svc_params_.scaling_factor_num[2] = 288;
-    svc_params_.scaling_factor_den[2] = 288;
-    for (int i = 0; i < AOM_MAX_LAYERS; ++i) {
-      svc_params_.max_quantizers[i] = 56;
-      svc_params_.min_quantizers[i] = 2;
-    }
-    cfg_.rc_end_usage = AOM_CBR;
-    cfg_.g_lag_in_frames = 0;
-    cfg_.g_error_resilient = 0;
-    // 3 temporal layers
-    svc_params_.framerate_factor[0] = 4;
-    svc_params_.framerate_factor[1] = 2;
-    svc_params_.framerate_factor[2] = 1;
-
+    cfg_.rc_max_quantizer = 56;
+    cfg_.rc_min_quantizer = 2;
     cfg_.rc_buf_initial_sz = 600;
     cfg_.rc_buf_optimal_sz = 600;
     cfg_.rc_buf_sz = 1000;
-    cfg_.rc_min_quantizer = 2;
-    cfg_.rc_max_quantizer = 56;
+    cfg_.rc_overshoot_pct = 50;
+    cfg_.rc_undershoot_pct = 50;
     cfg_.g_threads = 1;
     cfg_.kf_min_dist = key_interval_;
     cfg_.kf_max_dist = key_interval_;
-    cfg_.rc_target_bitrate = 1000;
-    cfg_.rc_overshoot_pct = 50;
-    cfg_.rc_undershoot_pct = 50;
+    cfg_.g_timebase.num = 1;
+    cfg_.g_timebase.den = 30;
+    cfg_.rc_end_usage = AOM_CBR;
+    cfg_.g_lag_in_frames = 0;
+    cfg_.g_error_resilient = 0;
+    svc_params_.number_spatial_layers = number_spatial_layers;
+    svc_params_.number_temporal_layers = number_temporal_layers;
 
-    svc_params_.layer_target_bitrate[0] = 100;
-    svc_params_.layer_target_bitrate[1] = 140;
-    svc_params_.layer_target_bitrate[2] = 200;
-    svc_params_.layer_target_bitrate[3] = 250;
-    svc_params_.layer_target_bitrate[4] = 350;
-    svc_params_.layer_target_bitrate[5] = 500;
-    svc_params_.layer_target_bitrate[6] = 450;
-    svc_params_.layer_target_bitrate[7] = 630;
-    svc_params_.layer_target_bitrate[8] = 900;
+    // Scale factors.
+    if (number_spatial_layers == 3) {
+      rc_cfg_.scaling_factor_num[0] = 1;
+      rc_cfg_.scaling_factor_den[0] = 4;
+      rc_cfg_.scaling_factor_num[1] = 2;
+      rc_cfg_.scaling_factor_den[1] = 4;
+      rc_cfg_.scaling_factor_num[2] = 4;
+      rc_cfg_.scaling_factor_den[2] = 4;
+      svc_params_.scaling_factor_num[0] = 1;
+      svc_params_.scaling_factor_den[0] = 4;
+      svc_params_.scaling_factor_num[1] = 2;
+      svc_params_.scaling_factor_den[1] = 4;
+      svc_params_.scaling_factor_num[2] = 4;
+      svc_params_.scaling_factor_den[2] = 4;
+    } else if (number_spatial_layers == 2) {
+      rc_cfg_.scaling_factor_num[0] = 1;
+      rc_cfg_.scaling_factor_den[0] = 2;
+      rc_cfg_.scaling_factor_num[1] = 2;
+      rc_cfg_.scaling_factor_den[1] = 2;
+      svc_params_.scaling_factor_num[0] = 1;
+      svc_params_.scaling_factor_den[0] = 2;
+      svc_params_.scaling_factor_num[1] = 2;
+      svc_params_.scaling_factor_den[1] = 2;
+    } else if (number_spatial_layers == 1) {
+      rc_cfg_.scaling_factor_num[0] = 1;
+      rc_cfg_.scaling_factor_den[0] = 1;
+      svc_params_.scaling_factor_num[0] = 1;
+      svc_params_.scaling_factor_den[0] = 1;
+    }
+
+    // TS rate decimator.
+    if (number_temporal_layers == 3) {
+      rc_cfg_.ts_rate_decimator[0] = 4;
+      rc_cfg_.ts_rate_decimator[1] = 2;
+      rc_cfg_.ts_rate_decimator[2] = 1;
+      svc_params_.framerate_factor[0] = 4;
+      svc_params_.framerate_factor[1] = 2;
+      svc_params_.framerate_factor[2] = 1;
+    } else if (number_temporal_layers == 2) {
+      rc_cfg_.ts_rate_decimator[0] = 2;
+      rc_cfg_.ts_rate_decimator[1] = 1;
+      svc_params_.framerate_factor[0] = 2;
+      svc_params_.framerate_factor[1] = 1;
+    } else if (number_temporal_layers == 1) {
+      rc_cfg_.ts_rate_decimator[0] = 1;
+      svc_params_.framerate_factor[0] = 1;
+    }
+
+    // Bitate.
+    rc_cfg_.target_bandwidth = 0;
+    cfg_.rc_target_bitrate = 0;
+    for (int sl = 0; sl < number_spatial_layers; sl++) {
+      int spatial_bitrate = 0;
+      if (number_spatial_layers <= 3)
+        spatial_bitrate = kSpatialLayerBitrate[sl];
+      for (int tl = 0; tl < number_temporal_layers; tl++) {
+        int layer = sl * number_temporal_layers + tl;
+        if (number_temporal_layers == 3) {
+          rc_cfg_.layer_target_bitrate[layer] =
+              kTemporalRateAllocation3Layer[tl] * spatial_bitrate / 100;
+          svc_params_.layer_target_bitrate[layer] =
+              kTemporalRateAllocation3Layer[tl] * spatial_bitrate / 100;
+        } else if (number_temporal_layers == 2) {
+          rc_cfg_.layer_target_bitrate[layer] =
+              kTemporalRateAllocation2Layer[tl] * spatial_bitrate / 100;
+          svc_params_.layer_target_bitrate[layer] =
+              kTemporalRateAllocation2Layer[tl] * spatial_bitrate / 100;
+        } else if (number_temporal_layers == 1) {
+          rc_cfg_.layer_target_bitrate[layer] = spatial_bitrate;
+          svc_params_.layer_target_bitrate[layer] = spatial_bitrate;
+        }
+      }
+      rc_cfg_.target_bandwidth += spatial_bitrate;
+      cfg_.rc_target_bitrate += spatial_bitrate;
+    }
+
+    // Layer min/max quantizer.
+    for (int sl = 0; sl < number_spatial_layers; ++sl) {
+      for (int tl = 0; tl < number_temporal_layers; ++tl) {
+        const int i = sl * number_temporal_layers + tl;
+        rc_cfg_.max_quantizers[i] = rc_cfg_.max_quantizer;
+        rc_cfg_.min_quantizers[i] = rc_cfg_.min_quantizer;
+        svc_params_.max_quantizers[i] = cfg_.rc_max_quantizer;
+        svc_params_.min_quantizers[i] = cfg_.rc_min_quantizer;
+      }
+    }
   }
 
   std::unique_ptr<aom::AV1RateControlRTC> rc_api_;
@@ -294,6 +424,9 @@
   aom_svc_params_t svc_params_;
   aom_svc_layer_id_t layer_id_;
   int layer_frame_cnt_;
+  int superframe_cnt_;
+  bool dynamic_temporal_layers_;
+  bool dynamic_spatial_layers_;
 };
 
 TEST_P(RcInterfaceTest, OneLayer) { RunOneLayer(); }
@@ -304,6 +437,10 @@
 
 TEST_P(RcInterfaceTest, SvcPeriodicKey) { RunSvcPeriodicKey(); }
 
+TEST_P(RcInterfaceTest, SvcDynamicTemporal) { RunSvcDynamicTemporal(); }
+
+TEST_P(RcInterfaceTest, SvcDynamicSpatial) { RunSvcDynamicSpatial(); }
+
 AV1_INSTANTIATE_TEST_SUITE(RcInterfaceTest, ::testing::Values(0, 3));
 
 }  // namespace

diff --git a/test/register_state_check.h b/test/register_state_check.h
index 0a150c3..4aad814 100644
--- a/test/register_state_check.h
+++ b/test/register_state_check.h

@@ -26,7 +26,7 @@
 //   See platform implementations of RegisterStateCheck and
 //   RegisterStateCheckMMX for details.
 
-#if defined(_WIN64) && ARCH_X86_64
+#if defined(_WIN64) && AOM_ARCH_X86_64
 
 #undef NOMINMAX
 #define NOMINMAX
@@ -86,9 +86,9 @@
 class RegisterStateCheck {};
 }  // namespace libaom_test
 
-#endif  // _WIN64 && ARCH_X86_64
+#endif  // _WIN64 && AOM_ARCH_X86_64
 
-#if (ARCH_X86 || ARCH_X86_64) && defined(__GNUC__)
+#if (AOM_ARCH_X86 || AOM_ARCH_X86_64) && defined(__GNUC__)
 namespace libaom_test {
 
 // Checks the FPU tag word pre/post execution to ensure emms has been called.
@@ -122,7 +122,7 @@
 class RegisterStateCheckMMX {};
 }  // namespace libaom_test
 
-#endif  // (ARCH_X86 || ARCH_X86_64) && defined(__GNUC__)
+#endif  // (AOM_ARCH_X86 || AOM_ARCH_X86_64) && defined(__GNUC__)
 
 #define API_REGISTER_STATE_CHECK(statement)           \
   do {                                                \

diff --git a/test/resize_test.cc b/test/resize_test.cc
index e21f4bf..85a72da 100644
--- a/test/resize_test.cc
+++ b/test/resize_test.cc

@@ -96,11 +96,17 @@
 };
 
 void ScaleForFrameNumber(unsigned int frame, unsigned int initial_w,
-                         unsigned int initial_h, unsigned int *w,
-                         unsigned int *h, int flag_codec) {
+                         unsigned int initial_h, int flag_codec,
+                         bool change_start_resln, unsigned int *w,
+                         unsigned int *h) {
   if (frame < 10) {
-    *w = initial_w;
-    *h = initial_h;
+    if (change_start_resln) {
+      *w = initial_w / 4;
+      *h = initial_h / 4;
+    } else {
+      *w = initial_w;
+      *h = initial_h;
+    }
     return;
   }
   if (frame < 20) {
@@ -179,15 +185,25 @@
     limit_ = 150;
   }
   int flag_codec_;
+  bool change_start_resln_;
   virtual ~ResizingVideoSource() {}
 
  protected:
-  virtual void Next() {
+  void Begin() override {
+    frame_ = 0;
+    unsigned int width;
+    unsigned int height;
+    ScaleForFrameNumber(frame_, kInitialWidth, kInitialHeight, flag_codec_,
+                        change_start_resln_, &width, &height);
+    SetSize(width, height);
+    FillFrame();
+  }
+  void Next() override {
     ++frame_;
     unsigned int width;
     unsigned int height;
-    ScaleForFrameNumber(frame_, kInitialWidth, kInitialHeight, &width, &height,
-                        flag_codec_);
+    ScaleForFrameNumber(frame_, kInitialWidth, kInitialHeight, flag_codec_,
+                        change_start_resln_, &width, &height);
     SetSize(width, height);
     FillFrame();
   }
@@ -225,6 +241,7 @@
 TEST_P(ResizeTest, TestExternalResizeWorks) {
   ResizingVideoSource video;
   video.flag_codec_ = 0;
+  video.change_start_resln_ = false;
   cfg_.g_lag_in_frames = 0;
   // We use max(kInitialWidth, kInitialHeight) because during the test
   // the width and height of the frame are swapped
@@ -240,8 +257,8 @@
     const unsigned int frame = static_cast<unsigned>(info->pts);
     unsigned int expected_w;
     unsigned int expected_h;
-    ScaleForFrameNumber(frame, kInitialWidth, kInitialHeight, &expected_w,
-                        &expected_h, 0);
+    ScaleForFrameNumber(frame, kInitialWidth, kInitialHeight, video.flag_codec_,
+                        video.change_start_resln_, &expected_w, &expected_h);
     EXPECT_EQ(expected_w, info->w)
         << "Frame " << frame << " had unexpected width";
     EXPECT_EQ(expected_h, info->h)
@@ -596,23 +613,30 @@
   mismatch_psnr_ = 0.0;
   mismatch_nframes_ = 0;
   DefaultConfig();
-  ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
+  // Test external resizing with start resolution equal to
+  // 1. kInitialWidth and kInitialHeight
+  // 2. down-scaled kInitialWidth and kInitialHeight
+  for (int i = 0; i < 2; i++) {
+    video.change_start_resln_ = static_cast<bool>(i);
 
-  // Check we decoded the same number of frames as we attempted to encode
-  ASSERT_EQ(frame_info_list_.size(), video.limit());
+    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
 
-  for (std::vector<FrameInfo>::const_iterator info = frame_info_list_.begin();
-       info != frame_info_list_.end(); ++info) {
-    const unsigned int frame = static_cast<unsigned>(info->pts);
-    unsigned int expected_w;
-    unsigned int expected_h;
-    ScaleForFrameNumber(frame, kInitialWidth, kInitialHeight, &expected_w,
-                        &expected_h, 1);
-    EXPECT_EQ(expected_w, info->w)
-        << "Frame " << frame << " had unexpected width";
-    EXPECT_EQ(expected_h, info->h)
-        << "Frame " << frame << " had unexpected height";
-    EXPECT_EQ(static_cast<unsigned int>(0), GetMismatchFrames());
+    // Check we decoded the same number of frames as we attempted to encode
+    ASSERT_EQ(frame_info_list_.size(), video.limit());
+    for (const auto &info : frame_info_list_) {
+      const unsigned int frame = static_cast<unsigned>(info.pts);
+      unsigned int expected_w;
+      unsigned int expected_h;
+      ScaleForFrameNumber(frame, kInitialWidth, kInitialHeight,
+                          video.flag_codec_, video.change_start_resln_,
+                          &expected_w, &expected_h);
+      EXPECT_EQ(expected_w, info.w)
+          << "Frame " << frame << " had unexpected width";
+      EXPECT_EQ(expected_h, info.h)
+          << "Frame " << frame << " had unexpected height";
+      EXPECT_EQ(static_cast<unsigned int>(0), GetMismatchFrames());
+    }
+    frame_info_list_.clear();
   }
 }
 
@@ -788,7 +812,8 @@
   virtual ~ResizingCspVideoSource() {}
 };
 
-#if (defined(DISABLE_TRELLISQ_SEARCH) && DISABLE_TRELLISQ_SEARCH)
+#if (defined(DISABLE_TRELLISQ_SEARCH) && DISABLE_TRELLISQ_SEARCH) || \
+    (defined(CONFIG_MAX_DECODE_PROFILE) && CONFIG_MAX_DECODE_PROFILE < 1)
 TEST_P(ResizeCspTest, DISABLED_TestResizeCspWorks) {
 #else
 TEST_P(ResizeCspTest, TestResizeCspWorks) {

diff --git a/test/rt_end_to_end_test.cc b/test/rt_end_to_end_test.cc
index e5cc163..735d799 100644
--- a/test/rt_end_to_end_test.cc
+++ b/test/rt_end_to_end_test.cc

@@ -42,23 +42,23 @@
                          { { 5, { { 0, 36.2 }, { 3, 36.7 } } },
                            { 6, { { 0, 36.1 }, { 3, 36.48 } } },
                            { 7, { { 0, 35.5 }, { 3, 36.0 } } },
-                           { 8, { { 0, 35.8 }, { 3, 36.48 } } },
+                           { 8, { { 0, 35.8 }, { 3, 36.4 } } },
                            { 9, { { 0, 35.5 }, { 3, 36.0 } } },
                            { 10, { { 0, 35.3 }, { 3, 35.9 } } } } },
                        { "niklas_1280_720_30.y4m",
-                         { { 5, { { 0, 34.4 }, { 3, 34.30 } } },
-                           { 6, { { 0, 34.2 }, { 3, 34.2 } } },
-                           { 7, { { 0, 33.5 }, { 3, 33.48 } } },
-                           { 8, { { 0, 33.48 }, { 3, 33.48 } } },
-                           { 9, { { 0, 33.4 }, { 3, 33.4 } } },
-                           { 10, { { 0, 33.2 }, { 3, 33.2 } } } } },
+                         { { 5, { { 0, 34.4 }, { 3, 34.2 } } },
+                           { 6, { { 0, 34.1 }, { 3, 34.0 } } },
+                           { 7, { { 0, 33.5 }, { 3, 33.1 } } },
+                           { 8, { { 0, 33.3 }, { 3, 33.3 } } },
+                           { 9, { { 0, 33.3 }, { 3, 33.3 } } },
+                           { 10, { { 0, 33.1 }, { 3, 33.1 } } } } },
                        { "hantro_collage_w352h288_nv12.yuv",
-                         { { 5, { { 0, 34.4 }, { 3, 34.30 } } },
-                           { 6, { { 0, 34.2 }, { 3, 34.2 } } },
+                         { { 5, { { 0, 34.4 }, { 3, 34.2 } } },
+                           { 6, { { 0, 34.1 }, { 3, 34.1 } } },
                            { 7, { { 0, 33.6 }, { 3, 33.6 } } },
-                           { 8, { { 0, 33.48 }, { 3, 33.48 } } },
-                           { 9, { { 0, 33.4 }, { 3, 33.4 } } },
-                           { 10, { { 0, 33.2 }, { 3, 33.2 } } } } } };
+                           { 8, { { 0, 33.3 }, { 3, 33.3 } } },
+                           { 9, { { 0, 33.3 }, { 3, 33.3 } } },
+                           { 10, { { 0, 33.1 }, { 3, 33.1 } } } } } };
 
 typedef struct {
   const char *filename;
@@ -82,17 +82,17 @@
   { "hantro_collage_w352h288_nv12.yuv", 8, AOM_IMG_FMT_NV12, AOM_BITS_8, 0 },
 };
 
-// Params: test video, speed, aq mode, threads, tile columns.
+// Params: test video, speed, aq mode, threads, tile columns, tile rows.
 class RTEndToEndTest
-    : public ::libaom_test::CodecTestWith5Params<TestVideoParam, int,
-                                                 unsigned int, int, int>,
+    : public ::libaom_test::CodecTestWith6Params<TestVideoParam, int,
+                                                 unsigned int, int, int, int>,
       public ::libaom_test::EncoderTest {
  protected:
   RTEndToEndTest()
       : EncoderTest(GET_PARAM(0)), test_video_param_(GET_PARAM(1)),
         cpu_used_(GET_PARAM(2)), psnr_(0.0), nframes_(0),
         aq_mode_(GET_PARAM(3)), threads_(GET_PARAM(4)),
-        tile_columns_(GET_PARAM(5)) {}
+        tile_columns_(GET_PARAM(5)), tile_rows_(GET_PARAM(6)) {}
 
   virtual ~RTEndToEndTest() {}
 
@@ -128,6 +128,7 @@
       encoder->Control(AV1E_SET_ENABLE_TPL_MODEL, 0);
       encoder->Control(AV1E_SET_FRAME_PARALLEL_DECODING, 1);
       encoder->Control(AV1E_SET_TILE_COLUMNS, tile_columns_);
+      encoder->Control(AV1E_SET_TILE_ROWS, tile_rows_);
       encoder->Control(AOME_SET_CPUUSED, cpu_used_);
       encoder->Control(AV1E_SET_TUNE_CONTENT, AOM_CONTENT_DEFAULT);
       encoder->Control(AV1E_SET_AQ_MODE, aq_mode_);
@@ -183,6 +184,7 @@
   unsigned int aq_mode_;
   int threads_;
   int tile_columns_;
+  int tile_rows_;
 };
 
 class RTEndToEndTestThreaded : public RTEndToEndTest {};
@@ -192,13 +194,15 @@
 TEST_P(RTEndToEndTestThreaded, EndtoEndPSNRTest) { DoTest(); }
 
 AV1_INSTANTIATE_TEST_SUITE(RTEndToEndTest, ::testing::ValuesIn(kTestVectors),
-                           ::testing::Range(5, 11),
+                           ::testing::Range(5, 12),
                            ::testing::Values<unsigned int>(0, 3),
-                           ::testing::Values(1), ::testing::Values(1));
+                           ::testing::Values(1), ::testing::Values(1),
+                           ::testing::Values(1));
 
 AV1_INSTANTIATE_TEST_SUITE(RTEndToEndTestThreaded,
                            ::testing::ValuesIn(kTestVectors),
-                           ::testing::Range(5, 11),
+                           ::testing::Range(5, 12),
                            ::testing::Values<unsigned int>(0, 3),
-                           ::testing::Range(2, 5), ::testing::Range(2, 5));
+                           ::testing::Range(2, 6), ::testing::Range(1, 5),
+                           ::testing::Range(1, 5));
 }  // namespace

diff --git a/test/sad_test.cc b/test/sad_test.cc
index 98c8f51..0a39ca6 100644
--- a/test/sad_test.cc
+++ b/test/sad_test.cc

@@ -481,42 +481,6 @@
   }
 };
 
-#if !CONFIG_REALTIME_ONLY
-class SADx4AvgTest : public ::testing::WithParamInterface<SadMxNx4AvgParam>,
-                     public SADTestBase {
- public:
-  SADx4AvgTest() : SADTestBase(GET_PARAM(0), GET_PARAM(1), GET_PARAM(3)) {}
-
- protected:
-  void SADs(unsigned int *results) {
-    const uint8_t *references[] = { GetReference(0), GetReference(1),
-                                    GetReference(2), GetReference(3) };
-
-    API_REGISTER_STATE_CHECK(GET_PARAM(2)(source_data_, source_stride_,
-                                          references, reference_stride_,
-                                          second_pred_, results));
-  }
-
-  void CheckSADs() {
-    unsigned int reference_sad, exp_sad[4];
-
-    SADs(exp_sad);
-    for (int block = 0; block < 4; ++block) {
-      reference_sad = ReferenceSADavg(block);
-
-      EXPECT_EQ(reference_sad, exp_sad[block]) << "block " << block;
-    }
-  }
-
-  void SADForSpeedTest(unsigned int *results,
-                       const uint8_t *const *references) {
-    GET_PARAM(2)
-    (source_data_, source_stride_, references, reference_stride_, second_pred_,
-     results);
-  }
-};
-#endif  // !CONFIG_REALTIME_ONLY
-
 class SADTest : public ::testing::WithParamInterface<SadMxNParam>,
                 public SADTestBase {
  public:
@@ -635,39 +599,6 @@
   }
 };
 
-class DistWtdSADTest : public ::testing::WithParamInterface<DistWtdSadMxhParam>,
-                       public SADTestBase {
- public:
-  DistWtdSADTest() : SADTestBase(GET_PARAM(0), GET_PARAM(1), GET_PARAM(3)) {}
-
- protected:
-  unsigned int SAD(int block_idx) {
-    unsigned int ret;
-    const uint8_t *const reference = GetReference(block_idx);
-
-    API_REGISTER_STATE_CHECK(ret = GET_PARAM(2)(source_data_, source_stride_,
-                                                reference, reference_stride_,
-                                                GET_PARAM(0), GET_PARAM(1)));
-    return ret;
-  }
-
-  void CheckSAD() {
-    const unsigned int reference_sad = ReferenceSAD(0);
-    const unsigned int exp_sad = SAD(0);
-
-    ASSERT_EQ(reference_sad, exp_sad);
-  }
-
-  void SADForSpeedTest(unsigned int *results,
-                       const uint8_t *const *references) {
-    GET_PARAM(2)
-    (source_data_, source_stride_, references[0], reference_stride_, width_,
-     height_);
-    (void)results;
-  }
-};
-GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(DistWtdSADTest);
-
 class DistWtdSADavgTest
     : public ::testing::WithParamInterface<DistWtdSadMxNAvgParam>,
       public SADTestBase {
@@ -908,52 +839,6 @@
   reference_stride_ = tmp_stride;
 }
 
-TEST_P(DistWtdSADTest, MaxRef) {
-  FillConstant(source_data_, source_stride_, 0);
-  FillConstant(reference_data_, reference_stride_, mask_);
-  CheckSAD();
-}
-
-TEST_P(DistWtdSADTest, MaxSrc) {
-  FillConstant(source_data_, source_stride_, mask_);
-  FillConstant(reference_data_, reference_stride_, 0);
-  CheckSAD();
-}
-
-TEST_P(DistWtdSADTest, ShortRef) {
-  const int tmp_stride = reference_stride_;
-  reference_stride_ >>= 1;
-  FillRandom(source_data_, source_stride_);
-  FillRandom(reference_data_, reference_stride_);
-  CheckSAD();
-  reference_stride_ = tmp_stride;
-}
-
-TEST_P(DistWtdSADTest, UnalignedRef) {
-  // The reference frame, but not the source frame, may be unaligned for
-  // certain types of searches.
-  const int tmp_stride = reference_stride_;
-  reference_stride_ -= 1;
-  FillRandom(source_data_, source_stride_);
-  FillRandom(reference_data_, reference_stride_);
-  CheckSAD();
-  reference_stride_ = tmp_stride;
-}
-
-TEST_P(DistWtdSADTest, ShortSrc) {
-  const int tmp_stride = source_stride_;
-  source_stride_ >>= 1;
-  int test_count = 2000;
-  while (test_count > 0) {
-    FillRandom(source_data_, source_stride_);
-    FillRandom(reference_data_, reference_stride_);
-    CheckSAD();
-    if (testing::Test::HasFatalFailure()) break;
-    test_count -= 1;
-  }
-  source_stride_ = tmp_stride;
-}
-
 TEST_P(DistWtdSADavgTest, MaxRef) {
   FillConstant(source_data_, source_stride_, 0);
   FillConstant(reference_data_, reference_stride_, mask_);
@@ -1252,69 +1137,6 @@
 
 using std::make_tuple;
 
-#if !CONFIG_REALTIME_ONLY
-TEST_P(SADx4AvgTest, DISABLED_Speed) {
-  int tmp_stride = reference_stride_;
-  reference_stride_ >>= 1;
-  FillRandom(source_data_, source_stride_);
-  FillRandom(GetReference(0), reference_stride_);
-  FillRandom(GetReference(1), reference_stride_);
-  FillRandom(GetReference(2), reference_stride_);
-  FillRandom(GetReference(3), reference_stride_);
-  FillRandom(second_pred_, width_);
-  SpeedSAD();
-  reference_stride_ = tmp_stride;
-}
-
-TEST_P(SADx4AvgTest, MaxRef) {
-  FillConstant(source_data_, source_stride_, 0);
-  FillConstant(GetReference(0), reference_stride_, mask_);
-  FillConstant(GetReference(1), reference_stride_, mask_);
-  FillConstant(GetReference(2), reference_stride_, mask_);
-  FillConstant(GetReference(3), reference_stride_, mask_);
-  FillConstant(second_pred_, width_, 0);
-  CheckSADs();
-}
-
-TEST_P(SADx4AvgTest, MaxSrc) {
-  FillConstant(source_data_, source_stride_, mask_);
-  FillConstant(GetReference(0), reference_stride_, 0);
-  FillConstant(GetReference(1), reference_stride_, 0);
-  FillConstant(GetReference(2), reference_stride_, 0);
-  FillConstant(GetReference(3), reference_stride_, 0);
-  FillConstant(second_pred_, width_, 0);
-  CheckSADs();
-}
-
-TEST_P(SADx4AvgTest, ShortRef) {
-  int tmp_stride = reference_stride_;
-  reference_stride_ >>= 1;
-  FillRandom(source_data_, source_stride_);
-  FillRandom(GetReference(0), reference_stride_);
-  FillRandom(GetReference(1), reference_stride_);
-  FillRandom(GetReference(2), reference_stride_);
-  FillRandom(GetReference(3), reference_stride_);
-  FillRandom(second_pred_, width_);
-  CheckSADs();
-  reference_stride_ = tmp_stride;
-}
-
-TEST_P(SADx4AvgTest, UnalignedRef) {
-  // The reference frame, but not the source frame, may be unaligned for
-  // certain types of searches.
-  int tmp_stride = reference_stride_;
-  reference_stride_ -= 1;
-  FillRandom(source_data_, source_stride_);
-  FillRandom(GetReference(0), reference_stride_);
-  FillRandom(GetReference(1), reference_stride_);
-  FillRandom(GetReference(2), reference_stride_);
-  FillRandom(GetReference(3), reference_stride_);
-  FillRandom(second_pred_, width_);
-  CheckSADs();
-  reference_stride_ = tmp_stride;
-}
-#endif  // !CONFIG_REALTIME_ONLY
-
 //------------------------------------------------------------------------------
 // C functions
 const SadMxNParam c_tests[] = {
@@ -1992,34 +1814,6 @@
 INSTANTIATE_TEST_SUITE_P(C, SADSkipx4Test,
                          ::testing::ValuesIn(skip_x4d_c_tests));
 
-#if !CONFIG_REALTIME_ONLY
-const SadMxNx4AvgParam x4d_avg_c_tests[] = {
-  make_tuple(128, 128, &aom_sad128x128x4d_avg_c, -1),
-  make_tuple(128, 64, &aom_sad128x64x4d_avg_c, -1),
-  make_tuple(64, 128, &aom_sad64x128x4d_avg_c, -1),
-  make_tuple(64, 64, &aom_sad64x64x4d_avg_c, -1),
-  make_tuple(64, 32, &aom_sad64x32x4d_avg_c, -1),
-  make_tuple(32, 64, &aom_sad32x64x4d_avg_c, -1),
-  make_tuple(32, 32, &aom_sad32x32x4d_avg_c, -1),
-  make_tuple(32, 16, &aom_sad32x16x4d_avg_c, -1),
-  make_tuple(16, 32, &aom_sad16x32x4d_avg_c, -1),
-  make_tuple(16, 16, &aom_sad16x16x4d_avg_c, -1),
-  make_tuple(16, 8, &aom_sad16x8x4d_avg_c, -1),
-  make_tuple(8, 16, &aom_sad8x16x4d_avg_c, -1),
-  make_tuple(8, 8, &aom_sad8x8x4d_avg_c, -1),
-  make_tuple(8, 4, &aom_sad8x4x4d_avg_c, -1),
-  make_tuple(4, 8, &aom_sad4x8x4d_avg_c, -1),
-  make_tuple(4, 4, &aom_sad4x4x4d_avg_c, -1),
-  make_tuple(64, 16, &aom_sad64x16x4d_avg_c, -1),
-  make_tuple(16, 64, &aom_sad16x64x4d_avg_c, -1),
-  make_tuple(32, 8, &aom_sad32x8x4d_avg_c, -1),
-  make_tuple(8, 32, &aom_sad8x32x4d_avg_c, -1),
-  make_tuple(16, 4, &aom_sad16x4x4d_avg_c, -1),
-  make_tuple(4, 16, &aom_sad4x16x4d_avg_c, -1),
-};
-INSTANTIATE_TEST_SUITE_P(C, SADx4AvgTest, ::testing::ValuesIn(x4d_avg_c_tests));
-#endif  // !CONFIG_REALTIME_ONLY
-
 //------------------------------------------------------------------------------
 // ARM functions
 #if HAVE_NEON
@@ -2040,6 +1834,56 @@
   make_tuple(8, 4, &aom_sad8x4_neon, -1),
   make_tuple(4, 8, &aom_sad4x8_neon, -1),
   make_tuple(4, 4, &aom_sad4x4_neon, -1),
+#if CONFIG_AV1_HIGHBITDEPTH
+  make_tuple(128, 128, &aom_highbd_sad128x128_neon, 8),
+  make_tuple(128, 64, &aom_highbd_sad128x64_neon, 8),
+  make_tuple(64, 128, &aom_highbd_sad64x128_neon, 8),
+  make_tuple(64, 64, &aom_highbd_sad64x64_neon, 8),
+  make_tuple(64, 32, &aom_highbd_sad64x32_neon, 8),
+  make_tuple(32, 64, &aom_highbd_sad32x64_neon, 8),
+  make_tuple(32, 32, &aom_highbd_sad32x32_neon, 8),
+  make_tuple(32, 16, &aom_highbd_sad32x16_neon, 8),
+  make_tuple(16, 32, &aom_highbd_sad16x32_neon, 8),
+  make_tuple(16, 16, &aom_highbd_sad16x16_neon, 8),
+  make_tuple(16, 8, &aom_highbd_sad16x8_neon, 8),
+  make_tuple(8, 16, &aom_highbd_sad8x16_neon, 8),
+  make_tuple(8, 8, &aom_highbd_sad8x8_neon, 8),
+  make_tuple(8, 4, &aom_highbd_sad8x4_neon, 8),
+  make_tuple(4, 8, &aom_highbd_sad4x8_neon, 8),
+  make_tuple(4, 4, &aom_highbd_sad4x4_neon, 8),
+  make_tuple(128, 128, &aom_highbd_sad128x128_neon, 10),
+  make_tuple(128, 64, &aom_highbd_sad128x64_neon, 10),
+  make_tuple(64, 128, &aom_highbd_sad64x128_neon, 10),
+  make_tuple(64, 64, &aom_highbd_sad64x64_neon, 10),
+  make_tuple(64, 32, &aom_highbd_sad64x32_neon, 10),
+  make_tuple(32, 64, &aom_highbd_sad32x64_neon, 10),
+  make_tuple(32, 32, &aom_highbd_sad32x32_neon, 10),
+  make_tuple(32, 16, &aom_highbd_sad32x16_neon, 10),
+  make_tuple(16, 32, &aom_highbd_sad16x32_neon, 10),
+  make_tuple(16, 16, &aom_highbd_sad16x16_neon, 10),
+  make_tuple(16, 8, &aom_highbd_sad16x8_neon, 10),
+  make_tuple(8, 16, &aom_highbd_sad8x16_neon, 10),
+  make_tuple(8, 8, &aom_highbd_sad8x8_neon, 10),
+  make_tuple(8, 4, &aom_highbd_sad8x4_neon, 10),
+  make_tuple(4, 8, &aom_highbd_sad4x8_neon, 10),
+  make_tuple(4, 4, &aom_highbd_sad4x4_neon, 10),
+  make_tuple(128, 128, &aom_highbd_sad128x128_neon, 12),
+  make_tuple(128, 64, &aom_highbd_sad128x64_neon, 12),
+  make_tuple(64, 128, &aom_highbd_sad64x128_neon, 12),
+  make_tuple(64, 64, &aom_highbd_sad64x64_neon, 12),
+  make_tuple(64, 32, &aom_highbd_sad64x32_neon, 12),
+  make_tuple(32, 64, &aom_highbd_sad32x64_neon, 12),
+  make_tuple(32, 32, &aom_highbd_sad32x32_neon, 12),
+  make_tuple(32, 16, &aom_highbd_sad32x16_neon, 12),
+  make_tuple(16, 32, &aom_highbd_sad16x32_neon, 12),
+  make_tuple(16, 16, &aom_highbd_sad16x16_neon, 12),
+  make_tuple(16, 8, &aom_highbd_sad16x8_neon, 12),
+  make_tuple(8, 16, &aom_highbd_sad8x16_neon, 12),
+  make_tuple(8, 8, &aom_highbd_sad8x8_neon, 12),
+  make_tuple(8, 4, &aom_highbd_sad8x4_neon, 12),
+  make_tuple(4, 8, &aom_highbd_sad4x8_neon, 12),
+  make_tuple(4, 4, &aom_highbd_sad4x4_neon, 12),
+#endif  // CONFIG_AV1_HIGHBITDEPTH
 #if !CONFIG_REALTIME_ONLY
   make_tuple(64, 16, &aom_sad64x16_neon, -1),
   make_tuple(32, 8, &aom_sad32x8_neon, -1),
@@ -2047,7 +1891,27 @@
   make_tuple(16, 4, &aom_sad16x4_neon, -1),
   make_tuple(8, 32, &aom_sad8x32_neon, -1),
   make_tuple(4, 16, &aom_sad4x16_neon, -1),
-#endif
+#if CONFIG_AV1_HIGHBITDEPTH
+  make_tuple(64, 16, &aom_highbd_sad64x16_neon, 8),
+  make_tuple(16, 64, &aom_highbd_sad16x64_neon, 8),
+  make_tuple(32, 8, &aom_highbd_sad32x8_neon, 8),
+  make_tuple(8, 32, &aom_highbd_sad8x32_neon, 8),
+  make_tuple(16, 4, &aom_highbd_sad16x4_neon, 8),
+  make_tuple(4, 16, &aom_highbd_sad4x16_neon, 8),
+  make_tuple(64, 16, &aom_highbd_sad64x16_neon, 10),
+  make_tuple(16, 64, &aom_highbd_sad16x64_neon, 10),
+  make_tuple(32, 8, &aom_highbd_sad32x8_neon, 10),
+  make_tuple(8, 32, &aom_highbd_sad8x32_neon, 10),
+  make_tuple(16, 4, &aom_highbd_sad16x4_neon, 10),
+  make_tuple(4, 16, &aom_highbd_sad4x16_neon, 10),
+  make_tuple(64, 16, &aom_highbd_sad64x16_neon, 12),
+  make_tuple(16, 64, &aom_highbd_sad16x64_neon, 12),
+  make_tuple(32, 8, &aom_highbd_sad32x8_neon, 12),
+  make_tuple(8, 32, &aom_highbd_sad8x32_neon, 12),
+  make_tuple(16, 4, &aom_highbd_sad16x4_neon, 12),
+  make_tuple(4, 16, &aom_highbd_sad4x16_neon, 12),
+#endif  // CONFIG_AV1_HIGHBITDEPTH
+#endif  // !CONFIG_REALTIME_ONLY
 };
 INSTANTIATE_TEST_SUITE_P(NEON, SADTest, ::testing::ValuesIn(neon_tests));
 
@@ -2068,6 +1932,56 @@
   make_tuple(8, 4, &aom_sad8x4x4d_neon, -1),
   make_tuple(4, 8, &aom_sad4x8x4d_neon, -1),
   make_tuple(4, 4, &aom_sad4x4x4d_neon, -1),
+#if CONFIG_AV1_HIGHBITDEPTH
+  make_tuple(128, 128, &aom_highbd_sad128x128x4d_neon, 8),
+  make_tuple(128, 64, &aom_highbd_sad128x64x4d_neon, 8),
+  make_tuple(64, 128, &aom_highbd_sad64x128x4d_neon, 8),
+  make_tuple(64, 64, &aom_highbd_sad64x64x4d_neon, 8),
+  make_tuple(64, 32, &aom_highbd_sad64x32x4d_neon, 8),
+  make_tuple(32, 64, &aom_highbd_sad32x64x4d_neon, 8),
+  make_tuple(32, 32, &aom_highbd_sad32x32x4d_neon, 8),
+  make_tuple(32, 16, &aom_highbd_sad32x16x4d_neon, 8),
+  make_tuple(16, 32, &aom_highbd_sad16x32x4d_neon, 8),
+  make_tuple(16, 16, &aom_highbd_sad16x16x4d_neon, 8),
+  make_tuple(16, 8, &aom_highbd_sad16x8x4d_neon, 8),
+  make_tuple(8, 16, &aom_highbd_sad8x16x4d_neon, 8),
+  make_tuple(8, 8, &aom_highbd_sad8x8x4d_neon, 8),
+  make_tuple(8, 4, &aom_highbd_sad8x4x4d_neon, 8),
+  make_tuple(4, 8, &aom_highbd_sad4x8x4d_neon, 8),
+  make_tuple(4, 4, &aom_highbd_sad4x4x4d_neon, 8),
+  make_tuple(128, 128, &aom_highbd_sad128x128x4d_neon, 10),
+  make_tuple(128, 64, &aom_highbd_sad128x64x4d_neon, 10),
+  make_tuple(64, 128, &aom_highbd_sad64x128x4d_neon, 10),
+  make_tuple(64, 64, &aom_highbd_sad64x64x4d_neon, 10),
+  make_tuple(64, 32, &aom_highbd_sad64x32x4d_neon, 10),
+  make_tuple(32, 64, &aom_highbd_sad32x64x4d_neon, 10),
+  make_tuple(32, 32, &aom_highbd_sad32x32x4d_neon, 10),
+  make_tuple(32, 16, &aom_highbd_sad32x16x4d_neon, 10),
+  make_tuple(16, 32, &aom_highbd_sad16x32x4d_neon, 10),
+  make_tuple(16, 16, &aom_highbd_sad16x16x4d_neon, 10),
+  make_tuple(16, 8, &aom_highbd_sad16x8x4d_neon, 10),
+  make_tuple(8, 16, &aom_highbd_sad8x16x4d_neon, 10),
+  make_tuple(8, 8, &aom_highbd_sad8x8x4d_neon, 10),
+  make_tuple(8, 4, &aom_highbd_sad8x4x4d_neon, 10),
+  make_tuple(4, 8, &aom_highbd_sad4x8x4d_neon, 10),
+  make_tuple(4, 4, &aom_highbd_sad4x4x4d_neon, 10),
+  make_tuple(128, 128, &aom_highbd_sad128x128x4d_neon, 12),
+  make_tuple(128, 64, &aom_highbd_sad128x64x4d_neon, 12),
+  make_tuple(64, 128, &aom_highbd_sad64x128x4d_neon, 12),
+  make_tuple(64, 64, &aom_highbd_sad64x64x4d_neon, 12),
+  make_tuple(64, 32, &aom_highbd_sad64x32x4d_neon, 12),
+  make_tuple(32, 64, &aom_highbd_sad32x64x4d_neon, 12),
+  make_tuple(32, 32, &aom_highbd_sad32x32x4d_neon, 12),
+  make_tuple(32, 16, &aom_highbd_sad32x16x4d_neon, 12),
+  make_tuple(16, 32, &aom_highbd_sad16x32x4d_neon, 12),
+  make_tuple(16, 16, &aom_highbd_sad16x16x4d_neon, 12),
+  make_tuple(16, 8, &aom_highbd_sad16x8x4d_neon, 12),
+  make_tuple(8, 16, &aom_highbd_sad8x16x4d_neon, 12),
+  make_tuple(8, 8, &aom_highbd_sad8x8x4d_neon, 12),
+  make_tuple(8, 4, &aom_highbd_sad8x4x4d_neon, 12),
+  make_tuple(4, 8, &aom_highbd_sad4x8x4d_neon, 12),
+  make_tuple(4, 4, &aom_highbd_sad4x4x4d_neon, 12),
+#endif  // CONFIG_AV1_HIGHBITDEPTH
 #if !CONFIG_REALTIME_ONLY
   make_tuple(64, 16, &aom_sad64x16x4d_neon, -1),
   make_tuple(32, 8, &aom_sad32x8x4d_neon, -1),
@@ -2075,7 +1989,27 @@
   make_tuple(16, 4, &aom_sad16x4x4d_neon, -1),
   make_tuple(8, 32, &aom_sad8x32x4d_neon, -1),
   make_tuple(4, 16, &aom_sad4x16x4d_neon, -1),
-#endif
+#if CONFIG_AV1_HIGHBITDEPTH
+  make_tuple(64, 16, &aom_highbd_sad64x16x4d_neon, 8),
+  make_tuple(16, 64, &aom_highbd_sad16x64x4d_neon, 8),
+  make_tuple(32, 8, &aom_highbd_sad32x8x4d_neon, 8),
+  make_tuple(8, 32, &aom_highbd_sad8x32x4d_neon, 8),
+  make_tuple(16, 4, &aom_highbd_sad16x4x4d_neon, 8),
+  make_tuple(4, 16, &aom_highbd_sad4x16x4d_neon, 8),
+  make_tuple(64, 16, &aom_highbd_sad64x16x4d_neon, 10),
+  make_tuple(16, 64, &aom_highbd_sad16x64x4d_neon, 10),
+  make_tuple(32, 8, &aom_highbd_sad32x8x4d_neon, 10),
+  make_tuple(8, 32, &aom_highbd_sad8x32x4d_neon, 10),
+  make_tuple(16, 4, &aom_highbd_sad16x4x4d_neon, 10),
+  make_tuple(4, 16, &aom_highbd_sad4x16x4d_neon, 10),
+  make_tuple(64, 16, &aom_highbd_sad64x16x4d_neon, 12),
+  make_tuple(16, 64, &aom_highbd_sad16x64x4d_neon, 12),
+  make_tuple(32, 8, &aom_highbd_sad32x8x4d_neon, 12),
+  make_tuple(8, 32, &aom_highbd_sad8x32x4d_neon, 12),
+  make_tuple(16, 4, &aom_highbd_sad16x4x4d_neon, 12),
+  make_tuple(4, 16, &aom_highbd_sad4x16x4d_neon, 12),
+#endif  // CONFIG_AV1_HIGHBITDEPTH
+#endif  // !CONFIG_REALTIME_ONLY
 };
 INSTANTIATE_TEST_SUITE_P(NEON, SADx4Test, ::testing::ValuesIn(x4d_neon_tests));
 const SadSkipMxNParam skip_neon_tests[] = {
@@ -2092,14 +2026,87 @@
   make_tuple(16, 8, &aom_sad_skip_16x8_neon, -1),
   make_tuple(8, 16, &aom_sad_skip_8x16_neon, -1),
   make_tuple(8, 8, &aom_sad_skip_8x8_neon, -1),
+  make_tuple(8, 4, &aom_sad_skip_8x4_neon, -1),
   make_tuple(4, 8, &aom_sad_skip_4x8_neon, -1),
+  make_tuple(4, 4, &aom_sad_skip_4x4_neon, -1),
+#if CONFIG_AV1_HIGHBITDEPTH
+  make_tuple(128, 128, &aom_highbd_sad_skip_128x128_neon, 8),
+  make_tuple(128, 64, &aom_highbd_sad_skip_128x64_neon, 8),
+  make_tuple(64, 128, &aom_highbd_sad_skip_64x128_neon, 8),
+  make_tuple(64, 64, &aom_highbd_sad_skip_64x64_neon, 8),
+  make_tuple(64, 32, &aom_highbd_sad_skip_64x32_neon, 8),
+  make_tuple(32, 64, &aom_highbd_sad_skip_32x64_neon, 8),
+  make_tuple(32, 32, &aom_highbd_sad_skip_32x32_neon, 8),
+  make_tuple(32, 16, &aom_highbd_sad_skip_32x16_neon, 8),
+  make_tuple(16, 32, &aom_highbd_sad_skip_16x32_neon, 8),
+  make_tuple(16, 16, &aom_highbd_sad_skip_16x16_neon, 8),
+  make_tuple(16, 8, &aom_highbd_sad_skip_16x8_neon, 8),
+  make_tuple(8, 16, &aom_highbd_sad_skip_8x16_neon, 8),
+  make_tuple(8, 8, &aom_highbd_sad_skip_8x8_neon, 8),
+  make_tuple(8, 4, &aom_highbd_sad_skip_8x4_neon, 8),
+  make_tuple(4, 8, &aom_highbd_sad_skip_4x8_neon, 8),
+  make_tuple(4, 4, &aom_highbd_sad_skip_4x4_neon, 8),
+  make_tuple(128, 128, &aom_highbd_sad_skip_128x128_neon, 10),
+  make_tuple(128, 64, &aom_highbd_sad_skip_128x64_neon, 10),
+  make_tuple(64, 128, &aom_highbd_sad_skip_64x128_neon, 10),
+  make_tuple(64, 64, &aom_highbd_sad_skip_64x64_neon, 10),
+  make_tuple(64, 32, &aom_highbd_sad_skip_64x32_neon, 10),
+  make_tuple(32, 64, &aom_highbd_sad_skip_32x64_neon, 10),
+  make_tuple(32, 32, &aom_highbd_sad_skip_32x32_neon, 10),
+  make_tuple(32, 16, &aom_highbd_sad_skip_32x16_neon, 10),
+  make_tuple(16, 32, &aom_highbd_sad_skip_16x32_neon, 10),
+  make_tuple(16, 16, &aom_highbd_sad_skip_16x16_neon, 10),
+  make_tuple(16, 8, &aom_highbd_sad_skip_16x8_neon, 10),
+  make_tuple(8, 16, &aom_highbd_sad_skip_8x16_neon, 10),
+  make_tuple(8, 8, &aom_highbd_sad_skip_8x8_neon, 10),
+  make_tuple(8, 4, &aom_highbd_sad_skip_8x4_neon, 10),
+  make_tuple(4, 8, &aom_highbd_sad_skip_4x8_neon, 10),
+  make_tuple(4, 4, &aom_highbd_sad_skip_4x4_neon, 10),
+  make_tuple(128, 128, &aom_highbd_sad_skip_128x128_neon, 12),
+  make_tuple(128, 64, &aom_highbd_sad_skip_128x64_neon, 12),
+  make_tuple(64, 128, &aom_highbd_sad_skip_64x128_neon, 12),
+  make_tuple(64, 64, &aom_highbd_sad_skip_64x64_neon, 12),
+  make_tuple(64, 32, &aom_highbd_sad_skip_64x32_neon, 12),
+  make_tuple(32, 64, &aom_highbd_sad_skip_32x64_neon, 12),
+  make_tuple(32, 32, &aom_highbd_sad_skip_32x32_neon, 12),
+  make_tuple(32, 16, &aom_highbd_sad_skip_32x16_neon, 12),
+  make_tuple(16, 32, &aom_highbd_sad_skip_16x32_neon, 12),
+  make_tuple(16, 16, &aom_highbd_sad_skip_16x16_neon, 12),
+  make_tuple(16, 8, &aom_highbd_sad_skip_16x8_neon, 12),
+  make_tuple(8, 16, &aom_highbd_sad_skip_8x16_neon, 12),
+  make_tuple(8, 8, &aom_highbd_sad_skip_8x8_neon, 12),
+  make_tuple(8, 4, &aom_highbd_sad_skip_8x4_neon, 12),
+  make_tuple(4, 8, &aom_highbd_sad_skip_4x8_neon, 12),
+  make_tuple(4, 4, &aom_highbd_sad_skip_4x4_neon, 12),
+#endif  // CONFIG_AV1_HIGHBITDEPTH
 #if !CONFIG_REALTIME_ONLY
   make_tuple(64, 16, &aom_sad_skip_64x16_neon, -1),
   make_tuple(32, 8, &aom_sad_skip_32x8_neon, -1),
   make_tuple(16, 64, &aom_sad_skip_16x64_neon, -1),
+  make_tuple(16, 4, &aom_sad_skip_16x4_neon, -1),
   make_tuple(8, 32, &aom_sad_skip_8x32_neon, -1),
   make_tuple(4, 16, &aom_sad_skip_4x16_neon, -1),
-#endif
+#if CONFIG_AV1_HIGHBITDEPTH
+  make_tuple(64, 16, &aom_highbd_sad_skip_64x16_neon, 8),
+  make_tuple(16, 64, &aom_highbd_sad_skip_16x64_neon, 8),
+  make_tuple(32, 8, &aom_highbd_sad_skip_32x8_neon, 8),
+  make_tuple(8, 32, &aom_highbd_sad_skip_8x32_neon, 8),
+  make_tuple(16, 4, &aom_highbd_sad_skip_16x4_neon, 8),
+  make_tuple(4, 16, &aom_highbd_sad_skip_4x16_neon, 8),
+  make_tuple(64, 16, &aom_highbd_sad_skip_64x16_neon, 10),
+  make_tuple(16, 64, &aom_highbd_sad_skip_16x64_neon, 10),
+  make_tuple(32, 8, &aom_highbd_sad_skip_32x8_neon, 10),
+  make_tuple(8, 32, &aom_highbd_sad_skip_8x32_neon, 10),
+  make_tuple(16, 4, &aom_highbd_sad_skip_16x4_neon, 10),
+  make_tuple(4, 16, &aom_highbd_sad_skip_4x16_neon, 10),
+  make_tuple(64, 16, &aom_highbd_sad_skip_64x16_neon, 12),
+  make_tuple(16, 64, &aom_highbd_sad_skip_16x64_neon, 12),
+  make_tuple(32, 8, &aom_highbd_sad_skip_32x8_neon, 12),
+  make_tuple(8, 32, &aom_highbd_sad_skip_8x32_neon, 12),
+  make_tuple(16, 4, &aom_highbd_sad_skip_16x4_neon, 12),
+  make_tuple(4, 16, &aom_highbd_sad_skip_4x16_neon, 12),
+#endif  // CONFIG_AV1_HIGHBITDEPTH
+#endif  // !CONFIG_REALTIME_ONLY
 };
 INSTANTIATE_TEST_SUITE_P(NEON, SADSkipTest,
                          ::testing::ValuesIn(skip_neon_tests));
@@ -2116,16 +2123,89 @@
   make_tuple(16, 32, &aom_sad_skip_16x32x4d_neon, -1),
   make_tuple(16, 16, &aom_sad_skip_16x16x4d_neon, -1),
   make_tuple(16, 8, &aom_sad_skip_16x8x4d_neon, -1),
-  make_tuple(8, 8, &aom_sad_skip_8x8x4d_neon, -1),
   make_tuple(8, 16, &aom_sad_skip_8x16x4d_neon, -1),
+  make_tuple(8, 8, &aom_sad_skip_8x8x4d_neon, -1),
+  make_tuple(8, 4, &aom_sad_skip_8x4x4d_neon, -1),
   make_tuple(4, 8, &aom_sad_skip_4x8x4d_neon, -1),
+  make_tuple(4, 4, &aom_sad_skip_4x4x4d_neon, -1),
+#if CONFIG_AV1_HIGHBITDEPTH
+  make_tuple(128, 128, &aom_highbd_sad_skip_128x128x4d_neon, 8),
+  make_tuple(128, 64, &aom_highbd_sad_skip_128x64x4d_neon, 8),
+  make_tuple(64, 128, &aom_highbd_sad_skip_64x128x4d_neon, 8),
+  make_tuple(64, 64, &aom_highbd_sad_skip_64x64x4d_neon, 8),
+  make_tuple(64, 32, &aom_highbd_sad_skip_64x32x4d_neon, 8),
+  make_tuple(32, 64, &aom_highbd_sad_skip_32x64x4d_neon, 8),
+  make_tuple(32, 32, &aom_highbd_sad_skip_32x32x4d_neon, 8),
+  make_tuple(32, 16, &aom_highbd_sad_skip_32x16x4d_neon, 8),
+  make_tuple(16, 32, &aom_highbd_sad_skip_16x32x4d_neon, 8),
+  make_tuple(16, 16, &aom_highbd_sad_skip_16x16x4d_neon, 8),
+  make_tuple(16, 8, &aom_highbd_sad_skip_16x8x4d_neon, 8),
+  make_tuple(8, 16, &aom_highbd_sad_skip_8x16x4d_neon, 8),
+  make_tuple(8, 8, &aom_highbd_sad_skip_8x8x4d_neon, 8),
+  make_tuple(8, 4, &aom_highbd_sad_skip_8x4x4d_neon, 8),
+  make_tuple(4, 8, &aom_highbd_sad_skip_4x8x4d_neon, 8),
+  make_tuple(4, 4, &aom_highbd_sad_skip_4x4x4d_neon, 8),
+  make_tuple(128, 128, &aom_highbd_sad_skip_128x128x4d_neon, 10),
+  make_tuple(128, 64, &aom_highbd_sad_skip_128x64x4d_neon, 10),
+  make_tuple(64, 128, &aom_highbd_sad_skip_64x128x4d_neon, 10),
+  make_tuple(64, 64, &aom_highbd_sad_skip_64x64x4d_neon, 10),
+  make_tuple(64, 32, &aom_highbd_sad_skip_64x32x4d_neon, 10),
+  make_tuple(32, 64, &aom_highbd_sad_skip_32x64x4d_neon, 10),
+  make_tuple(32, 32, &aom_highbd_sad_skip_32x32x4d_neon, 10),
+  make_tuple(32, 16, &aom_highbd_sad_skip_32x16x4d_neon, 10),
+  make_tuple(16, 32, &aom_highbd_sad_skip_16x32x4d_neon, 10),
+  make_tuple(16, 16, &aom_highbd_sad_skip_16x16x4d_neon, 10),
+  make_tuple(16, 8, &aom_highbd_sad_skip_16x8x4d_neon, 10),
+  make_tuple(8, 16, &aom_highbd_sad_skip_8x16x4d_neon, 10),
+  make_tuple(8, 8, &aom_highbd_sad_skip_8x8x4d_neon, 10),
+  make_tuple(8, 4, &aom_highbd_sad_skip_8x4x4d_neon, 10),
+  make_tuple(4, 8, &aom_highbd_sad_skip_4x8x4d_neon, 10),
+  make_tuple(4, 4, &aom_highbd_sad_skip_4x4x4d_neon, 10),
+  make_tuple(128, 128, &aom_highbd_sad_skip_128x128x4d_neon, 12),
+  make_tuple(128, 64, &aom_highbd_sad_skip_128x64x4d_neon, 12),
+  make_tuple(64, 128, &aom_highbd_sad_skip_64x128x4d_neon, 12),
+  make_tuple(64, 64, &aom_highbd_sad_skip_64x64x4d_neon, 12),
+  make_tuple(64, 32, &aom_highbd_sad_skip_64x32x4d_neon, 12),
+  make_tuple(32, 64, &aom_highbd_sad_skip_32x64x4d_neon, 12),
+  make_tuple(32, 32, &aom_highbd_sad_skip_32x32x4d_neon, 12),
+  make_tuple(32, 16, &aom_highbd_sad_skip_32x16x4d_neon, 12),
+  make_tuple(16, 32, &aom_highbd_sad_skip_16x32x4d_neon, 12),
+  make_tuple(16, 16, &aom_highbd_sad_skip_16x16x4d_neon, 12),
+  make_tuple(16, 8, &aom_highbd_sad_skip_16x8x4d_neon, 12),
+  make_tuple(8, 16, &aom_highbd_sad_skip_8x16x4d_neon, 12),
+  make_tuple(8, 8, &aom_highbd_sad_skip_8x8x4d_neon, 12),
+  make_tuple(8, 4, &aom_highbd_sad_skip_8x4x4d_neon, 12),
+  make_tuple(4, 8, &aom_highbd_sad_skip_4x8x4d_neon, 12),
+  make_tuple(4, 4, &aom_highbd_sad_skip_4x4x4d_neon, 12),
+#endif  // CONFIG_AV1_HIGHBITDEPTH
 #if !CONFIG_REALTIME_ONLY
   make_tuple(64, 16, &aom_sad_skip_64x16x4d_neon, -1),
   make_tuple(32, 8, &aom_sad_skip_32x8x4d_neon, -1),
   make_tuple(16, 64, &aom_sad_skip_16x64x4d_neon, -1),
+  make_tuple(16, 4, &aom_sad_skip_16x4x4d_neon, -1),
   make_tuple(8, 32, &aom_sad_skip_8x32x4d_neon, -1),
   make_tuple(4, 16, &aom_sad_skip_4x16x4d_neon, -1),
-#endif
+#if CONFIG_AV1_HIGHBITDEPTH
+  make_tuple(64, 16, &aom_highbd_sad_skip_64x16x4d_neon, 8),
+  make_tuple(16, 64, &aom_highbd_sad_skip_16x64x4d_neon, 8),
+  make_tuple(32, 8, &aom_highbd_sad_skip_32x8x4d_neon, 8),
+  make_tuple(8, 32, &aom_highbd_sad_skip_8x32x4d_neon, 8),
+  make_tuple(16, 4, &aom_highbd_sad_skip_16x4x4d_neon, 8),
+  make_tuple(4, 16, &aom_highbd_sad_skip_4x16x4d_neon, 8),
+  make_tuple(64, 16, &aom_highbd_sad_skip_64x16x4d_neon, 10),
+  make_tuple(16, 64, &aom_highbd_sad_skip_16x64x4d_neon, 10),
+  make_tuple(32, 8, &aom_highbd_sad_skip_32x8x4d_neon, 10),
+  make_tuple(8, 32, &aom_highbd_sad_skip_8x32x4d_neon, 10),
+  make_tuple(16, 4, &aom_highbd_sad_skip_16x4x4d_neon, 10),
+  make_tuple(4, 16, &aom_highbd_sad_skip_4x16x4d_neon, 10),
+  make_tuple(64, 16, &aom_highbd_sad_skip_64x16x4d_neon, 12),
+  make_tuple(16, 64, &aom_highbd_sad_skip_16x64x4d_neon, 12),
+  make_tuple(32, 8, &aom_highbd_sad_skip_32x8x4d_neon, 12),
+  make_tuple(8, 32, &aom_highbd_sad_skip_8x32x4d_neon, 12),
+  make_tuple(16, 4, &aom_highbd_sad_skip_16x4x4d_neon, 12),
+  make_tuple(4, 16, &aom_highbd_sad_skip_4x16x4d_neon, 12),
+#endif  // CONFIG_AV1_HIGHBITDEPTH
+#endif  // !CONFIG_REALTIME_ONLY
 };
 INSTANTIATE_TEST_SUITE_P(NEON, SADSkipx4Test,
                          ::testing::ValuesIn(skip_x4d_neon_tests));
@@ -2158,6 +2238,34 @@
 };
 INSTANTIATE_TEST_SUITE_P(NEON, SADavgTest, ::testing::ValuesIn(avg_neon_tests));
 
+const SadMxNx4Param x3d_neon_tests[] = {
+  make_tuple(128, 128, &aom_sad128x128x3d_neon, -1),
+  make_tuple(128, 64, &aom_sad128x64x3d_neon, -1),
+  make_tuple(64, 128, &aom_sad64x128x3d_neon, -1),
+  make_tuple(64, 64, &aom_sad64x64x3d_neon, -1),
+  make_tuple(64, 32, &aom_sad64x32x3d_neon, -1),
+  make_tuple(32, 64, &aom_sad32x64x3d_neon, -1),
+  make_tuple(32, 32, &aom_sad32x32x3d_neon, -1),
+  make_tuple(32, 16, &aom_sad32x16x3d_neon, -1),
+  make_tuple(16, 32, &aom_sad16x32x3d_neon, -1),
+  make_tuple(16, 16, &aom_sad16x16x3d_neon, -1),
+  make_tuple(16, 8, &aom_sad16x8x3d_neon, -1),
+  make_tuple(8, 16, &aom_sad8x16x3d_neon, -1),
+  make_tuple(8, 8, &aom_sad8x8x3d_neon, -1),
+  make_tuple(8, 4, &aom_sad8x4x3d_neon, -1),
+  make_tuple(4, 8, &aom_sad4x8x3d_neon, -1),
+  make_tuple(4, 4, &aom_sad4x4x3d_neon, -1),
+#if !CONFIG_REALTIME_ONLY
+  make_tuple(64, 16, &aom_sad64x16x3d_neon, -1),
+  make_tuple(32, 8, &aom_sad32x8x3d_neon, -1),
+  make_tuple(16, 64, &aom_sad16x64x3d_neon, -1),
+  make_tuple(16, 4, &aom_sad16x4x3d_neon, -1),
+  make_tuple(8, 32, &aom_sad8x32x3d_neon, -1),
+  make_tuple(4, 16, &aom_sad4x16x3d_neon, -1),
+#endif  // !CONFIG_REALTIME_ONLY
+};
+INSTANTIATE_TEST_SUITE_P(NEON, SADx3Test, ::testing::ValuesIn(x3d_neon_tests));
+
 #endif  // HAVE_NEON
 
 //------------------------------------------------------------------------------
@@ -2608,75 +2716,35 @@
 INSTANTIATE_TEST_SUITE_P(SSE2, SADSkipx4Test,
                          ::testing::ValuesIn(skip_x4d_sse2_tests));
 
+const DistWtdSadMxNAvgParam dist_wtd_avg_sse2_tests[] = {
+  make_tuple(128, 128, &aom_dist_wtd_sad128x128_avg_sse2, -1),
+  make_tuple(128, 64, &aom_dist_wtd_sad128x64_avg_sse2, -1),
+  make_tuple(64, 128, &aom_dist_wtd_sad64x128_avg_sse2, -1),
+  make_tuple(64, 64, &aom_dist_wtd_sad64x64_avg_sse2, -1),
+  make_tuple(64, 32, &aom_dist_wtd_sad64x32_avg_sse2, -1),
+  make_tuple(32, 64, &aom_dist_wtd_sad32x64_avg_sse2, -1),
+  make_tuple(32, 32, &aom_dist_wtd_sad32x32_avg_sse2, -1),
+  make_tuple(32, 16, &aom_dist_wtd_sad32x16_avg_sse2, -1),
+  make_tuple(16, 32, &aom_dist_wtd_sad16x32_avg_sse2, -1),
+  make_tuple(16, 16, &aom_dist_wtd_sad16x16_avg_sse2, -1),
+  make_tuple(16, 8, &aom_dist_wtd_sad16x8_avg_sse2, -1),
+  make_tuple(8, 16, &aom_dist_wtd_sad8x16_avg_sse2, -1),
+  make_tuple(8, 8, &aom_dist_wtd_sad8x8_avg_sse2, -1),
+  make_tuple(8, 4, &aom_dist_wtd_sad8x4_avg_sse2, -1),
+  make_tuple(4, 8, &aom_dist_wtd_sad4x8_avg_sse2, -1),
+  make_tuple(4, 4, &aom_dist_wtd_sad4x4_avg_sse2, -1),
 #if !CONFIG_REALTIME_ONLY
-const SadMxNx4AvgParam x4d_avg_sse2_tests[] = {
-  make_tuple(128, 128, &aom_sad128x128x4d_avg_sse2, -1),
-  make_tuple(128, 64, &aom_sad128x64x4d_avg_sse2, -1),
-  make_tuple(64, 128, &aom_sad64x128x4d_avg_sse2, -1),
-  make_tuple(64, 64, &aom_sad64x64x4d_avg_sse2, -1),
-  make_tuple(64, 32, &aom_sad64x32x4d_avg_sse2, -1),
-  make_tuple(32, 64, &aom_sad32x64x4d_avg_sse2, -1),
-  make_tuple(32, 32, &aom_sad32x32x4d_avg_sse2, -1),
-  make_tuple(32, 16, &aom_sad32x16x4d_avg_sse2, -1),
-  make_tuple(16, 32, &aom_sad16x32x4d_avg_sse2, -1),
-  make_tuple(16, 16, &aom_sad16x16x4d_avg_sse2, -1),
-  make_tuple(16, 8, &aom_sad16x8x4d_avg_sse2, -1),
-  make_tuple(8, 16, &aom_sad8x16x4d_avg_sse2, -1),
-  make_tuple(8, 8, &aom_sad8x8x4d_avg_sse2, -1),
-  make_tuple(8, 4, &aom_sad8x4x4d_avg_sse2, -1),
-  make_tuple(4, 8, &aom_sad4x8x4d_avg_sse2, -1),
-  make_tuple(4, 4, &aom_sad4x4x4d_avg_sse2, -1),
-  make_tuple(64, 16, &aom_sad64x16x4d_avg_sse2, -1),
-  make_tuple(16, 64, &aom_sad16x64x4d_avg_sse2, -1),
-  make_tuple(32, 8, &aom_sad32x8x4d_avg_sse2, -1),
-  make_tuple(8, 32, &aom_sad8x32x4d_avg_sse2, -1),
-  make_tuple(16, 4, &aom_sad16x4x4d_avg_sse2, -1),
-  make_tuple(4, 16, &aom_sad4x16x4d_avg_sse2, -1),
-};
-INSTANTIATE_TEST_SUITE_P(SSE2, SADx4AvgTest,
-                         ::testing::ValuesIn(x4d_avg_sse2_tests));
-#endif  // !CONFIG_REALTIME_ONLY
-#endif  // HAVE_SSE2
-
-#if HAVE_SSSE3
-// Note: These are named sse2, but part of ssse3 file and only built and linked
-// when ssse3 is enabled.
-const DistWtdSadMxhParam dist_wtd_sad_sse2_tests[] = {
-  make_tuple(4, 4, &aom_sad4xh_sse2, -1),
-  make_tuple(4, 8, &aom_sad4xh_sse2, -1),
-  make_tuple(8, 4, &aom_sad8xh_sse2, -1),
-  make_tuple(8, 8, &aom_sad8xh_sse2, -1),
-  make_tuple(8, 16, &aom_sad8xh_sse2, -1),
-  make_tuple(16, 8, &aom_sad16xh_sse2, -1),
-  make_tuple(16, 16, &aom_sad16xh_sse2, -1),
-  make_tuple(16, 32, &aom_sad16xh_sse2, -1),
-  make_tuple(32, 16, &aom_sad32xh_sse2, -1),
-  make_tuple(32, 32, &aom_sad32xh_sse2, -1),
-  make_tuple(32, 64, &aom_sad32xh_sse2, -1),
-  make_tuple(64, 32, &aom_sad64xh_sse2, -1),
-  make_tuple(64, 64, &aom_sad64xh_sse2, -1),
-  make_tuple(128, 128, &aom_sad128xh_sse2, -1),
-  make_tuple(128, 64, &aom_sad128xh_sse2, -1),
-  make_tuple(64, 128, &aom_sad64xh_sse2, -1),
-  make_tuple(4, 16, &aom_sad4xh_sse2, -1),
-  make_tuple(16, 4, &aom_sad16xh_sse2, -1),
-  make_tuple(8, 32, &aom_sad8xh_sse2, -1),
-  make_tuple(32, 8, &aom_sad32xh_sse2, -1),
-  make_tuple(16, 64, &aom_sad16xh_sse2, -1),
-  make_tuple(64, 16, &aom_sad64xh_sse2, -1),
-#if !CONFIG_REALTIME_ONLY
-  make_tuple(16, 64, &aom_sad16xh_sse2, -1),
-  make_tuple(64, 16, &aom_sad64xh_sse2, -1),
-  make_tuple(8, 32, &aom_sad8xh_sse2, -1),
-  make_tuple(32, 8, &aom_sad32xh_sse2, -1),
-  make_tuple(4, 16, &aom_sad4xh_sse2, -1),
-  make_tuple(16, 4, &aom_sad16xh_sse2, -1),
+  make_tuple(64, 16, &aom_dist_wtd_sad64x16_avg_sse2, -1),
+  make_tuple(16, 64, &aom_dist_wtd_sad16x64_avg_sse2, -1),
+  make_tuple(32, 8, &aom_dist_wtd_sad32x8_avg_sse2, -1),
+  make_tuple(8, 32, &aom_dist_wtd_sad8x32_avg_sse2, -1),
+  make_tuple(16, 4, &aom_dist_wtd_sad16x4_avg_sse2, -1),
+  make_tuple(4, 16, &aom_dist_wtd_sad4x16_avg_sse2, -1),
 #endif
 };
-INSTANTIATE_TEST_SUITE_P(SSE2, DistWtdSADTest,
-                         ::testing::ValuesIn(dist_wtd_sad_sse2_tests));
-
-#endif  // HAVE_SSSE3
+INSTANTIATE_TEST_SUITE_P(sse2, DistWtdSADavgTest,
+                         ::testing::ValuesIn(dist_wtd_avg_sse2_tests));
+#endif  // HAVE_SSE2
 
 #if HAVE_SSE3
 // Only functions are x3, which do not have tests.
@@ -2713,35 +2781,6 @@
 
 INSTANTIATE_TEST_SUITE_P(SSSE3, DistWtdCompAvgTest,
                          ::testing::ValuesIn(dist_wtd_comp_avg_ssse3_tests));
-
-const DistWtdSadMxNAvgParam dist_wtd_avg_ssse3_tests[] = {
-  make_tuple(128, 128, &aom_dist_wtd_sad128x128_avg_ssse3, -1),
-  make_tuple(128, 64, &aom_dist_wtd_sad128x64_avg_ssse3, -1),
-  make_tuple(64, 128, &aom_dist_wtd_sad64x128_avg_ssse3, -1),
-  make_tuple(64, 64, &aom_dist_wtd_sad64x64_avg_ssse3, -1),
-  make_tuple(64, 32, &aom_dist_wtd_sad64x32_avg_ssse3, -1),
-  make_tuple(32, 64, &aom_dist_wtd_sad32x64_avg_ssse3, -1),
-  make_tuple(32, 32, &aom_dist_wtd_sad32x32_avg_ssse3, -1),
-  make_tuple(32, 16, &aom_dist_wtd_sad32x16_avg_ssse3, -1),
-  make_tuple(16, 32, &aom_dist_wtd_sad16x32_avg_ssse3, -1),
-  make_tuple(16, 16, &aom_dist_wtd_sad16x16_avg_ssse3, -1),
-  make_tuple(16, 8, &aom_dist_wtd_sad16x8_avg_ssse3, -1),
-  make_tuple(8, 16, &aom_dist_wtd_sad8x16_avg_ssse3, -1),
-  make_tuple(8, 8, &aom_dist_wtd_sad8x8_avg_ssse3, -1),
-  make_tuple(8, 4, &aom_dist_wtd_sad8x4_avg_ssse3, -1),
-  make_tuple(4, 8, &aom_dist_wtd_sad4x8_avg_ssse3, -1),
-  make_tuple(4, 4, &aom_dist_wtd_sad4x4_avg_ssse3, -1),
-#if !CONFIG_REALTIME_ONLY
-  make_tuple(64, 16, &aom_dist_wtd_sad64x16_avg_ssse3, -1),
-  make_tuple(16, 64, &aom_dist_wtd_sad16x64_avg_ssse3, -1),
-  make_tuple(32, 8, &aom_dist_wtd_sad32x8_avg_ssse3, -1),
-  make_tuple(8, 32, &aom_dist_wtd_sad8x32_avg_ssse3, -1),
-  make_tuple(16, 4, &aom_dist_wtd_sad16x4_avg_ssse3, -1),
-  make_tuple(4, 16, &aom_dist_wtd_sad4x16_avg_ssse3, -1),
-#endif
-};
-INSTANTIATE_TEST_SUITE_P(SSSE3, DistWtdSADavgTest,
-                         ::testing::ValuesIn(dist_wtd_avg_ssse3_tests));
 #endif  // HAVE_SSSE3
 
 #if HAVE_SSE4_1

diff --git a/test/scan_test.cc b/test/scan_test.cc
index dee2ab5..571658e 100644
--- a/test/scan_test.cc
+++ b/test/scan_test.cc

@@ -15,10 +15,10 @@
 #include "test/av1_txfm_test.h"
 
 static int scan_test(const int16_t *scan, const int16_t *iscan, int si, int r,
-                     int c, int w) {
-  if (iscan[r * w + c] != si || scan[si] != r * w + c) {
+                     int c, int h) {
+  if (iscan[c * h + r] != si || scan[si] != c * h + r) {
     printf("r %d c %d ref_iscan %d iscan %d ref_scan %d scan %d\n", r, c, si,
-           iscan[r * w + c], r * w + c, scan[si]);
+           iscan[c * h + r], c * h + r, scan[si]);
     return 1;
   } else {
     return 0;
@@ -37,7 +37,7 @@
         for (int c = 0; c < w; ++c) {
           int r = i - c;
           if (r >= 0 && r < h) {
-            if (scan_test(scan, iscan, si, r, c, w)) return 1;
+            if (scan_test(scan, iscan, si, r, c, h)) return 1;
             ++si;
           }
         }
@@ -45,7 +45,7 @@
         for (int r = 0; r < h; ++r) {
           int c = i - r;
           if (c >= 0 && c < w) {
-            if (scan_test(scan, iscan, si, r, c, w)) return 1;
+            if (scan_test(scan, iscan, si, r, c, h)) return 1;
             ++si;
           }
         }
@@ -57,7 +57,7 @@
       for (int c = 0; c < w; ++c) {
         int r = i - c;
         if (r >= 0 && r < h) {
-          if (scan_test(scan, iscan, si, r, c, w)) return 1;
+          if (scan_test(scan, iscan, si, r, c, h)) return 1;
           ++si;
         }
       }
@@ -68,7 +68,7 @@
       for (int r = 0; r < h; ++r) {
         int c = i - r;
         if (c >= 0 && c < w) {
-          if (scan_test(scan, iscan, si, r, c, w)) return 1;
+          if (scan_test(scan, iscan, si, r, c, h)) return 1;
           ++si;
         }
       }
@@ -77,7 +77,7 @@
     int si = 0;
     for (int r = 0; r < h; ++r) {
       for (int c = 0; c < w; ++c) {
-        if (scan_test(scan, iscan, si, r, c, w)) return 1;
+        if (scan_test(scan, iscan, si, r, c, h)) return 1;
         ++si;
       }
     }
@@ -86,7 +86,7 @@
     int si = 0;
     for (int c = 0; c < w; ++c) {
       for (int r = 0; r < h; ++r) {
-        if (scan_test(scan, iscan, si, r, c, w)) return 1;
+        if (scan_test(scan, iscan, si, r, c, h)) return 1;
         ++si;
       }
     }

diff --git a/test/sum_squares_test.cc b/test/sum_squares_test.cc
index 5c049a5..91f172d 100644
--- a/test/sum_squares_test.cc
+++ b/test/sum_squares_test.cc

@@ -238,6 +238,13 @@
 
 #endif  // HAVE_SSE2
 
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(NEON, SumSquares1DTest,
+                         ::testing::Values(TestFuncs1D(
+                             aom_sum_squares_i16_c, aom_sum_squares_i16_neon)));
+
+#endif  // HAVE_NEON
+
 typedef int64_t (*sse_func)(const uint8_t *a, int a_stride, const uint8_t *b,
                             int b_stride, int width, int height);
 typedef libaom_test::FuncParam<sse_func> TestSSEFuncs;
@@ -708,6 +715,14 @@
 
 #endif  // HAVE_SSE2
 
+#if HAVE_NEON
+
+INSTANTIATE_TEST_SUITE_P(NEON, Lowbd2dVarTest,
+                         ::testing::Values(TestFuncVar2D(&aom_var_2d_u8_c,
+                                                         &aom_var_2d_u8_neon)));
+
+#endif  // HAVE_NEON
+
 class Highbd2dVarTest : public ::testing::TestWithParam<TestFuncVar2D> {
  public:
   virtual ~Highbd2dVarTest() {}
@@ -837,4 +852,12 @@
     ::testing::Values(TestFuncVar2D(&aom_var_2d_u16_c, &aom_var_2d_u16_avx2)));
 
 #endif  // HAVE_SSE2
+
+#if HAVE_NEON
+
+INSTANTIATE_TEST_SUITE_P(
+    NEON, Highbd2dVarTest,
+    ::testing::Values(TestFuncVar2D(&aom_var_2d_u16_c, &aom_var_2d_u16_neon)));
+
+#endif  // HAVE_NEON
 }  // namespace

diff --git a/test/svc_datarate_test.cc b/test/svc_datarate_test.cc
index a5c3840..d99d6a3 100644
--- a/test/svc_datarate_test.cc
+++ b/test/svc_datarate_test.cc

@@ -92,6 +92,8 @@
     screen_mode_ = 0;
     rps_mode_ = 0;
     rps_recovery_frame_ = 0;
+    user_define_frame_qp_ = 0;
+    set_speed_per_layer_ = false;
   }
 
   virtual void PreEncodeFrameHook(::libaom_test::VideoSource *video,
@@ -114,7 +116,15 @@
       encoder->Control(AV1E_SET_ENABLE_TPL_MODEL, 0);
       encoder->Control(AV1E_SET_DELTAQ_MODE, 0);
       if (cfg_.g_threads > 1) {
-        encoder->Control(AV1E_SET_TILE_COLUMNS, cfg_.g_threads >> 1);
+        if (cfg_.g_threads == 4) {
+          encoder->Control(AV1E_SET_TILE_COLUMNS, 2);
+          encoder->Control(AV1E_SET_TILE_ROWS, 2);
+        } else if (cfg_.g_threads == 8) {
+          encoder->Control(AV1E_SET_TILE_COLUMNS, 4);
+          encoder->Control(AV1E_SET_TILE_ROWS, 2);
+        } else {
+          encoder->Control(AV1E_SET_TILE_COLUMNS, cfg_.g_threads >> 1);
+        }
         encoder->Control(AV1E_SET_ROW_MT, 1);
       }
       if (screen_mode_) {
@@ -163,6 +173,23 @@
       encoder->Control(AV1E_SET_SVC_REF_FRAME_CONFIG, &ref_frame_config_);
       encoder->Control(AV1E_SET_SVC_REF_FRAME_COMP_PRED, &ref_frame_comp_pred_);
     }
+    if (set_speed_per_layer_) {
+      int speed_per_layer = 10;
+      if (layer_id_.spatial_layer_id == 0) {
+        // For for base SL0,TL0: use the speed the test loops over.
+        if (layer_id_.temporal_layer_id == 1) speed_per_layer = 7;
+        if (layer_id_.temporal_layer_id == 2) speed_per_layer = 8;
+      } else if (layer_id_.spatial_layer_id == 1) {
+        if (layer_id_.temporal_layer_id == 0) speed_per_layer = 7;
+        if (layer_id_.temporal_layer_id == 1) speed_per_layer = 8;
+        if (layer_id_.temporal_layer_id == 2) speed_per_layer = 9;
+      } else if (layer_id_.spatial_layer_id == 2) {
+        if (layer_id_.temporal_layer_id == 0) speed_per_layer = 8;
+        if (layer_id_.temporal_layer_id == 1) speed_per_layer = 9;
+        if (layer_id_.temporal_layer_id == 2) speed_per_layer = 10;
+      }
+      encoder->Control(AOME_SET_CPUUSED, speed_per_layer);
+    }
     if (set_frame_level_er_) {
       int mode =
           (layer_id_.spatial_layer_id > 0 || layer_id_.temporal_layer_id > 0);
@@ -193,6 +220,11 @@
     }
     layer_frame_cnt_++;
     DatarateTest::PreEncodeFrameHook(video, encoder);
+
+    if (user_define_frame_qp_) {
+      frame_qp_ = rnd_.PseudoUniform(63);
+      encoder->Control(AV1E_SET_QUANTIZER_ONE_PASS, frame_qp_);
+    }
   }
 
   virtual void PostEncodeFrameHook(::libaom_test::Encoder *encoder) {
@@ -200,6 +232,14 @@
     encoder->Control(AV1E_GET_NUM_OPERATING_POINTS, &num_operating_points);
     ASSERT_EQ(num_operating_points,
               number_temporal_layers_ * number_spatial_layers_);
+
+    if (user_define_frame_qp_) {
+      if (current_video_frame_ >= static_cast<unsigned int>(total_frame_))
+        return;
+      int qp;
+      encoder->Control(AOME_GET_LAST_QUANTIZER_64, &qp);
+      ASSERT_EQ(qp, frame_qp_);
+    }
   }
 
   virtual void FramePktHook(const aom_codec_cx_pkt_t *pkt) {
@@ -337,7 +377,41 @@
       if (rps_mode)
         ref_config_rps(ref_frame_config, frame_cnt, rps_recovery_frame);
     }
-    if (number_temporal_layers_ == 3 && number_spatial_layers_ == 1) {
+    if (number_temporal_layers_ == 2 && number_spatial_layers_ == 1) {
+      // 2-temporal layer.
+      //    1    3    5
+      //  0    2    4
+      // Keep golden fixed at slot 3.
+      base_count = frame_cnt >> 1;
+      ref_frame_config->ref_idx[3] = 3;
+      // Cyclically refresh slots 5, 6, 7, for lag alt ref.
+      lag_index = 5;
+      if (base_count > 0) {
+        lag_index = 5 + (base_count % 3);
+        if (frame_cnt % 2 != 0) lag_index = 5 + ((base_count + 1) % 3);
+      }
+      // Set the altref slot to lag_index.
+      ref_frame_config->ref_idx[6] = lag_index;
+      if (frame_cnt % 2 == 0) {
+        layer_id->temporal_layer_id = 0;
+        // Update LAST on layer 0, reference LAST.
+        ref_frame_config->refresh[0] = 1;
+        ref_frame_config->reference[0] = 1;
+        // Refresh lag_index slot, needed for lagging golen.
+        ref_frame_config->refresh[lag_index] = 1;
+        // Refresh GOLDEN every x base layer frames.
+        if (base_count % 32 == 0) ref_frame_config->refresh[3] = 1;
+      } else {
+        layer_id->temporal_layer_id = 1;
+        // No updates on layer 1, reference LAST (TL0).
+        ref_frame_config->reference[0] = 1;
+      }
+      // Always reference golden and altref on TL0.
+      if (layer_id->temporal_layer_id == 0) {
+        ref_frame_config->reference[3] = 1;
+        ref_frame_config->reference[6] = 1;
+      }
+    } else if (number_temporal_layers_ == 3 && number_spatial_layers_ == 1) {
       // 3-layer:
       //   1    3   5    7
       //     2        6
@@ -627,7 +701,7 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.60)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
     // Top temporal layers are non_reference, so exlcude them from
@@ -637,6 +711,71 @@
     EXPECT_EQ((int)GetMismatchFrames(), 150);
   }
 
+  virtual void SetFrameQpSVC3TL1SLTest() {
+    cfg_.rc_buf_initial_sz = 500;
+    cfg_.rc_buf_optimal_sz = 500;
+    cfg_.rc_buf_sz = 1000;
+    cfg_.rc_dropframe_thresh = 0;
+    cfg_.rc_min_quantizer = 0;
+    cfg_.rc_max_quantizer = 63;
+    cfg_.rc_end_usage = AOM_CBR;
+    cfg_.g_lag_in_frames = 0;
+    cfg_.g_error_resilient = 1;
+
+    user_define_frame_qp_ = 1;
+    total_frame_ = 300;
+
+    ::libaom_test::I420VideoSource video("hantro_collage_w352h288.yuv", 352,
+                                         288, 30, 1, 0, 300);
+    const int bitrate_array[2] = { 200, 550 };
+    cfg_.rc_target_bitrate = bitrate_array[GET_PARAM(4)];
+    ResetModel();
+    number_temporal_layers_ = 3;
+    target_layer_bitrate_[0] = 50 * cfg_.rc_target_bitrate / 100;
+    target_layer_bitrate_[1] = 70 * cfg_.rc_target_bitrate / 100;
+    target_layer_bitrate_[2] = cfg_.rc_target_bitrate;
+    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
+  }
+
+  virtual void SetFrameQpSVC3TL3SLTest() {
+    cfg_.rc_buf_initial_sz = 500;
+    cfg_.rc_buf_optimal_sz = 500;
+    cfg_.rc_buf_sz = 1000;
+    cfg_.rc_dropframe_thresh = 0;
+    cfg_.rc_min_quantizer = 0;
+    cfg_.rc_max_quantizer = 63;
+    cfg_.rc_end_usage = AOM_CBR;
+    cfg_.g_lag_in_frames = 0;
+    cfg_.g_error_resilient = 0;
+
+    user_define_frame_qp_ = 1;
+    total_frame_ = 300;
+
+    ::libaom_test::I420VideoSource video("hantro_collage_w352h288.yuv", 352,
+                                         288, 30, 1, 0, 300);
+    const int bitrate_array[2] = { 600, 1200 };
+    cfg_.rc_target_bitrate = bitrate_array[GET_PARAM(4)];
+    ResetModel();
+    number_temporal_layers_ = 3;
+    number_spatial_layers_ = 3;
+    // SL0
+    const int bitrate_sl0 = 1 * cfg_.rc_target_bitrate / 8;
+    target_layer_bitrate_[0] = 50 * bitrate_sl0 / 100;
+    target_layer_bitrate_[1] = 70 * bitrate_sl0 / 100;
+    target_layer_bitrate_[2] = bitrate_sl0;
+    // SL1
+    const int bitrate_sl1 = 3 * cfg_.rc_target_bitrate / 8;
+    target_layer_bitrate_[3] = 50 * bitrate_sl1 / 100;
+    target_layer_bitrate_[4] = 70 * bitrate_sl1 / 100;
+    target_layer_bitrate_[5] = bitrate_sl1;
+    // SL2
+    const int bitrate_sl2 = 4 * cfg_.rc_target_bitrate / 8;
+    target_layer_bitrate_[6] = 50 * bitrate_sl2 / 100;
+    target_layer_bitrate_[7] = 70 * bitrate_sl2 / 100;
+    target_layer_bitrate_[8] = bitrate_sl2;
+    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
+  }
+
   virtual void BasicRateTargetingSVC3TL1SLScreenTest() {
     cfg_.rc_buf_initial_sz = 500;
     cfg_.rc_buf_optimal_sz = 500;
@@ -663,7 +802,7 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.50)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.5)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.7)
           << " The datarate for the file is greater than target by too much!";
     }
     // Top temporal layers are non_reference, so exlcude them from
@@ -675,6 +814,44 @@
     EXPECT_LE((int)GetMismatchFrames(), 30);
   }
 
+  virtual void BasicRateTargetingSVC2TL1SLScreenDropFrameTest() {
+    cfg_.rc_buf_initial_sz = 500;
+    cfg_.rc_buf_optimal_sz = 500;
+    cfg_.rc_buf_sz = 1000;
+    cfg_.rc_dropframe_thresh = 30;
+    cfg_.rc_min_quantizer = 0;
+    cfg_.rc_max_quantizer = 52;
+    cfg_.rc_end_usage = AOM_CBR;
+    cfg_.g_lag_in_frames = 0;
+    cfg_.g_error_resilient = 0;
+
+    ::libaom_test::I420VideoSource video("hantro_collage_w352h288.yuv", 352,
+                                         288, 30, 1, 0, 300);
+
+    const int bitrate_array[2] = { 60, 100 };
+    cfg_.rc_target_bitrate = bitrate_array[GET_PARAM(4)];
+    ResetModel();
+    screen_mode_ = 1;
+    number_temporal_layers_ = 2;
+    number_spatial_layers_ = 1;
+    target_layer_bitrate_[0] = 60 * cfg_.rc_target_bitrate / 100;
+    target_layer_bitrate_[1] = cfg_.rc_target_bitrate;
+    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
+    for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
+      ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.75)
+          << " The datarate for the file is lower than target by too much!";
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.5)
+          << " The datarate for the file is greater than target by too much!";
+    }
+    // Top temporal layers are non_reference, so exlcude them from
+    // mismatch count, since loopfilter/cdef is not applied for these on
+    // encoder side, but is always applied on decoder.
+    // This means 300 = #frames(300) - #TL2_frames(150).
+    // We use LE for screen since loopfilter level can become very small
+    // or zero and then the frame is not a mismatch.
+    EXPECT_LE((int)GetMismatchFrames(), 150);
+  }
+
   virtual void BasicRateTargetingSVC1TL3SLScreenTest() {
     cfg_.rc_buf_initial_sz = 500;
     cfg_.rc_buf_optimal_sz = 500;
@@ -810,7 +987,7 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.80)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
   }
@@ -857,7 +1034,7 @@
     for (int i = 0; i < number_temporal_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.50)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
     // Only base spatial layer is decoded and there are no non-referenece
@@ -905,7 +1082,7 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.585)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
     // All 3 spatial layers are decoded, starting at frame 0, so there are
@@ -938,7 +1115,7 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.80)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
   }
@@ -1129,6 +1306,51 @@
     }
   }
 
+  virtual void BasicRateTargetingSVC3TL3SLMultiThreadSpeedPerLayerTest() {
+    cfg_.rc_buf_initial_sz = 500;
+    cfg_.rc_buf_optimal_sz = 500;
+    cfg_.rc_buf_sz = 1000;
+    cfg_.rc_dropframe_thresh = 0;
+    cfg_.rc_min_quantizer = 0;
+    cfg_.rc_max_quantizer = 63;
+    cfg_.rc_end_usage = AOM_CBR;
+    cfg_.g_lag_in_frames = 0;
+    cfg_.g_error_resilient = 0;
+    cfg_.g_threads = 2;
+    ::libaom_test::I420VideoSource video("niklas_640_480_30.yuv", 640, 480, 30,
+                                         1, 0, 400);
+    cfg_.g_w = 640;
+    cfg_.g_h = 480;
+    const int bitrate_array[2] = { 600, 1200 };
+    cfg_.rc_target_bitrate = bitrate_array[GET_PARAM(4)];
+    ResetModel();
+    set_speed_per_layer_ = true;
+    number_temporal_layers_ = 3;
+    number_spatial_layers_ = 3;
+    // SL0
+    const int bitrate_sl0 = 1 * cfg_.rc_target_bitrate / 8;
+    target_layer_bitrate_[0] = 50 * bitrate_sl0 / 100;
+    target_layer_bitrate_[1] = 70 * bitrate_sl0 / 100;
+    target_layer_bitrate_[2] = bitrate_sl0;
+    // SL1
+    const int bitrate_sl1 = 3 * cfg_.rc_target_bitrate / 8;
+    target_layer_bitrate_[3] = 50 * bitrate_sl1 / 100;
+    target_layer_bitrate_[4] = 70 * bitrate_sl1 / 100;
+    target_layer_bitrate_[5] = bitrate_sl1;
+    // SL2
+    const int bitrate_sl2 = 4 * cfg_.rc_target_bitrate / 8;
+    target_layer_bitrate_[6] = 50 * bitrate_sl2 / 100;
+    target_layer_bitrate_[7] = 70 * bitrate_sl2 / 100;
+    target_layer_bitrate_[8] = bitrate_sl2;
+    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
+    for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
+      ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.70)
+          << " The datarate for the file is lower than target by too much!";
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.45)
+          << " The datarate for the file is greater than target by too much!";
+    }
+  }
+
   virtual void BasicRateTargetingSVC3TL3SLHDMultiThread2Test() {
     cfg_.rc_buf_initial_sz = 500;
     cfg_.rc_buf_optimal_sz = 500;
@@ -1378,7 +1600,7 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.60)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
     // Test that no mismatches have been found.
@@ -1423,7 +1645,7 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.60)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
     // Test that no mismatches have been found.
@@ -1468,7 +1690,7 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.60)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
     // Test that no mismatches have been found.
@@ -1514,7 +1736,7 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.60)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
     // Test that no mismatches have been found.
@@ -1565,7 +1787,57 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.60)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
+          << " The datarate for the file is greater than target by too much!";
+    }
+    // Test that no mismatches have been found.
+    std::cout << "          Decoded frames: " << GetDecodedFrames() << "\n";
+    std::cout << "          Mismatch frames: " << GetMismatchFrames() << "\n";
+    EXPECT_EQ(300 - GetDecodedFrames(), drop_frames_);
+    EXPECT_EQ((int)GetMismatchFrames(), num_nonref);
+  }
+
+  virtual void BasicRateTargetingSVC2TL1SLDropSetEnhER0Test() {
+    cfg_.rc_buf_initial_sz = 500;
+    cfg_.rc_buf_optimal_sz = 500;
+    cfg_.rc_buf_sz = 1000;
+    cfg_.rc_dropframe_thresh = 0;
+    cfg_.rc_min_quantizer = 0;
+    cfg_.rc_max_quantizer = 63;
+    cfg_.rc_end_usage = AOM_CBR;
+    cfg_.g_lag_in_frames = 0;
+
+    ::libaom_test::I420VideoSource video("hantro_collage_w352h288.yuv", 352,
+                                         288, 30, 1, 0, 300);
+    const int bitrate_array[2] = { 200, 550 };
+    cfg_.rc_target_bitrate = bitrate_array[GET_PARAM(4)];
+    ResetModel();
+
+    // Set error_resilience off.
+    cfg_.g_error_resilient = 0;
+
+    // Drop TL1: for part of sequence. Start at first TL1 at
+    // frame 101, and end at frame 199. Frame 200 is TL0,
+    // so we can continue decoding without mismatch (since LAST is the
+    // only reference).
+    int n = 0;
+    int num_nonref = 300 / 2;
+    for (int i = 101; i < 200; i++) {
+      if (i % 2 != 0) {
+        drop_frames_list_[n] = i;
+        n++;
+        if (i % 2 != 0) num_nonref -= 1;
+      }
+    }
+    drop_frames_ = n;
+    number_temporal_layers_ = 2;
+    target_layer_bitrate_[0] = 70 * cfg_.rc_target_bitrate / 100;
+    target_layer_bitrate_[1] = cfg_.rc_target_bitrate;
+    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
+    for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
+      ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.60)
+          << " The datarate for the file is lower than target by too much!";
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
     // Test that no mismatches have been found.
@@ -1597,7 +1869,7 @@
     // Drop TL1 and TL2: for part of sequence. Start at first TL2 at
     // frame 101, and end at second T2 at frame 199. Frame 200 is TL0,
     // so we can continue decoding without mismatch (since LAST is the
-    // only reference and error_resil = 1 on TL1/TL2 frames).
+    // only reference).
     int n = 0;
     int num_nonref = 300 / 2;
     for (int i = 101; i < 200; i++) {
@@ -1616,7 +1888,7 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.60)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
     // Test that no mismatches have been found.
@@ -1645,7 +1917,7 @@
     // Drop TL1 and TL2: for part of sequence. Start at first TL2 at
     // frame 101, and end at second T2 at frame 199. Frame 200 is TL0,
     // so we can continue decoding without mismatch (since LAST is the
-    // only reference and error_resil = 1 on TL1/TL2 frames).
+    // only reference).
     // Drop here means drop whole superframe.
     int n = 0;
     int num_nonref = 300 / 2;
@@ -1679,7 +1951,7 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.60)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
     // Test that no mismatches have been found.
@@ -1822,7 +2094,7 @@
     for (int i = 0; i < number_temporal_layers_ * number_spatial_layers_; i++) {
       ASSERT_GE(effective_datarate_tl[i], target_layer_bitrate_[i] * 0.60)
           << " The datarate for the file is lower than target by too much!";
-      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.35)
+      ASSERT_LE(effective_datarate_tl[i], target_layer_bitrate_[i] * 1.60)
           << " The datarate for the file is greater than target by too much!";
     }
     // Test that no mismatches have been found.
@@ -1861,6 +2133,12 @@
   int screen_mode_;
   int rps_mode_;
   int rps_recovery_frame_;
+
+  int user_define_frame_qp_;
+  int frame_qp_;
+  int total_frame_;
+  bool set_speed_per_layer_;
+  libaom_test::ACMRandom rnd_;
 };
 
 // Check basic rate targeting for CBR, for 3 temporal layers, 1 spatial.
@@ -1868,12 +2146,21 @@
   BasicRateTargetingSVC3TL1SLTest();
 }
 
+TEST_P(DatarateTestSVC, SetFrameQpSVC3TL1SL) { SetFrameQpSVC3TL1SLTest(); }
+
+TEST_P(DatarateTestSVC, SetFrameQpSVC3TL3SL) { SetFrameQpSVC3TL3SLTest(); }
+
 // Check basic rate targeting for CBR, for 3 temporal layers, 1 spatial
 // for screen mode.
 TEST_P(DatarateTestSVC, BasicRateTargetingSVC3TL1SLScreen) {
   BasicRateTargetingSVC3TL1SLScreenTest();
 }
 
+// Check basic rate targeting for CBR, for 2 temporal layers, 1 spatial
+// for screen mode, with frame dropper on at low bitrates
+TEST_P(DatarateTestSVC, BasicRateTargetingSVC2TL1SLScreenDropFrame) {
+  BasicRateTargetingSVC2TL1SLScreenDropFrameTest();
+}
 // Check basic rate targeting for CBR, for 3 spatial layers, 1 temporal
 // for screen mode.
 TEST_P(DatarateTestSVC, BasicRateTargetingSVC1TL3SLScreen) {
@@ -1946,6 +2233,13 @@
 }
 
 // Check basic rate targeting for CBR, for 3 spatial, 3 temporal layers,
+// for 2 threads, 2 tile_columns, row-mt enabled, and different speed
+// per layer.
+TEST_P(DatarateTestSVC, BasicRateTargetingSVC3TL3SLMultiThreadSpeedPerLayer) {
+  BasicRateTargetingSVC3TL3SLMultiThreadSpeedPerLayerTest();
+}
+
+// Check basic rate targeting for CBR, for 3 spatial, 3 temporal layers,
 // for 2 threads, 2 tile_columns, row-mt enabled.
 TEST_P(DatarateTestSVC, BasicRateTargetingSVC3TL3SLHDMultiThread2) {
   BasicRateTargetingSVC3TL3SLHDMultiThread2Test();
@@ -1970,7 +2264,11 @@
 
 // Check basic rate targeting for CBR, for 3 spatial, 3 temporal layers,
 // for 4:4:4 input.
+#if defined(CONFIG_MAX_DECODE_PROFILE) && CONFIG_MAX_DECODE_PROFILE < 1
+TEST_P(DatarateTestSVC, DISABLED_BasicRateTargeting444SVC3TL3SL) {
+#else
 TEST_P(DatarateTestSVC, BasicRateTargeting444SVC3TL3SL) {
+#endif
   BasicRateTargeting444SVC3TL3SLTest();
 }
 
@@ -2019,6 +2317,15 @@
   BasicRateTargetingSVC3TL1SLDropSetEnhFrameERTest();
 }
 
+// Check basic rate targeting for CBR, for 2 temporal layers, 1 spatial layer,
+// with dropping set of enhancement layers (TL 1) in middle of sequence.
+// Test that the error_resilient flag can be 0/off for all frames.
+// This allows for successful decoding after dropping a set enhancement layer
+// frames in the sequence.
+TEST_P(DatarateTestSVC, BasicRateTargetingSVC2TL1SLDropSetEnhER0) {
+  BasicRateTargetingSVC2TL1SLDropSetEnhER0Test();
+}
+
 // Check basic rate targeting for CBR, for 3 temporal layers, 1 spatial layer,
 // with dropping set of enhancement layers (TL 1 and TL2) in middle of sequence.
 // Test that the error_resilient flag can be 0/off for all frames.
@@ -2068,7 +2375,7 @@
 
 AV1_INSTANTIATE_TEST_SUITE(DatarateTestSVC,
                            ::testing::Values(::libaom_test::kRealTime),
-                           ::testing::Range(7, 11), ::testing::Values(0, 3),
+                           ::testing::Range(7, 12), ::testing::Values(0, 3),
                            ::testing::Values(0, 1));
 
 }  // namespace

diff --git a/test/svc_encoder_rtc.sh b/test/svc_encoder_rtc.sh
new file mode 100644
index 0000000..735166d
--- /dev/null
+++ b/test/svc_encoder_rtc.sh

@@ -0,0 +1,85 @@
+#!/bin/sh
+## Copyright (c) 2023, Alliance for Open Media. All rights reserved
+##
+## This source code is subject to the terms of the BSD 2 Clause License and
+## the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+## was not distributed with this source code in the LICENSE file, you can
+## obtain it at www.aomedia.org/license/software. If the Alliance for Open
+## Media Patent License 1.0 was not distributed with this source code in the
+## PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+##
+
+. $(dirname $0)/tools_common.sh
+
+# Environment check: $YUV_RAW_INPUT is required.
+svc_encoder_verify_environment() {
+  if [ ! -e "${YUV_RAW_INPUT}" ]; then
+    echo "Libaom test data must exist in LIBAOM_TEST_DATA_PATH."
+    return 1
+  fi
+}
+
+common_flags="-k 10000"
+common_flags="${common_flags} --max-q=63"
+common_flags="${common_flags} --error-resilient=0"
+
+# Runs svc_encoder_rtc using with 1 spatial layer 3 temporal layers.
+svc_encoder_s1_t3() {
+  local encoder="${LIBAOM_BIN_PATH}/svc_encoder_rtc${AOM_TEST_EXE_SUFFIX}"
+  local output_file="${AOM_TEST_OUTPUT_DIR}/svc_encoder_rtc"
+
+  if [ ! -x "${encoder}" ]; then
+    elog "${encoder} does not exist or is not executable."
+    return 1
+  fi
+
+  eval "${AOM_TEST_PREFIX}" "${encoder}" "${common_flags}" \
+      "--width=${YUV_RAW_INPUT_WIDTH}" \
+      "--height=${YUV_RAW_INPUT_HEIGHT}" \
+      "-lm 2" \
+      "--speed=8" \
+      "--target-bitrate=400" \
+      "--bitrates=220,300,400" \
+      "--spatial-layers=1" \
+      "--temporal-layers=3" \
+      "--timebase=1/30" \
+      "${YUV_RAW_INPUT}" \
+      "-o ${output_file}" \
+      ${devnull} || return 1
+
+  [ -e "${output_file}" ] || return 1
+}
+
+# Runs svc_encoder_rtc using with 1 spatial layer 2 temporal layers with
+# speed 10.
+svc_encoder_s1_t2() {
+  local encoder="${LIBAOM_BIN_PATH}/svc_encoder_rtc${AOM_TEST_EXE_SUFFIX}"
+  local output_file="${AOM_TEST_OUTPUT_DIR}/svc_encoder_rtc"
+
+  if [ ! -x "${encoder}" ]; then
+    elog "${encoder} does not exist or is not executable."
+    return 1
+  fi
+
+  eval "${AOM_TEST_PREFIX}" "${encoder}" "${common_flags}" \
+      "--width=${YUV_RAW_INPUT_WIDTH}" \
+      "--height=${YUV_RAW_INPUT_HEIGHT}" \
+      "-lm 1" \
+      "--speed=10" \
+      "--target-bitrate=400" \
+      "--bitrates=220,400" \
+      "--spatial-layers=1" \
+      "--temporal-layers=2" \
+      "--timebase=1/30" \
+      "${YUV_RAW_INPUT}" \
+      "-o ${output_file}" \
+      ${devnull} || return 1
+
+  [ -e "${output_file}" ] || return 1
+}
+
+if [ "$(av1_encode_available)" = "yes" ]; then
+  svc_encoder_rtc_tests="svc_encoder_s1_t3
+                         svc_encoder_s1_t2"
+  run_tests svc_encoder_verify_environment "${svc_encoder_rtc_tests}"
+fi

diff --git a/test/temporal_filter_test.cc b/test/temporal_filter_test.cc
index 154fd5d..e689cd3 100644
--- a/test/temporal_filter_test.cc
+++ b/test/temporal_filter_test.cc

@@ -31,9 +31,7 @@
 #include "test/function_equivalence_test.h"
 
 using libaom_test::ACMRandom;
-using libaom_test::FunctionEquivalenceTest;
 using ::testing::Combine;
-using ::testing::Range;
 using ::testing::Values;
 using ::testing::ValuesIn;
 
@@ -47,11 +45,11 @@
 } ColorFormat;
 static const char *color_fmt_str[] = { "I400", "I420", "I422", "I444" };
 typedef void (*TemporalFilterFunc)(
-    const YV12_BUFFER_CONFIG *ref_frame, const MACROBLOCKD *mbd,
+    const YV12_BUFFER_CONFIG *frame_to_filter, const MACROBLOCKD *mbd,
     const BLOCK_SIZE block_size, const int mb_row, const int mb_col,
     const int num_planes, const double *noise_level, const MV *subblock_mvs,
-    const int *subblock_mses, const int q_factor, const int filter_strenght,
-    const uint8_t *pred, uint32_t *accum, uint16_t *count);
+    const int *subblock_mses, const int q_factor, const int filter_strength,
+    int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
 typedef libaom_test::FuncParam<TemporalFilterFunc> TemporalFilterFuncParam;
 
 typedef std::tuple<TemporalFilterFuncParam, int> TemporalFilterWithParam;
@@ -62,6 +60,7 @@
   virtual ~TemporalFilterTest() {}
   virtual void SetUp() {
     params_ = GET_PARAM(0);
+    tf_wgt_calc_lvl_ = GET_PARAM(1);
     rnd_.Reset(ACMRandom::DeterministicSeed());
     src1_ = reinterpret_cast<uint8_t *>(
         aom_memalign(8, sizeof(uint8_t) * MAX_MB_PLANE * BH * BW));
@@ -121,6 +120,7 @@
 
  protected:
   TemporalFilterFuncParam params_;
+  int32_t tf_wgt_calc_lvl_;
   uint8_t *src1_;
   uint8_t *src2_;
   ACMRandom rnd_;
@@ -131,8 +131,9 @@
                                  ColorFormat color_fmt) {
   aom_usec_timer ref_timer, test_timer;
   const BLOCK_SIZE block_size = TF_BLOCK_SIZE;
-  const int width = block_size_wide[block_size];
-  const int height = block_size_high[block_size];
+  static_assert(block_size == BLOCK_32X32, "");
+  const int width = 32;
+  const int height = 32;
   int num_planes = MAX_MB_PLANE;
   int subsampling_x = 0;
   int subsampling_y = 0;
@@ -173,25 +174,25 @@
     memset(accumulator_mod, 0, 1024 * 3 * sizeof(accumulator_mod[0]));
     memset(count_mod, 0, 1024 * 3 * sizeof(count_mod[0]));
 
-    assert(width == 32 && height == 32);
+    static_assert(width == 32 && height == 32, "");
     const MV subblock_mvs[4] = { { 0, 0 }, { 5, 5 }, { 7, 8 }, { 2, 10 } };
     const int subblock_mses[4] = { 15, 16, 17, 18 };
     const int q_factor = 12;
     const int filter_strength = 5;
     const int mb_row = 0;
     const int mb_col = 0;
-    std::unique_ptr<YV12_BUFFER_CONFIG> ref_frame(new (std::nothrow)
-                                                      YV12_BUFFER_CONFIG);
-    ASSERT_NE(ref_frame, nullptr);
-    ref_frame->y_crop_height = 360;
-    ref_frame->y_crop_width = 540;
-    ref_frame->heights[PLANE_TYPE_Y] = height;
-    ref_frame->heights[PLANE_TYPE_UV] = height >> subsampling_y;
-    ref_frame->strides[PLANE_TYPE_Y] = stride;
-    ref_frame->strides[PLANE_TYPE_UV] = stride >> subsampling_x;
+    std::unique_ptr<YV12_BUFFER_CONFIG> frame_to_filter(new (std::nothrow)
+                                                            YV12_BUFFER_CONFIG);
+    ASSERT_NE(frame_to_filter, nullptr);
+    frame_to_filter->y_crop_height = 360;
+    frame_to_filter->y_crop_width = 540;
+    frame_to_filter->heights[PLANE_TYPE_Y] = height;
+    frame_to_filter->heights[PLANE_TYPE_UV] = height >> subsampling_y;
+    frame_to_filter->strides[PLANE_TYPE_Y] = stride;
+    frame_to_filter->strides[PLANE_TYPE_UV] = stride >> subsampling_x;
     DECLARE_ALIGNED(16, uint8_t, src[1024 * 3]);
-    ref_frame->buffer_alloc = src;
-    ref_frame->flags = 0;  // Only support low bit-depth test.
+    frame_to_filter->buffer_alloc = src;
+    frame_to_filter->flags = 0;  // Only support low bit-depth test.
     memcpy(src, src1_, 1024 * 3 * sizeof(uint8_t));
 
     std::unique_ptr<MACROBLOCKD> mbd(new (std::nothrow) MACROBLOCKD);
@@ -200,26 +201,28 @@
     for (int plane = AOM_PLANE_Y; plane < num_planes; plane++) {
       int plane_height = plane ? height >> subsampling_y : height;
       int plane_stride = plane ? stride >> subsampling_x : stride;
-      ref_frame->buffers[plane] =
-          ref_frame->buffer_alloc + plane * plane_stride * plane_height;
+      frame_to_filter->buffers[plane] =
+          frame_to_filter->buffer_alloc + plane * plane_stride * plane_height;
       mbd->plane[plane].subsampling_x = plane ? subsampling_x : 0;
       mbd->plane[plane].subsampling_y = plane ? subsampling_y : 0;
     }
 
-    params_.ref_func(ref_frame.get(), mbd.get(), block_size, mb_row, mb_col,
-                     num_planes, sigma, subblock_mvs, subblock_mses, q_factor,
-                     filter_strength, src2_, accumulator_ref, count_ref);
-    params_.tst_func(ref_frame.get(), mbd.get(), block_size, mb_row, mb_col,
-                     num_planes, sigma, subblock_mvs, subblock_mses, q_factor,
-                     filter_strength, src2_, accumulator_mod, count_mod);
+    params_.ref_func(frame_to_filter.get(), mbd.get(), block_size, mb_row,
+                     mb_col, num_planes, sigma, subblock_mvs, subblock_mses,
+                     q_factor, filter_strength, tf_wgt_calc_lvl_, src2_,
+                     accumulator_ref, count_ref);
+    params_.tst_func(frame_to_filter.get(), mbd.get(), block_size, mb_row,
+                     mb_col, num_planes, sigma, subblock_mvs, subblock_mses,
+                     q_factor, filter_strength, tf_wgt_calc_lvl_, src2_,
+                     accumulator_mod, count_mod);
 
     if (run_times > 1) {
       aom_usec_timer_start(&ref_timer);
       for (int j = 0; j < run_times; j++) {
-        params_.ref_func(ref_frame.get(), mbd.get(), block_size, mb_row, mb_col,
-                         num_planes, sigma, subblock_mvs, subblock_mses,
-                         q_factor, filter_strength, src2_, accumulator_ref,
-                         count_ref);
+        params_.ref_func(frame_to_filter.get(), mbd.get(), block_size, mb_row,
+                         mb_col, num_planes, sigma, subblock_mvs, subblock_mses,
+                         q_factor, filter_strength, tf_wgt_calc_lvl_, src2_,
+                         accumulator_ref, count_ref);
       }
       aom_usec_timer_mark(&ref_timer);
       const int elapsed_time_c =
@@ -227,10 +230,10 @@
 
       aom_usec_timer_start(&test_timer);
       for (int j = 0; j < run_times; j++) {
-        params_.tst_func(ref_frame.get(), mbd.get(), block_size, mb_row, mb_col,
-                         num_planes, sigma, subblock_mvs, subblock_mses,
-                         q_factor, filter_strength, src2_, accumulator_mod,
-                         count_mod);
+        params_.tst_func(frame_to_filter.get(), mbd.get(), block_size, mb_row,
+                         mb_col, num_planes, sigma, subblock_mvs, subblock_mses,
+                         q_factor, filter_strength, tf_wgt_calc_lvl_, src2_,
+                         accumulator_mod, count_mod);
       }
       aom_usec_timer_mark(&test_timer);
       const int elapsed_time_simd =
@@ -286,7 +289,7 @@
     &av1_apply_temporal_filter_c, &av1_apply_temporal_filter_avx2) };
 INSTANTIATE_TEST_SUITE_P(AVX2, TemporalFilterTest,
                          Combine(ValuesIn(temporal_filter_test_avx2),
-                                 Range(64, 65, 4)));
+                                 Values(0, 1)));
 #endif  // HAVE_AVX2
 
 #if HAVE_SSE2
@@ -294,7 +297,7 @@
     &av1_apply_temporal_filter_c, &av1_apply_temporal_filter_sse2) };
 INSTANTIATE_TEST_SUITE_P(SSE2, TemporalFilterTest,
                          Combine(ValuesIn(temporal_filter_test_sse2),
-                                 Range(64, 65, 4)));
+                                 Values(0, 1)));
 #endif  // HAVE_SSE2
 
 #if HAVE_NEON
@@ -302,17 +305,109 @@
     &av1_apply_temporal_filter_c, &av1_apply_temporal_filter_neon) };
 INSTANTIATE_TEST_SUITE_P(NEON, TemporalFilterTest,
                          Combine(ValuesIn(temporal_filter_test_neon),
-                                 Range(64, 65, 4)));
+                                 Values(0, 1)));
 #endif  // HAVE_NEON
 
+typedef double (*EstimateNoiseFunc)(const uint8_t *src, int height, int width,
+                                    int stride, int edge_thresh);
+
+typedef std::tuple<EstimateNoiseFunc, EstimateNoiseFunc, int, int>
+    EstimateNoiseWithParam;
+
+class EstimateNoiseTest
+    : public ::testing::TestWithParam<EstimateNoiseWithParam> {
+ public:
+  virtual ~EstimateNoiseTest() {}
+  virtual void SetUp() {
+    ref_func = GET_PARAM(0);
+    tst_func = GET_PARAM(1);
+    width_ = GET_PARAM(2);
+    height_ = GET_PARAM(3);
+    rnd_.Reset(ACMRandom::DeterministicSeed());
+    src1_ = reinterpret_cast<uint8_t *>(
+        aom_memalign(8, sizeof(uint8_t) * width_ * height_));
+    GenRandomData(width_ * height_);
+    ASSERT_NE(src1_, nullptr);
+  }
+
+  virtual void TearDown() { aom_free(src1_); }
+
+  void RunTest(int run_times) {
+    stride_ = width_;
+
+    for (int i = 0; i < run_times; i++) {
+      double ref_out = ref_func(src1_, height_, width_, stride_,
+                                NOISE_ESTIMATION_EDGE_THRESHOLD);
+
+      double tst_out = tst_func(src1_, height_, width_, stride_,
+                                NOISE_ESTIMATION_EDGE_THRESHOLD);
+
+      EXPECT_EQ(ref_out, tst_out);
+    }
+  }
+
+  void SpeedTest(int run_times) {
+    stride_ = width_;
+    aom_usec_timer timer;
+    aom_usec_timer_start(&timer);
+    for (int i = 0; i < run_times; i++) {
+      ref_func(src1_, height_, width_, stride_,
+               NOISE_ESTIMATION_EDGE_THRESHOLD);
+    }
+    aom_usec_timer_mark(&timer);
+    const double time1 = static_cast<double>(aom_usec_timer_elapsed(&timer));
+    aom_usec_timer_start(&timer);
+    for (int i = 0; i < run_times; i++) {
+      tst_func(src1_, height_, width_, stride_,
+               NOISE_ESTIMATION_EDGE_THRESHOLD);
+    }
+    aom_usec_timer_mark(&timer);
+    const double time2 = static_cast<double>(aom_usec_timer_elapsed(&timer));
+
+    printf("(%3.2f)\n", time1 / time2);
+  }
+
+  void GenRandomData(int size) {
+    for (int ii = 0; ii < size; ii++) src1_[ii] = rnd_.Rand8();
+  }
+
+ protected:
+  EstimateNoiseFunc ref_func;
+  EstimateNoiseFunc tst_func;
+  ACMRandom rnd_;
+  uint8_t *src1_;
+  int width_;
+  int height_;
+  int stride_;
+};
+GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(EstimateNoiseTest);
+
+TEST_P(EstimateNoiseTest, RandomValues) { RunTest(1); }
+
+TEST_P(EstimateNoiseTest, DISABLED_Speed) { SpeedTest(2000); }
+
+#if HAVE_AVX2
+// Width and height for which av1_estimate_noise_from_single_plane() will be
+// tested.
+const int kWidths[] = { 3840, 1920, 1280, 800, 640, 360, 357 };
+const int kHeights[] = { 2160, 1080, 720, 600, 480, 240, 237 };
+
+INSTANTIATE_TEST_SUITE_P(
+    AVX2, EstimateNoiseTest,
+    ::testing::Combine(
+        ::testing::Values(av1_estimate_noise_from_single_plane_c),
+        ::testing::Values(av1_estimate_noise_from_single_plane_avx2),
+        ::testing::ValuesIn(kWidths), ::testing::ValuesIn(kHeights)));
+#endif  // HAVE_AVX2
+
 #if CONFIG_AV1_HIGHBITDEPTH
 
 typedef void (*HBDTemporalFilterFunc)(
-    const YV12_BUFFER_CONFIG *ref_frame, const MACROBLOCKD *mbd,
+    const YV12_BUFFER_CONFIG *frame_to_filter, const MACROBLOCKD *mbd,
     const BLOCK_SIZE block_size, const int mb_row, const int mb_col,
     const int num_planes, const double *noise_level, const MV *subblock_mvs,
-    const int *subblock_mses, const int q_factor, const int filter_strenght,
-    const uint8_t *pred, uint32_t *accum, uint16_t *count);
+    const int *subblock_mses, const int q_factor, const int filter_strength,
+    int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
 typedef libaom_test::FuncParam<HBDTemporalFilterFunc>
     HBDTemporalFilterFuncParam;
 
@@ -324,6 +419,7 @@
   virtual ~HBDTemporalFilterTest() {}
   virtual void SetUp() {
     params_ = GET_PARAM(0);
+    tf_wgt_calc_lvl_ = GET_PARAM(1);
     rnd_.Reset(ACMRandom::DeterministicSeed());
     src1_ = reinterpret_cast<uint16_t *>(
         aom_memalign(16, sizeof(uint16_t) * MAX_MB_PLANE * BH * BW));
@@ -385,6 +481,7 @@
 
  protected:
   HBDTemporalFilterFuncParam params_;
+  int tf_wgt_calc_lvl_;
   uint16_t *src1_;
   uint16_t *src2_;
   ACMRandom rnd_;
@@ -396,8 +493,9 @@
                                     ColorFormat color_fmt) {
   aom_usec_timer ref_timer, test_timer;
   const BLOCK_SIZE block_size = TF_BLOCK_SIZE;
-  const int width = block_size_wide[block_size];
-  const int height = block_size_high[block_size];
+  static_assert(block_size == BLOCK_32X32, "");
+  const int width = 32;
+  const int height = 32;
   int num_planes = MAX_MB_PLANE;
   int subsampling_x = 0;
   int subsampling_y = 0;
@@ -438,25 +536,26 @@
     memset(accumulator_mod, 0, 1024 * 3 * sizeof(accumulator_mod[0]));
     memset(count_mod, 0, 1024 * 3 * sizeof(count_mod[0]));
 
-    assert(width == 32 && height == 32);
+    static_assert(width == 32 && height == 32, "");
     const MV subblock_mvs[4] = { { 0, 0 }, { 5, 5 }, { 7, 8 }, { 2, 10 } };
     const int subblock_mses[4] = { 15, 16, 17, 18 };
     const int q_factor = 12;
     const int filter_strength = 5;
     const int mb_row = 0;
     const int mb_col = 0;
-    std::unique_ptr<YV12_BUFFER_CONFIG> ref_frame(new (std::nothrow)
-                                                      YV12_BUFFER_CONFIG);
-    ASSERT_NE(ref_frame, nullptr);
-    ref_frame->y_crop_height = 360;
-    ref_frame->y_crop_width = 540;
-    ref_frame->heights[PLANE_TYPE_Y] = height;
-    ref_frame->heights[PLANE_TYPE_UV] = height >> subsampling_y;
-    ref_frame->strides[PLANE_TYPE_Y] = stride;
-    ref_frame->strides[PLANE_TYPE_UV] = stride >> subsampling_x;
+    std::unique_ptr<YV12_BUFFER_CONFIG> frame_to_filter(new (std::nothrow)
+                                                            YV12_BUFFER_CONFIG);
+    ASSERT_NE(frame_to_filter, nullptr);
+    frame_to_filter->y_crop_height = 360;
+    frame_to_filter->y_crop_width = 540;
+    frame_to_filter->heights[PLANE_TYPE_Y] = height;
+    frame_to_filter->heights[PLANE_TYPE_UV] = height >> subsampling_y;
+    frame_to_filter->strides[PLANE_TYPE_Y] = stride;
+    frame_to_filter->strides[PLANE_TYPE_UV] = stride >> subsampling_x;
     DECLARE_ALIGNED(16, uint16_t, src[1024 * 3]);
-    ref_frame->buffer_alloc = CONVERT_TO_BYTEPTR(src);
-    ref_frame->flags = YV12_FLAG_HIGHBITDEPTH;  // Only Hihgbd bit-depth test.
+    frame_to_filter->buffer_alloc = CONVERT_TO_BYTEPTR(src);
+    frame_to_filter->flags =
+        YV12_FLAG_HIGHBITDEPTH;  // Only Hihgbd bit-depth test.
     memcpy(src, src1_, 1024 * 3 * sizeof(uint16_t));
 
     std::unique_ptr<MACROBLOCKD> mbd(new (std::nothrow) MACROBLOCKD);
@@ -465,28 +564,28 @@
     for (int plane = AOM_PLANE_Y; plane < num_planes; plane++) {
       int plane_height = plane ? height >> subsampling_y : height;
       int plane_stride = plane ? stride >> subsampling_x : stride;
-      ref_frame->buffers[plane] =
-          ref_frame->buffer_alloc + plane * plane_stride * plane_height;
+      frame_to_filter->buffers[plane] =
+          frame_to_filter->buffer_alloc + plane * plane_stride * plane_height;
       mbd->plane[plane].subsampling_x = plane ? subsampling_x : 0;
       mbd->plane[plane].subsampling_y = plane ? subsampling_y : 0;
     }
 
-    params_.ref_func(ref_frame.get(), mbd.get(), block_size, mb_row, mb_col,
-                     num_planes, sigma, subblock_mvs, subblock_mses, q_factor,
-                     filter_strength, CONVERT_TO_BYTEPTR(src2_),
-                     accumulator_ref, count_ref);
-    params_.tst_func(ref_frame.get(), mbd.get(), block_size, mb_row, mb_col,
-                     num_planes, sigma, subblock_mvs, subblock_mses, q_factor,
-                     filter_strength, CONVERT_TO_BYTEPTR(src2_),
-                     accumulator_mod, count_mod);
+    params_.ref_func(frame_to_filter.get(), mbd.get(), block_size, mb_row,
+                     mb_col, num_planes, sigma, subblock_mvs, subblock_mses,
+                     q_factor, filter_strength, tf_wgt_calc_lvl_,
+                     CONVERT_TO_BYTEPTR(src2_), accumulator_ref, count_ref);
+    params_.tst_func(frame_to_filter.get(), mbd.get(), block_size, mb_row,
+                     mb_col, num_planes, sigma, subblock_mvs, subblock_mses,
+                     q_factor, filter_strength, tf_wgt_calc_lvl_,
+                     CONVERT_TO_BYTEPTR(src2_), accumulator_mod, count_mod);
 
     if (run_times > 1) {
       aom_usec_timer_start(&ref_timer);
       for (int j = 0; j < run_times; j++) {
-        params_.ref_func(ref_frame.get(), mbd.get(), block_size, mb_row, mb_col,
-                         num_planes, sigma, subblock_mvs, subblock_mses,
-                         q_factor, filter_strength, CONVERT_TO_BYTEPTR(src2_),
-                         accumulator_ref, count_ref);
+        params_.ref_func(frame_to_filter.get(), mbd.get(), block_size, mb_row,
+                         mb_col, num_planes, sigma, subblock_mvs, subblock_mses,
+                         q_factor, filter_strength, tf_wgt_calc_lvl_,
+                         CONVERT_TO_BYTEPTR(src2_), accumulator_ref, count_ref);
       }
       aom_usec_timer_mark(&ref_timer);
       const int elapsed_time_c =
@@ -494,10 +593,10 @@
 
       aom_usec_timer_start(&test_timer);
       for (int j = 0; j < run_times; j++) {
-        params_.tst_func(ref_frame.get(), mbd.get(), block_size, mb_row, mb_col,
-                         num_planes, sigma, subblock_mvs, subblock_mses,
-                         q_factor, filter_strength, CONVERT_TO_BYTEPTR(src2_),
-                         accumulator_mod, count_mod);
+        params_.tst_func(frame_to_filter.get(), mbd.get(), block_size, mb_row,
+                         mb_col, num_planes, sigma, subblock_mvs, subblock_mses,
+                         q_factor, filter_strength, tf_wgt_calc_lvl_,
+                         CONVERT_TO_BYTEPTR(src2_), accumulator_mod, count_mod);
       }
       aom_usec_timer_mark(&test_timer);
       const int elapsed_time_simd =
@@ -554,7 +653,7 @@
 };
 INSTANTIATE_TEST_SUITE_P(SSE2, HBDTemporalFilterTest,
                          Combine(ValuesIn(HBDtemporal_filter_test_sse2),
-                                 Range(64, 65, 4)));
+                                 Values(0, 1)));
 #endif  // HAVE_SSE2
 #if HAVE_AVX2
 HBDTemporalFilterFuncParam HBDtemporal_filter_test_avx2[] = {
@@ -563,7 +662,7 @@
 };
 INSTANTIATE_TEST_SUITE_P(AVX2, HBDTemporalFilterTest,
                          Combine(ValuesIn(HBDtemporal_filter_test_avx2),
-                                 Range(64, 65, 4)));
+                                 Values(0, 1)));
 #endif  // HAVE_AVX2
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 }  // namespace

diff --git a/test/test-data.sha1 b/test/test-data.sha1
index 3ac50a4..4bd0ddc 100644
--- a/test/test-data.sha1
+++ b/test/test-data.sha1

@@ -570,3 +570,4 @@
 c7f336958e7af6162c20ddc84d67c7dfa9826910 *av1-1-b8-16-intra_only-intrabc-extreme-dv.ivf
 36a4fcf07e645ed522cde5845dd9c6ab2b2d1502 *av1-1-b8-16-intra_only-intrabc-extreme-dv.ivf.md5
 9f935d391fdf4a6f7c320355d45770d2e7d6095c *desktopqvga2.320_240.yuv
+4d1ad6d3070268ccb000d7fc3ae0f5a9447bfe82 *test_input_w1h1.yuv

diff --git a/test/test.cmake b/test/test.cmake
index a173246..672edb3 100644
--- a/test/test.cmake
+++ b/test/test.cmake

@@ -21,8 +21,16 @@
 set(AOM_IDE_TEST_FOLDER "test")
 set(AOM_IDE_TESTDATA_FOLDER "testdata")
 
+# Appends |AOM_TEST_SOURCE_VARS| with |src_list_name| at the caller's scope.
+# This collects all variables containing libaom test source files.
+function(add_to_libaom_test_srcs src_list_name)
+  list(APPEND AOM_TEST_SOURCE_VARS ${src_list_name})
+  set(AOM_TEST_SOURCE_VARS "${AOM_TEST_SOURCE_VARS}" PARENT_SCOPE)
+endfunction()
+
 list(APPEND AOM_UNIT_TEST_WRAPPER_SOURCES "${AOM_GEN_SRC_DIR}/usage_exit.c"
             "${AOM_ROOT}/test/test_libaom.cc")
+add_to_libaom_test_srcs(AOM_UNIT_TEST_WRAPPER_SOURCES)
 
 list(APPEND AOM_UNIT_TEST_COMMON_SOURCES
             "${AOM_ROOT}/test/acm_random.h"
@@ -41,6 +49,7 @@
             "${AOM_ROOT}/test/transform_test_base.h"
             "${AOM_ROOT}/test/util.h"
             "${AOM_ROOT}/test/video_source.h")
+add_to_libaom_test_srcs(AOM_UNIT_TEST_COMMON_SOURCES)
 
 list(APPEND AOM_UNIT_TEST_DECODER_SOURCES "${AOM_ROOT}/test/decode_api_test.cc"
             "${AOM_ROOT}/test/decode_scalability_test.cc"
@@ -48,6 +57,7 @@
             "${AOM_ROOT}/test/invalid_file_test.cc"
             "${AOM_ROOT}/test/test_vector_test.cc"
             "${AOM_ROOT}/test/ivf_video_source.h")
+add_to_libaom_test_srcs(AOM_UNIT_TEST_DECODER_SOURCES)
 
 list(APPEND AOM_UNIT_TEST_ENCODER_SOURCES
             "${AOM_ROOT}/test/active_map_test.cc"
@@ -60,6 +70,7 @@
             "${AOM_ROOT}/test/datarate_test.cc"
             "${AOM_ROOT}/test/datarate_test.h"
             "${AOM_ROOT}/test/deltaq_mode_test.cc"
+            "${AOM_ROOT}/test/dropframe_encode_test.cc"
             "${AOM_ROOT}/test/svc_datarate_test.cc"
             "${AOM_ROOT}/test/encode_api_test.cc"
             "${AOM_ROOT}/test/encode_small_width_height_test.cc"
@@ -86,9 +97,11 @@
             "${AOM_ROOT}/test/y4m_video_source.h"
             "${AOM_ROOT}/test/yuv_video_source.h"
             "${AOM_ROOT}/test/time_stamp_test.cc")
+add_to_libaom_test_srcs(AOM_UNIT_TEST_ENCODER_SOURCES)
 
 list(APPEND AOM_ENCODE_PERF_TEST_SOURCES "${AOM_ROOT}/test/encode_perf_test.cc")
 list(APPEND AOM_UNIT_TEST_WEBM_SOURCES "${AOM_ROOT}/test/webm_video_source.h")
+add_to_libaom_test_srcs(AOM_UNIT_TEST_WEBM_SOURCES)
 list(APPEND AOM_TEST_INTRA_PRED_SPEED_SOURCES "${AOM_GEN_SRC_DIR}/usage_exit.c"
             "${AOM_ROOT}/test/test_intra_pred_speed.cc")
 
@@ -114,6 +127,7 @@
                    "${AOM_ROOT}/test/cpu_speed_test.cc"
                    "${AOM_ROOT}/test/cpu_used_firstpass_test.cc"
                    "${AOM_ROOT}/test/deltaq_mode_test.cc"
+                   "${AOM_ROOT}/test/dropframe_encode_test.cc"
                    "${AOM_ROOT}/test/end_to_end_psnr_test.cc"
                    "${AOM_ROOT}/test/force_key_frame_test.cc"
                    "${AOM_ROOT}/test/gf_pyr_height_test.cc"
@@ -145,15 +159,19 @@
 
   list(APPEND AOM_UNIT_TEST_COMMON_INTRIN_NEON
               "${AOM_ROOT}/test/simd_cmp_neon.cc")
+  add_to_libaom_test_srcs(AOM_UNIT_TEST_COMMON_INTRIN_NEON)
 
   list(APPEND AOM_UNIT_TEST_COMMON_INTRIN_SSE2
               "${AOM_ROOT}/test/simd_cmp_sse2.cc")
+  add_to_libaom_test_srcs(AOM_UNIT_TEST_COMMON_INTRIN_SSE2)
 
   list(APPEND AOM_UNIT_TEST_COMMON_INTRIN_SSSE3
               "${AOM_ROOT}/test/simd_cmp_ssse3.cc")
+  add_to_libaom_test_srcs(AOM_UNIT_TEST_COMMON_INTRIN_SSSE3)
 
   list(APPEND AOM_UNIT_TEST_COMMON_INTRIN_AVX2
               "${AOM_ROOT}/test/simd_cmp_avx2.cc")
+  add_to_libaom_test_srcs(AOM_UNIT_TEST_COMMON_INTRIN_AVX2)
 
   list(APPEND AOM_UNIT_TEST_ENCODER_SOURCES
               "${AOM_ROOT}/test/arf_freq_test.cc"
@@ -173,7 +191,7 @@
               "${AOM_ROOT}/test/blend_a64_mask_test.cc"
               "${AOM_ROOT}/test/comp_avg_pred_test.cc"
               "${AOM_ROOT}/test/comp_avg_pred_test.h"
-              "${AOM_ROOT}/test/comp_mask_variance_test.cc"
+              "${AOM_ROOT}/test/comp_mask_pred_test.cc"
               "${AOM_ROOT}/test/encodemb_test.cc"
               "${AOM_ROOT}/test/encodetxb_test.cc"
               "${AOM_ROOT}/test/end_to_end_qmpsnr_test.cc"
@@ -187,6 +205,7 @@
               "${AOM_ROOT}/test/horver_correlation_test.cc"
               "${AOM_ROOT}/test/masked_sad_test.cc"
               "${AOM_ROOT}/test/masked_variance_test.cc"
+              "${AOM_ROOT}/test/minmax_test.cc"
               "${AOM_ROOT}/test/motion_vector_test.cc"
               "${AOM_ROOT}/test/mv_cost_test.cc"
               "${AOM_ROOT}/test/noise_model_test.cc"
@@ -209,6 +228,7 @@
 
   list(APPEND AOM_UNIT_TEST_ENCODER_INTRIN_SSE4_1
               "${AOM_ROOT}/test/simd_cmp_sse4.cc")
+  add_to_libaom_test_srcs(AOM_UNIT_TEST_ENCODER_INTRIN_SSE4_1)
 
   if(NOT CONFIG_REALTIME_ONLY)
     list(APPEND AOM_UNIT_TEST_ENCODER_INTRIN_SSE4_1
@@ -334,6 +354,12 @@
 
   endif()
 
+  if(HAVE_NEON)
+    list(APPEND AOM_UNIT_TEST_ENCODER_SOURCES
+                "${AOM_ROOT}/test/av1_convolve_scale_test.cc"
+                "${AOM_ROOT}/test/av1_horz_only_frame_superres_test.cc")
+  endif()
+
   if(HAVE_SSE4_2 OR HAVE_ARM_CRC32)
     list(APPEND AOM_UNIT_TEST_ENCODER_SOURCES "${AOM_ROOT}/test/hash_test.cc")
   endif()
@@ -356,26 +382,20 @@
 endif()
 
 if(CONFIG_AV1_ENCODER AND ENABLE_TESTS)
-  list(APPEND AOM_RC_INTERFACE_SOURCES
-              "${AOM_ROOT}/test/encode_test_driver.cc"
-              "${AOM_ROOT}/test/encode_test_driver.h"
+  list(APPEND AOM_RC_TEST_SOURCES "${AOM_ROOT}/test/codec_factory.h"
               "${AOM_ROOT}/test/decode_test_driver.cc"
               "${AOM_ROOT}/test/decode_test_driver.h"
-              "${AOM_ROOT}/test/codec_factory.h"
-              "${AOM_ROOT}/test/test_aom_rc_interface.cc"
+              "${AOM_ROOT}/test/encode_test_driver.cc"
+              "${AOM_ROOT}/test/encode_test_driver.h"
+              "${AOM_ROOT}/test/i420_video_source.h"
               "${AOM_ROOT}/test/ratectrl_rtc_test.cc"
-              "${AOM_ROOT}/common/y4minput.c"
-              "${AOM_ROOT}/common/y4minput.h"
-              "${AOM_ROOT}/test/y4m_video_source.h"
-              "${AOM_ROOT}/test/yuv_video_source.h")
-
-  list(APPEND AV1_RC_QMODE_SOURCES "${AOM_ROOT}/test/mock_ratectrl_qmode.h"
-              "${AOM_ROOT}/test/ratectrl_qmode_test.cc"
-              "${AOM_ROOT}/test/ducky_encode_test.cc"
-              "${AOM_ROOT}/common/y4minput.c" "${AOM_ROOT}/common/y4minput.h"
-              "${AOM_ROOT}/common/tools_common.c"
-              "${AOM_ROOT}/common/tools_common.h"
-              "${AOM_GEN_SRC_DIR}/usage_exit.c")
+              "${AOM_ROOT}/test/test_aom_rc.cc" "${AOM_ROOT}/test/util.h")
+  if(CONFIG_THREE_PASS)
+    # Add the dependencies of "${AOM_ROOT}/common/ivfdec.c".
+    list(APPEND AOM_RC_TEST_SOURCES "${AOM_ROOT}/common/tools_common.c"
+                "${AOM_ROOT}/common/tools_common.h"
+                "${AOM_GEN_SRC_DIR}/usage_exit.c")
+  endif()
 endif()
 
 if(ENABLE_TESTS)
@@ -575,65 +595,61 @@
     endif()
   endif()
 
-  # Collect all variables containing libaom test source files.
-  get_cmake_property(all_cmake_vars VARIABLES)
-  foreach(var ${all_cmake_vars})
-
-    # https://github.com/cheshirekow/cmake_format/issues/34
-    # cmake-format: off
-    if (("${var}" MATCHES "_TEST_" AND NOT
-         "${var}" MATCHES
-         "_DATA_\|_CMAKE_\|INTRA_PRED\|_COMPILED\|_HOSTING\|_PERF_\|CODER_")
-        OR (CONFIG_AV1_ENCODER AND ENABLE_ENCODE_PERF_TESTS AND
-            "${var}" MATCHES "_ENCODE_PERF_TEST_")
-        OR (CONFIG_AV1_DECODER AND ENABLE_DECODE_PERF_TESTS AND
-            "${var}" MATCHES "_DECODE_PERF_TEST_")
-        OR (CONFIG_AV1_ENCODER AND "${var}" MATCHES "_TEST_ENCODER_")
-        OR (CONFIG_AV1_DECODER AND  "${var}" MATCHES "_TEST_DECODER_"))
-      list(APPEND aom_test_source_vars ${var})
-    endif()
-    # cmake-format: on
-  endforeach()
-
   # Libaom_test_srcs.txt generation.
   set(libaom_test_srcs_txt_file "${AOM_CONFIG_DIR}/libaom_test_srcs.txt")
   file(WRITE "${libaom_test_srcs_txt_file}"
        "# This file is generated. DO NOT EDIT.\n")
 
   # Static source file list first.
-  foreach(aom_test_source_var ${aom_test_source_vars})
+  list(SORT AOM_TEST_SOURCE_VARS)
+  foreach(aom_test_source_var ${AOM_TEST_SOURCE_VARS})
+    if("${aom_test_source_var}" STREQUAL "${last_aom_test_source_var}")
+      message(
+        FATAL_ERROR
+          "Duplicate AOM_TEST_SOURCE_VARS entry: ${aom_test_source_var}")
+    endif()
     foreach(file ${${aom_test_source_var}})
       if(NOT "${file}" MATCHES "${AOM_CONFIG_DIR}")
         string(REPLACE "${AOM_ROOT}/" "" file "${file}")
         file(APPEND "${libaom_test_srcs_txt_file}" "${file}\n")
       endif()
     endforeach()
+    set(last_aom_test_source_var ${aom_test_source_var})
+  endforeach()
+
+  # libaom_test_srcs.gni generation
+  set(libaom_test_srcs_gni_file "${AOM_CONFIG_DIR}/libaom_test_srcs.gni")
+  file(WRITE "${libaom_test_srcs_gni_file}"
+       "# This file is generated. DO NOT EDIT.\n")
+
+  foreach(aom_test_source_var ${AOM_TEST_SOURCE_VARS})
+    string(TOLOWER "${aom_test_source_var}" aom_test_source_var_lowercase)
+    file(APPEND "${libaom_test_srcs_gni_file}"
+         "\n${aom_test_source_var_lowercase} = [\n")
+
+    foreach(file ${${aom_test_source_var}})
+      if(NOT "${file}" MATCHES "${AOM_CONFIG_DIR}")
+        string(REPLACE "${AOM_ROOT}/" "//third_party/libaom/source/libaom/" file
+                       "${file}")
+        file(APPEND "${libaom_test_srcs_gni_file}" "  \"${file}\",\n")
+      endif()
+    endforeach()
+
+    file(APPEND "${libaom_test_srcs_gni_file}" "]\n")
   endforeach()
 
   # Set up test for rc interface
-  if(CONFIG_AV1_RC_RTC
-     AND CONFIG_AV1_ENCODER
-     AND ENABLE_TESTS
-     AND CONFIG_WEBM_IO
-     AND NOT BUILD_SHARED_LIBS)
-    add_executable(test_aom_rc_interface ${AOM_RC_INTERFACE_SOURCES})
-    target_link_libraries(test_aom_rc_interface ${AOM_LIB_LINK_TYPE} aom
-                          aom_av1_rc aom_gtest webm)
-    set_property(TARGET test_aom_rc_interface
-                 PROPERTY FOLDER ${AOM_IDE_TEST_FOLDER})
-    list(APPEND AOM_APP_TARGETS test_aom_rc_interface)
-  endif()
-
   if(CONFIG_AV1_ENCODER
      AND ENABLE_TESTS
+     AND CONFIG_WEBM_IO
      AND NOT BUILD_SHARED_LIBS
      AND NOT CONFIG_REALTIME_ONLY)
-    add_executable(test_av1_rc_qmode ${AV1_RC_QMODE_SOURCES})
-    target_link_libraries(test_av1_rc_qmode ${AOM_LIB_LINK_TYPE} aom
-                          av1_rc_qmode aom_gtest aom_gmock)
-    set_property(TARGET test_av1_rc_qmode
-                 PROPERTY FOLDER ${AOM_IDE_TEST_FOLDER})
-    list(APPEND AOM_APP_TARGETS test_av1_rc_qmode)
+    add_executable(test_aom_rc ${AOM_RC_TEST_SOURCES})
+    target_link_libraries(test_aom_rc ${AOM_LIB_LINK_TYPE} aom aom_av1_rc
+                          aom_gtest aom_gmock webm)
+    set_property(TARGET test_aom_rc PROPERTY FOLDER ${AOM_IDE_TEST_FOLDER})
+    list(APPEND AOM_APP_TARGETS test_aom_rc)
   endif()
+
   set(AOM_APP_TARGETS ${AOM_APP_TARGETS} PARENT_SCOPE)
 endfunction()

diff --git a/test/test_aom_rc_interface.cc b/test/test_aom_rc.cc
similarity index 100%
rename from test/test_aom_rc_interface.cc
rename to test/test_aom_rc.cc


diff --git a/test/test_data_util.cmake b/test/test_data_util.cmake
index b5d6fda..de7d153 100644
--- a/test/test_data_util.cmake
+++ b/test/test_data_util.cmake

@@ -38,8 +38,8 @@
             "niklas_640_480_30.yuv"
             "vase10x10.yuv"
             "vase10x10_tiles.txt"
-            "firstpass_stats"
-            "bus_352x288_420_f20_b8.yuv")
+            "bus_352x288_420_f20_b8.yuv"
+            "test_input_w1h1.yuv")
 
 if(ENABLE_DECODE_PERF_TESTS AND CONFIG_AV1_ENCODER)
   list(APPEND AOM_TEST_DATA_FILE_NAMES "niklas_1280_720_30.yuv")

diff --git a/test/test_intra_pred_speed.cc b/test/test_intra_pred_speed.cc
index bf90d4a..d5c94be 100644
--- a/test/test_intra_pred_speed.cc
+++ b/test/test_intra_pred_speed.cc

@@ -468,12 +468,16 @@
                 aom_h_predictor_4x4_neon, aom_paeth_predictor_4x4_neon,
                 aom_smooth_predictor_4x4_neon, aom_smooth_v_predictor_4x4_neon,
                 aom_smooth_h_predictor_4x4_neon)
-INTRA_PRED_TEST(NEON, TX_4X8, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_4x8_neon,
+INTRA_PRED_TEST(NEON, TX_4X8, aom_dc_predictor_4x8_neon,
+                aom_dc_left_predictor_4x8_neon, aom_dc_top_predictor_4x8_neon,
+                aom_dc_128_predictor_4x8_neon, aom_v_predictor_4x8_neon,
+                aom_h_predictor_4x8_neon, aom_paeth_predictor_4x8_neon,
                 aom_smooth_predictor_4x8_neon, aom_smooth_v_predictor_4x8_neon,
                 aom_smooth_h_predictor_4x8_neon)
-INTRA_PRED_TEST(NEON, TX_4X16, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_4x16_neon,
+INTRA_PRED_TEST(NEON, TX_4X16, aom_dc_predictor_4x16_neon,
+                aom_dc_left_predictor_4x16_neon, aom_dc_top_predictor_4x16_neon,
+                aom_dc_128_predictor_4x16_neon, aom_v_predictor_4x16_neon,
+                aom_h_predictor_4x16_neon, aom_paeth_predictor_4x16_neon,
                 aom_smooth_predictor_4x16_neon,
                 aom_smooth_v_predictor_4x16_neon,
                 aom_smooth_h_predictor_4x16_neon)
@@ -555,17 +559,23 @@
                 aom_h_predictor_8x8_neon, aom_paeth_predictor_8x8_neon,
                 aom_smooth_predictor_8x8_neon, aom_smooth_v_predictor_8x8_neon,
                 aom_smooth_h_predictor_8x8_neon)
-INTRA_PRED_TEST(NEON, TX_8X4, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_8x4_neon,
+INTRA_PRED_TEST(NEON, TX_8X4, aom_dc_predictor_8x4_neon,
+                aom_dc_left_predictor_8x4_neon, aom_dc_top_predictor_8x4_neon,
+                aom_dc_128_predictor_8x4_neon, aom_v_predictor_8x4_neon,
+                aom_h_predictor_8x4_neon, aom_paeth_predictor_8x4_neon,
                 aom_smooth_predictor_8x4_neon, aom_smooth_v_predictor_8x4_neon,
                 aom_smooth_h_predictor_8x4_neon)
-INTRA_PRED_TEST(NEON, TX_8X16, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_8x16_neon,
+INTRA_PRED_TEST(NEON, TX_8X16, aom_dc_predictor_8x16_neon,
+                aom_dc_left_predictor_8x16_neon, aom_dc_top_predictor_8x16_neon,
+                aom_dc_128_predictor_8x16_neon, aom_v_predictor_8x16_neon,
+                aom_h_predictor_8x16_neon, aom_paeth_predictor_8x16_neon,
                 aom_smooth_predictor_8x16_neon,
                 aom_smooth_v_predictor_8x16_neon,
                 aom_smooth_h_predictor_8x16_neon)
-INTRA_PRED_TEST(NEON, TX_8X32, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_8x32_neon,
+INTRA_PRED_TEST(NEON, TX_8X32, aom_dc_predictor_8x32_neon,
+                aom_dc_left_predictor_8x32_neon, aom_dc_top_predictor_8x32_neon,
+                aom_dc_128_predictor_8x32_neon, aom_v_predictor_8x32_neon,
+                aom_h_predictor_8x32_neon, aom_paeth_predictor_8x32_neon,
                 aom_smooth_predictor_8x32_neon,
                 aom_smooth_v_predictor_8x32_neon,
                 aom_smooth_h_predictor_8x32_neon)
@@ -683,23 +693,33 @@
                 aom_smooth_predictor_16x16_neon,
                 aom_smooth_v_predictor_16x16_neon,
                 aom_smooth_h_predictor_16x16_neon)
-INTRA_PRED_TEST(NEON, TX_16X8, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_16x8_neon,
+INTRA_PRED_TEST(NEON, TX_16X8, aom_dc_predictor_16x8_neon,
+                aom_dc_left_predictor_16x8_neon, aom_dc_top_predictor_16x8_neon,
+                aom_dc_128_predictor_16x8_neon, aom_v_predictor_16x8_neon,
+                aom_h_predictor_16x8_neon, aom_paeth_predictor_16x8_neon,
                 aom_smooth_predictor_16x8_neon,
                 aom_smooth_v_predictor_16x8_neon,
                 aom_smooth_h_predictor_16x8_neon)
-INTRA_PRED_TEST(NEON, TX_16X32, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_16x32_neon,
+INTRA_PRED_TEST(NEON, TX_16X32, aom_dc_predictor_16x32_neon,
+                aom_dc_left_predictor_16x32_neon,
+                aom_dc_top_predictor_16x32_neon,
+                aom_dc_128_predictor_16x32_neon, aom_v_predictor_16x32_neon,
+                aom_h_predictor_16x32_neon, aom_paeth_predictor_16x32_neon,
                 aom_smooth_predictor_16x32_neon,
                 aom_smooth_v_predictor_16x32_neon,
                 aom_smooth_h_predictor_16x32_neon)
-INTRA_PRED_TEST(NEON, TX_16X4, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_16x4_neon,
+INTRA_PRED_TEST(NEON, TX_16X4, aom_dc_predictor_16x4_neon,
+                aom_dc_left_predictor_16x4_neon, aom_dc_top_predictor_16x4_neon,
+                aom_dc_128_predictor_16x4_neon, aom_v_predictor_16x4_neon,
+                aom_h_predictor_16x4_neon, aom_paeth_predictor_16x4_neon,
                 aom_smooth_predictor_16x4_neon,
                 aom_smooth_v_predictor_16x4_neon,
                 aom_smooth_h_predictor_16x4_neon)
-INTRA_PRED_TEST(NEON, TX_16X64, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_16x64_neon,
+INTRA_PRED_TEST(NEON, TX_16X64, aom_dc_predictor_16x64_neon,
+                aom_dc_left_predictor_16x64_neon,
+                aom_dc_top_predictor_16x64_neon,
+                aom_dc_128_predictor_16x64_neon, aom_v_predictor_16x64_neon,
+                aom_h_predictor_16x64_neon, aom_paeth_predictor_16x64_neon,
                 aom_smooth_predictor_16x64_neon,
                 aom_smooth_v_predictor_16x64_neon,
                 aom_smooth_h_predictor_16x64_neon)
@@ -808,18 +828,26 @@
                 aom_smooth_predictor_32x32_neon,
                 aom_smooth_v_predictor_32x32_neon,
                 aom_smooth_h_predictor_32x32_neon)
-INTRA_PRED_TEST(NEON, TX_32X16, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_32x16_neon,
+INTRA_PRED_TEST(NEON, TX_32X16, aom_dc_predictor_32x16_neon,
+                aom_dc_left_predictor_32x16_neon,
+                aom_dc_top_predictor_32x16_neon,
+                aom_dc_128_predictor_32x16_neon, aom_v_predictor_32x16_neon,
+                aom_h_predictor_32x16_neon, aom_paeth_predictor_32x16_neon,
                 aom_smooth_predictor_32x16_neon,
                 aom_smooth_v_predictor_32x16_neon,
                 aom_smooth_h_predictor_32x16_neon)
-INTRA_PRED_TEST(NEON, TX_32X64, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_32x64_neon,
+INTRA_PRED_TEST(NEON, TX_32X64, aom_dc_predictor_32x64_neon,
+                aom_dc_left_predictor_32x64_neon,
+                aom_dc_top_predictor_32x64_neon,
+                aom_dc_128_predictor_32x64_neon, aom_v_predictor_32x64_neon,
+                aom_h_predictor_32x64_neon, aom_paeth_predictor_32x64_neon,
                 aom_smooth_predictor_32x64_neon,
                 aom_smooth_v_predictor_32x64_neon,
                 aom_smooth_h_predictor_32x64_neon)
-INTRA_PRED_TEST(NEON, TX_32X8, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_32x8_neon,
+INTRA_PRED_TEST(NEON, TX_32X8, aom_dc_predictor_32x8_neon,
+                aom_dc_left_predictor_32x8_neon, aom_dc_top_predictor_32x8_neon,
+                aom_dc_128_predictor_32x8_neon, aom_v_predictor_32x8_neon,
+                aom_h_predictor_32x8_neon, aom_paeth_predictor_32x8_neon,
                 aom_smooth_predictor_32x8_neon,
                 aom_smooth_v_predictor_32x8_neon,
                 aom_smooth_h_predictor_32x8_neon)
@@ -905,18 +933,27 @@
 #endif
 
 #if HAVE_NEON
-INTRA_PRED_TEST(NEON, TX_64X64, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_64x64_neon,
+INTRA_PRED_TEST(NEON, TX_64X64, aom_dc_predictor_64x64_neon,
+                aom_dc_left_predictor_64x64_neon,
+                aom_dc_top_predictor_64x64_neon,
+                aom_dc_128_predictor_64x64_neon, aom_v_predictor_64x64_neon,
+                aom_h_predictor_64x64_neon, aom_paeth_predictor_64x64_neon,
                 aom_smooth_predictor_64x64_neon,
                 aom_smooth_v_predictor_64x64_neon,
                 aom_smooth_h_predictor_64x64_neon)
-INTRA_PRED_TEST(NEON, TX_64X32, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_64x32_neon,
+INTRA_PRED_TEST(NEON, TX_64X32, aom_dc_predictor_64x32_neon,
+                aom_dc_left_predictor_64x32_neon,
+                aom_dc_top_predictor_64x32_neon,
+                aom_dc_128_predictor_64x32_neon, aom_v_predictor_64x32_neon,
+                aom_h_predictor_64x32_neon, aom_paeth_predictor_64x32_neon,
                 aom_smooth_predictor_64x32_neon,
                 aom_smooth_v_predictor_64x32_neon,
                 aom_smooth_h_predictor_64x32_neon)
-INTRA_PRED_TEST(NEON, TX_64X16, nullptr, nullptr, nullptr, nullptr, nullptr,
-                nullptr, aom_paeth_predictor_64x16_neon,
+INTRA_PRED_TEST(NEON, TX_64X16, aom_dc_predictor_64x16_neon,
+                aom_dc_left_predictor_64x16_neon,
+                aom_dc_top_predictor_64x16_neon,
+                aom_dc_128_predictor_64x16_neon, aom_v_predictor_64x16_neon,
+                aom_h_predictor_64x16_neon, aom_paeth_predictor_64x16_neon,
                 aom_smooth_predictor_64x16_neon,
                 aom_smooth_v_predictor_64x16_neon,
                 aom_smooth_h_predictor_64x16_neon)
@@ -1268,20 +1305,32 @@
                        nullptr, nullptr)
 #endif
 #if HAVE_NEON
-HIGHBD_INTRA_PRED_TEST(NEON, TX_4X4, aom_highbd_dc_predictor_4x4_neon, nullptr,
-                       nullptr, nullptr, aom_highbd_v_predictor_4x4_neon,
-                       nullptr, aom_highbd_paeth_predictor_4x4_neon,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_4X4, aom_highbd_dc_predictor_4x4_neon,
+                       aom_highbd_dc_left_predictor_4x4_neon,
+                       aom_highbd_dc_top_predictor_4x4_neon,
+                       aom_highbd_dc_128_predictor_4x4_neon,
+                       aom_highbd_v_predictor_4x4_neon,
+                       aom_highbd_h_predictor_4x4_neon,
+                       aom_highbd_paeth_predictor_4x4_neon,
                        aom_highbd_smooth_predictor_4x4_neon,
                        aom_highbd_smooth_v_predictor_4x4_neon,
                        aom_highbd_smooth_h_predictor_4x4_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_4X8, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_4x8_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_4X8, aom_highbd_dc_predictor_4x8_neon,
+                       aom_highbd_dc_left_predictor_4x8_neon,
+                       aom_highbd_dc_top_predictor_4x8_neon,
+                       aom_highbd_dc_128_predictor_4x8_neon,
+                       aom_highbd_v_predictor_4x8_neon,
+                       aom_highbd_h_predictor_4x8_neon,
                        aom_highbd_paeth_predictor_4x8_neon,
                        aom_highbd_smooth_predictor_4x8_neon,
                        aom_highbd_smooth_v_predictor_4x8_neon,
                        aom_highbd_smooth_h_predictor_4x8_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_4X16, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_4x16_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_4X16, aom_highbd_dc_predictor_4x16_neon,
+                       aom_highbd_dc_left_predictor_4x16_neon,
+                       aom_highbd_dc_top_predictor_4x16_neon,
+                       aom_highbd_dc_128_predictor_4x16_neon,
+                       aom_highbd_v_predictor_4x16_neon,
+                       aom_highbd_h_predictor_4x16_neon,
                        aom_highbd_paeth_predictor_4x16_neon,
                        aom_highbd_smooth_predictor_4x16_neon,
                        aom_highbd_smooth_v_predictor_4x16_neon,
@@ -1350,26 +1399,42 @@
 #endif
 
 #if HAVE_NEON
-HIGHBD_INTRA_PRED_TEST(NEON, TX_8X8, aom_highbd_dc_predictor_8x8_neon, nullptr,
-                       nullptr, nullptr, aom_highbd_v_predictor_8x8_neon,
-                       nullptr, aom_highbd_paeth_predictor_8x8_neon,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_8X8, aom_highbd_dc_predictor_8x8_neon,
+                       aom_highbd_dc_left_predictor_8x8_neon,
+                       aom_highbd_dc_top_predictor_8x8_neon,
+                       aom_highbd_dc_128_predictor_8x8_neon,
+                       aom_highbd_v_predictor_8x8_neon,
+                       aom_highbd_h_predictor_8x8_neon,
+                       aom_highbd_paeth_predictor_8x8_neon,
                        aom_highbd_smooth_predictor_8x8_neon,
                        aom_highbd_smooth_v_predictor_8x8_neon,
                        aom_highbd_smooth_h_predictor_8x8_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_8X4, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_8x4_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_8X4, aom_highbd_dc_predictor_8x4_neon,
+                       aom_highbd_dc_left_predictor_8x4_neon,
+                       aom_highbd_dc_top_predictor_8x4_neon,
+                       aom_highbd_dc_128_predictor_8x4_neon,
+                       aom_highbd_v_predictor_8x4_neon,
+                       aom_highbd_h_predictor_8x4_neon,
                        aom_highbd_paeth_predictor_8x4_neon,
                        aom_highbd_smooth_predictor_8x4_neon,
                        aom_highbd_smooth_v_predictor_8x4_neon,
                        aom_highbd_smooth_h_predictor_8x4_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_8X16, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_8x16_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_8X16, aom_highbd_dc_predictor_8x16_neon,
+                       aom_highbd_dc_left_predictor_8x16_neon,
+                       aom_highbd_dc_top_predictor_8x16_neon,
+                       aom_highbd_dc_128_predictor_8x16_neon,
+                       aom_highbd_v_predictor_8x16_neon,
+                       aom_highbd_h_predictor_8x16_neon,
                        aom_highbd_paeth_predictor_8x16_neon,
                        aom_highbd_smooth_predictor_8x16_neon,
                        aom_highbd_smooth_v_predictor_8x16_neon,
                        aom_highbd_smooth_h_predictor_8x16_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_8X32, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_8x32_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_8X32, aom_highbd_dc_predictor_8x32_neon,
+                       aom_highbd_dc_left_predictor_8x32_neon,
+                       aom_highbd_dc_top_predictor_8x32_neon,
+                       aom_highbd_dc_128_predictor_8x32_neon,
+                       aom_highbd_v_predictor_8x32_neon,
+                       aom_highbd_h_predictor_8x32_neon,
                        aom_highbd_paeth_predictor_8x32_neon,
                        aom_highbd_smooth_predictor_8x32_neon,
                        aom_highbd_smooth_v_predictor_8x32_neon,
@@ -1457,32 +1522,51 @@
 
 #if HAVE_NEON
 HIGHBD_INTRA_PRED_TEST(NEON, TX_16X16, aom_highbd_dc_predictor_16x16_neon,
-                       nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_16x16_neon, nullptr,
+                       aom_highbd_dc_left_predictor_16x16_neon,
+                       aom_highbd_dc_top_predictor_16x16_neon,
+                       aom_highbd_dc_128_predictor_16x16_neon,
+                       aom_highbd_v_predictor_16x16_neon,
+                       aom_highbd_h_predictor_16x16_neon,
                        aom_highbd_paeth_predictor_16x16_neon,
                        aom_highbd_smooth_predictor_16x16_neon,
                        aom_highbd_smooth_v_predictor_16x16_neon,
                        aom_highbd_smooth_h_predictor_16x16_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_16X8, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_16x8_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_16X8, aom_highbd_dc_predictor_16x8_neon,
+                       aom_highbd_dc_left_predictor_16x8_neon,
+                       aom_highbd_dc_top_predictor_16x8_neon,
+                       aom_highbd_dc_128_predictor_16x8_neon,
+                       aom_highbd_v_predictor_16x8_neon,
+                       aom_highbd_h_predictor_16x8_neon,
                        aom_highbd_paeth_predictor_16x8_neon,
                        aom_highbd_smooth_predictor_16x8_neon,
                        aom_highbd_smooth_v_predictor_16x8_neon,
                        aom_highbd_smooth_h_predictor_16x8_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_16X32, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_16x32_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_16X32, aom_highbd_dc_predictor_16x32_neon,
+                       aom_highbd_dc_left_predictor_16x32_neon,
+                       aom_highbd_dc_top_predictor_16x32_neon,
+                       aom_highbd_dc_128_predictor_16x32_neon,
+                       aom_highbd_v_predictor_16x32_neon,
+                       aom_highbd_h_predictor_16x32_neon,
                        aom_highbd_paeth_predictor_16x32_neon,
                        aom_highbd_smooth_predictor_16x32_neon,
                        aom_highbd_smooth_v_predictor_16x32_neon,
                        aom_highbd_smooth_h_predictor_16x32_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_16X4, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_16x4_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_16X4, aom_highbd_dc_predictor_16x4_neon,
+                       aom_highbd_dc_left_predictor_16x4_neon,
+                       aom_highbd_dc_top_predictor_16x4_neon,
+                       aom_highbd_dc_128_predictor_16x4_neon,
+                       aom_highbd_v_predictor_16x4_neon,
+                       aom_highbd_h_predictor_16x4_neon,
                        aom_highbd_paeth_predictor_16x4_neon,
                        aom_highbd_smooth_predictor_16x4_neon,
                        aom_highbd_smooth_v_predictor_16x4_neon,
                        aom_highbd_smooth_h_predictor_16x4_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_16X64, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_16x64_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_16X64, aom_highbd_dc_predictor_16x64_neon,
+                       aom_highbd_dc_left_predictor_16x64_neon,
+                       aom_highbd_dc_top_predictor_16x64_neon,
+                       aom_highbd_dc_128_predictor_16x64_neon,
+                       aom_highbd_v_predictor_16x64_neon,
+                       aom_highbd_h_predictor_16x64_neon,
                        aom_highbd_paeth_predictor_16x64_neon,
                        aom_highbd_smooth_predictor_16x64_neon,
                        aom_highbd_smooth_v_predictor_16x64_neon,
@@ -1553,26 +1637,41 @@
 
 #if HAVE_NEON
 HIGHBD_INTRA_PRED_TEST(NEON, TX_32X32, aom_highbd_dc_predictor_32x32_neon,
-                       nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_32x32_neon, nullptr,
+                       aom_highbd_dc_left_predictor_32x32_neon,
+                       aom_highbd_dc_top_predictor_32x32_neon,
+                       aom_highbd_dc_128_predictor_32x32_neon,
+                       aom_highbd_v_predictor_32x32_neon,
+                       aom_highbd_h_predictor_32x32_neon,
                        aom_highbd_paeth_predictor_32x32_neon,
                        aom_highbd_smooth_predictor_32x32_neon,
                        aom_highbd_smooth_v_predictor_32x32_neon,
                        aom_highbd_smooth_h_predictor_32x32_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_32X16, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_32x16_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_32X16, aom_highbd_dc_predictor_32x16_neon,
+                       aom_highbd_dc_left_predictor_32x16_neon,
+                       aom_highbd_dc_top_predictor_32x16_neon,
+                       aom_highbd_dc_128_predictor_32x16_neon,
+                       aom_highbd_v_predictor_32x16_neon,
+                       aom_highbd_h_predictor_32x16_neon,
                        aom_highbd_paeth_predictor_32x16_neon,
                        aom_highbd_smooth_predictor_32x16_neon,
                        aom_highbd_smooth_v_predictor_32x16_neon,
                        aom_highbd_smooth_h_predictor_32x16_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_32X64, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_32x64_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_32X64, aom_highbd_dc_predictor_32x64_neon,
+                       aom_highbd_dc_left_predictor_32x64_neon,
+                       aom_highbd_dc_top_predictor_32x64_neon,
+                       aom_highbd_dc_128_predictor_32x64_neon,
+                       aom_highbd_v_predictor_32x64_neon,
+                       aom_highbd_h_predictor_32x64_neon,
                        aom_highbd_paeth_predictor_32x64_neon,
                        aom_highbd_smooth_predictor_32x64_neon,
                        aom_highbd_smooth_v_predictor_32x64_neon,
                        aom_highbd_smooth_h_predictor_32x64_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_32X8, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_32x8_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_32X8, aom_highbd_dc_predictor_32x8_neon,
+                       aom_highbd_dc_left_predictor_32x8_neon,
+                       aom_highbd_dc_top_predictor_32x8_neon,
+                       aom_highbd_dc_128_predictor_32x8_neon,
+                       aom_highbd_v_predictor_32x8_neon,
+                       aom_highbd_h_predictor_32x8_neon,
                        aom_highbd_paeth_predictor_32x8_neon,
                        aom_highbd_smooth_predictor_32x8_neon,
                        aom_highbd_smooth_v_predictor_32x8_neon,
@@ -1606,20 +1705,31 @@
 
 #if HAVE_NEON
 HIGHBD_INTRA_PRED_TEST(NEON, TX_64X64, aom_highbd_dc_predictor_64x64_neon,
-                       nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_64x64_neon, nullptr,
+                       aom_highbd_dc_left_predictor_64x64_neon,
+                       aom_highbd_dc_top_predictor_64x64_neon,
+                       aom_highbd_dc_128_predictor_64x64_neon,
+                       aom_highbd_v_predictor_64x64_neon,
+                       aom_highbd_h_predictor_64x64_neon,
                        aom_highbd_paeth_predictor_64x64_neon,
                        aom_highbd_smooth_predictor_64x64_neon,
                        aom_highbd_smooth_v_predictor_64x64_neon,
                        aom_highbd_smooth_h_predictor_64x64_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_64X32, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_64x32_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_64X32, aom_highbd_dc_predictor_64x32_neon,
+                       aom_highbd_dc_left_predictor_64x32_neon,
+                       aom_highbd_dc_top_predictor_64x32_neon,
+                       aom_highbd_dc_128_predictor_64x32_neon,
+                       aom_highbd_v_predictor_64x32_neon,
+                       aom_highbd_h_predictor_64x32_neon,
                        aom_highbd_paeth_predictor_64x32_neon,
                        aom_highbd_smooth_predictor_64x32_neon,
                        aom_highbd_smooth_v_predictor_64x32_neon,
                        aom_highbd_smooth_h_predictor_64x32_neon)
-HIGHBD_INTRA_PRED_TEST(NEON, TX_64X16, nullptr, nullptr, nullptr, nullptr,
-                       aom_highbd_v_predictor_64x16_neon, nullptr,
+HIGHBD_INTRA_PRED_TEST(NEON, TX_64X16, aom_highbd_dc_predictor_64x16_neon,
+                       aom_highbd_dc_left_predictor_64x16_neon,
+                       aom_highbd_dc_top_predictor_64x16_neon,
+                       aom_highbd_dc_128_predictor_64x16_neon,
+                       aom_highbd_v_predictor_64x16_neon,
+                       aom_highbd_h_predictor_64x16_neon,
                        aom_highbd_paeth_predictor_64x16_neon,
                        aom_highbd_smooth_predictor_64x16_neon,
                        aom_highbd_smooth_v_predictor_64x16_neon,

diff --git a/test/test_libaom.cc b/test/test_libaom.cc
index b55d762..6ffbbc5 100644
--- a/test/test_libaom.cc
+++ b/test/test_libaom.cc

@@ -17,7 +17,7 @@
 
 #include "config/aom_config.h"
 
-#if ARCH_X86 || ARCH_X86_64
+#if AOM_ARCH_X86 || AOM_ARCH_X86_64
 #include "aom_ports/x86.h"
 #endif
 extern "C" {
@@ -26,30 +26,30 @@
 extern void aom_scale_rtcd();
 }
 
-#if ARCH_X86 || ARCH_X86_64
+#if AOM_ARCH_X86 || AOM_ARCH_X86_64
 static void append_negative_gtest_filter(const char *str) {
-  std::string filter = ::testing::FLAGS_gtest_filter;
+  std::string flag_value = GTEST_FLAG_GET(filter);
   // Negative patterns begin with one '-' followed by a ':' separated list.
-  if (filter.find('-') == std::string::npos) filter += '-';
+  if (flag_value.find('-') == std::string::npos) flag_value += '-';
   // OPT.* matches TEST() functions
   // OPT/* matches TEST_P() functions
   // OPT_* matches tests which have been manually sharded.
   // We do not match OPT* because of SSE/SSE2 collisions.
   const char *search_terminators = "./_";
   for (size_t pos = 0; pos < strlen(search_terminators); ++pos) {
-    filter += ":";
-    filter += str;
-    filter += search_terminators[pos];
-    filter += "*";
+    flag_value += ":";
+    flag_value += str;
+    flag_value += search_terminators[pos];
+    flag_value += "*";
   }
-  ::testing::FLAGS_gtest_filter = filter;
+  GTEST_FLAG_SET(filter, flag_value);
 }
-#endif  // ARCH_X86 || ARCH_X86_64
+#endif  // AOM_ARCH_X86 || AOM_ARCH_X86_64
 
 int main(int argc, char **argv) {
   ::testing::InitGoogleTest(&argc, argv);
 
-#if ARCH_X86 || ARCH_X86_64
+#if AOM_ARCH_X86 || AOM_ARCH_X86_64
   const int simd_caps = x86_simd_caps();
   if (!(simd_caps & HAS_MMX)) append_negative_gtest_filter("MMX");
   if (!(simd_caps & HAS_SSE)) append_negative_gtest_filter("SSE");
@@ -60,7 +60,7 @@
   if (!(simd_caps & HAS_SSE4_2)) append_negative_gtest_filter("SSE4_2");
   if (!(simd_caps & HAS_AVX)) append_negative_gtest_filter("AVX");
   if (!(simd_caps & HAS_AVX2)) append_negative_gtest_filter("AVX2");
-#endif  // ARCH_X86 || ARCH_X86_64
+#endif  // AOM_ARCH_X86 || AOM_ARCH_X86_64
 
 // Shared library builds don't support whitebox tests that exercise internal
 // symbols.

diff --git a/test/tpl_model_test.cc b/test/tpl_model_test.cc
index 674f202..91eb5e9 100644
--- a/test/tpl_model_test.cc
+++ b/test/tpl_model_test.cc

@@ -202,6 +202,7 @@
   }
 }
 
+#if CONFIG_BITRATE_ACCURACY
 TEST(TplModelTest, TxfmStatsAccumulateTest) {
   TplTxfmStats sub_stats;
   av1_init_tpl_txfm_stats(&sub_stats);
@@ -248,6 +249,7 @@
     EXPECT_DOUBLE_EQ(stats2.abs_coeff_sum[i], 2 * stats1.abs_coeff_sum[i]);
   }
 }
+#endif  // CONFIG_BITRATE_ACCURACY
 
 TEST(TplModelTest, ComputeMVDifferenceTest) {
   TplDepFrame tpl_frame_small;
@@ -418,7 +420,7 @@
   double min_bits_diff = fabs(curr_estimate - bit_budget);
   // Start at q = 254 because we already have an estimate for q = 255.
   for (int q = 254; q >= 0; q--) {
-    double curr_estimate = av1_vbr_rc_info_estimate_gop_bitrate(
+    curr_estimate = av1_vbr_rc_info_estimate_gop_bitrate(
         q, bit_depth, update_type_scale_factors, frame_count, update_type_list,
         qstep_ratio_list, stats_list, q_index_list, estimated_bitrate_byframe);
     double bits_diff = fabs(curr_estimate - bit_budget);

diff --git a/test/variance_test.cc b/test/variance_test.cc
index 25b8c8d..2863aea 100644
--- a/test/variance_test.cc
+++ b/test/variance_test.cc

@@ -42,6 +42,11 @@
                                      uint32_t *sse8x8, int *sum8x8,
                                      unsigned int *tot_sse, int *tot_sum,
                                      uint32_t *var8x8);
+typedef void (*GetSseSum16x16DualFunc)(const uint8_t *a, int a_stride,
+                                       const uint8_t *b, int b_stride,
+                                       uint32_t *sse16x16,
+                                       unsigned int *tot_sse, int *tot_sum,
+                                       uint32_t *var16x16);
 typedef unsigned int (*SubpixVarMxNFunc)(const uint8_t *a, int a_stride,
                                          int xoffset, int yoffset,
                                          const uint8_t *b, int b_stride,
@@ -51,8 +56,6 @@
                                             const uint8_t *b, int b_stride,
                                             uint32_t *sse,
                                             const uint8_t *second_pred);
-typedef unsigned int (*Get4x4SseFunc)(const uint8_t *a, int a_stride,
-                                      const uint8_t *b, int b_stride);
 typedef unsigned int (*SumOfSquaresFunction)(const int16_t *src);
 typedef unsigned int (*DistWtdSubpixAvgVarMxNFunc)(
     const uint8_t *a, int a_stride, int xoffset, int yoffset, const uint8_t *b,
@@ -707,6 +710,12 @@
   void MaxTestSseSum();
   void SseSum_SpeedTest();
 
+  // SSE&SUM dual tests
+  void RefTestSseSumDual();
+  void MinTestSseSumDual();
+  void MaxTestSseSumDual();
+  void SseSum_SpeedTestDual();
+
   // MSE/SSE tests
   void RefTestMse();
   void RefTestSse();
@@ -833,9 +842,11 @@
     if (!use_high_bit_depth()) {
       src_[j] = rnd_.Rand8();
       ref_[j] = rnd_.Rand8();
+#if CONFIG_AV1_HIGHBITDEPTH
     } else {
       CONVERT_TO_SHORTPTR(src_)[j] = rnd_.Rand16() & mask();
       CONVERT_TO_SHORTPTR(ref_)[j] = rnd_.Rand16() & mask();
+#endif  // CONFIG_AV1_HIGHBITDEPTH
     }
   }
   unsigned int sse;
@@ -872,14 +883,15 @@
     const int stride = width();
     int k = 0;
 
-    for (int i = 0; i < height(); i += 8) {
-      for (int j = 0; j < width(); j += 32) {
-        API_REGISTER_STATE_CHECK(params_.func(
-            src_ + stride * i + j, stride, ref_ + stride * i + j, stride,
-            &sse1[k], &sum1[k], &sse_tot_simd, &sum_tot_simd, &var1[k]));
+    for (int row = 0; row < height(); row += 8) {
+      for (int col = 0; col < width(); col += 32) {
+        API_REGISTER_STATE_CHECK(params_.func(src_ + stride * row + col, stride,
+                                              ref_ + stride * row + col, stride,
+                                              &sse1[k], &sum1[k], &sse_tot_simd,
+                                              &sum_tot_simd, &var1[k]));
         aom_get_var_sse_sum_8x8_quad_c(
-            src_ + stride * i + j, stride, ref_ + stride * i + j, stride,
-            &sse2[k], &sum2[k], &sse_tot_c, &sum_tot_c, &var2[k]);
+            src_ + stride * row + col, stride, ref_ + stride * row + col,
+            stride, &sse2[k], &sum2[k], &sse_tot_c, &sum_tot_c, &var2[k]);
         k += 4;
       }
     }
@@ -976,12 +988,12 @@
     ref_[j] = rnd_.Rand8();
   }
 
-  unsigned int sse1 = 0;
-  unsigned int sse2 = 0;
-  unsigned int var1 = 0;
-  unsigned int var2 = 0;
-  int sum1 = 0;
-  int sum2 = 0;
+  unsigned int sse1[4] = { 0 };
+  unsigned int sse2[4] = { 0 };
+  unsigned int var1[4] = { 0 };
+  unsigned int var2[4] = { 0 };
+  int sum1[4] = { 0 };
+  int sum2[4] = { 0 };
   unsigned int sse_tot_c = 0;
   unsigned int sse_tot_simd = 0;
   int sum_tot_c = 0;
@@ -994,8 +1006,8 @@
     for (int i = 0; i < height(); i += 8) {
       for (int j = 0; j < width(); j += 32) {
         aom_get_var_sse_sum_8x8_quad_c(src_ + stride * i + j, stride,
-                                       ref_ + stride * i + j, stride, &sse2,
-                                       &sum2, &sse_tot_c, &sum_tot_c, &var2);
+                                       ref_ + stride * i + j, stride, sse2,
+                                       sum2, &sse_tot_c, &sum_tot_c, var2);
       }
     }
   }
@@ -1008,7 +1020,7 @@
     for (int i = 0; i < height(); i += 8) {
       for (int j = 0; j < width(); j += 32) {
         params_.func(src_ + stride * i + j, stride, ref_ + stride * i + j,
-                     stride, &sse1, &sum1, &sse_tot_simd, &sum_tot_simd, &var1);
+                     stride, sse1, sum1, &sse_tot_simd, &sum_tot_simd, var1);
       }
     }
   }
@@ -1022,6 +1034,171 @@
       width(), height(), elapsed_time_ref, elapsed_time_simd,
       elapsed_time_ref / elapsed_time_simd);
 }
+
+template <typename GetSseSum16x16DualFuncType>
+void MainTestClass<GetSseSum16x16DualFuncType>::RefTestSseSumDual() {
+  for (int iter = 0; iter < 10; ++iter) {
+    for (int idx = 0; idx < block_size(); ++idx) {
+      src_[idx] = rnd_.Rand8();
+      ref_[idx] = rnd_.Rand8();
+    }
+    unsigned int sse1[64] = { 0 };
+    unsigned int sse2[64] = { 0 };
+    unsigned int var1[64] = { 0 };
+    unsigned int var2[64] = { 0 };
+    unsigned int sse_tot_c = 0;
+    unsigned int sse_tot_simd = 0;
+    int sum_tot_c = 0;
+    int sum_tot_simd = 0;
+    const int stride = width();
+    int k = 0;
+
+    for (int row = 0; row < height(); row += 16) {
+      for (int col = 0; col < width(); col += 32) {
+        API_REGISTER_STATE_CHECK(params_.func(
+            src_ + stride * row + col, stride, ref_ + stride * row + col,
+            stride, &sse1[k], &sse_tot_simd, &sum_tot_simd, &var1[k]));
+        aom_get_var_sse_sum_16x16_dual_c(
+            src_ + stride * row + col, stride, ref_ + stride * row + col,
+            stride, &sse2[k], &sse_tot_c, &sum_tot_c, &var2[k]);
+        k += 2;
+      }
+    }
+    EXPECT_EQ(sse_tot_c, sse_tot_simd);
+    EXPECT_EQ(sum_tot_c, sum_tot_simd);
+    for (int p = 0; p < 64; p++) {
+      EXPECT_EQ(sse1[p], sse2[p]);
+      EXPECT_EQ(sse_tot_simd, sse_tot_c);
+      EXPECT_EQ(sum_tot_simd, sum_tot_c);
+      EXPECT_EQ(var1[p], var2[p]);
+    }
+  }
+}
+
+template <typename GetSseSum16x16DualFuncType>
+void MainTestClass<GetSseSum16x16DualFuncType>::MinTestSseSumDual() {
+  memset(src_, 0, block_size());
+  memset(ref_, 255, block_size());
+  unsigned int sse1[64] = { 0 };
+  unsigned int sse2[64] = { 0 };
+  unsigned int var1[64] = { 0 };
+  unsigned int var2[64] = { 0 };
+  unsigned int sse_tot_c = 0;
+  unsigned int sse_tot_simd = 0;
+  int sum_tot_c = 0;
+  int sum_tot_simd = 0;
+  const int stride = width();
+  int k = 0;
+
+  for (int row = 0; row < height(); row += 16) {
+    for (int col = 0; col < width(); col += 32) {
+      API_REGISTER_STATE_CHECK(params_.func(
+          src_ + stride * row + col, stride, ref_ + stride * row + col, stride,
+          &sse1[k], &sse_tot_simd, &sum_tot_simd, &var1[k]));
+      aom_get_var_sse_sum_16x16_dual_c(
+          src_ + stride * row + col, stride, ref_ + stride * row + col, stride,
+          &sse2[k], &sse_tot_c, &sum_tot_c, &var2[k]);
+      k += 2;
+    }
+  }
+  EXPECT_EQ(sse_tot_simd, sse_tot_c);
+  EXPECT_EQ(sum_tot_simd, sum_tot_c);
+  for (int p = 0; p < 64; p++) {
+    EXPECT_EQ(sse1[p], sse2[p]);
+    EXPECT_EQ(var1[p], var2[p]);
+  }
+}
+
+template <typename GetSseSum16x16DualFuncType>
+void MainTestClass<GetSseSum16x16DualFuncType>::MaxTestSseSumDual() {
+  memset(src_, 255, block_size());
+  memset(ref_, 0, block_size());
+  unsigned int sse1[64] = { 0 };
+  unsigned int sse2[64] = { 0 };
+  unsigned int var1[64] = { 0 };
+  unsigned int var2[64] = { 0 };
+  unsigned int sse_tot_c = 0;
+  unsigned int sse_tot_simd = 0;
+  int sum_tot_c = 0;
+  int sum_tot_simd = 0;
+  const int stride = width();
+  int k = 0;
+
+  for (int row = 0; row < height(); row += 16) {
+    for (int col = 0; col < width(); col += 32) {
+      API_REGISTER_STATE_CHECK(params_.func(
+          src_ + stride * row + col, stride, ref_ + stride * row + col, stride,
+          &sse1[k], &sse_tot_simd, &sum_tot_simd, &var1[k]));
+      aom_get_var_sse_sum_16x16_dual_c(
+          src_ + stride * row + col, stride, ref_ + stride * row + col, stride,
+          &sse2[k], &sse_tot_c, &sum_tot_c, &var2[k]);
+      k += 2;
+    }
+  }
+  EXPECT_EQ(sse_tot_c, sse_tot_simd);
+  EXPECT_EQ(sum_tot_c, sum_tot_simd);
+
+  for (int p = 0; p < 64; p++) {
+    EXPECT_EQ(sse1[p], sse2[p]);
+    EXPECT_EQ(var1[p], var2[p]);
+  }
+}
+
+template <typename GetSseSum16x16DualFuncType>
+void MainTestClass<GetSseSum16x16DualFuncType>::SseSum_SpeedTestDual() {
+  const int loop_count = 1000000000 / block_size();
+  for (int idx = 0; idx < block_size(); ++idx) {
+    src_[idx] = rnd_.Rand8();
+    ref_[idx] = rnd_.Rand8();
+  }
+
+  unsigned int sse1[2] = { 0 };
+  unsigned int sse2[2] = { 0 };
+  unsigned int var1[2] = { 0 };
+  unsigned int var2[2] = { 0 };
+  unsigned int sse_tot_c = 0;
+  unsigned int sse_tot_simd = 0;
+  int sum_tot_c = 0;
+  int sum_tot_simd = 0;
+  const int stride = width();
+
+  aom_usec_timer timer;
+  aom_usec_timer_start(&timer);
+  for (int r = 0; r < loop_count; ++r) {
+    for (int row = 0; row < height(); row += 16) {
+      for (int col = 0; col < width(); col += 32) {
+        aom_get_var_sse_sum_16x16_dual_c(src_ + stride * row + col, stride,
+                                         ref_ + stride * row + col, stride,
+                                         sse2, &sse_tot_c, &sum_tot_c, var2);
+      }
+    }
+  }
+  aom_usec_timer_mark(&timer);
+  const double elapsed_time_ref =
+      static_cast<double>(aom_usec_timer_elapsed(&timer));
+
+  aom_usec_timer_start(&timer);
+  for (int r = 0; r < loop_count; ++r) {
+    for (int row = 0; row < height(); row += 16) {
+      for (int col = 0; col < width(); col += 32) {
+        params_.func(src_ + stride * row + col, stride,
+                     ref_ + stride * row + col, stride, sse1, &sse_tot_simd,
+                     &sum_tot_simd, var1);
+      }
+    }
+  }
+  aom_usec_timer_mark(&timer);
+  const double elapsed_time_simd =
+      static_cast<double>(aom_usec_timer_elapsed(&timer));
+
+  printf(
+      "aom_getvar_16x16_dual for block=%dx%d : ref_time=%lf \t simd_time=%lf "
+      "\t "
+      "gain=%lf \n",
+      width(), height(), elapsed_time_ref, elapsed_time_simd,
+      elapsed_time_ref / elapsed_time_simd);
+}
+
 ////////////////////////////////////////////////////////////////////////////////
 // Tests related to MSE / SSE.
 
@@ -1029,14 +1206,21 @@
 void MainTestClass<FunctionType>::RefTestMse() {
   for (int i = 0; i < 10; ++i) {
     for (int j = 0; j < block_size(); ++j) {
-      src_[j] = rnd_.Rand8();
-      ref_[j] = rnd_.Rand8();
+      if (!use_high_bit_depth()) {
+        src_[j] = rnd_.Rand8();
+        ref_[j] = rnd_.Rand8();
+#if CONFIG_AV1_HIGHBITDEPTH
+      } else {
+        CONVERT_TO_SHORTPTR(src_)[j] = rnd_.Rand16() & mask();
+        CONVERT_TO_SHORTPTR(ref_)[j] = rnd_.Rand16() & mask();
+#endif  // CONFIG_AV1_HIGHBITDEPTH
+      }
     }
     unsigned int sse1, sse2;
     const int stride = width();
     API_REGISTER_STATE_CHECK(params_.func(src_, stride, ref_, stride, &sse1));
     variance_ref(src_, ref_, params_.log2width, params_.log2height, stride,
-                 stride, &sse2, false, AOM_BITS_8);
+                 stride, &sse2, use_high_bit_depth(), params_.bit_depth);
     EXPECT_EQ(sse1, sse2);
   }
 }
@@ -1060,11 +1244,25 @@
 
 template <typename FunctionType>
 void MainTestClass<FunctionType>::MaxTestMse() {
-  memset(src_, 255, block_size());
-  memset(ref_, 0, block_size());
+  int max_value = (1 << params_.bit_depth) - 1;
+  if (!use_high_bit_depth()) {
+    memset(src_, max_value, block_size());
+    memset(ref_, 0, block_size());
+#if CONFIG_AV1_HIGHBITDEPTH
+  } else {
+    aom_memset16(CONVERT_TO_SHORTPTR(src_), max_value, block_size());
+    aom_memset16(CONVERT_TO_SHORTPTR(ref_), 0, block_size());
+#endif  // CONFIG_AV1_HIGHBITDEPTH
+  }
   unsigned int sse;
   API_REGISTER_STATE_CHECK(params_.func(src_, width(), ref_, width(), &sse));
-  const unsigned int expected = block_size() * 255 * 255;
+  unsigned int expected = (unsigned int)block_size() * max_value * max_value;
+  switch (params_.bit_depth) {
+    case AOM_BITS_12: expected = ROUND_POWER_OF_TWO(expected, 8); break;
+    case AOM_BITS_10: expected = ROUND_POWER_OF_TWO(expected, 4); break;
+    case AOM_BITS_8:
+    default: break;
+  }
   EXPECT_EQ(expected, sse);
 }
 
@@ -1496,10 +1694,10 @@
 
 typedef MseWxHTestClass<MseWxH16bitFunc> MseWxHTest;
 typedef Mse16xHTestClass<Mse16xH16bitFunc> Mse16xHTest;
-typedef MainTestClass<Get4x4SseFunc> AvxSseTest;
 typedef MainTestClass<VarianceMxNFunc> AvxMseTest;
 typedef MainTestClass<VarianceMxNFunc> AvxVarianceTest;
 typedef MainTestClass<GetSseSum8x8QuadFunc> GetSseSum8x8QuadTest;
+typedef MainTestClass<GetSseSum16x16DualFunc> GetSseSum16x16DualTest;
 typedef SubpelVarianceTest<SubpixVarMxNFunc> AvxSubpelVarianceTest;
 typedef SubpelVarianceTest<SubpixAvgVarMxNFunc> AvxSubpelAvgVarianceTest;
 typedef SubpelVarianceTest<DistWtdSubpixAvgVarMxNFunc>
@@ -1510,8 +1708,6 @@
 typedef TestParams<MseWxH16bitFunc> MseWxHParams;
 typedef TestParams<Mse16xH16bitFunc> Mse16xHParams;
 
-TEST_P(AvxSseTest, RefSse) { RefTestSse(); }
-TEST_P(AvxSseTest, MaxSse) { MaxTestSse(); }
 TEST_P(MseWxHTest, RefMse) { RefMatchTestMse(); }
 TEST_P(MseWxHTest, DISABLED_SpeedMse) { SpeedTest(); }
 TEST_P(Mse16xHTest, RefMse) { RefMatchTestMse(); }
@@ -1528,6 +1724,10 @@
 TEST_P(GetSseSum8x8QuadTest, MinSseSum) { MinTestSseSum(); }
 TEST_P(GetSseSum8x8QuadTest, MaxMseSum) { MaxTestSseSum(); }
 TEST_P(GetSseSum8x8QuadTest, DISABLED_Speed) { SseSum_SpeedTest(); }
+TEST_P(GetSseSum16x16DualTest, RefMseSum) { RefTestSseSumDual(); }
+TEST_P(GetSseSum16x16DualTest, MinSseSum) { MinTestSseSumDual(); }
+TEST_P(GetSseSum16x16DualTest, MaxMseSum) { MaxTestSseSumDual(); }
+TEST_P(GetSseSum16x16DualTest, DISABLED_Speed) { SseSum_SpeedTestDual(); }
 TEST_P(SumOfSquaresTest, Const) { ConstTest(); }
 TEST_P(SumOfSquaresTest, Ref) { RefTest(); }
 TEST_P(AvxSubpelVarianceTest, Ref) { RefTest(); }
@@ -1558,11 +1758,6 @@
 INSTANTIATE_TEST_SUITE_P(C, SumOfSquaresTest,
                          ::testing::Values(aom_get_mb_ss_c));
 
-typedef TestParams<Get4x4SseFunc> SseParams;
-INSTANTIATE_TEST_SUITE_P(C, AvxSseTest,
-                         ::testing::Values(SseParams(2, 2,
-                                                     &aom_get4x4sse_cs_c)));
-
 typedef TestParams<VarianceMxNFunc> MseParams;
 INSTANTIATE_TEST_SUITE_P(C, AvxMseTest,
                          ::testing::Values(MseParams(4, 4, &aom_mse16x16_c),
@@ -1610,6 +1805,17 @@
 INSTANTIATE_TEST_SUITE_P(C, GetSseSum8x8QuadTest,
                          ::testing::ValuesIn(kArrayGetSseSum8x8Quad_c));
 
+typedef TestParams<GetSseSum16x16DualFunc> GetSseSumParamsDual;
+const GetSseSumParamsDual kArrayGetSseSum16x16Dual_c[] = {
+  GetSseSumParamsDual(7, 7, &aom_get_var_sse_sum_16x16_dual_c, 0),
+  GetSseSumParamsDual(6, 6, &aom_get_var_sse_sum_16x16_dual_c, 0),
+  GetSseSumParamsDual(5, 5, &aom_get_var_sse_sum_16x16_dual_c, 0),
+  GetSseSumParamsDual(5, 4, &aom_get_var_sse_sum_16x16_dual_c, 0)
+};
+
+INSTANTIATE_TEST_SUITE_P(C, GetSseSum16x16DualTest,
+                         ::testing::ValuesIn(kArrayGetSseSum16x16Dual_c));
+
 typedef TestParams<SubpixVarMxNFunc> SubpelVarianceParams;
 const SubpelVarianceParams kArraySubpelVariance_c[] = {
   SubpelVarianceParams(7, 7, &aom_sub_pixel_variance128x128_c, 0),
@@ -1865,6 +2071,7 @@
 TEST_P(MseHBDWxHTest, DISABLED_SpeedMse) { SpeedTest(); }
 TEST_P(AvxHBDMseTest, RefMse) { RefTestMse(); }
 TEST_P(AvxHBDMseTest, MaxMse) { MaxTestMse(); }
+TEST_P(AvxHBDMseTest, DISABLED_SpeedMse) { SpeedTest(); }
 TEST_P(AvxHBDVarianceTest, Zero) { ZeroTest(); }
 TEST_P(AvxHBDVarianceTest, Ref) { RefTest(); }
 TEST_P(AvxHBDVarianceTest, RefStride) { RefStrideTest(); }
@@ -1882,22 +2089,37 @@
                       MseHBDWxHParams(2, 3, &aom_mse_wxh_16bit_highbd_c, 10),
                       MseHBDWxHParams(2, 2, &aom_mse_wxh_16bit_highbd_c, 10)));
 
-/* TODO(debargha): This test does not support the highbd version
 INSTANTIATE_TEST_SUITE_P(
     C, AvxHBDMseTest,
-    ::testing::Values(make_tuple(4, 4, &aom_highbd_12_mse16x16_c),
-                      make_tuple(4, 4, &aom_highbd_12_mse16x8_c),
-                      make_tuple(4, 4, &aom_highbd_12_mse8x16_c),
-                      make_tuple(4, 4, &aom_highbd_12_mse8x8_c),
-                      make_tuple(4, 4, &aom_highbd_10_mse16x16_c),
-                      make_tuple(4, 4, &aom_highbd_10_mse16x8_c),
-                      make_tuple(4, 4, &aom_highbd_10_mse8x16_c),
-                      make_tuple(4, 4, &aom_highbd_10_mse8x8_c),
-                      make_tuple(4, 4, &aom_highbd_8_mse16x16_c),
-                      make_tuple(4, 4, &aom_highbd_8_mse16x8_c),
-                      make_tuple(4, 4, &aom_highbd_8_mse8x16_c),
-                      make_tuple(4, 4, &aom_highbd_8_mse8x8_c)));
-*/
+    ::testing::Values(MseParams(4, 4, &aom_highbd_12_mse16x16_c, 12),
+                      MseParams(4, 3, &aom_highbd_12_mse16x8_c, 12),
+                      MseParams(3, 4, &aom_highbd_12_mse8x16_c, 12),
+                      MseParams(3, 3, &aom_highbd_12_mse8x8_c, 12),
+                      MseParams(4, 4, &aom_highbd_10_mse16x16_c, 10),
+                      MseParams(4, 3, &aom_highbd_10_mse16x8_c, 10),
+                      MseParams(3, 4, &aom_highbd_10_mse8x16_c, 10),
+                      MseParams(3, 3, &aom_highbd_10_mse8x8_c, 10),
+                      MseParams(4, 4, &aom_highbd_8_mse16x16_c, 8),
+                      MseParams(4, 3, &aom_highbd_8_mse16x8_c, 8),
+                      MseParams(3, 4, &aom_highbd_8_mse8x16_c, 8),
+                      MseParams(3, 3, &aom_highbd_8_mse8x8_c, 8)));
+
+#if HAVE_NEON
+INSTANTIATE_TEST_SUITE_P(
+    NEON, AvxHBDMseTest,
+    ::testing::Values(MseParams(4, 4, &aom_highbd_12_mse16x16_neon, 12),
+                      MseParams(4, 3, &aom_highbd_12_mse16x8_neon, 12),
+                      MseParams(3, 4, &aom_highbd_12_mse8x16_neon, 12),
+                      MseParams(3, 3, &aom_highbd_12_mse8x8_neon, 12),
+                      MseParams(4, 4, &aom_highbd_10_mse16x16_neon, 10),
+                      MseParams(4, 3, &aom_highbd_10_mse16x8_neon, 10),
+                      MseParams(3, 4, &aom_highbd_10_mse8x16_neon, 10),
+                      MseParams(3, 3, &aom_highbd_10_mse8x8_neon, 10),
+                      MseParams(4, 4, &aom_highbd_8_mse16x16_neon, 8),
+                      MseParams(4, 3, &aom_highbd_8_mse16x8_neon, 8),
+                      MseParams(3, 4, &aom_highbd_8_mse8x16_neon, 8),
+                      MseParams(3, 3, &aom_highbd_8_mse8x8_neon, 8)));
+#endif  // HAVE_NEON
 
 const VarianceParams kArrayHBDVariance_c[] = {
   VarianceParams(7, 7, &aom_highbd_12_variance128x128_c, 12),
@@ -2351,6 +2573,15 @@
 INSTANTIATE_TEST_SUITE_P(SSE2, GetSseSum8x8QuadTest,
                          ::testing::ValuesIn(kArrayGetSseSum8x8Quad_sse2));
 
+const GetSseSumParamsDual kArrayGetSseSum16x16Dual_sse2[] = {
+  GetSseSumParamsDual(7, 7, &aom_get_var_sse_sum_16x16_dual_sse2, 0),
+  GetSseSumParamsDual(6, 6, &aom_get_var_sse_sum_16x16_dual_sse2, 0),
+  GetSseSumParamsDual(5, 5, &aom_get_var_sse_sum_16x16_dual_sse2, 0),
+  GetSseSumParamsDual(5, 4, &aom_get_var_sse_sum_16x16_dual_sse2, 0)
+};
+INSTANTIATE_TEST_SUITE_P(SSE2, GetSseSum16x16DualTest,
+                         ::testing::ValuesIn(kArrayGetSseSum16x16Dual_sse2));
+
 const SubpelVarianceParams kArraySubpelVariance_sse2[] = {
   SubpelVarianceParams(7, 7, &aom_sub_pixel_variance128x128_sse2, 0),
   SubpelVarianceParams(7, 6, &aom_sub_pixel_variance128x64_sse2, 0),
@@ -2444,22 +2675,14 @@
                                 12)));
 #endif  // HAVE_SSE4_1
 
-/* TODO(debargha): This test does not support the highbd version
 INSTANTIATE_TEST_SUITE_P(
     SSE2, AvxHBDMseTest,
-    ::testing::Values(MseParams(4, 4, &aom_highbd_12_mse16x16_sse2),
-                      MseParams(4, 3, &aom_highbd_12_mse16x8_sse2),
-                      MseParams(3, 4, &aom_highbd_12_mse8x16_sse2),
-                      MseParams(3, 3, &aom_highbd_12_mse8x8_sse2),
-                      MseParams(4, 4, &aom_highbd_10_mse16x16_sse2),
-                      MseParams(4, 3, &aom_highbd_10_mse16x8_sse2),
-                      MseParams(3, 4, &aom_highbd_10_mse8x16_sse2),
-                      MseParams(3, 3, &aom_highbd_10_mse8x8_sse2),
-                      MseParams(4, 4, &aom_highbd_8_mse16x16_sse2),
-                      MseParams(4, 3, &aom_highbd_8_mse16x8_sse2),
-                      MseParams(3, 4, &aom_highbd_8_mse8x16_sse2),
-                      MseParams(3, 3, &aom_highbd_8_mse8x8_sse2)));
-*/
+    ::testing::Values(MseParams(4, 4, &aom_highbd_12_mse16x16_sse2, 12),
+                      MseParams(3, 3, &aom_highbd_12_mse8x8_sse2, 12),
+                      MseParams(4, 4, &aom_highbd_10_mse16x16_sse2, 10),
+                      MseParams(3, 3, &aom_highbd_10_mse8x8_sse2, 10),
+                      MseParams(4, 4, &aom_highbd_8_mse16x16_sse2, 8),
+                      MseParams(3, 3, &aom_highbd_8_mse8x8_sse2, 8)));
 
 const VarianceParams kArrayHBDVariance_sse2[] = {
   VarianceParams(7, 7, &aom_highbd_12_variance128x128_sse2, 12),
@@ -2969,6 +3192,15 @@
 INSTANTIATE_TEST_SUITE_P(AVX2, GetSseSum8x8QuadTest,
                          ::testing::ValuesIn(kArrayGetSseSum8x8Quad_avx2));
 
+const GetSseSumParamsDual kArrayGetSseSum16x16Dual_avx2[] = {
+  GetSseSumParamsDual(7, 7, &aom_get_var_sse_sum_16x16_dual_avx2, 0),
+  GetSseSumParamsDual(6, 6, &aom_get_var_sse_sum_16x16_dual_avx2, 0),
+  GetSseSumParamsDual(5, 5, &aom_get_var_sse_sum_16x16_dual_avx2, 0),
+  GetSseSumParamsDual(5, 4, &aom_get_var_sse_sum_16x16_dual_avx2, 0)
+};
+INSTANTIATE_TEST_SUITE_P(AVX2, GetSseSum16x16DualTest,
+                         ::testing::ValuesIn(kArrayGetSseSum16x16Dual_avx2));
+
 const SubpelVarianceParams kArraySubpelVariance_avx2[] = {
   SubpelVarianceParams(7, 7, &aom_sub_pixel_variance128x128_avx2, 0),
   SubpelVarianceParams(7, 6, &aom_sub_pixel_variance128x64_avx2, 0),
@@ -3015,10 +3247,6 @@
                       MseWxHParams(2, 3, &aom_mse_wxh_16bit_neon, 8),
                       MseWxHParams(2, 2, &aom_mse_wxh_16bit_neon, 8)));
 
-INSTANTIATE_TEST_SUITE_P(NEON, AvxSseTest,
-                         ::testing::Values(SseParams(2, 2,
-                                                     &aom_get4x4sse_cs_neon)));
-
 INSTANTIATE_TEST_SUITE_P(NEON, AvxMseTest,
                          ::testing::Values(MseParams(3, 3, &aom_mse8x8_neon),
                                            MseParams(3, 4, &aom_mse8x16_neon),
@@ -3114,6 +3342,35 @@
 INSTANTIATE_TEST_SUITE_P(NEON, AvxSubpelAvgVarianceTest,
                          ::testing::ValuesIn(kArraySubpelAvgVariance_neon));
 
+#if !CONFIG_REALTIME_ONLY
+const ObmcSubpelVarianceParams kArrayObmcSubpelVariance_neon[] = {
+  ObmcSubpelVarianceParams(7, 7, &aom_obmc_sub_pixel_variance128x128_neon, 0),
+  ObmcSubpelVarianceParams(7, 6, &aom_obmc_sub_pixel_variance128x64_neon, 0),
+  ObmcSubpelVarianceParams(6, 7, &aom_obmc_sub_pixel_variance64x128_neon, 0),
+  ObmcSubpelVarianceParams(6, 6, &aom_obmc_sub_pixel_variance64x64_neon, 0),
+  ObmcSubpelVarianceParams(6, 5, &aom_obmc_sub_pixel_variance64x32_neon, 0),
+  ObmcSubpelVarianceParams(5, 6, &aom_obmc_sub_pixel_variance32x64_neon, 0),
+  ObmcSubpelVarianceParams(5, 5, &aom_obmc_sub_pixel_variance32x32_neon, 0),
+  ObmcSubpelVarianceParams(5, 4, &aom_obmc_sub_pixel_variance32x16_neon, 0),
+  ObmcSubpelVarianceParams(4, 5, &aom_obmc_sub_pixel_variance16x32_neon, 0),
+  ObmcSubpelVarianceParams(4, 4, &aom_obmc_sub_pixel_variance16x16_neon, 0),
+  ObmcSubpelVarianceParams(4, 3, &aom_obmc_sub_pixel_variance16x8_neon, 0),
+  ObmcSubpelVarianceParams(3, 4, &aom_obmc_sub_pixel_variance8x16_neon, 0),
+  ObmcSubpelVarianceParams(3, 3, &aom_obmc_sub_pixel_variance8x8_neon, 0),
+  ObmcSubpelVarianceParams(3, 2, &aom_obmc_sub_pixel_variance8x4_neon, 0),
+  ObmcSubpelVarianceParams(2, 3, &aom_obmc_sub_pixel_variance4x8_neon, 0),
+  ObmcSubpelVarianceParams(2, 2, &aom_obmc_sub_pixel_variance4x4_neon, 0),
+  ObmcSubpelVarianceParams(6, 4, &aom_obmc_sub_pixel_variance64x16_neon, 0),
+  ObmcSubpelVarianceParams(4, 6, &aom_obmc_sub_pixel_variance16x64_neon, 0),
+  ObmcSubpelVarianceParams(5, 3, &aom_obmc_sub_pixel_variance32x8_neon, 0),
+  ObmcSubpelVarianceParams(3, 5, &aom_obmc_sub_pixel_variance8x32_neon, 0),
+  ObmcSubpelVarianceParams(4, 2, &aom_obmc_sub_pixel_variance16x4_neon, 0),
+  ObmcSubpelVarianceParams(2, 4, &aom_obmc_sub_pixel_variance4x16_neon, 0),
+};
+INSTANTIATE_TEST_SUITE_P(NEON, AvxObmcSubpelVarianceTest,
+                         ::testing::ValuesIn(kArrayObmcSubpelVariance_neon));
+#endif
+
 const GetSseSumParams kArrayGetSseSum8x8Quad_neon[] = {
   GetSseSumParams(7, 7, &aom_get_var_sse_sum_8x8_quad_neon, 0),
   GetSseSumParams(6, 6, &aom_get_var_sse_sum_8x8_quad_neon, 0),
@@ -3123,8 +3380,33 @@
 INSTANTIATE_TEST_SUITE_P(NEON, GetSseSum8x8QuadTest,
                          ::testing::ValuesIn(kArrayGetSseSum8x8Quad_neon));
 
+const GetSseSumParamsDual kArrayGetSseSum16x16Dual_neon[] = {
+  GetSseSumParamsDual(7, 7, &aom_get_var_sse_sum_16x16_dual_neon, 0),
+  GetSseSumParamsDual(6, 6, &aom_get_var_sse_sum_16x16_dual_neon, 0),
+  GetSseSumParamsDual(5, 5, &aom_get_var_sse_sum_16x16_dual_neon, 0),
+  GetSseSumParamsDual(5, 4, &aom_get_var_sse_sum_16x16_dual_neon, 0)
+};
+INSTANTIATE_TEST_SUITE_P(NEON, GetSseSum16x16DualTest,
+                         ::testing::ValuesIn(kArrayGetSseSum16x16Dual_neon));
+
 #if CONFIG_AV1_HIGHBITDEPTH
 const VarianceParams kArrayHBDVariance_neon[] = {
+  VarianceParams(7, 7, &aom_highbd_12_variance128x128_neon, 12),
+  VarianceParams(7, 6, &aom_highbd_12_variance128x64_neon, 12),
+  VarianceParams(6, 7, &aom_highbd_12_variance64x128_neon, 12),
+  VarianceParams(6, 6, &aom_highbd_12_variance64x64_neon, 12),
+  VarianceParams(6, 5, &aom_highbd_12_variance64x32_neon, 12),
+  VarianceParams(5, 6, &aom_highbd_12_variance32x64_neon, 12),
+  VarianceParams(5, 5, &aom_highbd_12_variance32x32_neon, 12),
+  VarianceParams(5, 4, &aom_highbd_12_variance32x16_neon, 12),
+  VarianceParams(4, 5, &aom_highbd_12_variance16x32_neon, 12),
+  VarianceParams(4, 4, &aom_highbd_12_variance16x16_neon, 12),
+  VarianceParams(4, 3, &aom_highbd_12_variance16x8_neon, 12),
+  VarianceParams(3, 4, &aom_highbd_12_variance8x16_neon, 12),
+  VarianceParams(3, 3, &aom_highbd_12_variance8x8_neon, 12),
+  VarianceParams(3, 2, &aom_highbd_12_variance8x4_neon, 12),
+  VarianceParams(2, 3, &aom_highbd_12_variance4x8_neon, 12),
+  VarianceParams(2, 2, &aom_highbd_12_variance4x4_neon, 12),
   VarianceParams(7, 7, &aom_highbd_10_variance128x128_neon, 10),
   VarianceParams(7, 6, &aom_highbd_10_variance128x64_neon, 10),
   VarianceParams(6, 7, &aom_highbd_10_variance64x128_neon, 10),
@@ -3141,13 +3423,41 @@
   VarianceParams(3, 2, &aom_highbd_10_variance8x4_neon, 10),
   VarianceParams(2, 3, &aom_highbd_10_variance4x8_neon, 10),
   VarianceParams(2, 2, &aom_highbd_10_variance4x4_neon, 10),
+  VarianceParams(7, 7, &aom_highbd_8_variance128x128_neon, 8),
+  VarianceParams(7, 6, &aom_highbd_8_variance128x64_neon, 8),
+  VarianceParams(6, 7, &aom_highbd_8_variance64x128_neon, 8),
+  VarianceParams(6, 6, &aom_highbd_8_variance64x64_neon, 8),
+  VarianceParams(6, 5, &aom_highbd_8_variance64x32_neon, 8),
+  VarianceParams(5, 6, &aom_highbd_8_variance32x64_neon, 8),
+  VarianceParams(5, 5, &aom_highbd_8_variance32x32_neon, 8),
+  VarianceParams(5, 4, &aom_highbd_8_variance32x16_neon, 8),
+  VarianceParams(4, 5, &aom_highbd_8_variance16x32_neon, 8),
+  VarianceParams(4, 4, &aom_highbd_8_variance16x16_neon, 8),
+  VarianceParams(4, 3, &aom_highbd_8_variance16x8_neon, 8),
+  VarianceParams(3, 4, &aom_highbd_8_variance8x16_neon, 8),
+  VarianceParams(3, 3, &aom_highbd_8_variance8x8_neon, 8),
+  VarianceParams(3, 2, &aom_highbd_8_variance8x4_neon, 8),
+  VarianceParams(2, 3, &aom_highbd_8_variance4x8_neon, 8),
+  VarianceParams(2, 2, &aom_highbd_8_variance4x4_neon, 8),
 #if !CONFIG_REALTIME_ONLY
+  VarianceParams(6, 4, &aom_highbd_12_variance64x16_neon, 12),
+  VarianceParams(4, 6, &aom_highbd_12_variance16x64_neon, 12),
+  VarianceParams(5, 3, &aom_highbd_12_variance32x8_neon, 12),
+  VarianceParams(3, 5, &aom_highbd_12_variance8x32_neon, 12),
+  VarianceParams(4, 2, &aom_highbd_12_variance16x4_neon, 12),
+  VarianceParams(2, 4, &aom_highbd_12_variance4x16_neon, 12),
   VarianceParams(6, 4, &aom_highbd_10_variance64x16_neon, 10),
   VarianceParams(4, 6, &aom_highbd_10_variance16x64_neon, 10),
   VarianceParams(5, 3, &aom_highbd_10_variance32x8_neon, 10),
   VarianceParams(3, 5, &aom_highbd_10_variance8x32_neon, 10),
   VarianceParams(4, 2, &aom_highbd_10_variance16x4_neon, 10),
   VarianceParams(2, 4, &aom_highbd_10_variance4x16_neon, 10),
+  VarianceParams(6, 4, &aom_highbd_8_variance64x16_neon, 8),
+  VarianceParams(4, 6, &aom_highbd_8_variance16x64_neon, 8),
+  VarianceParams(5, 3, &aom_highbd_8_variance32x8_neon, 8),
+  VarianceParams(3, 5, &aom_highbd_8_variance8x32_neon, 8),
+  VarianceParams(4, 2, &aom_highbd_8_variance16x4_neon, 8),
+  VarianceParams(2, 4, &aom_highbd_8_variance4x16_neon, 8),
 #endif
 };
 

diff --git a/test/warp_filter_test_util.cc b/test/warp_filter_test_util.cc
index b4376d8..e42671e 100644
--- a/test/warp_filter_test_util.cc
+++ b/test/warp_filter_test_util.cc

@@ -185,7 +185,6 @@
   const int is_delta_zero = GET_PARAM(4);
   const int out_w = std::get<0>(params), out_h = std::get<1>(params);
   const int num_iters = std::get<2>(params);
-  int i, j, sub_x, sub_y;
   const int bd = 8;
 
   // The warp functions always write rows with widths that are multiples of 8.
@@ -209,7 +208,7 @@
   ASSERT_NE(dstb, nullptr);
   for (int i = 0; i < output_n; ++i) output[i] = output2[i] = rnd_.Rand8();
 
-  for (i = 0; i < num_iters; ++i) {
+  for (int i = 0; i < num_iters; ++i) {
     // Generate an input block and extend its borders horizontally
     for (int r = 0; r < h; ++r)
       for (int c = 0; c < w; ++c) input[r * stride + c] = rnd_.Rand8();
@@ -218,8 +217,8 @@
       memset(input + r * stride + w, input[r * stride + (w - 1)], border);
     }
     const int use_no_round = rnd_.Rand8() & 1;
-    for (sub_x = 0; sub_x < 2; ++sub_x)
-      for (sub_y = 0; sub_y < 2; ++sub_y) {
+    for (int sub_x = 0; sub_x < 2; ++sub_x)
+      for (int sub_y = 0; sub_y < 2; ++sub_y) {
         generate_warped_model(&rnd_, mat, &alpha, &beta, &gamma, &delta,
                               is_alpha_zero, is_beta_zero, is_gamma_zero,
                               is_delta_zero);
@@ -258,18 +257,18 @@
                         out_h, out_w, sub_x, sub_y, &conv_params, alpha, beta,
                         gamma, delta);
               if (use_no_round) {
-                for (j = 0; j < out_w * out_h; ++j)
+                for (int j = 0; j < out_w * out_h; ++j)
                   ASSERT_EQ(dsta[j], dstb[j])
                       << "Pixel mismatch at index " << j << " = ("
                       << (j % out_w) << ", " << (j / out_w) << ") on iteration "
                       << i;
-                for (j = 0; j < out_w * out_h; ++j)
+                for (int j = 0; j < out_w * out_h; ++j)
                   ASSERT_EQ(output[j], output2[j])
                       << "Pixel mismatch at index " << j << " = ("
                       << (j % out_w) << ", " << (j / out_w) << ") on iteration "
                       << i;
               } else {
-                for (j = 0; j < out_w * out_h; ++j)
+                for (int j = 0; j < out_w * out_h; ++j)
                   ASSERT_EQ(output[j], output2[j])
                       << "Pixel mismatch at index " << j << " = ("
                       << (j % out_w) << ", " << (j / out_w) << ") on iteration "
@@ -386,7 +385,6 @@
   const int bd = std::get<3>(param);
   const int num_iters = std::get<2>(param);
   const int mask = (1 << bd) - 1;
-  int i, j, sub_x, sub_y;
 
   // The warp functions always write rows with widths that are multiples of 8.
   // So to avoid a buffer overflow, we may need to pad rows to a multiple of 8.
@@ -409,7 +407,7 @@
   ASSERT_NE(dstb, nullptr);
   for (int i = 0; i < output_n; ++i) output[i] = output2[i] = rnd_.Rand16();
 
-  for (i = 0; i < num_iters; ++i) {
+  for (int i = 0; i < num_iters; ++i) {
     // Generate an input block and extend its borders horizontally
     for (int r = 0; r < h; ++r)
       for (int c = 0; c < w; ++c) input[r * stride + c] = rnd_.Rand16() & mask;
@@ -420,8 +418,8 @@
       }
     }
     const int use_no_round = rnd_.Rand8() & 1;
-    for (sub_x = 0; sub_x < 2; ++sub_x)
-      for (sub_y = 0; sub_y < 2; ++sub_y) {
+    for (int sub_x = 0; sub_x < 2; ++sub_x)
+      for (int sub_y = 0; sub_y < 2; ++sub_y) {
         generate_warped_model(&rnd_, mat, &alpha, &beta, &gamma, &delta,
                               is_alpha_zero, is_beta_zero, is_gamma_zero,
                               is_delta_zero);
@@ -464,18 +462,18 @@
                         beta, gamma, delta);
 
               if (use_no_round) {
-                for (j = 0; j < out_w * out_h; ++j)
+                for (int j = 0; j < out_w * out_h; ++j)
                   ASSERT_EQ(dsta[j], dstb[j])
                       << "Pixel mismatch at index " << j << " = ("
                       << (j % out_w) << ", " << (j / out_w) << ") on iteration "
                       << i;
-                for (j = 0; j < out_w * out_h; ++j)
+                for (int j = 0; j < out_w * out_h; ++j)
                   ASSERT_EQ(output[j], output2[j])
                       << "Pixel mismatch at index " << j << " = ("
                       << (j % out_w) << ", " << (j / out_w) << ") on iteration "
                       << i;
               } else {
-                for (j = 0; j < out_w * out_h; ++j)
+                for (int j = 0; j < out_w * out_h; ++j)
                   ASSERT_EQ(output[j], output2[j])
                       << "Pixel mismatch at index " << j << " = ("
                       << (j % out_w) << ", " << (j / out_w) << ") on iteration "

diff --git a/test/wiener_test.cc b/test/wiener_test.cc
index d44dd92..8be6a64 100644
--- a/test/wiener_test.cc
+++ b/test/wiener_test.cc

@@ -35,11 +35,14 @@
 // C implementation of the algorithm implmented by the SIMD code.
 // This is a little more efficient than the version in av1_compute_stats_c().
 static void compute_stats_win_opt_c(int wiener_win, const uint8_t *dgd,
-                                    const uint8_t *src, int h_start, int h_end,
-                                    int v_start, int v_end, int dgd_stride,
-                                    int src_stride, int64_t *M, int64_t *H,
+                                    const uint8_t *src, int16_t *d, int16_t *s,
+                                    int h_start, int h_end, int v_start,
+                                    int v_end, int dgd_stride, int src_stride,
+                                    int64_t *M, int64_t *H,
                                     int use_downsampled_wiener_stats) {
   ASSERT_TRUE(wiener_win == WIENER_WIN || wiener_win == WIENER_WIN_CHROMA);
+  (void)d;
+  (void)s;
   int i, j, k, l, m, n;
   const int pixel_count = (h_end - h_start) * (v_end - v_start);
   const int wiener_win2 = wiener_win * wiener_win;
@@ -156,23 +159,25 @@
 }
 
 void compute_stats_opt_c(int wiener_win, const uint8_t *dgd, const uint8_t *src,
-                         int h_start, int h_end, int v_start, int v_end,
-                         int dgd_stride, int src_stride, int64_t *M, int64_t *H,
+                         int16_t *d, int16_t *s, int h_start, int h_end,
+                         int v_start, int v_end, int dgd_stride, int src_stride,
+                         int64_t *M, int64_t *H,
                          int use_downsampled_wiener_stats) {
   if (wiener_win == WIENER_WIN || wiener_win == WIENER_WIN_CHROMA) {
-    compute_stats_win_opt_c(wiener_win, dgd, src, h_start, h_end, v_start,
+    compute_stats_win_opt_c(wiener_win, dgd, src, d, s, h_start, h_end, v_start,
                             v_end, dgd_stride, src_stride, M, H,
                             use_downsampled_wiener_stats);
   } else {
-    av1_compute_stats_c(wiener_win, dgd, src, h_start, h_end, v_start, v_end,
-                        dgd_stride, src_stride, M, H,
+    av1_compute_stats_c(wiener_win, dgd, src, d, s, h_start, h_end, v_start,
+                        v_end, dgd_stride, src_stride, M, H,
                         use_downsampled_wiener_stats);
   }
 }
 
 static const int kIterations = 100;
 typedef void (*compute_stats_Func)(int wiener_win, const uint8_t *dgd,
-                                   const uint8_t *src, int h_start, int h_end,
+                                   const uint8_t *src, int16_t *dgd_avg,
+                                   int16_t *src_avg, int h_start, int h_end,
                                    int v_start, int v_end, int dgd_stride,
                                    int src_stride, int64_t *M, int64_t *H,
                                    int use_downsampled_wiener_stats);
@@ -192,11 +197,17 @@
     dgd_buf = (uint8_t *)aom_memalign(
         32, MAX_DATA_BLOCK * MAX_DATA_BLOCK * sizeof(*dgd_buf));
     ASSERT_NE(dgd_buf, nullptr);
+    const int buf_size =
+        sizeof(*buf) * 6 * RESTORATION_UNITSIZE_MAX * RESTORATION_UNITSIZE_MAX;
+    buf = (int16_t *)aom_memalign(32, buf_size);
+    ASSERT_NE(buf, nullptr);
+    memset(buf, 0, buf_size);
     target_func_ = GET_PARAM(0);
   }
   virtual void TearDown() {
     aom_free(src_buf);
     aom_free(dgd_buf);
+    aom_free(buf);
   }
   void RunWienerTest(const int32_t wiener_win, int32_t run_times);
   void RunWienerTest_ExtremeValues(const int32_t wiener_win);
@@ -206,6 +217,7 @@
   libaom_test::ACMRandom rng_;
   uint8_t *src_buf;
   uint8_t *dgd_buf;
+  int16_t *buf;
 };
 
 void WienerTest::RunWienerTest(const int32_t wiener_win, int32_t run_times) {
@@ -232,6 +244,9 @@
   const int src_stride = MAX_DATA_BLOCK;
   const int iters = run_times == 1 ? kIterations : 2;
   const int max_value_downsample_stats = 1;
+  int16_t *dgd_avg = buf;
+  int16_t *src_avg =
+      buf + (3 * RESTORATION_UNITSIZE_MAX * RESTORATION_UNITSIZE_MAX);
 
   for (int iter = 0; iter < iters && !HasFatalFailure(); ++iter) {
     for (int i = 0; i < MAX_DATA_BLOCK * MAX_DATA_BLOCK; ++i) {
@@ -246,16 +261,16 @@
       aom_usec_timer timer;
       aom_usec_timer_start(&timer);
       for (int i = 0; i < run_times; ++i) {
-        av1_compute_stats_c(wiener_win, dgd, src, h_start, h_end, v_start,
-                            v_end, dgd_stride, src_stride, M_ref, H_ref,
-                            use_downsampled_stats);
+        av1_compute_stats_c(wiener_win, dgd, src, dgd_avg, src_avg, h_start,
+                            h_end, v_start, v_end, dgd_stride, src_stride,
+                            M_ref, H_ref, use_downsampled_stats);
       }
       aom_usec_timer_mark(&timer);
       const double time1 = static_cast<double>(aom_usec_timer_elapsed(&timer));
       aom_usec_timer_start(&timer);
       for (int i = 0; i < run_times; ++i) {
-        target_func_(wiener_win, dgd, src, h_start, h_end, v_start, v_end,
-                     dgd_stride, src_stride, M_test, H_test,
+        target_func_(wiener_win, dgd, src, dgd_avg, src_avg, h_start, h_end,
+                     v_start, v_end, dgd_stride, src_stride, M_test, H_test,
                      use_downsampled_stats);
       }
       aom_usec_timer_mark(&timer);
@@ -302,6 +317,9 @@
   const int src_stride = MAX_DATA_BLOCK;
   const int iters = 1;
   const int max_value_downsample_stats = 1;
+  int16_t *dgd_avg = buf;
+  int16_t *src_avg =
+      buf + (3 * RESTORATION_UNITSIZE_MAX * RESTORATION_UNITSIZE_MAX);
 
   for (int iter = 0; iter < iters && !HasFatalFailure(); ++iter) {
     for (int i = 0; i < MAX_DATA_BLOCK * MAX_DATA_BLOCK; ++i) {
@@ -313,12 +331,12 @@
     for (int use_downsampled_stats = 0;
          use_downsampled_stats <= max_value_downsample_stats;
          use_downsampled_stats++) {
-      av1_compute_stats_c(wiener_win, dgd, src, h_start, h_end, v_start, v_end,
-                          dgd_stride, src_stride, M_ref, H_ref,
-                          use_downsampled_stats);
+      av1_compute_stats_c(wiener_win, dgd, src, dgd_avg, src_avg, h_start,
+                          h_end, v_start, v_end, dgd_stride, src_stride, M_ref,
+                          H_ref, use_downsampled_stats);
 
-      target_func_(wiener_win, dgd, src, h_start, h_end, v_start, v_end,
-                   dgd_stride, src_stride, M_test, H_test,
+      target_func_(wiener_win, dgd, src, dgd_avg, src_avg, h_start, h_end,
+                   v_start, v_end, dgd_stride, src_stride, M_test, H_test,
                    use_downsampled_stats);
 
       int failed = 0;
@@ -710,5 +728,648 @@
                          ::testing::Values(av1_compute_stats_highbd_avx2));
 #endif  // HAVE_AVX2
 
+// A test that reproduces b/274668506: signed integer overflow in
+// update_a_sep_sym().
+TEST(SearchWienerTest, 10bitSignedIntegerOverflowInUpdateASepSym) {
+  constexpr int kWidth = 427;
+  constexpr int kHeight = 1;
+  std::vector<uint16_t> buffer(3 * kWidth * kHeight);
+  // The values in the buffer alternate between 0 and 1023.
+  uint16_t value = 0;
+  for (size_t i = 0; i < buffer.size(); ++i) {
+    buffer[i] = value;
+    value = 1023 - value;
+  }
+  unsigned char *img_data = reinterpret_cast<unsigned char *>(buffer.data());
+
+  aom_image_t img;
+  EXPECT_EQ(
+      aom_img_wrap(&img, AOM_IMG_FMT_I44416, kWidth, kHeight, 1, img_data),
+      &img);
+  img.cp = AOM_CICP_CP_UNSPECIFIED;
+  img.tc = AOM_CICP_TC_UNSPECIFIED;
+  img.mc = AOM_CICP_MC_UNSPECIFIED;
+  img.range = AOM_CR_FULL_RANGE;
+
+  aom_codec_iface_t *iface = aom_codec_av1_cx();
+  aom_codec_enc_cfg_t cfg;
+  EXPECT_EQ(aom_codec_enc_config_default(iface, &cfg, AOM_USAGE_ALL_INTRA),
+            AOM_CODEC_OK);
+  cfg.rc_end_usage = AOM_Q;
+  cfg.g_profile = 1;
+  cfg.g_bit_depth = AOM_BITS_10;
+  cfg.g_input_bit_depth = 10;
+  cfg.g_w = kWidth;
+  cfg.g_h = kHeight;
+  cfg.g_limit = 1;
+  cfg.g_lag_in_frames = 0;
+  cfg.kf_mode = AOM_KF_DISABLED;
+  cfg.kf_max_dist = 0;
+  cfg.g_threads = 61;
+  cfg.rc_min_quantizer = 2;
+  cfg.rc_max_quantizer = 20;
+  aom_codec_ctx_t enc;
+  EXPECT_EQ(aom_codec_enc_init(&enc, iface, &cfg, AOM_CODEC_USE_HIGHBITDEPTH),
+            AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_CQ_LEVEL, 11), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_ROW_MT, 1), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_TILE_ROWS, 4), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_CPUUSED, 3), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_COLOR_RANGE, AOM_CR_FULL_RANGE),
+            AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_SKIP_POSTPROC_FILTERING, 1),
+            AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_TUNING, AOM_TUNE_SSIM),
+            AOM_CODEC_OK);
+
+  // Encode frame
+  EXPECT_EQ(aom_codec_encode(&enc, &img, 0, 1, 0), AOM_CODEC_OK);
+  aom_codec_iter_t iter = nullptr;
+  const aom_codec_cx_pkt_t *pkt = aom_codec_get_cx_data(&enc, &iter);
+  ASSERT_NE(pkt, nullptr);
+  EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
+  // pkt->data.frame.flags is 0x1f0011.
+  EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, AOM_FRAME_IS_KEY);
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  // Flush encoder
+  EXPECT_EQ(aom_codec_encode(&enc, nullptr, 0, 1, 0), AOM_CODEC_OK);
+  iter = nullptr;
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  EXPECT_EQ(aom_codec_destroy(&enc), AOM_CODEC_OK);
+}
+
+// A test that reproduces b/281219978: signed integer overflow in
+// update_b_sep_sym().
+TEST(SearchWienerTest, 12bitSignedIntegerOverflowInUpdateBSepSym) {
+  constexpr int kWidth = 311;
+  constexpr int kHeight = 3;
+  static const uint16_t buffer[3 * kWidth * kHeight] = {
+    // Y plane:
+    0, 0, 0, 2156, 2513, 2211, 4095, 4095, 0, 2538, 0, 0, 0, 0, 4095, 0, 258,
+    941, 4095, 907, 0, 0, 2325, 2485, 2408, 4095, 1513, 0, 3644, 2080, 4095,
+    4095, 0, 2135, 0, 2461, 4095, 0, 4095, 4095, 0, 1987, 0, 3629, 0, 4095,
+    3918, 4095, 0, 4095, 4095, 4095, 0, 1065, 0, 2072, 3597, 102, 0, 534, 0, 0,
+    0, 4095, 0, 0, 4095, 0, 4095, 0, 4095, 0, 3611, 0, 1139, 4095, 0, 0, 0, 0,
+    0, 4095, 0, 0, 0, 0, 4095, 4095, 4095, 0, 0, 0, 3070, 3224, 0, 0, 4095,
+    4051, 4095, 0, 4095, 3712, 0, 1465, 4095, 1699, 4095, 4095, 0, 0, 0, 3885,
+    0, 4095, 0, 0, 4095, 1686, 4095, 4095, 4095, 4095, 1330, 0, 0, 0, 4095, 0,
+    4095, 4095, 3919, 4095, 781, 2371, 2055, 4095, 912, 3710, 0, 2045, 0, 4095,
+    4095, 4095, 1811, 0, 1298, 1115, 0, 3327, 0, 0, 4095, 0, 253, 2386, 4095,
+    1791, 3657, 1444, 0, 4095, 1918, 4095, 4095, 0, 4095, 305, 1587, 0, 4095, 0,
+    3759, 0, 0, 4095, 2387, 4095, 4095, 0, 0, 4095, 4095, 0, 1015, 4095, 0, 768,
+    2598, 1667, 130, 4095, 0, 0, 435, 4095, 3683, 4095, 0, 4095, 4095, 1888,
+    2828, 4095, 3349, 0, 4095, 4095, 4095, 4095, 0, 4095, 0, 0, 4095, 0, 2491,
+    1598, 0, 0, 383, 3712, 4095, 0, 0, 4095, 760, 4095, 4095, 4095, 2030, 4095,
+    0, 0, 3236, 0, 1040, 0, 0, 4095, 0, 0, 4095, 4095, 4095, 0, 0, 1043, 3897,
+    2446, 233, 1589, 427, 4095, 4095, 4095, 4095, 0, 1656, 3786, 4095, 0, 840,
+    4095, 4095, 1429, 4095, 0, 4095, 2734, 4095, 0, 2431, 1801, 278, 0, 4095, 0,
+    4095, 0, 0, 420, 0, 0, 746, 0, 0, 3281, 3006, 4095, 4095, 0, 0, 0, 3605,
+    4095, 4095, 0, 4095, 4095, 4095, 4095, 2660, 496, 4095, 0, 0, 0, 0, 4095, 0,
+    1317, 4095, 4095, 510, 1919, 0, 3893, 0, 4095, 4095, 4095, 4095, 4095, 2071,
+    2006, 0, 3316, 4095, 0, 0, 4095, 852, 2982, 0, 2073, 0, 2728, 1499, 4095,
+    852, 361, 3137, 4095, 4095, 1502, 1575, 0, 4095, 0, 0, 0, 0, 1585, 4095, 0,
+    4095, 0, 3188, 3244, 4095, 2958, 4095, 4095, 0, 4095, 4095, 4095, 1706,
+    2896, 4095, 1788, 730, 1146, 4095, 0, 0, 4095, 0, 0, 0, 2791, 3613, 2175,
+    2925, 0, 0, 0, 0, 0, 1279, 4095, 4095, 0, 4095, 0, 0, 2336, 0, 3462, 4095,
+    0, 4095, 1997, 2328, 2860, 0, 4095, 4095, 3241, 4095, 4095, 4095, 4095,
+    4095, 4095, 118, 0, 4095, 4095, 4095, 0, 3734, 0, 0, 0, 4095, 1952, 4095,
+    413, 4095, 1183, 4095, 0, 4095, 0, 0, 4095, 4095, 4095, 3805, 0, 1398, 0,
+    4095, 0, 0, 0, 4095, 4095, 4095, 2802, 3658, 4095, 4095, 0, 0, 0, 4095, 0,
+    897, 0, 4095, 2163, 0, 0, 0, 4095, 1440, 2487, 4095, 4095, 0, 4095, 4095,
+    4095, 2808, 0, 1999, 0, 0, 4095, 4095, 4095, 1563, 124, 2179, 754, 0, 0,
+    2407, 2798, 0, 4095, 4095, 0, 0, 1929, 0, 0, 0, 1387, 4095, 4095, 0, 0,
+    3911, 562, 4095, 0, 4095, 2639, 2673, 4095, 4095, 0, 0, 4095, 4095, 0, 4095,
+    4095, 901, 0, 321, 3961, 4095, 0, 4095, 4095, 4095, 0, 0, 0, 0, 3035, 3713,
+    3441, 0, 4095, 0, 0, 854, 1544, 3963, 1968, 4095, 0, 0, 0, 0, 2897, 4095, 0,
+    4095, 4095, 0, 235, 1011, 4095, 0, 3452, 4095, 4095, 0, 0, 4095, 4095, 4095,
+    4095, 4095, 3312, 0, 3064, 4095, 3981, 4095, 4095, 4095, 4095, 4095, 0, 791,
+    3243, 4095, 799, 0, 0, 0, 523, 2117, 3776, 0, 4095, 3311, 0, 543, 4095,
+    4095, 4095, 0, 0, 4095, 4095, 4095, 4095, 0, 0, 4095, 4095, 225, 0, 1195,
+    3070, 1210, 4095, 0, 4095, 498, 782, 0, 0, 4095, 4095, 4095, 4095, 4095,
+    1456, 4095, 3898, 1472, 4095, 4095, 0, 4095, 4026, 0, 0, 2354, 1554, 0,
+    4095, 0, 2986, 0, 1053, 1228, 0, 0, 4095, 4095, 0, 0, 4095, 0, 0, 4095, 0,
+    0, 0, 606, 0, 4095, 3563, 4095, 2016, 4095, 0, 0, 4095, 0, 4095, 4095, 4095,
+    0, 0, 0, 929, 0, 0, 4095, 0, 3069, 4095, 0, 2687, 4095, 4095, 4095, 2015,
+    4095, 4095, 4095, 0, 4095, 0, 0, 2860, 3668, 0, 0, 4095, 2523, 2104, 0, 0,
+    3063, 4095, 3674, 4095, 0, 2762, 0, 4095, 2582, 3473, 930, 0, 1012, 108, 38,
+    4095, 1148, 3568, 4036, 4095, 4095, 0, 1120, 1873, 3028, 4095, 515, 1902,
+    4095, 0, 815, 4095, 1548, 0, 1073, 3919, 4095, 2374, 0, 3126, 4095, 2268, 0,
+    0, 0, 4095, 425, 4095, 0, 0, 4095, 4095, 2710, 4095, 2067, 4095, 4095, 2201,
+    4095, 4095, 0, 4095, 4095, 2933, 0, 417, 2801, 4095, 4095, 3274, 0, 2870,
+    4095, 4095, 0, 0, 973, 0, 0, 3129, 4095, 0, 0, 0, 4095, 4095, 4095, 0, 242,
+    4095, 0, 4095, 0, 0, 0, 0, 987, 0, 2426, 4045, 2780, 0, 4095, 3762, 3361,
+    3095, 4095, 596, 1072, 4071, 4095, 4095, 0, 0, 81, 0, 1001, 1683, 4095,
+    4095, 3105, 2673, 0, 3300, 104, 4030, 0, 2615, 4095, 4095, 0, 4095, 1830,
+    3917, 4095, 4095, 4095, 0, 4095, 3637, 0, 4095, 4095, 3677, 4095, 4095, 0,
+    880, 4095, 4095, 0, 2797, 0, 0, 0, 0, 3225, 4095, 4095, 1925, 2885, 1879, 0,
+    0, 4095, 0, 0, 0, 2974, 559, 0, 0, 0, 699, 997, 1491, 423, 4012, 0, 2315,
+    4095, 0, 0, 4095, 0, 836, 4095, 0, 4095, 0, 1752, 0, 0, 0, 4095, 4095, 0, 0,
+    51, 4095, 350, 0, 2143, 2588, 0, 4095, 0, 4095, 0, 2757, 2370, 4095, 668,
+    4095, 0, 4095, 0, 3652, 3890, 0, 4095, 0, 4095, 4095, 4095, 4095, 4095,
+    // U plane:
+    4095, 4095, 1465, 0, 588, 4095, 0, 4095, 4095, 4095, 0, 2167, 4095, 4095,
+    918, 3223, 4095, 4095, 0, 696, 4095, 4095, 0, 0, 594, 4095, 2935, 0, 0, 0,
+    2036, 4095, 0, 2492, 4095, 4095, 0, 0, 0, 3883, 0, 4095, 483, 4095, 4095,
+    324, 923, 0, 3079, 0, 4095, 4095, 810, 0, 3371, 4095, 4095, 0, 4095, 2756,
+    0, 723, 0, 3338, 1084, 0, 4095, 4095, 3764, 0, 4095, 4095, 4095, 2323, 0,
+    3693, 682, 0, 0, 909, 4095, 2348, 4095, 4095, 4095, 1509, 4095, 0, 4095,
+    4095, 4095, 4095, 3977, 3652, 1580, 637, 4095, 0, 593, 4095, 1199, 1773,
+    4095, 4095, 4095, 0, 3447, 0, 0, 4095, 3873, 0, 0, 2094, 0, 1195, 0, 3892,
+    4095, 4095, 729, 4095, 0, 0, 4095, 449, 4095, 4095, 2900, 0, 4095, 0, 2114,
+    4095, 4095, 4095, 1174, 995, 2933, 360, 0, 1970, 0, 4095, 1208, 0, 4095, 0,
+    4095, 0, 4095, 4095, 0, 4095, 0, 0, 0, 1976, 0, 0, 921, 4095, 4095, 192,
+    1006, 0, 0, 2725, 4095, 0, 2813, 0, 0, 2375, 4095, 1982, 0, 2725, 4095,
+    1225, 3566, 4095, 0, 344, 863, 2747, 0, 4095, 4095, 1928, 4095, 4095, 0,
+    3640, 0, 1744, 3191, 4095, 4095, 0, 4095, 4095, 4095, 0, 0, 748, 4095, 0,
+    2609, 0, 0, 0, 0, 0, 3508, 4095, 4095, 2463, 0, 4095, 0, 4095, 4095, 4095,
+    3175, 419, 2193, 0, 0, 4095, 0, 0, 4095, 4051, 2159, 4095, 4095, 2262, 379,
+    4095, 0, 0, 3399, 4095, 4095, 4095, 3769, 2510, 4054, 3336, 730, 3968, 0, 0,
+    3354, 0, 1822, 0, 4095, 0, 3847, 3823, 3262, 0, 0, 2936, 0, 4095, 4095,
+    2120, 0, 3147, 0, 2838, 3480, 474, 1194, 4095, 4095, 2820, 4095, 0, 4095,
+    1882, 4095, 1085, 0, 4095, 2234, 3371, 4095, 0, 4095, 0, 0, 0, 2586, 4095,
+    4095, 4095, 4095, 0, 3818, 1401, 2273, 4095, 0, 4095, 0, 3907, 4095, 4095,
+    694, 0, 4066, 4095, 0, 0, 4095, 2116, 4095, 4095, 4095, 4095, 4095, 0, 2821,
+    29, 0, 0, 663, 1711, 652, 1271, 4095, 4095, 2401, 3726, 4095, 3453, 1803,
+    3614, 0, 4095, 3439, 4095, 0, 4095, 0, 816, 0, 0, 4095, 4095, 2635, 0, 1918,
+    0, 2663, 381, 0, 0, 3670, 0, 4095, 3065, 965, 4095, 4095, 4095, 2993, 4095,
+    4095, 0, 4095, 973, 4095, 0, 4095, 4095, 0, 3071, 0, 2777, 4095, 4095, 0,
+    3996, 4095, 1637, 0, 4095, 67, 3784, 0, 0, 4095, 2603, 579, 4095, 4095,
+    2854, 4095, 3016, 0, 4095, 0, 0, 4095, 4095, 4095, 4095, 3998, 3023, 4095,
+    4095, 0, 0, 0, 4095, 4095, 4095, 4095, 0, 0, 2623, 1308, 55, 4095, 0, 0,
+    2554, 2311, 0, 4095, 4095, 4095, 1134, 2112, 0, 4095, 4095, 0, 4095, 0, 645,
+    0, 0, 4095, 0, 909, 0, 0, 1719, 4095, 0, 3542, 0, 575, 0, 4095, 4095, 4095,
+    3428, 1172, 481, 1521, 4095, 3199, 1265, 4095, 3518, 4017, 4095, 760, 2042,
+    3986, 0, 4095, 42, 4095, 0, 4095, 4095, 4095, 4095, 2235, 346, 3865, 0,
+    4095, 4095, 4095, 4095, 4095, 4095, 845, 4095, 0, 2826, 4095, 4095, 0, 0,
+    335, 1614, 1465, 0, 4095, 4095, 0, 2771, 4095, 0, 2810, 4095, 4095, 0, 1254,
+    4095, 2589, 4095, 4095, 2252, 0, 0, 0, 4095, 0, 73, 4095, 4095, 0, 1341, 0,
+    0, 0, 0, 4095, 0, 0, 2645, 1985, 492, 914, 3996, 4095, 4095, 4095, 0, 2383,
+    2556, 433, 0, 4095, 1094, 4095, 4095, 642, 4095, 1722, 0, 3460, 4095, 4095,
+    4095, 4095, 4095, 0, 154, 4095, 92, 4095, 0, 0, 0, 4095, 0, 4095, 4095, 444,
+    0, 2925, 0, 0, 0, 0, 1628, 0, 4095, 1731, 2418, 697, 4095, 0, 2513, 4095, 0,
+    4095, 4095, 4095, 4095, 4095, 0, 2510, 4095, 3850, 0, 0, 4095, 2480, 4095,
+    4095, 2661, 4095, 0, 4095, 0, 0, 4095, 4095, 847, 4095, 4095, 3257, 443, 0,
+    67, 0, 0, 0, 4095, 0, 0, 3073, 4095, 0, 4095, 0, 4095, 0, 4095, 1224, 4095,
+    4095, 4095, 0, 4095, 958, 0, 4095, 0, 2327, 684, 0, 0, 0, 0, 4095, 4095, 0,
+    3693, 795, 4095, 0, 621, 1592, 2314, 4095, 0, 928, 1897, 4095, 4095, 0,
+    4095, 0, 0, 4095, 2619, 4095, 0, 4095, 0, 0, 4095, 2485, 4095, 4095, 0, 435,
+    4095, 1818, 4095, 4095, 0, 0, 0, 4095, 4095, 4095, 4095, 0, 1671, 4095,
+    4095, 0, 2617, 0, 2572, 0, 0, 4095, 3471, 0, 0, 4095, 2719, 3979, 1307, 0,
+    0, 0, 0, 1794, 642, 447, 913, 4095, 3927, 0, 2686, 0, 0, 4095, 0, 857, 0,
+    4095, 4095, 567, 2385, 0, 0, 4095, 893, 0, 289, 0, 0, 0, 4095, 4095, 2566,
+    0, 1913, 0, 2350, 1033, 2764, 0, 4095, 0, 4095, 0, 0, 0, 0, 4095, 3952,
+    3969, 0, 3476, 0, 4095, 4095, 393, 0, 2613, 0, 0, 1422, 0, 3359, 491, 3263,
+    4095, 4095, 0, 0, 4095, 697, 3601, 4095, 0, 4095, 4095, 0, 4095, 0, 0, 4095,
+    0, 4095, 4095, 4095, 2506, 0, 0, 1403, 0, 3836, 3976, 0, 4095, 4095, 4095,
+    2497, 4095, 4095, 4095, 4095, 0, 4095, 3317, 4095, 4095, 4095, 0, 0, 1131,
+    0, 0, 0, 4095, 0, 0, 4095, 0, 0, 2988, 4095, 4095, 2711, 2487, 1335, 0, 0,
+    0, 4095, 261, 4095, 86, 0, 0, 1138, 4095, 0, 0, 4095, 4095, 0, 0, 0, 334, 0,
+    2395, 3297, 4095, 1698, 4095, 1791, 1341, 0, 3559, 0, 4095, 0, 2056, 3238,
+    3310, 4095, 4095, 779, 2129, 2849, 4095, 2622, 1051, 0, 0, 1282, 4095, 1246,
+    0, 0, 3696, 4095, 556, 0, 0, 3463, 2658, 3572, 4095, 3982, 4095, 4095, 0, 0,
+    4053, 4095, 4095, 4095, 2162, 2567, 1621, 4095, 4095, 1522, 293, 4095, 0, 0,
+    1976, 4095, 3089, 4095, 0, 0, 0, 0, 3650,
+    // V plane:
+    0, 1892, 4095, 1995, 0, 0, 0, 2208, 1152, 1794, 4095, 4095, 89, 3333, 4095,
+    2478, 4095, 2505, 4095, 0, 2664, 4095, 1984, 0, 1144, 4095, 0, 4095, 0,
+    4095, 0, 0, 0, 2404, 1727, 4095, 4095, 0, 1326, 2033, 0, 4095, 0, 4095,
+    3022, 0, 4095, 0, 1980, 4095, 0, 2284, 4095, 0, 3422, 0, 4095, 2171, 3155,
+    4095, 0, 4095, 0, 636, 0, 0, 4095, 3264, 3862, 0, 2164, 0, 0, 3879, 3886, 0,
+    225, 0, 0, 4095, 0, 1956, 523, 464, 738, 0, 1545, 0, 2829, 4095, 4095, 4095,
+    799, 4095, 358, 4095, 0, 0, 953, 0, 0, 2081, 4095, 1604, 4095, 2086, 0, 954,
+    0, 0, 2393, 2413, 4095, 4095, 0, 3583, 4095, 4095, 2995, 4095, 0, 4095,
+    4095, 3501, 4095, 247, 4095, 0, 0, 0, 4095, 1303, 3382, 1059, 4095, 0, 543,
+    1276, 1801, 0, 0, 0, 2928, 0, 4095, 3931, 70, 0, 0, 3992, 4095, 1278, 1930,
+    4095, 0, 4095, 4095, 3894, 0, 0, 0, 0, 4095, 0, 0, 0, 0, 0, 0, 4095, 4095,
+    4095, 1098, 4095, 2059, 0, 380, 3166, 0, 4095, 2215, 0, 0, 2846, 0, 0, 2614,
+    528, 4095, 0, 4095, 2371, 0, 4095, 0, 0, 0, 0, 4095, 3133, 4095, 4095, 0,
+    4095, 1283, 3821, 1772, 0, 0, 4095, 4095, 4095, 890, 3475, 4095, 4095, 133,
+    3292, 1819, 4095, 4095, 4095, 0, 0, 4095, 702, 4095, 0, 0, 0, 4095, 0, 2137,
+    4095, 4095, 4095, 0, 0, 0, 4095, 4095, 1555, 2435, 2778, 4095, 0, 4095,
+    3825, 0, 3736, 3054, 0, 0, 4095, 4095, 4095, 0, 0, 0, 0, 371, 4095, 4095, 0,
+    0, 1565, 4095, 2731, 4095, 0, 756, 925, 0, 0, 0, 4095, 775, 1379, 4095,
+    1439, 0, 0, 0, 2680, 0, 0, 4095, 1280, 4095, 0, 0, 4095, 4095, 0, 3088, 0,
+    4095, 4095, 4095, 0, 0, 1526, 4095, 2314, 4095, 4095, 0, 4095, 288, 0, 205,
+    4095, 4095, 4095, 0, 1247, 2014, 0, 1530, 1985, 0, 0, 4095, 3195, 0, 4095,
+    4, 2397, 4095, 4095, 4095, 0, 4095, 4095, 4095, 0, 0, 0, 0, 0, 4031, 928,
+    4095, 0, 0, 4095, 4095, 4095, 1966, 4095, 2299, 1215, 4095, 0, 4095, 1335,
+    0, 4095, 1991, 4095, 0, 4095, 114, 0, 0, 0, 2123, 2639, 4095, 3323, 4095,
+    4095, 418, 209, 0, 0, 4095, 4095, 4095, 4095, 963, 0, 0, 0, 4095, 2505, 0,
+    3627, 0, 311, 3748, 2047, 4095, 2791, 0, 3643, 1852, 0, 0, 4095, 0, 2179, 0,
+    4095, 2678, 0, 0, 0, 2342, 4095, 4095, 0, 0, 4095, 0, 0, 0, 0, 1076, 0, 0,
+    4095, 0, 2370, 0, 3530, 0, 0, 0, 0, 0, 4095, 0, 0, 0, 3474, 1201, 0, 379,
+    699, 4095, 777, 4095, 0, 4095, 4095, 0, 1213, 1762, 4095, 4095, 4095, 0,
+    4095, 1090, 1233, 0, 4095, 0, 4095, 0, 0, 0, 2845, 3385, 2718, 0, 0, 2975,
+    3630, 0, 4095, 4095, 4095, 4095, 3261, 243, 0, 4095, 0, 0, 3836, 4095, 4095,
+    4095, 963, 0, 0, 2526, 0, 4095, 4000, 4095, 2069, 0, 0, 4095, 0, 4095, 1421,
+    0, 4095, 0, 4095, 4095, 0, 4095, 0, 4095, 4095, 1537, 4095, 3201, 0, 0,
+    4095, 2719, 4095, 0, 4095, 4095, 4095, 0, 4095, 0, 4095, 2300, 0, 2876, 0,
+    4095, 4095, 4095, 3235, 497, 635, 0, 1480, 4095, 0, 3067, 3979, 3741, 0,
+    3059, 1214, 4095, 4095, 2197, 0, 4095, 4095, 2734, 0, 4095, 4095, 3364,
+    2369, 4095, 303, 4095, 0, 4095, 4095, 3472, 1733, 4095, 4095, 4095, 0, 55,
+    0, 10, 1378, 1169, 4095, 0, 0, 688, 3613, 0, 4095, 2832, 867, 4095, 4095,
+    3514, 4095, 0, 4095, 4095, 2458, 3506, 0, 1920, 0, 1762, 1178, 2549, 4095,
+    3967, 4095, 0, 2975, 1282, 0, 377, 846, 3434, 97, 0, 0, 1616, 3526, 136,
+    1888, 0, 147, 334, 4095, 0, 4095, 0, 4095, 1106, 4095, 0, 4095, 3280, 4095,
+    4095, 0, 2849, 3528, 0, 4095, 4095, 0, 2306, 0, 3412, 0, 4095, 4095, 4095,
+    4048, 2273, 0, 4095, 4095, 4095, 0, 4095, 3031, 4095, 4095, 4095, 0, 3382,
+    3812, 2315, 4095, 0, 0, 0, 432, 4095, 3606, 0, 4, 2847, 4095, 0, 4095, 0, 0,
+    2616, 4095, 4095, 0, 4095, 0, 3394, 4095, 3976, 3119, 0, 0, 0, 0, 4046,
+    4095, 4095, 3331, 4095, 2127, 0, 4095, 0, 0, 0, 4095, 4095, 4095, 0, 4095,
+    4095, 4095, 0, 2068, 0, 0, 3882, 2967, 0, 1745, 4095, 2112, 478, 0, 4095, 0,
+    199, 4095, 4095, 3542, 4095, 2634, 4095, 4095, 1235, 4095, 4095, 167, 1553,
+    0, 4095, 2649, 0, 3383, 0, 4095, 2803, 4095, 0, 4095, 0, 785, 4095, 0, 4095,
+    1743, 4095, 0, 3945, 0, 4095, 1894, 4095, 3973, 4095, 0, 0, 4095, 0, 0,
+    4095, 318, 4095, 4095, 4095, 0, 261, 4095, 4095, 2125, 2690, 4095, 0, 4095,
+    3863, 1740, 4095, 0, 2899, 1509, 0, 0, 0, 2780, 4095, 1897, 2104, 4095,
+    1708, 284, 4095, 0, 4095, 3382, 4095, 4095, 483, 0, 0, 0, 3099, 0, 4095, 0,
+    926, 4095, 2062, 1931, 2121, 0, 4095, 0, 2485, 1535, 4095, 4095, 3662, 4095,
+    2419, 2487, 0, 4095, 4095, 4095, 0, 0, 4095, 0, 0, 2029, 0, 3008, 2338, 0,
+    4095, 0, 3854, 0, 4095, 0, 0, 1315, 0, 0, 0, 0, 3492, 0, 1445, 0, 11, 4095,
+    0, 0, 873, 0, 4095, 0, 4095, 2654, 3040, 0, 0, 0, 4095, 0, 68, 4095, 0, 0,
+    990, 0, 828, 1015, 88, 3606, 0, 2875, 4095, 0, 3117, 411, 0, 0, 2859, 0, 0,
+    4095, 3480, 25, 4095, 4095, 4095, 0, 0, 0, 4095, 4095, 4095, 4095, 1724, 0,
+    0, 0, 3635, 1063, 3728, 4095, 4095, 2025, 3715, 0, 0, 0, 3722, 0, 1648, 0,
+    4095, 3579, 0, 0, 0, 4095, 4095, 0, 4095
+  };
+  unsigned char *img_data =
+      reinterpret_cast<unsigned char *>(const_cast<uint16_t *>(buffer));
+
+  aom_image_t img;
+  EXPECT_EQ(
+      aom_img_wrap(&img, AOM_IMG_FMT_I44416, kWidth, kHeight, 1, img_data),
+      &img);
+  img.cp = AOM_CICP_CP_UNSPECIFIED;
+  img.tc = AOM_CICP_TC_UNSPECIFIED;
+  img.mc = AOM_CICP_MC_UNSPECIFIED;
+  img.range = AOM_CR_FULL_RANGE;
+
+  aom_codec_iface_t *iface = aom_codec_av1_cx();
+  aom_codec_enc_cfg_t cfg;
+  EXPECT_EQ(aom_codec_enc_config_default(iface, &cfg, AOM_USAGE_ALL_INTRA),
+            AOM_CODEC_OK);
+  cfg.rc_end_usage = AOM_Q;
+  cfg.g_profile = 2;
+  cfg.g_bit_depth = AOM_BITS_12;
+  cfg.g_input_bit_depth = 12;
+  cfg.g_w = kWidth;
+  cfg.g_h = kHeight;
+  cfg.g_limit = 1;
+  cfg.g_lag_in_frames = 0;
+  cfg.kf_mode = AOM_KF_DISABLED;
+  cfg.kf_max_dist = 0;
+  cfg.g_threads = 34;
+  cfg.rc_min_quantizer = 8;
+  cfg.rc_max_quantizer = 20;
+  aom_codec_ctx_t enc;
+  EXPECT_EQ(aom_codec_enc_init(&enc, iface, &cfg, AOM_CODEC_USE_HIGHBITDEPTH),
+            AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_CQ_LEVEL, 14), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_ROW_MT, 1), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_TILE_ROWS, 4), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_TILE_COLUMNS, 4), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_CPUUSED, 0), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_COLOR_RANGE, AOM_CR_FULL_RANGE),
+            AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_SKIP_POSTPROC_FILTERING, 1),
+            AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_TUNING, AOM_TUNE_SSIM),
+            AOM_CODEC_OK);
+
+  // Encode frame
+  EXPECT_EQ(aom_codec_encode(&enc, &img, 0, 1, 0), AOM_CODEC_OK);
+  aom_codec_iter_t iter = nullptr;
+  const aom_codec_cx_pkt_t *pkt = aom_codec_get_cx_data(&enc, &iter);
+  ASSERT_NE(pkt, nullptr);
+  EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
+  // pkt->data.frame.flags is 0x1f0011.
+  EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, AOM_FRAME_IS_KEY);
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  // Flush encoder
+  EXPECT_EQ(aom_codec_encode(&enc, nullptr, 0, 1, 0), AOM_CODEC_OK);
+  iter = nullptr;
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  EXPECT_EQ(aom_codec_destroy(&enc), AOM_CODEC_OK);
+}
+
+// A test that reproduces b/272139363: signed integer overflow in
+// update_b_sep_sym().
+TEST(SearchWienerTest, 10bitSignedIntegerOverflowInUpdateBSepSym) {
+  constexpr int kWidth = 34;
+  constexpr int kHeight = 3;
+  static const uint16_t buffer[3 * kWidth * kHeight] = {
+    // Y plane:
+    61, 765, 674, 188, 367, 944, 153, 275, 906, 433, 154, 51, 8, 855, 186, 154,
+    392, 0, 634, 3, 690, 1023, 1023, 1023, 1023, 1023, 1023, 8, 1, 64, 426, 0,
+    100, 344, 944, 816, 816, 33, 1023, 1023, 1023, 1023, 295, 1023, 1023, 1023,
+    1023, 1023, 1023, 1015, 1023, 231, 1020, 254, 439, 439, 894, 439, 150, 1019,
+    1023, 1023, 1023, 1023, 1023, 1023, 1023, 1023, 1023, 1023, 385, 320, 575,
+    682, 1023, 1023, 1023, 1023, 1023, 1023, 1023, 1023, 511, 699, 987, 3, 140,
+    661, 120, 33, 143, 0, 0, 0, 3, 40, 625, 585, 16, 579, 160, 867,
+    // U plane:
+    739, 646, 13, 603, 7, 328, 91, 32, 488, 870, 330, 330, 330, 330, 330, 330,
+    109, 330, 330, 330, 3, 545, 945, 249, 35, 561, 801, 32, 931, 639, 801, 91,
+    1023, 827, 844, 948, 631, 894, 854, 601, 432, 504, 85, 1, 0, 0, 89, 89, 0,
+    0, 0, 0, 0, 0, 432, 801, 382, 4, 0, 0, 2, 89, 89, 89, 89, 89, 89, 384, 0, 0,
+    0, 0, 0, 0, 0, 1023, 1019, 1, 3, 691, 575, 691, 691, 691, 691, 691, 691,
+    691, 691, 691, 691, 691, 84, 527, 4, 485, 8, 682, 698, 340, 1015, 706,
+    // V plane:
+    49, 10, 28, 1023, 1023, 1023, 0, 32, 32, 872, 114, 1003, 1023, 57, 477, 999,
+    1023, 309, 309, 309, 309, 309, 309, 309, 309, 309, 309, 309, 309, 309, 309,
+    9, 418, 418, 418, 418, 418, 418, 0, 0, 0, 1023, 4, 5, 0, 0, 1023, 0, 0, 0,
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 64, 0, 155, 709, 3, 331, 807, 633, 1023,
+    1018, 646, 886, 991, 692, 915, 294, 0, 35, 2, 0, 471, 643, 770, 346, 176,
+    32, 329, 322, 302, 61, 765, 674, 188, 367, 944, 153, 275, 906, 433, 154
+  };
+  unsigned char *img_data =
+      reinterpret_cast<unsigned char *>(const_cast<uint16_t *>(buffer));
+
+  aom_image_t img;
+  EXPECT_EQ(&img, aom_img_wrap(&img, AOM_IMG_FMT_I44416, kWidth, kHeight, 1,
+                               img_data));
+  img.cp = AOM_CICP_CP_UNSPECIFIED;
+  img.tc = AOM_CICP_TC_UNSPECIFIED;
+  img.mc = AOM_CICP_MC_UNSPECIFIED;
+  img.range = AOM_CR_FULL_RANGE;
+
+  aom_codec_iface_t *iface = aom_codec_av1_cx();
+  aom_codec_enc_cfg_t cfg;
+  EXPECT_EQ(AOM_CODEC_OK,
+            aom_codec_enc_config_default(iface, &cfg, AOM_USAGE_ALL_INTRA));
+  cfg.rc_end_usage = AOM_Q;
+  cfg.g_profile = 1;
+  cfg.g_bit_depth = AOM_BITS_10;
+  cfg.g_input_bit_depth = 10;
+  cfg.g_w = kWidth;
+  cfg.g_h = kHeight;
+  cfg.g_limit = 1;
+  cfg.g_lag_in_frames = 0;
+  cfg.kf_mode = AOM_KF_DISABLED;
+  cfg.kf_max_dist = 0;
+  cfg.rc_min_quantizer = 3;
+  cfg.rc_max_quantizer = 54;
+  aom_codec_ctx_t enc;
+  EXPECT_EQ(AOM_CODEC_OK,
+            aom_codec_enc_init(&enc, iface, &cfg, AOM_CODEC_USE_HIGHBITDEPTH));
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_control(&enc, AOME_SET_CQ_LEVEL, 28));
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_control(&enc, AV1E_SET_TILE_COLUMNS, 3));
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_control(&enc, AOME_SET_CPUUSED, 0));
+  EXPECT_EQ(AOM_CODEC_OK,
+            aom_codec_control(&enc, AV1E_SET_COLOR_RANGE, AOM_CR_FULL_RANGE));
+  EXPECT_EQ(AOM_CODEC_OK,
+            aom_codec_control(&enc, AV1E_SET_SKIP_POSTPROC_FILTERING, 1));
+  EXPECT_EQ(AOM_CODEC_OK,
+            aom_codec_control(&enc, AOME_SET_TUNING, AOM_TUNE_SSIM));
+
+  // Encode frame
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, &img, 0, 1, 0));
+  aom_codec_iter_t iter = nullptr;
+  const aom_codec_cx_pkt_t *pkt = aom_codec_get_cx_data(&enc, &iter);
+  ASSERT_NE(pkt, nullptr);
+  EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
+  // pkt->data.frame.flags is 0x1f0011.
+  EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, AOM_FRAME_IS_KEY);
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  // Flush encoder
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_encode(&enc, nullptr, 0, 1, 0));
+  iter = nullptr;
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  EXPECT_EQ(AOM_CODEC_OK, aom_codec_destroy(&enc));
+}
+
+// A test that reproduces b/277121724: signed integer overflow in
+// update_b_sep_sym().
+TEST(SearchWienerTest, 8bitSignedIntegerOverflowInUpdateBSepSym) {
+  constexpr int kWidth = 198;
+  constexpr int kHeight = 3;
+  // 8-bit YUV 4:2:2
+  static const unsigned char buffer[2 * kWidth * kHeight] = {
+    // Y plane:
+    35, 225, 56, 91, 8, 142, 137, 143, 224, 49, 217, 57, 202, 163, 159, 246,
+    232, 134, 135, 14, 76, 101, 239, 88, 186, 159, 118, 23, 114, 20, 108, 41,
+    72, 17, 58, 242, 45, 146, 230, 14, 135, 140, 34, 61, 189, 181, 222, 71, 98,
+    221, 5, 199, 244, 85, 229, 163, 105, 87, 144, 105, 64, 150, 36, 233, 235, 1,
+    179, 190, 50, 222, 176, 109, 166, 18, 80, 129, 45, 9, 218, 144, 234, 10,
+    148, 117, 37, 10, 232, 139, 206, 92, 208, 247, 128, 79, 202, 79, 212, 89,
+    185, 152, 206, 182, 83, 105, 21, 86, 150, 84, 21, 165, 34, 251, 174, 240,
+    172, 155, 254, 85, 98, 25, 96, 78, 230, 253, 36, 19, 247, 155, 112, 216,
+    166, 114, 229, 118, 197, 149, 186, 194, 128, 45, 219, 26, 36, 77, 110, 45,
+    252, 238, 183, 161, 171, 96, 232, 108, 73, 61, 243, 58, 155, 38, 91, 209,
+    187, 206, 16, 165, 236, 145, 69, 126, 102, 10, 4, 43, 191, 106, 193, 240,
+    132, 226, 38, 78, 7, 152, 101, 255, 254, 39, 33, 86, 35, 247, 199, 179, 239,
+    198, 165, 58, 190, 171, 226, 94, 158, 21, 190, 151, 75, 176, 11, 53, 199,
+    87, 91, 1, 226, 20, 117, 96, 75, 192, 101, 200, 125, 106, 233, 176, 63, 204,
+    114, 16, 31, 222, 15, 14, 71, 2, 25, 47, 100, 174, 26, 209, 138, 138, 211,
+    147, 164, 204, 9, 104, 135, 250, 9, 201, 88, 218, 71, 251, 61, 199, 0, 34,
+    59, 115, 228, 161, 100, 132, 50, 4, 117, 100, 191, 126, 53, 28, 193, 42,
+    155, 206, 79, 80, 117, 11, 3, 253, 181, 181, 138, 239, 107, 142, 216, 57,
+    202, 126, 229, 250, 60, 62, 150, 128, 95, 32, 251, 207, 236, 208, 247, 183,
+    59, 19, 117, 40, 106, 87, 140, 57, 109, 190, 51, 105, 226, 116, 156, 3, 35,
+    86, 255, 138, 52, 211, 245, 76, 83, 109, 113, 77, 106, 77, 18, 56, 235, 158,
+    24, 53, 151, 104, 152, 21, 15, 46, 163, 144, 217, 168, 154, 44, 80, 25, 11,
+    37, 100, 235, 145, 154, 113, 0, 140, 153, 80, 64, 19, 121, 185, 144, 43,
+    206, 16, 16, 72, 189, 175, 231, 177, 40, 177, 206, 116, 4, 82, 43, 244, 237,
+    22, 252, 71, 194, 106, 4, 112, 0, 108, 137, 126, 80, 122, 142, 43, 205, 22,
+    209, 217, 165, 32, 208, 100, 70, 3, 120, 159, 203, 7, 233, 152, 37, 96, 212,
+    177, 1, 133, 218, 161, 172, 202, 192, 186, 114, 150, 121, 177, 227, 175, 64,
+    127, 153, 113, 91, 198, 0, 111, 227, 226, 218, 71, 62, 5, 43, 128, 27, 3,
+    82, 5, 10, 68, 153, 215, 181, 138, 246, 224, 170, 1, 241, 191, 181, 151,
+    167, 14, 80, 45, 4, 252, 29, 66, 125, 58, 225, 253, 255, 248, 224, 40, 24,
+    236, 46, 11, 219, 154, 134, 12, 76, 72, 97, 239, 50, 39, 85, 182, 55, 219,
+    19, 109, 81, 119, 125, 206, 159, 239, 67, 193, 180, 132, 80, 127, 2, 169,
+    99, 53, 47, 5, 100, 174, 151, 124, 246, 202, 93, 82, 65, 53, 214, 238, 32,
+    218, 15, 254, 153, 95, 79, 189, 67, 233, 47, 83, 48, 125, 144, 206, 82, 69,
+    186, 112, 134, 244, 96, 21, 143, 187, 248, 8, 224, 161, 227, 185, 236, 6,
+    175, 237, 169, 154, 89, 143, 106, 205, 26, 47, 155, 42, 28, 162, 7, 8, 45,
+    // U plane:
+    55, 165, 203, 139, 152, 208, 36, 177, 61, 49, 129, 211, 140, 71, 253, 250,
+    120, 167, 238, 67, 255, 223, 104, 32, 240, 179, 28, 41, 86, 84, 61, 243,
+    169, 212, 201, 0, 9, 236, 89, 194, 204, 75, 228, 250, 27, 81, 137, 29, 255,
+    131, 194, 241, 76, 133, 186, 135, 212, 197, 150, 145, 203, 96, 86, 231, 91,
+    119, 197, 67, 226, 2, 118, 66, 181, 86, 219, 86, 132, 137, 156, 161, 221,
+    18, 55, 170, 35, 206, 201, 193, 38, 63, 229, 29, 110, 96, 14, 135, 229, 99,
+    106, 108, 167, 110, 50, 32, 144, 113, 48, 29, 57, 29, 20, 199, 145, 245, 9,
+    183, 88, 174, 114, 237, 29, 40, 99, 117, 233, 6, 51, 227, 2, 28, 76, 149,
+    190, 23, 240, 73, 113, 10, 73, 240, 105, 220, 129, 26, 144, 214, 34, 4, 24,
+    219, 24, 156, 198, 214, 244, 143, 106, 255, 204, 93, 2, 88, 107, 211, 241,
+    242, 86, 189, 219, 164, 132, 149, 32, 228, 219, 60, 202, 218, 189, 34, 250,
+    160, 158, 36, 212, 212, 41, 233, 61, 92, 121, 170, 220, 192, 232, 255, 124,
+    249, 231, 55, 196, 219, 196, 62, 238, 187, 76, 33, 138, 67, 82, 159, 169,
+    196, 66, 196, 110, 194, 64, 35, 205, 64, 218, 12, 41, 188, 195, 244, 178,
+    17, 80, 8, 149, 39, 110, 146, 164, 162, 215, 227, 107, 103, 47, 52, 95, 3,
+    181, 90, 255, 80, 83, 206, 66, 153, 112, 72, 109, 235, 69, 105, 57, 75, 145,
+    186, 16, 87, 73, 61, 98, 197, 237, 17, 32, 207, 220, 246, 188, 46, 73, 121,
+    84, 252, 164, 111, 21, 98, 13, 170, 174, 170, 231, 77, 10, 113, 9, 217, 11,
+    // V plane:
+    124, 94, 69, 212, 107, 223, 228, 96, 56, 2, 158, 49, 251, 217, 143, 107,
+    113, 17, 84, 169, 208, 43, 28, 37, 176, 54, 235, 150, 135, 135, 221, 94, 50,
+    131, 251, 78, 38, 254, 129, 200, 207, 55, 111, 110, 144, 109, 228, 65, 70,
+    39, 170, 5, 208, 151, 87, 86, 255, 74, 155, 153, 250, 15, 35, 33, 201, 226,
+    117, 119, 220, 238, 133, 229, 69, 122, 160, 114, 245, 182, 13, 65, 2, 228,
+    205, 174, 128, 248, 4, 139, 178, 227, 204, 243, 249, 253, 119, 253, 107,
+    234, 39, 15, 173, 47, 93, 12, 222, 238, 30, 121, 124, 167, 27, 40, 215, 84,
+    172, 130, 66, 43, 165, 55, 225, 79, 84, 153, 59, 110, 64, 176, 54, 123, 82,
+    128, 189, 150, 52, 202, 102, 133, 199, 197, 253, 180, 221, 127, 144, 124,
+    255, 224, 52, 149, 88, 166, 39, 38, 78, 114, 44, 242, 233, 40, 132, 142,
+    152, 213, 112, 244, 221, 7, 52, 206, 246, 51, 182, 160, 247, 154, 183, 209,
+    81, 70, 56, 186, 63, 182, 2, 82, 202, 178, 233, 52, 198, 241, 175, 38, 165,
+    9, 231, 150, 114, 43, 159, 200, 42, 173, 217, 25, 233, 214, 210, 50, 43,
+    159, 231, 102, 241, 246, 77, 76, 115, 77, 81, 114, 194, 182, 236, 0, 236,
+    198, 197, 180, 176, 148, 48, 177, 106, 180, 150, 158, 237, 130, 242, 109,
+    174, 247, 57, 230, 184, 64, 245, 251, 123, 169, 122, 156, 125, 123, 104,
+    238, 1, 235, 187, 53, 67, 38, 50, 139, 123, 149, 111, 72, 80, 17, 175, 186,
+    98, 153, 247, 97, 218, 141, 38, 0, 171, 254, 180, 81, 233, 71, 156, 48, 14,
+    62, 210, 161, 124, 203, 92
+  };
+  unsigned char *img_data = const_cast<unsigned char *>(buffer);
+
+  aom_image_t img;
+  EXPECT_EQ(aom_img_wrap(&img, AOM_IMG_FMT_I422, kWidth, kHeight, 1, img_data),
+            &img);
+  img.cp = AOM_CICP_CP_UNSPECIFIED;
+  img.tc = AOM_CICP_TC_UNSPECIFIED;
+  img.mc = AOM_CICP_MC_UNSPECIFIED;
+  img.range = AOM_CR_FULL_RANGE;
+
+  aom_codec_iface_t *iface = aom_codec_av1_cx();
+  aom_codec_enc_cfg_t cfg;
+  EXPECT_EQ(aom_codec_enc_config_default(iface, &cfg, AOM_USAGE_ALL_INTRA),
+            AOM_CODEC_OK);
+  cfg.rc_end_usage = AOM_Q;
+  cfg.g_profile = 2;
+  cfg.g_bit_depth = AOM_BITS_8;
+  cfg.g_input_bit_depth = 8;
+  cfg.g_w = kWidth;
+  cfg.g_h = kHeight;
+  cfg.g_limit = 1;
+  cfg.g_lag_in_frames = 0;
+  cfg.kf_mode = AOM_KF_DISABLED;
+  cfg.kf_max_dist = 0;
+  cfg.g_threads = 43;
+  cfg.rc_min_quantizer = 30;
+  cfg.rc_max_quantizer = 50;
+  aom_codec_ctx_t enc;
+  EXPECT_EQ(aom_codec_enc_init(&enc, iface, &cfg, 0), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_CQ_LEVEL, 40), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_ROW_MT, 1), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_TILE_ROWS, 4), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_TILE_COLUMNS, 1), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_CPUUSED, 2), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_COLOR_RANGE, AOM_CR_FULL_RANGE),
+            AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_SKIP_POSTPROC_FILTERING, 1),
+            AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_TUNING, AOM_TUNE_SSIM),
+            AOM_CODEC_OK);
+
+  // Encode frame
+  EXPECT_EQ(aom_codec_encode(&enc, &img, 0, 1, 0), AOM_CODEC_OK);
+  aom_codec_iter_t iter = nullptr;
+  const aom_codec_cx_pkt_t *pkt = aom_codec_get_cx_data(&enc, &iter);
+  ASSERT_NE(pkt, nullptr);
+  EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
+  // pkt->data.frame.flags is 0x1f0011.
+  EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, AOM_FRAME_IS_KEY);
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  // Flush encoder
+  EXPECT_EQ(aom_codec_encode(&enc, nullptr, 0, 1, 0), AOM_CODEC_OK);
+  iter = nullptr;
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  EXPECT_EQ(aom_codec_destroy(&enc), AOM_CODEC_OK);
+}
+
+// A test that reproduces b/259173819: signed integer overflow in
+// linsolve_wiener().
+TEST(SearchWienerTest, 10bitSignedIntegerOverflowInLinsolveWiener) {
+  constexpr int kWidth = 3;
+  constexpr int kHeight = 3;
+  static const uint16_t buffer[3 * kWidth * kHeight] = {
+    // Y plane:
+    81, 81, 1023, 1020, 81, 1023, 81, 128, 0,
+    // U plane:
+    273, 273, 273, 273, 273, 273, 273, 273, 273,
+    // V plane:
+    273, 273, 273, 273, 273, 273, 516, 81, 81
+  };
+  unsigned char *img_data =
+      reinterpret_cast<unsigned char *>(const_cast<uint16_t *>(buffer));
+
+  aom_image_t img;
+  EXPECT_EQ(
+      aom_img_wrap(&img, AOM_IMG_FMT_I44416, kWidth, kHeight, 1, img_data),
+      &img);
+  img.cp = AOM_CICP_CP_UNSPECIFIED;
+  img.tc = AOM_CICP_TC_UNSPECIFIED;
+  img.mc = AOM_CICP_MC_UNSPECIFIED;
+  img.range = AOM_CR_FULL_RANGE;
+
+  aom_codec_iface_t *iface = aom_codec_av1_cx();
+  aom_codec_enc_cfg_t cfg;
+  EXPECT_EQ(aom_codec_enc_config_default(iface, &cfg, AOM_USAGE_ALL_INTRA),
+            AOM_CODEC_OK);
+  cfg.rc_end_usage = AOM_Q;
+  cfg.g_profile = 1;
+  cfg.g_bit_depth = AOM_BITS_10;
+  cfg.g_input_bit_depth = 10;
+  cfg.g_w = kWidth;
+  cfg.g_h = kHeight;
+  cfg.g_limit = 1;
+  cfg.g_lag_in_frames = 0;
+  cfg.kf_mode = AOM_KF_DISABLED;
+  cfg.kf_max_dist = 0;
+  cfg.g_threads = 21;
+  cfg.rc_min_quantizer = 16;
+  cfg.rc_max_quantizer = 54;
+  aom_codec_ctx_t enc;
+  EXPECT_EQ(aom_codec_enc_init(&enc, iface, &cfg, AOM_CODEC_USE_HIGHBITDEPTH),
+            AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_CQ_LEVEL, 35), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_ROW_MT, 1), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_TILE_ROWS, 2), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_TILE_COLUMNS, 5), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_CPUUSED, 1), AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_COLOR_RANGE, AOM_CR_FULL_RANGE),
+            AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AV1E_SET_SKIP_POSTPROC_FILTERING, 1),
+            AOM_CODEC_OK);
+  EXPECT_EQ(aom_codec_control(&enc, AOME_SET_TUNING, AOM_TUNE_SSIM),
+            AOM_CODEC_OK);
+
+  // Encode frame
+  EXPECT_EQ(aom_codec_encode(&enc, &img, 0, 1, 0), AOM_CODEC_OK);
+  aom_codec_iter_t iter = nullptr;
+  const aom_codec_cx_pkt_t *pkt = aom_codec_get_cx_data(&enc, &iter);
+  ASSERT_NE(pkt, nullptr);
+  EXPECT_EQ(pkt->kind, AOM_CODEC_CX_FRAME_PKT);
+  // pkt->data.frame.flags is 0x1f0011.
+  EXPECT_EQ(pkt->data.frame.flags & AOM_FRAME_IS_KEY, AOM_FRAME_IS_KEY);
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  // Flush encoder
+  EXPECT_EQ(aom_codec_encode(&enc, nullptr, 0, 1, 0), AOM_CODEC_OK);
+  iter = nullptr;
+  pkt = aom_codec_get_cx_data(&enc, &iter);
+  EXPECT_EQ(pkt, nullptr);
+
+  EXPECT_EQ(aom_codec_destroy(&enc), AOM_CODEC_OK);
+}
+
 }  // namespace wiener_highbd
 #endif  // CONFIG_AV1_HIGHBITDEPTH

diff --git a/third_party/fastfeat/README.libaom b/third_party/fastfeat/README.libaom
index ce7ce70..8aaee12 100644
--- a/third_party/fastfeat/README.libaom
+++ b/third_party/fastfeat/README.libaom

@@ -39,3 +39,5 @@
 Convert tabs to spaces
 Prefix global functions with "aom_"
 Add error checking
+Add output argument to hold the scores of the detected features
+Add assertion and rewrite comparisons to appease the scan-build static analyzer

diff --git a/third_party/fastfeat/fast.c b/third_party/fastfeat/fast.c
index 30efde8..a684a33 100644
--- a/third_party/fastfeat/fast.c
+++ b/third_party/fastfeat/fast.c

@@ -33,20 +33,21 @@
 #include "fast.h"
 
 
-xy* aom_fast9_detect_nonmax(const byte* im, int xsize, int ysize, int stride, int b, int* ret_num_corners)
+xy* aom_fast9_detect_nonmax(const byte* im, int xsize, int ysize, int stride, int b,
+                            int** ret_scores, int* ret_num_corners)
 {
-	xy* corners;
-	int num_corners;
-	int* scores;
-	xy* nonmax;
+  xy* corners;
+  int num_corners;
+  int* scores;
+  xy* nonmax;
 
-	corners = aom_fast9_detect(im, xsize, ysize, stride, b, &num_corners);
-	scores = aom_fast9_score(im, stride, corners, num_corners, b);
-	nonmax = aom_nonmax_suppression(corners, scores, num_corners, ret_num_corners);
+  corners = aom_fast9_detect(im, xsize, ysize, stride, b, &num_corners);
+  scores = aom_fast9_score(im, stride, corners, num_corners, b);
+  nonmax = aom_nonmax_suppression(corners, scores, num_corners, ret_scores, ret_num_corners);
 
-	free(corners);
-	free(scores);
+  free(corners);
+  free(scores);
 
-	return nonmax;
+  return nonmax;
 }
 // clang-format on

diff --git a/third_party/fastfeat/fast.h b/third_party/fastfeat/fast.h
index d7a9617..7fd199f 100644
--- a/third_party/fastfeat/fast.h
+++ b/third_party/fastfeat/fast.h

@@ -41,9 +41,11 @@
 
 int* aom_fast9_score(const byte* i, int stride, xy* corners, int num_corners, int b);
 
-xy* aom_fast9_detect_nonmax(const byte* im, int xsize, int ysize, int stride, int b, int* ret_num_corners);
+xy* aom_fast9_detect_nonmax(const byte* im, int xsize, int ysize, int stride, int b,
+                            int** ret_scores, int* ret_num_corners);
 
-xy* aom_nonmax_suppression(const xy* corners, const int* scores, int num_corners, int* ret_num_nonmax);
+xy* aom_nonmax_suppression(const xy* corners, const int* scores, int num_corners,
+                           int** ret_scores, int* ret_num_nonmax);
 
 
 #endif

diff --git a/third_party/fastfeat/nonmax.c b/third_party/fastfeat/nonmax.c
index 39ec18c..cc0ada7 100644
--- a/third_party/fastfeat/nonmax.c
+++ b/third_party/fastfeat/nonmax.c

@@ -29,19 +29,22 @@
 // SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 // clang-format off
+#include <assert.h>
 #include <stdlib.h>
 #include "fast.h"
 
 
 #define Compare(X, Y) ((X)>=(Y))
 
-xy* aom_nonmax_suppression(const xy* corners, const int* scores, int num_corners, int* ret_num_nonmax)
+xy* aom_nonmax_suppression(const xy* corners, const int* scores, int num_corners,
+                           int** ret_scores, int* ret_num_nonmax)
 {
   int num_nonmax=0;
   int last_row;
   int* row_start;
   int i, j;
   xy* ret_nonmax;
+  int* nonmax_scores;
   const int sz = (int)num_corners;
 
   /*Point above points (roughly) to the pixel above the one of interest, if there
@@ -49,6 +52,7 @@
   int point_above = 0;
   int point_below = 0;
 
+  *ret_scores = 0;
   *ret_num_nonmax = 0;
   if(!(corners && scores) || num_corners < 1)
   {
@@ -61,6 +65,13 @@
     return 0;
   }
 
+  nonmax_scores = (int*)malloc(num_corners * sizeof(*nonmax_scores));
+  if (!nonmax_scores)
+  {
+    free(ret_nonmax);
+    return 0;
+  }
+
   /* Find where each row begins
      (the corners are output in raster scan order). A beginning of -1 signifies
      that there are no corners on that row. */
@@ -69,6 +80,7 @@
   if(!row_start)
   {
     free(ret_nonmax);
+    free(nonmax_scores);
     return 0;
   }
 
@@ -91,6 +103,7 @@
   {
     int score = scores[i];
     xy pos = corners[i];
+    assert(pos.y <= last_row);
 
     /*Check left */
     if(i > 0)
@@ -103,55 +116,56 @@
         continue;
 
     /*Check above (if there is a valid row above)*/
-    if(pos.y > 0)
-      if (row_start[pos.y - 1] != -1)
+    if(pos.y > 0 && row_start[pos.y - 1] != -1)
+    {
+      /*Make sure that current point_above is one
+        row above.*/
+      if(corners[point_above].y < pos.y - 1)
+        point_above = row_start[pos.y-1];
+
+      /*Make point_above point to the first of the pixels above the current point,
+        if it exists.*/
+      for(; corners[point_above].y < pos.y && corners[point_above].x < pos.x - 1; point_above++)
+      {}
+
+
+      for(j=point_above; corners[j].y < pos.y && corners[j].x <= pos.x + 1; j++)
       {
-        /*Make sure that current point_above is one
-          row above.*/
-        if(corners[point_above].y < pos.y - 1)
-          point_above = row_start[pos.y-1];
-
-        /*Make point_above point to the first of the pixels above the current point,
-          if it exists.*/
-        for(; corners[point_above].y < pos.y && corners[point_above].x < pos.x - 1; point_above++)
-        {}
-
-
-        for(j=point_above; corners[j].y < pos.y && corners[j].x <= pos.x + 1; j++)
-        {
-          int x = corners[j].x;
-          if( (x == pos.x - 1 || x ==pos.x || x == pos.x+1) && Compare(scores[j], score))
-            goto cont;
-        }
-
+        int x = corners[j].x;
+        if( (x == pos.x - 1 || x ==pos.x || x == pos.x+1) && Compare(scores[j], score))
+          goto cont;
       }
 
+    }
+
     /*Check below (if there is anything below)*/
-    if(pos.y >= 0)
-      if (pos.y != last_row && row_start[pos.y + 1] != -1 && point_below < sz) /*Nothing below*/
+    if (pos.y + 1 < last_row+1 && row_start[pos.y + 1] != -1 && point_below < sz) /*Nothing below*/
+    {
+      if(corners[point_below].y < pos.y + 1)
+        point_below = row_start[pos.y+1];
+
+      /* Make point below point to one of the pixels belowthe current point, if it
+         exists.*/
+      for(; point_below < sz && corners[point_below].y == pos.y+1 && corners[point_below].x < pos.x - 1; point_below++)
+      {}
+
+      for(j=point_below; j < sz && corners[j].y == pos.y+1 && corners[j].x <= pos.x + 1; j++)
       {
-        if(corners[point_below].y < pos.y + 1)
-          point_below = row_start[pos.y+1];
-
-        /* Make point below point to one of the pixels belowthe current point, if it
-           exists.*/
-        for(; point_below < sz && corners[point_below].y == pos.y+1 && corners[point_below].x < pos.x - 1; point_below++)
-        {}
-
-        for(j=point_below; j < sz && corners[j].y == pos.y+1 && corners[j].x <= pos.x + 1; j++)
-        {
-          int x = corners[j].x;
-          if( (x == pos.x - 1 || x ==pos.x || x == pos.x+1) && Compare(scores[j],score))
-            goto cont;
-        }
+        int x = corners[j].x;
+        if( (x == pos.x - 1 || x ==pos.x || x == pos.x+1) && Compare(scores[j],score))
+          goto cont;
       }
+    }
 
-    ret_nonmax[num_nonmax++] = corners[i];
+    ret_nonmax[num_nonmax] = corners[i];
+    nonmax_scores[num_nonmax] = scores[i];
+    num_nonmax++;
 cont:
     ;
   }
 
   free(row_start);
+  *ret_scores = nonmax_scores;
   *ret_num_nonmax = num_nonmax;
   return ret_nonmax;
 }

diff --git a/third_party/libwebm/AUTHORS.TXT b/third_party/libwebm/AUTHORS.TXT
index 9686ac1..59b648c 100644
--- a/third_party/libwebm/AUTHORS.TXT
+++ b/third_party/libwebm/AUTHORS.TXT

@@ -2,3 +2,4 @@
 # Name or Organization <email address>
 
 Google Inc.
+Elijah Cirioli <[email protected]>

diff --git a/third_party/libwebm/Android.mk b/third_party/libwebm/Android.mk
index 1185198..e6c17df 100644
--- a/third_party/libwebm/Android.mk
+++ b/third_party/libwebm/Android.mk

@@ -1,3 +1,5 @@
+# Ignore this file during non-NDK builds.
+ifdef NDK_ROOT
 LOCAL_PATH:= $(call my-dir)
 
 include $(CLEAR_VARS)
@@ -18,3 +20,4 @@
 LOCAL_LICENSE_CONDITIONS := notice
 LOCAL_NOTICE_FILE := $(LOCAL_PATH)/LICENSE.TXT $(LOCAL_PATH)/PATENTS.TXT
 include $(BUILD_STATIC_LIBRARY)
+endif  # NDK_ROOT

diff --git a/third_party/libwebm/README.libaom b/third_party/libwebm/README.libaom
index 325604c..ee350a5 100644
--- a/third_party/libwebm/README.libaom
+++ b/third_party/libwebm/README.libaom

@@ -1,7 +1,7 @@
 URL: https://chromium.googlesource.com/webm/libwebm
-Version: ee0bab576c338c9807249b99588e352b7268cb62
+Version: 1930e3ca23b007f3ff11d98a570077be6201957e
 License: BSD
-License File: LICENSE.txt
+License File: LICENSE.TXT
 
 Description:
 libwebm is used to handle WebM container I/O.

diff --git a/third_party/libwebm/mkvmuxer/mkvmuxer.cc b/third_party/libwebm/mkvmuxer/mkvmuxer.cc
index ae36531..faaf016 100644
--- a/third_party/libwebm/mkvmuxer/mkvmuxer.cc
+++ b/third_party/libwebm/mkvmuxer/mkvmuxer.cc

@@ -607,10 +607,10 @@
   return true;
 }
 
-uint64_t ContentEncoding::EncodingSize(uint64_t compresion_size,
+uint64_t ContentEncoding::EncodingSize(uint64_t compression_size,
                                        uint64_t encryption_size) const {
   // TODO(fgalligan): Add support for compression settings.
-  if (compresion_size != 0)
+  if (compression_size != 0)
     return 0;
 
   uint64_t encoding_size = 0;

diff --git a/third_party/libwebm/mkvmuxer/mkvmuxer.h b/third_party/libwebm/mkvmuxer/mkvmuxer.h
index f2db377..8602d82 100644
--- a/third_party/libwebm/mkvmuxer/mkvmuxer.h
+++ b/third_party/libwebm/mkvmuxer/mkvmuxer.h

@@ -330,7 +330,7 @@
 
  private:
   // Returns the size in bytes for the encoding elements.
-  uint64_t EncodingSize(uint64_t compresion_size,
+  uint64_t EncodingSize(uint64_t compression_size,
                         uint64_t encryption_size) const;
 
   // Returns the size in bytes for the encryption elements.
@@ -1425,7 +1425,7 @@
   bool Write(IMkvWriter* writer);
 
   // We are going to put a cap on the number of Seek Entries.
-  const static int32_t kSeekEntryCount = 5;
+  constexpr static int32_t kSeekEntryCount = 5;
 
  private:
   // Returns the maximum size in bytes of one seek entry.
@@ -1505,8 +1505,8 @@
     kBeforeClusters = 0x1  // Position Cues before Clusters
   };
 
-  static const uint32_t kDefaultDocTypeVersion = 4;
-  static const uint64_t kDefaultMaxClusterDuration = 30000000000ULL;
+  static constexpr uint32_t kDefaultDocTypeVersion = 4;
+  static constexpr uint64_t kDefaultMaxClusterDuration = 30000000000ULL;
 
   Segment();
   ~Segment();

diff --git a/third_party/libwebm/mkvmuxer/mkvmuxerutil.cc b/third_party/libwebm/mkvmuxer/mkvmuxerutil.cc
index bd2f769..300b155 100644
--- a/third_party/libwebm/mkvmuxer/mkvmuxerutil.cc
+++ b/third_party/libwebm/mkvmuxer/mkvmuxerutil.cc

@@ -607,7 +607,7 @@
 void GetVersion(int32* major, int32* minor, int32* build, int32* revision) {
   *major = 0;
   *minor = 3;
-  *build = 0;
+  *build = 1;
   *revision = 0;
 }
 

diff --git a/third_party/libwebm/mkvparser/mkvparser.cc b/third_party/libwebm/mkvparser/mkvparser.cc
index de8884b..868afcb 100644
--- a/third_party/libwebm/mkvparser/mkvparser.cc
+++ b/third_party/libwebm/mkvparser/mkvparser.cc

@@ -55,7 +55,7 @@
 void GetVersion(int& major, int& minor, int& build, int& revision) {
   major = 1;
   minor = 1;
-  build = 0;
+  build = 1;
   revision = 0;
 }
 
@@ -298,7 +298,7 @@
   if (status < 0)
     return status;
 
-  unsigned long long result = first_byte;
+  unsigned long long result = static_cast<unsigned long long>(first_byte);
   ++pos;
 
   for (long i = 1; i < size; ++i) {
@@ -2432,7 +2432,7 @@
     pos += size;  // consume payload
   }
 
-  if ((m_pos < 0) || (m_track <= 0)) {
+  if ((m_pos < 0) || (m_track <= 0) || (m_block < 0) || (m_block > LONG_MAX)) {
     return false;
   }
 

diff --git a/third_party/libyuv/source/row_x86.asm b/third_party/libyuv/source/row_x86.asm
deleted file mode 100644
index 0cb326f..0000000
--- a/third_party/libyuv/source/row_x86.asm
+++ /dev/null

@@ -1,146 +0,0 @@
-;
-; Copyright 2012 The LibYuv Project Authors. All rights reserved.
-;
-; Use of this source code is governed by a BSD-style license
-; that can be found in the LICENSE file in the root of the source
-; tree. An additional intellectual property rights grant can be found
-; in the file PATENTS. All contributing project authors may
-; be found in the AUTHORS file in the root of the source tree.
-;
-
-%ifdef __YASM_VERSION_ID__
-%if __YASM_VERSION_ID__ < 01020000h
-%error AVX2 is supported only by yasm 1.2.0 or later.
-%endif
-%endif
-%include "x86inc.asm"
-
-SECTION .text
-
-; cglobal numeric constants are parameters, gpr regs, mm regs
-
-; void YUY2ToYRow_SSE2(const uint8* src_yuy2, uint8* dst_y, int pix)
-
-%macro YUY2TOYROW 2-3
-cglobal %1ToYRow%3, 3, 3, 3, src_yuy2, dst_y, pix
-%ifidn %1,YUY2
-    pcmpeqb    m2, m2, m2        ; generate mask 0x00ff00ff
-    psrlw      m2, m2, 8
-%endif
-
-    ALIGN      4
-.convertloop:
-    mov%2      m0, [src_yuy2q]
-    mov%2      m1, [src_yuy2q + mmsize]
-    lea        src_yuy2q, [src_yuy2q + mmsize * 2]
-%ifidn %1,YUY2
-    pand       m0, m0, m2   ; YUY2 even bytes are Y
-    pand       m1, m1, m2
-%else
-    psrlw      m0, m0, 8    ; UYVY odd bytes are Y
-    psrlw      m1, m1, 8
-%endif
-    packuswb   m0, m0, m1
-%if cpuflag(AVX2)
-    vpermq     m0, m0, 0xd8
-%endif
-    sub        pixd, mmsize
-    mov%2      [dst_yq], m0
-    lea        dst_yq, [dst_yq + mmsize]
-    jg         .convertloop
-    REP_RET
-%endmacro
-
-; TODO(fbarchard): Remove MMX.  Add SSSE3 pshufb version.
-INIT_MMX MMX
-YUY2TOYROW YUY2,a,
-YUY2TOYROW YUY2,u,_Unaligned
-YUY2TOYROW UYVY,a,
-YUY2TOYROW UYVY,u,_Unaligned
-INIT_XMM SSE2
-YUY2TOYROW YUY2,a,
-YUY2TOYROW YUY2,u,_Unaligned
-YUY2TOYROW UYVY,a,
-YUY2TOYROW UYVY,u,_Unaligned
-INIT_YMM AVX2
-YUY2TOYROW YUY2,a,
-YUY2TOYROW UYVY,a,
-
-; void SplitUVRow_SSE2(const uint8* src_uv, uint8* dst_u, uint8* dst_v, int pix)
-
-%macro SplitUVRow 1-2
-cglobal SplitUVRow%2, 4, 4, 5, src_uv, dst_u, dst_v, pix
-    pcmpeqb    m4, m4, m4        ; generate mask 0x00ff00ff
-    psrlw      m4, m4, 8
-    sub        dst_vq, dst_uq
-
-    ALIGN      4
-.convertloop:
-    mov%1      m0, [src_uvq]
-    mov%1      m1, [src_uvq + mmsize]
-    lea        src_uvq, [src_uvq + mmsize * 2]
-    psrlw      m2, m0, 8         ; odd bytes
-    psrlw      m3, m1, 8
-    pand       m0, m0, m4        ; even bytes
-    pand       m1, m1, m4
-    packuswb   m0, m0, m1
-    packuswb   m2, m2, m3
-%if cpuflag(AVX2)
-    vpermq     m0, m0, 0xd8
-    vpermq     m2, m2, 0xd8
-%endif
-    mov%1      [dst_uq], m0
-    mov%1      [dst_uq + dst_vq], m2
-    lea        dst_uq, [dst_uq + mmsize]
-    sub        pixd, mmsize
-    jg         .convertloop
-    REP_RET
-%endmacro
-
-INIT_MMX MMX
-SplitUVRow a,
-SplitUVRow u,_Unaligned
-INIT_XMM SSE2
-SplitUVRow a,
-SplitUVRow u,_Unaligned
-INIT_YMM AVX2
-SplitUVRow a,
-
-; void MergeUVRow_SSE2(const uint8* src_u, const uint8* src_v, uint8* dst_uv,
-;                      int width);
-
-%macro MergeUVRow_ 1-2
-cglobal MergeUVRow_%2, 4, 4, 3, src_u, src_v, dst_uv, pix
-    sub        src_vq, src_uq
-
-    ALIGN      4
-.convertloop:
-    mov%1      m0, [src_uq]
-    mov%1      m1, [src_vq]
-    lea        src_uq, [src_uq + mmsize]
-    punpcklbw  m2, m0, m1       // first 8 UV pairs
-    punpckhbw  m0, m0, m1       // next 8 UV pairs
-%if cpuflag(AVX2)
-    vperm2i128 m1, m2, m0, 0x20  // low 128 of ymm2 and low 128 of ymm0
-    vperm2i128 m2, m2, m0, 0x31  // high 128 of ymm2 and high 128 of ymm0
-    mov%1      [dst_uvq], m1
-    mov%1      [dst_uvq + mmsize], m2
-%else
-    mov%1      [dst_uvq], m2
-    mov%1      [dst_uvq + mmsize], m0
-%endif
-    lea        dst_uvq, [dst_uvq + mmsize * 2]
-    sub        pixd, mmsize
-    jg         .convertloop
-    REP_RET
-%endmacro
-
-INIT_MMX MMX
-MergeUVRow_ a,
-MergeUVRow_ u,_Unaligned
-INIT_XMM SSE2
-MergeUVRow_ a,
-MergeUVRow_ u,_Unaligned
-INIT_YMM AVX2
-MergeUVRow_ a,
-

diff --git a/third_party/libyuv/source/x86inc.asm b/third_party/libyuv/source/x86inc.asm
deleted file mode 100644
index cb5c32d..0000000
--- a/third_party/libyuv/source/x86inc.asm
+++ /dev/null

@@ -1,1136 +0,0 @@
-;*****************************************************************************
-;* x86inc.asm: x264asm abstraction layer
-;*****************************************************************************
-;* Copyright (C) 2005-2012 x264 project
-;*
-;* Authors: Loren Merritt <[email protected]>
-;*          Anton Mitrofanov <[email protected]>
-;*          Jason Garrett-Glaser <[email protected]>
-;*          Henrik Gramner <[email protected]>
-;*
-;* Permission to use, copy, modify, and/or distribute this software for any
-;* purpose with or without fee is hereby granted, provided that the above
-;* copyright notice and this permission notice appear in all copies.
-;*
-;* THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
-;* WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
-;* MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
-;* ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
-;* WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
-;* ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
-;* OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
-;*****************************************************************************
-
-; This is a header file for the x264ASM assembly language, which uses
-; NASM/YASM syntax combined with a large number of macros to provide easy
-; abstraction between different calling conventions (x86_32, win64, linux64).
-; It also has various other useful features to simplify writing the kind of
-; DSP functions that are most often used in x264.
-
-; Unlike the rest of x264, this file is available under an ISC license, as it
-; has significant usefulness outside of x264 and we want it to be available
-; to the largest audience possible.  Of course, if you modify it for your own
-; purposes to add a new feature, we strongly encourage contributing a patch
-; as this feature might be useful for others as well.  Send patches or ideas
-; to [email protected] .
-
-; Local changes for libyuv:
-; remove %define program_name and references in labels
-; rename cpus to uppercase
-
-%define WIN64  0
-%define UNIX64 0
-%if ARCH_X86_64
-    %ifidn __OUTPUT_FORMAT__,win32
-        %define WIN64  1
-    %elifidn __OUTPUT_FORMAT__,win64
-        %define WIN64  1
-    %else
-        %define UNIX64 1
-    %endif
-%endif
-
-%ifdef PREFIX
-    %define mangle(x) _ %+ x
-%else
-    %define mangle(x) x
-%endif
-
-; Name of the .rodata section.
-; Kludge: Something on OS X fails to align .rodata even given an align attribute,
-; so use a different read-only section.
-%macro SECTION_RODATA 0-1 16
-    %ifidn __OUTPUT_FORMAT__,macho64
-        SECTION .text align=%1
-    %elifidn __OUTPUT_FORMAT__,macho
-        SECTION .text align=%1
-        fakegot:
-    %elifidn __OUTPUT_FORMAT__,aout
-        section .text
-    %else
-        SECTION .rodata align=%1
-    %endif
-%endmacro
-
-; aout does not support align=
-%macro SECTION_TEXT 0-1 16
-    %ifidn __OUTPUT_FORMAT__,aout
-        SECTION .text
-    %else
-        SECTION .text align=%1
-    %endif
-%endmacro
-
-%if WIN64
-    %define PIC
-%elif ARCH_X86_64 == 0
-; x86_32 doesn't require PIC.
-; Some distros prefer shared objects to be PIC, but nothing breaks if
-; the code contains a few textrels, so we'll skip that complexity.
-    %undef PIC
-%endif
-%ifdef PIC
-    default rel
-%endif
-
-; Always use long nops (reduces 0x90 spam in disassembly on x86_32)
-CPU amdnop
-
-; Macros to eliminate most code duplication between x86_32 and x86_64:
-; Currently this works only for leaf functions which load all their arguments
-; into registers at the start, and make no other use of the stack. Luckily that
-; covers most of x264's asm.
-
-; PROLOGUE:
-; %1 = number of arguments. loads them from stack if needed.
-; %2 = number of registers used. pushes callee-saved regs if needed.
-; %3 = number of xmm registers used. pushes callee-saved xmm regs if needed.
-; %4 = list of names to define to registers
-; PROLOGUE can also be invoked by adding the same options to cglobal
-
-; e.g.
-; cglobal foo, 2,3,0, dst, src, tmp
-; declares a function (foo), taking two args (dst and src) and one local variable (tmp)
-
-; TODO Some functions can use some args directly from the stack. If they're the
-; last args then you can just not declare them, but if they're in the middle
-; we need more flexible macro.
-
-; RET:
-; Pops anything that was pushed by PROLOGUE, and returns.
-
-; REP_RET:
-; Same, but if it doesn't pop anything it becomes a 2-byte ret, for athlons
-; which are slow when a normal ret follows a branch.
-
-; registers:
-; rN and rNq are the native-size register holding function argument N
-; rNd, rNw, rNb are dword, word, and byte size
-; rNh is the high 8 bits of the word size
-; rNm is the original location of arg N (a register or on the stack), dword
-; rNmp is native size
-
-%macro DECLARE_REG 2-3
-    %define r%1q %2
-    %define r%1d %2d
-    %define r%1w %2w
-    %define r%1b %2b
-    %define r%1h %2h
-    %if %0 == 2
-        %define r%1m  %2d
-        %define r%1mp %2
-    %elif ARCH_X86_64 ; memory
-        %define r%1m [rsp + stack_offset + %3]
-        %define r%1mp qword r %+ %1m
-    %else
-        %define r%1m [esp + stack_offset + %3]
-        %define r%1mp dword r %+ %1m
-    %endif
-    %define r%1  %2
-%endmacro
-
-%macro DECLARE_REG_SIZE 3
-    %define r%1q r%1
-    %define e%1q r%1
-    %define r%1d e%1
-    %define e%1d e%1
-    %define r%1w %1
-    %define e%1w %1
-    %define r%1h %3
-    %define e%1h %3
-    %define r%1b %2
-    %define e%1b %2
-%if ARCH_X86_64 == 0
-    %define r%1  e%1
-%endif
-%endmacro
-
-DECLARE_REG_SIZE ax, al, ah
-DECLARE_REG_SIZE bx, bl, bh
-DECLARE_REG_SIZE cx, cl, ch
-DECLARE_REG_SIZE dx, dl, dh
-DECLARE_REG_SIZE si, sil, null
-DECLARE_REG_SIZE di, dil, null
-DECLARE_REG_SIZE bp, bpl, null
-
-; t# defines for when per-arch register allocation is more complex than just function arguments
-
-%macro DECLARE_REG_TMP 1-*
-    %assign %%i 0
-    %rep %0
-        CAT_XDEFINE t, %%i, r%1
-        %assign %%i %%i+1
-        %rotate 1
-    %endrep
-%endmacro
-
-%macro DECLARE_REG_TMP_SIZE 0-*
-    %rep %0
-        %define t%1q t%1 %+ q
-        %define t%1d t%1 %+ d
-        %define t%1w t%1 %+ w
-        %define t%1h t%1 %+ h
-        %define t%1b t%1 %+ b
-        %rotate 1
-    %endrep
-%endmacro
-
-DECLARE_REG_TMP_SIZE 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
-
-%if ARCH_X86_64
-    %define gprsize 8
-%else
-    %define gprsize 4
-%endif
-
-%macro PUSH 1
-    push %1
-    %assign stack_offset stack_offset+gprsize
-%endmacro
-
-%macro POP 1
-    pop %1
-    %assign stack_offset stack_offset-gprsize
-%endmacro
-
-%macro PUSH_IF_USED 1-*
-    %rep %0
-        %if %1 < regs_used
-            PUSH r%1
-        %endif
-        %rotate 1
-    %endrep
-%endmacro
-
-%macro POP_IF_USED 1-*
-    %rep %0
-        %if %1 < regs_used
-            pop r%1
-        %endif
-        %rotate 1
-    %endrep
-%endmacro
-
-%macro LOAD_IF_USED 1-*
-    %rep %0
-        %if %1 < num_args
-            mov r%1, r %+ %1 %+ mp
-        %endif
-        %rotate 1
-    %endrep
-%endmacro
-
-%macro SUB 2
-    sub %1, %2
-    %ifidn %1, rsp
-        %assign stack_offset stack_offset+(%2)
-    %endif
-%endmacro
-
-%macro ADD 2
-    add %1, %2
-    %ifidn %1, rsp
-        %assign stack_offset stack_offset-(%2)
-    %endif
-%endmacro
-
-%macro movifnidn 2
-    %ifnidn %1, %2
-        mov %1, %2
-    %endif
-%endmacro
-
-%macro movsxdifnidn 2
-    %ifnidn %1, %2
-        movsxd %1, %2
-    %endif
-%endmacro
-
-%macro ASSERT 1
-    %if (%1) == 0
-        %error assert failed
-    %endif
-%endmacro
-
-%macro DEFINE_ARGS 0-*
-    %ifdef n_arg_names
-        %assign %%i 0
-        %rep n_arg_names
-            CAT_UNDEF arg_name %+ %%i, q
-            CAT_UNDEF arg_name %+ %%i, d
-            CAT_UNDEF arg_name %+ %%i, w
-            CAT_UNDEF arg_name %+ %%i, h
-            CAT_UNDEF arg_name %+ %%i, b
-            CAT_UNDEF arg_name %+ %%i, m
-            CAT_UNDEF arg_name %+ %%i, mp
-            CAT_UNDEF arg_name, %%i
-            %assign %%i %%i+1
-        %endrep
-    %endif
-
-    %xdefine %%stack_offset stack_offset
-    %undef stack_offset ; so that the current value of stack_offset doesn't get baked in by xdefine
-    %assign %%i 0
-    %rep %0
-        %xdefine %1q r %+ %%i %+ q
-        %xdefine %1d r %+ %%i %+ d
-        %xdefine %1w r %+ %%i %+ w
-        %xdefine %1h r %+ %%i %+ h
-        %xdefine %1b r %+ %%i %+ b
-        %xdefine %1m r %+ %%i %+ m
-        %xdefine %1mp r %+ %%i %+ mp
-        CAT_XDEFINE arg_name, %%i, %1
-        %assign %%i %%i+1
-        %rotate 1
-    %endrep
-    %xdefine stack_offset %%stack_offset
-    %assign n_arg_names %0
-%endmacro
-
-%if WIN64 ; Windows x64 ;=================================================
-
-DECLARE_REG 0,  rcx
-DECLARE_REG 1,  rdx
-DECLARE_REG 2,  R8
-DECLARE_REG 3,  R9
-DECLARE_REG 4,  R10, 40
-DECLARE_REG 5,  R11, 48
-DECLARE_REG 6,  rax, 56
-DECLARE_REG 7,  rdi, 64
-DECLARE_REG 8,  rsi, 72
-DECLARE_REG 9,  rbx, 80
-DECLARE_REG 10, rbp, 88
-DECLARE_REG 11, R12, 96
-DECLARE_REG 12, R13, 104
-DECLARE_REG 13, R14, 112
-DECLARE_REG 14, R15, 120
-
-%macro PROLOGUE 2-4+ 0 ; #args, #regs, #xmm_regs, arg_names...
-    %assign num_args %1
-    %assign regs_used %2
-    ASSERT regs_used >= num_args
-    ASSERT regs_used <= 15
-    PUSH_IF_USED 7, 8, 9, 10, 11, 12, 13, 14
-    %if mmsize == 8
-        %assign xmm_regs_used 0
-    %else
-        WIN64_SPILL_XMM %3
-    %endif
-    LOAD_IF_USED 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
-    DEFINE_ARGS %4
-%endmacro
-
-%macro WIN64_SPILL_XMM 1
-    %assign xmm_regs_used %1
-    ASSERT xmm_regs_used <= 16
-    %if xmm_regs_used > 6
-        SUB rsp, (xmm_regs_used-6)*16+16
-        %assign %%i xmm_regs_used
-        %rep (xmm_regs_used-6)
-            %assign %%i %%i-1
-            movdqa [rsp + (%%i-6)*16+(~stack_offset&8)], xmm %+ %%i
-        %endrep
-    %endif
-%endmacro
-
-%macro WIN64_RESTORE_XMM_INTERNAL 1
-    %if xmm_regs_used > 6
-        %assign %%i xmm_regs_used
-        %rep (xmm_regs_used-6)
-            %assign %%i %%i-1
-            movdqa xmm %+ %%i, [%1 + (%%i-6)*16+(~stack_offset&8)]
-        %endrep
-        add %1, (xmm_regs_used-6)*16+16
-    %endif
-%endmacro
-
-%macro WIN64_RESTORE_XMM 1
-    WIN64_RESTORE_XMM_INTERNAL %1
-    %assign stack_offset stack_offset-(xmm_regs_used-6)*16+16
-    %assign xmm_regs_used 0
-%endmacro
-
-%define has_epilogue regs_used > 7 || xmm_regs_used > 6 || mmsize == 32
-
-%macro RET 0
-    WIN64_RESTORE_XMM_INTERNAL rsp
-    POP_IF_USED 14, 13, 12, 11, 10, 9, 8, 7
-%if mmsize == 32
-    vzeroupper
-%endif
-    ret
-%endmacro
-
-%elif ARCH_X86_64 ; *nix x64 ;=============================================
-
-DECLARE_REG 0,  rdi
-DECLARE_REG 1,  rsi
-DECLARE_REG 2,  rdx
-DECLARE_REG 3,  rcx
-DECLARE_REG 4,  R8
-DECLARE_REG 5,  R9
-DECLARE_REG 6,  rax, 8
-DECLARE_REG 7,  R10, 16
-DECLARE_REG 8,  R11, 24
-DECLARE_REG 9,  rbx, 32
-DECLARE_REG 10, rbp, 40
-DECLARE_REG 11, R12, 48
-DECLARE_REG 12, R13, 56
-DECLARE_REG 13, R14, 64
-DECLARE_REG 14, R15, 72
-
-%macro PROLOGUE 2-4+ ; #args, #regs, #xmm_regs, arg_names...
-    %assign num_args %1
-    %assign regs_used %2
-    ASSERT regs_used >= num_args
-    ASSERT regs_used <= 15
-    PUSH_IF_USED 9, 10, 11, 12, 13, 14
-    LOAD_IF_USED 6, 7, 8, 9, 10, 11, 12, 13, 14
-    DEFINE_ARGS %4
-%endmacro
-
-%define has_epilogue regs_used > 9 || mmsize == 32
-
-%macro RET 0
-    POP_IF_USED 14, 13, 12, 11, 10, 9
-%if mmsize == 32
-    vzeroupper
-%endif
-    ret
-%endmacro
-
-%else ; X86_32 ;==============================================================
-
-DECLARE_REG 0, eax, 4
-DECLARE_REG 1, ecx, 8
-DECLARE_REG 2, edx, 12
-DECLARE_REG 3, ebx, 16
-DECLARE_REG 4, esi, 20
-DECLARE_REG 5, edi, 24
-DECLARE_REG 6, ebp, 28
-%define rsp esp
-
-%macro DECLARE_ARG 1-*
-    %rep %0
-        %define r%1m [esp + stack_offset + 4*%1 + 4]
-        %define r%1mp dword r%1m
-        %rotate 1
-    %endrep
-%endmacro
-
-DECLARE_ARG 7, 8, 9, 10, 11, 12, 13, 14
-
-%macro PROLOGUE 2-4+ ; #args, #regs, #xmm_regs, arg_names...
-    %assign num_args %1
-    %assign regs_used %2
-    %if regs_used > 7
-        %assign regs_used 7
-    %endif
-    ASSERT regs_used >= num_args
-    PUSH_IF_USED 3, 4, 5, 6
-    LOAD_IF_USED 0, 1, 2, 3, 4, 5, 6
-    DEFINE_ARGS %4
-%endmacro
-
-%define has_epilogue regs_used > 3 || mmsize == 32
-
-%macro RET 0
-    POP_IF_USED 6, 5, 4, 3
-%if mmsize == 32
-    vzeroupper
-%endif
-    ret
-%endmacro
-
-%endif ;======================================================================
-
-%if WIN64 == 0
-%macro WIN64_SPILL_XMM 1
-%endmacro
-%macro WIN64_RESTORE_XMM 1
-%endmacro
-%endif
-
-%macro REP_RET 0
-    %if has_epilogue
-        RET
-    %else
-        rep ret
-    %endif
-%endmacro
-
-%macro TAIL_CALL 2 ; callee, is_nonadjacent
-    %if has_epilogue
-        call %1
-        RET
-    %elif %2
-        jmp %1
-    %endif
-%endmacro
-
-;=============================================================================
-; arch-independent part
-;=============================================================================
-
-%assign function_align 16
-
-; Begin a function.
-; Applies any symbol mangling needed for C linkage, and sets up a define such that
-; subsequent uses of the function name automatically refer to the mangled version.
-; Appends cpuflags to the function name if cpuflags has been specified.
-%macro cglobal 1-2+ ; name, [PROLOGUE args]
-%if %0 == 1
-    cglobal_internal %1 %+ SUFFIX
-%else
-    cglobal_internal %1 %+ SUFFIX, %2
-%endif
-%endmacro
-%macro cglobal_internal 1-2+
-    %ifndef cglobaled_%1
-        %xdefine %1 mangle(%1)
-        %xdefine %1.skip_prologue %1 %+ .skip_prologue
-        CAT_XDEFINE cglobaled_, %1, 1
-    %endif
-    %xdefine current_function %1
-    %ifidn __OUTPUT_FORMAT__,elf
-        global %1:function hidden
-    %else
-        global %1
-    %endif
-    align function_align
-    %1:
-    RESET_MM_PERMUTATION ; not really needed, but makes disassembly somewhat nicer
-    %assign stack_offset 0
-    %if %0 > 1
-        PROLOGUE %2
-    %endif
-%endmacro
-
-%macro cextern 1
-    %xdefine %1 mangle(%1)
-    CAT_XDEFINE cglobaled_, %1, 1
-    extern %1
-%endmacro
-
-; like cextern, but without the prefix
-%macro cextern_naked 1
-    %xdefine %1 mangle(%1)
-    CAT_XDEFINE cglobaled_, %1, 1
-    extern %1
-%endmacro
-
-%macro const 2+
-    %xdefine %1 mangle(%1)
-    global %1
-    %1: %2
-%endmacro
-
-; This is needed for ELF, otherwise the GNU linker assumes the stack is
-; executable by default.
-%ifidn __OUTPUT_FORMAT__,elf
-SECTION .note.GNU-stack noalloc noexec nowrite progbits
-%endif
-%ifidn __OUTPUT_FORMAT__,elf32
-section .note.GNU-stack noalloc noexec nowrite progbits
-%endif
-%ifidn __OUTPUT_FORMAT__,elf64
-section .note.GNU-stack noalloc noexec nowrite progbits
-%endif
-
-; cpuflags
-
-%assign cpuflags_MMX      (1<<0)
-%assign cpuflags_MMX2     (1<<1) | cpuflags_MMX
-%assign cpuflags_3dnow    (1<<2) | cpuflags_MMX
-%assign cpuflags_3dnow2   (1<<3) | cpuflags_3dnow
-%assign cpuflags_SSE      (1<<4) | cpuflags_MMX2
-%assign cpuflags_SSE2     (1<<5) | cpuflags_SSE
-%assign cpuflags_SSE2slow (1<<6) | cpuflags_SSE2
-%assign cpuflags_SSE3     (1<<7) | cpuflags_SSE2
-%assign cpuflags_SSSE3    (1<<8) | cpuflags_SSE3
-%assign cpuflags_SSE4     (1<<9) | cpuflags_SSSE3
-%assign cpuflags_SSE42    (1<<10)| cpuflags_SSE4
-%assign cpuflags_AVX      (1<<11)| cpuflags_SSE42
-%assign cpuflags_xop      (1<<12)| cpuflags_AVX
-%assign cpuflags_fma4     (1<<13)| cpuflags_AVX
-%assign cpuflags_AVX2     (1<<14)| cpuflags_AVX
-%assign cpuflags_fma3     (1<<15)| cpuflags_AVX
-
-%assign cpuflags_cache32  (1<<16)
-%assign cpuflags_cache64  (1<<17)
-%assign cpuflags_slowctz  (1<<18)
-%assign cpuflags_lzcnt    (1<<19)
-%assign cpuflags_misalign (1<<20)
-%assign cpuflags_aligned  (1<<21) ; not a cpu feature, but a function variant
-%assign cpuflags_atom     (1<<22)
-%assign cpuflags_bmi1     (1<<23)
-%assign cpuflags_bmi2     (1<<24)|cpuflags_bmi1
-%assign cpuflags_tbm      (1<<25)|cpuflags_bmi1
-
-%define    cpuflag(x) ((cpuflags & (cpuflags_ %+ x)) == (cpuflags_ %+ x))
-%define notcpuflag(x) ((cpuflags & (cpuflags_ %+ x)) != (cpuflags_ %+ x))
-
-; Takes up to 2 cpuflags from the above list.
-; All subsequent functions (up to the next INIT_CPUFLAGS) is built for the specified cpu.
-; You shouldn't need to invoke this macro directly, it's a subroutine for INIT_MMX &co.
-%macro INIT_CPUFLAGS 0-2
-    %if %0 >= 1
-        %xdefine cpuname %1
-        %assign cpuflags cpuflags_%1
-        %if %0 >= 2
-            %xdefine cpuname %1_%2
-            %assign cpuflags cpuflags | cpuflags_%2
-        %endif
-        %xdefine SUFFIX _ %+ cpuname
-        %if cpuflag(AVX)
-            %assign AVX_enabled 1
-        %endif
-        %if mmsize == 16 && notcpuflag(SSE2)
-            %define mova movaps
-            %define movu movups
-            %define movnta movntps
-        %endif
-        %if cpuflag(aligned)
-            %define movu mova
-        %elifidn %1, SSE3
-            %define movu lddqu
-        %endif
-    %else
-        %xdefine SUFFIX
-        %undef cpuname
-        %undef cpuflags
-    %endif
-%endmacro
-
-; merge MMX and SSE*
-
-%macro CAT_XDEFINE 3
-    %xdefine %1%2 %3
-%endmacro
-
-%macro CAT_UNDEF 2
-    %undef %1%2
-%endmacro
-
-%macro INIT_MMX 0-1+
-    %assign AVX_enabled 0
-    %define RESET_MM_PERMUTATION INIT_MMX %1
-    %define mmsize 8
-    %define num_mmregs 8
-    %define mova movq
-    %define movu movq
-    %define movh movd
-    %define movnta movntq
-    %assign %%i 0
-    %rep 8
-    CAT_XDEFINE m, %%i, mm %+ %%i
-    CAT_XDEFINE nmm, %%i, %%i
-    %assign %%i %%i+1
-    %endrep
-    %rep 8
-    CAT_UNDEF m, %%i
-    CAT_UNDEF nmm, %%i
-    %assign %%i %%i+1
-    %endrep
-    INIT_CPUFLAGS %1
-%endmacro
-
-%macro INIT_XMM 0-1+
-    %assign AVX_enabled 0
-    %define RESET_MM_PERMUTATION INIT_XMM %1
-    %define mmsize 16
-    %define num_mmregs 8
-    %if ARCH_X86_64
-    %define num_mmregs 16
-    %endif
-    %define mova movdqa
-    %define movu movdqu
-    %define movh movq
-    %define movnta movntdq
-    %assign %%i 0
-    %rep num_mmregs
-    CAT_XDEFINE m, %%i, xmm %+ %%i
-    CAT_XDEFINE nxmm, %%i, %%i
-    %assign %%i %%i+1
-    %endrep
-    INIT_CPUFLAGS %1
-%endmacro
-
-%macro INIT_YMM 0-1+
-    %assign AVX_enabled 1
-    %define RESET_MM_PERMUTATION INIT_YMM %1
-    %define mmsize 32
-    %define num_mmregs 8
-    %if ARCH_X86_64
-    %define num_mmregs 16
-    %endif
-    %define mova vmovaps
-    %define movu vmovups
-    %undef movh
-    %define movnta vmovntps
-    %assign %%i 0
-    %rep num_mmregs
-    CAT_XDEFINE m, %%i, ymm %+ %%i
-    CAT_XDEFINE nymm, %%i, %%i
-    %assign %%i %%i+1
-    %endrep
-    INIT_CPUFLAGS %1
-%endmacro
-
-INIT_XMM
-
-; I often want to use macros that permute their arguments. e.g. there's no
-; efficient way to implement butterfly or transpose or dct without swapping some
-; arguments.
-;
-; I would like to not have to manually keep track of the permutations:
-; If I insert a permutation in the middle of a function, it should automatically
-; change everything that follows. For more complex macros I may also have multiple
-; implementations, e.g. the SSE2 and SSSE3 versions may have different permutations.
-;
-; Hence these macros. Insert a PERMUTE or some SWAPs at the end of a macro that
-; permutes its arguments. It's equivalent to exchanging the contents of the
-; registers, except that this way you exchange the register names instead, so it
-; doesn't cost any cycles.
-
-%macro PERMUTE 2-* ; takes a list of pairs to swap
-%rep %0/2
-    %xdefine tmp%2 m%2
-    %xdefine ntmp%2 nm%2
-    %rotate 2
-%endrep
-%rep %0/2
-    %xdefine m%1 tmp%2
-    %xdefine nm%1 ntmp%2
-    %undef tmp%2
-    %undef ntmp%2
-    %rotate 2
-%endrep
-%endmacro
-
-%macro SWAP 2-* ; swaps a single chain (sometimes more concise than pairs)
-%rep %0-1
-%ifdef m%1
-    %xdefine tmp m%1
-    %xdefine m%1 m%2
-    %xdefine m%2 tmp
-    CAT_XDEFINE n, m%1, %1
-    CAT_XDEFINE n, m%2, %2
-%else
-    ; If we were called as "SWAP m0,m1" rather than "SWAP 0,1" infer the original numbers here.
-    ; Be careful using this mode in nested macros though, as in some cases there may be
-    ; other copies of m# that have already been dereferenced and don't get updated correctly.
-    %xdefine %%n1 n %+ %1
-    %xdefine %%n2 n %+ %2
-    %xdefine tmp m %+ %%n1
-    CAT_XDEFINE m, %%n1, m %+ %%n2
-    CAT_XDEFINE m, %%n2, tmp
-    CAT_XDEFINE n, m %+ %%n1, %%n1
-    CAT_XDEFINE n, m %+ %%n2, %%n2
-%endif
-    %undef tmp
-    %rotate 1
-%endrep
-%endmacro
-
-; If SAVE_MM_PERMUTATION is placed at the end of a function, then any later
-; calls to that function will automatically load the permutation, so values can
-; be returned in mmregs.
-%macro SAVE_MM_PERMUTATION 0-1
-    %if %0
-        %xdefine %%f %1_m
-    %else
-        %xdefine %%f current_function %+ _m
-    %endif
-    %assign %%i 0
-    %rep num_mmregs
-        CAT_XDEFINE %%f, %%i, m %+ %%i
-    %assign %%i %%i+1
-    %endrep
-%endmacro
-
-%macro LOAD_MM_PERMUTATION 1 ; name to load from
-    %ifdef %1_m0
-        %assign %%i 0
-        %rep num_mmregs
-            CAT_XDEFINE m, %%i, %1_m %+ %%i
-            CAT_XDEFINE n, m %+ %%i, %%i
-        %assign %%i %%i+1
-        %endrep
-    %endif
-%endmacro
-
-; Append cpuflags to the callee's name iff the appended name is known and the plain name isn't
-%macro call 1
-    call_internal %1, %1 %+ SUFFIX
-%endmacro
-%macro call_internal 2
-    %xdefine %%i %1
-    %ifndef cglobaled_%1
-        %ifdef cglobaled_%2
-            %xdefine %%i %2
-        %endif
-    %endif
-    call %%i
-    LOAD_MM_PERMUTATION %%i
-%endmacro
-
-; Substitutions that reduce instruction size but are functionally equivalent
-%macro add 2
-    %ifnum %2
-        %if %2==128
-            sub %1, -128
-        %else
-            add %1, %2
-        %endif
-    %else
-        add %1, %2
-    %endif
-%endmacro
-
-%macro sub 2
-    %ifnum %2
-        %if %2==128
-            add %1, -128
-        %else
-            sub %1, %2
-        %endif
-    %else
-        sub %1, %2
-    %endif
-%endmacro
-
-;=============================================================================
-; AVX abstraction layer
-;=============================================================================
-
-%assign i 0
-%rep 16
-    %if i < 8
-        CAT_XDEFINE sizeofmm, i, 8
-    %endif
-    CAT_XDEFINE sizeofxmm, i, 16
-    CAT_XDEFINE sizeofymm, i, 32
-%assign i i+1
-%endrep
-%undef i
-
-%macro CHECK_AVX_INSTR_EMU 3-*
-    %xdefine %%opcode %1
-    %xdefine %%dst %2
-    %rep %0-2
-        %ifidn %%dst, %3
-            %error non-AVX emulation of ``%%opcode'' is not supported
-        %endif
-        %rotate 1
-    %endrep
-%endmacro
-
-;%1 == instruction
-;%2 == 1 if float, 0 if int
-;%3 == 1 if 4-operand (xmm, xmm, xmm, imm), 0 if 2- or 3-operand (xmm, xmm, xmm)
-;%4 == number of operands given
-;%5+: operands
-%macro RUN_AVX_INSTR 6-7+
-    %ifid %6
-        %define %%sizeofreg sizeof%6
-    %elifid %5
-        %define %%sizeofreg sizeof%5
-    %else
-        %define %%sizeofreg mmsize
-    %endif
-    %if %%sizeofreg==32
-        %if %4>=3
-            v%1 %5, %6, %7
-        %else
-            v%1 %5, %6
-        %endif
-    %else
-        %if %%sizeofreg==8
-            %define %%regmov movq
-        %elif %2
-            %define %%regmov movaps
-        %else
-            %define %%regmov movdqa
-        %endif
-
-        %if %4>=3+%3
-            %ifnidn %5, %6
-                %if AVX_enabled && %%sizeofreg==16
-                    v%1 %5, %6, %7
-                %else
-                    CHECK_AVX_INSTR_EMU {%1 %5, %6, %7}, %5, %7
-                    %%regmov %5, %6
-                    %1 %5, %7
-                %endif
-            %else
-                %1 %5, %7
-            %endif
-        %elif %4>=3
-            %1 %5, %6, %7
-        %else
-            %1 %5, %6
-        %endif
-    %endif
-%endmacro
-
-; 3arg AVX ops with a memory arg can only have it in src2,
-; whereas SSE emulation of 3arg prefers to have it in src1 (i.e. the mov).
-; So, if the op is symmetric and the wrong one is memory, swap them.
-%macro RUN_AVX_INSTR1 8
-    %assign %%swap 0
-    %if AVX_enabled
-        %ifnid %6
-            %assign %%swap 1
-        %endif
-    %elifnidn %5, %6
-        %ifnid %7
-            %assign %%swap 1
-        %endif
-    %endif
-    %if %%swap && %3 == 0 && %8 == 1
-        RUN_AVX_INSTR %1, %2, %3, %4, %5, %7, %6
-    %else
-        RUN_AVX_INSTR %1, %2, %3, %4, %5, %6, %7
-    %endif
-%endmacro
-
-;%1 == instruction
-;%2 == 1 if float, 0 if int
-;%3 == 1 if 4-operand (xmm, xmm, xmm, imm), 0 if 2- or 3-operand (xmm, xmm, xmm)
-;%4 == 1 if symmetric (i.e. doesn't matter which src arg is which), 0 if not
-%macro AVX_INSTR 4
-    %macro %1 2-9 fnord, fnord, fnord, %1, %2, %3, %4
-        %ifidn %3, fnord
-            RUN_AVX_INSTR %6, %7, %8, 2, %1, %2
-        %elifidn %4, fnord
-            RUN_AVX_INSTR1 %6, %7, %8, 3, %1, %2, %3, %9
-        %elifidn %5, fnord
-            RUN_AVX_INSTR %6, %7, %8, 4, %1, %2, %3, %4
-        %else
-            RUN_AVX_INSTR %6, %7, %8, 5, %1, %2, %3, %4, %5
-        %endif
-    %endmacro
-%endmacro
-
-AVX_INSTR addpd, 1, 0, 1
-AVX_INSTR addps, 1, 0, 1
-AVX_INSTR addsd, 1, 0, 1
-AVX_INSTR addss, 1, 0, 1
-AVX_INSTR addsubpd, 1, 0, 0
-AVX_INSTR addsubps, 1, 0, 0
-AVX_INSTR andpd, 1, 0, 1
-AVX_INSTR andps, 1, 0, 1
-AVX_INSTR andnpd, 1, 0, 0
-AVX_INSTR andnps, 1, 0, 0
-AVX_INSTR blendpd, 1, 0, 0
-AVX_INSTR blendps, 1, 0, 0
-AVX_INSTR blendvpd, 1, 0, 0
-AVX_INSTR blendvps, 1, 0, 0
-AVX_INSTR cmppd, 1, 0, 0
-AVX_INSTR cmpps, 1, 0, 0
-AVX_INSTR cmpsd, 1, 0, 0
-AVX_INSTR cmpss, 1, 0, 0
-AVX_INSTR cvtdq2ps, 1, 0, 0
-AVX_INSTR cvtps2dq, 1, 0, 0
-AVX_INSTR divpd, 1, 0, 0
-AVX_INSTR divps, 1, 0, 0
-AVX_INSTR divsd, 1, 0, 0
-AVX_INSTR divss, 1, 0, 0
-AVX_INSTR dppd, 1, 1, 0
-AVX_INSTR dpps, 1, 1, 0
-AVX_INSTR haddpd, 1, 0, 0
-AVX_INSTR haddps, 1, 0, 0
-AVX_INSTR hsubpd, 1, 0, 0
-AVX_INSTR hsubps, 1, 0, 0
-AVX_INSTR maxpd, 1, 0, 1
-AVX_INSTR maxps, 1, 0, 1
-AVX_INSTR maxsd, 1, 0, 1
-AVX_INSTR maxss, 1, 0, 1
-AVX_INSTR minpd, 1, 0, 1
-AVX_INSTR minps, 1, 0, 1
-AVX_INSTR minsd, 1, 0, 1
-AVX_INSTR minss, 1, 0, 1
-AVX_INSTR movhlps, 1, 0, 0
-AVX_INSTR movlhps, 1, 0, 0
-AVX_INSTR movsd, 1, 0, 0
-AVX_INSTR movss, 1, 0, 0
-AVX_INSTR mpsadbw, 0, 1, 0
-AVX_INSTR mulpd, 1, 0, 1
-AVX_INSTR mulps, 1, 0, 1
-AVX_INSTR mulsd, 1, 0, 1
-AVX_INSTR mulss, 1, 0, 1
-AVX_INSTR orpd, 1, 0, 1
-AVX_INSTR orps, 1, 0, 1
-AVX_INSTR pabsb, 0, 0, 0
-AVX_INSTR pabsw, 0, 0, 0
-AVX_INSTR pabsd, 0, 0, 0
-AVX_INSTR packsswb, 0, 0, 0
-AVX_INSTR packssdw, 0, 0, 0
-AVX_INSTR packuswb, 0, 0, 0
-AVX_INSTR packusdw, 0, 0, 0
-AVX_INSTR paddb, 0, 0, 1
-AVX_INSTR paddw, 0, 0, 1
-AVX_INSTR paddd, 0, 0, 1
-AVX_INSTR paddq, 0, 0, 1
-AVX_INSTR paddsb, 0, 0, 1
-AVX_INSTR paddsw, 0, 0, 1
-AVX_INSTR paddusb, 0, 0, 1
-AVX_INSTR paddusw, 0, 0, 1
-AVX_INSTR palignr, 0, 1, 0
-AVX_INSTR pand, 0, 0, 1
-AVX_INSTR pandn, 0, 0, 0
-AVX_INSTR pavgb, 0, 0, 1
-AVX_INSTR pavgw, 0, 0, 1
-AVX_INSTR pblendvb, 0, 0, 0
-AVX_INSTR pblendw, 0, 1, 0
-AVX_INSTR pcmpestri, 0, 0, 0
-AVX_INSTR pcmpestrm, 0, 0, 0
-AVX_INSTR pcmpistri, 0, 0, 0
-AVX_INSTR pcmpistrm, 0, 0, 0
-AVX_INSTR pcmpeqb, 0, 0, 1
-AVX_INSTR pcmpeqw, 0, 0, 1
-AVX_INSTR pcmpeqd, 0, 0, 1
-AVX_INSTR pcmpeqq, 0, 0, 1
-AVX_INSTR pcmpgtb, 0, 0, 0
-AVX_INSTR pcmpgtw, 0, 0, 0
-AVX_INSTR pcmpgtd, 0, 0, 0
-AVX_INSTR pcmpgtq, 0, 0, 0
-AVX_INSTR phaddw, 0, 0, 0
-AVX_INSTR phaddd, 0, 0, 0
-AVX_INSTR phaddsw, 0, 0, 0
-AVX_INSTR phsubw, 0, 0, 0
-AVX_INSTR phsubd, 0, 0, 0
-AVX_INSTR phsubsw, 0, 0, 0
-AVX_INSTR pmaddwd, 0, 0, 1
-AVX_INSTR pmaddubsw, 0, 0, 0
-AVX_INSTR pmaxsb, 0, 0, 1
-AVX_INSTR pmaxsw, 0, 0, 1
-AVX_INSTR pmaxsd, 0, 0, 1
-AVX_INSTR pmaxub, 0, 0, 1
-AVX_INSTR pmaxuw, 0, 0, 1
-AVX_INSTR pmaxud, 0, 0, 1
-AVX_INSTR pminsb, 0, 0, 1
-AVX_INSTR pminsw, 0, 0, 1
-AVX_INSTR pminsd, 0, 0, 1
-AVX_INSTR pminub, 0, 0, 1
-AVX_INSTR pminuw, 0, 0, 1
-AVX_INSTR pminud, 0, 0, 1
-AVX_INSTR pmovmskb, 0, 0, 0
-AVX_INSTR pmulhuw, 0, 0, 1
-AVX_INSTR pmulhrsw, 0, 0, 1
-AVX_INSTR pmulhw, 0, 0, 1
-AVX_INSTR pmullw, 0, 0, 1
-AVX_INSTR pmulld, 0, 0, 1
-AVX_INSTR pmuludq, 0, 0, 1
-AVX_INSTR pmuldq, 0, 0, 1
-AVX_INSTR por, 0, 0, 1
-AVX_INSTR psadbw, 0, 0, 1
-AVX_INSTR pshufb, 0, 0, 0
-AVX_INSTR pshufd, 0, 1, 0
-AVX_INSTR pshufhw, 0, 1, 0
-AVX_INSTR pshuflw, 0, 1, 0
-AVX_INSTR psignb, 0, 0, 0
-AVX_INSTR psignw, 0, 0, 0
-AVX_INSTR psignd, 0, 0, 0
-AVX_INSTR psllw, 0, 0, 0
-AVX_INSTR pslld, 0, 0, 0
-AVX_INSTR psllq, 0, 0, 0
-AVX_INSTR pslldq, 0, 0, 0
-AVX_INSTR psraw, 0, 0, 0
-AVX_INSTR psrad, 0, 0, 0
-AVX_INSTR psrlw, 0, 0, 0
-AVX_INSTR psrld, 0, 0, 0
-AVX_INSTR psrlq, 0, 0, 0
-AVX_INSTR psrldq, 0, 0, 0
-AVX_INSTR psubb, 0, 0, 0
-AVX_INSTR psubw, 0, 0, 0
-AVX_INSTR psubd, 0, 0, 0
-AVX_INSTR psubq, 0, 0, 0
-AVX_INSTR psubsb, 0, 0, 0
-AVX_INSTR psubsw, 0, 0, 0
-AVX_INSTR psubusb, 0, 0, 0
-AVX_INSTR psubusw, 0, 0, 0
-AVX_INSTR ptest, 0, 0, 0
-AVX_INSTR punpckhbw, 0, 0, 0
-AVX_INSTR punpckhwd, 0, 0, 0
-AVX_INSTR punpckhdq, 0, 0, 0
-AVX_INSTR punpckhqdq, 0, 0, 0
-AVX_INSTR punpcklbw, 0, 0, 0
-AVX_INSTR punpcklwd, 0, 0, 0
-AVX_INSTR punpckldq, 0, 0, 0
-AVX_INSTR punpcklqdq, 0, 0, 0
-AVX_INSTR pxor, 0, 0, 1
-AVX_INSTR shufps, 1, 1, 0
-AVX_INSTR subpd, 1, 0, 0
-AVX_INSTR subps, 1, 0, 0
-AVX_INSTR subsd, 1, 0, 0
-AVX_INSTR subss, 1, 0, 0
-AVX_INSTR unpckhpd, 1, 0, 0
-AVX_INSTR unpckhps, 1, 0, 0
-AVX_INSTR unpcklpd, 1, 0, 0
-AVX_INSTR unpcklps, 1, 0, 0
-AVX_INSTR xorpd, 1, 0, 1
-AVX_INSTR xorps, 1, 0, 1
-
-; 3DNow instructions, for sharing code between AVX, SSE and 3DN
-AVX_INSTR pfadd, 1, 0, 1
-AVX_INSTR pfsub, 1, 0, 0
-AVX_INSTR pfmul, 1, 0, 1
-
-; base-4 constants for shuffles
-%assign i 0
-%rep 256
-    %assign j ((i>>6)&3)*1000 + ((i>>4)&3)*100 + ((i>>2)&3)*10 + (i&3)
-    %if j < 10
-        CAT_XDEFINE q000, j, i
-    %elif j < 100
-        CAT_XDEFINE q00, j, i
-    %elif j < 1000
-        CAT_XDEFINE q0, j, i
-    %else
-        CAT_XDEFINE q, j, i
-    %endif
-%assign i i+1
-%endrep
-%undef i
-%undef j
-
-%macro FMA_INSTR 3
-    %macro %1 4-7 %1, %2, %3
-        %if cpuflag(xop)
-            v%5 %1, %2, %3, %4
-        %else
-            %6 %1, %2, %3
-            %7 %1, %4
-        %endif
-    %endmacro
-%endmacro
-
-FMA_INSTR  pmacsdd,  pmulld, paddd
-FMA_INSTR  pmacsww,  pmullw, paddw
-FMA_INSTR pmadcswd, pmaddwd, paddd
-
-; tzcnt is equivalent to "rep bsf" and is backwards-compatible with bsf.
-; This lets us use tzcnt without bumping the yasm version requirement yet.
-%define tzcnt rep bsf

diff --git a/third_party/x86inc/README.libaom b/third_party/x86inc/README.libaom
index 2f3e5c2..6b92358 100644
--- a/third_party/x86inc/README.libaom
+++ b/third_party/x86inc/README.libaom

@@ -16,3 +16,4 @@
 Use .text instead of .rodata on macho to avoid broken tables in PIC mode.
 Use .text with no alignment for aout.
 Only use 'hidden' visibility with Chromium.
+Prefix ARCH_* with AOM_.

diff --git a/third_party/x86inc/x86inc.asm b/third_party/x86inc/x86inc.asm
index e48d644..b0421f5 100644
--- a/third_party/x86inc/x86inc.asm
+++ b/third_party/x86inc/x86inc.asm

@@ -45,7 +45,7 @@
 %endif
 
 %ifndef STACK_ALIGNMENT
-    %if ARCH_X86_64
+    %if AOM_ARCH_X86_64
         %define STACK_ALIGNMENT 16
     %else
         %define STACK_ALIGNMENT 4
@@ -54,7 +54,7 @@
 
 %define WIN64  0
 %define UNIX64 0
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
     %ifidn __OUTPUT_FORMAT__,win32
         %define WIN64  1
     %elifidn __OUTPUT_FORMAT__,win64
@@ -168,7 +168,7 @@
         %endif
     %endif
 
-    %if ARCH_X86_64 == 0
+    %if AOM_ARCH_X86_64 == 0
         %undef PIC
     %endif
 
@@ -277,7 +277,7 @@
     %if %0 == 2
         %define r%1m  %2d
         %define r%1mp %2
-    %elif ARCH_X86_64 ; memory
+    %elif AOM_ARCH_X86_64 ; memory
         %define r%1m [rstk + stack_offset + %3]
         %define r%1mp qword r %+ %1 %+ m
     %else
@@ -298,7 +298,7 @@
     %define e%1h %3
     %define r%1b %2
     %define e%1b %2
-    %if ARCH_X86_64 == 0
+    %if AOM_ARCH_X86_64 == 0
         %define r%1 e%1
     %endif
 %endmacro
@@ -335,14 +335,14 @@
 
 DECLARE_REG_TMP_SIZE 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
 
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
     %define gprsize 8
 %else
     %define gprsize 4
 %endif
 
 %macro LEA 2
-%if ARCH_X86_64
+%if AOM_ARCH_X86_64
     lea %1, [%2]
 %elif PIC
     call $+5 ; special-cased to not affect the RSB on most CPU:s
@@ -414,7 +414,7 @@
     %endif
 %endmacro
 
-%if ARCH_X86_64 == 0
+%if AOM_ARCH_X86_64 == 0
     %define movsxd movifnidn
 %endif
 
@@ -466,7 +466,7 @@
 %endmacro
 
 %define required_stack_alignment ((mmsize + 15) & ~15)
-%define vzeroupper_required (mmsize > 16 && (ARCH_X86_64 == 0 || xmm_regs_used > 16 || notcpuflag(avx512)))
+%define vzeroupper_required (mmsize > 16 && (AOM_ARCH_X86_64 == 0 || xmm_regs_used > 16 || notcpuflag(avx512)))
 %define high_mm_regs (16*cpuflag(avx512))
 
 %macro ALLOC_STACK 1-2 0 ; stack_size, n_xmm_regs (for win64 only)
@@ -521,13 +521,13 @@
                 ; Reserve an additional register for storing the original stack pointer, but avoid using
                 ; eax/rax for this purpose since it can potentially get overwritten as a return value.
                 %assign regs_used (regs_used + 1)
-                %if ARCH_X86_64 && regs_used == 7
+                %if AOM_ARCH_X86_64 && regs_used == 7
                     %assign regs_used 8
-                %elif ARCH_X86_64 == 0 && regs_used == 1
+                %elif AOM_ARCH_X86_64 == 0 && regs_used == 1
                     %assign regs_used 2
                 %endif
             %endif
-            %if ARCH_X86_64 && regs_used < 5 + UNIX64 * 3
+            %if AOM_ARCH_X86_64 && regs_used < 5 + UNIX64 * 3
                 ; Ensure that we don't clobber any registers containing arguments. For UNIX64 we also preserve r6 (rax)
                 ; since it's used as a hidden argument in vararg functions to specify the number of vector registers used.
                 %assign regs_used 5 + UNIX64 * 3
@@ -654,7 +654,7 @@
     AUTO_REP_RET
 %endmacro
 
-%elif ARCH_X86_64 ; *nix x64 ;=============================================
+%elif AOM_ARCH_X86_64 ; *nix x64 ;=============================================
 
 DECLARE_REG 0,  rdi
 DECLARE_REG 1,  rsi
@@ -1002,7 +1002,7 @@
         %endif
     %endif
 
-    %if ARCH_X86_64 || cpuflag(sse2)
+    %if AOM_ARCH_X86_64 || cpuflag(sse2)
         %ifdef __NASM_VER__
             ALIGNMODE p6
         %else
@@ -1039,7 +1039,7 @@
     %endif
 
     %assign num_mmregs 8
-    %if ARCH_X86_64 && mmsize >= 16
+    %if AOM_ARCH_X86_64 && mmsize >= 16
         %assign num_mmregs 16
         %if cpuflag(avx512) || mmsize == 64
             %assign num_mmregs 32
@@ -1064,7 +1064,7 @@
 
 ; Prefer registers 16-31 over 0-15 to avoid having to use vzeroupper
 %macro AVX512_MM_PERMUTATION 0-1 0 ; start_reg
-    %if ARCH_X86_64 && cpuflag(avx512)
+    %if AOM_ARCH_X86_64 && cpuflag(avx512)
         %assign %%i %1
         %rep 16-%1
             %assign %%i_high %%i+16

diff --git a/tools/frame_size_variation_analyzer.py b/tools/frame_size_variation_analyzer.py
new file mode 100644
index 0000000..5c02319
--- /dev/null
+++ b/tools/frame_size_variation_analyzer.py

@@ -0,0 +1,74 @@
+# RTC frame size variation analyzer
+# Usage:
+# 1. Config with "-DCONFIG_OUTPUT_FRAME_SIZE=1".
+# 2. Build aomenc. Encode a file, and generate output file: frame_sizes.csv
+# 3. Run: python ./frame_size.py frame_sizes.csv target-bitrate fps
+#    Where target-bitrate: Bitrate (kbps), and fps is frame per second.
+#    Example: python ../aom/tools/frame_size_variation_analyzer.py frame_sizes.csv
+#    1000 30
+
+import numpy as np
+import csv
+import sys
+import matplotlib.pyplot as plt
+
+# return the moving average
+def moving_average(x, w):
+  return np.convolve(x, np.ones(w), 'valid') / w
+
+def frame_size_analysis(filename, target_br, fps):
+  tbr = target_br * 1000 / fps
+
+  with open(filename, 'r') as infile:
+    raw_data = list(csv.reader(infile, delimiter=','))
+
+  data = np.array(raw_data).astype(float)
+  fsize = data[:, 0].astype(float)  # frame size
+  qindex = data[:, 1].astype(float)  # qindex
+
+  # Frame bit rate mismatch
+  mismatch = np.absolute(fsize - np.full(fsize.size, tbr))
+
+  # Count how many frames are more than 2.5x of frame target bit rate.
+  tbr_thr = tbr * 2.5
+  cnt = 0
+  idx = np.arange(fsize.size)
+  for i in idx:
+    if fsize[i] > tbr_thr:
+      cnt = cnt + 1
+
+  # Use the 15-frame moving window
+  win = 15
+  avg_fsize = moving_average(fsize, win)
+  win_mismatch = np.absolute(avg_fsize - np.full(avg_fsize.size, tbr))
+
+  print('[Target frame rate (bit)]:', "%.2f"%tbr)
+  print('[Average frame rate (bit)]:', "%.2f"%np.average(fsize))
+  print('[Frame rate standard deviation]:', "%.2f"%np.std(fsize))
+  print('[Max/min frame rate (bit)]:', "%.2f"%np.max(fsize), '/', "%.2f"%np.min(fsize))
+  print('[Average frame rate mismatch (bit)]:', "%.2f"%np.average(mismatch))
+  print('[Number of frames (frame rate > 2.5x of target frame rate)]:', cnt)
+  print(' Moving window size:', win)
+  print('[Moving average frame rate mismatch (bit)]:', "%.2f"%np.average(win_mismatch))
+  print('------------------------------')
+
+  figure, axis = plt.subplots(2)
+  x = np.arange(fsize.size)
+  axis[0].plot(x, fsize, color='blue')
+  axis[0].set_title("frame sizes")
+  axis[1].plot(x, qindex, color='blue')
+  axis[1].set_title("frame qindex")
+  plt.tight_layout()
+
+  # Save the plot
+  plotname = filename + '.png'
+  plt.savefig(plotname)
+  plt.show()
+
+if __name__ == '__main__':
+  if (len(sys.argv) < 4):
+    print(sys.argv[0], 'input_file, target_bitrate, fps')
+    sys.exit()
+  target_br = int(sys.argv[2])
+  fps = int(sys.argv[3])
+  frame_size_analysis(sys.argv[1], target_br, fps)
commit	8a657a6fe96b9618244bda3ed1bc688d2f22c6f2	[log] [tgz]
author	Harish Mahendrakar <[email protected]>	Sat Oct 07 03:18:46 2023 +0000
committer	Cherrypicker Worker <[email protected]>	Thu Nov 30 03:39:07 2023 +0000
tree	35005b6dc14154f863e88f325abff141af82fac8
parent	ebef6e4b5898d04b5956796d52d32b51d147125b [diff]